Skip to content

rodrigols89/jsp

Repository files navigation

MIT License Code Style Black

Job Salary Prediction

The Job Salary Prediction project was developed with the aim of predicting job advertisement salaries in the United Kingdom (UK).

NOTE:
This project was developed following the CRISP-DM methodology.

Project Overview


Dataset proposed (Data Understanding)

The dataset proposed (Job Salary Prediction) consists of a large number of rows (240k+ samples) representing individual job ads.

The dataset features are:

  • Id:
    • A unique identifier for each job ad
  • Title:
    • Briefly, the Title is the summary of the position or function.
  • FullDescription:
    • The full text of the job ad as provided by the job advertiser.
    • Where you see ***s, we have stripped values from the description in order to ensure that no salary information appears within the descriptions.
    • There may be some collateral damage here where we have also removed other numerics.
  • LocationRaw:
    • Imagine that this column represents the job location, however, using cardinal points (West, East) and/or references.
  • LocationNormalized:
    • It has the same meaning as the LocationRaw column, but with less information and references.
    • That's because this column is the result of a Pre-Processing of the LocationRaw column did by Adzuna.
  • ContractType:
    • This column represents the types of contracts per job vacancy sample, which are full_time or part_time.
    • In fact, this column tells us whether the employee works full-time (e.g. 40 hours per week) or part-time (e.g. 20 hours per week).
  • ContractTime:
    • Contract type, which can be permanent or contract.
  • Company:
    • The employer name provided by the job advertiser.
  • Category:
    • Job categories (are 29):
      • IT Jobs
      • Engineering Jobs
      • Accounting & Finance Jobs
      • Healthcare & Nursing Jobs
      • Sales Jobs
      • Other/General Jobs
      • Teaching Jobs
      • Hospitality & Catering Jobs
      • PR, Advertising & Marketing Jobs
      • Trade & Construction Jobs
      • HR & Recruitment Jobs
      • Admin Jobs
      • Retail Jobs
      • Customer Services Jobs
      • Legal Jobs
      • Manufacturing Jobs
      • Logistics & Warehouse Jobs
      • Social work Jobs
      • Consultancy Jobs
      • Travel Jobs
      • Scientific & QA Jobs
      • Charity & Voluntary Jobs
      • Energy, Oil & Gas Jobs
      • Creative & Design Jobs
      • Maintenance Jobs
      • Graduate Jobs
      • Property Jobs
      • Domestic help & Cleaning Jobs
      • Part time Jobs
  • SalaryRaw:
    • Imagine that this column represents the salary of the ad (sample). However:
      • No formatting;
      • With bonus;
      • Remuneration:
        • Per hour;
        • Per month;
        • Per annum.
  • SalaryNormalised:
    • It has the same meaning as the "SalaryRaw" column, however Adzuna has normalized the data so that the salary is represented in an annualized way.
  • SourceName:
    • The website name or advertiser from whom we received the job ad.

All of the data are real.
Used in job ads so are clearly subject to lots of real-world noise, including but not limited to:

  • Ads that are not UK based;
  • Salaries that are incorrectly stated;
  • Fields that are incorrectly normalised;
  • And duplicate adverts.

Exploratory Data Analysis (Data understanding)

Here, let's apply an Exploratory Data Analysis (EDA) to understand more about the data.

For each Exploratory Data Analysis (EDA) I'll create a new Jupyter Notebook with a different focus.

[v1] - Exploratory Data Analysis/EDA (Data understanding)

Training & Evaluation (Modeling & Evaluation)

Here, let's create models to make predictions and evaluate how well our models learned.

Each time I change/update the model (apply preprocessing or add new features) I'll create a new Jupyter Notebook.

[v1] - Training & Evaluation (baseline, dummy, PoC, prototype)

Data Lake Architecture

The project follows the following Data Lake Architecture to store and make available data:

img

  • Landing (Entry Point/Ingestion):
    • The "Landing" bucket serves as the entry point for the data lake.
    • It is used to receive raw data, often in its original format, with little or no transformation.
    • Data here can come from various sources such as server logs, IoT devices, social media feeds, etc.
    • The primary purpose of this bucket is to store raw data, allowing data to be quickly dumped into the data lake without an immediate need for processing or structuring. This helps capture all available data for future analysis and transformation.
  • Processing:
    • The "Processing" bucket is where raw data from the "Landing" layer undergoes processing and transformation to make it more usable and valuable.
    • Data here may be cleaned, enriched, structured, and transformed into suitable formats for advanced analytics, machine learning, reporting, and other use cases.
    • Typically, data processing tools like Apache Spark, Apache Flink, or ETL (Extract, Transform, Load) services are used in this layer.
  • Curated:
    • The "Curated" bucket is where processed and ready-to-use data is stored in an organized and structured manner.
    • Data in this bucket is usually refined, optimized, and may be indexed to enable quick and efficient access.
    • This layer is often used by Data Analysts, Data Scientists, and other professionals to conduct analyses, create Dashboards, reports, and other activities that require high-quality data.
    • This is where data becomes "curated" and prepared for consumption by applications and systems that rely on accurate and reliable information.

Settings

To use the project first, prepare the virtual environment and install the dependencies:

Set environment:

poetry env use python

Activate environment:

poetry shell

Install dependencies:

poetry install

I preferred to store the data in PostgreSQL (using the Docker container) because it is easier to apply SQL queries in all applications.

Knowing this, with docker compose installed, run:

sudo docker compose up -d

As the datasets are huge and cannot be downloaded using Kaggle API (are very old datasets) you will need:

Finally, you can run the command "jsp etl" CLI to load the DataFrames into PostgreSQL:

jsp etl --help
Usage: jsp etl [OPTIONS] COMMAND [ARGS]...

  Command to apply ETL processes.

Options:
  --help  Show this message and exit.

Commands:
  load-all    Loads all train and test data into PostgreSQL.
  load-test   Loads test data into PostgreSQL.
  load-train  Loads train DataFrames into PostgreSQL.

Here you can use:

  • jsp etl load-all:
    • To load the train and test data into PostgreSQL.
  • jsp etl load-test:
    • To load the test data into PostgreSQL.
  • jsp etl load-train:
    • To load the train data into PostgreSQL.

You can also check the data on the docker container following the instructions below:

OPEN THE DOCKER CONTAINER:

sudo docker container exec -it postgres-container bash

OPEN THE POSTGRES SHELL:

psql -U postgres

CHECK DATABASES:

\l

CONNECT TO THE DATABASE:

\c jsp-db

CHECK TABLES:

\dt

If you are interested in committing something initialize pre-commit settings:

pre-commit install

Tech Stack


Rodrigo Leite da Silva

About

Predict the salary of any UK job ad based on its contents.

Resources

License

Stars

Watchers

Forks

Packages

No packages published