# Predicting Potential Salaries using Data Science
---

### Background
Job hunting is part and parcel of life. On average, a person changes jobs about five times during his/her working career. One of the biggest hurdle to overcome in job hunting is to negotiate salaries and this salary prediction models aims to eliminate this hurdle for both job seekers and hiring companies.

### Target Audience

**1. Primary stakeholders: Job seekers** <br/>
With this salary prediction model readily available, one will be able to gain insights into the industry from exploratory data analysis, and get a benchmark or an idea of the salary that might be offered before applying for the job. Not only job seekers can apply this model to his/her next potential job, existing employees will also be able to check if he/she is paid according to the market trend, and negotiate for a higher pay within the company. As the exploratory data analysis will show the market demand for such jobs, companies might need to offer competitive salaries in order to retain their existing employees.


**2. Secondary stakeholders: Companies seeking to hire** <br/>
Companies will be able to set realistic hiring budgets if they are setting up new data science teams. They will also be able to check hiring market trends, and are able to decide if they should offer competitive salaries if they are in urgent need of certain roles.


**3. Platform: Job Portals and Recruitment Agencies** <br/>
This salary prediction model can be utilized by job portals and recruitment agencies to reconcile these two stakeholders. As the companies post up their job listings, the job seekers are able 

### Data Source
The dataset used in this salary prediction model is scraped from Singapore job portal mycareersfuture.sg, and compromises of jobs in the data science industry including but not limited to roles such as data scientist, data analyst and data engineers. While this particular industry was selected as a form of personal research before embarking on my job hunt as a data scientist, it is merely a proof of concept, and can be easily tweaked to predict salaries from other industries.

### Project Workflow

**1. Scraping the Data** <br/>

Using an API scraper, datasets were obtained from mycareersfuture.sg using the following keywords to search for job listings:
* data scientist
* data analyst
* data enginner
* business analyst
* machine learning
* python
* data

These datasets were then concatenated to create a comprehensive dataset that includes most jobs in the data science industry.

**2. Cleaning the Data** <br/>

Several cleaning and transformation tasks were performed in order to prepare a dataset suitable for modelling. Some actions taken as this stage are:
* Feature engineering the target variable (average salary) from the minimum and maximum salary values of each job listing, and removing those that appeared invalid such as zero values and those that were too high.
* Several of the features in the dataset such as `job_title`, `position_level` were categorical with a huge number of possible unique values. Hence standardization and simplification of those features were required.

After cleaning the dataset, there was 4,000 rows and 10 columns in the dataset.

**3. Exploratory Data Analysis** <br/>

Correlations between the different features in the datasets are investigated in this section. Further feature engineering from text features such as `skills` was done to convert it to the categorical features for modelling.

**4. Pre-processing** <br/>

This section involves one-hot encoding the standarized categorical variables in the dataset. The resultant dataset for modelling contained about 4,000 rows and 90 columns. 

A second target variable `salary_above_median` was also created for classification modelling. `salary_above_median` will represent if the salary offered will be above or below the median value according to the candidate's year of experience.
* 1 represents salary above median values for that given year of experience
* 0 represents salary below median values for that given year of experience.

**5a. Regression Modelling** <br/>


**5b. Classification Modelling** <br/>



## MyCareersFuture.sg API Calls 
---
Credit: https://github.com/pwaaron/jobscrapers

In [40]:
import requests
from urllib.parse import urlencode
from bs4 import BeautifulSoup

import pandas as pd

In [41]:
API_LINK = 'https://api1.mycareersfuture.sg/v2/jobs?'

### Variables to change
Please change the search query and the number of total jobs you would like to query.

In [73]:
jobs = []
LIMIT = 100 # Limit should not exceed 100. The smaller the number, the gentler it is
SEARCH_QUERY = 'python' #search query
TOTAL_JOBS = 20000 # Number of jobs to be queried
N_PAGES = TOTAL_JOBS//LIMIT

### Running query


In [14]:
#For limited queries
for page in range(N_PAGES):
    query = {'limit': LIMIT, 'page': page, 'search': SEARCH_QUERY}
    r = requests.get(API_LINK + urlencode(query))
    jobs.extend(r.json()["results"])

In [74]:
#To query all pages
page = 0 
query = {'limit': LIMIT, 'page': page, 'search': SEARCH_QUERY}
r = requests.get(API_LINK + urlencode(query))

while r.json()["results"]:
    jobs.extend(r.json()["results"])
    page += 1
    query = {'limit': LIMIT, 'page': page, 'search': SEARCH_QUERY}
    r = requests.get(API_LINK + urlencode(query))

In [75]:
len(jobs)

2677

### Extract the Information Out

In [76]:
job_id = list(map(lambda job: job['uuid'], jobs))
ext_job_id = list(map(lambda job: job['metadata']['jobPostId'], jobs))
job_title = list(map(lambda job: job['title'], jobs))
job_description = list(map(lambda job: BeautifulSoup(job['description'], 'lxml').text, jobs))
minimum_years_experience = list(map(lambda job: job['minimumYearsExperience'], jobs))
ssoc_code = list(map(lambda job: job['ssocCode'], jobs))
categories = list(map(lambda job: '; '.join(list(map(lambda category: category['category'], job['categories']))), jobs))
employment_types = list(map(lambda job: '; '.join(list(map(lambda employmentType: employmentType['employmentType'], job['employmentTypes']))), jobs))
position_levels = list(map(lambda job: '; '.join(list(map(lambda positionLevel: positionLevel['position'], job['positionLevels']))), jobs))
skills = list(map(lambda job: '; '.join(list(map(lambda skill: skill['skill'], job['skills']))), jobs))
organisation = list(map(lambda job: job['postedCompany']['name'], jobs))
new_posting_date = list(map(lambda job: job['metadata']['newPostingDate'], jobs))
original_posting_date = list(map(lambda job: job['metadata']['originalPostingDate'], jobs))
closing_date = list(map(lambda job: job['metadata']['expiryDate'], jobs))
last_updated = list(map(lambda job: job['metadata']['updatedAt'], jobs))
salary_minimum = list(map(lambda job: job['salary']['minimum'], jobs))
salary_maximum = list(map(lambda job: job['salary']['maximum'], jobs))
salary_type = list(map(lambda job: job['salary']['type']['salaryType'], jobs))
api_link = list(map(lambda job: job['_links']['self']['href'], jobs))
job_url = list(map(lambda job: job['metadata']['jobDetailsUrl'], jobs))

### Save as Dataframe and Export as CSV

In [77]:
col = {'job_id': job_id, 'ext_job_id': ext_job_id, 
       'job_title': job_title, 'job_description': job_description,
       'minimum_years_experience': minimum_years_experience, 
       'ssoc_code': ssoc_code, 'categories': categories, 
       'employment_types': employment_types, 'position_levels': position_levels, 
       'new_posting_date': new_posting_date, 'original_posting_date': original_posting_date, 
       'closing_date': closing_date, 'last_updated': last_updated, 
       'skills': skills, 'organisation': organisation,
       'salary_minimum': salary_minimum, 'salary_maximum': salary_maximum, 'salary_type': salary_type,
       'api_link': api_link, 'job_url': job_url}

In [78]:
jobs = pd.DataFrame(col)

In [79]:
PREFIX = "mycareersfuturesg_results"
FILENAME = '_'.join((PREFIX + ' '+ SEARCH_QUERY).split()) + ".csv"
jobs.to_csv(FILENAME, index=False)
print("DONE")

DONE


### Checking exported .csv

In [80]:
jobs.shape

(2677, 20)

In [81]:
jobs.head()

Unnamed: 0,job_id,ext_job_id,job_title,job_description,minimum_years_experience,ssoc_code,categories,employment_types,position_levels,new_posting_date,original_posting_date,closing_date,last_updated,skills,organisation,salary_minimum,salary_maximum,salary_type,api_link,job_url
0,90f5a8ac82ff1d5dfb7e42b52539ffe7,MCF-2021-0106896,Product Technical Lead ( Python ),We are looking for Product Technical Lead ( Py...,8,9943,Information Technology,Permanent,Professional,2021-04-24,2021-03-12,2021-05-24,2021-05-10T16:38:24.000Z,AWS; Django; Flask; Git; Github; Machine Learn...,ARYAN SOLUTIONS PTE. LTD.,10000,11000,Monthly,https://api.mycareersfuture.gov.sg/v2/jobs/90f...,https://www.mycareersfuture.gov.sg/job/informa...
1,84e22053f08516412d1380a206d4d190,MCF-2021-0186226,Python Expert,\nNewly Created Role\nExposure to Latest Techn...,4,2504,Information Technology,Full Time,Professional,2021-04-26,2021-04-26,2021-05-26,2021-05-10T16:25:51.000Z,Designing; Encoding; Microsoft Technologies; O...,MICHAEL PAGE INTERNATIONAL PTE LTD,5000,8000,Monthly,https://api.mycareersfuture.gov.sg/v2/jobs/84e...,https://www.mycareersfuture.gov.sg/job/informa...
2,752afd8aa0a2863251e9b9ba5f637422,MCF-2021-0174611,Python Developer,"Job description :\n\nExperience with Python, D...",5,10216,Information Technology,Contract,Professional,2021-04-20,2021-04-20,2021-05-20,2021-05-10T16:29:33.000Z,Analysis; API; Django; Flask; JavaScript; NumP...,U3 INFOTECH PTE. LTD.,8000,11000,Monthly,https://api.mycareersfuture.gov.sg/v2/jobs/752...,https://www.mycareersfuture.gov.sg/job/informa...
3,5d9d21e6b03f30170841e5a4175f88a4,MCF-2021-0211484,Python Developer,Background\nImpact Credit Solutions (“ICS”) is...,3,2447,Banking and Finance; Engineering; Information ...,Full Time,Manager,2021-05-10,2021-05-10,2021-06-09,2021-05-10T16:17:41.000Z,Capital Markets; Core Banking; Django; Finance...,IMPACT CREDIT SOLUTIONS PTE. LTD.,7500,10000,Monthly,https://api.mycareersfuture.gov.sg/v2/jobs/5d9...,https://www.mycareersfuture.gov.sg/job/banking...
4,907c8a431d0e58fee992d682856c2efb,MCF-2021-0209870,Python and React Junior Engineer,• 1-6 years of experience working with various...,1,14320,Information Technology,Full Time,Professional,2021-05-10,2021-05-10,2021-06-09,2021-05-10T16:18:43.000Z,Building Codes; Business Intelligence; DevOps;...,TOSS-EX PTE. LTD.,4900,5000,Monthly,https://api.mycareersfuture.gov.sg/v2/jobs/907...,https://www.mycareersfuture.gov.sg/job/informa...


In [51]:
jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15102 entries, 0 to 15101
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   job_id                    15102 non-null  object
 1   ext_job_id                15102 non-null  object
 2   job_title                 15102 non-null  object
 3   job_description           15102 non-null  object
 4   minimum_years_experience  15102 non-null  int64 
 5   ssoc_code                 15102 non-null  int64 
 6   categories                15102 non-null  object
 7   employment_types          15102 non-null  object
 8   position_levels           15102 non-null  object
 9   new_posting_date          15102 non-null  object
 10  original_posting_date     15102 non-null  object
 11  closing_date              15102 non-null  object
 12  last_updated              15102 non-null  object
 13  skills                    15102 non-null  object
 14  organisation          