# Salary Predictions Based on Job Descriptions

# Part 1 - DEFINE

### ---- 1 Define the problem ----

The assignment I received only came with this instruction:
`Your job as a data scientist is in this assignment is to examine a set of job postings with salaries and then predict salaries for a new set of job postings.`

The data provided in both the training and testing dataset has the following features:
- companyId
- jobType
- degree
- major
- industry
- yearsExperience
- milesFromMetropolis

I didn't receive any additional background or information about why I'd be asked to do this. Understanding the "why" behind any data science project is crucial. If this were a real project, I would ask the following questions:

- Who are the ultimate end users of this project?
- Where does the source data come from?
- What are the main issues they are trying to solve?
- How are they solving those issues right now?
- What are the issue/disadvantages of the current approach?
- What are the key components to a successful model?

**For the purposes of this exercise, I will assume the following:**

The human resources department has requested this project. They obtained the data about job posting salaries from an independent salary research firm. The research firm claims that the data was collected within the last year from comparable companies.

The HR department would like to ensure that the company offers competitive salaries, not too high or too low. They would like to include a predicted salary as a reference point in their decision of whether or not to approve salaries for job postings.

At the moment, they are using rough salary bands provided by the salary reasearch firm as a reference point. However, these are only based on the generic job title (CEO, CFO, Janitor, etc), and the HR department feels that it would be good to include a few other factors to get a more specific salary estimate.

A successful model would be able to predict a salary based on the features provided in the datasets.

**Other items to consider**

In order to make a more accurate model, it would be good to include other features in the data. For example, the specific market that the job is in will likely have a large impact on the salary. Also, there are other components to compensation that may be worth considering as well, such as bonus, vacation days, etc. It would also be good to have more information about things like company size. In my model, I haven't removed any industries, but it would likely be appropriate to focus only on the industry of the target client. 

While job postings are an interesting data point, it is important to remember that they do not represent the actual salaries ultimately agreed upon. It would be good to consider the cost of acquiring actual salary data which could be used in defining guidelines for appropriate salary bands and salary negotiations.

In [1]:
#import your libraries
import pandas as pd
#import sklearn as sk
#etc

__author__ = "Steve Anderson"
__email__ = "steve@ranksmarts.com"

## Part 2 - DISCOVER

### ---- 2 Load the data ----

In [40]:
train_feat = pd.read_csv('data/train_features.csv',index_col='jobId')
print("Rows and columns in train_feat:",train_feat.shape)
train_feat.head(3)

Rows and columns in train_feat: (1000000, 7)


Unnamed: 0_level_0,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
jobId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
JOB1362684407687,COMP37,CFO,MASTERS,MATH,HEALTH,10,83
JOB1362684407688,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73
JOB1362684407689,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38


In [41]:
train_salaries = pd.read_csv('data/train_salaries.csv',index_col='jobId')
print("Rows and columns in train_salaries:",train_salaries.shape)
train_salaries.head(3)

Rows and columns in train_salaries: (1000000, 1)


Unnamed: 0_level_0,salary
jobId,Unnamed: 1_level_1
JOB1362684407687,130
JOB1362684407688,101
JOB1362684407689,137


In [42]:
test_feat = pd.read_csv('data/test_features.csv',index_col='jobId')
print("Rows and columns in test_feat:",test_feat.shape)
test_feat.head(3)

Rows and columns in test_feat: (1000000, 7)


Unnamed: 0_level_0,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
jobId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
JOB1362685407687,COMP33,MANAGER,HIGH_SCHOOL,NONE,HEALTH,22,73
JOB1362685407688,COMP13,JUNIOR,NONE,NONE,AUTO,20,47
JOB1362685407689,COMP10,CTO,MASTERS,BIOLOGY,HEALTH,17,9


### ---- 3 Clean the data ----

In [48]:
train_w_salary = train_feat.join(train_salaries)
print("Rows and columns in train_w_salary:",test_feat.shape)
train_w_salary.head(3)

Rows and columns in train_w_salary: (1000000, 7)


Unnamed: 0_level_0,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary
jobId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
JOB1362684407687,COMP37,CFO,MASTERS,MATH,HEALTH,10,83,130
JOB1362684407688,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73,101
JOB1362684407689,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38,137


In [50]:
#Check for duplicates in all other feature columns

features = ['companyId', 'jobType', 'degree', 'major', 'industry',
       'yearsExperience', 'milesFromMetropolis']

train_feat_dups = train_feat[train_feat.duplicated(features,keep=False)]
print("Number of rows with duplicate data in train_feat:",len(train_feat_dups))

Number of rows with duplicate data in train_feat: 15917


In [58]:
train_w_salary[train_feat.duplicated(features,keep=False)].sort_values(features).head(30)

Unnamed: 0_level_0,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary
jobId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
JOB1362685003735,COMP0,CEO,BACHELORS,BIOLOGY,SERVICE,23,34,214
JOB1362685283347,COMP0,CEO,BACHELORS,BIOLOGY,SERVICE,23,34,122
JOB1362685004580,COMP0,CEO,HIGH_SCHOOL,NONE,AUTO,0,82,129
JOB1362685288664,COMP0,CEO,HIGH_SCHOOL,NONE,AUTO,0,82,97
JOB1362685165440,COMP0,CEO,HIGH_SCHOOL,NONE,AUTO,15,13,125
JOB1362685283913,COMP0,CEO,HIGH_SCHOOL,NONE,AUTO,15,13,156
JOB1362684435928,COMP0,CEO,HIGH_SCHOOL,NONE,AUTO,23,94,136
JOB1362684748853,COMP0,CEO,HIGH_SCHOOL,NONE,AUTO,23,94,105
JOB1362684556793,COMP0,CEO,HIGH_SCHOOL,NONE,EDUCATION,11,63,106
JOB1362684734837,COMP0,CEO,HIGH_SCHOOL,NONE,EDUCATION,11,63,83


In [51]:
features2 = ['companyId', 'jobType', 'degree', 'major', 'industry',
       'yearsExperience', 'milesFromMetropolis', 'salary']

train_w_salary_dups = train_w_salary[train_w_salary.duplicated(features2,keep=False)]
print("Number of rows with duplicate data in train_w_salary:",len(train_w_salary_dups))

Number of rows with duplicate data in train_w_salary: 372


In [52]:
train_w_salary_dups.sort_values(features2)

Unnamed: 0_level_0,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary
jobId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
JOB1362685182180,COMP0,CTO,HIGH_SCHOOL,NONE,HEALTH,19,28,142
JOB1362685321124,COMP0,CTO,HIGH_SCHOOL,NONE,HEALTH,19,28,142
JOB1362684800083,COMP0,JANITOR,HIGH_SCHOOL,NONE,EDUCATION,9,79,32
JOB1362684834695,COMP0,JANITOR,HIGH_SCHOOL,NONE,EDUCATION,9,79,32
JOB1362684777036,COMP0,JANITOR,NONE,NONE,FINANCE,5,18,73
...,...,...,...,...,...,...,...,...
JOB1362685209805,COMP9,JUNIOR,NONE,NONE,HEALTH,15,27,93
JOB1362684424988,COMP9,MANAGER,NONE,NONE,WEB,5,4,84
JOB1362685314558,COMP9,MANAGER,NONE,NONE,WEB,5,4,84
JOB1362685056059,COMP9,VICE_PRESIDENT,HIGH_SCHOOL,NONE,WEB,9,34,98


### ---- 4 Explore the data (EDA) ----

In [None]:
#summarize each feature variable
#summarize the target variable
#look for correlation between each feature and the target
#look for correlation between features

### ---- 5 Establish a baseline ----

In [None]:
#select a reasonable metric (MSE in this case)
#create an extremely simple model and measure its efficacy
#e.g. use "average salary" for each industry as your model and then measure MSE
#during 5-fold cross-validation

### ---- 6 Hypothesize solution ----

In [None]:
#brainstorm 3 models that you think may improve results over the baseline model based
#on your 

Brainstorm 3 models that you think may improve results over the baseline model based on your EDA and explain why they're reasonable solutions here.

Also write down any new features that you think you should try adding to the model based on your EDA, e.g. interaction variables, summary statistics for each group, etc

## Part 3 - DEVELOP

You will cycle through creating features, tuning models, and training/validing models (steps 7-9) until you've reached your efficacy goal

#### Your metric will be MSE and your goal is:
 - <360 for entry-level data science roles
 - <320 for senior data science roles

### ---- 7 Engineer features  ----

In [None]:
#make sure that data is ready for modeling
#create any new features needed to potentially enhance model

### ---- 8 Create models ----

In [None]:
#create and tune the models that you brainstormed during part 2

### ---- 9 Test models ----

In [None]:
#do 5-fold cross validation on models and measure MSE

### ---- 10 Select best model  ----

In [None]:
#select the model with the lowest error as your "prodcuction" model

## Part 4 - DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### ---- 12 Deploy solution ----

In [None]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### ---- 13 Measure efficacy ----

We'll skip this step since we don't have the outcomes for the test data