# Salary Predictions Based on Job Descriptions

### Problem Definition

As a Data Scientist in the recruitment industry, my goal is to build a system to predict salaries for a new set of job postings based on data provided on a set of historic job postings that include salaries.

Two CSV data files are available as a basis for training a machine learning model:

• train_features.csv: Each row represents metadata for an individual job posting.
The “jobId” column represents a unique identifier for the job posting. The remaining columns describe features of the job posting.
• train_salaries.csv: Each row associates a “jobId” with a “salary”.

The data upon which predictions should be made are stored in a further CSV data file:

• test_features.csv: Similar to train_features.csv, each row represents metadata for an individual job posting.

The output of my system should be a CSV file entitled test_salaries.csv where each row has the following format:

jobId, salary

### Library Imports

In [None]:
# Importing base libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

# Importing scikit-learn pre-processing libraries



# Importing scikit-learn machine learning libraries



# Author information
__author__ = "Ross MacDonald"
__email__ = "ross.macdonald@technologist.com"

 ### Reusable Functions

In [None]:
# Load CSV data file with completion dialogue and shape confirmation
def load_file(file):
    df =  pd.read_csv(file)
    shape = df.shape
    print("Data file is loaded, the shape of the dataset = {}".format(shape))
    return df

def ends(df, x):
    print('{} rows x {} columns'.format(np.shape(df)[0],np.shape(df)[1]))
    return df.head(x).append(df.tail(x))

### Load the Data

In [None]:
# Load the data into pandas dataframes (csv files for training and test data)
train_feature_df = load_file('data/train_features.csv')
train_target_df = load_file('data/train_salaries.csv')
test_df = load_file('data/test_features.csv')

In [None]:
# Merge the data on jobId to get a single training dataset (includes features and target)
train_df = pd.merge(train_feature_df,train_target_df,how="inner",on="jobId")

In [None]:
# Initial check of top 10 rows of train_df to confirm load and get an initial view on data / types
train_df.head(10)

In [None]:
# Initial check of top 10 rows of test_feature_df to confirm load and get an initial view on data / types
test_df.head(10)

In [None]:
# Get information on the train_feature_df dataframe. Ensure data types and number of entries per column are as expected.
train_df.info()

In [None]:
# Get information on the train_feature_df dataframe. Ensure data types and number of entries per column are as expected.
test_df.info()

### Clean the Data

In [None]:
# Check for duplicate rows in train_df
print("Number of duplicated rows = {}".format(train_df.duplicated().sum()))

In [None]:
# Check for duplicate rows in train_df
print("Number of duplicated rows = {}".format(train_df.duplicated().sum()))

In [None]:
# Check for null values in the columns of train_df
train_df.isnull().sum().to_frame('Null Entries')

In [None]:
# Check for null values in the columns of test_df
test_df.isnull().sum().to_frame('Null Entries')

##### There are no duplicates or null entries in the data, so no need to drop rows / entries or subsitute any values. 

In [None]:
# Get the column names for numerical columns from train_df to enable us to filter for invalid values 
print(train_df.select_dtypes(include=['float64', 'int64']).columns.values)

In [None]:
# Having found the column names for numerical columns from train_df, find rows that contain invalid values (i.e. <0 or <=0)
print(train_df[(train_df['yearsExperience'] < 0) | (train_df['milesFromMetropolis'] < 0) | (train_df['salary'] <= 0)])

In [None]:
# Remove the rows that contain invalid data from the train_df dataframe (i.e. salary was invalid in all cases, thus only
# include rows with salaries more than 0)
train_df = train_df[train_df.salary > 0]

In [None]:
# Check dataframe information to confirm that the expected rows have been dropped
train_df.info()

In [None]:
# Reset the index of train_df after dropping the invalid values and print head and tail (using function ends) to confirm
# reindex. Makes rows sequential to prevent any misunderstanding in subsequent analysis.
train_df = train_df.reset_index(drop=True)
ends(train_df, 3)

##### Invalid numerical entries have been dropped from the train_df dataframe, ensuring only valid data informs the prediction model / system. Index has been reset to prevent any misunderstanding (due to non-sequential rows in data) in any further analysis.

In [None]:
# Generate summary statistics for the training data train_df
train_df.salary.describe()

##### The summary 

In [None]:
# Visualize target variable (salary)
plt.figure(figsize = (14, 6))
plt.subplot(1,2,1)
sns.boxplot(train_df.salary)
plt.subplot(1,2,2)
sns.distplot(train_df.salary, bins=50 , color = 'green')
plt.show()

In [None]:
IQR = np.subtract(*np.percentile(train_df.salary, [75, 25]))
print(IQR)

In [None]:
df = train_df.salary[~((train_df.salary < (Q1 - 1.5 * IQR)) or (train_df.salary > (Q3 + 1.5 * IQR))).any()]
print (df)

### ---- 4 Explore the data (EDA) ----

In [None]:
#summarize each feature variable
#summarize the target variable
#look for correlation between each feature and the target
#look for correlation between features

### ---- 5 Establish a baseline ----

In [None]:
#select a reasonable metric (MSE in this case)
#create an extremely simple model and measure its efficacy
#e.g. use "average salary" for each industry as your model and then measure MSE
#during 5-fold cross-validation

### ---- 6 Hypothesize solution ----

In [None]:
#brainstorm 3 models that you think may improve results over the baseline model based
#on your 

Brainstorm 3 models that you think may improve results over the baseline model based on your EDA and explain why they're reasonable solutions here.

Also write down any new features that you think you should try adding to the model based on your EDA, e.g. interaction variables, summary statistics for each group, etc

## Part 3 - DEVELOP

You will cycle through creating features, tuning models, and training/validing models (steps 7-9) until you've reached your efficacy goal

#### Your metric will be MSE and your goal is:
 - <360 for entry-level data science roles
 - <320 for senior data science roles

### ---- 7 Engineer features  ----

In [None]:
#make sure that data is ready for modeling
#create any new features needed to potentially enhance model

### ---- 8 Create models ----

In [None]:
#create and tune the models that you brainstormed during part 2

### ---- 9 Test models ----

In [None]:
#do 5-fold cross validation on models and measure MSE

### ---- 10 Select best model  ----

In [None]:
#select the model with the lowest error as your "prodcuction" model

## Part 4 - DEPLOY

### ---- 11 Automate pipeline ----

In [None]:
#write script that trains model on entire training set, saves model to disk,
#and scores the "test" dataset

### ---- 12 Deploy solution ----

In [None]:
#save your prediction to a csv file or optionally save them as a table in a SQL database
#additionally, you want to save a visualization and summary of your prediction and feature importances
#these visualizations and summaries will be extremely useful to business stakeholders

### ---- 13 Measure efficacy ----

We'll skip this step since we don't have the outcomes for the test data