# Unit 2 Assessment

## Swetha Veerla(U62395128)

In this assignment, we will focus on salary prediction. The data set for this assignment includes information on job postings. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description and other attributes of the job. This is important, because this model can make a salary recommendation as soon as a job posting is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Goal

Use the **jobs_alldata.csv** data set and build models to predict **salary**.

**Be careful: this is a REGRESSION task**

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Important hints:

* This assignment requires you to work with a text-based column in addition to regular numeric/categorical columns. So you will have to pay attention to your pipelines during data processing.
* You can do your data prep before or after the train/test split. Regardless, you should use train_test_split only once. If you find yourself using it twice, it means you are doing something wrong.
* Recommended approach: 
    * import the data and perform the train/test split - like we always do. 
    * identify the names of numeric, categorical, feature engineered, and text columns - like we always do
    * create individual pipelines for each type of column - like we always do. For the text pipeline, I would recommend the TFIDF Vectorizer with SVDs. Though, you can also use TFIDF Vectorizer with top N terms (without SVDs).
    * combine all pipelines using the column transformer - like we always do 

# Section 1: (6 points in total)

## Data Prep (4 points)

## Setup

In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(9870)

# Get the data

In [2]:
jobs = pd.read_csv('jobs_alldata.csv')
jobs.head()

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,No
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15 hrs
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10 hrs
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,No
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10 hrs


# Split the data into train and test


In [3]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(jobs, test_size=0.3)

## Check the missing values


In [4]:
train_set.isna().sum()

Salary             0
Job Description    0
Location           0
Min_years_exp      0
Technical          0
Comm               0
Travel             0
dtype: int64

In [5]:
test_set.isna().sum()

Salary             0
Job Description    0
Location           0
Min_years_exp      0
Technical          0
Comm               0
Travel             0
dtype: int64

## Data Prep

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

## Separate the target variable


In [7]:
train_y = train_set[['Salary']]
test_y = test_set[['Salary']]

train_inputs = train_set.drop(['Salary'], axis=1)
test_inputs = test_set.drop(['Salary'], axis=1)

##  Identify the numerical and categorical columns(Programmatically)

In [8]:
train_inputs.dtypes

Job Description    object
Location           object
Min_years_exp       int64
Technical           int64
Comm                int64
Travel             object
dtype: object

In [9]:
train_y.shape

(1689, 1)

In [10]:
train_inputs.shape

(1689, 6)

## Feature Engineering (1 point)

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 

Grading: 
- 0.5 points for creating the new feature correctly
- 0.5 points for the justification of the new feature (i.e., why did you create this new feature)

In [11]:
def new_col(df):

#Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()

    # Use the formula, though fill in 0s when the value is 0/0 (because 0/0 generates "nan" values)
    df1['Skills_Ratio'] = (df1['Comm']/df1['Technical']).fillna(0)

    # Replace the infinity values with 1 (because a value divided by 0 generates infinity)
    df1['Skills_Ratio'].replace(np.inf, 1, inplace=True)

    return df1[['Skills_Ratio']]
    # You can use this to check whether the calculation is made correctly:
    #return df1

## Feature Engineering Justification 

In [122]:
new_col(train_inputs)

Unnamed: 0,Skills_Ratio
1647,1.500000
2172,1.000000
980,1.500000
1999,4.000000
594,1.500000
...,...
127,2.000000
793,4.000000
58,1.333333
1697,3.000000


## Caveat for creating a pipline for text columns


In [13]:
def new_col_text(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    # First, conver the dataframe column to a numpy array. Then, call the ravel function to make it one-dimensional
    return np.array(df1).ravel()

In [14]:
new_col(train_set)

Unnamed: 0,Skills_Ratio
1647,1.500000
2172,1.000000
980,1.500000
1999,4.000000
594,1.500000
...,...
127,2.000000
793,4.000000
58,1.333333
1697,3.000000


# Pipeline

In [15]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [16]:
text_column = ['Job Description']

In [17]:
for col in text_column:
    categorical_columns.remove(col)

In [18]:
numeric_columns

['Min_years_exp', 'Technical', 'Comm']

In [19]:
categorical_columns

['Location', 'Travel']

In [20]:
text_column

['Job Description']

In [21]:
feat_eng_columns = ['Technical', 'Comm']

In [22]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [23]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='mean')),
                ('scaler', StandardScaler())])

In [24]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [25]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col)),
                               ('scaler', StandardScaler())])

In [26]:
No_of_SVD_Components=300

In [27]:
text_transformer = Pipeline(steps=[
                ('my_new_column', FunctionTransformer(new_col_text)),
                ('text', TfidfVectorizer(stop_words='english')),
                ('svd', TruncatedSVD(n_components=No_of_SVD_Components, n_iter=10)) 
            ])

In [28]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('text', text_transformer, text_column),
        ('trans', my_new_column, feat_eng_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

# Tranform: fit_transform() for Train

In [29]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[ 1.09658984, -0.21755792, -0.14708252, ...,  0.03200587,
        -0.02860663, -0.31457249],
       [ 1.09658984,  0.60633295, -0.14708252, ..., -0.02333141,
        -0.01234264, -0.76725456],
       [ 0.53955518, -0.21755792, -0.14708252, ..., -0.01457511,
         0.03646927, -0.31457249],
       ...,
       [ 1.09658984,  0.60633295,  0.97193717, ...,  0.01376906,
        -0.01494834, -0.46546652],
       [-1.1315488 , -1.04144879, -0.14708252, ...,  0.03125062,
        -0.00284168,  1.04347371],
       [ 1.09658984, -1.04144879, -0.14708252, ...,  0.03203856,
        -0.04802004,  1.04347371]])

In [30]:
train_x.shape

(1689, 313)

# Tranform: transform() for TEST

In [31]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 1.09658984e+00, -1.04144879e+00,  9.71937175e-01, ...,
         4.56091711e-03,  4.72315876e-02,  1.94883785e+00],
       [ 1.09658984e+00,  6.06332951e-01,  2.09095687e+00, ...,
         9.77981804e-03, -3.06396521e-02, -1.63678472e-01],
       [-1.13154880e+00, -2.17557921e-01,  9.71937175e-01, ...,
        -4.20437450e-03,  8.74643708e-03,  1.38109574e-01],
       ...,
       [ 5.39555182e-01,  6.06332951e-01,  9.71937175e-01, ...,
        -4.30328393e-02, -7.90382007e-02, -4.65466518e-01],
       [ 1.09658984e+00,  2.25411470e+00, -1.47082517e-01, ...,
         1.09630358e-02, -1.24747192e-03, -1.12940022e+00],
       [ 1.09658984e+00, -2.17557921e-01, -1.47082517e-01, ...,
        -5.54828537e-03,  1.19191260e-02, -3.14572495e-01]])

In [32]:
test_x.shape

(724, 313)

## Find the Baseline (1 point)

In [33]:
from sklearn.dummy import DummyRegressor

dummy_regr = DummyRegressor(strategy="mean")

dummy_regr.fit(train_inputs, train_y)

In [34]:
from sklearn.metrics import mean_squared_error

In [35]:
#Baseline Train RMSE
dummy_train_pred = dummy_regr.predict(train_inputs)

baseline_train_mse = mean_squared_error(train_y, dummy_train_pred)

baseline_train_rmse = np.sqrt(baseline_train_mse)

print('Baseline Train RMSE: {}' .format(baseline_train_rmse))

Baseline Train RMSE: 29334.594049222505


In [36]:
#Baseline Test RMSE
dummy_test_pred = dummy_regr.predict(test_inputs)

baseline_test_mse = mean_squared_error (test_y, dummy_test_pred)

baseline_test_rmse = np.sqrt(baseline_test_mse)

print('Baseline Test RMSE: {}' .format(baseline_test_rmse))

Baseline Test RMSE: 28874.209927959517


# Models


## Decision Tree: (1 point)

In [60]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=4) 

tree_reg.fit(train_x, train_y)

In [61]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 23278.91479281602


In [62]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 26182.360042468208


## Voting regressor (1 points):

The voting regressor should have at least 3 individual models

In [63]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=5)
svm_reg = SVR(kernel="rbf", C=10, gamma='scale',epsilon=0.1) 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [64]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 20515.48141095037


In [65]:
#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 22863.790141797363


## A Boosting model: (1 point)

Build either an Adaboost or a GradientBoost model

In [95]:
from sklearn.ensemble import GradientBoostingRegressor

gb_reg = GradientBoostingRegressor(max_depth=2, n_estimators=100, 
                                   learning_rate=0.1, 
                                  tol=0.1, n_iter_no_change=5, validation_fraction=0.2,
                                  verbose=1) 


gb_reg.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


      Iter       Train Loss   Remaining Time 
         1   812411927.0000            7.58s
         2   794377892.6388            6.10s
         3   777327156.7930            5.56s
         4   757310038.4339            5.29s
         5   743470876.8205            5.11s
         6   727504405.0043            4.96s
         7   715808313.2156            4.84s
         8   704072468.2404            4.75s
         9   690550692.2952            4.67s
        10   679885670.2211            4.59s
        20   585494536.5305            3.93s
        30   520543571.1871            3.41s
        40   472313698.7862            2.93s
        50   431223612.3240            2.44s
        60   397942168.3415            1.96s
        70   367081544.0996            1.47s
        80   342725421.0022            0.98s
        90   319299635.4323            0.49s
       100   299619595.8390            0.00s


In [96]:
#Train RMSE
train_pred = gb_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 18262.083374104182


In [97]:
#Test RMSE
test_pred = gb_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 21883.20019282459


## Neural network: (1 point)

In [46]:
from sklearn.neural_network import MLPRegressor

dnn_reg = MLPRegressor(hidden_layer_sizes=(75,50,25),
                       max_iter=1000,
                       early_stopping=True,
                      alpha = 0.2)

dnn_reg.fit(train_x, train_y)

  y = column_or_1d(y, warn=True)


In [47]:
#Train RMSE
train_pred = dnn_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 17084.397611281154


In [48]:
#Test RMSE
test_pred = dnn_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 20538.552635552045


## Grid search (1 points)

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

In [116]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(1, 15), 
     'max_depth': np.arange(1,5),
    }
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=5,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_y)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


In [117]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

27701.55039388517 {'min_samples_leaf': 3, 'max_depth': 3}
27647.342580376262 {'min_samples_leaf': 14, 'max_depth': 3}
28536.498978095515 {'min_samples_leaf': 3, 'max_depth': 1}
28536.498978095515 {'min_samples_leaf': 13, 'max_depth': 1}
28536.498978095515 {'min_samples_leaf': 12, 'max_depth': 1}


In [118]:
grid_search.best_params_

{'min_samples_leaf': 14, 'max_depth': 3}

In [119]:
grid_search.best_estimator_

In [120]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 25449.647784476525


In [121]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 26843.455209496227


# Discussion (4 points in total)


## List the train and test values of each model you built (1 points)

## Which model performs the best and why? (0.5 points) 
## How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the lowest TEST RMSE value (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (1 point)

## Is there any overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (1 point)