# Unit 2 Assessment

In this assignment, we will focus on salary prediction. The data set for this assignment includes information on job postings. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description and other attributes of the job. This is important, because this model can make a salary recommendation as soon as a job posting is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Goal

Use the **jobs_alldata.csv** data set and build models to predict **salary**.

**Be careful: this is a REGRESSION task**

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


## Important hints:

* This assignment requires you to work with a text-based column in addition to regular numeric/categorical columns. So you will have to pay attention to your pipelines during data processing.
* You can do your data prep before or after the train/test split. Regardless, you should use train_test_split only once. If you find yourself using it twice, it means you are doing something wrong.
* Recommended approach: 
    * import the data and perform the train/test split - like we always do. 
    * identify the names of numeric, categorical, feature engineered, and text columns - like we always do
    * create individual pipelines for each type of column - like we always do. For the text pipeline, I would recommend the TFIDF Vectorizer with SVDs. Though, you can also use TFIDF Vectorizer with top N terms (without SVDs).
    * combine all pipelines using the column transformer - like we always do 

# Section 1: (6 points in total)

## Data Prep (4 points)

In [None]:
#Group Assignment by Shreya Reddy Vurelly and krishnasai Chaluvadi

In [184]:
#importing the data
import numpy as np
import pandas as pd

np.random.seed(99)

In [185]:
jobs = pd.read_csv("jobs_alldata.csv")
jobs.head()

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,No
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15 hrs
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10 hrs
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,No
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10 hrs


In [186]:
# Check for missing values

jobs[['Job Description']].isna().sum()

Job Description    0
dtype: int64

In [187]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(jobs, test_size=0.3)

In [188]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

In [189]:
train_y = train_set['Salary']
test_y = test_set['Salary']

train_inputs = train_set.drop(['Salary'], axis=1)
test_inputs = test_set.drop(['Salary'], axis=1)

## Feature Engineering (1 point)

Create one NEW feature from existing data. You either transform a single variable, or create a new variable from existing ones. 

Grading: 
- 0.5 points for creating the new feature correctly
- 0.5 points for the justification of the new feature (i.e., why did you create this new feature)

In [190]:
def new_col(df): 
    df_copy = df.copy()
    df_copy['level'] = pd.cut(df_copy['Min_years_exp'],
                                       bins=[1,2,4,np.inf],  #bins=[exclusive, inclusive]
                                       labels= False, 
                                       include_lowest=True,
                                       ordered=True)
    
    return df_copy[['level']]

In [191]:
new_col(train_set)

Unnamed: 0,level
2127,2
1436,2
603,0
1797,0
17,0
...,...
1092,1
1768,1
1737,1
1209,2


In [192]:
def text_trans(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    # First, conver the dataframe column to a numpy array. Then, call the ravel function to make it one-dimensional
    return np.array(df1['Job Description']).ravel()

In [193]:
text_trans(train_set)

array(['Only Candidates permanent in the Assistant Civil Engineer title or those who can provide proof of successful registration for the December 2018 Open Competitive Exam #9026 may apply.  Failure to do so will result in your disqualification.   The Department of Design and Construction, Division of Infrastructure is seeking Design Engineers. Under the supervision of an Engineer â€“ in - Charge, the selected candidates will prepare contract documents, specifications, and final estimates; engage in engineering investigations; and prepare contract plans and working drawings.  Candidates will also participate in field surveys of existing conditions; prepare reports; engage in engineering reviews and studies; and prepare designs with minimal supervision.',
       'â€¢ Under general direction, with wide latitude for independent initiative and judgment, oversee and facilitate grant management from grant award through closeout. Administer a grant project portfolio, including but not limite

In [194]:
train_inputs

Unnamed: 0,Job Description,Location,Min_years_exp,Technical,Comm,Travel
2127,Only Candidates permanent in the Assistant Civ...,Southeast campus,5,4,3,No
1436,"â€¢ Under general direction, with wide latitud...",HQ,5,3,2,5-10 hrs
603,About New York City Cyber Command NYC Cyber Co...,East campus,1,1,3,1-5 hrs
1797,The NYC Department of Environmental Protection...,HQ,1,2,1,1-5 hrs
17,The NYC Department of Environmental Protection...,HQ,1,2,3,5-10 hrs
...,...,...,...,...,...,...
1092,The Division of Sidewalk & Inspection Manageme...,Remote,3,1,3,No
1768,The Bureauof Veterinary and Pest Control Servi...,HQ,4,3,3,No
1737,"The Fire Department, City of New York (FDNY), ...",East campus,4,2,3,1-5 hrs
1209,***PLEASE NOTE APPLICANTS MUST BE PERMANENT IN...,Remote,5,2,3,5-10 hrs


In [195]:
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

text_column = ['Job Description']

feat_eng_columns = ['Min_years_exp']

In [196]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [197]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [198]:
number_svd_components = 300
text_transformer = Pipeline(steps=[
                ('my_new_column', FunctionTransformer(text_trans)),
                ('text', TfidfVectorizer(stop_words='english')),
                ('svd', TruncatedSVD(n_components=number_svd_components, n_iter=10))
            ])

In [199]:
feat_column = Pipeline(steps=[('feat_column', FunctionTransformer(new_col)),
                               ('scaler', StandardScaler())])

In [200]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('trans', feat_column, feat_eng_columns),
        ('text', text_transformer, text_column)],
        remainder='passthrough')

In [201]:
train_x = preprocessor.fit_transform(train_inputs)

train_x

train_x.shape

(1689, 1502)

In [202]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

test_x.shape

(724, 1502)

## Find the Baseline (1 point)

In [203]:
from sklearn.dummy import DummyRegressor

dummy_regr = DummyRegressor(strategy="mean")

dummy_regr.fit(train_x, train_y)

In [204]:
from sklearn.metrics import mean_squared_error

In [205]:
#Baseline Train RMSE
dummy_train_pred = dummy_regr.predict(train_x)

baseline_train_mse = mean_squared_error(train_y, dummy_train_pred)

baseline_train_rmse = np.sqrt(baseline_train_mse)

print('Baseline Train RMSE: {}' .format(baseline_train_rmse))

Baseline Train RMSE: 29127.250205690703


In [206]:
#Baseline Test RMSE
dummy_test_pred = dummy_regr.predict(test_x)

baseline_test_mse = mean_squared_error (test_y, dummy_test_pred)

baseline_test_rmse = np.sqrt(baseline_test_mse)

print('Baseline Test RMSE: {}' .format(baseline_test_rmse))

Baseline Test RMSE: 29358.248255682094


# Section 2: (5 points in total)

Build the following models:


## Decision Tree: (1 point)

In [210]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=5) 

tree_reg.fit(train_x, train_y)

In [211]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 23435.392678789485


In [212]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 26416.54627003106


In [222]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=10) 

tree_reg.fit(train_x, train_y)

In [223]:
#Train RMSE
train_pred = tree_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 14549.544722688199


In [224]:
#Test RMSE
test_pred = tree_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 24524.171127224614


## Voting regressor (1 points):

The voting regressor should have at least 3 individual models

In [213]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor
from sklearn.metrics import mean_squared_error


dtree_reg = DecisionTreeRegressor(max_depth=20)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), 
                        ('svr', svm_reg), 
                        ('sgd', sgd_reg)])

voting_reg.fit(train_x, train_y)



In [214]:
#Train RMSE
train_pred = voting_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

#Test RMSE
test_pred = voting_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Train RMSE: 10868.659644426523
Test RMSE: 19898.73673248899


## A Boosting model: (1 point)

Build either an Adaboost or a GradientBoost model

In [218]:
#Use GradientBoosting

from sklearn.ensemble import GradientBoostingRegressor

gb_reg = GradientBoostingRegressor(max_depth=2, n_estimators=100, learning_rate=0.1) 

gb_reg.fit(train_x, train_y)

In [219]:
#Train RMSE
train_pred = gb_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

#Test RMSE
test_pred = gb_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Train RMSE: 18128.64026194977
Test RMSE: 22788.72432265877


## Neural network: (1 point)

In [231]:
from sklearn.neural_network import MLPRegressor
dnn_reg = MLPRegressor(hidden_layer_sizes=(500,500,500,500,500),
                       max_iter=5000)

dnn_reg.fit(train_x, train_y)

In [232]:
#Train RMSE
train_pred = dnn_reg.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

#Test RMSE
test_pred = dnn_reg.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Train RMSE: 199.33029137552657
Test RMSE: 17451.40409445141


## Grid search (1 points)

Perform either a full or randomized grid search on any model you want. There has to be at least two parameters for the search. 

In [166]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(1, 30), 
     'max_depth': np.arange(1,30)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=10,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x, train_y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [167]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

27733.385316350046 {'min_samples_leaf': 12, 'max_depth': 4}
27169.01841877581 {'min_samples_leaf': 22, 'max_depth': 18}
27649.282290648054 {'min_samples_leaf': 8, 'max_depth': 3}
27276.98216012597 {'min_samples_leaf': 21, 'max_depth': 9}
27329.347371430194 {'min_samples_leaf': 2, 'max_depth': 3}
27286.026236982907 {'min_samples_leaf': 4, 'max_depth': 22}
27426.442211121408 {'min_samples_leaf': 9, 'max_depth': 11}
27391.85448427366 {'min_samples_leaf': 25, 'max_depth': 10}
27780.901938353883 {'min_samples_leaf': 13, 'max_depth': 29}
27280.130342183307 {'min_samples_leaf': 9, 'max_depth': 13}


In [168]:
grid_search.best_params_

{'min_samples_leaf': 22, 'max_depth': 18}

In [169]:
grid_search.best_estimator_

In [170]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 18751.94910533815


In [171]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 27860.12112991056


# Discussion (4 points in total)


## List the train and test values of each model you built (1 points)

## Which model performs the best and why? (0.5 points) 
## How does it compare to baseline? (0.5 points)

Hint: The best model is the one that has the lowest TEST RMSE value (regardless of any of the training values). If you select your model based on TRAIN values, you will lose points.

## Is there any evidence of overfitting in the best model, why or why not? If there is, what did you do about it? (1 point)

## Is there any overfitting in the other models (besides the best model), why or why not? If there is, what did you do about it? (1 point)