Building a salary prediction system. The data set for this assignment includes information on job descriptions and salaries. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

The description of variables are provided in "Jobs - Data Dictionary.docx"

## Goal

Use the **jobs_alldata.csv** data set and build models to predict **salary**.

## Data Prep

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mp

In [2]:
saldf = pd.read_csv('jobs_alldata.csv')

In [3]:
saldf.head(5)

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,0
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,0
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10


## Assign the "target" variable

In [4]:
target = saldf['Salary']

# Text Mining

## Assign the "text" (input) variable

In [5]:
# Check for missing values

saldf[['Job Description']].isna().sum()

Job Description    0
dtype: int64

In [6]:
input_data = saldf['Job Description']

## Split the data

In [7]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [8]:
train_set.shape, train_y.shape

((1689,), (1689,))

In [9]:
test_set.shape, test_y.shape

((724,), (724,))

## Sklearn: Text preparation

We need to prepare the text data. We'll use sklearn's CountVectorizer, which counts the frequency of words that appear in your entire data set.<br>
CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [10]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

train_x_tr = tfidf_vect.fit_transform(train_set)

In [11]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

test_x_tr = tfidf_vect.transform(test_set)


In [12]:
train_x_tr.shape, test_x_tr.shape

((1689, 9914), (724, 9914))

In [13]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr

<1689x9914 sparse matrix of type '<class 'numpy.float64'>'
	with 250443 stored elements in Compressed Sparse Row format>

In [14]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02913336, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [15]:
from sklearn.decomposition import TruncatedSVD

In [16]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=300, n_iter=10)

In [17]:
train_x_lsa = svd.fit_transform(train_x_tr)

In [18]:
train_x_lsa.shape

(1689, 300)

In [19]:
train_x_lsa

array([[ 0.24720598, -0.20327125,  0.11489911, ..., -0.02279996,
         0.00647618, -0.06844994],
       [ 0.17294821, -0.13074886,  0.00477585, ...,  0.01940549,
        -0.04161913, -0.010966  ],
       [ 0.5877761 ,  0.365734  ,  0.13443229, ..., -0.01437579,
         0.0099372 , -0.00641932],
       ...,
       [ 0.13385776, -0.10605211, -0.04018447, ..., -0.01465256,
         0.02430085,  0.00872729],
       [ 0.15340621, -0.12353644, -0.04283056, ..., -0.02456958,
        -0.02298649, -0.04267142],
       [ 0.21722752, -0.05642434,  0.31527024, ..., -0.00661978,
        -0.02403752, -0.04506847]])

### Let's transform the test data set

In [20]:
test_x_lsa = svd.transform(test_x_tr)

In [21]:
test_x_lsa.shape

(724, 300)

## Processing colums other than text

In [22]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

In [23]:
nontextdf = pd.read_csv('jobs_alldata.csv')
nontextdf.head()

Unnamed: 0,Salary,Job Description,Location,Min_years_exp,Technical,Comm,Travel
0,67206,Civil Service Title: Regional Director Mental ...,Remote,5,2,3,0
1,88313,The New York City Comptrollerâ€™s Office Burea...,Remote,5,2,4,10-15
2,81315,With minimal supervision from the Deputy Commi...,East campus,5,3,3,5-10
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...,East campus,1,1,3,0
4,55675,Only candidates who are permanent in the Princ...,Southeast campus,1,1,3,5-10


# Split the data into train and test

In [24]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(saldf, test_size=0.3)

# Dropping the variables we can't use 

In [25]:
train = train_set.drop(['Job Description', 'Travel'], axis=1)
test = test_set.drop(['Job Description', 'Travel'], axis=1)
train.head()

Unnamed: 0,Salary,Location,Min_years_exp,Technical,Comm
910,65141,HQ,4,2,4
1291,51458,Southeast campus,5,2,2
491,62410,HQ,5,1,3
2104,100548,HQ,2,5,2
1869,86242,East campus,5,3,2


## Separate the target variable (we don't want to transform it)

In [26]:
train_inputs = train.drop(['Salary'], axis=1)
test_inputs = test.drop(['Salary'], axis=1)

## Feature Engineering

## Here i didnot transform a column because i could not see a pattern to transform a single column or combine 2 columns as it should not be done without a reason. But have done text mining and one hot encoding for categorical variables


In [27]:
# Identify the numerical columns
numeric_columns = ['Min_years_exp','Technical','Comm']

# Identify the categorical columns
categorical_columns = ['Location']

# Pipeline

In [28]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [29]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [30]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns)],
        remainder='passthrough')

#passtrough is an optional step. You don't have to use it.

# Transform: fit_transform() for TRAIN

In [31]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[ 0.57095113, -0.22153976,  0.98746763, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.12548619, -0.22153976, -1.28015008, ...,  0.        ,
         1.        ,  0.        ],
       [ 1.12548619, -1.04754561, -0.14634123, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 1.12548619, -1.04754561,  0.98746763, ...,  1.        ,
         0.        ,  0.        ],
       [-1.09265404,  0.6044661 , -1.28015008, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.12548619,  0.6044661 , -2.41395893, ...,  0.        ,
         0.        ,  0.        ]])

In [32]:
train_x.shape

(1689, 8)

# Tranform: transform() for TEST

In [33]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[ 0.57095113,  0.6044661 ,  0.98746763, ...,  0.        ,
         0.        ,  0.        ],
       [-1.09265404, -1.04754561, -0.14634123, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.57095113, -1.04754561, -0.14634123, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 0.57095113, -1.04754561, -0.14634123, ...,  0.        ,
         0.        ,  0.        ],
       [-1.09265404,  0.6044661 , -0.14634123, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.12548619,  0.6044661 , -1.28015008, ...,  0.        ,
         0.        ,  0.        ]])

In [34]:
test_x.shape

(724, 8)

## Combining Text mining data and non-text columns processed data

In [35]:
train_x_cb=np.concatenate((train_x_lsa,train_x),axis=1) 

In [36]:
train_x_cb

array([[ 0.24720598, -0.20327125,  0.11489911, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.17294821, -0.13074886,  0.00477585, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.5877761 ,  0.365734  ,  0.13443229, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.13385776, -0.10605211, -0.04018447, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.15340621, -0.12353644, -0.04283056, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.21722752, -0.05642434,  0.31527024, ...,  0.        ,
         0.        ,  0.        ]])

In [37]:
train_x_cb.shape

(1689, 308)

In [38]:
test_x_cb=np.concatenate((test_x_lsa,test_x),axis=1) 

In [39]:
test_x_cb

array([[ 0.22442943, -0.15304556, -0.07083585, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.18129172, -0.14482415, -0.05639426, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.25478046, -0.19879586, -0.07065608, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 0.27755586, -0.18287792,  0.03599143, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.14498797, -0.11293755, -0.07782072, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.26692051, -0.08935042,  0.44355519, ...,  0.        ,
         0.        ,  0.        ]])

In [40]:
test_x_cb.shape

(724, 308)

## Find the Baseline

In [41]:
from sklearn.metrics import mean_squared_error

In [42]:
#First find the average value of the target

mean_value = np.mean(train_y)

mean_value

78566.0307874482

In [43]:
# Predict all values as the mean

baseline_pred = np.repeat(mean_value, len(test_y))

In [44]:
baseline_mse = mean_squared_error(test_y, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 28294.892856870818


# Explore the SVDs - OPTIONAL

In [45]:
svd.explained_variance_.sum()

0.6686490638820455

In [46]:
#These are the all the components:
svd.components_

array([[ 6.88496587e-04,  8.57475540e-02,  9.35138953e-05, ...,
         2.58331510e-04,  2.64894838e-04,  4.74548918e-04],
       [-4.24457233e-04,  9.21278086e-02, -8.70152800e-05, ...,
        -5.37307789e-04,  4.90695772e-04, -9.88739389e-04],
       [-5.92600177e-04, -6.23455368e-02, -1.68140617e-04, ...,
        -3.77315274e-04, -3.92896112e-04, -5.15104924e-04],
       ...,
       [ 1.54880303e-02,  8.39712028e-03,  1.51191855e-03, ...,
         6.02299529e-03,  1.87934402e-03, -2.08995908e-03],
       [-2.36825972e-02, -6.73228937e-03,  2.67184886e-05, ...,
         3.42238707e-03, -3.90742033e-03, -3.59153477e-03],
       [-1.66974419e-02, -9.74245930e-03,  4.03521347e-03, ...,
        -3.84009561e-03,  5.06943722e-04,  9.95543406e-03]])

In [47]:
svd.components_.shape

(300, 9914)

In [48]:
#Let's select the first component:

first_component = svd.components_[0,:]

In [49]:
# Sort the weights in the first component, and get the indeces

indeces = np.argsort(first_component).tolist()

In [50]:
#Let's get the feature names from the count vectorizer:
feat_names = tfidf_vect.get_feature_names_out()

In [51]:
#Print the last 10 terms (i.e., the 10 terms that have the highest weigths)

for index in indeces[-10:]:
    print(feat_names[index], "\t\tweight =", first_component[index])

bureau 		weight = 0.10717931879805626
management 		weight = 0.10960765437645147
new 		weight = 0.12230386159112756
design 		weight = 0.12986195254896962
city 		weight = 0.1360852112302988
project 		weight = 0.13846106935768068
dep 		weight = 0.14965285958788005
construction 		weight = 0.15479839429867367
wastewater 		weight = 0.15841529884565272
water 		weight = 0.26425044968095424


## Decision Tree:

In [52]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

tree_reg = DecisionTreeRegressor(max_depth = 10, max_features = 2) 

tree_reg.fit(train_x_cb, train_y)

DecisionTreeRegressor(max_depth=10, max_features=2)

In [53]:
#Train RMSE
train_pred = tree_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 18462.013568912105


In [54]:
#Test RMSE
test_pred = tree_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 25393.464381007892


## Voting regressor

In [55]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=20)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), ('svr', svm_reg), ('sgd', sgd_reg)],
                        weights=[0.6, 0.2, 0.2])

voting_reg.fit(train_x_cb, train_y)

VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=20)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))],
                weights=[0.6, 0.2, 0.2])

In [56]:
#Train RMSE
train_pred = voting_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 8778.619007225168


In [57]:
#Test RMSE
test_pred = voting_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 18247.085889126734


## Regularization

In [58]:
#Let's restrict the depth in decision tree
dtree_reg = DecisionTreeRegressor(min_samples_leaf = 10, max_depth=5)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), ('svr', svm_reg), ('sgd', sgd_reg)],
                        weights=[0.6, 0.2, 0.2])


voting_reg.fit(train_x_cb, train_y)


VotingRegressor(estimators=[('dt',
                             DecisionTreeRegressor(max_depth=5,
                                                   min_samples_leaf=10)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))],
                weights=[0.6, 0.2, 0.2])

In [59]:
#Train RMSE
train_pred = voting_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 21575.92658609869


In [60]:
#Test RMSE
test_pred = voting_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 23269.315377788254


In [61]:
for reg in (dtree_reg, svm_reg, sgd_reg, voting_reg):
    reg.fit(train_x_cb, train_y)
    test_y_pred = reg.predict(test_x_cb)
    print(reg.__class__.__name__, 'Test rmse=', np.sqrt(mean_squared_error(test_y, test_y_pred)))

DecisionTreeRegressor Test rmse= 25219.498646091455
SVR Test rmse= 28420.40182338191
SGDRegressor Test rmse= 20576.993242733573
VotingRegressor Test rmse= 23237.211104937775


In [62]:
#Voting regressor model with weights 

In [63]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor 
from sklearn.svm import SVR 
from sklearn.ensemble import VotingRegressor


dtree_reg = DecisionTreeRegressor(max_depth=20)
svm_reg = SVR(kernel="rbf", C=10, epsilon=0.01, gamma='scale') 
sgd_reg = SGDRegressor(max_iter=10000, tol=1e-3)

voting_reg = VotingRegressor(
            estimators=[('dt', dtree_reg), ('svr', svm_reg), ('sgd', sgd_reg)],
                        weights=[0.33, 0.33, 0.34])

voting_reg.fit(train_x_cb, train_y)

VotingRegressor(estimators=[('dt', DecisionTreeRegressor(max_depth=20)),
                            ('svr', SVR(C=10, epsilon=0.01)),
                            ('sgd', SGDRegressor(max_iter=10000))],
                weights=[0.33, 0.33, 0.34])

In [64]:
#Train RMSE
train_pred = voting_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 14684.737208900684


In [65]:
#Test RMSE
test_pred = voting_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 19280.543397683497


In [66]:
#Voting Regressor model with Bagging

In [67]:
from sklearn.ensemble import BaggingRegressor


#If you want to do pasting, change "bootstrap=False"
#n_jobs=-1 means use all CPU cores
#bagging automatically performs soft voting

bag_reg = BaggingRegressor( 
            SGDRegressor(), n_estimators=50, 
            max_samples=1000, bootstrap=True, n_jobs=-1) 

bag_reg.fit(train_x_cb, train_y)

BaggingRegressor(base_estimator=SGDRegressor(), max_samples=1000,
                 n_estimators=50, n_jobs=-1)

In [68]:
#Train RMSE
train_pred = bag_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 20650.125753723147


In [69]:
#Test RMSE
test_pred = bag_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 22031.26868836874


## A Boosting model


In [70]:
#AdaBoosting Model

In [71]:
from sklearn.ensemble import AdaBoostRegressor 

#Create Adapative Boosting with Decision Stumps (depth=10)
ada_reg = AdaBoostRegressor( 
            DecisionTreeRegressor(max_depth=10), n_estimators=500, 
            learning_rate=0.1) 

ada_reg.fit(train_x_cb, train_y)

AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10),
                  learning_rate=0.1, n_estimators=500)

In [72]:
#Train RMSE
train_pred = ada_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 2900.8903801670617


In [73]:
#Test RMSE
test_pred = ada_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 15465.78556942667


# Early Stopping

In [None]:
#Notice that learning rate and tol are high to see early stopping
ada_reg = AdaBoostRegressor( 
            DecisionTreeRegressor(max_depth=5), n_estimators=300, 
           learning_rate=1) 

ada_reg.fit(train_x_cb, train_y)

In [None]:
#Train RMSE
train_pred = ada_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = ada_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

## Neural network

In [None]:
from sklearn.neural_network import MLPRegressor

#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,), max_iter = 5000)
mlp_reg.fit(train_x_cb, train_y)

In [None]:
#Train RMSE
train_pred = mlp_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
mlp_pred = mlp_reg.predict(test_x_cb)

mlp_mse = mean_squared_error(test_y, mlp_pred)

mlp_rmse = np.sqrt(mlp_mse)

print('MLP RMSE: {}' .format(mlp_rmse))

In [None]:
dnn_reg = MLPRegressor(hidden_layer_sizes=(100, 100, 100), max_iter = 10000, tol = 0.0001, early_stopping = True)
dnn_reg.fit(train_x_cb, train_y)

In [None]:
train_pred = dnn_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
dnn_pred = dnn_reg.predict(test_x_cb)

dnn_mse = mean_squared_error(test_y, dnn_pred)

dnn_rmse = np.sqrt(dnn_mse)

print('DNN RMSE: {}' .format(dnn_rmse))

In [None]:
#Default settings create 1 hidden layer with 100 neurons
mlp_reg = MLPRegressor(hidden_layer_sizes=(100,), max_iter=1000)

mlp_reg.fit(train_x_cb, train_y)

In [None]:
#Train RMSE
train_pred = mlp_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = mlp_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

In [None]:
dnn_reg = MLPRegressor(hidden_layer_sizes=(50,50,50,50,50),
                       max_iter=1000)

dnn_reg.fit(train_x_cb, train_y)

In [None]:
#Train RMSE
train_pred = dnn_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = dnn_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

In [None]:
dnn_reg = MLPRegressor(hidden_layer_sizes=(50,50,50),
                       max_iter=1000,
                       early_stopping=True,
                      alpha = 0.1)

dnn_reg.fit(train_x_cb, train_y)

In [None]:
#Train RMSE
train_pred = dnn_reg.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = dnn_reg.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

## Grid search


In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(10, 30), 
     'max_depth': np.arange(10,30)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=10,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x_cb, train_y)

In [None]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

In [None]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x_cb)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

In [None]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x_cb)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))