# Modelling

In this notebook we explore various Linear models, Boosting models, Ensemble models etc.

scikit learn's `Pipeline` is utilized for pre-processing data and model training.

Used scikit learn `GridSearchCV` for finding the best hyper parameters that fits the model.

In [1]:
# imports
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV


from utils import data_preprocessor, sep_columns_from_desc, findCorrelation, cat_correlation

# Load Data

In [2]:
# Load the data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print("Dimensions of train: {}".format(train_df.shape))
print("Dimensions of test: {}".format(test_df.shape))

Dimensions of train: (1460, 81)
Dimensions of test: (1459, 80)


# Train and Validation splits

Prepare data by removing predicted value from data-set. Remove columns that do not contribute to the model
- `Id` - Identifier column
- `GarageYrBlt` - As per EDA this is a redundant information to `GarageArea`

Split data into training and validation sets

Get Categorical and Numerical Columns and create a preprocessor defined 

In [3]:
# data preprocessor
cat_cols, num_cols= sep_columns_from_desc(filename='data_description.txt',
                                          data_cols=train_df.columns)

# remove columns that do not contribute to model
no_use_cols = ['Id', 'GarageYrBlt']

# remove columns with high correlation
rm_num_cols = findCorrelation(train_df[num_cols].corr(), cutoff=0.7)
rm_cat_cols = cat_correlation(train_df, cat_cols)

# combine all columns to remove
rm_cols = rm_cat_cols + rm_num_cols + no_use_cols

preprocessor = data_preprocessor(cat_cols, num_cols, rm_cols)

In [4]:
# store predict target value in separate variable 
y = train_df['SalePrice']
# remove the target value from the training
train_df = train_df.drop('SalePrice', axis=1)

# # remove columns that do not contribute to the model
# rm_cols = ['Id', 'GarageYrBlt']

# train_df = train_df.drop(rm_cols, axis=1)

# split data into training and validation
X_train, X_valid, y_train, y_valid = train_test_split(train_df, y, train_size=0.8, test_size=0.2, random_state=0)

# Tuning Hyper Parameters with GridSearchCV 

## Linear Models

Setup the grid-search for tuning the hyper parameters. We use following algorithms to find the one that best works 
1) Linear Regression
2) Ridge Regression
3) Lasso Regression
4) Elastic Net

In [5]:
# Initialize the regression model
model1 = LinearRegression()
model2 = Ridge(alpha=0.1)
model3 = Lasso(alpha=0.1)
model4 = ElasticNet(alpha=0.1, l1_ratio=0.5)
model5 = RandomForestRegressor(random_state=0, n_jobs=-1)
model6 = XGBRegressor(n_estimators=1000, learning_rate=0.05)

Create model dictionary with the model and it's hyper parameters

In [6]:
# Create parameter dictionary
param1 = {}
# param1['preprocessor__num__imputer__strategy'] = ['mean', 'median']
param1['model__fit_intercept'] = [True, False]
param1['model'] = [model1]

param2 = {}
param2['model__alpha'] = [0.1, 0.25, 0.5, 0.75, 1]
param2['model'] = [model2]

param3 = {}
param3['model__alpha'] = [0.1, 0.25, 0.5, 0.75, 1]
param3['model'] = [model3]

param4 = {}
param4['model__alpha'] = [0.1, 0.25, 0.5, 0.75, 1]
param4['model__l1_ratio'] = [0.1, 0.25, 0.5, 0.75, 0.9]
param4['model'] = [model4]

param5 = {}
param5['model__n_estimators'] = [50, 100, 200]
param5['model__max_depth'] = [5, 10, 20]
param5['model__criterion']=['squared_error', 'absolute_error', 'friedman_mse', 'poisson']
param5['model'] = [model5]

param6 = {}
param6['model__max_depth'] = [5, 10, 20]
param6['model__learning_rate'] = [0.01, 0.05, 0.1]
param6['model__eval_metric'] = [mean_absolute_error, mean_squared_error]
param6['model'] = [model6]


In [7]:
# Create a pipeline by defining 'model' as 'model1' which is dummy
pipe = Pipeline([('preprocessor', preprocessor), ('model', model1)])
params = [param1, param2, param3, param4, param5, param6]

In [8]:
pipe.fit(X_train,y_train)
preds = pipe.predict(X_valid)
print("Mean Squared Error: {}".format(mean_squared_error(y_valid, preds)))

Mean Squared Error: 2962417664.434398


In [9]:
# Grid search CV
# n_jobs=-1 means use all processors available on the machine
gs = GridSearchCV(pipe, params, cv=25, n_jobs=-1) 

In [10]:
# Fit the model
gs.fit(X_train, y_train)



In [11]:
# pd.DataFrame(gs.cv_results_)
gs.best_score_


0.8520503640174866

In [12]:
pd.DataFrame({'actual': y_valid, 'predicted': gs.predict(X_valid), 'error': y_valid - gs.predict(X_valid)})


Unnamed: 0,actual,predicted,error
529,200624,219161.531250,-18537.531250
491,133000,156661.328125,-23661.328125
459,110000,106319.031250,3680.968750
279,192000,209512.015625,-17512.015625
655,88000,90391.148438,-2391.148438
...,...,...,...
326,324000,295949.843750,28050.156250
440,555000,530029.312500,24970.687500
1387,136000,161625.406250,-25625.406250
1323,82500,77703.921875,4796.078125


In [14]:
mean_squared_error(y_valid, gs.predict(X_valid)), mean_absolute_error(y_valid, gs.predict(X_valid))

(919437680.1233581, 16338.514581549658)

In [18]:
gs.best_params_

{'model': XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.05, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=1000, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...),
 'model__eval_metric': <function sklearn.metrics._regression.mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')>,
 'model__learning_rate': 0.05,
 'model__max_depth': 5}

In [15]:
# Hard stop to avoid executing the code below
assert True==False

AssertionError: 

## Submit Results

Predict scores on test dataset and store it in csv

In [16]:
pd.DataFrame({
    'Id': test_df['Id'],
    'SalePrice': gs.predict(test_df.drop(['Id'], axis=1))
}).to_csv('submission.csv', index=False)

Submit results to kaggle competitions page

In [17]:
!kaggle competitions submit -c home-data-for-ml-course -f submission.csv -m "performing correlation analysis and droping correlated columns"

100%|██████████████████████████████████████| 21.2k/21.2k [00:01<00:00, 20.9kB/s]
Successfully submitted to Housing Prices Competition for Kaggle Learn Users