## **Results**

**Ridge Regressor** is the best model with highest Mean Corss-Validation, below are the hyper parameters.

**Best Mean Cross-validation score: 0.84**

**Best Train Performance:  0.8737983750237459**

**Best Test Performance:  0.7840688355575975**

* **test mse: 1483903856.0287342**
* **test rmse: 38521.47266173419**
* **test r2: 0.7840688355575975**

**alpha=300**

```
{'regressor': Ridge(alpha=300, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=0, solver='auto', tol=0.001), 'regressor__alpha': 300,
'regressor__random_state': 0}
```

## Data PreProcessing

In [44]:
!pip install feature-engine



In [45]:
from math import sqrt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from pathlib import Path

pd.pandas.set_option('display.max_columns', None)
%matplotlib inline

### Load Datasets

In [46]:
# load dataset
# your code here
data = pd.read_csv('houseprice.csv')

### Types of variables



In [47]:
# we have an Id variable, that we should not use for predictions:

print('Number of House Id labels: ', len(data.Id.unique()))
print('Number of Houses in the Dataset: ', len(data))

Number of House Id labels:  1460
Number of Houses in the Dataset:  1460


#### Find categorical variables

In [48]:
# find categorical variables- hint data type = 'O'

categorical = [var for var in data.columns if data[var].dtype=='O']

print(f'There are {len(categorical)} categorical variables')

There are 43 categorical variables


#### Find temporal variables

In [49]:
# make a list of the numerical variables first= Hint data type != O
numerical = [var for var in data.columns if data[var].dtype!='O']

# list of variables that contain year information= Hint variable namme has Yr or 
year_vars = [var for var in numerical if 'Yr' in var or 'Year' in var]

year_vars

['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']

#### Find discrete variables

To identify discrete variables- numerical variables with less than 20 unique values 

In [50]:
# let's visualise the values of the discrete variables
discrete = [var for var in numerical if len(data[var].unique()) < 20 and var not in year_vars]

print(f'There are {len(discrete)} discrete variables')

There are 14 discrete variables


#### Continuous variables

In [51]:
# find continuous variables- hint numerical variables not in discrete and  year_years 
# Also remove the Id variable and the target variable SalePrice
# which are both also numerical

continuous = [var for var in numerical if var not in discrete and var not in [
    'Id', 'SalePrice'] and var not in year_vars]

print('There are {} numerical and continuous variables'.format(len(numerical)))

There are 38 numerical and continuous variables


### Separate train and test set

In [52]:
# Let's separate into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.drop(['Id', 'SalePrice'], axis=1),
                                                    data['SalePrice'],
                                                    test_size=0.1,
                                                    random_state=0)

X_train.shape, X_test.shape

((1314, 79), (146, 79))

### Craete New Variables

Replace 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt  with time elapsed since YrSold
So YearBuilt = YrSold-YearBuilt. 

Similarly transform 'YearRemodAdd', 'GarageYrBlt.
After making transformation drop YrSold

In [53]:
# function to calculate elapsed time

def elapsed_years(df, var):
    # capture difference between year variable and
    # year the house was sold
    
    df[var] = df['YrSold'] - df[var]
    return df

In [54]:
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
    X_train = elapsed_years(X_train, var)
    X_test = elapsed_years(X_test, var)

In [55]:
# drop YrSold
X_train.drop('YrSold', axis=1, inplace=True)
X_test.drop('YrSold', axis=1, inplace=True)

In [56]:
year_vars.remove('YrSold')

In [57]:
# capture the column names for use later in the notebook
final_columns = X_train.columns
final_columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

### Feature Engineering Pipeline

In [58]:
# I will treat discrete variables as if they were categorical
# to treat discrete as categorical using Feature-engine
# we need to re-cast them as object

X_train[discrete] = X_train[discrete].astype('O')
X_test[discrete] = X_test[discrete].astype('O')

In [59]:
# import relevant modules for feature engineering
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from feature_engine import missing_data_imputers as mdi
from feature_engine import categorical_encoders as ce
from feature_engine.variable_transformers import YeoJohnsonTransformer
from sklearn.preprocessing import StandardScaler
from feature_engine.discretisers import DecisionTreeDiscretiser
from feature_engine.wrappers import SklearnTransformerWrapper

In [60]:
house_preprocess = Pipeline([
    
    # missing data imputation 
    ('missing_ind', mdi.AddMissingIndicator(
        variables=['LotFrontage', 'MasVnrArea',  'GarageYrBlt'])),
    ('imputer_num', mdi.MeanMedianImputer(imputation_method='mean',
                                          variables=['LotFrontage', 'MasVnrArea',  'GarageYrBlt'])),
    ('imputer_cat', mdi.CategoricalVariableImputer(imputation_method='missing',variables=categorical)),

    # categorical encoding 
     ('rare_label_enc', ce.RareLabelCategoricalEncoder(
         tol=0.01,n_categories=3, variables=categorical+discrete)),
    ('categorical_enc', ce.MeanCategoricalEncoder(variables = categorical + discrete)),
     
    # Transforming Numerical Variables
    ('yjt', YeoJohnsonTransformer(variables = ['LotFrontage','MasVnrArea', 'GarageYrBlt'])),

    
    # discretisation and encoding
    ('treeDisc',  DecisionTreeDiscretiser(cv=2, scoring='neg_mean_squared_error',
                                   regression=True,
                                   param_grid={'max_depth': [1,2,3,4,5,6]},
                                   variables = ['LotFrontage','MasVnrArea', 'GarageYrBlt'])),

    # feature Scaling
    ('scaler', SklearnTransformerWrapper(transformer=StandardScaler())),
    


])

In [61]:
house_preprocess.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('missing_ind',
                 AddMissingIndicator(how='missing_only',
                                     variables=['LotFrontage', 'MasVnrArea',
                                                'GarageYrBlt'])),
                ('imputer_num',
                 MeanMedianImputer(imputation_method='mean',
                                   variables=['LotFrontage', 'MasVnrArea',
                                              'GarageYrBlt'])),
                ('imputer_cat',
                 CategoricalVariableImputer(fill_value='Missing',
                                            imputation_method='missing',
                                            return_ob...
                                                      'LotFrontage', 'LotArea',
                                                      'Street', 'Alley',
                                                      'LotShape', 'LandContour',
                                                    

In [62]:
# Apply Transformations
X_train=house_preprocess.transform(X_train)
X_test=house_preprocess.transform(X_test)

## <span class="mark">DO NOT CHANGE STEPS BEFORE THIS POINT</span>

## Try Different models we have learnt in Class - The Best model should be chosen based on mean cross validation score

## Regression Models- Tune different models one by one

### Ridge regression

In [63]:
# Code snippet already provided in the assignment
# Train a Ridge regression model
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
ridge = Ridge()

#define a list of parameters
param_ridge = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100, 1000] }

grid_ridge = GridSearchCV(ridge, param_ridge, cv=5, return_train_score = True)
grid_ridge.fit(X_train, y_train)

# Mean Cross Validation Score
print("Best Mean Cross-validation score: {:.2f}".format(grid_ridge.best_score_))

print()

#find best parameters
print('Ridge parameters: ', grid_ridge.best_params_)

# Check test data set performance

print("Ridge Test Performance: ", grid_ridge.score(X_test,y_test))

Best Mean Cross-validation score: 0.83

Ridge parameters:  {'alpha': 100}
Ridge Test Performance:  0.7861405134143578


### Lasso regression

In [64]:
from sklearn import linear_model
from warnings import filterwarnings
filterwarnings('ignore')

# Training a Lasso regression model
lasso_reg = linear_model.Lasso(max_iter=1000)

#define parameters
param_lasso = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100, 1000] }

grid_lasso = GridSearchCV(lasso_reg, param_lasso, cv=10, return_train_score = True)
grid_lasso.fit(X_train, y_train)

# Mean Cross Validation Score
print("Best Mean Cross-validation score: {:.2f}".format(grid_lasso.best_score_))

print()

#find best parameters
print('Lasso parameters: ', grid_lasso.best_params_)

# Check test data set performance

print("Lasso Test Performance: ", grid_lasso.score(X_test,y_test))

Best Mean Cross-validation score: 0.84

Lasso parameters:  {'alpha': 1000}
Lasso Test Performance:  0.8143430518610633


### Decision Tree Regression

In [65]:
from sklearn.tree import DecisionTreeRegressor

dtree_reg = DecisionTreeRegressor()

#define parameters 
param_dtreereg = {'max_depth': [2, 3, 4, 5, 6, 7]}

grid_dtreereg = GridSearchCV(dtree_reg, param_dtreereg, cv=10, return_train_score = True)
grid_dtreereg.fit(X_train, y_train)

# Mean Cross Validation Score
print("Best Mean Cross-validation score: {:.2f}".format(grid_dtreereg.best_score_))

print()

#find best parameters
print('Decision Tree parameters: ', grid_dtreereg.best_params_)

# Check test data set performance

print("Decision Tree Test Performance: ", grid_dtreereg.score(X_test,y_test))

Best Mean Cross-validation score: 0.73

Decision Tree parameters:  {'max_depth': 4}
Decision Tree Test Performance:  0.815636500897849


### KNN Neighbours Regression

In [66]:
from sklearn.neighbors import KNeighborsRegressor

knnreg = KNeighborsRegressor()

#define parameters
param_knn = {'n_neighbors': np.arange(1, 30, 2)}

grid_knnreg = GridSearchCV(knnreg, param_knn, cv=10, return_train_score = True)
grid_knnreg.fit(X_train, y_train)

# Mean Cross Validation Score
print("Best Mean Cross-validation score: {:.2f}".format(grid_knnreg.best_score_))

print()

#find best parameters
print('KNN Regression parameters: ', grid_knnreg.best_params_)

# Check test data set performance

print("KNN Regression Test Performance: ", grid_knnreg.score(X_test,y_test))


Best Mean Cross-validation score: 0.79

KNN Regression parameters:  {'n_neighbors': 9}
KNN Regression Test Performance:  0.6484091011554165


### ElasticNet Regression



In [67]:
from sklearn.linear_model import ElasticNet

elnet = ElasticNet()

#define parameters
param_elnet = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}

grid_elnet = GridSearchCV(elnet, param_elnet, cv=10, return_train_score = True)
grid_elnet.fit(X_train, y_train)

# Mean Cross Validation Score
print("Best Mean Cross-validation score: {:.2f}".format(grid_elnet.best_score_))

print()

#find best parameters
print('ElasticNet parameters: ', grid_elnet.best_params_)

# Check test data set performance

print("ElasticNet Test Performance: ", grid_elnet.score(X_test,y_test))

Best Mean Cross-validation score: 0.84

ElasticNet parameters:  {'alpha': 1}
ElasticNet Test Performance:  0.7786442924623866


## Tune Multiple Models with one GridSearch

In [68]:
from sklearn.linear_model import LinearRegression
model_gs = Pipeline([("regressor", LinearRegression())])

In [69]:
model_parm_gd = [
    
    # Linear Regressor
    { 'regressor': [LinearRegression()]},
    
    # Ridge Regressor
    { 'regressor': [Ridge()],
      'regressor__alpha':[0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500],
      'regressor__random_state': [0, 30, 42] },
    
    # Lasso Regressor
    { 'regressor': [linear_model.Lasso()],
      'regressor__alpha':[0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500, 1000] },
    
    # Decision Tree Regressor 
    {'regressor': [DecisionTreeRegressor()],
      'regressor__max_depth': [2, 3, 4]},
    
    # KNN Regressor      
    {'regressor' :[KNeighborsRegressor()],
     'regressor__n_neighbors': np.arange(1, 30, 2)},
    
    # Elastic Net Regressor
    { 'regressor': [ElasticNet()],
      'regressor__alpha':[0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500, 600, 1000],
      'regressor__random_state': [0, 30, 42] },
]

In [70]:
grid_search_house_pipe = GridSearchCV(model_gs, model_parm_gd)

In [71]:
grid_search_house_pipe.fit(X_train,y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('regressor',
                                        LinearRegression(copy_X=True,
                                                         fit_intercept=True,
                                                         n_jobs=None,
                                                         normalize=False))],
                                verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'regressor': [LinearRegression(copy_X=True,
                                                         fit_intercept=True,
                                                         n_jobs=None,
                                                         normalize=False)]},
                         {'regressor': [Ridge(alpha=300, copy...
                                                   fit_intercept=True,
                                               

In [72]:
print(grid_search_house_pipe.best_params_)

{'regressor': Ridge(alpha=300, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=0, solver='auto', tol=0.001), 'regressor__alpha': 300, 'regressor__random_state': 0}


In [73]:
# let's get the predictions
X_train_preds = grid_search_house_pipe.predict(X_train)
X_test_preds = grid_search_house_pipe.predict(X_test)

In [76]:
print("Best Mean Cross-validation score: {:.2f}".format(grid_search_house_pipe.best_score_))
print()
print("Best Train Performance: ", grid_search_house_pipe.score(X_train,y_train))
print()
print("Best Test Performance: ", grid_search_house_pipe.score(X_test,y_test))

Best Mean Cross-validation score: 0.84

Best Train Performance:  0.8737983750237459

Best Test Performance:  0.7840688355575975


In [75]:
# check model performance:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

print('test mse: {}'.format(mean_squared_error(y_test, X_test_preds)))
print('test rmse: {}'.format(sqrt(mean_squared_error(y_test, X_test_preds))))
print('test r2: {}'.format(r2_score(y_test, X_test_preds)))

test mse: 1483903856.0287342
test rmse: 38521.47266173419
test r2: 0.7840688355575975
