<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Ames Housing Data & Kaggle Challenge

Note: This is part 2 of the code notebook covering the following:-
1. [Library Imports & Functions Creation](#1.-Library-Imports-&-Functions-Creation)
2. [Data Preparation for Train & Test Datasets](#2.-Data-Preparation-for-Train-&-Test-Datasets)
3. [Base Models Training](#3.-Base-Models-Training)
4. [Hyperparameter Tuning of Models](#4.-Hyperparameter-Tuning-of-Models)
5. [Further Analysis on the Final Model](#5.-Further-Analysis-on-the-Final-Model)
6. [Kaggle Submission](#6.-Kaggle-Submission)
7. [Conclusion](#7.-Conclusion)

## 1. Library Imports & Functions Creation

In [1]:
import numpy as np
import pandas as pd
import warnings
import time
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV, Lasso, Ridge, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from warnings import simplefilter

In [2]:
def model_scores(model, X_train, X_test, y_train, y_test):
    print(f'Mean CV score: {round(cross_val_score(model, X_train, y_train).mean(), 3)}.')
    print(f'Training score: {round(model.score(X_train, y_train), 3)}.')
    print(f'Testing score: {round(model.score(X_test, y_test), 3)}.')
    print(f'RMSE in training: {round(np.sqrt(mean_squared_error(y_train, model.predict(X_train))), 3)}.') # sqrt(msq(y_train, y_pred))
    print(f'RMSE in testing: {round(np.sqrt(mean_squared_error(y_test, model.predict(X_test))), 3)}.') 

## 2. Data Preparation for Train & Test Datasets

### 2.1 Combining both datasets

In [3]:
# Importing all datasets
# The original train dataset will be need for kaggle submission
house_test_cleaned = pd.read_csv('./datasets/house_test_cleaned.csv')
house_train_combined = pd.read_csv('./datasets/house_test_combined.csv')
house_train = pd.read_csv('./datasets/train.csv')
house_test = pd.read_csv('./datasets/test.csv')

In [4]:
# We need to check the number of features in both datasets
house_test_cleaned.shape
print(f'Number of Combined Features for Train Dataset: {house_train_combined.shape[1]}')

Number of Combined Features for Train Dataset: 164


In [5]:
print(f'Number of Combined Features for Test Dataset: {house_test_cleaned.shape[1]}')

Number of Combined Features for Test Dataset: 154


### 2.2 Identify differing features of both datasets

In [6]:
# We shall loop through each dataset to find the columns that are not the same 
test_col = []
for col in house_test_cleaned:
    test_col.append(col)

In [7]:
# Features missing from test dataset
for col in house_train_combined:
    if col not in test_col: print(col)

MS SubClass_SC150
MS Zoning_C (all)
Neighborhood_GrnHill
Neighborhood_Landmrk
Condition 2_Feedr
Condition 2_PosN
Condition 2_RRAe
Condition 2_RRAn
Condition 2_RRNn
Roof Matl_Membran
Exterior 1st_CBlock
Exterior 1st_ImStucc
Exterior 1st_Stone
Exterior 2nd_Stone
Heating_OthW
Heating_Wall
Misc Feature_Gar2
Misc Feature_TenC


In [8]:
train_col = []
for col in house_train_combined:
    train_col.append(col)

In [9]:
# Features missing from train dataset
for col in house_test_cleaned:
    if col not in train_col: print(col)

Roof Matl_Metal
Roof Matl_Roll
Exterior 1st_PreCast
Exterior 2nd_Other
Exterior 2nd_PreCast
Mas Vnr Type_CBlock
Heating_GasA
Sale Type_VWD


### 2.3 Removing unique features of each dataset

In [10]:
house_test_cleaned.drop(columns=['Roof Matl_Metal',
                              'Roof Matl_Roll',
                              'Exterior 1st_PreCast',
                              'Exterior 2nd_Other',
                              'Exterior 2nd_PreCast',
                              'Mas Vnr Type_CBlock',
                              'Heating_GasA',
                              'Sale Type_VWD'],inplace=True)

In [11]:
house_train_combined.drop(columns=['MS SubClass_SC150',
                              'MS Zoning_C (all)',
                              'Neighborhood_GrnHill',
                              'Neighborhood_Landmrk',
                              'Condition 2_Feedr',
                              'Condition 2_RRAe',
                              'Condition 2_RRAn',
                              'Condition 2_RRNn',
                              'Roof Matl_Membran',
                              'Condition 2_PosN',
                              'Exterior 1st_CBlock',
                              'Exterior 1st_ImStucc',
                              'Exterior 1st_Stone',
                              'Exterior 2nd_Stone',
                              'Heating_OthW',
                              'Heating_Wall',
                              'Misc Feature_Gar2',     
                              'Misc Feature_TenC'],inplace=True)

In [12]:
print(f'Number of Combined Features for Train Dataset After 2.3: {house_train_combined.shape[1]}')

Number of Combined Features for Train Dataset After 2.3: 146


In [13]:
print(f'Number of Combined Features for Test Dataset After 2.3: {house_test_cleaned.shape[1]}')

Number of Combined Features for Test Dataset After 2.3: 146


## 3. Base Models Training

We shall be covering four base models:
1. Linear Regression
2. Lasso Regression
3. Ridge Regression
4. ElasticNet Regression

### 3.1 Scaling Choice: Min Max Scaler

This project will only be using the Min Max Scaler for scaling. The rationale is to keep the values of features from becoming negative. This is the [only scaler](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py) that allows us to do that. As a recap on the entire dataset, there should not be any negative values for the dataset. The issue with negative values is that the relationship between the features and the target variable will become inversed, which is not an accurate reflection of the data.

### 3.2 Base: Linear Regression

In [14]:
y = house_train['SalePrice']
X = house_train_combined

In [15]:
# Using the random state of 42 for consistent result
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [17]:
lr_model = lr.fit(X_train, y_train)

In [18]:
mm = MinMaxScaler()
X_train = mm.fit_transform(X_train)
X_test = mm.transform(X_test)

In [19]:
# Results for base linear regression
warnings.filterwarnings("ignore")
model_scores(lr_model, X_train, X_test, y_train, y_test)

Mean CV score: -2.075698653063636e+16.
Training score: -7.095.
Testing score: -7.385.
RMSE in training: 226264.213.
RMSE in testing: 226905.54.


### 3.3 Base: Lasso Regression

In [20]:
lasso = Lasso()

In [21]:
lasso_model = lasso.fit(X_train, y_train)

In [22]:
# Results for base lasso regression
warnings.filterwarnings("ignore")
model_scores(lasso_model, X_train, X_test, y_train, y_test)

Mean CV score: 0.829.
Training score: 0.887.
Testing score: 0.883.
RMSE in training: 26722.091.
RMSE in testing: 26772.47.


### 3.4 Base: Ridge Regression

In [23]:
ridge = Ridge()

In [24]:
ridge_model = ridge.fit(X_train, y_train)

In [25]:
# Results for base lasso regression
warnings.filterwarnings("ignore")
model_scores(ridge_model, X_train, X_test, y_train, y_test)

Mean CV score: 0.839.
Training score: 0.885.
Testing score: 0.887.
RMSE in training: 26941.657.
RMSE in testing: 26388.169.


### 3.5 Base: Elastic Net Regression

In [26]:
elasticnet = ElasticNet()

In [27]:
elasticnet_model = elasticnet.fit(X_train, y_train)

In [28]:
# Results for base lasso regression
warnings.filterwarnings("ignore")
model_scores(elasticnet_model, X_train, X_test, y_train, y_test)

Mean CV score: 0.566.
Training score: 0.571.
Testing score: 0.6.
RMSE in training: 52060.967.
RMSE in testing: 49555.979.


### 3.6 Evaluation of Base Model Training

| Index | Model Name | Alpha Value | L1 Ratio Value | Mean CV R<sup>2</sup> Score | R<sup>2</sup> Train Score | R<sup>2</sup> Test Score| RMSE Train Score| RMSE Test Score
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | Base: Linear Regression | - |  -  | -2.076e+16 | -7.095 | -7.385 | 226264.213 | 226905.54 | 
| 2 | Base: Lasso Regression | 1 | -  | 0.829 | 0.887 | 0.883 | 26722.091 | 26772.47 |
| 3 | Base: Ridge Regression | 1 | - | 0.839 | 0.885 | 0.887 | 26941.657 | 26388.169 |
| 4 | Base: Elastic Net Regression | 1 | 0.5 | 0.566 | 0.571 | 0.600 | 52060.967 | 49555.979 |

1. The base models of the Lasso and Ridge regressions clearly outperform the base models of Linear and Elastic Net regressions. 
2. The result of base Linear regression shows that the model is probably too simplistic with major underfitting.
3. The result of base Elastic Net is not ideal as well with a similar issue of underfitting to a lesser extent as compare to base Linear regression.
4. Base Lasso and Ridge regressions show comparable results with Ridge regression performing slightly better on RMSE Testing Score.
5. Nevertheless, both base Lasso and Ridge regressions also share similar issues with underfitting, which is more apparent when using the mean CV R<sup>2</sup>.

## 4. Hyperparameter Tuning of Models

We shall be tuning the following models:

1. Lasso Regression
2. Ridge Regression
3. ElasticNet Regression

### 4.1 HyperPara: Lasso Regression

In [29]:
# I will just focus on tuning the alpha values as the other parameters are tested with no effect
# I encounter issues with the usage of GridGV for lasso to run the below function
# This code might take awhile to run
l_alphas = np.logspace(0, 50, 500)
lasso_cv = LassoCV(alphas=l_alphas).fit(X_train, y_train)

In [30]:
# This code might take awhile to run
model_scores(lasso_cv, X_train, X_test, y_train, y_test)

Mean CV score: 0.841.
Training score: 0.874.
Testing score: 0.896.
RMSE in training: 28194.816.
RMSE in testing: 25241.134.


In [31]:
print(f'Alpha value: {lasso_cv.alpha_:.3f}')

Alpha value: 100.927


### 4.2 HyperPara: Ridge Regression

In [32]:
# I will take out unimportant parameters in the interest of coding running time
ridge_para = {
    "alphas": np.logspace(0, 50, 500)}

In [33]:
ridge_grid = GridSearchCV(RidgeCV(),
                          ridge_para,
                          n_jobs=-1,
                          cv=5,
                          verbose = 1
                         )

In [34]:
ridge_grid.fit(X_train, y_train)

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


In [35]:
# This code might take awhile to run
start = time.time()
model_scores(ridge_grid, X_train, X_test, y_train, y_test)
end = time.time()
print(f"Runtime of the code is {end - start}")

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
Fitting 5 folds for each of 500 candidates, totalling 2500 fits
Fitting 5 folds for each of 500 candidates, totalling 2500 fits
Fitting 5 folds for each of 500 candidates, totalling 2500 fits
Fitting 5 folds for each of 500 candidates, totalling 2500 fits
Mean CV score: 0.84.
Training score: 0.88.
Testing score: 0.887.
RMSE in training: 27491.844.
RMSE in testing: 26350.511.
Runtime of the code is 38.54296398162842


In [36]:
ridge_grid.best_params_

{'alphas': 3.169582088726117}

### 4.3 HyperPara: ElasticNet Regression

In [37]:
# I will just focus on tuning the alpha values as the other parameters are tested with no effect
# I encounter issues with the usage of GridGV for elasticnet to run the below function
# This code might take awhile to run
enet_alphas = np.logspace(0, 50, 500)
enet_ratio = np.linspace(0,1,5)
enet_model = ElasticNetCV(alphas=enet_alphas, l1_ratio=enet_ratio, cv=5).fit(X_train, y_train)

In [38]:
start = time.time()
model_scores(enet_model, X_train, X_test, y_train, y_test)
end = time.time()
print(f"Runtime of the code is {end - start}")

Mean CV score: 0.841.
Training score: 0.874.
Testing score: 0.896.
RMSE in training: 28194.816.
RMSE in testing: 25241.134.
Runtime of the code is 175.4595410823822


In [39]:
print(f'Alpha value: {enet_model.alpha_:.3f}')

Alpha value: 100.927


In [40]:
print(f'L1 Ratio value: {enet_model.l1_ratio_:.3f}')

L1 Ratio value: 1.000


### 4.4 Evaluation & Selection of Best Model

 | Index | Model Name | Alpha Value | L1 Ratio Value | Mean CV R<sup>2</sup> Score | R<sup>2</sup> Train Score | R<sup>2</sup> Test Score| RMSE Train Score| RMSE Test Score
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | HyperPara: Lasso Regression | 100.927 |  -  | 0.841 | 0.874 | 0.896 | 28194.816 | 25241.134 | 
| 2 | HyperPara: Ridge Regression | 3.169 | -  | 0.840 | 0.880 | 0.887 | 27491.844 | 26350.511 |
| 3 | HyperPara: ElasticNet Regression | 101.552 | 1 | 0.841 | 0.874 | 0.896 | 28194.816 | 25241.134 |

1. Based on RMSE test score, the Lasso regression clearly outperforms the Ridge regression.
2. Interestingly, the ElasticNet regression shares the same scores as Lasso regression. With a L1 Ratio Value of 1, it reinforces the performance of the Lasso regression as the best.
3. While Ridge regression does poorer for the test scores, it does better in both R<sup>2</sup> train score and RMSE train score. As compared to the Lasso regression, it does seem that the Ridge regression has less issue with underfitting despite having a lower R<sup>2</sup> test score and RMSE test score.
4. As with the base models, underfitting is still an issue for all models when examining the respective mean CV R<sup>2</sup> score.

## 5. Further Analysis on the Best Model

With the best RMSE test score, I am inclined to choose the 'HyperPara: Lasso Regression' as the best model. As the focus of the project is to identify the best predictors, I shall focus on the features which perform the best for the final model.

| Model Name | Alpha Value | L1 Ratio Value | Mean CV R<sup>2</sup> Score | R<sup>2</sup> Train Score | R<sup>2</sup> Test Score| RMSE Train Score| RMSE Test Score
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| HyperPara: Lasso Regression | 100.927 |  -  | 0.841 | 0.874 | 0.896 | 28194.816 | 25241.134 |

In [41]:
# Top 10 strongest positively correlated features
out_arr = pd.Series(lasso_cv.coef_, index = house_train_combined.columns).to_frame()
out_arr.reset_index(inplace=True)
out_arr = out_arr.rename(columns={'index': 'Features', 0: 'Lasso Coef'})
out_arr.sort_values(by='Lasso Coef', ascending= False).head(10)

Unnamed: 0,Features,Lasso Coef
30,Floors SF,214278.396516
6,Overall Qual,99627.248859
70,Neighborhood_StoneBr,60852.470284
64,Neighborhood_NridgHt,48505.584697
100,Roof Matl_WdShngl,45415.065374
63,Neighborhood_NoRidge,37640.051737
19,Kitchen Qual,35438.139276
23,Garage Area,33124.045604
8,Mas Vnr Area,32951.074973
31,Size of Porch,26732.879657


In [42]:
# Bot 5 strongest negatively correlated features
out_arr.sort_values(by='Lasso Coef', ascending= False).tail()

Unnamed: 0,Features,Lasso Coef
104,Exterior 1st_Stucco,-10257.562759
55,Neighborhood_Edwards,-15069.074549
85,Bldg Type_Twnhs,-23173.668683
28,Age,-24250.593483
86,Bldg Type_TwnhsE,-26292.253098


In [50]:
print(f'Number of Features: {lasso_cv.n_features_in_}')

Number of Features: 146


In [43]:
print(f'Total Features Used: {lasso_cv.n_features_in_ - len(out_arr[out_arr["Lasso Coef"] != 0])}')

Total Features Used: 70


### Evaluation

1. A total of 70 features are being used by the best model. 76 features have been removed.
2. The above are the top 10 and bottom 5 features with the strongest positive and negative Lasso coefficients with the target variable. Based on the result, there are more features with strong Lasso coefficients.
3. Floor Square Feet is the feature with the strongest coefficient, followed by Overall Quality. Other features related to size and quality are kitchen quality, garage area, masonry veneer area and porch size. 
4. Apart from size and quality, location also matters. There are also very strong features concerning the neighbourhood that the houses are in, namely Stone Brook, Northridge Heights and Northridge. Interestingly, Edwards is a place with negative correlation with sale price. Further research might be useful to find out about these places.
5. As expected, age of the house has an inverse relationship with sale price of houses.
6. Apart from age of house and the Edwards neighbour, other strong negative predictors of sale price are exterior made of Stucco and building types such as townhouse end unit and townhouse inside unit.

## 6. Kaggle Submission

In [44]:
# Preparing data for Kaggle Submission
kaggle_data = house_test_cleaned
kaggle_data = mm.fit_transform(kaggle_data)

In [45]:
kaggle_pred = lasso_cv.predict(kaggle_data)

In [46]:
kaggle_pred  = pd.Series(kaggle_pred, name="SalePrice")

In [47]:
id_col = house_test['Id']
kaggle = pd.concat([id_col, kaggle_pred], axis=1)
kaggle = pd.DataFrame(kaggle)

In [48]:
# Exporting the data for kaggle submission
kaggle.to_csv("datasets/kaggle_lasso.csv", index = False)

##### Final Kaggle Private RMSE Score: 25190
##### Final Kaggle Public RMSE Score: 31440

### Evaluation of Kaggle Score
1. The private score is ranked 49th in the scoring chart.
2. The public score is ranked 53rd in the scoring chart.
3. The result leaves room for improvement. I will consider some ways to further improve the score under conclusion.

## 7. Conclusion

### 7.1 Possible Areas for Improvement

Broadly speaking, regression analysis can always be improved by using more data. However, given the context of our best regression model, there is underfitting. Hence, to improve the results, there is room for the regression increase in complexity. This can be achieved by using a more complex regression model or simply creating more complex features. One possibility is to use the polynomial feature engineering to create more sophisticated features for regression analysis. Another possiblity is also to harness the power of lasso and ridge regression more. A personal learning point in the course of the analysis is that methods of dropping features, such as statistical significant of features with targeted variable and multi-colinearity between features, might not always improve the result of regression. 

### 7.2 Limitation

The analysis has the following limitations: 
1. Outdated dataset from 2006-2010.
2. Data limited to the City of Ames, Iowa. Hence, data cannot be generalised.
3. Data limited to micro factors. Other macro factors such as bank interest rate, demographics of population, population growth, Gross Domestic Product (GDP) of nation/ state, crime rate, government policy etc. should be taken into consideration to make the analysis more robust and holistic.

Nevertheless, as indicated under the problem statement and background, the project aims to analysis the best predictors for Ames during the period of The Great Recession. With a looming recession in view, this project hopes that the findings would be especially useful for prediction during the period even though this will require further research on the extent of its relevance.

### 7.3 Summary

In summary, this project wants to identify the best predictors of house prices through extensive linear regression modeling. This first requires the project to first produce a helpful regression model. The project settles on a best model using the lasso regression. The following are the features that are considered the best predictors of house prices: -

1. Size (Floor*, Garage*, Masonry Veneer*, Porch*)
2. Quality (Overall*, Kitchen*)
3. Age of House^
4. Neighbourhood (Stone Brook*, Northridge Heights*, Northridge*, Edwards^)
5. Building Type (Townhouse End Unit^, Townhouse Inside Unit^)
6. Roof Material (Wood Shingles*)
7. Exterior of House (Stucco^)

Note: Strong positive relationship predictors*, Strong negative relationship predictors^