# Housing Price for Beginner using Basic Ridge, Lasso, Elastic

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Acknowledgement 

Notebooks from which I inspired the most:
- https://www.kaggle.com/houcembenmansour/house-price-prediction
- https://www.kaggle.com/rbyron/simple-linear-regression-models
- https://www.kaggle.com/apapiu/regularized-linear-models

# Objective:
The goal of this work is to build a model that can correctly predict `SalePrice`  
This notebook is targeted to give exposure to beginner (myself) to work with continuous target variable, we will only implement basic model such as:
- Simple Linear Regression (OLS)
- Ridge Regression (Regression with L2 Regularization)
- Lasso Regression (Regression with L1 Regularization)
- Elastic (Regression with combination of L1 and L2)

We will use RMSE as a metric

In [None]:
# Basic
import pandas as pd
import numpy as np
import random

# Plots
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Misc.
import warnings
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test  = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")
sample= pd.read_csv("../input/house-prices-advanced-regression-techniques/sample_submission.csv")

# EDA

This section will explore the data, goal:
- get to know the dataset, how many features etc.
- quick overview of feature correlation with dependent variable

## Getting to know data

In [None]:
train.head()

In [None]:
print('Data shape: ', train.shape)
print('There are %d instances' %train.shape[0])
print('There are %d features' %train.shape[1])

## Correlation heatmap
Goal: see the correlation between each features, correlation range between -1 ant 1, the values near -1 or 1 shows stronge negative correlation or positive correlation respectively, while weak correlation indicated by values closer to zero. The goal is to see which features has high correlation (either positive or negative) with `SalePrice`

In [None]:
# Create correlation matrix
corrmat = train.corr()
# corrmat

plt.figure(figsize=(10, 10))
ax = sns.heatmap(corrmat, square=True, vmax=1, vmin=-1)
ax.set_title('Correlation Heatmap of Housing Pricing Train data')
plt.show()

For now we only need to focus on the last row `SalePrice`:
- positively correlated : `OverallQual`, `GrLivArea`
- negatively correlated : `None`
There are no variales that has high negative correlation with `SalePrice` 
Let see correlation values of those two variable in positive correlated and plot them to see what patterns they have

In [None]:
train['OverallQual']

In [None]:
# Set seaborn theme
sns.set(style='darkgrid', palette='muted')

In [None]:
# Example of positive correlation
target = 'SalePrice'
var1 = 'OverallQual'
var2 = 'GrLivArea'


fig, (ax1, ax2)  = plt.subplots(1, 2, figsize=(12, 5), sharey=True)
sns.boxplot(x=var1, y=target, data=train, ax=ax1)
ax1.set_title('Correlation values %.3f' %corrmat.loc[target, var1])
sns.scatterplot(x=var2, y=target, data=train, ax=ax2)
ax2.set_title('Correlation values %.3f' %corrmat.loc[target, var2])
fig.tight_layout()
plt.show()

This is examples of and definition of positive correlation, value of one variable increase as the other increase, the same logic applies to negative correlation.  
We will further observed which feature has high positive correlation and choose them as our features to train model

## Observe missing values

In [None]:
train.shape

In [None]:
total = train.isna().sum().sort_values(ascending=False)
percent = total/len(train)

missing = pd.concat([total, percent], axis=1)
missing.columns = ['total', 'percentage']

# let's also see corresponding correlation values
corr_tmp = corrmat.SalePrice
corr_tmp.name = 'corrval'
missingcorr = missing.merge(corr_tmp, how='outer', left_index=True,right_index=True).sort_values(by='percentage', ascending=False)
missingcorr.head(20)

What can we can we learn here?:
- Notice that not all variables has correlation values, this indicates that corresponding features are categorical (non-numeric features)
- The first four feature contains lots of missing data > 80%, we can delete those feature entirely and pretend they don't exist
- Same thing goes for `LotFrontage` and `FireplaceQu`, also notice that it has low correlation, so dropping it entirely won'e be a problem
- Interesting percentage values for features `GarageXXX`, notice they have save number, those missing values the belong to same instances, for now let's just drop them, plus the correlation values is not that high
- The same logic applies to `BsmtXX` and `MasVnrXX`, although `MasVnrXX` has relatively low missing values, but for now let's just drop its column
- `Electrical` only has one missing values, in this case we can delete the row

In [None]:
# Dropping columns
del_cols = missingcorr[missingcorr['percentage'] > missingcorr.loc['Electrical', 'percentage']].index
del_cols
print('Initial data shape:', train.shape)

# The train_nona refers to train data without NaN values
train_nona = train.drop(columns=del_cols)
print('After dropping columns:', train_nona.shape)

train_nona = train_nona.dropna(axis=0, how='any')
print('After dropping instance of `Electrical`: ', train_nona.shape)

print('Total missing values in data after cleaning: ', train_nona.isna().sum().sum())

# Feature Engineering
This section will cover the following:
- pick which features to build our model, based on highest correlation values
- check normality of features, normalized them if needed

## Pick important features
Let's check the correlation matrix once again and sort them descendingly

In [None]:
highcorr = corrmat.SalePrice.sort_values(ascending=False)

# Pick correlation values that are > 0.5
highcorr = highcorr[highcorr > 0.5]
highcorr

Now we got some meaningful features that probably best to train our model, but do we need them all?  
Let's take a look at correlation heatmap with these features one more time

In [None]:
# Correlation matrix with above features
highcorrmat = corrmat.loc[highcorr.index, highcorr.index]

highcorrmat

In [None]:
# Correlation heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(highcorrmat, annot=True, fmt='.2f')
plt.show()

Now, let me introduce **multicollinearity**, this term means that there are high correlation between two independent variables/predictors/features, this can be cause a problem when we train our model. The example from above would be `GarageCars` and `GarageArea`, those two variables are highly correlated, this means we can drop one variable and keep the other. This makes the train data less redundant and also reduce its dimension, such that making our model less complex.

Another example of multicollinearity from above heatmap:
- `TotalBsmtSF` and `1stFlrSF`
- `GrLivArea` and `TotRmsAbvGrd`

Intuitively, we understand why they have high correlations, because essentially both variables indicates the same thing, for example, the number of cars you can fit in your garage (`GarageCars`) implicitly dictates the area of the garage itself (`GarageArea`). The same goes for other pairs

So how to decide which feature to drop among those pair? For now, let's pick feature with higher correlation to our target variable, we pick:
- `GarageCars`
- `TotalBsmtSF`
- `GrLivArea`

And we drop: `GarageArea`, `1stFlrSF`, `TotRmsAbvGrd`  
For now, let's make it simplere and remove those in lowest three as well: `YearBuilt`, `YearRemodAdd`

Let's remove those features

In [None]:
# Create new dataframe containing selected features 
drop_cols = ['GarageArea', '1stFlrSF', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd']

# Select highest correlated features
sel_train = train_nona[highcorrmat.index]

# Remove feature with multicollinearity
sel_train = sel_train.drop(columns=drop_cols)

sel_train.shape

## Pairplot & Outliers
We can see the plot between features altogether using pair plot, it helps us to see if patterns or outliers that may exist

In [None]:
sns.pairplot(sel_train)
plt.show()

Few observations regarding the plot against `SalePrice`:
- There are two highest values in `GrLivArea` that doesn't follow the trends
- There are one instances with highest values in `TotalBsmtSF` that doesn't follow the trends

Let's delete those values

There some three instance in `GarageCars` when it equals to 4, that doesn't follow the trends, but we will ignore it now

In [None]:
# Check index of those instance 
print(sel_train['GrLivArea'].sort_values()[-2:].index)
print(sel_train['TotalBsmtSF'].sort_values()[-1:].index)

In [None]:
# We can just delete the instance from GrLivArea
sel_train = sel_train.drop(index=sel_train['GrLivArea'].sort_values()[-2:].index)

# Check to see if thsoe outliers have been removed
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(12, 5))
fig.suptitle('After Removing Outliers')
sns.scatterplot(x='GrLivArea', y='SalePrice', data=sel_train, ax=ax1)
sns.scatterplot(x='TotalBsmtSF', y='SalePrice', data=sel_train, ax=ax2)
plt.show()

## Normality
I consider this section to be quite advance, since we actually can proceed building model using data from before.
But since I've just learned about it, I thought I might as well put it here.

This section will check whether or not each features with numeric values follows normal distribution.  
This is done because data with normal distribution is favorable in machine learning settings.   
Here we will apply log + 1 transformation to convert non-normal distribution to normal distribution.

We will also do this for columns with continuos values, i.e. `SalePrice`, `GrLivArea` and `TotalBsmtSF`

### `SalePrice`

In [None]:
from scipy.stats import norm

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))
sns.distplot(sel_train.SalePrice, kde=True, fit=norm, ax=ax1)
_ = stats.probplot(sel_train.SalePrice, plot = ax2)
fig.tight_layout()
plt.show()

We can see `SalePrice` is not normal, the distribution has positive skewness, and qq plot shows it doesn't follows diagonal line. Let's apply log transformation!

In [None]:
# Applying log transformation
sel_train['SalePrice'] = np.log1p(sel_train.SalePrice)

In [None]:
# See effect of log transformation 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))
sns.distplot(sel_train.SalePrice, kde=True, fit=norm, ax=ax1)
_ = stats.probplot(sel_train.SalePrice, plot = ax2)
fig.tight_layout()
plt.show()

#### `GrLivArea`

In [None]:
# Plot GrLiveArea
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))
sns.distplot(sel_train.GrLivArea, kde=True, fit=norm, ax=ax1)
_ = stats.probplot(sel_train.GrLivArea, plot = ax2)
fig.tight_layout()
plt.show()

Same phenomenon as before, `GrLivArea` also experience positive skewness

In [None]:
# Apply log transformation
sel_train.GrLivArea = np.log(sel_train.GrLivArea)

In [None]:
# See effect of log transformation 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))
sns.distplot(sel_train.GrLivArea, kde=True, fit=norm, ax=ax1)
_ = stats.probplot(sel_train.GrLivArea, plot = ax2)
fig.tight_layout()
plt.show()

#### `TotalBsmtSF`

In [None]:
# Plot TotalBsmtSF
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))
sns.distplot(sel_train.TotalBsmtSF, kde=True, fit=norm, ax=ax1)
_ = stats.probplot(sel_train.TotalBsmtSF, plot = ax2)
fig.tight_layout()
plt.show()

Notice that couple of values are zeros, these values can be transformed using log.   
To solve this, we can only apply log transformation for the non-zero values.

In [None]:
# transform data
sel_train['TotalBsmtSF'] = np.log1p(sel_train.TotalBsmtSF)

In [None]:
# See effect of log transformation
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))
sns.distplot(sel_train.TotalBsmtSF[sel_train.TotalBsmtSF>0], kde=True, fit=norm, ax=ax1)
_ = stats.probplot(sel_train.TotalBsmtSF[sel_train.TotalBsmtSF>0], plot = ax2)
fig.tight_layout()
plt.show()

In [None]:
# Split the target variable from the predictors
y = sel_train.SalePrice
X = sel_train.drop(columns='SalePrice')

## Training

In [None]:
# Models
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, LassoCV, Ridge, RidgeCV
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Ordinary Least Square (OLS), Regression
First is let's use ordinary least square (OLS) model, this is basic linear regression without regularization.   
The definition of regularization will not be discussed extensively in this course.

In [None]:
# Function to compute RMSE (Root Mean Squared Error), using 5-fold CV
def rmse(model, X, y, cv):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=cv))
    return rmse.mean()

In [None]:
lin = LinearRegression()

In [None]:
rmse_sc = rmse(lin, X, y, 5)
rmse_sc

In [None]:
# Create List to append dictionary of scores and model
all_scores = []

In [None]:
all_scores.append(dict(model='OLD', score=rmse_sc))

### Ridge Regression
If you have studied about L1 and L2 Regularization, it's good thing to know them by different name: 
- L1 is also called Lasso
- L2 is also called Ridge

Some source I recommend to read on:
- https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c
- https://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge
- https://stats.stackexchange.com/questions/200416/is-regression-with-l1-regularization-the-same-as-lasso-and-with-l2-regularizati

In [None]:
# let's try a default value alpha =1 
ridge = Ridge(alpha=1)

In [None]:
rmse(ridge, X, y, cv=5)

If you ever heard of term `lambda` as regularization term, in this scikit module it's defined by the variable `alpha`, itgoverns how much we want to regularize the model, it's a hyperparameter, meaning we can set this value according to our will that gives the best metric we concern about, in this case `rmse`, let's create numbers for alpha

In [None]:
# Selection for alphas
alphas = np.logspace(5,-5,50)
alphas

In [None]:
# Ridge CV is a ridge regression with built-in CV implementation
ridgecv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error')
ridgecv.fit(X, y)

# RidgeCV gives us the model trained with best alpha values
ridgecv.alpha_

In [None]:
mod = Ridge(alpha=ridgecv.alpha_)
mod.fit(X, y)

In [None]:
rmse(mod, X, y, cv=5)

In [None]:
# Computing rmse
rmse_sc = rmse(ridgecv, X, y, cv=5)
rmse_sc

Oops, it gives the same values as before with OLS, don't worry it happens. Notice that also in this case, using default alpha value or new alpha value doesn't really affect rmse_sc.

Also one thing to note is that, ridgecv returns the ridge model trained with best alpha, 

In [None]:
all_scores.append(dict(model='Ridge', score=rmse_sc))

### Lasso Regression
The same logic from before, applies to Lasso model

In [None]:
# Try lasso with 1 alpha values
lasso = Lasso(alpha=1)
rmse(lasso, X, y, cv=5)

In [None]:
# Find new alphas
lassocv = LassoCV(alphas=alphas)
lassocv.fit(X, y)
lassocv.alpha_

In [None]:
# Compute rmse and add to list
rmse_sc = rmse(lassocv, X, y, cv=5)
rmse_sc

We can see that rmse with new alpha gives better (lower) rmse than default values.  
But still the rmse with new alpha is similar to previous two.

In [None]:
all_scores.append(dict(model='Lasso', score=rmse_sc))

### Elastic Net
Elastic net is a combination of both L1 and L2 regularization.  
We will skip using the default alpha values, and immediately jump to using ElasticNetCV

In [None]:
elasticcv = ElasticNetCV(alphas=alphas)
elasticcv.fit(X, y)

In [None]:
# Compute rmse and add to list
rmse_sc = rmse(elasticcv, X, y, cv=5)
rmse_sc

In [None]:
all_scores.append(dict(model='ElasticNet', score=rmse_sc))

## Results summary 

In [None]:
pd.DataFrame(all_scores).sort_values(by='score')

Current results suggest Ridge Regression gives the lowest RSME score, let's use it to predict test data

# Test

## Grab desired columns and impute missing values

In [None]:
# Desired columns
want_cols = X.columns

sel_test = test[want_cols]

In [None]:
# Imput missing values with an average, notice the column with missing values
sel_test.isna().sum()

Recall that `GarageCars` only contains integer values, so let's obey that. 

In [None]:
sel_test.GarageCars = sel_test.GarageCars.fillna(round(sel_test.GarageCars.mean()))
sel_test.TotalBsmtSF = sel_test.TotalBsmtSF.fillna(sel_test.TotalBsmtSF.mean())

In [None]:
# After imputing missing values
sel_test.isna().sum().sum()

## Predict using Ridge

In [None]:
# Predict
ypred = ridgecv.predict(sel_test)

# Creating output csv
result = pd.DataFrame({sample.columns[0] : sample['Id'],
                        sample.columns[1] : ypred
})


In [None]:
result.to_csv('./20210629-housing-ridge.csv', index=False)