<h1><center>House Price Prediction - Ridge and Lasso Regression with RFE</center></h1>

# Table of Contents

* [1. Data Loading, Understanding and Cleaning the Data](#1)
 * [1.1 Loading the data ](#1.1)
 * [1.2 Analysing the dataframe ](#1.2)
 * [1.3 Cleaning the dataframe ](#1.3)
* [2. Visualising the Data](#2)
 * [2.1 Visualising the Target Variables ](#2.1)
 * [2.2 Visualising Numeric Variables ](#2.2)
 * [2.3 Visualising Categorical Variables](#2.3)
* [3. Data Preparation](#3)
 * [3.1 Converting categorical data into numerical data](#3.1)
 * [3.2 Dummy Variables](#3.1)
 * [3.3 Splitting the Data into Training and Testing Sets](#3.2)
 * [3.4 Rescaling the Features](#3.3)
* [4. Building a Linear Model](#4)
 * [4.1 Using RFE for Initial Feature Selection](#4.1)
 * [4.2 Building model using Ridge Regression](#4.2)
 * [4.3 Building model using Lasso Regression](#4.2)
* [5. Validating the assumptions of Linear Regression](#5)
 * [5.1 Residual Analysis on the train data](#5.1)
 * [5.2 Preserving Homoscedasticity](#5.2)
 * [5.3 Observations are independent of each other](#5.3)
 * [5.4 No Multicolinearity](#5.4)
* [6. Making Prediction using the Final Model And Evaluation](#6)
 * [6.1 Model Evaluation](#6.1)
 * [6.2 Conclusion](#6.2)

<a id="1"></a>
## Step 1: Data Loading, Understanding and Cleaning the Data

### Let's start with importing all the required libraries for the analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
import statsmodels.api as sm
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV

# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

from scipy.stats import skew


#Display full output in each jupyter cell, not jut the last statement
from IPython.core.interactiveshell import InteractiveShell
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

<a id="1.1"></a>
## 1.1 Loading the data

In [None]:
housing = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

<a id="1.2"></a>
##  1.2 Analysing the dataframe

In [None]:
housing.head()

In [None]:
housing.shape

In [None]:
test.shape

In [None]:
housing.info()

In [None]:
housing.describe([0.25,0.50,0.75,0.99])

<a id="1.3"></a>
##  1.3 Cleaning the dataframe

In [None]:
#Since 'Id' column is unnecessary for the prediction process., dropping Id column
housing.drop('Id', axis=1,inplace=True)
test.drop('Id', axis=1,inplace=True)

In [None]:
#Looking at columns with NaN Values
housing.isnull().sum().sort_values(ascending=False).head(20)

In [None]:
#Any rows with greater than 1 null value?
housing[housing.isnull().sum(axis=1)>1]

In [None]:
#Looking at the percentage of null values
round(housing.isnull().sum()*100/housing.shape[0],2).sort_values(ascending=False).head(20)

### Considering 10% as threshold and dropping columns having more than threshold NaN values

In [None]:
threshold =10
drop_cols = round(housing.isnull().sum()*100/housing.shape[0],2)[round(housing.isnull().sum()*100/housing.shape[0],2)>threshold].index.tolist()
drop_cols

In [None]:
housing.drop(columns=drop_cols, inplace=True)
test.drop(columns=drop_cols, inplace=True)
housing.shape

In [None]:
housing.head()

### Checking remaining columns with null values and imputing them

In [None]:
round(housing.isnull().sum()*100/housing.shape[0],2)[round(housing.isnull().sum()*100/housing.shape[0],2)>0].sort_values(ascending=False)

In [None]:
round(test.isnull().sum()*100/test.shape[0],2)[round(test.isnull().sum()*100/test.shape[0],2)>0].sort_values(ascending=False)

### Converting years to number of years for GarageYrBlt, YearBuilt , YearRemodAdd & YrSold

In [None]:
housing['GarageYrBlt'] = 2021-housing['GarageYrBlt']
housing['YearBuilt'] = 2021-housing['YearBuilt']
housing['YearRemodAdd'] = 2021-housing['YearRemodAdd']
housing['YrSold'] = 2021-housing['YrSold']
housing[['GarageYrBlt','YearBuilt','YearRemodAdd','YrSold']].head()

In [None]:
test['GarageYrBlt'] = 2021-test['GarageYrBlt']
test['YearBuilt'] = 2021-test['YearBuilt']
test['YearRemodAdd'] = 2021-test['YearRemodAdd']
test['YrSold'] = 2021-test['YrSold']
test[['GarageYrBlt','YearBuilt','YearRemodAdd','YrSold']].head()

### Null value treatment
#### Instead of dropping the null values which will result in a data loss, we will impute the null values according the data dictionary provided with the data.

In [None]:
# NA in GarageType, GarageFinish, GarageQual & GarageCond means 'No Garage', so we will replace NA by 'No Garage'
housing['GarageFinish'].fillna('No Garage', inplace=True)
housing['GarageType'].fillna('No Garage', inplace=True)
housing['GarageQual'].fillna('No Garage', inplace=True)
housing['GarageCond'].fillna('No Garage', inplace=True)

In [None]:
# Imputing GarageYrBlt with -1, since these houses don't have garage 
housing['GarageYrBlt'].fillna(-1, inplace=True)

In [None]:
# NA in BsmtExposure, BsmtFinType2, BsmtQual,BsmtCond & BsmtFinType1 means 'No Basement', so we will replace NA by 'No Basement'
housing['BsmtExposure'].fillna('No Basement', inplace=True)
housing['BsmtFinType1'].fillna('No Basement', inplace=True)
housing['BsmtQual'].fillna('No Basement', inplace=True)
housing['BsmtCond'].fillna('No Basement', inplace=True)
housing['BsmtFinType2'].fillna('No Basement', inplace=True)

In [None]:
housing['MasVnrType'].fillna('None', inplace=True)
housing['MasVnrArea'].fillna(0, inplace=True)

In [None]:
# Dropping remaining rows with na
housing.dropna(axis=0, inplace=True)

In [None]:
housing.shape

In [None]:
# NA in GarageType, GarageFinish, GarageQual & GarageCond means 'No Garage', so we will replace NA by 'No Garage'
test['GarageFinish'].fillna('No Garage', inplace=True)
test['GarageType'].fillna('No Garage', inplace=True)
test['GarageQual'].fillna('No Garage', inplace=True)
test['GarageCond'].fillna('No Garage', inplace=True)
# Imputing GarageYrBlt with -1, since these houses don't have garage 
test['GarageYrBlt'].fillna(-1, inplace=True)
# NA in BsmtExposure, BsmtFinType2, BsmtQual,BsmtCond & BsmtFinType1 means 'No Basement', so we will replace NA by 'No Basement'
test['BsmtExposure'].fillna('No Basement', inplace=True)
test['BsmtFinType1'].fillna('No Basement', inplace=True)
test['BsmtQual'].fillna('No Basement', inplace=True)
test['BsmtCond'].fillna('No Basement', inplace=True)
test['BsmtFinType2'].fillna('No Basement', inplace=True)
test['MasVnrType'].fillna('None', inplace=True)
test['MasVnrArea'].fillna(0, inplace=True)
test.shape

### Observation : 
Now we can see all `null` values are taken care off.

### Outlier Treatment

In [None]:
housing.describe([0.25,0.50,0.75,0.99])

### Removing outliers, taking the lower and upper quantile as 0.25 & 0.99 respectively

In [None]:
num_col = list(housing.dtypes[housing.dtypes !='object'].index)
def drop_outliers(x):
    list = []
    for col in num_col:
        Q1 = x[col].quantile(.25)
        Q3 = x[col].quantile(.99)
        IQR = Q3-Q1
        x =  x[(x[col] >= (Q1-(1.5*IQR))) & (x[col] <= (Q3+(1.5*IQR)))] 
    return x
housing = drop_outliers(housing)

In [None]:
housing.shape

In [None]:
housing.describe([0.25,0.50,0.75,0.99])

In [None]:
# Dropping PoolArea column, since all values are 0 after removing outliers
housing.drop(columns=['PoolArea'], inplace=True)
test.drop(columns=['PoolArea'], inplace=True)

<a id="2"></a>
## Step 2: Visualising the Data

Let's now spend some time doing what is arguably the most important step - **understanding the data**.
- If there is some obvious multicollinearity going on, this is the first place to catch it
- Here's where you'll also identify if some predictors directly have a strong association with the outcome variable

<a id="2.1"></a>
### 2.1 Visualising the Target Variables

In [None]:
plt.title('SalePrice')
sns.distplot(housing['SalePrice'], bins=10)
plt.show()

### Observation : 
Now we can see our target variable `SalePrice` is skewed, so doing log transformation

In [None]:
housing['SalePrice'] = np.log1p(housing['SalePrice'])

plt.title('SalePrice')
sns.distplot(housing['SalePrice'], bins=10)
plt.show()

<a id="2.2"></a>
### 2.2 Visualising Numeric Variables

In [None]:
#Get list of numeric variables
num_vars = list(housing.dtypes[housing.dtypes !='object'].index)

#Let's review the numeric variables
housing[num_vars].head()

In [None]:
# Check the numerical values using pairplots

plt.figure(figsize=(10,5))
sns.pairplot(housing, x_vars=['MSSubClass','LotArea', 'MasVnrArea'], y_vars='SalePrice',height=4, aspect=1,kind='scatter')
sns.pairplot(housing, x_vars=['OverallQual', 'OverallCond','OpenPorchSF'], y_vars='SalePrice',height=4, aspect=1,kind='scatter')
sns.pairplot(housing, x_vars=['BsmtFinSF1', 'BsmtUnfSF','TotalBsmtSF'], y_vars='SalePrice',height=4, aspect=1,kind='scatter')
sns.pairplot(housing, x_vars=['1stFlrSF','2ndFlrSF', 'GrLivArea'], y_vars='SalePrice',height=4, aspect=1,kind='scatter')
sns.pairplot(housing, x_vars=['BsmtFullBath','FullBath', 'HalfBath'], y_vars='SalePrice',height=4, aspect=1,kind='scatter')
sns.pairplot(housing, x_vars=['BedroomAbvGr','TotRmsAbvGrd', 'Fireplaces'], y_vars='SalePrice',height=4, aspect=1,kind='scatter')
sns.pairplot(housing, x_vars=['GarageCars','GarageArea', 'WoodDeckSF'], y_vars='SalePrice',height=4, aspect=1,kind='scatter')
sns.pairplot(housing, x_vars=['3SsnPorch','MiscVal','KitchenAbvGr'], y_vars='SalePrice',height=4, aspect=1,kind='scatter')
plt.show()

#### <u> Observations </u>:  
- `MSSubClass`,`3SsnPorch` & `MiscVal` don't seem to have a relationship with `SalePrice`so can be dropped.
- `KitchenAbvGr`,`BsmtFullBath`, `BsmtHalfBath` don't seem to have a relationship with `SalePrice`so can be dropped.
- `GrLivArea`, `TotalBsmtSF` and `1stFlrSF` have a similar issue with extreme outliers so I'll clip them all back to a maximum vlue of 3000.
- `GarageArea` also have some outliers

In [None]:
# Dropping & Clipping
housing.drop(columns=[ 'MSSubClass','3SsnPorch','MiscVal'], inplace=True)
housing.drop(columns=[ 'KitchenAbvGr','BsmtFullBath','BsmtHalfBath'], inplace=True)
housing['GrLivArea'] = housing['GrLivArea'].clip(0, 3000)
housing['TotalBsmtSF'] = housing['TotalBsmtSF'].clip(0, 3000)
housing['1stFlrSF'] = housing['1stFlrSF'].clip(0, 3000)
housing['GarageArea'] = housing['GarageArea'].clip(0, 1200)
housing['BsmtFinSF1'] = housing['BsmtFinSF1'].clip(0, 2500)
housing['OpenPorchSF'] = housing['OpenPorchSF'].clip(0, 400)
housing['LotArea'] = housing['LotArea'].clip(0, 60000)
housing.shape

In [None]:
# Dropping & Clipping
test.drop(columns=[ 'MSSubClass','3SsnPorch','MiscVal'], inplace=True)
test.drop(columns=[ 'KitchenAbvGr','BsmtFullBath','BsmtHalfBath'], inplace=True)
test['GrLivArea'] = test['GrLivArea'].clip(0, 3000)
test['TotalBsmtSF'] = test['TotalBsmtSF'].clip(0, 3000)
test['1stFlrSF'] = test['1stFlrSF'].clip(0, 3000)
test['GarageArea'] = test['GarageArea'].clip(0, 1200)
test['BsmtFinSF1'] = test['BsmtFinSF1'].clip(0, 2500)
test['OpenPorchSF'] = test['OpenPorchSF'].clip(0, 400)
test['LotArea'] = test['LotArea'].clip(0, 60000)
test.shape

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated

plt.figure(figsize = (30, 20))
sns.heatmap(housing.corr(), annot = True, cmap="YlGnBu")
plt.show()

#### <u> Observations </u>:  
- `SalePrice` seems to be correlated with `OverallQual`, `GrLivArea` and `GarageCars` most
- `GarageCars` and `GarageArea` seems to be highly  correlated with each other.
- `GrLivArea` and `TotRmsAbvGrd` seems to be highly  correlated with each other.
- `TotalBsmtSF` and `1stFlrSF` seems to be highly  correlated with each other.
- `GrLivArea` and `2ndFlrSF` seems to be highly  correlated with each other.
- `GrLivArea` and `FullBath` seems to be highly  correlated with each other.
- `YearBuilt` and `GarageYrBlt` seems to be highly  correlated with each other.
- `YearBuilt` and `YearRemodAdd` seems to be highly  correlated with each other.

In [None]:
# Dropping GarageArea, TotRmsAbvGrd, 1stFlrSF, 2ndFlrSF, BsmtFullBath, FullBath and GarageYrBlt
housing.drop(columns=['GarageArea','TotRmsAbvGrd','1stFlrSF','2ndFlrSF','FullBath','GarageYrBlt', 'YearRemodAdd'], inplace=True) #,'BsmtFullBath','FullBath','GarageYrBlt', 'YearRemodAdd'
housing.shape

In [None]:
# Dropping GarageArea, TotRmsAbvGrd, 1stFlrSF, 2ndFlrSF, BsmtFullBath, FullBath and GarageYrBlt
test.drop(columns=['GarageArea','TotRmsAbvGrd','1stFlrSF','2ndFlrSF','FullBath','GarageYrBlt', 'YearRemodAdd'], inplace=True) #,'BsmtFullBath','FullBath','GarageYrBlt', 'YearRemodAdd'
test.shape

### Checking for skewness within independent variables

In [None]:
numerical_columns = housing.select_dtypes(include=['int64', 'float64'])
skewness_of_feats = numerical_columns.apply(lambda x: skew(x)).sort_values(ascending=False)
print(skewness_of_feats)

In [None]:
# Using log transformation for fixing skewness within variables
housing['LowQualFinSF'] = np.log1p(housing['LowQualFinSF'])
housing['LotArea'] = np.log1p(housing['LotArea'])
housing['BsmtFinSF2'] = np.log1p(housing['BsmtFinSF2'])
housing['ScreenPorch'] = np.log1p(housing['ScreenPorch'])
housing['EnclosedPorch'] = np.log1p(housing['EnclosedPorch'])
housing['MasVnrArea'] = np.log1p(housing['MasVnrArea'])
housing['OpenPorchSF'] = np.log1p(housing['OpenPorchSF'])
housing['WoodDeckSF'] = np.log1p(housing['WoodDeckSF'])
housing['BsmtUnfSF'] = np.log1p(housing['BsmtUnfSF'])

In [None]:
# Using log transformation for fixing skewness within variables
test['LowQualFinSF'] = np.log1p(test['LowQualFinSF'])
test['LotArea'] = np.log1p(test['LotArea'])
test['BsmtFinSF2'] = np.log1p(test['BsmtFinSF2'])
test['ScreenPorch'] = np.log1p(test['ScreenPorch'])
test['EnclosedPorch'] = np.log1p(test['EnclosedPorch'])
test['MasVnrArea'] = np.log1p(test['MasVnrArea'])
test['OpenPorchSF'] = np.log1p(test['OpenPorchSF'])
test['WoodDeckSF'] = np.log1p(test['WoodDeckSF'])
test['BsmtUnfSF'] = np.log1p(test['BsmtUnfSF'])

<a id="2.3"></a>
### 2.3 Visualising Categorical Variables

As you might have noticed, there are a many categorical variables as well. Let's make a boxplot for some of these variables.

In [None]:
plt.figure(figsize = (20,20)) 
plt.subplot(3,3,1)
sns.boxplot(x='MSZoning', y="SalePrice", data=housing)
plt.subplot(3,3,2)
sns.boxplot(x='BldgType', y="SalePrice", data=housing)
plt.subplot(3,3,3)
sns.boxplot(x='Street', y="SalePrice", data=housing)
plt.subplot(3,3,4)
sns.boxplot(x='LotShape', y="SalePrice", data=housing)
plt.subplot(3,3,5)
sns.boxplot(x='HouseStyle', y="SalePrice", data=housing)
plt.subplot(3,3,6)
sns.boxplot(x='Utilities', y="SalePrice", data=housing)
plt.subplot(3,3,7)
sns.boxplot(x='RoofStyle', y="SalePrice", data=housing)
plt.subplot(3,3,8)
sns.boxplot(x='LandSlope', y="SalePrice", data=housing)
plt.subplot(3,3,9)
sns.boxplot(x='Neighborhood', y="SalePrice", data=housing)
plt.show()

In [None]:
plt.figure(figsize = (20,20)) 
plt.subplot(3,3,1)
sns.boxplot(x='ExterQual', y="SalePrice", data=housing)
plt.subplot(3,3,2)
sns.boxplot(x='Foundation', y="SalePrice", data=housing)
plt.subplot(3,3,3)
sns.boxplot(x='BsmtQual', y="SalePrice", data=housing)
plt.subplot(3,3,4)
sns.boxplot(x='Heating', y="SalePrice", data=housing)
plt.subplot(3,3,5)
sns.boxplot(x='CentralAir', y="SalePrice", data=housing)
plt.subplot(3,3,6)
sns.boxplot(x='Electrical', y="SalePrice", data=housing)
plt.subplot(3,3,7)
sns.boxplot(x='KitchenQual', y="SalePrice", data=housing)
plt.subplot(3,3,8)
sns.boxplot(x='GarageType', y="SalePrice", data=housing)
plt.subplot(3,3,9)
sns.boxplot(x='GarageQual', y="SalePrice", data=housing)
plt.show()
plt.figure(figsize = (20,5)) 
plt.subplot(1,2,1)
sns.boxplot(x='SaleType', y="SalePrice", data=housing)

#### <u> Observations </u>: 

- MsZoning with of type 'Fv' has high Saleprice and type 'C' has least sale price
- The Street of type 'Pave' has more Sale Price when compared to 'Grvl' the utlities coulms have most of its values as 'AllPub' So we this column have give much of an informration. - Its not an important feature.
- The house with Exterior Quality of type Excellent has the highest SalePrice.
- The house with Basement Quality of type Excellent has the highest SalePrice.
- The house with Kitchen Quality of type Excellent has the highest SalePrice.
- The house with Garage Quality of type Excellent has the highest SalePrice.

In [None]:
housing.drop(columns=['Utilities'], inplace=True)

In [None]:
test.drop(columns=['Utilities'], inplace=True)

<a id="3"></a>
## Step 3: Data Preparation

<a id="3.1"></a>
### 3.1 Converting categorical data into numerical data

In [None]:
cat_vars = list(housing.dtypes[housing.dtypes =='object'].index)
housing[cat_vars].head(10)

#### Lets check for the below columns here we can clearly see that these are having some kind of order and hence we can say these are ordinal in nature

In [None]:
housing[['LandSlope','ExterQual','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2',
            'HeatingQC','CentralAir',  'KitchenQual','GarageFinish','GarageQual','GarageCond',
             'ExterCond','LotShape']].head()

In [None]:
housing['LandSlope'] = housing.LandSlope.map({'Sev':0,'Mod':1,'Gtl':2})
housing['ExterQual'] = housing.ExterQual.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
housing['BsmtQual'] = housing.BsmtQual.map({'No Basement':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
housing['BsmtCond'] = housing.BsmtCond.map({'No Basement':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
housing['BsmtExposure'] = housing.BsmtExposure.map({'No Basement':0,'No':1,'Mn':2,'Av':3,'Gd':4})
housing['BsmtFinType1'] = housing.BsmtFinType1.map({'No Basement':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
housing['BsmtFinType2'] = housing.BsmtFinType2.map({'No Basement':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
housing['HeatingQC'] = housing.HeatingQC.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
housing['CentralAir'] = housing.CentralAir.map({'N':0,'Y':1})
housing['KitchenQual'] = housing.KitchenQual.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
housing['GarageFinish'] = housing.GarageFinish.map({'No Garage':0,'Unf':1,'RFn':2,'Fin':3})
housing['GarageQual'] = housing.GarageQual.map({'No Garage':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
housing['GarageCond'] = housing.GarageCond.map({'No Garage':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
housing['ExterCond'] = housing.ExterCond.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
housing['LotShape'] = housing.LotShape.map({'IR1':0,'IR2':1,'IR3':2,'Reg':3})

In [None]:
housing.head()

In [None]:
test['LandSlope'] = test.LandSlope.map({'Sev':0,'Mod':1,'Gtl':2})
test['ExterQual'] = test.ExterQual.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
test['BsmtQual'] = test.BsmtQual.map({'No Basement':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
test['BsmtCond'] = test.BsmtCond.map({'No Basement':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
test['BsmtExposure'] = test.BsmtExposure.map({'No Basement':0,'No':1,'Mn':2,'Av':3,'Gd':4})
test['BsmtFinType1'] = test.BsmtFinType1.map({'No Basement':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
test['BsmtFinType2'] = test.BsmtFinType2.map({'No Basement':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
test['HeatingQC'] = test.HeatingQC.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
test['CentralAir'] = test.CentralAir.map({'N':0,'Y':1})
test['KitchenQual'] = test.KitchenQual.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
test['GarageFinish'] = test.GarageFinish.map({'No Garage':0,'Unf':1,'RFn':2,'Fin':3})
test['GarageQual'] = test.GarageQual.map({'No Garage':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
test['GarageCond'] = test.GarageCond.map({'No Garage':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
test['ExterCond'] = test.ExterCond.map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
test['LotShape'] = test.LotShape.map({'IR1':0,'IR2':1,'IR3':2,'Reg':3})

<a id="3.2"></a>
### 3.2 Dummy Variables
Creating Dummy Variables for Categorical Features

In [None]:
cat_vars = list(housing.dtypes[housing.dtypes =='object'].index)
cat_vars

In [None]:
#Converting remaining Categorical features to dummy variables using using one-hot encoding.
housing = pd.get_dummies(data=housing,columns=cat_vars,drop_first=True)

In [None]:
housing.info()

In [None]:
test = pd.get_dummies(data=test,columns=cat_vars,drop_first=True)
test.info()

In [None]:
# Get missing columns in the training test
missing_cols = set( housing.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
housing, test = housing.align(test, axis=1)

In [None]:
test.info()

#### <u> Observation </u>:  After One-Hot Encoding  we have `183` numeric columns

### Dividing into X and Y sets for the model building

In [None]:
housing.describe()
# Putting all feature variable to X

X = housing.drop(['SalePrice'], axis=1)
X.head()

In [None]:
# Putting response variable to y

y = housing['SalePrice']
y.head()

<a id="3.3"></a>
### Step 3.3 Rescaling the Features
Let's bring all numeric variables to the same scale so as to simplify model evaluation and interpretation.

In [None]:
# scaling the features

from sklearn.preprocessing import scale

# storing column names in cols
# scaling (the dataframe is converted to a numpy array)

cols = X.columns
X = pd.DataFrame(scale(X))
X.columns = cols
X.columns

In [None]:
cols = test.columns
test = pd.DataFrame(scale(test))
test.columns = cols
test.columns

<a id="3.4"></a>
### Step 3.4 Splitting the Data into Training and Testing Sets
    -As you know, the first basic step for regression is performing a train-test split.
    -We will split the data into 2 parts : train data and test data

In [None]:
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size = 0.3, random_state=42)

In [None]:
len(X_train.index)

In [None]:
len(X_test.index)

#### <u> Observation </u>:  We have now 1007 rows in training dataset and 432 rows in test dataset 

<a id="4"></a>
## Step 4: Building a Linear Model

<a id="4.1"></a>
### Step 4.1 Using RFE for Initial Feature Selection
Let's use Recursive Feature Elimination (RFE) to automatically select 50 best features.

In [None]:
# Running RFE with the output number of the variable equal to 50
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 50) # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

In [None]:
# Creating X_train & X_test dataframes with RFE selected variables
X_train_rfe = X_train[col]
X_test_rfe = X_test[col]

In [None]:
test_rfe = test[col]

<a id="4.2"></a>
### Step 4.2 Building model using Ridge Regression

In [None]:
# list of alphas

params = {'alpha': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.02, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
                    9.0, 10.0, 20, 50, 100, 500, 1000 ]}

ridge = Ridge()

folds = 5
ridge_model_cv = GridSearchCV(estimator = ridge, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
ridge_model_cv.fit(X_train_rfe, y_train)

In [None]:
ridge_cv_results = pd.DataFrame(ridge_model_cv.cv_results_)
ridge_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])

In [None]:
# plotting mean test and train scoes with alpha 

ridge_cv_results['param_alpha'] = ridge_cv_results['param_alpha'].astype('int32')

# plotting
plt.figure(figsize=(16,8))

plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_train_score'])
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.xscale('log')
plt.ylabel('R2 Score')
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()

In [None]:
#checking the value of optimum number of parameters
print(ridge_model_cv.best_params_)

In [None]:
# Building the model with alpha
ridge = Ridge(alpha=ridge_model_cv.best_params_['alpha'])

ridge.fit(X_train_rfe, y_train)
y_train_pred = ridge.predict(X_train_rfe)
y_test_pred = ridge.predict(X_test_rfe)

print(r2_score(y_true=y_train,y_pred=y_train_pred))
print(r2_score(y_true=y_test,y_pred=y_test_pred))

In [None]:
# Check the mean squared error

mean_squared_error(y_test, y_test_pred)

In [None]:
model_param = list(ridge.coef_)
model_param.insert(0,ridge.intercept_)
cols = X_train_rfe.columns
cols.insert(0,'const')
ridge_coef = pd.DataFrame(list(zip(cols,model_param,(abs(ele) for ele in model_param))))
ridge_coef.columns = ['Feature','Coef','Mod']
ridge_coef.sort_values(by='Mod',ascending=False).head(10)

<a id="4.3"></a>
### Step 4.3 Building model using Lasso Regression

In [None]:
lasso = Lasso()

folds = 10
lasso_model_cv = GridSearchCV(estimator = lasso, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
lasso_model_cv.fit(X_train_rfe, y_train)

In [None]:
lasso_cv_results = pd.DataFrame(lasso_model_cv.cv_results_)
lasso_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])

In [None]:
# plotting mean test and train scoes with alpha 

lasso_cv_results['param_alpha'] = lasso_cv_results['param_alpha'].astype('int32')

# plotting
plt.figure(figsize=(16,8))

plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_train_score'])
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('R2 Score')
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()

In [None]:
#checking the value of optimum number of parameters
print(lasso_model_cv.best_params_)

In [None]:
# Building the model with alpha 0.0001
lasso = Lasso(alpha=lasso_model_cv.best_params_['alpha'])

lasso.fit(X_train_rfe, y_train)
y_train_pred = lasso.predict(X_train_rfe)
y_test_pred = lasso.predict(X_test_rfe)

print(r2_score(y_true=y_train,y_pred=y_train_pred))
print(r2_score(y_true=y_test,y_pred=y_test_pred))

In [None]:
# Check the mean squared error

mean_squared_error(y_test, y_test_pred)

In [None]:
model_param = list(lasso.coef_)
model_param.insert(0,lasso.intercept_)
cols = X_train_rfe.columns
cols.insert(0,'const')
lasso_coef = pd.DataFrame(list(zip(cols,model_param,(abs(ele) for ele in model_param))))
lasso_coef.columns = ['Feature','Coef','Mod']
lasso_coef.sort_values(by='Mod',ascending=False).head(20)

### Observation: 
After creating model in both Ridge and Lasso we can see that the r2_scores are almost same for both of them but as lasso will penalize more on the dataset and can also help in feature elemination i am going to consider that as my final model.

<a id="5"></a>
## Step 5: Validating the assumptions of Linear Regression
Let's verify that the model fulfills the assumptions of linear regression 

<a id="5.1"></a>
### 5.1: Residual Analysis on the train data
i.e. are the error terms normally distributed

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_pred), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

#### <u> Observations </u>:  
Error terms are normally distributed with a mean of zero, so we can use this model to do predictions

<a id="5.2"></a>
### 5.2: Preserving Homoscedasticity
The probability distribution of the errors has constant variance. We can look at residual vs fitted values plot

In [None]:
plt.scatter(y_train, (y_train - y_train_pred))
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.show()

#### <u> Observations </u>:  
- The scatter plot doesn't show any funnel shape pattern, then we say that Homoscedasticity is well preserved

<a id="5.3"></a>
### 5.3: Observations are independent of each other
We can use the Durbin-Watson test for verification. The test will output values between 0 and 4. The closer it is to 2, the less auto-correlation there is between the various variables.

In [None]:
print('The Durbin-Watson value for Final Model is',round(sm.stats.stattools.durbin_watson((y_train - y_train_pred)),4))

<a id='6'></a>
## Step 6: Making Prediction using the Final Model And Evaluation

<a id='6.1'></a>
### 6.1 Final Model
#### Building the Final model using Lasso with Optimal alpha 0.0001

In [None]:
lasso = Lasso(alpha=0.0001)

lasso.fit(X_train_rfe, y_train)
y_train_pred = lasso.predict(X_train_rfe)
y_test_pred = lasso.predict(X_test_rfe)

In [None]:
test_rfe = test_rfe.fillna(test_rfe.interpolate())
preds = lasso.predict(test_rfe)
final_predictions = np.exp(preds)

In [None]:
test.index = test.index + 1461
submission = pd.DataFrame({'Id': test.index ,'SalePrice': final_predictions })
submission.to_csv("submission.csv",index=False)

In [None]:
# Plotting y_test and y_pred to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_test_pred)
fig.suptitle('y_test vs y_pred', fontsize = 20)              # Plot heading 
plt.xlabel('y_test', fontsize = 18)                          # X-label
plt.ylabel('y_pred', fontsize = 16) 

In [None]:
#Let's visualize Actual vs Predicted for Test Data

c = [i for i in range(1,433,1)]
fig = plt.figure(figsize=(20,8))
plt.plot(c,y_test, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,y_test_pred, color="red",  linewidth=2.5, linestyle="-")
fig.suptitle('Actual vs Predicted Test Data', fontsize=20)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Count', fontsize=16)                               # Y-label

In [None]:
#Let's get the r-square for test data
r2_score(y_test, y_test_pred)

In [None]:
model_param = list(lasso.coef_)
model_param.insert(0,lasso.intercept_)
cols = X_train_rfe.columns
cols.insert(0,'const')
lasso_coef = pd.DataFrame(list(zip(cols,model_param,(abs(ele) for ele in model_param))))
lasso_coef.columns = ['Feature','Coef','Mod']
lasso_coef.sort_values(by='Mod',ascending=False).head(10)

<a id='6.2'></a>

## Conclusion :

- The optimal lambda value in case of Ridge and Lasso is as below:
    - Ridge - 0.1
    - Lasso - 0.0001
    
- The Mean Squared error in case of Ridge and Lasso are:
    - Ridge - 0.01922
    - Lasso - 0.01904
    
- The r2_score for test data in case of Ridge and Lasso are:
    - Ridge - 87.9%
    - Lasso - 88.04%

- The Mean Squared Error of Lasso is slightly lower than that of Ridge

- Also, since Lasso helps in feature reduction, Lasso has a better edge over Ridge.
  
- Hence based on Lasso, the factors that generally affect the price are the :
    - Lot Area
    - MSZoning
    - KitchenQual
    - Neighborhood
    - SaleCondition
    - Overall quality
    
Therefore, the variables predicted by Lasso are significant variables for predicting the price of a house.

<h2><center> If you liked this notebook, please don't forget to comment and upvote. </center></h2>
<h2><center>Thank you!</center></h2>