<center><h1>US House Pricing Prediction</h1></center>

<h3>Introduction</h3>


A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below.

<h3>Business Problem</h3>

The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.

<br>

**The company wants to know:**

Which variables are significant in predicting the price of a house, and

How well those variables describe the price of a house.

Also, determine the optimal value of lambda for ridge and lasso regression.

 <br>

**Business Goal**


You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.


<h2>Data Definition</h2>

<ul><ol>MSSubClass: Identifies the type of dwelling involved in the sale.	</ol>
<ol>MSZoning: Identifies the general zoning classification of the sale.</ol>
<ol>LotFrontage: Linear feet of street connected to property</ol>
<ol>LotArea: Lot size in square feet</ol>
<ol>Street: Type of road access to property</ol>
<ol>Alley: Type of alley access to property</ol>
<ol>LotShape: General shape of property</ol>
<ol>LandContour: Flatness of the property</ol>
<ol>Utilities: Type of utilities available</ol>
<ol>LotConfig: Lot configuration</ol>
<ol>LandSlope: Slope of property</ol>
<ol>Neighborhood: Physical locations within Ames city limits</ol>
<ol>Condition1: Proximity to various conditions</ol>
<ol>Condition2: Proximity to various conditions (if more than one is present)</ol>
<ol>BldgType: Type of dwelling</ol>
<ol>HouseStyle: Style of dwelling</ol>
<ol>OverallQual: Rates the overall material and finish of the house</ol>
<ol>OverallCond: Rates the overall condition of the house</ol>
<ol>YearBuilt: Original construction date</ol>
<ol>YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)</ol>
<ol>RoofStyle: Type of roof</ol>
<ol>RoofMatl: Roof material</ol>
<ol>Exterior1st: Exterior covering on house</ol>
<ol>Exterior2nd: Exterior covering on house (if more than one material)</ol>
<ol>MasVnrType: Masonry veneer type</ol>
<ol>MasVnrArea: Masonry veneer area in square feet</ol>
<ol>ExterQual: Evaluates the quality of the material on the exterior </ol>
<ol>ExterCond: Evaluates the present condition of the material on the exterior</ol>
<ol>Foundation: Type of foundation</ol>
<ol>BsmtQual: Evaluates the height of the basement</ol>
<ol>BsmtCond: Evaluates the general condition of the basement</ol>
<ol>BsmtExposure: Refers to walkout or garden level walls</ol>
<ol>BsmtFinType1: Rating of basement finished area</ol>
<ol>BsmtFinSF1: Type 1 finished square feet</ol>
<ol>BsmtFinType2: Rating of basement finished area (if multiple types)</ol>
<ol>BsmtFinSF2: Type 2 finished square feet</ol>
<ol>BsmtUnfSF: Unfinished square feet of basement area</ol>
<ol>TotalBsmtSF: Total square feet of basement area</ol>
<ol>Heating: Type of heating</ol>
<ol>HeatingQC: Heating quality and condition</ol>
<ol>CentralAir: Central air conditioning</ol>
<ol>Electrical: Electrical system</ol>
<ol>1stFlrSF: First Floor square feet</ol>
<ol>2ndFlrSF: Second floor square feet</ol>
<ol>LowQualFinSF: Low quality finished square feet (all floors)</ol>
<ol>GrLivArea: Above grade (ground) living area square feet</ol>
<ol>BsmtFullBath: Basement full bathrooms</ol>
<ol>BsmtHalfBath: Basement half bathrooms</ol>
<ol>FullBath: Full bathrooms above grade</ol>
<ol>HalfBath: Half baths above grade</ol>
<ol>Bedroom: Bedrooms above grade (does NOT include basement bedrooms)</ol>
<ol>Kitchen: Kitchens above grade</ol>
<ol>KitchenQual: Kitchen quality</ol>
<ol>TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)</ol>
<ol>Functional: Home functionality (Assume typical unless deductions are warranted)</ol>
<ol>Fireplaces: Number of fireplaces</ol>
<ol>FireplaceQu: Fireplace quality</ol>
<ol>GarageType: Garage location</ol>
<ol>GarageYrBlt: Year garage was built</ol>
<ol>GarageFinish: Interior finish of the garage</ol>
<ol>GarageCars: Size of garage in car capacity</ol>
<ol>GarageArea: Size of garage in square feet</ol>
<ol>GarageQual: Garage quality</ol>
<ol>GarageCond: Garage condition</ol>
<ol>PavedDrive: Paved driveway</ol>
<ol>WoodDeckSF: Wood deck area in square feet</ol>
<ol>OpenPorchSF: Open porch area in square feet</ol>
<ol>EnclosedPorch: Enclosed porch area in square feet</ol>
<ol>3SsnPorch: Three season porch area in square feet</ol>
<ol>ScreenPorch: Screen porch area in square feet</ol>
<ol>PoolArea: Pool area in square feet</ol>
<ol>PoolQC: Pool quality</ol>
<ol>Fence: Fence quality</ol>
<ol>MiscFeature: Miscellaneous feature not covered in other categories</ol>
<ol>MiscVal: $Value of miscellaneous feature</ol>
<ol>MoSold: Month Sold (MM)</ol>
<ol>YrSold: Year Sold (YYYY)</ol>
<ol>SaleType: Type of sale</ol>
<ol>SaleCondition: Condition of sale</ol></ul>

In [None]:
#Import Required Packages
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

import math
import datetime

from scipy import stats

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler,OrdinalEncoder,LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score,mean_squared_error

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# set default plot style
sns.set_style('darkgrid')

In [None]:
# import warnings module and set ignore to hide the warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# read the csv dataset
df = pd.read_csv('/kaggle/input/housing-prices-dataset/train.csv')

In [None]:
# Preview the data
df.head()

In [None]:
# check the shape of the dataset
df.shape

In [None]:
# view the dataframe info
df.info()

The dataset contains 81 features and 1460 records.

<h1> Data Cleaning<h1>

Lets starts datacleaning by droping the ID column first as it has no value to our model

In [None]:
# dropping column Id from dataframe
df.drop('Id',axis=1,inplace=True)

In [None]:
# defind a method to check null percentage of the features
def check_null_percentage(df):
    missing_info = pd.DataFrame(np.array(df.isnull().sum().sort_values(ascending=False).reset_index())\
                                ,columns=['Columns','Missing_Percentage']).query("Missing_Percentage > 0").set_index('Columns')
    return 100*missing_info/df.shape[0]

In [None]:
# view the null percentage of each feature
check_null_percentage(df)

The missing Percentage here means that the house doesn't have that specific feature like Gargae has 5% of all its features missing then it means that house doesn't have garage and hence 'GarageType','GarageFinish','GarageQual','GarageCond' are all have same percenntage of misiing value. These can hence be filled with "NA" because using mode will give a different meaning and won't be the right approch to fix this. SO lets proceed by will these kind of missingvalues with NA 

In [None]:
# list all the null columns
NA_columns = ['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','FireplaceQu',\
          'GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']

# fill the null values with NA
df[NA_columns] = df[NA_columns].fillna('NA')

In [None]:
# check in any rows had more than 5 na if it has more tha 5 na features willcan consider to drop them
df[df.isnull().sum(axis=1) > 5]

In [None]:
# check if there are duplicated rows so if there are any we can drop them
df.duplicated(keep=False).sum()

There are no duplicated rows in the given data. Lets proceed with some further analyasis

To make a good prediction we need the categrical features to have a balanced or a fair ratio of labels. If a same value applies for 90% of the data set then that feature can't explain the target variable much and would product incorrect results when used in our model. So lets find the featurs which has most repetitive value.

In [None]:
# Create a method that returns a tuple that give information on the top most common value,
# its percentage and count for each feature
def top_unique_count(x):
    unq_cnt = ( x.value_counts(ascending=False,dropna=False).head(1).index.values[0],
               100 * x.value_counts(ascending=False,dropna=False).head(1).values[0]/df.shape[0],
               x.value_counts(ascending=False,dropna=False).head(1).values[0])
    return unq_cnt

In [None]:
#Assign it to a variable and provide column name once the tuble get converted to actual dataframe columns
unique_df = df.apply(lambda x: top_unique_count(x)).rename(index={0:"Value",1:'Percentage',2:'Count'})\
    .T.sort_values(by='Count',ascending=False)
unique_df.head(25)

We can see that many fetatures are filled with same values in 90% of its data

In [None]:
# view the null percentage of each feature
check_null_percentage(df)

LotFrontage has some null values which should be handled. As from my analysis we can groupby neighbourhod and LotCOnfig and view the median value of the them grouped together as Lotfrontage on each neighbourhood and LotFrontage for LotConfig are similar.

In [None]:
df['LotFrontage'] = df.groupby(['Neighborhood','LotConfig'])['LotFrontage'].\
                        apply(lambda x: np.Nan if x.median() == np.NaN else x.fillna(x.median()))
df['LotFrontage'].isnull().sum()

There are still some null values in LotFrontage. Okay now lets take just LotConfig and group it up and fill in the na with  median of LotFrontage as LotConfig is more relevant to LotFrontage.

In [None]:
df['LotFrontage'] = df.groupby(['LotConfig'])['LotFrontage'].apply(lambda x: x.fillna(x.median()))

Lets Analyse Garage features to determine how GarageYrBlt can be filled

In [None]:
df.loc[df.GarageYrBlt.isnull(),['GarageType','GarageCars','GarageArea','GarageFinish','GarageYrBlt','GarageQual','GarageCond']]

From inspecting the Garage data there are few homes that doesn't have agarage and hence the data are null, for categorical variables we can fill it with NA and for numerical variables if the data is based on count/measurement we can fill it with 0. For Garage Year we can't fill it with 0 but we can check if the house build year and replace it with the same. This is just my assumption as it will have some significatent corr between the columns but less on target column compartively.

Lets check the Year build column before we proceed as nul values in Yearbuild will have no effect on our approach

In [None]:
df.YearBuilt.isnull().sum()

In [None]:
# replacing null values of GarageYrBlt with YearBuilt
df.loc[df.GarageYrBlt.isnull(),'GarageYrBlt'] = df.loc[df.GarageYrBlt.isnull(),'YearBuilt']

Lets fill Masonry veneer Area with 0 for all the Na values and Masonry veneer Type with Not Present 

In [None]:
# fill 0 and Not Present for numerical and categorical feature's null values
df.MasVnrArea.fillna(0,inplace=True)
df.MasVnrType.fillna('Not present',inplace=True)

<h2>Creating Derived Features</h2>

In [None]:
# Adding square feet of first floor and second floor
df['TotalFlrSFAbvGrd'] = df[['1stFlrSF','2ndFlrSF']].sum(axis=1)
# Adding all the bathrooms
df['TotalBath'] = df[['BsmtFullBath','BsmtHalfBath','FullBath','HalfBath']].sum(axis=1)
# Adding square feet of all Porcch
df['TotalPorchSF'] = df[['OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','WoodDeckSF']].sum(axis=1)

In [None]:
check_null_percentage(df)

In [None]:
# get the features that holds more than 90% of its data with a same value
drop_columns = unique_df.query('Percentage > 90.0').index.values
drop_columns

In [None]:
# drop the columns from above analysis
df.drop(columns=drop_columns,inplace=True)
del drop_columns

In [None]:
# create a list of numerical features
numerical_features = list(df.select_dtypes(include=[np.number]).columns.values)

# create a list of features that or categorical
categorical_features = list(df.select_dtypes(include=[np.object]).columns.values)

# Creata feature list for time sereis data
timeseries_features = ['YearBuilt', 'YearRemodAdd', 'YrSold', 'MoSold', 'GarageYrBlt']

In [None]:
# removing times series features from numerice to avoid repetition
for col in timeseries_features:
    numerical_features.remove(col) 

In [None]:
# adding numerical features to categrical if the unique value count in a feature is less tha are equal to 10
cat_feature = pd.Series(df[numerical_features].nunique().sort_values(),name='Count').to_frame().query('Count <= 10').index.values
categorical_features.extend(cat_feature)

In [None]:
# removing the numerical features tht belong to time series
for col in cat_feature:
    numerical_features.remove(col)

From the data we can observe that we have TotalBsmtSF and TotRmsAbvGrd which sums up the subs categories or values of the related data. Similarly we can get calculate a TotalSF for the floors and PorchSF And also calculate total number of Bathrooms

<br>
<h2>Analysing Numerical Variables</h2>

In [None]:
# define subplot columns and rows and the figure size
fig,ax = plt.subplots(math.ceil(len(numerical_features)/3),3,figsize=(15,30),sharey=True)

# initialize the row and column variable

i ,j = 0, 0
for col in sorted(numerical_features):

    # Plot a regression plot for the numerical feature and SalePrice
    sns.regplot(col,'SalePrice',data=df,ax=ax[i][j])
    if j == 2:
        j=0
        i +=1
    else:
        j +=1
        
# hide last two grids as it doesn't have any plots to show
ax[6][1].set_visible(False)
ax[6][2].set_visible(False)

<h2>Observation</h2>

Outlier and Correlation:

    1stFlrSF : It has an outlier with sqft greater than 4000 at lower price whihc is not possible to sell
    
    2ndFlrSF : is not well correlated with salePrice
    
    BsmtFinSF1 : The Sqft is greater than 5000 with lower price which is not possible
    
    BsmtFinSF2 : A outlier with sqft almost 1500 is rated at lower price which doesn't look right and is not correlated with salePrice
    
    BsmtUnfSF : The feature doesn't influence the Saleprice much
    
    EnclosedPorch : AN outlier with sqft above 500 is rated at lower price.
    
    GarageArea : There are few outliers above 1200 sqft in realtion to price
    
    GrLivArea : There are few outliers at lower price and extreme sqft
    
    LotArea : Few outliers are present above 100000 sqft
    
    LotFrontage : Two outliers are at the exterme of above 300 linear feet
    
    MasVnrArea : There is one outlier above 1500 sqft
    
    MSSubClass : This feature doesn't provide any information about salePrice
    
    OpenPorchSF : Few outliers are present above 400 sqft
    
    TotalBsmtSF : An outlier is present with sqft of about 6000


In [None]:
# define the number of outliers to be handled for each feature
feature_outlier_count = {'1stFlrSF':1,
                'BsmtFinSF1':1,
                'BsmtFinSF2':1,
                'EnclosedPorch':2,
                'GarageArea':4,
                'GrLivArea':4,
                'LotArea':4,
                'LotFrontage':2,
                'MasVnrArea':1,
                'OpenPorchSF':3,
                'TotalBsmtSF':4,
                'TotRmsAbvGrd':1,
                'TotalFlrSFAbvGrd':2,
                'TotalPorchSF':1,
                'WoodDeckSF':3}

In [None]:
# define a method to print the outlier to have a visual representation of the data with saleprice
def print_outliers(feature_list):
    for k,v in feature_list.items():
        if v:
            display(df.loc[df[k].isin(sorted(df[k])[-v:]),[k,'SalePrice']])

# returns the outlier highest value or the value specific to index when specified 
def get_outliers(feature,index=-1):
    return df.loc[df[feature] == sorted(df[feature])[index],[feature,'SalePrice']].sort_values(by=feature,ascending=False)

In [None]:
# prints the outlier for each feature
print_outliers(feature_outlier_count)

Lets drop the data in index 1298, this data doesn't seem to be wrong as this could be an exceptional record in our dataset but it lies far greater than other features and is being an outlier for the existing data so hence we need to handle it.

**This index has a huge impact on all the features deleting this record will be the best option, cause its affecting our good correlated features**

In [None]:
# Get teh index of the outlier in Feature 1stFlrSF
outlier_index = get_outliers('1stFlrSF').index.values[0]
outlier_index

In [None]:
df.iloc[1298]

In [None]:
# Remove the outlier record with its index value and if the same outlier 
# is present in other features as well reduce the count in outlier_features
def remove_outlier_features_count_for_index(outlier_idx):
    for col in feature_outlier_count.keys():
        if (feature_outlier_count[col] > 0) & (outlier_index in get_outliers(col).index.values):
            feature_outlier_count[col] = feature_outlier_count[col]-1 
    df.drop(outlier_index,inplace=True)
    df.reset_index(drop=True,inplace=True)

In [None]:
remove_outlier_features_count_for_index(outlier_index)

Lets fix the outliers by getting the mean value of the feature at that SalePrice range

In [None]:
df.loc[df.index[get_outliers('TotRmsAbvGrd').index.values[0]],'TotRmsAbvGrd'] = df.loc[df['SalePrice'] == get_outliers('TotRmsAbvGrd').SalePrice.values[0],'TotRmsAbvGrd'].mode()[0]

feature_outlier_count['TotRmsAbvGrd'] = 0

In [None]:
def fix_outliers(outlier_features_list):
    for k,v in outlier_features_list.items():
        while v > 0:
            # replacing the outliers by taking mean of four closest feature value of the outlier at the salePrice Range
            replace_with = df.loc[(df['SalePrice']-get_outliers(k)['SalePrice'].values[0]).abs().argsort()[v:v+4],k].mean()
            if (df[k].dtypes == np.int64) | (df[k].dtypes == np.int32):
                df.loc[df.index[get_outliers(k).index.values[0]],k] = int(replace_with)
            else:
                df.loc[df.index[get_outliers(k).index.values[0]],k] = round(replace_with,1)        
            v = v-1
            feature_outlier_count[k] = v

In [None]:
# pass the dictionary containg all the features with number of outliers to be fixed
fix_outliers(feature_outlier_count)

Even after fixing the exterme outliers we still have some outliers that are not in extreme but or incorrect values like for larger squarefeet the salePrice is really ow which is not normal.

Lets havea  look at those missleading values

In [None]:
df[['1stFlrSF','SalePrice']].sort_values(by='1stFlrSF',ascending=False)[:3]

If we can see here the first two values of 1stFlrSF are incorrect and are at both extreme

In [None]:
df.loc[df.index[get_outliers('1stFlrSF',-2).index.values[0]],'1stFlrSF'] = df.loc[(df['SalePrice']-get_outliers('1stFlrSF',-2)['SalePrice'].values[0]).abs().argsort()[1:1+4],'1stFlrSF'].mean()

In [None]:
df[['BsmtFinSF1','SalePrice']].sort_values(by='BsmtFinSF1',ascending=False)[:3]

And same here in BsmtFinSF1 it has one more outlier with misleading value

In [None]:
df[['LotArea','SalePrice']].sort_values(by='LotArea',ascending=False)[:7]

Here there are three outliers with too low values

In [None]:
#Assign the number of outliers to be fixed
feature_outlier_count['LotArea']=3
feature_outlier_count['BsmtFinSF1']=1

# call the method to fix hte outliers
fix_outliers(feature_outlier_count)

In [None]:
#Delet the variable as its not needed
del feature_outlier_count

Lets check the Regression plot on numerical variables again to see if it look better 

In [None]:
# define subplot columns and rows and the figure size
fig,ax = plt.subplots(math.ceil(len(numerical_features)/3),3,figsize=(15,30),sharey=True)

# initialize the row and column variable

i ,j = 0, 0
for col in sorted(numerical_features):

    # Plot a regression plot for the numerical feature and SalePrice
    sns.regplot(col,'SalePrice',data=df,ax=ax[i][j])
    if j == 2:
        j=0
        i +=1
    else:
        j +=1
        
# hide last two grids as it doesn't have any plots to show
ax[6][1].set_visible(False)
ax[6][2].set_visible(False)

Observation:

After fixing the outliers we can see the correlation better now. Lets determine which features are less correlated and drop them

    BsmtFinSF2 : It has a low correlation and hence can be dropped
    
    BsmtUnfSF : It depics less correlation towards SalePrice and hence can be dropped
    
    EnclosedPorch : Has a low and bit of negative correlation and doesn't provide much information
    
    MSSubClass : This looks like it belongs to Category variable let reassign the variable to categorical feature list
    
    

In [None]:
# drop the variables that are not in correlation with sale price
df.drop(['BsmtFinSF2','BsmtUnfSF','EnclosedPorch'],axis=1,inplace=True)
for col in ['BsmtFinSF2','BsmtUnfSF','EnclosedPorch']:
    numerical_features.remove(col)

Since MSSubClass has too many categories less reduce them to grouped label categories

In [None]:
# change the type to string 
df.MSSubClass = df.MSSubClass.astype(str)

# reducing the number of categories
df.MSSubClass.replace({'20':'1story', '30':'1story', '40':'1story', '45':'1story', '50':'1story', 
                           '60':'2story', '70':'2story', '75':'2story', '80':'nstory',
                           '85':'nstory', '90':'nstory', '120':'1story', '150':'1story',
                           '160':'2story','180':'nstory','190':'nstory'}, inplace=True)

# adding MSSubClass to catgeorical Feature list
categorical_features.append('MSSubClass')

# removing it from numerical feature list
numerical_features.remove('MSSubClass')

<br>
<h2> Analysis on Categorical Variables</h2>

In [None]:
# define the subplots with col and row count
fig,ax = plt.subplots(math.ceil(len(categorical_features)/3),3,figsize=(20,60),sharey=True)

# initialize the row and column number
i ,j = 0, 0

# add properties to the boxplot style
PROPS = {
    'boxprops':{'facecolor':'none', 'edgecolor':'black','linewidth':0.3},
} 

for col in sorted(categorical_features):
    # plot a boxplot for SalePrice with feature
    sns.boxplot(col,'SalePrice',data=df,ax=ax[i][j],showfliers=False,**PROPS)
    
    # plot a stripplot with salePrice for the fature
    sns.stripplot(col,'SalePrice',data=df,ax=ax[i][j],alpha=0.5)
    
    # rotate the x-ticsk if the length is more
    if df[col].nunique() > 8:
        ax[i][j].tick_params(axis='x',rotation=45)
    if j == 2:
        j=0
        i +=1
    else:
        j +=1

<h2>Observation</h2>

**Combine Categories:**

    BedroomAbvGr : Combine 0 , 5 , 6 and 8
    
    BldgType : Combine 2fmCon ,Twnhs and Duplex  
    
    BsmtCond : Combine No Basement, Fa and Poor
    
    BsmtExposure : Combine Mn and Av
    
    BsmtFinType1 : Combine ALQ, Rec, BLQ and LwQ
    
    BsmtFinType2 : Combine BLQ , Rec and LwQ
    
    BsmtFullBath : Combine 2 and 3
    
    BsmtQual : Combine No Basement and Fa
    
    Condition1 : Combine RRNn and RRAn, PosN and PosA , RRNe and RRAe and Feedr and Artery
    
    Exterior2nd : Combine MetalSd, Wd Shng, HbBoard, Plywood, Wd Sdng , Stucco and combine CBlock, Other , Stone, AsphShn, ImStucc, Brk Cmn, BrkFace
    
    FireplaceQu : Combine No Fireplace, Po and Fa
    
    Foundation: Combine Wood, Slab and Stone
    
    FullBath : Combine 0 and 1
    
    GarageType: Combine Detchd, CarPort, No Garage, Basment and 2Types.
    
    GarageQual : Combine Ex and Gd , Po , Fa and No Garage
    
    HeatingQC : Combine Fa and Po
    
    House Style : Combine 2Story and 2.5Fin, SFoyer and 1.5Fin, SLvl and 1Story, 1.5Unf and 2.5Unf
    
    LotShape: Combine IR2 and IR3
    
    MSZoning : Combine RM and RH to other
    
    MasVnrType: Combine None, Not present and BrkCmn
    
    Neighborhood : combine MeadowV , BrDale and IDOTRR , Sawyer , NAmes , NPkVill , Mitchel , SWISU and Blueste , Gilbert , Blmngtn , SawyerW and NWAmes, ClearCr , CollgCr and Crawfor, Veenker, Timber and Somerst , OldTown , Edwards and BrkSide , StoneBr , NridgHt and NoRidge.
    
    OverallCond : 1, 2 and 3 , 6, 7, and 8
    
    OverallQual : 1 and 2
    
    SaleCondition: Combine AdjLand, Alloca, Family and Abnorml
    
    SaleType: Combine COD, ConLD, ConLI, CwD, ConLw, Con and Oth.
    

**Columns to Drop:**
    
    ExterCond : drop this column as mean is same for TA and Gd and other values are too less for prediction
    
    Exterior1st : The spread of data is across the price range so the correlation will be less and might not be helpful in prediction
    
    Fence : The mean is almost same for all types of fence so lets drop it
    
    LotConfig : The mean of all labels are in same range
    
    RoofStyle : Two categories has same mean with most of the datapoints
    

**Highly Correlated Features:**
    
    Fireplaces, GarageCars, HeatingQC, KitchenQual
    

<h3> Handling Nominal Categories </h2>

In [None]:
#Combine Categories that are not ordinal as ordinal catgeories 
# need to be factorized laters, hence combining nomial categorical vairables


# df.BedroomAbvGr = df.BedroomAbvGr.astype(str)
# df.BedroomAbvGr.replace({'0':'5','6':'5','8':'5'},inplace=True)
df.BldgType.replace({'2fmCon':'Twnhs','Duplex':'Twnhs'},inplace=True)
# df.BsmtCond.replace({'No Basement':'Fa','Poor':'Fa'},inplace=True)
df.BsmtExposure.replace({'Mn':'Av'},inplace=True)
# df.BsmtFinType1.replace({'ALQ':'Rec', 'BLQ':'Rec','LwQ':'Rec'},inplace=True)
# df.BsmtFinType2.replace({'BLQ':'LwQ' , 'Rec':'LwQ' },inplace=True)
# df.BsmtFullBath = df.BsmtFullBath.astype(str)
# df.BsmtFullBath.replace({'3':'2'},inplace=True)
# df.BsmtQual.replace({'No Basement' : 'Fa'},inplace=True)
df.Condition1.replace({'RRNn' : 'RRAn', 'PosN' : 'PosA' , 'RRNe' : 'RRAe' , 'Feedr' : 'Artery'},inplace=True)
df.Exterior2nd.replace({'MetalSd':'Wd Sdng', 'Wd Shng':'Wd Sdng', 'HbBoard':'Wd Sdng','Plywood':'Wd Sdng',\
                        'Stucco':'Wd Sdng' , 'CBlock': 'BrkFace','Other': 'BrkFace' , 'Stone': 'BrkFace',\
                        'AsphShn': 'BrkFace', 'ImStucc': 'BrkFace', 'Brk Cmn': 'BrkFace'},inplace=True)
# df.FireplaceQu.replace({'Po':'No Fireplace', 'Fa':'No Fireplace'},inplace=True)
df.Foundation.replace({'Wood':'Stone','Slab':'Stone'},inplace=True)
# df.FullBath = df.FullBath.astype(str)
# df.FullBath.replace({'0':'1'},inplace=True)
df.GarageType.replace({'CarPort':'Detchd', 'No Garage':'Detchd', 'Basment':'Detchd' , '2Types':'Detchd'},inplace=True)
# df.GarageQual.replace({'Ex':'NA', 'Gd':'NA' , 'Po':'NA' , 'Fa':'NA' },inplace=True)
# df.HeatingQC.replace({'Po':'Fa'},inplace=True)
# df.HouseStyle.replace({'2.5Fin':'2Story', '1.5Fin':'SFoyer', 'SLvl':'1Story', '1.5Unf': '2.5Unf'},inplace=True)
df.LotShape.replace({'IR3':'IR2'},inplace=True)
df.MSZoning.replace({'RH':'RM'},inplace=True)
df.MasVnrType.replace({'None':'BrkCmn', 'Not present':'BrkCmn'},inplace=True)
df.Neighborhood.replace({'BrDale':'MeadowV' , 'IDOTRR':'MeadowV' ,\
                         'NAmes':'Sawyer' , 'NPkVill':'Sawyer' , 'Mitchel':'Sawyer' , 'SWISU':'Sawyer', 'Blueste':'Sawyer' ,\
                         'Blmngtn':'Gilbert' , 'SawyerW':'Gilbert', 'NWAmes':'Gilbert',\
                         'ClearCr':'Crawfor' , 'CollgCr' :'Crawfor',\
                         'Timber':'Veenker', 'Somerst':'Veenker' ,\
                         'Edwards':'OldTown', 'BrkSide':'OldTown' ,\
                         'StoneBr' : 'NridgHt' , 'NoRidge': 'NridgHt'},inplace=True)
# df.OverallCond = df.OverallCond.astype(str)
# df.OverallCond.replace({'2': '3','1':'3', '6': '7', '8':'7'},inplace=True)
# df.OverallQual = df.OverallQual.astype(str)
# df.OverallQual.replace({'1':'2'},inplace=True)
df.SaleCondition.replace({'AdjLand':'Abnorml', 'Alloca':'Abnorml', 'Family' :'Abnorml'},inplace=True)
df.SaleType.replace({'ConLD':'COD', 'ConLI':'COD', 'CwD':'COD', 'ConLw':'COD', 'Con':'COD', 'Oth':'COD'},inplace=True)

dropping the categorical features that are less correlated .ie has same mean across all its labels

In [None]:
# add columns to drop
drop_columns = ['ExterCond', 'Fence', 'LotConfig' ,'RoofStyle' ,'Exterior1st']

# drop the selected features
df.drop(columns=drop_columns,inplace=True)

# remove the dropped columns from categorical feature list
for cat in drop_columns[:]:
    categorical_features.remove(cat)

<br>
<h2> Time Series Analysis</h2>

In [None]:
# list down the variables for time series
timeseries_features

In [None]:
df[timeseries_features].info()

In [None]:
# change the types to integer
df.YrSold = df.YrSold.astype(int)
df.GarageYrBlt = df.GarageYrBlt.astype(int)

In [None]:
# create a derieved column date sold by combining month sold and year sold
df['dateSold'] = df['MoSold'].astype(str)+'-1-'+df['YrSold'].astype(str)
df['dateSold'] =pd.to_datetime(df['dateSold'])

# add the new column to timeseries list
timeseries_features.append('dateSold')

In [None]:
# preview the new column
df['dateSold'].head()

In [None]:
df.loc[df.GarageYrBlt < 1900,['GarageYrBlt','YearBuilt']]

In [None]:
# define the subplot iwth number of rows and columns and the figure size accordingly
fig,ax = plt.subplots(math.ceil(len(timeseries_features)/2),2,figsize=(15,15),sharey=True)

# initialize the row and column index
i ,j = 0, 0
for col in sorted(timeseries_features):
    if col == 'GarageYrBlt':
        # create a line plot for GarageYrBlt with SalePrice for year
        # greater than 1880 as there are only two data available below this year
        sns.lineplot(df.loc[df[col] >= 1880,col],df.loc[df[col] != 0,'SalePrice'],ax=ax[i][j])
    else:
        # create a line plot for the time data with SalePrice
        sns.lineplot(col,'SalePrice',data=df,ax=ax[i][j])
    
    # if the x-ticks are more rotate the labels
    if df[col].nunique() > 8:
        ax[i][j].tick_params(axis='x',rotation=45)
    if col == "YrSold":
        ax[i][j].xaxis.set_ticks([2006,2007,2008,2009,2010])
    if j == 1:
        j=0
        i +=1
    else:
        j +=1

In [None]:
# reorder the sale price to the end for ease of visualiation in heat map
df_dummy = df.pop('SalePrice')
df.insert(df.shape[1],'SalePrice',df_dummy)
del df_dummy

In [None]:
# set figure size
plt.figure(figsize=(15,15))

# plot correlation heatmap
sns.heatmap(df.corr(),annot=True);

In [None]:
# drop MoSold and YrSold as they are almost zero correlated
df.drop(['MoSold','YrSold'],axis=1,inplace=True)
for col in ['MoSold','YrSold']:
    timeseries_features.remove(col)

<br>
<h2> Encoding Category Labels <h2>

In [None]:
# reset these categorical numerical variables to integer 
df[['HalfBath','Fireplaces','FullBath','BsmtFullBath','GarageCars','BedroomAbvGr','OverallCond','OverallQual']] = df[['HalfBath','Fireplaces','FullBath','BsmtFullBath','GarageCars','BedroomAbvGr','OverallCond','OverallQual']].astype(int)

In [None]:
# assign the categorical columns that are non integer to categorical_columns as a list
categorical_columns =['ExterQual','BsmtQual','BsmtCond','HeatingQC','KitchenQual','FireplaceQu','GarageQual','HouseStyle','BsmtFinType2','BsmtFinType1','GarageFinish']

<h3> Manual Ordinal Data Encoding</h3>

Converting the Object category featurs to Categorical data type for encoding

In [None]:
# assign the labels in the order of decreasing to increasing as when creating a categorical feature
# Converting normal Object features to Categorical Data Type features
df['ExterQual']=pd.Categorical(df['ExterQual'],ordered=True,categories=['Fa','TA','Gd','Ex'])
df['BsmtQual']=pd.Categorical(df['BsmtQual'],ordered=True,categories=['NA','Fa','TA','Gd','Ex'])
df['BsmtCond']=pd.Categorical(df['BsmtCond'],ordered=True,categories=['NA','Po','Fa','TA','Gd'])
df['HeatingQC']=pd.Categorical(df['HeatingQC'],ordered=True,categories=['Po','Fa','TA','Gd','Ex'])
df['KitchenQual']=pd.Categorical(df['KitchenQual'],ordered=True,categories=['Fa','TA','Gd','Ex'])
df['FireplaceQu']=pd.Categorical(df['FireplaceQu'],ordered=True,categories=['NA','Po','Fa','TA','Gd','Ex'])
df['GarageQual']=pd.Categorical(df['GarageQual'],ordered=True,categories=['NA','Po','Fa','TA','Gd','Ex'])

In [None]:
df['GarageFinish'] = pd.Categorical(df['GarageFinish'],ordered=True,categories=['NA','Unf','RFn','Fin'])

In [None]:
df['BsmtFinType1']=pd.Categorical(df['BsmtFinType1'],ordered=True,categories=['NA','Unf','LwQ','Rec','BLQ','ALQ','GLQ'])
df['BsmtFinType2']=pd.Categorical(df['BsmtFinType2'],ordered=True,categories=['NA','Unf','LwQ','Rec','BLQ','ALQ','GLQ'])

In [None]:
df['HouseStyle']=pd.Categorical(df['HouseStyle'],ordered=True,categories=[ 'SFoyer','1.5Unf','1Story','1.5Fin','SLvl','2.5Unf','2Story','2.5Fin'])

**Explenation:**
<pre>Now as we have converted all these ordinal vaiables to Categorical Data Type features with ordered relation. Lets convert the labels to integers.

The integers are assigned bsaed on the order for example for lowest category it will assign zero and highest catgoery it will assign the nth position form zero.

Example: For **FireplaceQu**

['NA' < 'Po' < 'Fa' < 'TA' < 'Gd' < 'Ex'] : Categorical Data Type

[  0  <   1  <  2   <  3   <  4   <  5  ] : Integer Representation</pre>

In [None]:
# factorize the categories to Integer representation
for col in categorical_columns:
    code, _ = pd.factorize(df[col],sort=True)
    df[col] = pd.Series(code)

In [None]:
df.info()

In [None]:
# reassign the categorical features 
categorical_features = list(df.select_dtypes(include=[np.object]).columns.values)

In [None]:
df.shape

We have brounght down to 53 columns finally but we still have to convert few categories to dummy variable creation

<br>
<h2> Dummy Variable Creation <h2>

Lets create dummy variables for the remaing categorical features.

As the labels in these categorical variable are shrinked to lower number the number of features generated will be less

In [None]:
# print the shape of categorical columns and the number of created dummy columns
pd.get_dummies(df[categorical_features],drop_first=True).shape,len(categorical_features)

From the above column count 29 new columns will be added to our final dataFrame

In [None]:
# created dummy variables for categorical features
house_price = pd.concat([df,pd.get_dummies(df[categorical_features],drop_first=True)],axis=1)

In [None]:
# drop the actual categorical feature from list
house_price.drop(columns=categorical_features,inplace=True)

In [None]:
house_price

In [None]:
# reset index for the new dataframe
house_price.reset_index(drop=True,inplace=True)

<h3> Handling DATE Object </h3>

We have one more data to clean up that is our date object. For model to use dat object we need to convert it to integer. Thinking about it we have a way to conver ti to integer... Ofcourse thats unixtimestamp

In [None]:
# preview the dateSold feature
house_price.dateSold.head()

In [None]:
#We need time HH:MM:SS to be added to date to convert it unixtimestamp

# lets create a constant time
tm = datetime.time(10,10)

# convert the dateSold to unixstimestamp
house_price.dateSold = house_price.dateSold.apply(lambda x: datetime.datetime.combine(x, tm).timestamp())

In [None]:
house_price.dateSold.head()

In [None]:
# reassigning all the numerical features to the numerical_features variable as a list
numerical_features = list(df.select_dtypes(include=[np.number]).columns.values)

<h1> Data Preparation </h1>
    
<h3> Our data has been cleaned and now tuned with 81 independent variables and 1 Target variable</h3>

Split the independent features and target feature to x and y respectively

In [None]:
house_price.shape

In [None]:
# extract the target fesature out from dataFrame
y = house_price.pop('SalePrice')
# assign the independent variables to x
X = house_price

# remove the target feature from numerical features before we perform scaling
numerical_features.remove('SalePrice')

Lets split the data to train and test data with sklearn liberary.

Lets defien the test size as 30% and train data size as 70%

In [None]:
# split the data to test and train
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.7,test_size=0.3,random_state=0)

<h2> Feature Scaling</h2>

In [None]:
# create a StandardScaler object
scaler = StandardScaler()

# Fit and transform our train data with Standard Scaler
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])

# transform our test data with the same scaler object
X_test[numerical_features] = scaler.transform(X_test[numerical_features])

In [None]:
X_train.head()

In [None]:
y_train.head()

<br>
<h1> Model Building </h1>

<h2>Linear Regression</h2>

Lets first build a Linear Regression model with the number of features we have to check if it overfits

In [None]:
# creating linar regressor
lr = LinearRegression()

# Initializing our model with the predictors and target features 
lr.fit(X_train,y_train)

# Predcting values in our train data set
y_train_pred = lr.predict(X_train)

print('\nIntercept:', lr.intercept_)
print('Coefficients:', lr.coef_)
print('\n\nTrain results \n')
print('Mean squared error (MSE): {:.2f}'.format(mean_squared_error(y_train, y_train_pred)))
print('Coefficient of Determination (R2): {:.2f}'.format(r2_score(y_train, y_train_pred)))
print('Residual Sum of Squares (RSS): {:.2f}'.format(np.sum(np.square(y_train - y_train_pred))))

# Predicting values for our test data
y_test_pred = lr.predict(X_test)
print('\n\nTest results \n')
print('Test Mean squared error (MSE): {:.2f}'.format(mean_squared_error(y_test, y_test_pred)))
print('Test Coefficient of Determination (R2): {:.2f}'.format(r2_score(y_test, y_test_pred)))
print('Test Residual Sum of Squares (RSS): {:.2f} \n'.format(np.sum(np.square(y_test - y_test_pred))))

Lets calculate and store the outputs in a metric variable

In [None]:
# Lets calculate some metrics such as R2 score, RSS and RMSE

# initialize a list
metric = []

# calcuate R2 Score for Training data
metric.append(r2_score(y_train, y_train_pred))

# calculate R2 Score for Test data
metric.append(r2_score(y_test, y_test_pred))

# calcuate RSS for Train Data
metric.append(np.sum(np.square(y_train - y_train_pred)))

# calcuate RSS for Test Data
metric.append(np.sum(np.square(y_test - y_test_pred)))

# calcuate MSE for Train Data
metric.append(mean_squared_error(y_train, y_train_pred)**0.5)

# calcuate MSE for Test Data
metric.append(mean_squared_error(y_test, y_test_pred)**0.5)

# add number of features for the model
metric.append(len(X_train.columns))

# add the alpha value if present
metric.append(0)

<h3>Observation:</h3>

    Our model output has higher R2 score in train data and lower R2 score in test data with a huge difference. Hence it clearly **overfits**.
    
    There are several ways to overcome this problem:
        - reduce the number of features to has the model simple(less complex) using RFE
        - Ridge regression can reduce the coeff values a lot and use the hyper parameter(alpha) to tune the model
        - Use Lasso regression to reduce the improve the model best fit by reducing the coeff to zero to reject the features, this helps in selceting the features for out model with teh use of hyper parameter

<H2> Feature Selection</H2>
<h3> Analysis the Correlated Features </h3>

Before we use RFE lets analyse the top correlated features

In [None]:
# perform correlation on the train data set for the predictor
correlation_df = pd.concat([X_train,y_train],axis=1)

In [None]:
# Creating a method to get the top correlated features
def get_top_corr_features():
    # creating a correlation list for the top features with SalesPrice
    corr_cols = correlation_df.corr().loc[:,'SalePrice'].sort_values(ascending=False)
    corr_cols = corr_cols.reset_index()
    # order the data with positive corr first and negative corr last
    corr_cols = corr_cols[corr_cols.SalePrice>0].append(corr_cols[corr_cols.SalePrice<0])
    return corr_cols

In [None]:
# get teh top 30 +ve corelation appended with top 25 most +ve corr freature and lastly the reaming features
corr_cols = get_top_corr_features().head(30).append(get_top_corr_features().tail(25).sort_values('SalePrice',ascending=True)).append(get_top_corr_features().iloc[30:57])[1:].reset_index(drop=True)
# rename columns
corr_cols.columns = ['Corr Feature','SalePrice Corr']
# preview the result
corr_cols[corr_cols['Corr Feature'].isin(['2ndFlrSF',
 'Condition1_PosA',
 'Exterior2nd_BrkFace',
 'Exterior2nd_CmentBd',
 'Exterior2nd_VinylSd',
 'LotShape_IR2',
 'MSZoning_FV',
 'SaleCondition_Partial',
 'SaleType_CWD'])]

<h2>RFE</h2>

In [None]:
# Reordering the columns based on correlation so when performing RFE build by adding feature on by one noise features will be added only at the end
X_train = X_train[corr_cols['Corr Feature']]
X_test = X_test[corr_cols['Corr Feature']]

Reducing the number of features to 50 using RFE

In [None]:
# create a Linear model for RFE
lm_rfe = LinearRegression()

# fit the train and test data to linaer model
lm_rfe.fit(X_train,y_train)

# create RFE for our Linear Regressor Model and reduce to 50 features
rfe = RFE(lm_rfe,50)

# fit the train and test data to RFE
rfe = rfe.fit(X_train,y_train)

RFE gives a rank with number , the number that are greater than one can be dropped. The selected columns shows true to retain and false to reject.

In [None]:
# print the rank of each features
pd.DataFrame(zip(X_train.columns,rfe.support_,rfe.ranking_),columns=['Feature','Selected','Rank']).sort_values('Rank')

In [None]:
# lets extract the top 50 selected columns by RFE
rfe_selected_columns = X_train.columns[rfe.support_]

In [None]:
# Top 50 correlated feature list
corr_selected_columns = get_top_corr_features().head(25).append(get_top_corr_features().tail(25).sort_values('SalePrice',ascending=True))[1:].reset_index(drop=True)
# rename the columns
corr_selected_columns.columns = ['Corr_feature','SalePrice_Corr']

In [None]:
# Sort the DataFrame based on the column name
corr_selected_columns = corr_selected_columns.sort_values(by='Corr_feature').reset_index(drop=True)
# Sort the Sereis based on the feature name
rfe_selected_columns = pd.Series(sorted(rfe_selected_columns),name='RFE')
# make a inner join and merge to get the common features
corr_rfe_features = pd.merge(left=rfe_selected_columns,right=corr_selected_columns,how='inner',\
                             left_on='RFE',right_on='Corr_feature').sort_values(by='SalePrice_Corr',ascending=False).reset_index(drop=True)

In [None]:
corr_rfe_features

<h3>OLS Model</h3>

In [None]:
# this function can be resued to build ols model for given features
def build_model(X_train_rfe):
    # adding a constant variable for intercept
    X_train_rfe = sm.add_constant(X_train_rfe)

    # Initialize an OLS model for our dataset and fit the data to model
    lm = sm.OLS(y_train,X_train_rfe).fit()

    # view the summary of the model for selected features
    print(lm.summary())

    return lm

<h3> VIF Analysis </h3>

The VIF score should be below 5 for an ideal model.

In [None]:
def VIF(X_train_rfe):
    # create a dummy dataframe
    vif = pd.DataFrame(columns=['Features','VIF'])
    
    # extract the column values to vif features column value
    vif['Features'] = X_train_rfe.columns

    # calculate vif for the train data for the added features
    vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])]
    
    # round the value to 2 decimals
    vif['VIF'] = round(vif['VIF'], 2)
    
    # sort values by hightevif value first
    vif = vif.sort_values(by = "VIF", ascending = False)
    
    # print vif table
    display(vif)
    
    # retrun vif object
    return vif

We have two way to build a model to find the best features selected by RFE that would fit.
<ul><li>Dropping a feature one by one from the model built with 50 features until it shows good performane without overfitting</li><li>Adding a feature one by one to the model until it shows a good performance metrics</li></ul>

Lets add features one by one and build our model

Lets create a custom logic based on the below condition

<table>
<tr><th style="text-align:center">Order</th><th style="text-align:center">P-value</th><th style="text-align:center">VIF</th><th style="text-align:center">Action</th></tr>
<tr><td>1</td><td>High</td><td>High</td><td>Drop these columns First</td></tr>
<tr><td>2</td><td>High</td><td>Low</td><td>Drop these columns one by one, because this could lower the VIF values of other columns to prevent it from being dropped in next step </td></tr>
<tr><td>3</td><td>Low</td><td>High</td><td>Drop the colums with VIF greater than 5</td></tr>
<tr><td>4</td><td>Low</td><td>Low</td><td>Keep these features</td></tr>
</table>	

In [None]:
def perform_feature_selection(train_data,rfe=False,corr=False):
    # create a empty data frame for xtrain and vif
    X_train_rfe = pd.DataFrame()
    vif = pd.DataFrame()

    # creating this object to ignore vif for a single feature
    count = 1

    # created this varible to stop the outer loop of adding futher features for model
    stop = False

    # prev r2score
    r2score = 0.0

    # fetch the features based on corr/rfe selection
    if rfe:
        cols = rfe_selected_columns
    elif corr:
        cols = corr_selected_columns.Corr_feature
    else: 
        cols = corr_rfe_features.RFE.values
        
    for col in cols:
    #for v in rfe_selected_columns:

        # add the column to the traing data set
        if col in train_data.columns.values:
            X_train_rfe[col] = train_data[col]

            # rebuild the model again to ckeck for high vifs and p-values 
            # once a feature is dropped on the above conditions after adding 
            # the new feature from the previous step to the model
            while True:
                # build the model
                lm = build_model(X_train_rfe)

                # Drop the previous column if r2score doesn't increase
                if round(r2score,3) == round(lm.rsquared,3):

                    print("\n\n Dropping "+X_train_rfe.columns.values[-1]+" and rebuilding the model as it did not add any info to model \n\n")

                    X_train_rfe.drop(X_train_rfe.columns.values[-1],axis=1, inplace=True)
                    
                    # build the model again as we have removed a feature 
                    lm = build_model(X_train_rfe)

                # Assign new r2score to check for the next build on adding new feature
                r2score = lm.rsquared

                # ignore vif and p-value check since there will
                # be only 1 column on first iteration
                if count != 1:

                    # calculate VIF
                    vif = VIF(X_train_rfe)

                    # if the model reaches required r2 score stop the model from executing furher steps
                    if lm.rsquared >= 0.90:
                        stop = True
                        break

                    # Check if the p-value if high
                    if (lm.pvalues > 0.05).sum() > 0:

                        # extract feature fo high p-value
                        feature = lm.pvalues[lm.pvalues > 0.05].index

                        # check if this feature is not const
                        if feature[0] != 'const':

                            # if the VIF value is aslo high drop this columns first
                            if feature[0] in vif.loc[vif.VIF > 5,'Features']:
                                X_train_rfe.drop(feature[0],axis=1,inplace=True)                # order 1
                            else:
                                # if only the p-value is high drop it
                                X_train_rfe.drop(feature[0],axis=1,inplace=True)                # order 2

                        # if the p-value column is 2nd in the list extract 
                        # that feature name to drop if from dataset if there is 
                        # a third value with high p-value it will be
                        # validated in the next loop after rebuild on dropping the current feature
                        elif (feature[0] == 'const') & (len(feature) > 1):
                            X_train_rfe.drop(feature[1],axis=1,inplace=True)                    # order 2

                    # if VIF value is high drop it
                    if ((vif.VIF > 5).sum() > 0) & (col in X_train_rfe.columns.values):
                        X_train_rfe.drop(col,axis=1,inplace=True)   # order 3
                    else:
                        break                                                                   # order 4
                else:
                    break
            # stop the process
            if stop:
                break

            # increment count on adding new feature
            count = count + 1
            
    return X_train_rfe

In [None]:
X_train_rfe = perform_feature_selection(X_train,corr=True)

In [None]:
# filter our X_Test with the rfe selected features
X_test_rfe = X_test[X_train_rfe.columns.values]
X_test_rfe.shape,y_test.shape

<h2>Linear Regression With RFE</h2>

In [None]:
# Lets calculate some metric11s such as R2 score, RSS and RMSE

lr_rfe = LinearRegression()

lr_rfe.fit(X_train_rfe,y_train)

y_train_pred = lr_rfe.predict(X_train_rfe)
y_test_pred = lr_rfe.predict(X_test_rfe)

# initialize a list
metric1 = []

# calcuate R2 Score for Training data
metric1.append(r2_score(y_train, y_train_pred))

# calculate R2 Score for Test data
metric1.append(r2_score(y_test, y_test_pred))

# calcuate RSS for Train Data
metric1.append(np.sum(np.square(y_train - y_train_pred)))

# calcuate RSS for Test Data
metric1.append(np.sum(np.square(y_test - y_test_pred)))

# calcuate MSE for Train Data
metric1.append(mean_squared_error(y_train, y_train_pred)**0.5)

# calcuate MSE for Test Data
metric1.append(mean_squared_error(y_test, y_test_pred)**0.5)

# add number of features for the model
metric1.append(len(X_train_rfe.columns))

# add the alpha value if present
metric1.append(0)

In [None]:
# Residual analysis
y_test_res = y_test - y_test_pred

In [None]:
# calculate the residual and plot the results
res = y_test_res
plt.scatter( y_test_pred , res)
plt.axhline(y=0, color='r', linestyle=':')
plt.xlabel("Predictions")
plt.ylabel("Residual")
plt.show()

In [None]:
# Distribution of errors
p = sns.distplot(y_test_res,kde=True)

p = plt.title('Normality of error terms/residuals')
plt.xlabel("Residuals")
plt.show()

<h2> Ridge Regression </h2>

In [None]:
# list of alphas to tune - if value too high it will lead to underfitting, if it is too low, 
# it will not handle the overfitting
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 
 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 3.5,
 4.0, 4.5, 5.0, 6.0, 7.0, 8.0, 9.0, 10, 15, 20, 25, 50, 100, 500, 1000 ]}

In [None]:
ridge = Ridge()

# cross validation
folds = 5
#
ridge_cv = GridSearchCV(estimator = ridge, 
                        scoring= 'neg_mean_absolute_error',  
                        param_grid = params, 
                        return_train_score=True,
                        cv = folds, 
                        verbose = 1)            
ridge_cv.fit(X_train, y_train)

In [None]:
# fetch the results of our gridsearch
ridge_results = pd.DataFrame(ridge_cv.cv_results_)
ridge_results

In [None]:
# plot a graph for comparing the test and train performance on different alpha params
ridge_results['param_alpha'] = ridge_results['param_alpha'].astype('float32')

# plotting
plt.figure(figsize=(16,5))

# plot train result
plt.plot(ridge_results['param_alpha'], ridge_results['mean_train_score'])

# plot test results
plt.plot(ridge_results['param_alpha'], ridge_results['mean_test_score'])

# set labels and title 
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')
plt.xscale('log')
plt.title("Negative Mean Absolute Error and alpha")

# plot legend
plt.legend(['train score', 'test score'], loc='upper right')

# show graph
plt.show()

In [None]:
print(ridge_cv.best_params_)

In [None]:
#Fitting Ridge model for alpha = 15
alpha = ridge_cv.best_params_['alpha']

# initialise Ridge Model
ridge = Ridge(alpha=alpha)

# Fit the data
ridge.fit(X_train, y_train)

# printing coefficients which have been penalised
print(ridge.coef_)

In [None]:
# Lets calculate some metric21s such as R2 score, RSS and RMSE

y_train_pred = ridge.predict(X_train)
y_test_pred = ridge.predict(X_test)

# initialize a list
metric2 = []

# calcuate R2 Score for Training data
metric2.append(r2_score(y_train, y_train_pred))

# calculate R2 Score for Test data
metric2.append(r2_score(y_test, y_test_pred))

# calcuate RSS for Train Data
metric2.append(np.sum(np.square(y_train - y_train_pred)))

# calcuate RSS for Test Data
metric2.append(np.sum(np.square(y_test - y_test_pred)))

# calcuate MSE for Train Data
metric2.append(mean_squared_error(y_train, y_train_pred)**0.5)

# calcuate MSE for Test Data
metric2.append(mean_squared_error(y_test, y_test_pred)**0.5)

# add number of features for the model
metric2.append(len(X_train.columns))

# add the alpha value if present
metric2.append(ridge_cv.best_params_['alpha'])

Observation:
    
    Ridge has done a good coefficient balance with 15 as alpha value for regularizing. But since the number of features is 81 it will make the mode complex and overfit so let try building Ridge model with the RFE train data

lets take train data from RFE and Perform Ridge Regression for as a less complex model

In [None]:
# fit the RFE data to gridcv for ridge
ridge_cv.fit(X_train_rfe, y_train)

In [None]:
print(ridge_cv.best_params_)

In [None]:
#Fitting Ridge model for alpha = 5
alpha = ridge_cv.best_params_['alpha']

# Initialze Ridhe Model
ridge_rfe = Ridge(alpha=alpha)

# fit rfe data to RIdge model
ridge_rfe.fit(X_train_rfe, y_train)

# printing coefficients which have been penalised
print(ridge_rfe.coef_)

In [None]:
# Lets calculate some metric31s such as R2 score, RSS and RMSE

y_train_pred = ridge_rfe.predict(X_train_rfe)
y_test_pred = ridge_rfe.predict(X_test_rfe)

# initialize a list
metric3 = []

# calcuate R2 Score for Training data
metric3.append(r2_score(y_train, y_train_pred))

# calculate R2 Score for Test data
metric3.append(r2_score(y_test, y_test_pred))

# calcuate RSS for Train Data
metric3.append(np.sum(np.square(y_train - y_train_pred)))

# calcuate RSS for Test Data
metric3.append(np.sum(np.square(y_test - y_test_pred)))

# calcuate MSE for Train Data
metric3.append(mean_squared_error(y_train, y_train_pred)**0.5)

# calcuate MSE for Test Data
metric3.append(mean_squared_error(y_test, y_test_pred)**0.5)

# add number of features for the model
metric3.append(len(X_train_rfe.columns))

# add the alpha value if present
metric3.append(ridge_cv.best_params_['alpha'])

In [None]:
metric3

<h2>Lasso Regression</h2>

In [None]:
# create lasso regression instance
lasso = Lasso()

params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 10.0, 20, 50, 75, 100, 200, 250, 300, 500, 1000 ]}


# cross validation
lasso_cv = GridSearchCV(estimator = lasso, 
                        return_train_score=True,
                        cv = folds, 
                        param_grid = params, 
                        verbose = 1,
                        scoring= 'neg_mean_absolute_error')            

# fit the train and test data
lasso_cv.fit(X_train, y_train) 

In [None]:
# featch Lasso results
lasso_results = pd.DataFrame(lasso_cv.cv_results_)
lasso_results

In [None]:
# plot a graph for comparing the test and train performance on different alpha params
lasso_results['param_alpha'] = lasso_results['param_alpha'].astype('float32')

# plotting
plt.figure(figsize=(16,5))

# plot train result
plt.plot(lasso_results['param_alpha'], lasso_results['mean_train_score'])

# plot test results
plt.plot(lasso_results['param_alpha'], lasso_results['mean_test_score'])

# set labels and title 
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')
plt.xscale('log')
plt.title("Negative Mean Absolute Error and alpha")

# plot legend
plt.legend(['train score', 'test score'], loc='upper right')

# show graph
plt.show()

In [None]:
# Printing the best hyperparameter alpha
print(lasso_cv.best_params_)

In [None]:
#Fitting Ridge model for alpha = 200
alpha = lasso_cv.best_params_['alpha']

# build Lasso model
lasso = Lasso(alpha=alpha)
        
# fit Train and test data
lasso.fit(X_train, y_train)

In [None]:
# Lets calculate some metric41s such as R2 score, RSS and RMSE

y_train_pred = lasso.predict(X_train)
y_test_pred = lasso.predict(X_test)

# initialize a list
metric4 = []

# calcuate R2 Score for Training data
metric4.append(r2_score(y_train, y_train_pred))

# calculate R2 Score for Test data
metric4.append(r2_score(y_test, y_test_pred))

# calcuate RSS for Train Data
metric4.append(np.sum(np.square(y_train - y_train_pred)))

# calcuate RSS for Test Data
metric4.append(np.sum(np.square(y_test - y_test_pred)))

# calcuate MSE for Train Data
metric4.append(mean_squared_error(y_train, y_train_pred)**0.5)

# calcuate MSE for Test Data
metric4.append(mean_squared_error(y_test, y_test_pred)**0.5)

# add number of features for the model
feature_cnt = len(lasso.coef_)-pd.Series(lasso.coef_).apply(lambda x: x == 0).sum()
metric4.append(feature_cnt)

# add the alpha value if present
metric4.append(lasso_cv.best_params_['alpha'])

**Observation:**
    
    Lasso did a great analysis with the advantage oof feature selection which took 53 features outof 81 almost 30 features got to zero coefficient.

Lets build lasso with our rfe data

In [None]:
# fit the Training data from RFE selection
lasso_cv.fit(X_train_rfe, y_train) 

In [None]:
# Printing the best hyperparameter alpha
print(lasso_cv.best_params_)

In [None]:
#Fitting Ridge model for alpha = 200 and printing coefficients which have been penalised
alpha = lasso_cv.best_params_['alpha']

# build lasso for 200 alpha
lasso_rfe = Lasso(alpha=alpha)

# fit the rfe train data and
lasso_rfe.fit(X_train_rfe,y_train)

In [None]:
# Lets calculate some metrics such as R2 score, RSS and RMSE

y_train_pred = lasso_rfe.predict(X_train_rfe)
y_test_pred = lasso_rfe.predict(X_test_rfe)

# initialize a list
metric5 = []

# calcuate R2 Score for Training data
metric5.append(r2_score(y_train, y_train_pred))

# calculate R2 Score for Test data
metric5.append(r2_score(y_test, y_test_pred))

# calcuate RSS for Train Data
metric5.append(np.sum(np.square(y_train - y_train_pred)))

# calcuate RSS for Test Data
metric5.append(np.sum(np.square(y_test - y_test_pred)))

# calcuate MSE for Train Data
metric5.append(mean_squared_error(y_train, y_train_pred)**0.5)

# calcuate MSE for Test Data
metric5.append(mean_squared_error(y_test, y_test_pred)**0.5)

# add number of features for the model
feature_cnt = len(lasso_rfe.coef_)-pd.Series(lasso_rfe.coef_).apply(lambda x: x == 0).sum()
metric5.append(feature_cnt)

# add the alpha value if present
metric5.append(lasso_cv.best_params_['alpha'])

<h2> Model Summary </h2><br>

In [None]:
betas = pd.DataFrame(index=X_train.columns.values, 
                     columns = ['Linear', 'Linear RFE','Ridge','Ridge RFE', 'Lasso', 'Lasso RFE'])

betas.loc[X_train.columns,'Linear'] = lr.coef_ # Polynomial Regression
betas.loc[X_train_rfe.columns,'Linear RFE'] = lr_rfe.coef_ # Polynomial Regression
betas.loc[X_train.columns,'Ridge'] = ridge.coef_ # Ridge Regression
betas.loc[X_train_rfe.columns,'Ridge RFE'] = ridge_rfe.coef_ # Ridge Regression
betas.loc[X_train.columns,'Lasso'] = lasso.coef_ # Lasso Regression
betas.loc[X_train_rfe.columns,'Lasso RFE'] = lasso_rfe.coef_ # Lasso Regression

print("\n\n   ======== Coefficients for features under each model built ========")

betas['Total Coeff'] = abs(betas).sum(axis=1)

betas.sort_values(by='Total Coeff',ascending=False)

Top Five features on overall coeff score based on the six models are :
<b>
<ul>
<ol>TotalFlrSFAbvGrd</ol>
<ol>Neighborhood_NridgHt</ol>
<ol>Neighborhood_Sawyer</ol>
<ol>GrLivArea</ol>
<ol>BldgType_Twnhs</ol>
</ul>
</b>

To do:
    
    We can see that the least used features where the **Exterior2nd** column which doesn't impact SalePrice much.. we can try rebuilding the model by removing it

In [None]:
# Creating a table which contain all the metrics

lr_table = {'Metric': ['R2 Score (Train)','R2 Score (Test)','RSS (Train)','RSS (Test)',
                       'MSE (Train)','MSE (Test)','No of Features','Alpha'], 
        'Linear Regression': metric
        }

lr_metric = pd.DataFrame(lr_table ,columns = ['Metric', 'Linear Regression'] )

lr_metric_rfe = pd.Series(metric1, name = 'Linear Regression RFE')

rg_metric = pd.Series(metric2, name = 'Ridge Regression')
rg_rfe_metric = pd.Series(metric3, name = 'Ridge Regression RFE')
ls_metric = pd.Series(metric4, name = 'Lasso Regression')
ls_rfe_metric = pd.Series(metric5, name = 'Lasso Regression RFE')

final_metric = pd.concat([lr_metric,lr_metric_rfe, rg_metric,rg_rfe_metric, ls_metric,ls_rfe_metric], axis = 1)

print("\n\n  \t\t\t\t  ============= Overall view of the models ================")

round(final_metric,4).astype(str)

<h1> Conclusion </h1>

In [None]:
print("\n\n \t   =============== Final models ================")

round(final_metric[['Metric','Linear Regression RFE','Ridge Regression RFE','Lasso Regression RFE']],4).astype(str)

**The final models have the train and test R2 Score with a very minimal difference of 0.001 R2 score between the models for train data and 0.002 incease in R2 score for successive models**

Lasso looks better than the other models, So we can conclude that Lasso performs well in these regressions