## Problem Statement :

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual value and flip them at a higher price. For the same purpose, the company has collected a data set from house sales in Australia. 

The company wants to know:

- Which variables are significant in predicting the price of a house

- How well those variables describe the price of a house



## Business Goal :

The company is looking at prospective properties to buy to enter the market.

 - Build a regression model using regularization, so as to predict the actual value of the prospective properties and decide whether to invest in them or not.

 - Determine the optimal value of lambda for ridge and lasso regression.

 - Model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. Further, the model will be a good way for management to understand the pricing dynamics of a new market.



## 1. <u> Data Understanding and Exploration

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
#Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn import linear_model
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score


pd.set_option('max_columns', 100)

In [None]:
#Import dataset
housing_df = pd.read_csv('../input/house-prices-data/train.csv')
housing_df.head()

In [None]:
#check the shape of dataset
housing_df.shape

### Observation

 - #### As seen from above, there are total 81 variables which contains 80 independent and 1 dependent variables.


In [None]:
# summary of the dataset
housing_df.info()

In [None]:
numerical_vars = housing_df.dtypes[housing_df.dtypes != "object"].index
print("Number of Numerical features: ", len(numerical_vars))

categorical_vars = housing_df.dtypes[housing_df.dtypes == "object"].index
print("Number of Categorical features: ", len(categorical_vars))

### Observation

 - #### There are three types of data in dataset- object, float64 and int64.
 - #### There are 38 numerical features and 43 categorical features.
 - #### Also there are null values in many columns 

In [None]:
#check the statistical distribution of the dataset
housing_df.describe([0.25,0.50,0.75,0.99]) 

## 2. <u> Data Cleaning

 - ### 2.1 Missing/Null Value Analysis

In [None]:
(housing_df.isnull().sum()/len(housing_df.index)).sort_values(ascending=False).head(20)

### Observation

 - #### Many columns have more than 50% null values. Lets drop those columns with more than 50% null values along wth 'Id' column as it not useful for our analysis

In [None]:
housing_df = housing_df.drop(['PoolQC','MiscFeature','Alley','Fence','Id'],axis='columns')

#### Let's analyze others columns with NA values one by one and decide how to handle them

In [None]:
#FireplaceQu - fireplace quality for differnet houses
housing_df.FireplaceQu.value_counts()

### Observation

 - #### Around 690 properties does not have any type of fire place. Hence we will convert NA to No Fireplace

In [None]:
#replace NA in FireplaceQu column with No Fireplace
housing_df['FireplaceQu'].fillna('No Fireplace', inplace=True)

In [None]:
#sanity check
housing_df.FireplaceQu.isnull().sum()

In [None]:
#LotFrontage - linear feet of street connected to property
sns.distplot(housing_df.LotFrontage)
plt.show()

In [None]:
sns.boxplot(housing_df.LotFrontage)
plt.show()

### Observation

 - #### As seen from above plots, the distribution of LotFrontage is right-skewed and column has few outliers.Hence we will replace the NA values with median instead of mean

In [None]:
# Impute the null values with median values in LotFrontage column

housing_df['LotFrontage'] = housing_df['LotFrontage'].replace(np.nan, housing_df['LotFrontage'].median())

In [None]:
#sanity check
housing_df.LotFrontage.isnull().sum()


In [None]:
#garage related columns - GarageType , GarageCond, GarageYrBlt , GarageFinish , GarageQual
print(housing_df.GarageType.value_counts())
print(housing_df.GarageCond.value_counts())
print(housing_df.GarageYrBlt.value_counts())
print(housing_df.GarageFinish.value_counts())
print(housing_df.GarageQual.value_counts())

In [None]:
#impute all the NA values in Garage related columns except GarageYrBlt with No_Garage
housing_df.GarageCond.replace(np.nan,'No Garage',inplace=True)
housing_df.GarageType.replace(np.nan,'No Garage',inplace=True)
housing_df.GarageFinish.replace(np.nan,'No Garage',inplace=True)
housing_df.GarageQual.replace(np.nan,'No Garage',inplace=True)

In [None]:
sns.distplot(housing_df.GarageYrBlt)
plt.show()

### Observation

 - #### As seen from above plots, the distribution is not normal Hence we cannot impute missing values with median or mean.

In [None]:
# Impute the null values with 0
housing_df['GarageYrBlt']=housing_df['GarageYrBlt'].fillna(0)

In [None]:
# Create a new column Garage_status with 2 values - 0 and 1 . Garage Yr Built less than 2000 will be considered as old (0) else new(1).

def get_Garage_status(x):
    if x == 0:
        return 0
    elif x >= 1900 and x < 2000:        
        return 0
    else:   
        return 1
    
housing_df['Garage_status'] = housing_df['GarageYrBlt'].apply(get_Garage_status)

In [None]:
housing_df['Garage_status'].value_counts()

In [None]:
#BsmtFinType1 - rating of basement finished area
housing_df.BsmtFinType1.value_counts()

In [None]:
#replace NA in BsmtFinType1 column with No Basement
housing_df.BsmtFinType1.replace(np.nan,'No Basement',inplace=True)

In [None]:
#sanity check
housing_df.BsmtFinType1.isnull().sum()

In [None]:
#BsmtExposure - walkout or garden level walls.
housing_df.BsmtExposure.value_counts()

In [None]:
#replace NA in BsmtExposure column with No Basement
housing_df.BsmtExposure.replace(np.nan,'No Basement',inplace=True)

In [None]:
#sanity check
housing_df.BsmtExposure.isnull().sum()

In [None]:
#BsmtCond - condition of the basement.
housing_df.BsmtCond.value_counts()

In [None]:
#replace NA in BsmtCond column with No Basement
housing_df.BsmtCond.replace(np.nan,'No Basement',inplace=True)

In [None]:
#sanity check
housing_df.BsmtCond.isnull().sum()

In [None]:
#BsmtQual - height of the basement.

housing_df.BsmtQual.value_counts()

In [None]:
#replace NA in BsmtQual column with No Basement
housing_df.BsmtQual.replace(np.nan,'No Basement',inplace=True)

In [None]:
#sanity check
housing_df.BsmtQual.isnull().sum()

In [None]:
#BsmtFinType2 - rating of basement finished area, if they are of multiple types.
housing_df.BsmtFinType2.value_counts()

In [None]:
#replace NA in BsmtFinType2 column with No Basement
housing_df.BsmtFinType2.replace(np.nan,'No Basement',inplace=True)

In [None]:
#sanity check
housing_df.BsmtFinType2.isnull().sum()

In [None]:
#MasVnrArea - masonry veneer area in square feet.
sns.distplot(housing_df.MasVnrArea)
plt.show()

In [None]:
sns.boxplot(housing_df.MasVnrArea)
plt.show()

### Observation

 - #### As seen from above plots, the distribution of MasVnrArea is right-skewed and column has many outliers.Hence we will replace the NA values with median instead of mean

In [None]:
housing_df.MasVnrArea.replace(np.nan,housing_df.MasVnrArea.median(),inplace=True)

In [None]:
#sanity check
housing_df.MasVnrArea.isnull().sum()

In [None]:
#MasVnrType - masonry veneer type.
housing_df.MasVnrType.value_counts()

In [None]:
#replace NA in MasVnrType column with None
housing_df['MasVnrType'].fillna('None', inplace=True) 

In [None]:
#sanity check
housing_df.MasVnrType.isnull().sum()

In [None]:
#Electrical - properties type of Electrical system 
housing_df.Electrical.value_counts()

In [None]:
#replace NA value with mode since there is only one NA value in column Electrical
housing_df.Electrical.replace(np.nan,housing_df.Electrical.mode()[0],inplace=True)

In [None]:
#sanity check
housing_df.Electrical.isnull().sum()

In [None]:
# Create a new column named Is_Remodelled to indicate if the house is remodelled or not.
def check_Remodelled(df):
    if(df['YearBuilt'] == df['YearRemodAdd']):
        return 0
    elif(df['YearBuilt'] < df['YearRemodAdd']):
        return 1
    else:
        return 2
    
housing_df['Is_Remodelled'] = housing_df.apply(check_Remodelled, axis=1)

In [None]:
# Create a new column Built_Remodel_Age to determine the age of the building at the time of selling

def Built_Remodel_Age(df):
    if(df['YearBuilt'] == df['YearRemodAdd']):
        return df['YrSold'] - df['YearBuilt']
    else:
        return df['YrSold'] - df['YearRemodAdd']
       
housing_df['Built_Remodel_Age'] = housing_df.apply(Built_Remodel_Age, axis=1)
housing_df.head()

In [None]:
#lets drop the original columns
housing_df.drop(['YearBuilt', 'YearRemodAdd', 'YrSold', 'GarageYrBlt'], axis = 1, inplace = True)

In [None]:
#check for columns with more than 85% same values using below function

def get_Same_ValueCounts():
    remove_columns = []
    cols=housing_df.select_dtypes(['int64','float64','object']).columns
    for col in cols:
        if(housing_df[col].value_counts().max() >= 1241):
            remove_columns.append(col)
    return remove_columns

remove_columns = get_Same_ValueCounts()


In [None]:
#display columns with more than 85% same values
print(remove_columns)

In [None]:
#lets drop those columns
housing_df.drop(remove_columns, axis = 1, inplace = True)

In [None]:
housing_df

In [None]:
# Check if there are any duplicate values in the dataset

housing_df[housing_df.duplicated(keep=False)]

 - ### 2.2 Outlier treatment

In [None]:
#Check the outliers in predictor variables

plt.figure(figsize=(20, 12))
plt.subplot(5,3,1)
sns.boxplot(y = 'LotArea', palette='Set3', data = housing_df)
plt.subplot(5,3,2)
sns.boxplot(y = 'MasVnrArea', palette='Set3', data = housing_df)
plt.subplot(5,3,3)
sns.boxplot(y = 'TotalBsmtSF', palette='Set3', data = housing_df)
plt.subplot(5,3,4)
sns.boxplot(y = 'WoodDeckSF', palette='Set3', data = housing_df)
plt.subplot(5,3,5)
sns.boxplot(y = 'OpenPorchSF', palette='Set3', data = housing_df)
plt.show()

In [None]:
#define function to treat outlier treatment


# Removing values beyond 98% for LotArea

quartile_LotArea = housing_df['LotArea'].quantile(0.98)
housing_df = housing_df[housing_df["LotArea"] < quartile_LotArea]

# Removing values beyond 98% for MasVnrArea

quartile_MasVnrArea = housing_df['MasVnrArea'].quantile(0.98)
housing_df = housing_df[housing_df["MasVnrArea"] < quartile_MasVnrArea]

# Removing values beyond 99% for TotalBsmtSF

quartile_TotalBsmtSF = housing_df['TotalBsmtSF'].quantile(0.99)
housing_df = housing_df[housing_df["TotalBsmtSF"] < quartile_TotalBsmtSF]

# Removing values beyond 99% for WoodDeckSF

quartile_WoodDeckSF = housing_df['WoodDeckSF'].quantile(0.99)
housing_df = housing_df[housing_df["WoodDeckSF"] < quartile_WoodDeckSF]

# Removing values beyond 99% for OpenPorchSF

quartile_OpenPorchSF = housing_df['OpenPorchSF'].quantile(0.99)
housing_df = housing_df[housing_df["OpenPorchSF"] < quartile_OpenPorchSF]

## 3. <u> Data Visualization

In [None]:
#check the outliers in response variable - SalePrice

plt.figure(figsize=(15,5))
sns.distplot(housing_df['SalePrice'])

print("Skewness: %f" % housing_df['SalePrice'].skew())
print("Kurtosis: %f" % housing_df['SalePrice'].kurt())

### Observation

 - #### The target variable SalePrice is not normally distributed and the distribution is right skewed.This can reduce the performance of the ML regression models .
 - #### Therfore we make a log transformation so that the resulting distribution looks approximately normal  .

In [None]:
#Transform the skewed distribution to approximately normal by taking log of sale price
housing_df['SalePrice'] = np.log(housing_df['SalePrice'])
                                 
plt.figure(figsize=(15,5))
sns.distplot(housing_df['SalePrice'])


In [None]:
f, axes = plt.subplots(ncols=3, figsize=(16,4))

# Lot Area: In Square Feet
sns.distplot(housing_df['LotArea'], kde=False, color="#DF3A01", ax=axes[0]).set_title("Distribution of LotArea")
axes[0].set_ylabel("Square Ft")
axes[0].set_xlabel("Amount of Houses")

# MoSold: Year of the Month sold
sns.distplot(housing_df['MoSold'], kde=False, color="#045FB4", ax=axes[1]).set_title("Monthly Sales Distribution")
axes[1].set_ylabel("Amount of Houses Sold")
axes[1].set_xlabel("Month of the Year")

# House Value
sns.distplot(housing_df['SalePrice'], kde=False, color="#088A4B", ax=axes[2]).set_title("Monthly Sales Distribution")
axes[2].set_ylabel("Number of Houses ")
axes[2].set_xlabel("Price of the House")



plt.show()

### Observation

 - #### Most of the houses were sold in the month of June and July.
 

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x="MSZoning", data=housing_df, palette="Set2")
plt.title('Building Sales by Zoning', fontsize=18)
plt.ylabel('Number of houses sold', fontsize=14)
plt.xlabel('MSZoning', fontsize=14)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
sns.countplot(x="Neighborhood", data=housing_df, palette="Set2")
ax.set_title("Types of Neighborhoods", fontsize=20)
ax.set_xlabel("Neighborhoods", fontsize=16)
ax.set_ylabel("Number of Houses Sold", fontsize=16)
#ax.set_xticklabels(labels=housing_df['Neighborhood'] ,rotation=90)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.boxplot(x="Neighborhood", y="SalePrice", data=housing_df)
ax.set_title("Range Value of the Neighborhoods", fontsize=18)
ax.set_ylabel('Price Sold', fontsize=16)
ax.set_xlabel('Neighborhood', fontsize=16)
#ax.set_xticklabels(labels=housing_df['Neighborhood'] , rotation=90)
plt.show()

### Observation

- #### Most of the houses were sold were from a Residential Low Density Zone .

- #### Most of the houses were sold in the neighbourhood - NridgHt,CollgCr,NWAmes

- #### The most expensive Neighborhoods are Crawfor, Sawyer and NridgHt

In [None]:
sns.jointplot(x='GrLivArea',y='SalePrice',data=housing_df,kind='hex')

sns.jointplot(x='GarageArea',y='SalePrice',data=housing_df,kind='hex')

sns.jointplot(x='TotalBsmtSF',y='SalePrice',data=housing_df,kind='hex')

plt.show()

In [None]:
plt.figure(figsize=(16,6))
plt.subplot(121)
ax = sns.regplot(x="LotFrontage", y="SalePrice", data=housing_df)
ax.set_title("Lot Frontage vs Sale Price", fontsize=16)

plt.subplot(122)
ax1 = sns.regplot(x="LotArea", y="SalePrice", data=housing_df, color='#FE642E')
ax1.set_title("Lot Area vs Sale Price", fontsize=16)

plt.show()

### Observation

- #### GrlivingArea ,GarageArea and TotalBsmft is positively correlated with the price of the house.Houses with more GrlivingArea are sold at higher prices.GarageArea of 200–1000 has most of the SalePrice.

- #### LotArea and LotFrontage do not show any strong pattern

In [None]:
fig, ax = plt.subplots(figsize=(14,8))
palette = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71", "#FF8000", "#AEB404", "#FE2EF7", "#64FE2E"]

sns.swarmplot(x="OverallQual", y="SalePrice", data=housing_df, ax=ax, palette=palette, linewidth=1)
plt.title('Correlation between OverallQual and SalePrice', fontsize=18)
plt.ylabel('Sale Price', fontsize=14)
plt.show()

### Observation

- ####  SalePrices increase rapidly with houses with better overall quality which is pretty reasonable.

In [None]:
sns.factorplot("Fireplaces","SalePrice",data=housing_df,hue="FireplaceQu");

### Observation

- #### House with 2 fireplaces and of excellent quality has higher sales price

In [None]:
#check the correlation among variables
plt.figure(figsize=(20,10))
mask = np.zeros_like(housing_df.corr(),dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
sns.heatmap(housing_df.corr(),mask=mask,cmap="coolwarm", annot=True)
plt.show()

### Observation

#### As seen from above heatmap , following variables are highly correlated

 - #### GarageCars & GarageArea with correlation coefficent of 0.89
 - #### TotRmsAbvGrd & GrLivArea with correlation coefficent of 0.83
 - #### TotalBsmtSF & 1stFlrSF with correlation coefficent of 0.77

In [None]:
#drop highly correlated variables

housing_df.drop(['1stFlrSF','TotRmsAbvGrd', 'GarageArea'], axis = 1, inplace = True)

### Lets visualize Numerical predictor variables with target variable -

In [None]:
corr=housing_df.corr()["SalePrice"]
corr[np.argsort(corr, axis=0)[::-1]]

In [None]:
from scipy import stats
nr_rows = 8
nr_cols = 3

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*3.5,nr_rows*3))
num_cols=list(housing_df.select_dtypes(['int64','float64']))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        i = r*nr_cols+c
        if i < len(num_cols):
            sns.regplot(housing_df[num_cols[i]], housing_df['SalePrice'], ax = axs[r][c])
            stp = stats.pearsonr(housing_df[num_cols[i]], housing_df['SalePrice'])
            #axs[r][c].text(0.4,0.9,"title",fontsize=7)
            str_title = "r = " + "{0:.2f}".format(stp[0]) + "      " "p = " + "{0:.2f}".format(stp[1])
            axs[r][c].set_title(str_title,fontsize=11)
            
plt.tight_layout()    
plt.show() 

### Observation
 
 - #### OverallQual,GrLivArea,GarageCars,TotalBsmtSF,FullBath,Built_Remodel_Age have more than 0.5 correlation with SalePrice.



### Lets visualize Categorical predictor variables with target variable -

In [None]:
cat_col=list(housing_df.select_dtypes('object'))
plt.figure(figsize=(20,70))
for m,n in enumerate(cat_col):
    plt.subplot(11,2,(m+1))
    sns.boxplot(x=n, y='SalePrice',data=housing_df)
    plt.xlabel(n, fontsize=14)
plt.show()

## 4. <u> Data Preparation

There are three kinds of predictor variables - ordered categorical, unordered categorical and numeric. 
We will convert ordered categorical variable into numeric type .

For values which can be ordered,assign different weightage by mapping based on the data dictionary


In [None]:
housing_df['LotShape'] = housing_df['LotShape'].map({'Reg': 3, 'IR1': 2, 'IR2': 1, 'IR3': 0})
housing_df['ExterQual'] = housing_df['ExterQual'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1})
housing_df['BsmtQual'] = housing_df['BsmtQual'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'No Basement': 0})
housing_df['BsmtExposure'] = housing_df['BsmtExposure'].map({'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'No Basement': 0})
housing_df['BsmtFinType1'] = housing_df['BsmtFinType1'].map({'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 
                                                                 'No Basement': 0})
housing_df['HeatingQC'] = housing_df['HeatingQC'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1})
housing_df['KitchenQual'] = housing_df['KitchenQual'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1})
housing_df['FireplaceQu'] = housing_df['FireplaceQu'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'No Fireplace': 0})
housing_df['GarageFinish'] = housing_df['GarageFinish'].map({'Fin': 3, 'RFn': 2, 'Unf': 1, 'No Garage': 0 })
housing_df['BldgType'] = housing_df['BldgType'].map({'Twnhs': 5, 'TwnhsE': 4, 'Duplex': 3, '2fmCon': 2, '1Fam': 1})
housing_df['HouseStyle'] = housing_df['HouseStyle'].map({'SLvl': 8, 'SFoyer': 7, '2.5Fin': 6, '2.5Unf': 5, '2Story': 4, 
                                                                 '1.5Fin': 3, '1.5Unf': 2, '1Story': 1})
housing_df['LotConfig'] = housing_df['LotConfig'].map({'Inside': 5, 'Corner': 4, 'CulDSac': 3, 'FR2': 2, 'FR3': 1})
housing_df['MasVnrType'] = housing_df['MasVnrType'].map({'BrkCmn': 1, 'BrkFace': 1, 'CBlock': 1, 'Stone': 1, 'None': 0 })
housing_df['SaleCondition'] = housing_df['SaleCondition'].map({'Normal': 1, 'Partial': 1, 'Abnorml': 0, 'Family': 0, 
                                                                   'Alloca': 0, 'AdjLand': 0})
housing_df.head()

In [None]:
# creating dummies for the following columns -
unordered_cols = ['MSZoning', 'Neighborhood', 'RoofStyle', 'Exterior1st', 'Exterior2nd', 'Foundation', 'GarageType']

for i in unordered_cols:
    dummies_df=pd.get_dummies(housing_df[i],prefix=i,drop_first=True)   
    housing_df=pd.concat([housing_df,dummies_df],axis=1)
    housing_df.drop(i,axis=1,inplace=True)   

In [None]:
housing_df.head()

## 5. <u> Train Test Split

In [None]:
X = housing_df.drop(['SalePrice'], axis=1)
X.head()

In [None]:
# Putting response variable to y

y = housing_df['SalePrice']
y.head()

In [None]:
# storing column names in cols
cols = X.columns

X = pd.DataFrame(scale(X))
X.columns = cols
X.columns

In [None]:
# split into train and test
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size = 0.3, random_state=42)

## 6. <u> Model Building and Evaluation 

In [None]:
lm = LinearRegression()
lm.fit(X_train, y_train)

# running RFE
rfe = RFE(lm, 50)            
rfe = rfe.fit(X_train, y_train)

In [None]:
# Assign the columns selected by RFE to cols
col = X_train.columns[rfe.support_]

# assign the 50 features selected using RFE to a dataframe and view them
temp_df = pd.DataFrame(list(zip(X_train.columns,rfe.support_,rfe.ranking_)), columns=['Variable', 'rfe_support', 'rfe_ranking'])
temp_df = temp_df.loc[temp_df['rfe_support'] == True]
temp_df.reset_index(drop=True, inplace=True)

temp_df

In [None]:
# Assign the 50 columns to X_train_rfe

X_train_rfe = X_train[col]

In [None]:
X_train = X_train_rfe[X_train_rfe.columns]
X_test =  X_test[X_train.columns]

### 6.1 Ridge Regression

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity.

In [None]:
#Lets use Grid Search Cross Validation method to get the best value of hyperparameter alpha for ridge regression model.

params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
                    9.0, 10.0, 20, 50, 100, 500, 1000 ]}

ridge = Ridge()


folds = 5
ridge_model_cv = GridSearchCV(estimator = ridge, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
ridge_model_cv.fit(X_train, y_train)

In [None]:
# display the mean scores

ridge_cv_results = pd.DataFrame(ridge_model_cv.cv_results_)
ridge_cv_results = ridge_cv_results[ridge_cv_results['param_alpha']<=500]
ridge_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])

In [None]:
# plotting mean test and train scoes with alpha 
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_train_score'], label='Train score')
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_test_score'], label='Test score')
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')
plt.title("Negative Mean Absolute Error and alpha")
plt.show()

In [None]:
# get the best estimator for lambda
ridge_model_cv.best_estimator_

In [None]:
#Let’s check out the coefficient values with alpha value as 10.
alpha=10.0
ridge=Ridge(alpha=alpha)
ridge.fit(X_train,y_train)
print(ridge.coef_)

In [None]:
# Check the mean squared error

mean_squared_error(y_test, ridge.predict(X_test))

In [None]:
#check the R2 value for optimum alpha value:
alpha=10.0

ridge=Ridge(alpha=alpha)
ridge.fit(X_train,y_train)

y_train_pred_ridge= ridge.predict(X_train)
y_test_pred_ridge= ridge.predict(X_test)

print('train R2 score is',r2_score(y_train,y_train_pred_ridge))
print('test R2 score is',r2_score(y_test,y_test_pred_ridge))

### Observation

#### As seen from above ,the train data has R2 value - 0.92 and  test data has 0.90 as R2 value. So it is pretty much predicting well and not overfitted

In [None]:
# Put the Features and coefficients in a dataframe and determine the top 10 significant features  

ridge_df = pd.DataFrame({'Features':X_train.columns, 'Coefficient':ridge.coef_.round(4)})
ridge_df.reset_index(drop=True, inplace=True)
ridge_df.sort_values('Coefficient',ascending=False).head(10)

In [None]:
#Top 10 and bottom 10 Features
plt.figure(figsize=(20, 10))
r_coefs = pd.Series(ridge.coef_, index = X_train.columns)

r_imp_coefs = pd.concat([r_coefs.sort_values().head(10),
                     r_coefs.sort_values().tail(10)])
r_imp_coefs.plot(kind = "barh", color='yellowgreen')
plt.xlabel("Ridge coefficient", weight='bold')
plt.title("Feature importance in the Ridge Model", weight='bold')
plt.show()

### 6.2 Lasso Regression

Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that cause regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable

In [None]:
lasso=Lasso()
params = {'alpha': [0.000001, 0.00001,0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 500, 1000, 10000]}
folds=5
lasso_model_cv=GridSearchCV(estimator=lasso,
                           param_grid=params,
                           scoring= 'r2',
                            cv=folds,
                            return_train_score=True,
                           verbose=1)
lasso_model_cv.fit(X_train, y_train)

In [None]:
# display the mean scores

lasso_cv_results = pd.DataFrame(lasso_model_cv.cv_results_)
lasso_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])

In [None]:
# plotting mean test and train scoes with alpha 
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_train_score'], label='Train')
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_test_score'], label='Test')
plt.xlabel('alpha')
plt.ylabel('R2_score')
plt.xscale('log')
plt.legend()
plt.show()

In [None]:
# get the best estimator for lambda

lasso_model_cv.best_estimator_

In [None]:
# check the coefficient values with lambda = 0.0001

alpha = 0.0001

lasso = Lasso(alpha=alpha)
        
lasso.fit(X_train, y_train) 
lasso.coef_

In [None]:
# Check the mean squared error

mean_squared_error(y_test, lasso.predict(X_test))

In [None]:
#check the R2 value for optimum alpha value:
alpha=0.0001

lasso= Lasso(alpha=alpha)
lasso.fit(X_train, y_train)

y_train_pred_lasso= lasso.predict(X_train)
y_test_pred_lasso= lasso.predict(X_test)

print('train R2 score is',r2_score(y_train,y_train_pred_lasso))
print('test R2 score is',r2_score(y_test,y_test_pred_lasso))

### Observation

#### As seen from above ,the train data has R2 value - 0.92 and  test data has 0.90 as R2 value. So it is pretty much predicting well and not overfitted

In [None]:
# Put the Features and coefficients in a dataframe and determine the top 10 significant features 
lasso_df = pd.DataFrame({'Features':X_train.columns, 'Coefficient':lasso.coef_.round(4)})
lasso_df = lasso_df[lasso_df['Coefficient'] != 0.00]
lasso_df.reset_index(drop=True, inplace=True)
lasso_df.sort_values('Coefficient',ascending=False).head(10)


In [None]:
#Top 10 and bottom 10 Features
plt.figure(figsize=(20, 10))
L_coefs = pd.Series(lasso.coef_, index = X_train.columns)

L_imp_coefs = pd.concat([L_coefs.sort_values().head(10),
                     L_coefs.sort_values().tail(10)])
L_imp_coefs.plot(kind = "barh", color='yellowgreen')
plt.xlabel("Lasso coefficient", weight='bold')
plt.title("Feature importance in the Ridge Model", weight='bold')
plt.show()

### <u> Conclusion

 - #### Both Lasso and Ridge Regression have almost same train and test r2 value which is 0.92 and 0.9
 
 - #### The optimal lambda value in case of Ridge is 10 and Lasso is  0.0001.

 - #### The Mean Squared error in case of Ridge is 0.013579 and Lasso is 0.013474.
 
 - #### Also the top 10 most significant variables predicted by ridge and lasso are almost same just their order is different 

 - #### Since Lasso helps in feature reduction (as the coefficient value of one of the feature became 0) and mean Squared Error of Lasso is slightly lower than that of Ridge . Hence we will go ahead with Lasso



#### Top 10 most significant variables in Ridge are:
 - 'GrLivArea', 0.1005
 - 'MSZoning_RL', 0.0899
 - 'OverallQual', 0.0702
 - 'MSZoning_FV', 0.0590
 - 'MSZoning_RM', 0.0587
 - 'TotalBsmtSF', 0.0513
 - 'OverallCond', 0.0460
 - 'Foundation_PConc', 0.0443
 - 'GarageCars', 0.0373
 - 'BsmtFinSF1', 0.033


#### Top 10 most significant variables in Lasso are:
 - 'MSZoning_RL', 0.1326
 - 'GrLivArea', 0.1036
 - 'MSZoning_RM', 0.0967
 - 'MSZoning_FV', 0.0811
 - 'OverallQual', 0.0693
 - 'TotalBsmtSF', 0.0504
 - 'Foundation_PConc', 0.0468
 - 'OverallCond', 0.0457
 - 'GarageCars', 0.0379
 - 'MSZoning_RH', 0.0335


#### Hence the top 10 variables which are significant in predicting the price of a house -

 - 'MSZoning_RL', 0.1326
 - 'GrLivArea', 0.1036
 - 'MSZoning_RM', 0.0967
 - 'MSZoning_FV', 0.0811
 - 'OverallQual', 0.0693
 - 'TotalBsmtSF', 0.0504
 - 'Foundation_PConc', 0.0468
 - 'OverallCond', 0.0457
 - 'GarageCars', 0.0379
 - 'MSZoning_RH', 0.0335


#### How well those variables describe the price of a house.?

 - 'MSZoning_RL' with coefficient value 0.1326
 - 'GrLivArea' with coefficient value 0.1036
 - 'MSZoning_RM' with coefficient value 0.0967
 - 'MSZoning_FV' with coefficient value 0.0811
 - 'OverallQual' with coefficient value 0.0693