# Housing Case Study

This notebook aims to analyse the dataset of a real estate company and build a model to optimize sales prices of the houses and its dependency on different parameters.

# Step-1 Reading and Understanding the data

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
#Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
#Read the data by csv file
housing_data = pd.read_csv('../input/housing-simple-regression/Housing.csv')
housing_data.head()

Above dataframe shows first 5 rows of dataset. This data set has both numerical and categorical data present in it. Let's check the last 5 rows of data set as well.

In [None]:
housing_data.tail()

In [None]:
#Check the shape of the dataframe
housing_data.shape

This dataset contains 545 rows and 13 columns, out of which 6 columns contain numerical data while other 7 have categorical data.

In [None]:
#Check the information about the data
housing_data.info()

In [None]:
#Let's check about null values
housing_data.isna().any()

Clearly there are no null values present in the dataset.



In [None]:
#statstical summary of the data
housing_data.describe()

This dataframe represents count, mean,standard deviation, minimum, maximum and interquartile values for each column containing numerical data.

# Step-2 Visualizing the data

VISUALIZING NUMERICAL DATA

In [None]:
housing_data.hist(figsize=(20,20))

Histogram shows that only area and price have continous data while other columns such as bathrooms, bedrooms, parking and stories have discrete data

In [None]:
#let's make pairplot of numerical data
sns.pairplot(housing_data)
plt.show()

Pairplots show there is a correlation between area and price. Now let's visualize categorical data as well.

**VISUALIZING CATEGORICAL DATA**

In [None]:
#Box plots
plt.figure(figsize=(20,12))
plt.subplot(2,3,1)
sns.boxplot(x='mainroad', y='price', data=housing_data)
plt.subplot(2,3,2)
sns.boxplot(x='guestroom', y='price', data=housing_data)
plt.subplot(2,3,3)
sns.boxplot(x='basement', y='price', data=housing_data)
plt.subplot(2,3,4)
sns.boxplot(x='hotwaterheating', y='price', data=housing_data)
plt.subplot(2,3,5)
sns.boxplot(x='airconditioning', y='price', data=housing_data)
plt.subplot(2,3,6)
sns.boxplot(x='furnishingstatus', y='price', data=housing_data)
plt.show()






It can be clearly seen from box plots that furnishing status has three levels so dummy encoding has to be performed here. Also categorical data needs to be converted into numerical data for modelling.

# Step 3 Data Preparation

Let's convert categorical data into numerical data

In [None]:
#List of variables to map
varlist= ['mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea']
#Defining the map function
def binary_map(x):
    return x.map({'yes':1,'no':0})
housing_data[varlist]=housing_data[varlist].apply(binary_map)


In [None]:
housing_data.head()

**DUMMY VARIABLES**

The attribute furnishingstatus has 3 levels. We need to convert this into numerical data as well.

In [None]:
#Get the dummy variable for the attribute furnishingstatus and store it in a new dataframe
df=pd.get_dummies(housing_data['furnishingstatus'])

In [None]:
#Check how new dataset df looks like
df.head()

As it is clearly visible that there are 3 levels; furnished, semi-furnished and unfurnished but type of furnishing can be determined by only two columns so first column can be dropped.

In [None]:
#Let's drop first column
df=pd.get_dummies(housing_data['furnishingstatus'], drop_first=True)
df.head()

In [None]:
#Add the above dataframe df into original housing_data dataframe
housing_data=pd.concat([housing_data, df], axis=1)
housing_data.head()

Here in this concated dataframe previous attribute of furnishingstatus is still there so let's drop this.

In [None]:
housing_data.drop(['furnishingstatus'], axis=1, inplace= True)
housing_data.head()

# Step 4 Scaling the data

Except area all other attributes have very low integer values so MinMax Scaler will be used here. But first let's split the data into training and testing data

In [None]:
#Splitting of data into train and test set
from sklearn.model_selection import train_test_split
train, test = train_test_split(housing_data,train_size=0.8,test_size=0.2,random_state=0)

In [None]:
#Scaling the features
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()

In [None]:
#Apply scaler everywhere except categorical data
num = ['area','bedrooms','bathrooms','stories','parking','price']
train[num] = scaler.fit_transform(train[num])
train.head()

In [None]:
#Let's find outh the correlation matrix
corr=housing_data.corr()
corr

In [None]:
#Let's check heatmap
plt.figure(figsize=(16,10))
sns.heatmap(corr,annot=True, cmap='YlGnBu')
plt.show()

As it is clearly seen from the correlation matrix and heatmap that there is no case of multicollinearity since maximum correlation lies between price and area which is 0.54.

# Step 5 Model Building

Let's use automated feature selection.

In [None]:
from sklearn.linear_model import LinearRegression
X_train= train
Y_train= train.pop('price')


**RECRUSIVE FEATURE ELIMINATION (RFE)**

In [None]:
from sklearn.feature_selection import RFE
lm = LinearRegression()
lm.fit(X_train, Y_train)
rfe = RFE(lm, 10)             
rfe = rfe.fit(X_train, Y_train)

Displaying the columns in order of preference that can be used for model building as suggested by RFE

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

All those columns with ranking 1 and true are the most preferrable columns for RFE.

In [None]:
#Display the columns suuported by RFE
col = X_train.columns[rfe.support_]
col

In [None]:
#Display the columns not supported by RFE
X_train.columns[~rfe.support_]

Building model using statsmodel, for the detailed statistics

In [None]:
# Creating training dataframe with RFE selected variables
X_train_rfe = X_train[col]

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_rfe = sm.add_constant(X_train_rfe)

In [None]:
# Running the linear model
lm = sm.OLS(Y_train,X_train_rfe).fit()   

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

Here cofficients for feature bedrooms are insignificant so it can be dropped

In [None]:
#Drooping the bedrooms column
X_train_new = X_train_rfe.drop(["bedrooms"], axis = 1)

In [None]:
#Rebuilding the model without bedrooms

import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_new)
lm = sm.OLS(Y_train,X_train_lm).fit() 





In [None]:
#Let's see the summary of our linear model
print(lm.summary())

In [None]:
X_train_new.columns

In [None]:
X_train_new = X_train_new.drop(['const'], axis=1)

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

A variance inflation factor under 5 is considered to be good and here all the attributes have value of VIF<5, hence our model is doing great soo far

# **Step 6 Residual Analysis on the training data**
So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.








In [None]:
#Predicted value of price
Y_train_price = lm.predict(X_train_lm)

In [None]:
# Displaying error terms
fig = plt.figure()
sns.distplot((Y_train - Y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                   
plt.xlabel('Errors', fontsize = 18)                         

# Step 7 Model evaluation on test data


In [None]:
#Applying the scaling on the test sets
num = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking','price']
test[num] = scaler.transform(test[num])

In [None]:
Y_test = test.pop('price')
X_test = test

In [None]:
# Now let's use our model to make predictions.

# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train_new.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)
# Making predictions
Y_pred = lm.predict(X_test_new)

In [None]:
# Displaying available and predicted price values for the test data 
fig = plt.figure()
plt.scatter(Y_test,Y_pred)
fig.suptitle('Y_test vs Y_pred', fontsize=20)            
plt.xlabel('Y_test', fontsize=18)                          
plt.ylabel('Y_pred', fontsize=16)                          

Since errors are normally distributed so our model is doing good enough. Hence final equation will be
price=(0.314*area+0.193*bathrooms+0.112*stories+0.040*mainroad+0.051*guestroom+0.111*hotwaterheating+0.082*airconditioning+0.07*parking+0.069*prefera)