# Multiple Linear Regression



### Housing Case Study



#### Problem Statement:


Consider that a real estate company has the data of real-estate prices in Delhi. The company wants to optimise the selling price of the properties, based on important factors such as area, bedrooms, parking, etc.
 

Essentially, the company wants:

- To identify the variables affecting house prices, e.g., area, number of rooms, bathrooms, etc.
- To create a linear model that quantitatively relates house prices with variables, such as the number of rooms, area, number of bathrooms, etc.
- To know the accuracy of the model, i.e. how well do these variables predict the house prices


**So the the interpretation of the data is important**

Step the are perfomred in the multiple linear regression are as follows
1. Reading , understanding and visualize the data
2. Preparing Training and Testing data split
3. Training the model
4. Residual Analysis
5. Prediction and Evalution of the test dataset

# Step 1: Reading , understanding and visualize the data

In [None]:
#standard import for importing the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
from sklearn.metrics import r2_score

In [None]:
# ignoring all the warning the we get
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Reading the data set
data_url = "/kaggle/input//housing-simple-regression/Housing.csv"

housing = pd.read_csv(data_url)
housing.head()

In [None]:
housing.info()

In [None]:
housing.shape

In [None]:
housing.describe()

In [None]:
# ploting the dataset
# numerical variable
sns.pairplot(housing)

In [None]:
# visualizng the categorical variable
# make a box plot between continous varaible and categorical variable
plt.figure(figsize=(15,10))
plt.subplot(2,3,1)
sns.boxplot(x="mainroad",y="price",data=housing)

plt.subplot(2,3,2)
sns.boxplot(x="airconditioning",y="price",data=housing)

plt.subplot(2,3,3)
sns.boxplot(x="furnishingstatus",y="price",data=housing)

plt.subplot(2,3,4)
sns.boxplot(x="guestroom",y="price",data=housing)

plt.subplot(2,3,5)
sns.boxplot(x="basement",y="price",data=housing)

plt.subplot(2,3,6)
sns.boxplot(x="prefarea",y="price",data=housing)

it can be clearly seen that how a categorical variable explains about the variance in the price value

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(x="furnishingstatus",hue="airconditioning", y="price",data=housing)

In [None]:
# we can infer that usually having an airconditioned increases the price of house as compared to not having it

# Step 2: Preparing the data for modeling

- Encoding
    - binary categorical to 0 and 1
    - other categorical variable to dummy variable or one hot encoding
- Spliting in test train data set
- Rescaling of some variable


In [None]:
# convert the binary categorical variable to 1 or 0 or we can have one hot encoding
housing.columns

In [None]:
var_list = ['mainroad','guestroom', 'basement',
            'hotwaterheating', 'airconditioning', 'prefarea']

# one way doing it is mentioned below
# for var in var_list:
#     housing[var] = housing[var].apply(lambda x: 1 if x=="yes" else 0)

# can also be done by subseting the dataset
housing[var_list] = housing[var_list].apply(lambda x: x.map({"yes":1,"no":0}))

In [None]:
housing.head()

In [None]:
# converting furnishing status to one hot encoding or dummy variable
status = pd.get_dummies(housing['furnishingstatus'])
status.head()

we can drop furnished columns as instead of three columns it can be represent using the following structure

Identification -
- 00 Furnished 
- 10 Semi -furnished   
- 01 unfurnished       

In [None]:
# dropping the redundent columns
status = pd.get_dummies(housing['furnishingstatus'],drop_first=True)
status.head()

In [None]:
# concat the dummy data to housing

housing = pd.concat([housing,status],axis=1)

In [None]:
housing = housing.drop(['furnishingstatus'],axis=1)
housing.head()

In [None]:
# performing the train test split
df_train, df_test = train_test_split(housing,train_size=0.7,random_state=100)
print(df_train.shape)
print(df_test.shape)

## Rescaling the Features

1. Min-Max scaling (normalization) convert between 0 and 1
2. Standardization (convert the data such that mean = 0 and std = 1)


In [None]:
housing.columns

In [None]:
# sklearn geneally comes with 3 types of methods for preprocessing MinMaxScaler
# fit() learn, will just calculate min and max values
# transform() x - xmin/(xmax - xmin)
# fit_tranform() does above two in just one step

In [None]:
# create class object
scaler = MinMaxScaler()

#create a list of only numeric variable
scaler_list = ['area','bedrooms','bathrooms','stories','parking','price']

# fit the scaler in training data set
df_train[scaler_list] = scaler.fit_transform(df_train[scaler_list])
df_train.head()

# Step 3: Modelling or Training the model

In [None]:
# how many of the features show we choose for optimum model training

In [None]:
#plotting a heat map to understand the correlation among feature
plt.figure(figsize=(14,10))
sns.heatmap(df_train.corr(), annot= True, cmap='YlGnBu')
b, t = plt.ylim()
b += 0.5
t -= 0.5
plt.ylim(b, t)

In [None]:
y_train = df_train.pop('price')
X_train = df_train
X_train.head()

In [None]:
# for every new feature varaible we add we see the following
# - signification of the variable 
# - if varaible is correlated we'll look at VIF

#only using area for now
X_train_sm = sm.add_constant(X_train['area'])

lr = sm.OLS(y_train, X_train_sm)

lr_model = lr.fit()

In [None]:
lr_model.params

In [None]:
lr_model.summary()

### Summay Evaluation

- p value of the coef are low mean the coef are significant
- r sqaured is 28.3 percent means the model explains 28 percent variance in the price with the given variables

In [None]:
# now we add another variable and see the result

X_train_sm = X_train[['area','bathrooms']]
X_train_sm = sm.add_constant(X_train_sm)

lr = sm.OLS(y_train,X_train_sm)

lr_model = lr.fit()

In [None]:
lr_model.summary()

### Summay Evaluation
we have 2 dependent variable now

- p values of the coefs are low which mean the coefs are significant
- r sqaured is 48 percent means the model explains 48 percent variance in the price with the given variables
- this is a relatively good model as compared to the previous one

In [None]:
# now we add another variable bedrooms


X_train_sm = X_train[['area','bathrooms','bedrooms']]
X_train_sm = sm.add_constant(X_train_sm)

lr = sm.OLS(y_train,X_train_sm)

lr_model = lr.fit()
lr_model.summary()

### Now this is a top down approach
### we can also build the model using all the feature and then try look at the VIF and p value to drop out the unnecessary variables

In [None]:
# now we ada all the varaibles

X_train_sm = sm.add_constant(X_train)

lr = sm.OLS(y_train,X_train_sm)

lr_model = lr.fit()
lr_model.summary()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
def vif_calculate(X_df):
    vif = pd.DataFrame()
    vif['Features'] = X_df.columns
    vif['vif'] = [variance_inflation_factor(X_df.values,i) for i in range(X_df.shape[1])]
    vif['vif'] = round(vif['vif'],2)
    vif = vif.sort_values(by="vif",ascending=False)
    return vif

In [None]:
def train_model(y_train,X_train):
    X_train_sm = sm.add_constant(X_train)

    lr = sm.OLS(y_train,X_train_sm)

    lr_model = lr.fit()
    return lr_model

In [None]:
vif_calculate(X_train)

In [None]:
# we usually stick with vif less than 5

we could have 
- high p-value, high vip = drop it
- high-low
    - high p-value, low vif (reomve these first)
    - low p-value, high vif (after removing above recompute the vif and check it )
- low p-value, low vip = Keep it

In [None]:
# calculate everything once again and eliminate the feature base on the above rules

In [None]:
X_train = X_train.drop('semi-furnished',axis=1)
X_train.head()

In [None]:
train_model(y_train,X_train).summary()

In [None]:
vif_calculate(X_train)

In [None]:
# again doing the eleminating step
# bedrooms have high p-value so we eliminat it
X_train = X_train.drop('bedrooms',axis=1)
X_train_sm = sm.add_constant(X_train)
lr_model = train_model(y_train,X_train)
lr_model.summary()

In [None]:
vif_calculate(X_train) # now almost most of the every feature is below 5 so this could be our final model

# Step 4: Residual Analysis


In [None]:
y_train_pred = lr_model.predict(X_train_sm)

res = y_train - y_train_pred
res

In [None]:
#Distribution of the error terms - it should have a normal distribution
sns.distplot(res)

# Step :5 Prediction and Evaluation of our model

In [None]:
# same transformation needs to be on the training set also
# we never perform fit() operation on the test set
# we only transform() on the dataset

df_test[scaler_list] = scaler.transform(df_test[scaler_list])
df_test.head()

In [None]:
df_test.describe()

In [None]:
y_test = df_test.pop('price')
X_test = df_test

In [None]:
X_test_sm = sm.add_constant(X_test)
X_test_sm

In [None]:
X_test_sm = X_test_sm.drop(['semi-furnished','bedrooms'],axis=1)
X_test_sm

In [None]:
y_test_pred = lr_model.predict(X_test_sm)
y_test_pred

In [None]:
r2_score(y_true=y_test,y_pred=y_test_pred)

#### what ever the models  has learned on train set able to generalize well for test because the r2 score is kinda same for both

# We can also use RFE - Recursive feature elimination

In [None]:
import sklearn
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [None]:
df_train, df_test =train_test_split(housing,train_size=0.7,random_state=100)

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler_list = ['area','bedrooms','bathrooms','stories','parking','price']

# fit the scaler in training data set
df_train[scaler_list] = scaler.fit_transform(df_train[scaler_list])
df_train.head()

In [None]:
y_train = df_train.pop('price')
X_train = df_train

In [None]:
# setting dimension of y varriable
y_train = y_train.values.reshape(-1,1)

In [None]:
lr = LinearRegression()
lr.fit(y_train, X_train)

In [None]:
rfe = RFE(lr,10) # we choose have the best 10 variable 

In [None]:
rfe = rfe.fit(X_train,y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
# our 10 best features would be
X_train.columns[rfe.support_].tolist() # RFE tells that these variable are really significant

In [None]:
columns_to_drop = X_train.columns[~rfe.support_]
columns_to_drop 

In [None]:
X_train.drop(columns_to_drop,axis=1,inplace=True)

- We can again use our statsmodel to check weather these feature are good or not
- we can also use the vif values of the feature to understand the multicollinearity

In [None]:
vif_calculate(X_train)

In [None]:
# using the statsmodel

X_train_sm = sm.add_constant(X_train)

lr = sm.OLS(y_train,X_train)

lr_model = lr.fit()
lr_model.summary()

### Summary Evalution of the OLS model fitting

- we can see that vif value of bedroom variable is high and it's p value is also high
- drop bedrooms and generate the summary and vif values again

In [None]:
X_train = X_train.drop('bedrooms',axis=1)

In [None]:
vif_calculate(X_train)

In [None]:
X_train_sm = X_train_sm.drop('bedrooms',axis=1)

In [None]:
lr = sm.OLS(y_train,X_train)
lr_model = lr.fit()
lr_model.summary()

# Insights and inferences

- Clearly even after dropping bedrooms variable we are still able to get a high value of r square
- p-values of all the coefs are very smalls means the coefs are significant
- VIF values of the variables are below 5
- This model can explain almost 91 percent variance in the dataset

#### Question
how come dropping the bedrooms make the model better when it the variable of higher significance ?
- bedrooms variable must be high co related to some other variable i.e why we can eliminate it

Suppose we have 3 variable X,Y,Z and all of them have the same values (means highly correlated)

10X = 9X + 1Y = 8X + 1Y + 1Z = .... etc

so it doesn't matter which variables you choose if they are so similar in nature

it could be possible the bedrooms be Y and Area be X. So we can remove Y and have a higher coef value for X which normalize the effect
