### Multiple Linear Regression
## Housing Case Study

#### Problem Statement:

Consider a real estate company that has a dataset containing the prices of properties in the Delhi region. It wishes to use the data to optimise the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Essentially, the company wants —


- To identify the variables affecting house prices, e.g. area, number of rooms, bathrooms, etc.

- To create a linear model that quantitatively relates house prices with variables such as number of rooms, area, number of bathrooms, etc.

- To know the accuracy of the model, i.e. how well these variables can predict house prices.

**So interpretation is important!**

## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the housing dataset

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

In [None]:
housing = pd.read_csv("C:/Users/Phanendra.varma.UPGRAD/Downloads/Housing.csv")

Question 1

Inspect the various aspects of the housing dataset and identify the mean of the area feature in the dataset

- 5150
- 545
- 510
- 296

In [None]:
# Enter code here

## Step 2: Visualising the Data

Let's now spend some time doing what is arguably the most important step - **understanding the data**.
- If there is some obvious multicollinearity going on, this is the first place to catch it
- Here's where you'll also identify if some predictors directly have a strong association with the outcome variable

We'll visualise our data using `matplotlib` and `seaborn`.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#### Visualising Numeric Variables

Let's make a pairplot of all the numeric variables. 



Question 2

Identify the plot that best represents the plot between area and price column

1. ---Refer to the Image provided on Platform---
2. ---Refer to the Image provided on Platform---
3. ---Refer to the Image provided on Platform---
4. ---Refer to the Image provided on Platform---

In [None]:
# Enter code here

#### Visualising Categorical Variables

As you might have noticed, there are a few categorical variables as well. Let's make a boxplot for some of these variables.




Question 3

How does a box plot between price and mainroad looks like

1. ---Refer to the Image provided on Platform---
2. ---Refer to the Image provided on Platform---
3. ---Refer to the Image provided on Platform---
4. ---Refer to the Image provided on Platform---

In [None]:
#Enter code here


We can also visualise some of these categorical features parallely by using the `hue` argument. Below is the plot for `furnishingstatus` with `airconditioning` as the hue.

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'furnishingstatus', y = 'price', hue = 'airconditioning', data = housing)
plt.show()

## Step 3: Data Preparation

- You can see that your dataset has many columns with values as 'Yes' or 'No'.

- But in order to fit a regression line, we would need numerical values and not string. Hence, we need to convert them to 1s and 0s, where 1 is a 'Yes' and 0 is a 'No'.

Question 4

Identify the columns that can be part of my varlist

- guestroom
- basement
- hotwaterheating
- bathrooms


In [None]:
# Enter code here ------ Define the List of variables to map

varlist =  []


In [None]:

# Defining the map function
def binary_map(x):
    return x.map({'yes': 1, "no": 0})

# Applying the function to the housing list
housing[varlist] = housing[varlist].apply(binary_map)

In [None]:
# Check the housing dataframe now

housing.head()

### Dummy Variables

The variable `furnishingstatus` has three levels. We need to convert these levels into integer as well. 

For this, we will use something called `dummy variables`.

You can read more about dummy variable here - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

In [None]:
# Get the dummy variables for the feature 'furnishingstatus' and store it in a new variable - 'status'
status = pd.get_dummies(housing['furnishingstatus'])

In [None]:
# Check what the dataset 'status' looks like
status.head()

Now, you don't need three columns. You can drop the `furnished` column, as the type of furnishing can be identified with just the last two columns where — 
- `00` will correspond to `furnished`
- `01` will correspond to `unfurnished`
- `10` will correspond to `semi-furnished`

In [None]:
# Let's drop the first column from status df using 'drop_first = True'
status = pd.get_dummies(housing['furnishingstatus'], drop_first = True)

In [None]:
# Add the results to the original housing dataframe
housing = pd.concat([housing, status], axis = 1)

In [None]:
# Drop 'furnishingstatus' as we have created the dummies for it
housing.drop(['furnishingstatus'], axis = 1, inplace = True)

Question 5

Let's say another categorical column has five different labels, now using dummy variables on that column would add a minimum of ___ columns to my dataset 

- 2
- 6
- 5
- 4

## Step 4: Splitting the Data into Training and Testing Sets

As you know, the first basic step for regression is performing a train-test split.

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
df_train, df_test = train_test_split(housing, train_size = 0.7, test_size = 0.3, random_state = 100)

### Rescaling the Features 

In Linear Regression, scaling doesn't impact your model. Here we can see that except for `area`, all the columns have small integer values. So it is extremely important to rescale the variables so that they have a comparable scale. If we don't have comparable scales, then some of the coefficients as obtained by fitting the regression model might be very large or very small as compared to the other coefficients. This might become very annoying at the time of model evaluation. So it is advised to use standardization or normalization so that the units of the coefficients obtained are all on the same scale. As you know, there are two common ways of rescaling:

1. Min-Max scaling 
2. Standardisation (mean-0, sigma-1) 

This time, we will use MinMax scaling.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking','price']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

Question 6

Identify the correlation coefficient between area and price column

- 0.53
- 0.32
- 0.52
- 0.43

In [None]:
# Enter code here --- check the correlation coefficients to see which variables are highly correlated


As you might have noticed, `area` seems to the correlated to `price` the most. Let's see a pairplot for `area` vs `price`.

We pick `area` as the first variable and we'll try to fit a regression line to that.

### Dividing into X and Y sets for the model building

In [None]:
y_train = df_train.pop('price')
X_train = df_train

## Step 5: Building a linear model

Build a linear regression model on the training data using `statsmodels`.

Question 7

Build a linear regression model to predict price of a house using two input variables that have the highest correlation with price. Report r2 score of the model?

- 0.681
- 0.480
- 0.759
- 0.915



In [None]:
import statsmodels.api as sm

# Enter code here

### Adding all the variables to the model

Question 8

Build a linear regression model (Use lr1) to predict price of a house by considering all features of the housing dataframe.By observing the summary statistics identify which of these variables are insignificant considering level of significance to be 0.05?

- Parking
- stories
- main road
- semifurnished


In [None]:
# Enter code here
# Build a linear model as lr_1


In [None]:
print(lr_1.summary())

Looking at the p-values, it looks like some of the variables aren't really significant (in the presence of other variables). We could simply drop the variable with the highest, non-significant p value. A better way would be to supplement this with the VIF information. 

### Checking VIF



In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Dropping the variable and updating the model

As you can see from the summary statistics and the VIF dataframe, some variables are insignificant. Creating a linear regression model 2 by dropping the variable that has the highest VIF and that is least significant


In [None]:
# Dropping highly correlated variables and insignificant variables
X = X_train
X = X.drop('semi-furnished', 1)
X = X.drop('bedrooms', 1)

In [None]:
# Build a second fitted model as lr-2

X_train_lm = sm.add_constant(X)

lr_2 = sm.OLS(y_train, X_train_lm).fit()

In [None]:
# Print the summary of the model
print(lr_2.summary())

In [None]:
# Calculate the VIFs again for the new model

vif = pd.DataFrame()
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

### Dropping the variable and updating the model

As you might have noticed, all the VIF values are now under 5. But from the summary statistics, we can still see some of them have a slightly higher p-value, We should drop those variable as well.

In [None]:
#drop the variable with the p value = 0.030

X = X_train
X = X.drop('basement', 1)
X = X.drop('semi-furnished', 1)
X = X.drop('bedrooms', 1)

In [None]:
# Build a third fitted model as lr_3

X_train_lm = sm.add_constant(X)
lr_3 = sm.OLS(y_train, X_train_lm).fit()


In [None]:
# print the summary of the model
print(lr_3.summary())

## Step 7: Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
# By using the final dataframe that contains the necessary columns (list of columns used for building model 3) you can make predictions on the training dataset
y_train_price = lr_3.predict(X_train_lm)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label