# Boston House Price Prediction

*In this project we are going to use Machine Learning to predict the house prices of city named Boston in US.*

*The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.*

*There are several features given for a house and we have to predicts its value as accurate as possible.*

# 0. Overview

Below is the overview of the whole project, what all things we will be doing, step wise.


- 1. Importing Libraries


- 2. Exploring Dataset
    - 2.1. We will be importing the dataset using Pandas library.
    - 2.2. Finding variables which are useful for prediction.


- 3. Univariate and Multivariate Analysis  
    - 3.1 MEDV
    - 3.2 TAX
    - 3.3 PTRATIO
    - 3.4 LSTAT
    - 3.5 RM


- 4. Splitting Dataset into Train and Test Set


- 5. Multiple Linear Regression
    - 5.1 Model Prepration
    - 5.2 Model Evaluation
    - 5.3 Model Interpretation


- 6. Decision Tree
    - 6.1 Model Prepration
    - 6.2 Model Interpretation


- 7. Random Forest
    - 7.1 Model Prepration
    - 7.2 Model Interpretation


- 8. Conclusion
---

# 1. Importing Libraries

First we are importing all the important libraries we are going to use in this project and if we need any other library, we will import it at that time only.

In [None]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings #to remove warning from the notebook
warnings.filterwarnings(action='ignore')

# 2. Exploring Dataset

## 2.1 Loading Dataset

Here we are going to import our **Boston House Price** dataset and will see how it looks o_o

In [None]:
#loading dataset
name= ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df = pd.read_csv(filepath_or_buffer="../input/boston-house-prices/housing.csv",delim_whitespace=True,names=name)
df.head()

**Boston House Price** dataset has 14 features and their description is given as follows:
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- RAD      index of accessibility to radial highways
- TAX      full-value property-tax rate per dollar 10,000.
- PTRATIO  pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    % lower status of the population
- MEDV     Median value of owner-occupied homes in $1000's

Here main thing to notice is that **MEDV** is the outcome variable which we need to predict and all other variables are predictor variables.

In [None]:
#shape of our dataset
df.shape

This data set has 14 features and 506 rows i.e. details of 506 houses.

In [None]:
#information about the data
df.info()

We can see that all features in the dataset are numeric type either float or int. There is no categorical variable, which makes our life little easier here :)

In [None]:
#checking for missing data
df.isnull().sum()
#there is no missing value in the data

We noticed that there are *No Missing* values in the dataset which again reduced our work load. Cheers!

## 2.2 Finding variables which are useful for prediction

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(data=df.corr().round(2),annot=True,cmap='coolwarm',linewidths=0.2,square=True)

The Big colorful picture above which is called *Heatmap* helps us to understand how features are correlated to each other.
- Postive sign implies postive correlation between two features whereas Negative sign implies negative correlation between two features.


- I am here interested to know which features have good correlation with our dependent variable MEDV and can help in having good predictions.


- I observed that INDUS, RM, TAX, PTRATIO and LSTAT shows some good correaltion with MEDV and I am interested to know more about them.


- However I noticed that INDUS shows good correlation with TAX and LSAT which is a pain point for us :(
  
  because it leads to **Multicollinearity**. So I decided NOT to consider this feature and do further analysis with other 5 remaining features.

In [None]:
#since some of these features shows quite good and very good correlation with our predictive variable Houese Price(MEDV)
df1 = df[['RM','TAX','PTRATIO','LSTAT','MEDV']]
df1.head()

Now we have created a new dataset consisting of only those variables which we selected after analysing Heatmap.

In [None]:
sns.pairplot(data=df1)

These 5x5 figures above helps us to understand how data in each variable (feature) is distributed with itself and with others.

**Observations**
- As we can see that RM, LSTAT and MEDV are quite normally distributed.


- Also we can see that RM and LSTAT shows kind of good Linear relationship with MEDV.


- There seems to have presence of some outliers in the dataset, we will study about them in some time.

In [None]:
#description about data
desc = df1.describe().round(2)
desc

Above table displays measures of central tendency like Mean, Median (50%) etc. We can see number of entries for each variable which is same as 506.

**Observations**
- Maximum value in MEDV and LSTAT are much higher than 75% of data points, which is kind of alarming situtaion for me.


- We will study each of the feature seprately and see how data is distributed and if there are any outliers or not.

# 3. Univariate and Multivariate Analysis

## 3.1 MEDV

In [None]:
#Box Plot and Distribution Plot for Dependent variable MEDV
plt.figure(figsize=(20,3))

plt.subplot(1,2,1)
sns.boxplot(df1.MEDV,color='#005030')
plt.title('Box Plot of MEDV')

plt.subplot(1,2,2)
sns.distplot(a=df1.MEDV,color='#500050')
plt.title('Distribution Plot of MEDV')
plt.show()

From above two figures we can see observe that:
- MEDV is normally distributed
- It contains some extreme values which could be potential outliers

Next we are going to observe data points which lies outside wiskers.

*Q3 + 1.5 * IQR*  <  **Potential Outliers**  <  *Q1 - 1.5 * IQR*
- Q3 -> Quartile 3, Under which 75% of data lies
- Q1 -> Quartile 1, Under which 25% of data lies
- IQR -> Inter-Quartile Range, Q3 - Q1

In [None]:
MEDV_Q3 = desc['MEDV']['75%']
MEDV_Q1 = desc['MEDV']['25%']
MEDV_IQR = MEDV_Q3 - MEDV_Q1
MEDV_UV = MEDV_Q3 + 1.5*MEDV_IQR
MEDV_LV = MEDV_Q1 - 1.5*MEDV_IQR

df1[df1['MEDV']<MEDV_LV]

**Observations:**
- For these two low house prices, we can see that TAX = 666 which is very high for a house with approx 5 rooms.
- For these two low house prices, we can see that LSTAT is also  high.

**Conclusion:**
- Since both TAX and LSTAT are negatively correlated to MEDV which means higher the TAX and LSTAT lower will be the house price and vica-versa.
- I find it meaningful to have such low house prices.
- Therefore, I will keep these data points.

In [None]:
df1[df1['MEDV']>MEDV_UV].sort_values(by=['MEDV','RM'])

**Observations:**
- For house prices = 50, it is observed that number of Room ranges from 5 to 9 (approx.) which is quite unusual.
- Also for these houses TAX ranges from low to high.
- For houses price between 37 to less than 50, RM is higher than 75% of the total data points. Since RM is positively correlated to MEDV, so this could be reason for little higher house prices.
- Also for these houses PTRATIO and LSTAT lies in 25% - 50% of the total observation respectively. Since PTRATIO and LSAT are negatively correlated to MEDV so this could be reason for little higher house prices.

**Conclusion:**
- I am going to DROP ALL entries whose MEDV = 50 because I feel these entries are outliers and can create problem in having good predicitions.
- I am going to keep all entries having MEDV between 37 to less than 50, since I could not observe any unusual behaviour for them.

In [None]:
print(f'Shape of dataset before remving Outliers: {df1.shape}')
df2 = df1[~(df1['MEDV']==50)]
print(f'Shape of dataset after remving Outliers: {df2.shape}')

As we can see that we have deleted 16 rows from out dataset having MEDV = 50

## 3.2 TAX

In [None]:
#Box Plot, Distribution Plot and Scatter Plot for TAX
plt.figure(figsize=(20,3))

plt.subplot(1,3,1)
sns.boxplot(df2.TAX,color='#005030')
plt.title('Box Plot of TAX')

plt.subplot(1,3,2)
sns.distplot(a=df2.TAX,color='#500050')
plt.title('Distribution Plot of TAX')

plt.subplot(1,3,3)
sns.scatterplot(df2.TAX,df2.MEDV)
plt.title('Scatter Plot of TAX vs MEDV')

plt.show()

From above three figures we can observe that:
- TAX is NOT normally distributed
- Though Boxplot does not show any outlier but there are some extreme TAX values in the dataset which is bothering me.
- Also from the scatter plot we can observe that for these extreme TAX values, MEDV ranges from low to high.

In [None]:
temp_df = df2[df1['TAX']>600].sort_values(by=['RM','MEDV'])
temp_df.shape

There are total 132 entries in TAX mostly having value 666 which I thinks is a *DEVIL'S* number. Now lets deep dive inside them.

In [None]:
temp_df

In [None]:
temp_df.describe()

**Observations:**
- RM for these entries lies between 3.5 to 8.78.
- PTRATIO for almost all of these entries is same and equal to 20.20.
- LSTAT for these entries lies between 2.96 to 37.97.
- MEDV for these entries lies between 5 to 29.80.
- All these observations are very unusual, it seems impossible to have such high TAX values for all these houses.
- These values most likely missing values which were imputed casually by someone.

**Conclusion:**
- Since LSTAT is most correlated to TAX as seen above in Heatmap, so I am going to replace those 132 TAX values with mean of remaining TAX values dividing in some intervals with the help of LSTAT.
- Interval 1: TAX_10 -> Replacing extreme TAX values having LSTAT is between 0 to 10 with mean of other TAX values whose LSTAT is between 0 to 10.
- Interval 2: TAX_20 -> Replacing extreme TAX values having LSTAT is between 10 to 20 with mean of other TAX values whose LSTAT is between 10 to 20.
- Interval 3: TAX_30 -> Replacing extreme TAX values having LSTAT is between 20 to 30 with mean of other TAX values whose LSTAT is between 20 to 30.
- Interval 4: TAX_40 -> Replacing extreme TAX values having LSTAT >= 30 with mean of other TAX values whose LSTAT >= 30.

In [None]:
TAX_10 = df2[(df2['TAX']<600) & (df2['LSTAT']>=0) & (df2['LSTAT']<10)]['TAX'].mean()
TAX_20 = df2[(df2['TAX']<600) & (df2['LSTAT']>=10) & (df2['LSTAT']<20)]['TAX'].mean()
TAX_30 = df2[(df2['TAX']<600) & (df2['LSTAT']>=20) & (df2['LSTAT']<30)]['TAX'].mean()
TAX_40 = df2[(df2['TAX']<600) & (df2['LSTAT']>=30)]['TAX'].mean()

indexes = list(df2.index)
for i in indexes:
    if df2['TAX'][i] > 600:
        if (0 <= df2['LSTAT'][i] < 10):
            df2.at[i,'TAX'] = TAX_10
        elif (10 <= df2['LSTAT'][i] < 20):
            df2.at[i,'TAX'] = TAX_20
        elif (20 <= df2['LSTAT'][i] < 30):
            df2.at[i,'TAX'] = TAX_30
        elif (df2['LSTAT'][i] >30):
            df2.at[i,'TAX'] = TAX_40

print('Values imputed successfully')

In [None]:
#This show all those extreme TAX values are replaced successfully
df2[df2['TAX']>600]['TAX'].count()

This shows that those values are replaced succesfully :)

In [None]:
sns.distplot(a=df2.TAX,color='#500050')
plt.title('Distribution Plot of TAX after replacing extreme values')
plt.show()

## 3.3 PTRATIO

In [None]:
#Box Plot, Distribution Plot and Scatter Plot for PTRATIO
plt.figure(figsize=(20,3))

plt.subplot(1,3,1)
sns.boxplot(df2.PTRATIO,color='#005030')
plt.title('Box Plot of PTRATIO')

plt.subplot(1,3,2)
sns.distplot(a=df2.PTRATIO,color='#500050')
plt.title('Distribution Plot of PTRATIO')

plt.subplot(1,3,3)
sns.scatterplot(df2.PTRATIO,df2.MEDV)
plt.title('Scatter Plot of PTRATIO vs MEDV')

plt.show()

From above three figures we can observe that:
- PTRATIO is NOT normally distributed
- There are few low PRATIO values in the dataset which is bothering me.

In [None]:
df2[df2['PTRATIO']<14].sort_values(by=['LSTAT','MEDV'])

**Observations:**
- PTRATIO for all above data points is same.
- RM and MEDV is increasing simultaneously, as RM and MEDV are positively correlated, which is fine.
- As LSTAT increases MEDV decreases, which follows negative correlation.

**Conclusion:**
- I don't observe any unusual behaviour for these data points. Therefore, I will keep them.

## 3.4 LSTAT

In [None]:
#Box Plot, Distribution Plot and Scatter Plot for LSTAT
plt.figure(figsize=(20,3))

plt.subplot(1,3,1)
sns.boxplot(df2.LSTAT,color='#005030')
plt.title('Box Plot of LSTAT')

plt.subplot(1,3,2)
sns.distplot(a=df2.LSTAT,color='#500050')
plt.title('Distribution Plot of LSTAT')

plt.subplot(1,3,3)
sns.scatterplot(df2.LSTAT,df2.MEDV)
plt.title('Scatter Plot of LSTAT vs MEDV')

plt.show()

From above three figures we can observe that:
- LSTAT is  normally distributed and skewed to right.
- There are some high LSTAT values in the dataset which we will analyse.

In [None]:
LSTAT_Q3 = desc['LSTAT']['75%']
LSTAT_Q1 = desc['LSTAT']['25%']
LSTAT_IQR = LSTAT_Q3 - LSTAT_Q1
LSTAT_UV = LSTAT_Q3 + 1.5*LSTAT_IQR
LSTAT_LV = LSTAT_Q1 - 1.5*LSTAT_IQR

df2[df2['LSTAT']>LSTAT_UV].sort_values(by='LSTAT')

**Observations:**
- From above data, I observed that since LSAT value for these 7 houses is high resulting in low MEDV, which follows the negative correaltion and is True.
- RM is low and TAX is little higher which means low MEDV and which is True.

**Conclusion:**
- I don't find any strong reason  to exclude these data points. Therefore, I will keep this data also for our model

## 3.5 RM

In [None]:
#Box Plot, Distribution Plot and Scatter Plot for RM
plt.figure(figsize=(20,3))

plt.subplot(1,3,1)
sns.boxplot(df2.RM,color='#005030')
plt.title('Box Plot of MEDV')

plt.subplot(1,3,2)
sns.distplot(a=df2.RM,color='#500050')
plt.title('Distribution Plot of MEDV')

plt.subplot(1,3,3)
sns.scatterplot(df2.RM,df2.MEDV)
plt.title('Scatter Plot of RM vs MEDV')

plt.show()

From above three figures we can observe that:
- RM is normally distributed .
- There are some low and high RM values in the dataset which we will analyse.
- Scatter plot of RM vs MEDV show good Positive Linear Relationship.

In [None]:
RM_Q3 = desc['RM']['75%']
RM_Q1 = desc['RM']['25%']
RM_IQR = RM_Q3 - RM_Q1
RM_UV = RM_Q3 + 1.5*RM_IQR
RM_LV = RM_Q1 - 1.5*RM_IQR

df2[df2['RM']<RM_LV].sort_values(by=['RM','MEDV'])

**Observations:**
- I am more concerned about two data points (row index - 365 & 367) where MEDV is higher while RM is very low, though RM and MEDV are positively correlated.
- Also for these two data points TAX and PTRATIO are above 50% of data points respectively, though both are negatively correlated to MEDV.
- For rest data points, I don't see any unusual behaviour.

**Conclusion:**
- I am going to delete those two data points (row index - 365 & 367) as it may influence the prediction capability of our model.
- Also I am going to keep all other points.

In [None]:
print(f'Shape of dataset before removing data points: {df2.shape}')
df3 = df2.drop(axis=0,index=[365,367])
print(f'Shape of dataset before removing data points: {df3.shape}')

We can see in the difference of shape of dataset after removing two data points (outliers).

In [None]:
df3[df3['RM']>RM_UV].sort_values(by=['RM','MEDV'])

**Observations:**
- In the above data points, I am more concerned about one data point only (row index - 364) where MEDV is very low while RM is very high, though RM and MEDV are positively correlated.
- Also for this data point LSTAT is low and MEDV is also low, though both are negatively correlated.
- For rest data points, I don't see any unusual behaviour.

**Conclusion:**
- I am going to delete the data point (row index - 364) as I believe this could be human error while inputting the data.
- Also I am going to keep all other points.

In [None]:
print(f'Shape of dataset before removing data points: {df3.shape}')
df3 = df3.drop(axis=0,index=[364])
print(f'Shape of dataset before removing data points: {df3.shape}')

We can see in the difference of shape of dataset after removing one data point (outlier).

---

Now, we are done with univariate and multivariate analysis and I feel data is ready to put into the **Black Box** i.e. model.

But before doing that we need to split our data into Training set and Test set and then we will make our model on Training set and test its accracy on Test set.

## 4. Splitting Dataset into Train and Test Set

In [None]:
#Now will split our dataset into Dependent variable and Independent variable

X = df3.iloc[:,0:4].values
y = df3.iloc[:,-1:].values

First, we have divided our data into two sets:

**X** contains all independent variables

**y** contains independent variable MEDV

In [None]:
print(f"Shape of Dependent Variable X = {X.shape}")
print(f"Shape of Independent Variable y = {y.shape}")

In [None]:
def FeatureScaling(X):
    """
    is function takes an array as an input, which needs to be scaled down.
    Apply Standardization technique to it and scale down the features with mean = 0 and standard deviation = 1
    
    Input <- 2 dimensional numpy array
    Returns -> Numpy array after applying Feature Scaling
    """
    mean = np.mean(X,axis=0)
    std = np.std(X,axis=0)
    for i in range(X.shape[1]):
        X[:,i] = (X[:,i]-mean[i])/std[i]

    return X

- Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. 


- Few advantages of Feature Scaling the data are as follows:
    - It makes training of model faster.
    - It prevents the model from getting stuck in local optima.


- Here, we are using Standard Scalar which will scale Independent variables such that distribution is now centred around 0, with a Standard Deviation of 1.

In [None]:
X = FeatureScaling(X)

Set of Independent variables X is now scaled down.

In [None]:
m,n = X.shape
X = np.append(arr=np.ones((m,1)),values=X,axis=1)

We need to add a variable for **Bias** also. So, we are adding a new column of 1's in X as the fist column. 

In [None]:
#Now we will spit our data into Train set and Test Set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state = 42)

print(f"Shape of X_train = {X_train.shape}")
print(f"Shape of X_test = {X_test.shape}")
print(f"Shape of y_train = {y_train.shape}")
print(f"Shape of y_test = {y_test.shape}")

Here we can see that we have split the data into Training Set (80% of total data) and Test Set (20% of total data)

# 5. Multiple Linear Regression
##### Here we are building Multiple Linear Regression from Scratch

## 5.1 Model Prepration

In [None]:
#ComputeCost function determines the cost (sum of squared errors) 

def ComputeCost(X,y,theta):
    """
    This function takes three inputs and uses the Cost Function to determine the cost (basically error of prediction vs
    actual values)
    Cost Function: Sum of square of error in predicted values divided by number of data points in the set
    J = 1/(2*m) *  Summation(Square(Predicted values - Actual values))
    
    Input <- Take three numoy array X,y and theta
    Return -> The cost calculated from the Cost Function
    """
    m=X.shape[0] #number of data points in the set
    J = (1/(2*m)) * np.sum((X.dot(theta) - y)**2)
    return J

This is the function which computes the Cost of sum of squared errors of our Multiple Linear Regression function.

In [None]:
#Gradient Descent Algorithm to minimize the Cost and find best parameters in order to get best line for our dataset

def GradientDescent(X,y,theta,alpha,no_of_iters):
    """
    Gradient Descent Algorithm to minimize the Cost
    
    Input <- X, y and theta are numpy arrays
            X -> Independent Variables/ Features
            y -> Dependent/ Target Variable
            theta -> Parameters 
            alpha -> Learning Rate i.e. size of each steps we take
            no_of_iters -> Number of iterations we want to perform
    
    Return -> theta (numpy array) which are the best parameters for our dataset to fit a linear line
             and Cost Computed (numpy array) for each iteration
    """
    m=X.shape[0]
    J_Cost = []
    for i in range(no_of_iters):
        error = np.dot(X.transpose(),(X.dot(theta)-y))
        theta = theta - alpha * (1/m) * error
        J_Cost.append(ComputeCost(X,y,theta))
    
    return theta, np.array(J_Cost)

This is our Gradient Descent Algorithm which will minimize the *Error in Prediction*.

Basically, it will find the best coefficients **theta** for our data which will represt Best Linear Line for our data.   

In [None]:
iters = 1000

alpha1 = 0.001
theta1 = np.zeros((X_train.shape[1],1))
theta1, J_Costs1 = GradientDescent(X_train,y_train,theta1,alpha1,iters)

alpha2 = 0.003
theta2 = np.zeros((X_train.shape[1],1))
theta2, J_Costs2 = GradientDescent(X_train,y_train,theta2,alpha2,iters)

alpha3 = 0.01
theta3 = np.zeros((X_train.shape[1],1))
theta3, J_Costs3 = GradientDescent(X_train,y_train,theta3,alpha3,iters)

alpha4 = 0.03
theta4 = np.zeros((X_train.shape[1],1))
theta4, J_Costs4 = GradientDescent(X_train,y_train,theta4,alpha4,iters)

- Now we run the Gradient Descent Algorithm using different **learning rate** *alpha*. Number of iterations we will be performing = 1000


- After that we will see what is *best learing rate* for our algorithm by visualizing the results.


- Finally we will get best *theta*, which represents the best linear line for our data.

In [None]:
plt.figure(figsize=(8,5))
plt.plot(J_Costs1,label = 'alpha = 0.001')
plt.plot(J_Costs2,label = 'alpha = 0.003')
plt.plot(J_Costs3,label = 'alpha = 0.01')
plt.plot(J_Costs4,label = 'alpha = 0.03')
plt.title('Convergence of Gradient Descent for different values of alpha')
plt.xlabel('No. of iterations')
plt.ylabel('Cost')
plt.legend()
plt.show()

**Observations:**
- We can see that for ***alpha = 0.03***, Gradient Descent algorithm converges to minimum much faster than for any other value of alpha (taken).


- We can see that Gradient Descent algorithm converged to minimum Cost somewhere before 50 iterations for *alpha = 0.03*.


- Gradient Descent convergenced fastest for *alpha = 0.03 -> 0.01 -> 0.003 -> 0.001*.


- Thus, the best value of *alpha = 0.03* and corrosponding to it we will get best *theta* which is equal to '*theta4*.

In [None]:
theta4

Above is the value of theta corrosponding to alpha = 0.03

In [None]:
def Predict(X,theta):
    """
    This function predicts the result for the unseen data
    """
    y_pred = X.dot(theta)
    return y_pred

Predict fucntion predicts the house price i.e. MEDV on the new unseen data using the regression coefficients i.e. theta.

In [None]:
y_pred = Predict(X_test,theta4)
y_pred[:5]

Predicted value for Test Set is saved in *y_pred* successfully.

## 5.2 Model Evaluation

In [None]:
plt.scatter(x=y_test,y=y_pred,alpha=0.5)
plt.xlabel('y_test',size=12)
plt.ylabel('y_pred',size=12)
plt.title('Predicited Values vs Original Values (Test Set)',size=15)
plt.show()

In the above scatter plot we can see that the diagonal line is not that straight, which represents the differences in the actual and predictions.

In [None]:
sns.residplot(y_pred,(y_pred-y_test))
plt.xlabel('Predicited Values',size=12)
plt.ylabel("Residues",size=12)
plt.title('Residual Plot',size=15)
plt.show()

In [None]:
sns.distplot(y_pred-y_test)
plt.xlabel('Residual',size=12)
plt.ylabel('Frquency',size=12)
plt.title('Distribution of Residuals',size=15)
plt.show()

**Observations:**
- *Distribution of Residuals Plot* shows residuals are quite normally distributed.


- From above *Residual Plot*, I do not found any significant pattern in residues (errors or predicition).


- I can conclude that our model is neither under fitting nor over fitting the data.

In [None]:
from sklearn import metrics
r2= metrics.r2_score(y_test,y_pred)
N,p = X_test.shape
adj_r2 = 1-((1-r2)*(N-1))/(N-p-1)
print(f'R^2 = {r2}')
print(f'Adjusted R^2 = {adj_r2}')

*R square* value above is calcualted on Test Set, though it is not very good but still it explains quite good linear relationship among independent variable and dependent variables.

In [None]:
from sklearn import metrics
mse = metrics.mean_squared_error(y_test,y_pred)
mae = metrics.mean_absolute_error(y_test,y_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test,y_pred))
print(f'Mean Squared Error: {mse}',f'Mean Absolute Error: {mae}',f'Root Mean Squared Error: {rmse}',sep='\n')

From above *Evaluation Metrices*, we can notice that Root Mean Squared Error is low for our Multiple Regression Model and that is good thing for us.

## 5.3 Model Interpretation

In [None]:
#coefficients of regression model
coeff=np.array([y for x in theta4 for y in x]).round(2)
features=['Bias','RM','TAX','PTRATIO','LSTAT']
eqn = 'MEDV = '
for f,c in zip(features,coeff):
    eqn+=f" + ({c} * {f})";

print(eqn)

In [None]:
sns.barplot(x=features,y=coeff)
plt.ylim([-5,25])
plt.xlabel('Coefficient Names',size=12)
plt.ylabel('Coefficient Values',size=12)
plt.title('Visualising Regression Coefficients',size=15)
plt.show()

**Observations:**

 - **MEDV = (21.74 * Bias) + (2.74 * RM) + (-1.06 * TAX) + (-1.93 * PTRATIO) + (-3.03 * LSTAT)**


- From above equation we can conclude that, for 1 unit increase in RM the House Price will go up by 2.74 units and vica-versa, considering other factors remaining constant.


- Also for 1 unit increase in TAX the House Price will go down by 1.06 units and vica-versa, considering other factors remaining constant.


- Also for 1 unit increase in PTRATIO the House Price will go down by 1.93 units and vica-versa, considering other factors remaining constant.


- Also for 1 unit increase in LSTAT the House Price will go down by 3.03 units and vica-versa, considering other factors remaining constant.


(Above four observations are quite meaningful also, since RM is positively correlated to MEDV and TAX, PRTATIO & LSTAT are negatively correlated to MEDV.)

**Conclusion:**

- *As we know, as the number of rooms increases price of the house increases. Whereas if the number of lower class people is high in a region (LSTAT) or if the student-teacher ratio is bigger (PTRATIO) i.e. less number of teachers for more number of students or if TAX rate is more, obiously House price will gp down.*


- Our multiple regression model does not explains the data perfectly (as R sqare value is 0.77) but it still it explains the good relationship of House Price (i.e. MEDV) and other factors affecting the price.


- We will fit few more models on this dataset and at the end will choose the model which explains the data best among all models.
---

# 6. Decision Tree
#### We will be using sklearn lirbrary to build Decision Tree model on the dataset. 

## 6.1 Model Prepration

In [None]:
X_dt = df3.iloc[:,:-1].values
y_dt = df3.iloc[:,-1].values

In [None]:
from sklearn.model_selection import train_test_split
X_train_dt,X_test_dt,y_train_dt,y_test_dt = train_test_split(X_dt,y_dt,test_size=0.2,random_state=42)

print(f"Shape of X_train_dt = {X_train_dt.shape}")
print(f"Shape of X_test_dt = {X_test_dt.shape}")
print(f"Shape of y_train_dt = {y_train_dt.shape}")
print(f"Shape of y_test_dt = {y_test_dt.shape}")

We have divided the dataset in Training set and Test set

In [None]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
dt.fit(X_train_dt,y_train_dt)

We fiitted a Decision Tree Regressor with default parameters.

In [None]:
y_pred_dt = dt.predict(X_test_dt)
y_pred_dt[:5]

We used the Decision Tree Regressor to predict the House Price on the Test Set.

## 6.2 Model Interpretation

In [None]:
plt.scatter(x=y_test_dt,y=y_pred_dt,alpha=0.5)
plt.xlabel('y_test',size=12)
plt.ylabel('y_pred',size=12)
plt.title('Predicited Values vs Original Values (Test Set)',size=15)
plt.show()

In the above scatter plot we can see that the diagonal line is not that straight, which represents the differences in the actual and predictions.

In [None]:
sns.residplot(y_pred_dt,(y_pred_dt-y_test_dt))
plt.xlabel('Predicited Values',size=12)
plt.ylabel("Residues",size=12)
plt.title('Residual Plot',size=15)
plt.show()

In [None]:
sns.distplot(y_pred_dt-y_test_dt)
plt.xlabel('Residual',size=12)
plt.ylabel('Frquency',size=12)
plt.title('Distribution of Residuals',size=15)
plt.show()

**Observations:**
- *Distribution of Residuals Plot* shows residuals are normally distributed.


- From above *Residual Plot*, I do not found any significant pattern in residues (errors or predicition).


- I can conclude that our model is neither under fitting nor over fitting the data.

In [None]:
from sklearn import metrics
r2_dt= metrics.r2_score(y_test_dt,y_pred_dt)
N,p = X_test_dt.shape
adj_r2_dt = 1-((1-r2_dt)*(N-1))/(N-p-1)
print(f'R^2 = {r2_dt}')
print(f'Adjusted R^2 = {adj_r2_dt}')

*R square* value above is calcualted on Test Set, though it is not very good but still its better than our Multiple Linear Regression Score. This means that Decision Tree model fits better than Multiple Linear Regression Model.

In [None]:
from sklearn import metrics
mse_dt = metrics.mean_squared_error(y_test_dt,y_pred_dt)
mae_dt = metrics.mean_absolute_error(y_test_dt,y_pred_dt)
rmse_dt = np.sqrt(metrics.mean_squared_error(y_test_dt,y_pred_dt))
print(f'Mean Squared Error: {mse_dt}',f'Mean Absolute Error: {mae_dt}',f'Root Mean Squared Error: {rmse_dt}',sep='\n')

From above *Evaluation Metrices*, we can notice that Root Mean Squared Error is low for our Decision Tree Model and that is good thing for us. Also all these error scores are less then Linear Regression Model.

**Conclusion:**
- Decision Tree Model gives better R square than one we got in Linear Regression, it means that this model is able to predict house prices more accurate than Linear Regression and we may use this model for predicting the house prices.


- We will try one more model Random Forest before reaching to our final conclusion.
---

# 7. Random Forest
#### We will be using sklearn lirbrary to build Random Forest model on the dataset. 

## 7.1 Model Prepration

In [None]:
X_rf = df3.iloc[:,:-1].values
y_rf = df3.iloc[:,-1].values

In [None]:
from sklearn.model_selection import train_test_split
X_train_rf,X_test_rf,y_train_rf,y_test_rf = train_test_split(X_rf,y_rf,test_size=0.2,random_state=42)

print(f"Shape of X_train_rf = {X_train_rf.shape}")
print(f"Shape of X_test_rf = {X_test_rf.shape}")
print(f"Shape of y_train_rf = {y_train_rf.shape}")
print(f"Shape of y_test_rf = {y_test_rf.shape}")

We have divided the dataset in Training set and Test set

In [None]:
warnings.filterwarnings(action='ignore')
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_rf,y_train_rf)

We fiitted a Random Forest Regressor with default parameters.

In [None]:
y_pred_rf = rf.predict(X_test_rf)
y_pred_rf[:5]

We used the Random Forest Regressor to predict the House Price on the Test Set.

## 7.2 Model Interpretation

In [None]:
plt.scatter(x=y_test_rf,y=y_pred_rf,alpha=0.5)
plt.xlabel('y_test',size=12)
plt.ylabel('y_pred',size=12)
plt.title('Predicited Values vs Original Values (Test Set)',size=15)
plt.show()

In the above scatter plot we can see that the diagonal line is not that straight, which represents the differences in the actual and predictions.

In [None]:
sns.residplot(y_pred_rf,(y_pred_rf-y_test_rf))
plt.xlabel('Predicited Values',size=12)
plt.ylabel("Residues",size=12)
plt.title('Residual Plot',size=15)
plt.show()

In [None]:
sns.distplot(y_pred_rf-y_test_rf)
plt.xlabel('Residual',size=12)
plt.ylabel('Frquency',size=12)
plt.title('Distribution of Residuals',size=15)
plt.show()


**Observations:**
- *Distribution of Residuals Plot* shows residuals are normally distributed.


- From above *Residual Plot*, I do not found any significant pattern in residues (errors or predicition).


- I can conclude that our model is neither under fitting nor over fitting the data.


In [None]:
from sklearn import metrics
r2_rf= metrics.r2_score(y_test_rf,y_pred_rf)
N,p = X_test_dt.shape
adj_r2_rf = 1-((1-r2_rf)*(N-1))/(N-p-1)
print(f'R^2 = {r2_rf}')
print(f'Adjusted R^2 = {adj_r2_rf}')


*R square* value above is calcualted on Test Set, though it is not very good but still its better than our Decision Tree Score. This means that Random Forest model fits better than Decision Tree Model.

In [None]:
from sklearn import metrics
mse_rf = metrics.mean_squared_error(y_test_rf,y_pred_rf)
mae_rf = metrics.mean_absolute_error(y_test_rf,y_pred_rf)
rmse_rf = np.sqrt(metrics.mean_squared_error(y_test_rf,y_pred_rf))
print(f'Mean Squared Error: {mse_rf}',f'Mean Absolute Error: {mae_rf}',f'Root Mean Squared Error: {rmse_rf}',sep='\n')

From above *Evaluation Metrices*, we can notice that Root Mean Squared Error is low for our Random Forest Model and that is good thing for us. Also all these error scores are less then Decision Tree and Linear Regression.

**Conclusion:**
- Random Forest Model gives better R square than both the models we made earlier, it means that this model is able to predict house prices more accurate than previous both models and we may use this model for predicting the house prices.

---

# 8. Conclusion

*Since we have made three models for our dataset and all of them have some difference in their prediction capability, now it's time to wrap up all we learned and reach to our final conclusion.*

*So guyz it's Show time!*

In [None]:
results=pd.DataFrame({'Linear Regression':[r2,adj_r2],'Decision Tree':[r2_dt,adj_r2_dt],
                      'Random Forest':[r2_rf,adj_r2_rf]},index=['R square','Adj R square'])

results.plot(kind='bar',alpha=0.7,grid=True,title='Interpreting Results',rot=0,figsize=(10,5),colormap='jet')
results

- *As from the above figure it is clearly visible that R square and Adjusted R square value for* **Random Forest** *Model is highest among all three models we used for predicting house price.*


- *We can say that it will be good to use Random Forest model for this dataset as it will help in predicitng house prices best without overfitting.*


- *Also, this data was collected many years ago and many things has changed since then, so it may not be feasible to implement this model for current house price predictions of Boston. Other new features must be taken into consideration and then we can make a better model for current scenario.*


- *Though this data is enough to learn and understand how to predicit House Price frpm given number of features using Machine Learning.*

 # Thank You
Please leave your valuable feedback on this