# Overview
The Boston housing dataset has been widely used in regression analysis since the 1970’s..!! This classic dataset continues to serve as a canvas for developing the skills of new learners. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston.

# Goal
Predict Boston Housing Prices based on certain associated home and neighborhood attributes.

# Attributes
Alongside house price (i.e Medv – Median home values of Boston) the dataset also provides the following information -

1) RM - average number of rooms per dewlling.

2) LSTAT - % lower status of the population.

3) PTRATIO - pupil-teacher ratio by town.

4)MEDV - Median value of owner-occupied homes in $1000's

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv('../input/bostonhoustingmlnd/housing.csv')

In [None]:
data.head()

# Correlation analysis
Based on the below Correlation Matrix we can infer as below -

1) RM - average number of rooms per dwelling.The more the value of RM, the more will be the value of 'MEDV'. This feature is positively correlated to Price as number of rooms increase , the price of the house will obviously increase.

2) LSTAT - % of lower status of the population. This feature is negatively correlated to Price. The greater the value of LSTAT, the lesser the value of 'MEDV'. The reasoning is with the increase in the percentage of "lower class" homeowners in the neighbourhood, there can be a rise in the the crime rate in the neighbourhood hence the housing prices can become low. Also, Elite real estate owners may not build homes in that region as the people in and around would not be able to buy them. Hence houses in such a region will be cheaper.

3) PTRATIO - pupil-teacher ratio by town.The lesser the value of PTRATIO, the more will be the value of 'MEDV'. This feature is negatively co-related to price. Regions with a low PTRATIO will have higher prices for houses as this is more desirable for students to get better attention from teachers and impacts the education of the students.

# Correlation Matrix

In [None]:
import seaborn as sns
plt.figure(figsize=(15, 10))
sns.heatmap(data.corr(), annot=True) #, cmap = 'spring')

# Its important to standardize the data as we have different measure units for the features

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

# Lets start to split the data for Training and Testing purposes 

In [None]:
y = data["MEDV"]
data.drop(columns="MEDV", axis=1, inplace=True)
X = data 

In [None]:
X.shape

In [None]:
y.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=3)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

# Its important to scale our data as the features are in different units of measurements 

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# I have tried the following models and hypertuning :


Random Forest Regressor - This model with hypertuning gave better results comparitively

Decision Tree Regressor 

Linear Regression

# Lets import the Regressor Evaluation Metrics

In [None]:
from math import sqrt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=500, max_depth=7,random_state=42)
rf.fit(X_train, y_train)
rf_pred=rf.predict(X_test)

In [None]:
rf_RMSE_test_error=sqrt(mean_squared_error(y_test, rf_pred))
print(f"rf_Test_error: {rf_RMSE_test_error:.2f}")

In [None]:
print(mean_absolute_error(y_test, rf_pred))

In [None]:
r2_test = r2_score(y_test, rf_pred)

In [None]:
print("The rf model performance for the test set")
print("--------------------------------------------")
print("RMSE of test set is {}".format(rf_RMSE_test_error))
print("R2 Score of test set is {}".format(r2_test))

In [None]:
rf_compare = pd.DataFrame({'Actual': y_test, 'Predicted': rf_pred})
print(rf_compare)
(rf_compare.head(10)).plot(kind='bar', figsize=(15,5))
plt.show() 

# Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dtr_model = DecisionTreeRegressor(random_state=42, max_depth = 5)
dtr_model.fit(X_train, y_train)
DT_pred = dtr_model.predict(X_test)

In [None]:
DT_RMSE_test_error=sqrt(mean_squared_error(y_test, DT_pred))
print(f"DT_RMSE_Test_error: {DT_RMSE_test_error:.2f}")

In [None]:
print(mean_absolute_error(y_test, DT_pred))

In [None]:
r2_test = r2_score(y_test, DT_pred)

In [None]:
print("The DT model performance for the test set")
print("--------------------------------------------")
print("RMSE of test set is {}".format(DT_RMSE_test_error))
print("R2 Score of test set is {}".format(r2_test))

In [None]:
DT_compare = pd.DataFrame({'Actual': y_test, 'Predicted': DT_pred})
print(DT_compare)
(DT_compare.head(10)).plot(kind='bar', figsize=(15,5))
plt.show() 

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression 
LR_model = LinearRegression()
LR_model = LR_model.fit(X_train, y_train)
LR_pred = LR_model.predict(X_test)

In [None]:
LR_RMSE_test_error=np.sqrt(mean_squared_error(y_test, LR_pred))
print(f"LR_RMSE_Test_error: {LR_RMSE_test_error:.2f}")

In [None]:
print(mean_absolute_error(y_test, LR_pred))

In [None]:
r2_test = r2_score(y_test, LR_pred)

In [None]:
print("The LR model performance for the test set")
print("--------------------------------------------")
print("RMSE of test set is {}".format(LR_RMSE_test_error))
print("R2 Score of test set is {}".format(r2_test))

In [None]:
lr_compare = pd.DataFrame({'Actual': y_test, 'Predicted': LR_pred})
print(lr_compare)
(lr_compare.head(10)).plot(kind='bar', figsize=(15,5))
plt.show()

# Motivation - Keep Learning Until you get it Right :)