<h5>The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.<br>
The attributes are deﬁned as follows (taken from the UCI Machine Learning Repository1)</h5>

1. CRIM: per capita crime rate by
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to ﬁve Boston employment centers
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000


11. PTRATIO: pupil-teacher ratio by town 
12. B: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in $1000s
We can see that the input attributes have a mixture of units.

In [None]:
# Importing the libraries 
import pandas as pd
import numpy as np
from sklearn import metrics

In [None]:
#importing boston dataset
from sklearn.datasets import load_boston
boston = load_boston()

In [None]:
#initialising the data Frame
df=pd.DataFrame(boston.data)

In [None]:
#seeing the dataset roughly
df.head(8)

In [None]:
#adding the names of features with respective data
df.columns=boston.feature_names
#Adding target variable to dataframe


In [None]:
#cheking columns before adding features for target value i.e price
df.shape

In [None]:
df['PRICE'] = boston.target 
# Median value of owner-occupied homes in $1000s
df.head()

In [None]:
#checking columns after adding target values
df.shape

In [None]:
# all datas are properly associated with their types
df.dtypes

In [None]:
#Analysing the data.
#Statistics of dataset described.
df.describe()

In [None]:
#So no data is missing since all sums are 0.
df.isnull().sum()

<h1>1. Outliers Deduction</h1>
<h4>Outliers are very dangerous. They significantly affect the mean and the standard deviation and thus affecting the estimators of the model. In order to visually see outliers, we need a box plot or a scatter plot. Therefore, lets see the most correlated features with sale price to plot them a gainst each others.</h4>

In [None]:
#checking outliers using boxplot
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline

fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
count = 0
axs = axs.flatten()
for tar,var in df.items():
    sns.boxplot(y=tar, data=df, ax=axs[count])
    count= count+1
plt.tight_layout(pad=0.5, w_pad=0.78, h_pad=4.0)


<h4>We can see a large number of outliers in CRIM, ZN, RM, B. Lets check the percentage of their outlier. </h4>


In [None]:
#outliers in percentage
for tar,var in df.items():
    q1=var.quantile(0.25)
    q3=var.quantile(0.75)
    iqr=q3-q1
    var_col=var[(var<=q1-1.5*iqr) | (var>=q3+1.5*iqr)]
    perc=np.shape(var_col)[0]*100.0/np.shape(df)[0]
    print("Column %s outliers = %.2f%%" % (tar, perc))             
                  

<h4>Here DIS,CRIM,ZN,B are highly skewed.CHAS is discrete in nature. </h4>

<h1>2. Feature Selection</h1>

In [None]:
#checking the correlation between two features.
corr=df.corr()
corr

In [None]:
#using a heatmap to see correlation between features more clearly.
plt.figure(figsize=(20,20))
sns.heatmap(corr.abs(), annot=True,cmap='Greens')

<h4>From the correlation matrix RM,LSTAT,TAX,NOX,INDUS,PTRATIO is correlated with PRICE. TAX and RAD is highly correlated with each other(0.91).So, here we get our predictors. </h4>

In [None]:
#Checking the skewness in data
fig,axs = plt.subplots(ncols=7, nrows=2, figsize=(24,12))
count = 0
axs = axs.flatten()
for tar,var in df.items():
    sns.distplot(var,ax=axs[count])
    count = count+1
plt.tight_layout(pad=0.5, w_pad=0.6, h_pad=5.0)

<h1>3. Model Building and Evaluation</h1>

<h5>Train || Test split procedure</h5>

1. Split Data in Train/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Test Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
7. Adjust Parameters as Necessary and repeat steps 5 and 6

In [None]:
# Spliting target variable and independent variables
X = df.drop(['PRICE'], axis = 1)
y = df['PRICE']

In [None]:
X

In [None]:
#splitting the data to train and test. checking the validation of the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.4, random_state = 10)

Types of regression algorithms.
1. Linear Regression
2. Ridge Regression
3. Lasso Regression
4. Random Forest Regressor
5. XGBoost Regressor

<h1>1. LINEAR REGRESSION</h1>

In [None]:
# Import library for Linear Regression
from sklearn.linear_model import LinearRegression

# Create a Linear regressor
lm = LinearRegression()

# Train the model using the training sets 
lm.fit(X_train, y_train)

In [None]:
# Value of y intercept
lm.intercept_

In [None]:
#Converting the coefficient values to a dataframe
coeffcients = pd.DataFrame([X_train.columns,lm.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients

In [None]:
#predicting on training data
y_pred=lm.predict(X_train)
#Model Evaluation and error calculations
print('R^2 =',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2 =',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE =',metrics.mean_absolute_error(y_train, y_pred))
print('MSE =',metrics.mean_squared_error(y_train, y_pred))
print('RMSE =',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

<h2>Model Validation</h2>

In order to validated model we need to check few assumption of linear regression model. The common assumption for Linear Regression model are following

1. Linear Relationship: In linear regression the relationship between the dependent and independent variable to be linear. This can be checked by scatter ploting Actual value Vs Predicted value
2. The residual error plot should be normally distributed.
3. The mean of residual error should be 0 or close to 0 as much as possible
4. The linear regression require all variables to be multivariate normal. This assumption can best checked with Q-Q plot.
5. Linear regession assumes that there is little or no Multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. The variance inflation factor VIF* identifies correlation between independent variables and strength of that correlation.  VIF=1/(1−R^2) , If VIF=1 no correlation,
If VIF >1 & VIF <5 moderate correlation,
VIF > 5 critical level of multicollinearity.
6. Homoscedasticity: The data are homoscedastic meaning the residuals are equal across the regression line. We can look at residual Vs fitted value scatter plot. If heteroscedastic plot would exhibit a funnel shape pattern.

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual Price vs Predicted Price")
plt.show()

In [None]:
#Plotting Actual observations vs predicted observations
import matplotlib.pyplot as plt 
import seaborn as sns
f = plt.figure(figsize=(14,5))
ax = f.add_subplot(121)
sns.scatterplot(y_train,y_pred,ax=ax,color='r')
ax.set_title('Actual Vs Predicted value')

# Check for Residual normality & mean
ax = f.add_subplot(122)
a=(y_train - y_pred)
sns.distplot(a,ax=ax,color='b')
ax.axvline(a.mean(),color='k',linestyle='--')
ax.set_title('Check for Residual normality & mean: \n Residual eror');

In [None]:
#Check for Multicollinearity
#Variance Inflation Factor
R_square = lm.score(X_test,y_test)
VIF_LR = 1/(1- R_square)
VIF_LR

1. Actual vs Predicted price is linear in nature.
2. Residuals are normally distributed and it follows normality assumptions.
3. VIF<5 so moderately correlated.

<h2> Predicting ML model on test data.</h2>

In [None]:
#predicting the data using above model
y_tpred= lm.predict(X_test)
#Model Evaluation
tpred_linreg = metrics.r2_score(y_test, y_tpred)
print('R^2:', tpred_linreg)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_tpred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_tpred))
print('MSE:',metrics.mean_squared_error(y_test, y_tpred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_tpred)))

<h1>2. SVM REGRESSION</h1>

In [None]:
#Standardising the data 
from sklearn.preprocessing import StandardScaler
ss= StandardScaler()
X_train= ss.fit_transform(X_train)
X_test= ss.transform(X_test)


In [None]:
#importing SVM regressor
from sklearn import svm
reg= svm.SVR()

#training the model
reg.fit(X_train,y_train)

In [None]:
#Predicting the model on train data
y_pred= reg.predict(X_train)

In [None]:
# Model Evaluation and error calculations
print('R^2 =',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2 =',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE =',metrics.mean_absolute_error(y_train, y_pred))
print('MSE =',metrics.mean_squared_error(y_train, y_pred))
print('RMSE =',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

<h2>Model Validation</h2>

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual Price vs Predicted Price")
plt.show()

In [None]:
#Plotting Actual observations vs predicted observations
import matplotlib.pyplot as plt 
import seaborn as sns
f = plt.figure(figsize=(14,5))
ax = f.add_subplot(121)
sns.scatterplot(y_train,y_pred,ax=ax,color='r')
ax.set_title('Actual Vs Predicted value')

# Check for Residual normality & mean
ax = f.add_subplot(122)
a=(y_train - y_pred)
sns.distplot(a,ax=ax,color='b')
ax.axvline(a.mean(),color='k',linestyle='--')
ax.set_title('Check for Residual normality & mean: \n Residual eror')

In [None]:
#Check for Multicollinearity using Variance Inflation Factor
R_square = lm.score(X_test,y_test)
VIF_SVR = 1/(1- R_square)
VIF_SVR

<h2> Predicting ML model on test data.</h2>

In [None]:
#predicting the data using our test model
y_tpred= reg.predict(X_test)
#Model Evaluation
tpred_svm = metrics.r2_score(y_test, y_tpred)
print('R^2:', tpred_svm)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_tpred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_tpred))
print('MSE:',metrics.mean_squared_error(y_test, y_tpred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_tpred)))

<h1>3. RANDOM FOREST REGRESSOR</h1>

In [None]:
#importing the dataset
from sklearn.ensemble import RandomForestRegressor
rfr=RandomForestRegressor()
rfr.fit(X_train,y_train)

In [None]:
#Predicting the model
y_pred=rfr.predict(X_train)

In [None]:
# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

Adjusted R^2 value is very good.

<h2>Model Validation</h2>

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual Price vs Predicted Price")
plt.show()

In [None]:
#Plotting Actual observations vs predicted observations
import matplotlib.pyplot as plt 
import seaborn as sns
f = plt.figure(figsize=(14,5))
ax = f.add_subplot(121)
sns.scatterplot(y_train,y_pred,ax=ax,color='g')
ax.set_title('Actual Vs Predicted value')
# Check for Residual normality & mean
ax = f.add_subplot(122)
a=(y_train - y_pred)
sns.distplot(a,ax=ax,color='b')
ax.axvline(a.mean(),color='k',linestyle='--')
ax.set_title('Check for Residual normality & mean: \n Residual eror')


Actual price v/s predicted price is almost a straight line. It can be a good model.

In [None]:
#Check for Multicollinearity using Variance Inflation Factor
R_square=rfr.score(X_test,y_test)
VIF_RFR = 1/(1-R_square)
VIF_RFR

Model is having very high collinearity.

<h2> Predicting ML model on test data.</h2>

In [None]:
#predicting the data using above model
y_tpred= rfr.predict(X_test)
#Model Evaluation
tpred_rfr = metrics.r2_score(y_test, y_tpred)
print('R^2:',tpred_rfr)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_tpred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_tpred))
print('MSE:',metrics.mean_squared_error(y_test, y_tpred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_tpred)))

<h1>4. XGBOOOST REGRESSOR</h1>

In [None]:
#importing XGBOOST regression library
from xgboost import XGBRegressor
#
xgbr= XGBRegressor()
#Training the model
xgbr.fit(X_train, y_train)

We can change above default values such that our model accuracy is increased in both train and test dataset.

In [None]:
#predicting the model
y_pred=xgbr.predict(X_train)

In [None]:
# Model Evaluation and error calculations
print('R^2 =',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2 =',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE =',metrics.mean_absolute_error(y_train, y_pred))
print('MSE =',metrics.mean_squared_error(y_train, y_pred))
print('RMSE =',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

Adjusted R^2 value is very high which is good for train data. Let us see if it goes well even for test data also.

<h2>Model Validation</h2>

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual Price vs Predicted Price")
plt.show()

Its exactly coming out to be a straight line.

In [None]:
#Plotting Actual observations vs predicted observations
import matplotlib.pyplot as plt 
import seaborn as sns
f = plt.figure(figsize=(14,5))
ax = f.add_subplot(121)
sns.scatterplot(y_train,y_pred,ax=ax,color='g')
ax.set_title('Actual Vs Predicted value')

# Check for Residual normality & mean
ax = f.add_subplot(122)
a=(y_train - y_pred)
sns.distplot(a,ax=ax,color='b')
ax.axvline(a.mean(),color='k',linestyle='--')
ax.set_title('Check for Residual normality & mean: \n Residual eror')


In [None]:
#check for Multicollinearity using Variance Inflation Factor
R_square=xgbr.score(X_test,y_test)
VIF_XGBR = 1/(1-R_square)
VIF_XGBR

VIF_XGBR > 5. i.e model is highly collinear in nature.

<h2> Predicting ML model on test data.</h2>

In [None]:
#predicting the data using above model
y_tpred= xgbr.predict(X_test)
#Model Evaluation
tpred_xgbr = metrics.r2_score(y_test, y_tpred)
print('R^2:',tpred_xgbr)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_tpred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_tpred))
print('MSE:',metrics.mean_squared_error(y_test, y_tpred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_tpred)))

<h1> Choosing the best model</h1>

In [None]:
models = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'XGBoost', 'Support Vector Machines'],
    'R-squared Score': [tpred_linreg*100, tpred_rfr*100, tpred_xgbr*100, tpred_svm*100]})
models.sort_values(by='R-squared Score', ascending=False)


<h1>OUTCOME :</h1>
<h1>FOR PREDICTING THE HOUSE PRICE IN NEAR FUTURE WE WILL USE THIS XGBOOST MODEL</h1>