# Boston Housing

### Boston housing is considered the "Hello World" of ML

Hence as a custom, let's start our journey with this.

## About the dataset


The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA.

This dataset has 14 columns each defining a parameter.

* CRIM - per capita crime rate by town.
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise).
* NOX - nitric oxides concentration (parts per 10 million).
* RM - average number of rooms per dwelling.
* AGE - proportion of owner-occupied units built prior to 1940.
* DIS - weighted distances to five Boston employment centres.
* RAD - index of accessibility to radial highways.
* TAX - full-value property-tax rate per 10,000.
* PTRATIO - pupil-teacher ratio by town.
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
* LSTAT - % lower status of the population.
* MEDV - Median value of owner-occupied homes in $1000's.



In [None]:
# Importing the libraries 
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

### Load dataset

Here we use the dataset from sklearn

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()

# Create a dataframe
data = pd.DataFrame(boston.data)

In [None]:
data.head(10) # Printing 1st 20 data rows.

### Looks fine !!

In [None]:
# Naming the features
data.columns = boston.feature_names
data.head()

In [None]:
#We need a price variable in tha dataframe too
data['PRICE'] = boston.target
data.head()

### Price variable doesn't show up in the data initially as it is not a parameter

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.dtypes

In [None]:
# Dataset is not always flawless
# Check for missing values
data.isnull().sum()
# But this seems to be flawless

In [None]:
# Finding out the correlation between the features
corr = data.corr()
corr.shape

In [None]:
# Plotting the heatmap of correlation between features
plt.figure(figsize=(20,20))
sns.heatmap(corr, cbar=True, square= True, fmt='.1f', annot=True, annot_kws={'size':15}, cmap='Blues')

In [None]:
# Spliting target variable and independent variables
X = data.drop(['PRICE'], axis = 1)
y = data['PRICE']

In [None]:
# Splitting to training and testing data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 4)

## Magic starts here!!!

## Linear Regression

In [None]:
# Import library for Linear Regression
from sklearn.linear_model import LinearRegression

# Create a Linear regressor
lm = LinearRegression()

# Train the model using the training sets 
lm.fit(X_train, y_train)

In [None]:
# Value of y intercept
lm.intercept_

In [None]:
#Converting the coefficient values to a dataframe
coeffcients = pd.DataFrame([X_train.columns,lm.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients

In [None]:
# Model prediction on train data
y_pred = lm.predict(X_train)


In [None]:
# Model Evaluation
print('Variance:',metrics.r2_score(y_train, y_pred))
print('Adjusted Variance:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('Mean Absolute Errors:',metrics.mean_absolute_error(y_train, y_pred))
print('Mean Square Error:',metrics.mean_squared_error(y_train, y_pred))
print('Root Mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))

## I understand those values make no sense

So here's a chart

In [None]:
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_train)
plt.title("What we want")
plt.xlabel("Actual Prices")
plt.ylabel("Predicted prices")
plt.show()


plt.scatter(y_train, y_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted prices")
plt.title("What we got")
plt.show()

## I know it looks like we missed the target by a long shot.

## But it can predict house prices with 73% accuracy.

Thats marginally more than my 12th grade score.