# Tutorial normal equations for the Boston dataset
In notebook we build regression models on the Boston housing dataset.

In [2]:
%matplotlib inline
import sklearn.datasets
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.preprocessing import PolynomialFeatures

sns.set_context("poster")
plt.style.use('fivethirtyeight')
matplotlib.style.use('fivethirtyeight') 
plt.rcParams['figure.figsize'] = (10, 6)
scatter_size = 60

#Load boston dataset
boston = sklearn.datasets.load_boston()

def feature_normalize(dataset):
    mu = np.mean(dataset,axis=0)
    sigma = np.std(dataset,axis=0)
    return (dataset - mu)/sigma

X = feature_normalize(boston.data)
Y = boston.target[:, np.newaxis]

## Boston housing dataset

In [3]:
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

### The data

In [4]:
# Turn Numpy array into a Pandas DataFrame
features = pd.DataFrame(boston.data, columns=boston.feature_names)
full_data = pd.DataFrame(boston.data, columns=boston.feature_names)
full_data['price'] = boston.target
full_data.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,price
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3.0,222.0,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.1,18.9


## Train - Test Split

In [5]:
rnd_indices = np.random.rand(len(X)) < 0.80

train_x = X[rnd_indices]
train_y = Y[rnd_indices]
val_x = X[~rnd_indices]
val_y = Y[~rnd_indices]

## Base model

Take a constant hypothesis, constant, equal to the mean.  
Calculate the MAPE (mean percentage error)  
Calculate the mean square loss

In [4]:
y_hat = Y.mean()
print("Mape: {:3.1f}%".format(100*(np.abs(Y-y_hat)/Y).mean()))
print("MSS:  {:3.1f}".format((np.square(Y-y_hat)).mean()))

Mape: 36.3%
MSS:  84.4


## Solve the normal equations

$$
\boldsymbol \theta = (X^\top X)^{-1}X^\top Y
$$

## Add polynomial features
