# Demo: Multivariable Regression on Boston Housing Data

In [None]:
import pandas as pd
import numpy as np

names =['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',  'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'PRICE']

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', 
                 header=None, delim_whitespace=True, names=names, na_values='?')

"""
Attribute Information:
    1.  CRIM      per capita crime rate by town
    2.  ZN        proportion of residential land zoned for lots over 
                  25,000 sq.ft.
    3.  INDUS     proportion of non-retail business acres per town
    4.  CHAS      Charles River dummy variable (= 1 if tract bounds 
                  river; 0 otherwise)
    5.  NOX       nitric oxides concentration (parts per 10 million)
    6.  RM        average number of rooms per dwelling
    7.  AGE       proportion of owner-occupied units built prior to 1940
    8.  DIS       weighted distances to five Boston employment centres
    9.  RAD       index of accessibility to radial highways
    10. TAX       full-value property-tax rate per $10,000
    11. PTRATIO   pupil-teacher ratio by town
    12. B         1000(Bk - 0.63)^2 where Bk is the proportion of blocks by town
    13. LSTAT     % lower status of the population
    14. MEDV      Median value of owner-occupied homes in $1000's
"""

## Forming the Design Matrix

We want to put our features into feature vectors (stacked into a design matrix). Here we check the difference between the numpy and pandas datatype, and see the importance of using ```df['feature'].values``` to get a numpy array returned.

In [None]:
print(df.columns.to_list())
features=df.columns.to_list()
features.remove('PRICE')
print(features)

Treat all the features as a vector, $\mathbf{x}$, and stack the samples in a $N$ by $D$ matrix, $X$, where $N$ is the number of samples and $D$ is the number of features.

In [None]:
print(df)

In [None]:
# Features
X = df[features].values
print(X)
print(X.shape)
print(f"The dataset contains {X.shape[0]} data points with {X.shape[1]} features")

In [None]:
# Labels
y = df['PRICE'].values
print(y.shape)
y = y.reshape(-1,1)
print(y.shape)

# Linear Regression

We are going to use [sklearn](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares).

First, we define a linear regression model and we fit the model to our data.

[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression(fit_intercept=True)
regr.fit(X, y)

The accuracy of the model could be evaluated by finding the MSE between the model prediction and corresponding data points

In [None]:
y_hat = regr.predict(X)  # Model prediction
print(y_hat.shape)

In [None]:
mse_y = np.mean((y-y_hat)**2)
print(mse_y)

### Here are the parameter of the model :

In [None]:
print(regr.coef_)        # this is [w_1, ...., w_n]
print(regr.intercept_)   # this the bias w_0

### Here is a fancy way to compare $y$ and $\hat{y}$ :

In [None]:
Y = np.hstack([y, y_hat])
with np.printoptions(precision=2):
    print(Y[:10,:])

# Exercise : 
Compute the Least square solution with numpy and compare your result with the one of sklearn !

In [None]:
## To do