# Homework Lecture 2: Linear Regression

## Preliminaries

### Imports

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scipy.optimize
import sklearn.datasets
import sklearn.linear_model


%matplotlib inline



### Data Directories 

Create a directory with the path below

In [2]:
raw_data_dir="../../raw/california_housing"
data_dir="../../data/ProbabilisticTools"


### Random Seed

In [3]:
seed=2506
np.random.seed(seed)

### Get Data

<div class="alert alert-block alert-success"> Problem 0 </div>
We download the California housing dataset using the function `sklearn.datasets.fetch_california_housing`.

In [4]:
import sklearn.datasets
housing=sklearn.datasets.fetch_california_housing()

In [5]:
housing.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

In [6]:
print(housing.DESCR)

California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/datasets/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.




In [7]:
print(len(housing.feature_names),housing.feature_names)

8 ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


In [8]:
print(housing.data.shape,housing.target.shape)

(20640, 8) (20640,)


In [9]:
data=pd.DataFrame(housing.data,columns=housing.feature_names)
data["value"]=housing.target
data.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,value
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


## Data Pre-Processing

The variables in `data` have very different scales.
We will replace the values  $x$ on each column by their standarized values 
$$
    z = \frac{x - \bar{x}}{\sigma_x}
$$

<div class="alert alert-block alert-info"> Problem 1.1 </div>
Compute the mean and std deviation of each column in `data`

[HINT] Pandas has convenient functions to compute the column mean an std deviation

In [10]:
mean=data.mean()
std=data.std()

print("name,mean,std")
for name in data.columns:
    print(name,mean[name],std[name])

name,mean,std
MedInc 3.8706710029070246 1.8998217179452732
HouseAge 28.639486434108527 12.585557612111637
AveRooms 5.428999742190365 2.4741731394243205
AveBedrms 1.0966751496062053 0.47391085679546435
Population 1425.4767441860465 1132.4621217653375
AveOccup 3.070655159436382 10.386049562213591
Latitude 35.6318614341087 2.1359523974571117
Longitude -119.56970445736148 2.003531723502581
value 2.0685581690891843 1.1539561587441483


<div class="alert alert-block alert-info"> Problem 1.2 </div>
Create a new `DataFrame` called `data_standarized` the value $x$ of each column gets replaced by its standarized value 
$$
    z = \frac{x - \bar{x}}{\sigma_x}
$$

In [11]:
data_standarized=(data-mean)/std
data_standarized.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,value
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,-2.526564e-14,1.817399e-15,4.300802e-15,5.446427e-15,-2.836528e-16,-7.20117e-16,-7.636681e-14,-1.429215e-12,-3.212312e-14
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.774256,-2.196127,-1.852274,-1.610729,-1.256092,-0.2289944,-1.447533,-2.385935,-1.662601
25%,-0.6881019,-0.8453727,-0.3994399,-0.191167,-0.5637952,-0.06170912,-0.7967694,-1.113182,-0.756145
50%,-0.1767908,0.02864502,-0.08078293,-0.1010626,-0.2291262,-0.02431526,-0.6422715,0.5389006,-0.235328
75%,0.4592952,0.6642943,0.2519554,0.006015724,0.2644885,0.02037404,0.972933,0.7784775,0.5014851
max,5.858144,1.856137,55.1619,69.57003,30.2496,119.4162,2.957996,2.625216,2.540349


<div class="alert alert-block alert-info"> Problem 1.3</div>
1. Create a numpy array variable named `X` with all the features (but excluding the house values)
2. Create a numpay array variable named `Y` with the house prices (values)

In [12]:
X=data_standarized[housing.feature_names].as_matrix()
Y=data_standarized["value"].as_matrix()
print(X.shape,Y.shape)

(20640, 8) (20640,)


## Exact Solution with Numpy

We assume a linear model
$$
     y = \sum_d x_d \theta_d  + \epsilon
$$
where $d$ runs through the housing features and $\epsilon$ is a Gaussian noise term.

<div class="alert alert-block alert-info"> Problem 2.1 </div>
Can you find a reason why we have not included a bias term `b` in the equation?

In [13]:
# X and Y have zero mean, so b would come up to be zero if we solved for it.

<div class="alert alert-block alert-info"> Problem 2.1 </div>
Using only `numpy` matrix algebra functions, find the Maximum Likelihood values of $\theta_d$

[Hint] Computing matrix inverses is computationally expensive.  The function `numpy.lialg.solve` can be used to solve systems of linear equations.

In [14]:
theta_exact=np.linalg.solve( np.dot(X.T,X),np.dot(X.T,Y))
for idx in range(X.shape[1]):
    print(data.columns[idx],theta_exact[idx])

MedInc 0.7189522722254268
HouseAge 0.10291077971424403
AveRooms -0.23010693262952092
AveBedrms 0.264917894140825
Population -0.003902323642704787
AveOccup -0.03408034125556681
Latitude -0.7798454455510937
Longitude -0.7544152220969756


<div class="alert alert-block alert-info"> Problem 2.2 </div>
Create a variable named `Y_pred` that for each sample $X$, constains  the maximum likelihood model predicted value for $Y$

In [15]:
Y_pred=np.dot(X,theta_exact)

## Gradient Descent Optimization

We will now solve the same problem using Gradient Descent, instead of the analytic solution.

<div class="alert alert-block alert-info"> Problem 3.1 </div>
Define a python function `mse(theta,X,Y)` that computes the mean square error function given $\theta$, $X$ and $Y$

In [16]:
def mse(theta,X,Y):
    Y_pred=np.dot(X,theta)
    dY=Y_pred-Y
    return 0.5*np.mean(dY**2)

<div class="alert alert-block alert-info"> Problem 3.2 </div>
Define a python function `grad(theta,X,Y)` that computes the gradient of the error function given $\theta$, $X$ and $Y$

In [17]:
def grad(theta,X,Y):
    Y_pred=np.dot(X,theta)
    dY=(Y_pred-Y)
    return np.dot(X.T,dY)/len(X)

<div class="alert alert-block alert-info"> Problem 3.4 </div>
Using [`numpy.random.normal`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.normal.html) 
generate a random guess of the vector $\theta$ so that each component is $\mathcal{N}(0,1)$ distributed

In [18]:
D=X.shape[1]
theta0=np.random.normal(size=D)
theta0

array([ 1.18332114,  1.3144417 , -1.43101293, -0.90304957,  1.47576176,
       -1.11638228,  1.71375566, -0.16266391])

<div class="alert alert-block alert-info"> Problem 3.3 </div>
Use the function [`scipy.optimze.check_grad`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.check_grad.html)
to verify numerically that `grad` is really the gradient of `mse` for the  $\theta$ guess.

[HINT] `grad` is the gradient of `mse` if `check_grad` returns a very small number (say $\approx 10^-8$)

In [19]:
scipy.optimize.check_grad(mse,grad,theta0,X,Y)

2.5193743230817395e-07

<div class="alert alert-block alert-info"> Problem 3.4 </div>
** Steepest Descent Algorithm**

1. Pick a value for the learning rate $\eta$
1. Implement the steepest descent update rule
    $$
        \theta \leftarrow \theta - \eta \frac{\partial E}{\partial \theta}
    $$
1. Run the update rule on a loop, starting from your random guess for $\theta$. Repeat  $T=1000$ times
1. Every 100 steps, print the step number and the current error
1. After 1000 steps, print the final error, and the final $\theta$ parameters.
2. If process did not converge, modify value of learning rate $\eta$ and repeat until convergence.

In [20]:
eta=0.05
T=1000
theta=theta0
for t in range(T):
    if (t % 100 ==0):
        print(t,mse(theta,X,Y))
    theta = theta - eta * grad(theta,X,Y)
print(T,mse(theta,X,Y))
print("theta",theta)

0 5.936400542007097
100 0.29115086653133304
200 0.24903259847247267
300 0.2269112448592256
400 0.21457695563442136
500 0.207504029123795
600 0.2033502396992242
700 0.20086303921832627
800 0.19935100825037902
900 0.19842118045332546
1000 0.19784449078061972
theta [ 7.64373858e-01  1.13963018e-01 -3.11180639e-01  3.29872832e-01
 -4.71264888e-04 -3.59837562e-02 -6.56625211e-01 -6.36252061e-01]


<div class="alert alert-block alert-info"> Problem 3.5 </div>
Compare the MSE of the steepest descent solution to the exact solution.

In [21]:
E=mse(theta,X,Y)
E_exact=mse(theta_exact,X,Y)
print("approx",E)
print("exact",E_exact)
print("diff",E-E_exact)

approx 0.19784449078061972
exact 0.19687411846320466
diff 0.0009703723174150636


<div class="alert alert-block alert-info"> Problem 3.6 </div>
Compare the  steepest descent parameters $\theta$  to the exact solution.

In [22]:
print("feature, theta,theta_exact")
for idx in range(X.shape[1]):
    print(data.columns[idx],theta[idx],theta_exact[idx])

feature, theta,theta_exact
MedInc 0.7643738575365989 0.7189522722254268
HouseAge 0.11396301783052945 0.10291077971424403
AveRooms -0.3111806391332855 -0.23010693262952092
AveBedrms 0.32987283228271413 0.264917894140825
Population -0.000471264887870642 -0.003902323642704787
AveOccup -0.03598375622085677 -0.03408034125556681
Latitude -0.656625211472127 -0.7798454455510937
Longitude -0.636252060635037 -0.7544152220969756


## Sklearn Comparison

<div class="alert alert-block alert-info"> Problem 4.1 </div>
Use [`sklearn.linear_model.LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
to fit our model.

[Hint] You will need to create a `LinearRegression` object, and the call the `fit` method. Make sure not to fit the intercept (bias).


In [23]:
model=sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X,Y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

<div class="alert alert-block alert-info"> Problem 4.2 </div>
Compute the mean squared different between the exact model prediction's  `Y_pred`  we saved before and
`sklearn`'s Linear model regression predictions

In [24]:
Y_sk = model.predict(X)
dY=(Y_sk-Y_pred)
np.dot(dY.T,dY)/len(dY)

3.9626724673140836e-30

In [25]:
#it is fine to use the 1/2 we defined above and return (it is just a matter of convention)
Y_sk = model.predict(X)
dY=(Y_sk-Y_pred)
0.5*np.dot(dY.T,dY)/len(dY)

1.9813362336570418e-30

<div class="alert alert-block alert-info"> Problem 4.3 </div>
Compare the sklearn solution to the exact solution we found earlier.

[Hint] The solution is stored on the model's  `coef_` variable

In [26]:
print("feature, theta,theta_exact")
for idx in range(X.shape[1]):
    print(data.columns[idx],model.coef_[idx],theta_exact[idx])

feature, theta,theta_exact
MedInc 0.7189522722254276 0.7189522722254268
HouseAge 0.10291077971424487 0.10291077971424403
AveRooms -0.23010693262952173 -0.23010693262952092
AveBedrms 0.2649178941408253 0.264917894140825
Population -0.0039023236427050455 -0.003902323642704787
AveOccup -0.034080341255566575 -0.03408034125556681
Latitude -0.779845445551089 -0.7798454455510937
Longitude -0.7544152220969702 -0.7544152220969756


### Statmodels  Comparison

In [27]:
import statsmodels.api as sm

  from pandas.core import datetools


We will solve using  `statmodels` so that we appreciate the difference in emphasis between Machine Learning (`sklearn`) and Statistics Modeling `statmodels` 

<div class="alert alert-block alert-info"> Problem 5.1 </div>
Use [`statmodels.api.OLS`](http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html) to solve the same linear regression problem


In [28]:
model=sm.OLS(Y,X)
results=model.fit()

<div class="alert alert-block alert-info"> Problem 5.2 </div>
Compare the `statmodels` solution to the exact solution we found earlier.

[Hint] The fitted parameters are stored on the results 's  `parms` variable

In [29]:
print("feature, theta,theta_exact")
for idx in range(X.shape[1]):
    print(data.columns[idx],results.params[idx],theta_exact[idx])

feature, theta,theta_exact
MedInc 0.7189522722254271 0.7189522722254268
HouseAge 0.1029107797142449 0.10291077971424403
AveRooms -0.23010693262952087 -0.23010693262952092
AveBedrms 0.2649178941408249 0.264917894140825
Population -0.003902323642705284 -0.003902323642704787
AveOccup -0.03408034125556665 -0.03408034125556681
Latitude -0.7798454455510893 -0.7798454455510937
Longitude -0.7544152220969701 -0.7544152220969756


<div class="alert alert-block alert-info"> Problem 5.3 </div>
Print a  `statmodels` result summary (function `summary` of the results object).

It will show you a number of estimates on goodness-of-fit, significance of coefficients, etc.

In [30]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     3971.
Date:                Mon, 24 Sep 2018   Prob (F-statistic):               0.00
Time:                        14:23:28   Log-Likelihood:                -19668.
No. Observations:               20640   AIC:                         3.935e+04
Df Residuals:                   20632   BIC:                         3.942e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.7190      0.007    104.056      0.0

### Independent test for categorial variables

<div class="alert alert-block alert-info"> Problem 6.1 </div>

Read the data from file 'homework.csv' in the  'data_dir' directory

Perform a $\chi^2$ test of independence between the variables `X` and `Y`.
Are 'X' and 'Y' dependent on each other?

[Hint] You can copy any code you need from the [`CategoricalInference`](./CategoricalInference.ipynb) Notebook,
but make sure to import any python modules you may need.

In [31]:
data=pd.read_csv(data_dir+"/homework.csv")
X=data["X"]
Y=data["Y"]

In [32]:
Z_x=pd.get_dummies(X).as_matrix()
Z_y=pd.get_dummies(Y).as_matrix()

In [33]:
import scipy.special as special
def C2_independence(Z_x,Z_y):
    N=len(Z_x)
    D=Z_x.shape[1]
    K=Z_y.shape[1]
    # p_y has index k
    p_y=Z_y.mean(axis=0)
    # p_x has index d
    p_x=Z_x.mean(axis=0)
    # p will be K*D, with indexes k,d
    p=p_y[:,np.newaxis]*p_x[np.newaxis,:]
    # expectation if x and y are independent
    expect=N*p
    # Z_y has indexes i,k and Z_x has indexes i,d
    #Z will be N*K*D, with indexes i,k,d
    Z=Z_y[:,:,np.newaxis]*Z_x[:,np.newaxis,:]
    # observations for each (y,x) 
    # sum over i, left with a K*D matrix
    obs=Z.sum(axis=0) # last two expressions are the same as np.dot(Z_y^T,Z_x)
    df=obs-expect
    df2=df*df
    # we need to special case 0/0 case.
    c2 = (df2/np.maximum(1e-9,expect)).sum()
    return c2,special.chdtrc((K-1)*(D-1),c2)

In [34]:
C2_independence(Z_x,Z_y)

(20.306931610738086, 0.061498086442245095)

The propability of the C2 score if X and Y are independent is 6%, so we accept the hypothesis they are idepedendent with a 5% threshold