# Homework Lecture 2: Linear Regression

## Preliminaries

### Imports

In [4]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scipy.optimize
import sklearn.datasets
import sklearn.linear_model


%matplotlib inline



### Data Directories 

Create a directory with the path below

In [5]:
raw_data_dir="../../raw/california_housing"
data_dir="../../data/probabilisticTools"


### Random Seed

In [6]:
seed=2506
np.random.seed(seed)

### Get Data

<div class="alert alert-block alert-success"> Problem 0 </div>
We download the California housing dataset using the function `sklearn.datasets.fetch_california_housing`.

In [7]:
import sklearn.datasets
housing=sklearn.datasets.fetch_california_housing()

In [8]:
housing.keys()

dict_keys(['target', 'DESCR', 'data', 'feature_names'])

In [9]:
print(housing.DESCR)

California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.




In [10]:
print(len(housing.feature_names),housing.feature_names)

8 ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


In [11]:
print(housing.data.shape,housing.target.shape)

(20640, 8) (20640,)


In [12]:
data=pd.DataFrame(housing.data,columns=housing.feature_names)
data["value"]=housing.target
data.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,value
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


## Data Pre-Processing

The variables in `data` have very different scales.
We will replace the values  $x$ on each column by their standarized values 
$$
    z = \frac{x - \bar{x}}{\sigma_x}
$$

<div class="alert alert-block alert-info"> Problem 1.1 </div>
Compute the mean and std deviation of each column in `data`

[HINT] Pandas has convenient functions to compute the column mean an std deviation

In [13]:
print(data.mean())
print(data.std())

MedInc           3.870671
HouseAge        28.639486
AveRooms         5.429000
AveBedrms        1.096675
Population    1425.476744
AveOccup         3.070655
Latitude        35.631861
Longitude     -119.569704
value            2.068558
dtype: float64
MedInc           1.899822
HouseAge        12.585558
AveRooms         2.474173
AveBedrms        0.473911
Population    1132.462122
AveOccup        10.386050
Latitude         2.135952
Longitude        2.003532
value            1.153956
dtype: float64


<div class="alert alert-block alert-info"> Problem 1.2 </div>
Create a new `DataFrame` called `data_standarized` the value $x$ of each column gets replaced by its standarized value 
$$
    z = \frac{x - \bar{x}}{\sigma_x}
$$

In [14]:
data_standarized = (data-data.mean())/data.std()
data_standarized.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,value
0,2.344709,0.982119,0.628544,-0.153754,-0.974405,-0.049595,1.052523,-1.327803,2.12958
1,2.332181,-0.607004,0.327033,-0.263329,0.861418,-0.09251,1.043159,-1.322812,1.314124
2,1.782656,1.856137,1.155592,-0.049015,-0.820757,-0.025842,1.038478,-1.332794,1.258663
3,0.932945,1.856137,0.156962,-0.049832,-0.76601,-0.050328,1.038478,-1.337785,1.165072
4,-0.012881,1.856137,0.344702,-0.032905,-0.759828,-0.085614,1.038478,-1.337785,1.172871


<div class="alert alert-block alert-info"> Problem 1.3</div>
1. Create a numpy array variable named `X` with all the features (but excluding the house values)
2. Create a numpay array variable named `Y` with the house prices (values)

In [15]:
X = np.array(data_standarized.drop(['value'],axis = 1))
Y = np.array(data_standarized['value'])

## Exact Solution with Numpy

We assume a linear model
$$
     y = \sum_d x_d \theta_d  + \epsilon
$$
where $d$ runs through the housing features and $\epsilon$ is a Gaussian noise term.

<div class="alert alert-block alert-info"> Problem 2.1 </div>
Can you find a reason why we have not included a bias term `b` in the equation?

**Ans:** Because we have already normalized the data, so even though we include b it should be zero since linear regression will always pass the central point.

<div class="alert alert-block alert-info"> Problem 2.1 </div>
Using only `numpy` matrix algebra functions, find the Maximum Likelihood values of $\theta_d$

[Hint] Computing matrix inverses is computationally expensive.  The function `numpy.linalg.solve` can be used to solve systems of linear equations.

In [16]:
# 'np.linalg.slove' showed some problem so I use 'np.linalg.lstsq' instead
res = np.linalg.lstsq(X,Y)
theta_exact = res[0]
theta_exact

array([ 0.71895227,  0.10291078, -0.23010693,  0.26491789, -0.00390232,
       -0.03408034, -0.77984545, -0.75441522])

<div class="alert alert-block alert-info"> Problem 2.2 </div>
Create a variable named `Y_pred` that for each sample $X$, constains  the maximum likelihood model predicted value for $Y$

In [17]:
Y_pred = np.dot(X, theta_exact)
Y_pred

array([ 1.78784232,  1.65348419,  1.39347822, ..., -1.64417577,
       -1.51604801, -1.34559232])

## Gradient Descent Optimization

We will now solve the same problem using Gradient Descent, instead of the analytic solution.

<div class="alert alert-block alert-info"> Problem 3.1 </div>
Define a python function `mse(theta,X,Y)` that computes the mean square error function given $\theta$, $X$ and $Y$

In [18]:
def mse(_theta,_X,_Y):
    _dY = np.dot(_X,_theta) - Y
    return 0.5 * np.mean(_dY**2)

<div class="alert alert-block alert-info"> Problem 3.2 </div>
Define a python function `grad(theta,X,Y)` that computes the gradient of the error function given $\theta$, $X$ and $Y$

In [19]:
def grad(_theta,_X,_Y):
    _dY = np.dot(_X,_theta) - Y
    return np.dot(_X.T,_dY)/len(_X)

<div class="alert alert-block alert-info"> Problem 3.3 </div>
Using [`numpy.random.normal`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.normal.html) 
generate a random guess of the vector $\theta$ so that each component is $\mathcal{N}(0,1)$ distributed

In [20]:
D = len(X.T)
np.random.seed(seed)
theta0 = np.random.normal(size = D)
theta0

array([ 1.18332114,  1.3144417 , -1.43101293, -0.90304957,  1.47576176,
       -1.11638228,  1.71375566, -0.16266391])

<div class="alert alert-block alert-info"> Problem 3.4 </div>
Use the function [`scipy.optimze.check_grad`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.check_grad.html)
to verify numerically that `grad` is really the gradient of `mse` for the  $\theta$ guess.

[HINT] `grad` is the gradient of `mse` if `check_grad` returns a very small number (say $\approx 10^-8$)

In [21]:
scipy.optimize.check_grad(mse,grad,theta0,X,Y)

2.5193743484242718e-07

<div class="alert alert-block alert-info"> Problem 3.5 </div>
** Steepest Descent Algorithm**

1. Pick a value for the learning rate $\eta$
1. Implement the steepest descent update rule
    $$
        \theta \leftarrow \theta - \eta \frac{\partial E}{\partial \theta}
    $$
1. Run the update rule on a loop, starting from your random guess for $\theta$. Repeat  $T=1000$ times
1. Every 100 steps, print the step number and the current error
1. After 1000 steps, print the final error, and the final $\theta$ parameters.
2. If process did not converge, modify value of learning rate $\eta$ and repeat until convergence.

In [22]:
# We also pick ita = 0.1 and it seems to be converging
eta=0.1
T=1000
theta=theta0
for t in range(T):
    if (t % 100 ==0):
        print(t,mse(theta,X,Y))
        print("theta",theta)
    theta = theta - eta * grad(theta,X,Y)
print(T,mse(theta,X,Y))
print("theta:",theta)

0 5.93640054201
theta [ 1.18332114  1.3144417  -1.43101293 -0.90304957  1.47576176 -1.11638228
  1.71375566 -0.16266391]
100 0.248941746659
theta [ 0.92096184  0.21102494 -0.48196332  0.41671382  0.03360645 -0.04775863
  0.18835653  0.19647676]
200 0.214525984741
theta [ 0.86685232  0.16092102 -0.45290784  0.4242497   0.01528205 -0.04214132
 -0.21809466 -0.20713863]
300 0.203325894136
theta [ 0.82149443  0.13539224 -0.39905156  0.39371718  0.00658294 -0.03901443
 -0.44677408 -0.43207916]
400 0.199339626442
theta [ 0.78780326  0.12160549 -0.34936883  0.35877819  0.00200484 -0.03712981
 -0.57891274 -0.56097246]
500 0.19783918183
theta [  7.64278404e-01   1.13923252e-01  -3.11041198e-01   3.29775515e-01
  -4.84492127e-04  -3.59783571e-02  -6.57004727e-01  -6.36622446e-01]
600 0.197256533747
theta [ 0.74841603  0.10952044 -0.28374397  0.30837827 -0.00188027 -0.03526769
 -0.70399532 -0.68189612]
700 0.197026582328
theta [ 0.7379457   0.10693595 -0.26512412  0.29349238 -0.00268457 -0.0348258

<div class="alert alert-block alert-info"> Problem 3.6 </div>
Compare the MSE of the steepest descent solution to the exact solution.

In [23]:
E = mse(theta,X,Y)
E_exact = mse(theta_exact,X,Y)
print("approx",E)
print("exact",E_exact)
print("diff",E-E_exact)

approx 0.196883903105
exact 0.196874118463
diff 9.78464191842e-06


<div class="alert alert-block alert-info"> Problem 3.7 </div>
Compare the  steepest descent parameters $\theta$  to the exact solution.

In [24]:
print(theta)
print(theta_exact)
print(np.dot(theta-theta_exact,theta-theta_exact)/len(theta))
# This distance is very small and we can conclude we have got the exact solution.

[ 0.72390482  0.10387113 -0.23940455  0.27258007 -0.00361725 -0.03426718
 -0.7681941  -0.74333741]
[ 0.71895227  0.10291078 -0.23010693  0.26491789 -0.00390232 -0.03408034
 -0.77984545 -0.75441522]
5.36491001839e-05


## Sklearn Comparison

<div class="alert alert-block alert-info"> Problem 4.1 </div>
Use [`sklearn.linear_model.LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
to fit our model.

[Hint] You will need to create a `LinearRegression` object, and the call the `fit` method. Make sure not to fit the intercept (bias).


In [25]:
model = sklearn.linear_model.LinearRegression(fit_intercept=False)
model.fit(X,Y)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

<div class="alert alert-block alert-info"> Problem 4.2 </div>
Compute the mean squared different between the exact model prediction's  `Y_pred`  we saved before and
`sklearn`'s Linear model regression predictions

In [26]:
Y_sk = model.predict(X)
dY=(Y_sk-Y_pred)
np.dot(dY.T,dY)/len(dY)

1.8755077202641051e-29

<div class="alert alert-block alert-info"> Problem 4.3 </div>
Compare the sklearn solution to the exact solution we found earlier.

[Hint] The solution is stored on the model's  `coef_` variable

In [27]:
print("feature, theta, theta_exact, diff")
for idx in range(X.shape[1]):
    print(idx,model.coef_[idx],theta_exact[idx],model.coef_[idx] - theta_exact[idx])

feature, theta, theta_exact, diff
0 0.718952272225 0.718952272225 3.5527136788e-15
1 0.102910779714 0.102910779714 4.57966997658e-16
2 -0.23010693263 -0.23010693263 5.27355936697e-16
3 0.264917894141 0.264917894141 6.66133814775e-16
4 -0.00390232364271 -0.0039023236427 -5.66820895775e-16
5 -0.0340803412556 -0.0340803412556 4.85722573274e-16
6 -0.779845445551 -0.779845445551 -1.99840144433e-15
7 -0.754415222097 -0.754415222097 -3.33066907388e-16


### Statmodels  Comparison

In [28]:
import statsmodels.api as sm

We will solve using  `statmodels` so that we appreciate the difference in emphasis between Machine Learning (`sklearn`) and Statistics Modeling `statmodels` 

<div class="alert alert-block alert-info"> Problem 5.1 </div>
Use [`statmodels.api.OLS`](http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html) to solve the same linear regression problem


In [29]:
model=sm.OLS(Y,X)
results=model.fit()

<div class="alert alert-block alert-info"> Problem 5.2 </div>
Compare the `statmodels` solution to the exact solution we found earlier.

[Hint] The fitted parameters are stored on the results 's  `parms` variable

In [30]:
print("feature, theta, theta_exact, diff")
for idx in range(X.shape[1]):
    print(idx,results.params[idx],theta_exact[idx],results.params[idx] - theta_exact[idx] )

feature, theta, theta_exact, diff
0 0.718952272225 0.718952272225 3.33066907388e-15
1 0.102910779714 0.102910779714 5.55111512313e-17
2 -0.23010693263 -0.23010693263 2.49800180541e-16
3 0.264917894141 0.264917894141 5.55111512313e-17
4 -0.0039023236427 -0.0039023236427 -1.71303943253e-16
5 -0.0340803412556 -0.0340803412556 -6.93889390391e-17
6 -0.779845445551 -0.779845445551 -1.66533453694e-15
7 -0.754415222097 -0.754415222097 -8.881784197e-16


<div class="alert alert-block alert-info"> Problem 5.3 </div>
Print a  `statmodels` result summary (function `summary` of the results object).

It will show you a number of estimates on goodness-of-fit, significance of coefficients, etc.

In [31]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     3971.
Date:                Thu, 01 Feb 2018   Prob (F-statistic):               0.00
Time:                        23:26:29   Log-Likelihood:                -19668.
No. Observations:               20640   AIC:                         3.935e+04
Df Residuals:                   20632   BIC:                         3.942e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.7190      0.007    104.056      0.0

### Independent test for categorial variables

<div class="alert alert-block alert-info"> Problem 6.1 </div>

Read the data from file 'homework.csv' in the  'data_dir' directory

Perform a $\chi^2$ test of independence between the variables `X` and `Y`.
Are 'X' and 'Y' dependent on each other?

[Hint] You can copy any code you need from the [`CategoricalInference`](./CategoricalInference.ipynb) Notebook,
but make sure to import any python modules you may need.

In [32]:
import scipy.special as special
data_filename=data_dir+"/homework.csv"
data=pd.read_csv(data_filename)
X = np.array(data['X'])
Y = np.array(data['Y'])


Z_x = pd.get_dummies(X).as_matrix()
Z_y = pd.get_dummies(Y).as_matrix()

def C2_independence(Z_x,Z_y):
    N=len(Z_x)
    D=Z_x.shape[1]
    K=Z_y.shape[1]
    # p_y has index k
    p_y=Z_y.mean(axis=0)
    # p_x has index d
    p_x=Z_x.mean(axis=0)
    # p will be K*D, with indexes k,d
    p=p_y[:,np.newaxis]*p_x[np.newaxis,:]
    # Z_y has indexes i,k and Z_x has indexes i,d
    #Z will be N*K*D, with indexes i,k,d
    Z=Z_y[:,:,np.newaxis]*Z_x[:,np.newaxis,:]
    # sum over i, left with a K*D matrix
    obs=Z.sum(axis=0) # last two expressions are the same as np.dot(Z_y^T,Z_x)
    # expect
    expect=N*p
    df=obs-expect
    df2=df*df
    c2 = (df2/np.maximum(1e-9,expect)).sum()
    return c2,special.chdtrc((K-1)*(D-1),c2)

C2_independence(Z_x,Z_y)

OSError: File b'../../data/probabilisticTools/homework.csv' does not exist

**Ans:** So they do not look independent