# Linear Regression
### Jack Bennetto
#### January 24, 2017


## Objectives

 * State the assumptions of a linear-regression model
 * Estimate a linear-regression model
 * Evaluate a linear-regression model
 * Fix common problems that could compromise results

## Agenda

Morning

 * Introduction to regression
 * Simple linear regression
 * Multiple linear regression
 * Assumptions for linear regression
 
Afternoon

 * Assessing accuracy and comparing models
 * Categorical variables

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import scipy.stats as scs
import numpy as np
from sklearn.linear_model import LinearRegression
import pylab

import statsmodels.api as sm

from pandas.tools.plotting import scatter_matrix
import sklearn

In [None]:
n = 1000
b0 = 5
b1 = 0.4
x = scs.uniform(0,10).rvs(n)
y = b0 + b1*x + scs.norm(0,1).rvs(n)


model = LinearRegression()
model.fit(x.reshape(1000,1), y)

b0hat = model.intercept_
b1hat = model.coef_

print "b0hat = ",model.intercept_
print "b1hat = ",model.coef_[0]

In [None]:
x_with_const = sm.add_constant(x)
est = sm.OLS(y, x_with_const)
est = est.fit()
est.summary()

## Comparing models

The problem with RSS are

 * It's hard to interpret
   * It's proporational to n
   * It's in units of the response squared
   
   
## $R^2$

Unlike RSS, $R^2$ is unitless, ranging from 0 to 1. It represents the fraction of the variance that is explained by the model.

$$R^2 = \frac{1-RSS}{TSS}$$

where RSS and TSS are given by

$$RSS = \sum_{i=1}^n (y_i - \hat y_i)^2$$
$$TSS = \sum_{i=1}^n (y_i - \bar y)^2$$



In the case of a linear model, this is equal to the square of the correlation coefficient.

## Adjusted $R^2$

$$\bar{R^2} = 1 - \frac{(1 - R^2)(n - 1)}{n - p}$$

where $n$ is number of data and p is the number of parameters to be fit.


### AIC/BIC (Akaike/Bayesian Information Criterion)

AIC and BIC are two metrics used to compare models of different complexity. In both cases, the the lower the value, the better the model.

$$AIC = -2 \ln(\hat{\cal L}) + 2p$$
$$BIC = -2 \ln(\hat{\cal L}) +ln(n)p$$

Where $\hat{\cal L}$ is the likelihood of the fitted model producing the data, $p$ is the number of parameters fitted in the model, and $n$ is the number of data points. They are similar, but BIC penalizes complex models more heavily.

In [None]:
X = x.reshape(-1,1)
# In sklearn LinearRegression, R^2 is the model score
print "R^2:", model.score(X, y)

# we can also calculate from as the unexplained variance
rss = np.sum((y - model.predict(X))**2)
tss = np.sum((y - np.mean(y))**2)
print "RSS: {}  TSS: {} R^2: {}".format(rss, tss, 1-rss/tss)


## Dummy variables

When using catagorical features, create dummy variables.

In the dataset below, the "mode" column labels the mode of travel taken, where 
 1. air
 2. train
 3. bus
 4. car

Can we use this for linear regression?


In [None]:
travel_df = sm.datasets.modechoice.data.load_pandas().data

In [None]:
travel_df.head()

In [None]:
travel_df['mode'].value_counts()

In [None]:
dummies = pd.get_dummies(travel_df['mode'])
dummies.head()

In [None]:
dummies.columns = 'air train bus car'.split()
dummies.head()

In [None]:
dummies = dummies[[0,1,2]]
travel_df = pd.concat((travel_df, dummies), axis=1)
travel_df.head()