#### Introduction to Statistical Learning, Exercise 3.2

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Multiple Linear Regression on the Auto Data Set

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

### A. Preparation and Visualisation

Load the `Auto` data set and modify it to use the `name` column as the row index. Then remove the `name` column and produce a scatter plot matrix for all remaining variables, except `origin` (you can use the `vars` keyword argument of `sns.pairplot()` to achieve this). Use the `origin` variable for colour coding the plots (use the `hue` keyword argument of `sns.pairplot()`).



In [None]:
auto = datasets.Auto()
auto.set_index(auto['name'], inplace=True)
auto.drop('name', axis=1, inplace=True)
auto.head()

In [None]:
axg = sns.pairplot(data=auto, vars=auto.columns.drop('origin'), hue='origin')

### B. Correlations

Compute the *correlation matrix* of all the variables in the `Auto` data set using the `corr()` method of the data frame. 

In [None]:
cor = auto.corr()
cor

### C. Multiple Linear Regression

Use the `smf.ols()` function to produce a multiple linear regression fit using `mpg` as the response and all other variables as the predictors.

Produce a summary of the fit result. Comment on the output. For example:

  - Is there a relationship between the predictors and the response?
  - Which predictors appear to have a statistically significant relationship to the response?
  - What does the coefficient for the `year` variable suggest?

In [None]:
formula = 'mpg~' + '+'.join(auto.columns.drop('mpg'))
lm = smf.ols(formula, auto).fit()
lm.summary()

Comments:

  - There are some coefficients different from zero with low $p$-values, indicating they are related to the response.
  - According to the $p$-values, the most significant variables are `weight`, `year`, `origin` and `displacement`. One has to be careful about `origin`, though: it is essentially a qualitative variable.
  - The coefficient is positive, suggesting that newer models have better mileage per gallon.

### D. Control Plots

Use the `lmplots.plot()` function to produce the summary plots for the fitted model. Comment on any problems you can see with the fit.

Do the residual plots suggest any unusually large outliers? Does the leverage plot indicate any observations (car models) with unusually high leverage? 

In [None]:
fig = lmplots.plot(lm)

The residuals vs fitted values plot shows a structure. Ideally, the residuals should be randomly scattered around the horizontal zero line. The structure highlighted by the *lowess estimate* line indicates the linear model does not fit the data well.

There are three models with unusually high residuals ('mazda glc', 'vw dasher diesel' and 'vw rabbit diesel').

One model has a particularly high leverage ('buik estate wagon sw').

### E. Interaction Terms

Use the `*` and/or `:` symbols in the formula the fit linear regression models with interaction terms. Do any interactions appear to be statistically significant?

Inspired by the correlation and scatter plot matrices, we try two models:

  1. Add interaction terms between `horsepower`, `weight` and `displacement` (large correlations).
  2. Add an interaction term for `year` and `acceleration` (small correlation).

In [None]:
f1 = formula + '+horsepower:weight+horsepower:displacement+weight:displacement'
lm1 = smf.ols(f1, auto).fit()
print(lm1.model.formula)
lm1.summary().tables[1]

In [None]:
f2 = formula + '+year:acceleration'
lm2 = smf.ols(f2, auto).fit()
print(lm2.model.formula)
lm2.summary().tables[1]

The `year:acceleration` term appears to have a relationship with the response and is statistically significant (low $p$-value).

### F. Non-linear Transformations

Try a few different non-linear transformations on some of the predictors, such ad $\log(X)$, $\sqrt{X}$ and $X^2$. Comment on your findings.


We first make a few plots to get some inspiration.

In [None]:
ax = lmplots.plot_fit(lm, 'horsepower', lowess=True)

In [None]:
ax = lmplots.plot_fit(lm, 'weight', lowess=True)

In [None]:
ax = lmplots.plot_fit(lm, 'acceleration', lowess=True)

Inspired by the plots we try quadratic terms for `horsepower` and `weight` and square a $\log()$ transformation on `acceleration`.

In [None]:
f3 = formula + '+I(horsepower**2)+I(weight**2)+np.log(acceleration)'
lm3 = smf.ols(f3, auto).fit()
lm3.summary().tables[1]

In [None]:
ax = lmplots.plot_fit(lm3, 'acceleration')

The squared `weight` has a small $p$-value but the coefficient is almost zeros, so it has little impact on the predicted response.

The logarithm of `acceleration` has a surprisingly large coefficient as well as a low $p$-value.