#### Introduction to Statistical Learning, Lab 3.3

# Interactions & Non-linear Transformations

We often want to include interaction terms and non-linear transformations of the predictors in our model. This is fully supported by the formula mini language.


  - [statsmodels documentation](https://www.statsmodels.org/stable/)
  - [statsmodels formula interface](https://www.statsmodels.org/stable/example_formulas.html)
  - [the formula mini language](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-language)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from islpy import datasets, lmplots
sns.set()
%matplotlib inline

#### Data Set

We use the `Boston` data set to demonstrate multiple linear regression.

In [None]:
boston = datasets.Boston()
boston.head()

#### Model Specification & Fit

The `smf.ols()` function builds a statistical *model* prepared for fitting with *ordinary least squares* (ols). This is the type of fit explained in detail in the lecture.

The syntax to use interaction terms is `y~x1:x2`. This will include a term corresponding to $x_1\times x_2$ in the model.

There is a shorthand notation for including an interaction term and the predictors themselves: `y~x1*x2`. This is equivalent to `y~x1+x2+x1:x2`.

As in the simple regression with one predictor, a constant term for the intercept is added automatically.

The formula `medv~lstat*age` means we are using `lstat`, `age` and the interaction term `lstat`$\times$`age` as our predictors and `medv` as our dependent variable:

$$ \mathrm{medv} = \beta_0 + \beta_1 \mathrm{lstat} + \beta_2 \mathrm{age} + \beta_3 \mathrm{lstat}\times\mathrm{age}$$

In [None]:
model = smf.ols(formula='medv~lstat*age', data=boston)
model_fit = model.fit()

#### Fit Result Summary

We can get a comprehensive summary using the `summary()` method. Now we get the results for all three $\beta$ coefficients.

In [None]:
model_fit.summary()

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4.5))
lmplots.plot_fit(model_fit, 'lstat', ax=ax1)
lmplots.plot_fit(model_fit, 'age', ax=ax2)
plt.show()

#### Non-linear Transformations of the Predictors

The formula mini language also accommodates non-linear transformations of the predictors. For instance, given a predictor $x$ we can create a predictor $x^2$ using the expression `I(x**2)`. Here the `I()` acts as an escape sequence. It tells the formula parser to treat the expression inside as a Python expression. This is necessary because `**` has a special meaning in the formula language. We now perform a regression of `medv` on `lstat` and `lstat` squared. The formula `medv~lstat+I(lstat**2)` describes the following model:

$$ \mathrm{medv} = \beta_0 + \beta_1 \mathrm{lstat} + \beta_2 \mathrm{lstat}^2$$

We stress again that this is still a linear model because it is *linear in the parameters*.

In fact we can use any Python expression this way. For example, `I(np.log(age))`. The only restriction is that the variable names must be valid Python identifiers.

In [None]:
model2 = smf.ols(formula='medv~lstat+I(lstat**2)', data=boston)
model2_fit = model2.fit()
model2_fit.summary()

In [None]:
model2_fit.pvalues

The low $p$-value of the quadratic term suggests that the term improves the model over the simple regression `medv~lstat`.

To properly assess this, we would like to perform a hypothesis test with the two models. Our null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior.

The `anova_lm` function from the `statsmodels.stats.api` performs such a test.

In [None]:
model1 = smf.ols(formula='medv~lstat', data=boston)
model1_fit = model1.fit()

In [None]:
import statsmodels.stats.api as sms
sms.anova_lm(model1_fit, model2_fit)

Here the $F$-statistic is 135 and the $p$-value is virtually zero. This provides clear evidence that the full model with the quadratic term is far superior. 

We make some plots to confirm this visually.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4.5))
lmplots.plot_fit(model1_fit, 'lstat', ax=ax1)
lmplots.plot_fit(model2_fit, 'lstat', ax=ax2)
ax1.set_title(model1.formula)
ax2.set_title(model2.formula)
plt.show()

Of course we can combine all of the above.

In [None]:
lm = smf.ols(formula='medv~lstat*age+I(lstat**2)', data=boston).fit()
fig, ax = lmplots.plot_fit_3D(lm, 'lstat', 'age')
fig.suptitle(lm.model.formula)
plt.show()