#### Introduction to Statistical Learning, Lab 3.2

# Multiple Linear Regression

In the Python environment the most popular libraries for model fitting (and therefore linear regression) *sklearn* and *statsmodels*. The statsmodels library provides a R-style formula-based interface. We will mostly use this interface because it provides more flexibility and better parameter reporting. This has the additional advantage that it maps quite well onto the examples in the ISLR book.  


  - [statsmodels documentation](https://www.statsmodels.org/stable/)
  - [statsmodels formula interface](https://www.statsmodels.org/stable/example_formulas.html)
  - [the formula mini language](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-language)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from islpy import datasets
%matplotlib inline

#### Data Set

We use the `Boston` data set to demonstrate multiple linear regression.

In [None]:
boston = datasets.Boston()
boston.head()

#### Model Specification

The `smf.ols()` function builds a statistical *model* prepared for fitting with *ordinary least squares* (ols). This is the type of fit explained in detail in the lecture.

the syntax to use multiple regressors (variables, predictors, features...) is `y~x1+x2+x3`. As in the simple regression with one predictor, a constant term for the intercept is added automatically.

The formula `medv~lstat+age` means we are using `lstat` and `age` as our predictors and `medv` as our dependent variable:

$$ \mathrm{medv} = \beta_0 + \beta_1 \mathrm{lstat} + \beta_2 \mathrm{age}$$

In [None]:
model = smf.ols(formula='medv~lstat+age', data=boston)

#### Fitting the Model

We *fit* the model to the data by calling the `fit()` method:

In [None]:
model_fit = model.fit()

#### Fit Result Summary

We can get a comprehensive summary using the `summary()` method. Now we get the results for all three $\beta$ coefficients.

In [None]:
model_fit.summary()

#### Specific Summary Tables

We can also select a specific table from the summary. For example the fitted coefficients:

In [None]:
model_fit.summary().tables[1]

#### Fit Result Parameters

Or we can retrieve only the fitted parameters ($\beta_0$ = *intercept*, $\beta_1$ = *lstat*) as a pandas series using the `params` attribute:

In [None]:
model_fit.params

#### Confidence Intervals

The 95% confidence intervals for the coefficients can be retrieved via the `conf_int()` method:

In [None]:
model_fit.conf_int()

#### Visualising the Fit Results

With two predictors we can visualise the data and the fit result in a 3D plot. The `seaborn` library does not provide 3D plotting facilities. There is a good reason for that: it is very hard to make informative 3D charts. Most of the time it is much better to think of a good way to visualise the data in 2D.

That said, we want to give at least one example of a 3D chart. Our approach is similar to the one variable case:

  - First produce a 3D scatter plot.
  - Next get a range of predictor values from the plot's x- and y-axis and compute all predictions on a 2D grid.
  - Then use the `plot_surface()` method to overlay the prediction plane of the fitted model.
  
Like in the one variable case, this approach might seem a bit heavy-handed for a linear model (it plots surface segments between many points on the grid, while only very few are necessary). But again it does have the advantage that it works with *any* model!

In particular this also works for the confidence level surfaces which are *not* planes.

What follows is quite a bit of code; making reasonably looking 3D plots is a bit of work. We include a number of different features in this chart so you have a reference to come back to.

In [None]:
from mpl_toolkits.mplot3d import axes3d
from matplotlib import cm

fig = plt.figure(figsize=(12, 9))
ax = axes3d.Axes3D(fig)

# 3D scatter plot of the raw data
ax.scatter(boston.lstat, boston.age, boston.medv)

# prepare point grids from the ranges of the scatter plot
xs = np.linspace(*ax.get_xlim(), 100)
ys = np.linspace(*ax.get_ylim(), 100)
xv, yv = np.meshgrid(xs, ys, copy=False)
zv = np.zeros((ys.size, xs.size))
lv = np.zeros((ys.size, xs.size))
uv = np.zeros((ys.size, xs.size))

# compute predictions and CI bounds for the rows in the point grids
for idx, y in enumerate(yv):
    pred = model_fit.get_prediction({'lstat': xs, 'age': y}).summary_frame()
    zv[idx] = pred['mean']
    lv[idx] = pred['mean_ci_lower']
    uv[idx] = pred['mean_ci_upper']

# plot the prediction & CI boundary surfaces
ax.plot_surface(xv, yv, zv, alpha=0.4)
ax.plot_surface(xv, yv, lv, alpha=0.2, color='C1')
ax.plot_surface(xv, yv, uv, alpha=0.2, color='C1')

# add contour plot of the CI width to the bottom of the figure
ax.contourf(xv, yv, uv-lv,
            zdir='z',
            offset=ax.get_zlim()[0],
            levels=30,
            antialiased=True,
            cmap=cm.Oranges)

# set figure title and axes labels
ax.set_title('Linear regression on Boston housing data: medv ~ lstat + age')
ax.set_xlabel('lstat')
ax.set_ylabel('age')
ax.set_zlabel('medv')

# specify viewing angle
ax.view_init(15, -70)

Looks cool, but that's a bad reason to do it. Choosing a good viewing angle and the right surface transparencies can be a bit tricky. You can waste a lot of time (and CPU cycles) on this kind of thing. 

Obviously, once we use more than two predictors, this approach to visualisation won't work anymore. You can play tricks with colour coding and so on. But you can do that in 2D as well and the results will be much more readable!

The bottom line is: __*don't make 3D charts unless you absolutely have to*__.