#### Introduction to Statistical Learning, Lab 3.1

# Simple Linear Regression

In the Python environment the most popular libraries for model fitting (and therefore linear regression) *sklearn* and *statsmodels*. The statsmodels library provides a R-style formula-based interface. We will mostly use this interface because it provides more flexibility and better parameter reporting. This has the additional advantage that it maps quite well onto the examples in the ISLR book.  


  - [statsmodels documentation](https://www.statsmodels.org/stable/)
  - [statsmodels formula interface](https://www.statsmodels.org/stable/example_formulas.html)
  - [the formula mini language](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-language)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from islpy import datasets
%matplotlib inline

#### Data Set

We use the `Boston` data set to demonstrate simple linear regression.

In [None]:
boston = datasets.Boston()
boston.head()

#### Model Specification

The `smf.ols()` function builds a statistical *model* prepared for fitting with *ordinary least squares* (ols). This is the type of fit explained in detail in the lecture.

The formula `medv~lstat` means we are using `lstat` as our predictor and `medv` as our dependent variable:

$$ \mathrm{medv} = \beta_0 + \beta_1 \mathrm{lstat} $$

In [None]:
model = smf.ols(formula='medv~lstat', data=boston)

#### Fitting the Model

We *fit* the model to the data by calling the `fit()` method:

In [None]:
model_fit = model.fit()

#### Fit Result Summary

We can get a comprehensive summary using the `summary()` method:

In [None]:
model_fit.summary()

#### Specific Summary Tables

We can also select a specific table from the summary. For example the fitted coefficients:

In [None]:
model_fit.summary().tables[1]

#### Fit Result Parameters

Or we can retrieve only the fitted parameters ($\beta_0$ = *intercept*, $\beta_1$ = *lstat*) as a pandas series using the `params` attribute:

In [None]:
model_fit.params

#### Confidence Intervals

The 95% confidence intervals for the coefficients can be retrieved via the `conf_int()` method:

In [None]:
model_fit.conf_int()

#### Making Predictions

The purpose of fitting a model is to make predictions from as of yet unobserved predictors in the future. We use the `predict()` method to do that. Note that the predictor data set must provide all keys (column names) used in the formula. In practice this will almost always be a pandas data frame with the required columns. But any `dict`-like object with the required keys will work. 

In [None]:
model_fit.predict({'lstat': [5, 10, 15]})

#### Prediction with Confidence and Prediction Intervals

In case we need confidence and/or prediction intervals we use the `get_prediction()` method and extract a summary data frame from the result with `summary_frame()`:

In [None]:
pred = model_fit.get_prediction({'lstat': [5, 10, 15]})
pred.summary_frame()

The `mean_ci` columns are the confidence interval limits and the `obs_ci` columns are the prediction interval limits.

For instance, the 95% confidence interval associated with an `lstat` value of 10 is (24.47, 25.63), and the 95% prediction interval is (12.83, 37.28). As expected, they are both centred around the same point, the predicted value 25.05, but the prediction interval is substantially wider.

#### Plotting the Fit Results

Our goal is to make a graph with a scatter plot and overlay the line resulting from the fit. There are (somewhat unfortunately) plenty of ways to that.

We recommend the following approach:

  - First use `seaborn` to produce the scatter plot.
  - Next get a range of predictor values from the plot's x-axis.
  - Then use the `matplotlib` `plot()` function to overlay the prediction curve of the fitted model.
  
This approach might seem a bit heavy-handed for a linear model (it plots line segments between many points on the line, while only two are necessary). But it does have the advantage that it works with *any* model!

In [None]:
ax = sns.scatterplot(x='lstat', y='medv', data=boston)
xs = np.linspace(*ax.get_xlim(), 100)
ax.plot(xs, model_fit.predict({'lstat': xs}), color='C1', lw=2)
plt.show()

Note that we have modified the colour and width of the line. The colour name `C1` refers to the second colour in the default colour cycle. We highly recommend to stick to the colours in the default colour cycle; they were selected for good reasons!

#### Quick Regression Visualisation

If we are not interested in all the statistics and flexibility `statsmodels` provides, we can use `seaborn`'s built-in regression plot facility. This is useful to have a quick look but lacks a lot of the additional information provided by a fitted model. In particular, this does not allow us to compute predictions from future data sets.

In [None]:
ax = sns.regplot(x='lstat', y='medv', data=boston,
                 line_kws={'color': 'C1', 'lw': 2})

Note that `seaborn` also draws the confidence interval around the predictions as a shaded area. We will look at how to retrieve this information from our fitted model next.

#### Residuals & Hat-values

We use the `get_influence()` method to get access to a host of useful quantities, including residuals, studentised residuals and hat-values.

In [None]:
influence = model_fit.get_influence()

Residuals:

In [None]:
ax = sns.scatterplot(model_fit.predict(), influence.resid)

Studentised residuals:

In [None]:
ax = sns.scatterplot(model_fit.predict(), influence.resid_studentized)

The residual plots suggest that there some non-linearity in the data. We can can get *leverage* statistics by accessing the `hat_matrix_diag` property of the `influence`.

In [None]:
ax = sns.scatterplot(boston.index, influence.hat_matrix_diag)

In [None]:
influence.hat_matrix_diag.argmax()

The `argmax()` method of `numpy` arrays gives us the *index* of the maximum value in the array. In this case it tells us which *observation* has the largest leverage statistics.

#### Plotting the Fit Results with Confidence Interval Boundaries

Our goal is to make a graph with a scatter plot and overlay the line resulting from the fit together with the 95% confident interval boundary lines.

We recommend the following approach:

  - First use `seaborn` to produce the scatter plot.
  - Next get a range of predictor values from the plot's x-axis.
  - Then use the `matplotlib` `plot()` function to overlay the predicted curve of the fitted model.
  - Then use the `matplotlib` `fill_between()` function to overlay the confidence interval boundaries.
  
This approach does have the advantage that it works with *any* model!

In [None]:
ax = sns.scatterplot(x='lstat', y='medv', data=boston)
xs = np.linspace(*ax.get_xlim(), 100)
pred = model_fit.get_prediction({'lstat': xs}).summary_frame()
ax.plot(xs, pred['mean'], color='C1', lw=2)
lower = pred['mean_ci_lower']
upper = pred['mean_ci_upper']
ax.fill_between(xs, lower, upper, alpha=0.3)
plt.show()