#### Introduction to Statistical Learning, Lab 3.5

# Producing Consistent Plots

We have stressed earlier that visualisation is *very* important. A big part of this consistency. Unfortunately, this can be hard to achieve in Python ecosystem.

The `seaborn` library provides a consistent and well designed look and we generally recommend to use it for your plotting tasks.

However, there are some things `seaborn` can't do easily. In particular, we often want to plot quantities derived from a model the *we* have fitted.  That can be a problem, because many of `seaborn`'s convenience features (such as `regplot()`) do their own fits we have little control over. In fact, they will often be plain wrong! For example, when we fitted a multiple regression model. Yet, we still want `seaborn`'s consistent look.

Many plots referred to in the ISL book are based on the convenient functions provided by R. Almost none of these are readily provided in the Python ecosystem, at least not with a consistent look and full control over the underlying model. 

So what do we do about this? 

We have to write some code... as you have seen, this kind of thing is extremely tedious. But you are in luck, we did most of the work for you! All you have to do is:

```python
from islpy import datasets, utisl, lmplots
```
We will demonstrate some functions from the `lmplots` module in this lab.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

#### Data Set

We use the `Auto` data set to demonstrate the usage of qualitative variables. We start by loading it and making the `name` column the index (it is not a useful variable and makes good row labels).

We also change the `origin` variable to strings referring to the region. This is more readable and will make `statsmodels` treat it as a qualitative variable (we could also force this with the `C()` notation).

In [None]:
auto = datasets.Auto()
auto.set_index('name', inplace=True)
auto.head()

#### Model Specification & Fit

We would like to predict `mpg` based on all predictors.

In [None]:
formula = 'mpg~' + '+'.join(auto.columns.drop('mpg'))
lm = smf.ols(formula=formula, data=auto).fit()

#### Fit Result Summary

We can get a comprehensive summary using the `summary()` method. Now we get the results for all three $\beta$ coefficients.

In [None]:
lm.summary()

#### Plotting Fit Results

The `plot_fit()` function from `islpy.lmplots` automatically marginalises out all other variables. The `seaborn.regplot()` function does not (and can not) do this for us. By default a scatter plot of the response versus the specified predictor is shown and the fitted values are overlaid together with the 95% confidence interval.

In [None]:
ax = lmplots.plot_fit(lm, 'weight')

We can also add the 95% *prediction interval*:

In [None]:
ax = lmplots.plot_fit(lm, 'weight', show_pi=True)

We can also influence colours and other properties. We do recommend *not* to do this in the interest of consistency. If you don't like the defaults or need different colours you should define a proper colour scheme, such that `'CN'` colours are recognised. Be graphically consistent.

In [None]:
ax = lmplots.plot_fit(lm, 'weight', scolor='C3', fcolor='black', pcolor='C5', show_pi=True)

We also provide a 3D plot facility. While it looks cool, remember that 2D plots are usuallu *much* more readable.

In [None]:
fig = lmplots.plot_fit_3D(lm, 'weight', 'horsepower')

The 3D plotting function also has some (unfortuanate, as always) side effects: it resets the `seaborn` look to the `matplotlib` default. We have to repair this before we make more plots.

In [None]:
sns.set()
%matplotlib inline

Now we can make plots again like before.

In [None]:
ax = lmplots.plot_fit(lm, 'horsepower')

#### R-style Linear Model Control Plots

We provide functions to replicate R-style control plots on linear models. You can show the overall summary by using `lmplots.plot()`.

In [None]:
fig = lmplots.plot(lm)

For more flexibility we can also show the plots individually. For example we might not want the annotations (this also works for `lmplots.plot()`).

In [None]:
ax = lmplots.plot_resid(lm, annotations=0)

Or maybe we want more annotations (the default is three annotations).

In [None]:
ax = lmplots.plot_leverage(lm, annotations=8)

#### Plotting with Qualitative Predictors

The `origin` variable in the `Auto` data set is qualitative. It is a good idea to make it explicitly so and let the `statsmodels` library deal with the encoding. 

In [None]:
auto['origin'].replace({1: 'US', 2: 'EU', 3: 'JP'}, inplace=True)
auto.head()

In [None]:
lm = smf.ols(formula, auto).fit()
lm.summary()
lm.model.exog_names

This works fine for the control plots.

In [None]:
fig = lmplots.plot(lm)

Unfortunately, the `lmplot.plot_fit()` will fall over. This is understandable: predictions are made from the data set passed on model construction and the magic done to the qualitative variables loses some information.

There is no easy generic way around this problem because we can't meaningfully compute a mean from qualitative variables (needed for marginalising out).

So for now, the `plot_fit()` and `plot_fit_3d()` functions won't work if you have included explicitly qualitative variables in your fit. 