In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from statsmodels.formula.api import ols

# Developing and Assessing the Model

This notebook introduces the idea of a line of best fit using conventional Least Squares routines.

In [None]:
ads = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

In [None]:
ads.head()

In [None]:
ads.info()

### Relationships Between Variables

Because we are examining linear relationships, it's important to remember we are looking to develop a model to predict with.  We will find the vocabulary of **response** variable used here, and with a simple linear model we are looking at something like:

$$\text{response} = \text{predictor}\times \text{slope} + \text{intercept}$$

In the example of the advertising data, we are interested in how the **predictors** of TV, radio, and newspaper are related to sales. To start, let's look at a basic plot of the data. 

In [None]:
fig, axs = plt.subplots(1, 3, sharey=True)
ads.plot(kind='scatter', x='TV', y='sales', ax=axs[0], figsize=(16, 8))
ads.plot(kind='scatter', x='radio', y='sales', ax=axs[1])
ads.plot(kind='scatter', x='newspaper', y='sales', ax=axs[2])

There are a number of libraries with linear model capabilities.  We will start by looking at the the `.polyfit` and `.polyval` methods in NumPy.  As a first example, let's see how `TV` and `sales` are related.

In [None]:
lm = np.polyfit(ads['TV'], ads['sales'], 1)

In [None]:
lm

Thus, the equation (with some rounding) is

$$ y = 0.05 + 7.04x$$

With `.polyval()` we can evaluate our model at the values for television and plot a predicted line.

In [None]:
predictions = np.polyval(lm, ads['TV'])

In [None]:
plt.figure(figsize = (9, 5))
plt.scatter(ads['TV'], ads['sales'])
plt.plot(ads['TV'], predictions, color = 'black', linewidth = 4)

Remember that the idea here is to use this to make a prediction about money spent in Television advertising and its effect on sales.  If we evaluate the model at 50,000 we find a prediction of 2383 or so sales in that given market.

In [None]:
np.polyval(lm, 50000)

### StatsModels

Now, we investigate the `statsmodels` version of linear regression.  Here, he summary information is a little more deliberate than the NumPy version.

In [None]:
import statsmodels.formula.api as smf

In [None]:
lm = smf.ols(formula = 'sales ~ TV', data = ads).fit()

In [None]:
lm.summary()

In [None]:
#investigate residuals
lm.resid[:5]

In [None]:
#examine parameters
lm.params

In [None]:
#r2 value
lm.rsquared

The $r^2$ value tells us the proportion of variance explained.  It comes from the formula:

$$r^2 = \frac{TSS - RSS}{TSS}$$

This is a way for us to understand the predictive capability of our model in terms of the variance.  We would say in this example, that our model describes roughly 78% of the data.

In [None]:
sns.jointplot('TV', 'sales', data = ads, kind = 'reg')

### Residuals

We can examine the residuals of the model to understand more about the quality of fit and whether we are using an appropriate model.  The following assumptions are made about residuals in the OLS method.

<div class="alert alert-danger" role="alert">
<ul>
<li> Should be balanced and symmetric about 0</li>
    <li> Should be free of trends </li>
<li> Absolute value or overall magnitude of residuals should be roughly the same for entire dataset.  The assumption that magnitude of variance is constant is "homoscedasticity".</li>
    <li> Residuals distributed according to a Gaussian/aka normal distribution </li>
    </ul>
</div>

In [None]:
sns.residplot('TV', 'sales', data = ads)

In [None]:
sns.distplot(lm.resid)

In [None]:
resids = pd.DataFrame({'residuals': lm.resid})

In [None]:
resids.skew()

In [None]:
resids.kurt()

**Further Assessments**

There are many other ways to complexify your understanding of regression models.  The statsmodels library has many additional capabilities including some nice plots of summary fit information.

http://www.statsmodels.org/dev/graphics.html#regression-plots

In [None]:
import statsmodels.api as sm
fig = plt.figure(figsize=(15,8))

# pass in the model as the first parameter, then specify the 
# predictor variable we want to analyze
fig = sm.graphics.plot_regress_exog(model, "TV", fig=fig)

**PROBLEM**

How do the other features fair as single predictors?  

1. Determine a line of best fit for the other variables against sales.

2. Decide which of the linear models is the best predictor and why.