#### Introduction to Statistical Learning, Exercise 3.3

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Multiple Linear Regression on the Carseats Data Set

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from islpwf import datasets, utils, lmplots
sns.set()
%matplotlib inline

### A. Regression Fit

Load the `Carseats` data set and perform a multiple linear regression with `Sales` as the response and `Price`, `Urban`, and `US` as the predictors.


In [None]:
carseats = datasets.Carseats()
carseats.head()

In [None]:
lm = smf.ols('Sales~Price+Urban+US', carseats).fit()

### B. Interpretation

Provide an interpretation of each coefficient in the model. Be careful, some of the variables are qualitative!

To understand the meaning of the variables you can always cal `help` on the data set:

```python
help(datasets.Carseats)
```

In [None]:
lm.summary().tables[1]

The variable `Price` is qualitative and `Urban` and `US` are qualitative.

  - `Price` has a negative impact on `Sales` and small $p$-value.
  - `Urban` has a large $p$-value and is therefore not a got predictor.
  - `US`: if the store is in the US the `Sales` is higher than for stores outside the US. This is significant, as the $p$-value is low.

### C. Model Equation

Write down the model in equation form. Be careful to handle qualitative variables properly!

$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \epsilon_i =
\begin{cases}
    \beta_0 + \beta_3 x_{i3} + \epsilon_i: & x_{i1} = x_{i2} = 0 \\
    \beta_0 + \beta_1 x_{i1} + \beta_3 x_{i3} + \epsilon_i: & x_{i1} =1, x_{i2} = 0 \\
    \beta_0 + \beta_2 x_{i2} + \beta_3 x_{i3} + \epsilon_i: & x_{i1} = 0, x_{i2} = 1 \\
    \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \epsilon_i: & x_{i1} = x_{i2} = 1 \\
\end{cases}
$$

where

$$
(x_{i1}, x_{i2}, x_{i3}) = (\mathsf{Urban}_i, \mathsf{US}_i, \mathsf{Price}_i)
$$

and $0$ means 'No' and $1$ means 'Yes'.

### D. Null Hypothesis

For which of the predictors can you reject the *null hypothesis* $H_0: \beta_j = 0$?

The $p$-values for the `US` and `Price` predictors are virtually zero while the $p$-value of the `Urban` predictor is very large. Hence we can reject the null hypothesis for `US` and `Price`, but not for `Urban`.

### E. Improving the Model

Given your answer to the previous question, fit a smaller model that only uses the predictors for which the null hypothesis can be clearly rejected.


In [None]:
lm1 = smf.ols('Sales~US+Price', carseats).fit()

### F. Fit Quality

How well do the models from A and E fit the data?

In [None]:
lm.summary().tables[0]

In [None]:
lm1.summary().tables[0]

Both models fit the data well. However, the $F$-statistic for the model from E (with `Urban` omitted) is larger. Therefore this model fits the data better.

### G. Confidence Intervals

Using the model from E, obtain the 95% confidence intervals the coefficients.

We can read off the 95% confidence intervals from the last two columns of the second summary table.

In [None]:
lm1.summary().tables[1]

If we want to use confidence intervals for further computations is is better to retrieve the corresponding data frame from the fitted model.

In [None]:
cis = lm1.conf_int()
cis

This also allows us to change the confidence level. For example, to 98%.

In [None]:
cis = lm1.conf_int(0.02)
cis

### H. Outliers & High Leverage Observations

Is there any evidence of outliers or high leverage observations in the model from E?

In [None]:
fig = lmplots.plot(lm1)

No, the fitted model looks solid in every respect.