#### Introduction to Statistical Learning, Exercise 3.5

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# The Collinearity Problem

We will simulate a data set that has high collinearity among predictors and investigate it with the help of our knowledge of the simulated truth.

We will use `numpy`'s random generator facilities for the simulation. If you want reproducible results you should set the random seed explicitly like this at the beginning (the actual seed value does not matter):

```python
np.random.seed(seed=123)
```

 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

### A. Simulated Data Set

Execute the following commands:

```python
np.random.seed(123)
x1 = np.random.uniform(size=100)
x2 = 0.5 * x1 + np.random.normal(size=100) / 10
y = 2 + 2 * x1 + 0.3 * x2 + np.random.normal(size=100)
```

The last line creates a linear model in which $y = y(x1, x2)$. Write out the form of the linear model. What are the regression coefficients?


In [None]:
np.random.seed(123)
x1 = np.random.uniform(size=100)
x2 = 0.5 * x1 + np.random.normal(size=100) / 10
y = 2 + 2 * x1 + 0.3 * x2 + np.random.normal(size=100)

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon $$

$$ \beta_0 = 2, \beta_1 = 2, \beta_2 = 0.3 $$


### B. Predictor Correlation

What is the correlation between $x_1$ and $x_2$? Create a scatter plot showing the relationship between the two variables.

In [None]:
ax = sns.scatterplot(x1, x2)
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
plt.show()

The relationship is linear and positive.

### C. Regression on Full Model

First create a data frame from `y`, `x1` and `x2`. Then fit a linear regression model with `y` as the response and `x1` and `x2` as the predictors.

Describe the results. What are $\hat{\beta}_0$, $\hat{\beta}_1$ and $\hat{\beta}_2$? Can you reject the null hypothesis $H_0 : \beta_1 = 0$? How about the null hypothesis $H_0 : \beta_2 = 0$?

In [None]:
data = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2})
lm = smf.ols('y~x1+x2', data).fit()
lm.summary()

The fit is good, according to the $F$-statistic.

In [None]:
lm.params

The estimated parameters are:

$$ \hat{\beta}_0 = 1.79, \hat{\beta}_1 = 2.02, \hat{\beta}_2 = 0.54 $$

The parameters $\hat{\beta}_0$ and $\hat{\beta}_1$ are reasonably close the true parameters, but $\hat{\beta}_2$ deviates from the truth and its standard error is large.

The $p$-value for $\beta_1$ is small and we can reject the null hypothesis $H_0: \beta_1 = 0$. We can *not* reject the null hypothesis $H_0: \beta_2 = 0$ due to the large $p$-value.

### D. Regression on First Predictor.

Now fit a model with `y` as the response and only `x1` as the predictor. Comment on the results. Can you reject the null hypothesis $H_0 : \beta_1 = 0$?

In [None]:
lm1 = smf.ols('y~x1', data).fit()
lm1.summary()

The fit has improved according to the $F$-statistic. The $p$-value for $\beta_1$ is virtually zero and we can reject the null hypothesis $H_0: \beta_1 = 0$.

### E. Regression on Second Predictor.

No fit a model with `y` as the response and only `x2` as the predictor. Comment on the results. Can you reject the null hypothesis $H_0 : \beta_2 = 0$?

In [None]:
lm2 = smf.ols('y~x2', data).fit()
lm2.summary()

The fit has improved according to the $F$-statistic, but not as much as when predicting on `x1` only. The $p$-value for $\beta_2$ is virtually zero and we can reject the null hypothesis $H_0: \beta_2 = 0$.

### F. Comparing Results

Do the results obtained in C to E contradict each other? Explain your answer.

No they don't contradict each other. The reason we can not reject $H_0: \beta_2 = 0$ in the full fit is that $x_2$ has a strong linear correlation with $x_1$. That is, $x_2$ does not add any extra information. In the single predictor fits the predictor carries all the information (again due to the strong linear correlation) and we can therefore reject the null hyposthesis in both cases.

### G. Adding a Bad Measurement

Now suppose we obtain one additional measurement, which was unfortunately wrongly measured.

```python
data.loc[data.shape[0]] = [6, 0.1, 0.8]
```
Re-fit the three linear models from C to E using this new data set. What effect does the bad observation have on each of the models? Is the new observation an outlier in any of the models? A high-leverage point? Both? Explain your answers.  

In [None]:
data.loc[data.shape[0]] = [6, 0.1, 0.8]
data.tail()

In [None]:
lm = smf.ols('y~x1+x2', data).fit()
lm.summary().tables[1]

Judging from the $p$ values we can now clearly reject the null hypothesis for $\beta_2$, but barely for $\beta_1$. However, the estimated parameters are far from the true values. The one bad measurement had a large impact on the results.

In [None]:
fig = lmplots.plot(lm)
plt.show()

The bad measurement is among the outliers, but not extreme compared to the other outliers. The bad observation has extremely high leverage, lying out side the 0.5 Cook's distance contour. This is a clear indication that something is wrong with this observation.

In [None]:
lm1 = smf.ols('y~x1', data).fit()
lm1.summary().tables[1]

The $p$-value for $\beta_1$ is virtually zeror so we can reject $H_0: \beta_1 = 0$. The values of the parameters are close to the true values.

In [None]:
fig = lmplots.plot(lm1)
plt.show()

In this fit the new observation is the most extreme outlier. It also has high leverage, but well inside the 0.5 Cook's distance contour. We would still conclude something wrong with this observation, but observation 71 looks equally bad.

In [None]:
lm2 = smf.ols('y~x2', data).fit()
lm2.summary().tables[1]

The $p$-value for $\beta_2$ is virtually zero, but the fitted parameter value is far from the truth.

In [None]:
fig = lmplots.plot(lm2, annotations=5)
plt.show()

The new observation is not among the outliers in the `y~x2` fit. But the leverage is very high, approaching the 0.5 Cook's distance contour.

In summary, the bad observation affects the full `y~x1+x2` fit most. This is because it does not follow the true correlation between $x_2$ and $x_1$ at all. This can also be seen in the $x_2$ versus $x_1$ scatter plot.

In [None]:
ax = sns.scatterplot(x='x1', y='x2', data=data)
x1_new, x2_new = data.iloc[100]['x1'], data.iloc[100]['x2']
ax = sns.scatterplot([x1_new], [x2_new])
ax.annotate(100, (x1_new, x2_new))
plt.show()