#### Introduction to Statistical Learning, Exercise 3.4

__Please do yourself a favour and only look at the solutions after you honestly tried to solve the exercises.__

# Simulated Data

We will simulate a data set and investigate how things change with varying parameters of the simulation. This allows us to evaluate our models against the simulation truth.

We will use `numpy`'s random generator facilities for the simulation. If you want reproducible results you should set the random seed explicitly like this at the beginning (the actual seed value does not matter):

```python
np.random.seed(seed=123)
```

 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from islpy import datasets, utils, lmplots
sns.set()
%matplotlib inline

### A. Simulate the Feature

Using the `np.random.normal()` function, create a a vector `x` containing a 100 observations drawn from a $N(0, 1)$ distribution. This represents the feature $X$.

In [None]:
np.random.seed(seed=123)  # we only need to do this once at the start
x = np.random.normal(size=100)
x

### B. Simulate the Noise

Using the `np.random.normal()` function, create a a vector `eps` containing a 100 observations drawn from a $N(0, 0.25)$ distribution. That is, a normal distribution with mean 0 and standard deviation 0.25. This represents the noise $\epsilon$.

In [None]:
eps = np.random.normal(0, 0.25, size=100)
eps

### C. Simulate the Response

Using the vectors `x` and `eps`, generate a vector `y` according to the following model:

$$ Y = -1 + 0.5 X + \epsilon $$

This represents the response $Y$. What is the length of the vector `y`? What are the values of $\beta_0$ and $\beta_1$ in this model?

In [None]:
y = -1 + 0.5 * x + eps 
y

In [None]:
y.size, y.shape[0]

$\beta_0 = -1$ and $\beta_1 = 0.5$

### D. Plot the y-x Relationship

Create a scatter plot of `y` versus `x`. Comment on what you observe.

In [None]:
ax = sns.scatterplot(x, y)

There is clearly a linear relationship in the simulated data. The simulated noise is also clearly visible.

### E. Linear Model Fit

Fit at least squares model to the data with `y` as the response and `x` as the predictor. Comment on what you observe. How are the estimated parameters $\hat{\beta}_0$ and $\hat{\beta}_1$ compare to the true values $\beta_0$ and $\beta_1$?

Hint: you can do this directly with `sm.OLS()` but creating a data frame first and using `smf.ols()` might be more convenient.

In [None]:
data = pd.DataFrame({'y': y, 'x':x})
data.head()

In [None]:
lm = smf.ols('y~x', data).fit()
lm.summary()

The model fits the data very well. The estimated parameters a close to the true parameters.

### F. Plotting Estimate & Truth

Make a scatter plot of `y` versus `x` and overlay the *population regression line* and the *least squares line*. Create an appropriate legend by using labels and the `legend()` function. 

In [None]:
ax = sns.scatterplot(x='x', y='y', data=data)
x = np.linspace(*ax.get_xlim(), 10)
y_true = -1 + 0.5 * x
y_fit = lm.predict({'x': x})
ax.plot(x, y_true, label='truth')
ax.plot(x, y_fit, color='C1', label='fit')
ax.legend()
plt.show()

### G. Quadratic Term

Now fit a model that uses `x` and `x`${}^2$ as the predictors. Does the quadratic term improve the model fit? Explain you answer. 

In [None]:
lm2 = smf.ols('y~x+I(x**2)', data).fit()
lm2.summary()

No, the quadratic term does *not* improve the model fit; the $F$-statistic is significantly lower. Also the $p$-value of the quadratic term is very high. This is of course expected, because the data has a truly linear relationship.