# Preparing the environment

In [None]:
# import all needed packages
import numpy as np
import pandas as pd

import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)

# A) Case study
(see textbook Exercise 3.7.9)

*Information on the dataset*

Gas mileage, horsepower, and other information for 392 vehicles.

A data frame with 392 observations on the following 9 variables.

- `mpg`: miles per gallon
- `cylinders`: Number of cylinders between 4 and 8
- `displacement`: Engine displacement (cu. inches)
- `horsepower`: Engine horsepower
- `weight`: Vehicle weight (lbs.)
- `acceleration`: Time to accelerate from 0 to 60 mph (sec.)
- `year`: Model year (modulo 100)
- `origin`: Origin of car (1. American, 2. European, 3. Japanese)
- `name`: Vehicle name

In [None]:
# run this cell to load the data
Auto = load_data('Auto')
Auto

## Task A.1
Produce a scatterplot matrix which includes all of the variables in the data set.

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
pd.plotting.scatter_matrix(Auto, ax=ax);

We observe that all variables seem to be related to `mpg`. This indicates that it might be a good starting point to build a regression model with all possible predictor variables.

## Task A.2
Compute the matrix of correlations between the variables using the `DataFrame.corr()` method.

In [None]:
corrmat = Auto.corr()
corrmat

In [None]:
# Additional possibility: visualization of correlation matrix using a heatmap
import seaborn as sns
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(corrmat)

We observe that several of our potential predictor variables are highly correlated, e.g. displacement and weight, displacement and cylinders, displacement and horsepower. It might be a good idea to investigate this in more detail later.

## Task A.3
Use the `sm.OLS()` function to perform a multiple linear regression with `mpg` as the response and all other variables except `name` as the predictors. Use the `summarize()` function to print the results. Comment on the output. For instance:

1. Is there a relationship between the predictors and the response? Use the `summary()` function from statsmodels to answer this question.
2. Which predictors appear to have a statistically significant relationship to the response?
3. What does the coefficient for the `year` variable suggest?

In [None]:
# change dtype of origin column to make
# sure that ModelSpec recognizes origin as
# qualitative variable
Auto_new = Auto.astype({'origin' : 'category'})

In [None]:
Auto.dtypes

In [None]:
Auto_new.dtypes

In [None]:
# create design matrix
predictors = Auto_new.columns.drop(['name','mpg'])
design = MS(predictors).fit(Auto_new)

In [None]:
X = design.transform(Auto_new)
X

In [None]:
X_manual = Auto.drop(columns=['mpg','name'])
X_manual['intercept'] = np.ones(Auto.shape[0])
X_manual

In [None]:
X_manual.dtypes

In [None]:
y = Auto.mpg
model = sm.OLS(y,X)
results=model.fit()
summarize(results)

*Please document your observations here*.

## Task A.4
Produce some of diagnostic plots of the linear regression fit as described in the lab. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

In [None]:
# your code here

## Task A.5
Fit some models with interactions as described in the lab. Do any interactions appear to be statistically significant?

In [None]:
# your code here

*Please document your observations here*.

## Task A.6
Try a few different transformations of the variables, such as $log(X)$, $\sqrt{X}$, $X^2$. Comment on your findings.

In [None]:
# your code here

*Please document your observations here.*

# B) Simulation study
(see textbook Exercise 3.7.14)

This problem focuses on the *colinearity* problem.

## Task B.1
Run the cell below:

In [None]:
rng = np.random.default_rng(10)
x1 = rng.uniform(0, 1, size=100)
x2 = 0.5 * x1 + rng.normal(size=100) / 10
y = 2 + 2 * x1 + 0.3 * x2 + rng.normal(size=100)

The last line corresponds to creating a linear model in which `y` is a function of `x1` and `x2`. Write out the form of the linear model. What are the regression coefficients?

*Your answer here*

##Task B.2
What is the correlation between `x1` and `x2`? Create a scatterplot displaying the relationship between the variables.

In [None]:
# your code here

## Task B.3
Using this data, fit a least squares regression to predict `y` using `x1` and `x2`. Describe the results obtained. What are $\beta_0$ , $\beta_1$ , and $\beta_2$? How do these relate to the true $\beta_0$, $\beta_1$, $\beta_2$? Can you reject the null hypothesis $H_0 : \beta_1 = 0$? How about the null hypothesis $H_0 : \beta_2 = 0$?

In [None]:
# your code here

*Your comments and observations here*

## Task B.4
Now fit a least squares regression to predict `y` using only `x1`.

Comment on your results. Can you reject the null hypothesis
$H_0 : \beta_1 = 0$?

In [None]:
# your code here

*Your comments and observations here*

## Task B.5
Now fit a least squares regression to predict `y` using only `x2`.

Comment on your results. Can you reject the null hypothesis
$H_0 : \beta_1 = 0$?

In [None]:
# your code here

*Your comments and observations here*.

## Task B.6
Do the results obtained in (c)–(e) contradict each other? Explain your answer.

*Your answer here*

## Task B.7
Suppose we obtain one additional observation, which was unfortunately mismeasured. We use the function `np.concatenate()` to add this additional observation to each of `x1`, `x2` and `y`.

In [None]:
x1 = np.concatenate([x1, [0.1]])
x2 = np.concatenate([x2, [0.8]])
y = np.concatenate([y, [6]])

Re-fit the linear models from Task B.3 to Task B.5 using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

In [None]:
# your code here

*Your answer here*