# Preparing the environment

In [None]:
# import all needed packages
import numpy as np
import pandas as pd

import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.stats.outliers_influence \
     import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)

# A) Case study
(see textbook Exercise 3.7.9)

*Information on the dataset*

Gas mileage, horsepower, and other information for 392 vehicles.

A data frame with 392 observations on the following 9 variables.

- `mpg`: miles per gallon
- `cylinders`: Number of cylinders between 4 and 8
- `displacement`: Engine displacement (cu. inches)
- `horsepower`: Engine horsepower
- `weight`: Vehicle weight (lbs.)
- `acceleration`: Time to accelerate from 0 to 60 mph (sec.)
- `year`: Model year (modulo 100)
- `origin`: Origin of car (1. American, 2. European, 3. Japanese)
- `name`: Vehicle name

In [None]:
# run this cell to load the data
Auto = load_data('Auto')
Auto

## Task 1
Produce a scatterplot matrix which includes all of the variables in the data set.

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
pd.plotting.scatter_matrix(Auto, ax=ax);

We observe that all variables seem to be related to `mpg`. This indicates that it might be a good starting point to build a regression model with all possible predictor variables.

## Task 2
Compute the matrix of correlations between the variables using the `DataFrame.corr()` method.

In [None]:
corrmat = Auto.corr()
corrmat

In [None]:
# Additional possibility: visualization of correlation matrix using a heatmap
import seaborn as sns
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(corrmat)

We observe that several of our potential predictor variables are highly correlated, e.g. displacement and weight, displacement and cylinders, displacement and horsepower. It might be a good idea to investigate this in more detail later.

## Task 3
Use the `sm.OLS()` function to perform a multiple linear regression with `mpg` as the response and all other variables except `name` as the predictors. Use the `summarize()` function to print the results. Comment on the output. For instance:

1. Is there a relationship between the predictors and the response? Use the `summary()` function from statsmodels to answer this question.
2. Which predictors appear to have a statistically significant relationship to the response?
3. What does the coefficient for the `year` variable suggest?

In [None]:
# change dtype of origin column to make
# sure that ModelSpec recognizes origin as
# qualitative variable
Auto_new = Auto.astype({'origin' : 'category'})

In [None]:
Auto.dtypes

In [None]:
Auto_new.dtypes

In [None]:
# create design matrix
predictors = Auto_new.columns.drop(['name','mpg'])
design = MS(predictors).fit(Auto_new)

In [None]:
X = design.transform(Auto_new)
X

In [None]:
X_manual = Auto.drop(columns=['mpg','name'])
X_manual['intercept'] = np.ones(Auto.shape[0])
X_manual

In [None]:
X_manual.dtypes

In [None]:
y = Auto.mpg
model = sm.OLS(y,X)
results=model.fit()
summarize(results)

### Analysis of the model
1. Is there a relationship between the predictors and the response? Use the `summary()` function from statsmodels to answer this question.

In [None]:
# your code here

*Your interpretation here*

2. Which predictors appear to have a statistically significant relationship to the response?

*Your interpretation here*

3. What does the coefficient for the `year` variable suggest?

*Your interpretation here*

## Task 4
Produce some of diagnostic plots of the linear regression fit as described in the lab. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

### Residual plot

In [1]:
# your code here

*Your interpretation here*

### Outlier analysis

In [None]:
# your code here

*Your interpretation here*

### Influential observations

In [2]:
# your code here

*Your interpretation here*

### VIF table

In [None]:
# your code here

*Your interpretation here*

## Task 5
Try a few different transformations of the variables, such as $log(X)$, $\sqrt{X}$, $X^2$. Comment on your findings.

In [None]:
# Your code here

*Please document your observations here.*