### Required Assignment 8.2: Comparing Complexity and Variance

**Expected Time: 60 Minutes**

**Total Points: 35**

In this activity, you will explore the effect of model complexity on the variance in predictions.  Continuing with the automotive data, you will build models on a subset of 10 vehicles.  You will compare the model error when used on the entire dataset and investigate how variance changes with model complexity.

#### Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)


In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import plotly.express as px

In [None]:
auto = pd.read_csv('data/auto.csv')

In [None]:
auto.head()

### The Sample

Below, a sample of ten vehicles from the data is extracted.  These data will form our **training** data.  The data is subsequently split into `X_train` and `y_train`.  You are to use this smaller dataset to build your models on and explore their performance using the entire dataset.

In [None]:
X = auto.loc[:,['horsepower']]
y = auto['mpg']
sample = auto.sample(10, random_state = 22)
X_train = sample.loc[:, ['horsepower']]
y_train = sample['mpg']

In [None]:
X_train

In [None]:
y_train

In [None]:
X.shape

[Back to top](#Index:) 

### Problem 1

#### Iterate on Models

**20 Points**

Complete the code below according to the instructions below:

- Assign the values in the `horsepower` column of `auto` to the variable `X` below.
- Assign the values in the `mpg` column of `auto` to the variable `y` below.

Use a `for` loop to loop over the values from one to ten. For each iteration `i`:

- Use `Pipeline` to create a pipeline object. Inside the pipeline object, define a a tuple where the first element is a string identifier `quad_features'` and the second element is an instance of `PolynomialFeatures` of degree `i` with `include_bias = False`. Inside the pipeline define another tuple where the first element is a string identifier `quad_model`, and the second element is an instance of `LinearRegression`. Assign the pipeline object to the variable `pipe`.
- Use the `fit` function on `pipe` to train your model on `X_train` and `y_train`. Assign the result to `preds`.
- Use the `predict` function to predict the value of `X_train`. Assign the result to `preds`.
- Assign each `model_predictions` of degree `i` the corresponding `preds` value.

In [None]:
### GRADED

### YOUR SOLUTION HERE
model_predictions = {f'degree_{i}': None for i in range(1, 11)}

print("Starting Dictionary of Predictions\n", model_predictions)
#for 1, 2, 3, ..., 10

    #create pipeline
    
    #fit pipeline on training data
    
    #make predictions on all data
    
    #assign to model_predictions
    


### BEGIN SOLUTION
for i in range(1, 11):
    pipe = Pipeline([('quad_features', PolynomialFeatures(degree = i, include_bias = False)), ('quad_model', LinearRegression())])
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_train)
    model_predictions[f'degree_{i}'] = preds
### END SOLUTION

# Answer check
model_predictions['degree_1'][:10]

[Back to top](#Index:) 

### Problem 2

#### DataFrame of Predictions

**5 Points**

Use the `model_predictions` dictionary to create a DataFrame of the 10 models predictions.  Assign your solution to `pred_df` below as a DataFrame. 

In [None]:
### GRADED

### YOUR SOLUTION HERE
pred_df = ''
    


### BEGIN SOLUTION
pred_df = pd.DataFrame(model_predictions)
### END SOLUTION

# Answer check
print(type(pred_df))
print(pred_df.head())

[Back to top](#Index:) 

### Problem 3

#### DataFrame of Errors

**5 Points**

Now, determine the error for each model and create a DataFrame of these errors.  One way to do this is to use your prediction DataFrame's `.subtract` method to subtract `y` from each feature.  Assign the DataFrame of errors as `error_df` below.  

In [None]:
### GRADED

### YOUR SOLUTION HERE
error_df = ''
    


### BEGIN SOLUTION
error_df = pred_df.subtract(y, axis = 0)
### END SOLUTION

# Answer check
print(type(error_df))
print(error_df.head())

[Back to top](#Index:) 

### Problem 4

#### Mean and Variance of Model Errors

**5 Points**


Using the DataFrame of errors, examine the mean and variance of each model's error.  What degree model has the highest variance?  Assign your response as an integer to `highest_var_degree` after computing the variance_errors using `.var()` and assigning it to `variance_errors()`.

HINT: Use `int(variance_errors.idxmax().split('_')[1])` to get the integer output of the degree. 

In [None]:
### GRADED

### YOUR SOLUTION HERE
highest_var_degree = ''
variance_errors = ''
### BEGIN SOLUTION
variance_errors = error_df.var()
highest_var_degree = int(variance_errors.idxmax().split('_')[1])
print(f"\nThe degree with the highest variance in errors is degree {highest_var_degree}.")
### END SOLUTION

# Answer check
print(type(highest_var_degree))
print(highest_var_degree)