## Colab Activity 8.1: Adding Nonlinear Features

**Estimated time: 60 minutes**


This activity focuses on building polynomial models with `sklearn`.  You will fit a standard first-degree linear regression model and create a quadratic term similar to the `hp2` from video 8.2.  Using scikit-learn, you will compare the performance of the models and determine the appropriate model complexity.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import plotly.express as px

### The Data

For this exercise, a dataset containing data on automobiles, including their horsepower and fuel economy, is used.  Your goal is to build a model to predict the `mpg` column using the `horsepower` column as your models input.  Below, the dataset is loaded, and a scatterplot of `horsepower` vs. `mpg` is displayed.  

In [None]:
auto = pd.read_csv('data/auto.csv')
auto

In [None]:
px.scatter(data_frame=auto, x='horsepower', y='mpg')

In [None]:
auto.info()

In [None]:
auto.head()

[Back to top](#Index:) 

## Problem 1

### Regression with `horsepower`


Complete the code below according to the isntructions below:

- Assign the `horsepower` column from the `auto` DataFrame to the `X` variable.
- Assign the `mpg` column from the `auto` DataFrame to the `y` variable.
- Instantiate and fit a sklearn `LinearRegression` model to predict `mpg` using the `horsepower` column. Assign this model to the variable `first_degree_model` below.  
- Calculate the model mean squared error between `first_degree_model.predict(X)` and `y` and assign it to the variable `first_degree_mse` below.  

In [None]:
X = ''
y = ''
first_degree_model = ''
first_degree_mse = ''



# Answer check
print(type(first_degree_model))
print(first_degree_model.coef_)
print(first_degree_mse)

[Back to top](#Index:) 

## Problem 2

### Creating quadratic feature

To build a second-degree or quadratic model, you will first add a new column to the data based on squaring the `horsepower` column.  Assign these new values to the new column with the name `hp2` below. 

In [None]:


auto['hp2'] = ''


# Answer check
print(auto.shape)

[Back to top](#Index:) 

## Problem 3

### Building a quadratic model



Complete the code below according to the isntructions below:

- Assign the `horsepower` and `hp2` columns from the `auto` DataFrame to the `X` variable.
- Assign the `mpg` column from the `auto` DataFrame to the `y` variable.
- Instantiate a sklearn `LinearRegression` model and use the `fit` function to train your model using `X` and `y`. Assign this model to the variable `quadratic_model` below.  
- Calculate the model mean squared error between `quadratic_model.predict(X)` and `y` and assign it to the variable `quad_mse` below.  

In [None]:


X = ''
y = ''
quadratic_model = ''
quad_mse = ''


# Answer check
print(quadratic_model.coef_)
print(quadratic_model.intercept_)
print(quad_mse)

[Back to top](#Index:) 

## Problem 4

### Plotting Predictions


Because our data is not ordered by horsepower, a lineplot of `.predict(X)` would not be sensible.  To plot the correct predictions for your quadratic model, use the `sort_values()` function on `auto[['horsepower', 'hp2']]`  to sort the two features by the `horsepower` column. 

Assign this as a DataFrame to `x_for_pred` below.  

Note that the resulting DataFrame should start with:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>horsepower</th>      <th>hp2</th>    </tr>  </thead>  <tbody>    <tr>      <th>19</th>      <td>46.0</td>      <td>2116.0</td>    </tr>    <tr>      <th>101</th>      <td>46.0</td>      <td>2116.0</td>    </tr>    <tr>      <th>324</th>      <td>48.0</td>      <td>2304.0</td>    </tr>    <tr>      <th>323</th>      <td>48.0</td>      <td>2304.0</td>    </tr>    <tr>      <th>242</th>      <td>48.0</td>      <td>2304.0</td>    </tr>  </tbody></table>

In [None]:


x_for_pred = ''


# Answer check
print(type(x_for_pred))
x_for_pred.head()

[Back to top](#Index:) 

## Problem 5

### Comparing the model performance



Reflect on the mean squared error of the two models.  Which model more closely approximated the data -- linear or quadratic?  Assign your answer as a string to `best_model` below (`linear` or `quadratic`).  

In [None]:


best_model = ''


# Answer check
print(best_model)