<a href="https://colab.research.google.com/github/revathys/CustomerCoupons/blob/main/colab_activity8_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Colab Activity 8.1: Adding Nonlinear Features

**Estimated time: 60 minutes**


This activity focuses on building polynomial models with `sklearn`.  You will fit a standard first-degree linear regression model and create a quadratic term similar to the `hp2` from video 8.2.  Using scikit-learn, you will compare the performance of the models and determine the appropriate model complexity.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import plotly.express as px

### The Data

For this exercise, a dataset containing data on automobiles, including their horsepower and fuel economy, is used.  Your goal is to build a model to predict the `mpg` column using the `horsepower` column as your models input.  Below, the dataset is loaded, and a scatterplot of `horsepower` vs. `mpg` is displayed.  

In [2]:
auto = pd.read_csv('data/auto.csv')
auto

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86.0,2790,15.6,82,1,ford mustang gl
388,44.0,4,97.0,52.0,2130,24.6,82,2,vw pickup
389,32.0,4,135.0,84.0,2295,11.6,82,1,dodge rampage
390,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger


In [3]:
px.scatter(data_frame=auto, x='horsepower', y='mpg')

In [4]:
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 27.7+ KB


In [5]:
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


[Back to top](#Index:)

## Problem 1

### Regression with `horsepower`


Complete the code below according to the isntructions below:

- Assign the `horsepower` column from the `auto` DataFrame to the `X` variable.
- Assign the `mpg` column from the `auto` DataFrame to the `y` variable.
- Instantiate and fit a sklearn `LinearRegression` model to predict `mpg` using the `horsepower` column. Assign this model to the variable `first_degree_model` below.  
- Calculate the model mean squared error between `first_degree_model.predict(X)` and `y` and assign it to the variable `first_degree_mse` below.  

In [8]:
X = auto[['horsepower']]
y = auto['mpg']
first_degree_model = LinearRegression(fit_intercept=True).fit(X,y)
first_degree_mse = mean_squared_error(first_degree_model.predict(X), y)



# Answer check
print(type(first_degree_model))
print(first_degree_model.coef_)
print(first_degree_mse)

<class 'sklearn.linear_model._base.LinearRegression'>
[-0.15784473]
23.943662938603108


[Back to top](#Index:)

## Problem 2

### Creating quadratic feature

To build a second-degree or quadratic model, you will first add a new column to the data based on squaring the `horsepower` column.  Assign these new values to the new column with the name `hp2` below.

In [9]:


auto['hp2'] = auto['horsepower']**2


# Answer check
print(auto.shape)

(392, 10)


[Back to top](#Index:)

## Problem 3

### Building a quadratic model



Complete the code below according to the isntructions below:

- Assign the `horsepower` and `hp2` columns from the `auto` DataFrame to the `X` variable.
- Assign the `mpg` column from the `auto` DataFrame to the `y` variable.
- Instantiate a sklearn `LinearRegression` model and use the `fit` function to train your model using `X` and `y`. Assign this model to the variable `quadratic_model` below.  
- Calculate the model mean squared error between `quadratic_model.predict(X)` and `y` and assign it to the variable `quad_mse` below.  

In [None]:


X = ''
y = ''
quadratic_model = ''
quad_mse = ''


# Answer check
print(quadratic_model.coef_)
print(quadratic_model.intercept_)
print(quad_mse)

[Back to top](#Index:)

## Problem 4

### Plotting Predictions


Because our data is not ordered by horsepower, a lineplot of `.predict(X)` would not be sensible.  To plot the correct predictions for your quadratic model, use the `sort_values()` function on `auto[['horsepower', 'hp2']]`  to sort the two features by the `horsepower` column.

Assign this as a DataFrame to `x_for_pred` below.  

Note that the resulting DataFrame should start with:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>horsepower</th>      <th>hp2</th>    </tr>  </thead>  <tbody>    <tr>      <th>19</th>      <td>46.0</td>      <td>2116.0</td>    </tr>    <tr>      <th>101</th>      <td>46.0</td>      <td>2116.0</td>    </tr>    <tr>      <th>324</th>      <td>48.0</td>      <td>2304.0</td>    </tr>    <tr>      <th>323</th>      <td>48.0</td>      <td>2304.0</td>    </tr>    <tr>      <th>242</th>      <td>48.0</td>      <td>2304.0</td>    </tr>  </tbody></table>

In [None]:


x_for_pred = ''


# Answer check
print(type(x_for_pred))
x_for_pred.head()

[Back to top](#Index:)

## Problem 5

### Comparing the model performance



Reflect on the mean squared error of the two models.  Which model more closely approximated the data -- linear or quadratic?  Assign your answer as a string to `best_model` below (`linear` or `quadratic`).  

In [None]:


best_model = ''


# Answer check
print(best_model)