<a href="https://colab.research.google.com/github/sensei-jirving/Online-DS-PT-01.24.22-cohort-notes/blob/main/Week_06/Lecture_02/Challenge/Challenge_bias_variance_activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
<img src="https://course_report_production.s3.amazonaws.com/rich/rich_files/rich_files/2470/s300/cd-logo-blue-600x600.png" alt="Coding Dojo Logo" class="center" height="50">

# Bias/Variance Tradeoff

*Make a copy of this notebook to edit!*

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn.searchenginejournal.com%2Fwp-content%2Fuploads%2F2020%2F08%2Fcopy-the-colab-notebook-to-your-google-drive-5f2579179f746.jpg&f=1&nofb=1" alt="Make a copy" class="center" height="300">

</center>

## What is the bias/variance trade-off?

The bias/variance trade-off is the trade-off we make in machine learning between having too high bias and too high variance. You ultimately want to strike a good balance between the two.

![](https://miro.medium.com/max/1400/1*9hPX9pAO3jqLrzt0IE3JzA.png)

## High Bias

Bias is essentially just how good of a job your model does at fitting to your data. (High bias = bad model).

**High bias = underfit**

## High Variance

Variance is essentially how well your model does at generalizing to new data. High variance means your model is too overfit to the data it was trained on.

**High variance = overfit**

![](https://miro.medium.com/max/875/1*wHw8x8hZZdfUO-kBf-lobg.jpeg)

## Bias/Variance Tradeoff using Tree Based Models

In this notebook you will:
1. load and inspect data (code provided)
2. select features for use in modeling (code provided)
3. change data types to fit variable types (code provided)
  1. ordinal and numeric = numeric data type
  2. nominal = object datatype
4. create a column transformer for data preprocessing, including:
  1. scaling numeric data
  2. one-hot encoding nominal categorical data
5. create a baseline model
6. fit a Decision Tree model with a high vias
7. fit a Decision Tree model with a high variance
8. fit a Random Forest model with the optimal value for max_depth

Import Libraries

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import set_config
set_config(display='diagram')

Load Data

You can download the data [here](https://drive.google.com/file/d/1hcRko3e-Os1ImFJGXaWOYMtkFjuveZyJ/view?usp=sharing), which is originally from [this source](https://archive.ics.uci.edu/ml/datasets/Auto+MPG).

**Data Dictionary:**

Variable Name | Description | Units
--- | --- | ---
mpg | Target variable | [mpg](https://www.kbb.com/what-is/mpg/)
cylinders | number of [cylinders](https://en.wikipedia.org/wiki/Cylinder_%28engine%29) the car has | discrete number
displacement | [engine displacement](https://en.wikipedia.org/wiki/Engine_displacement) | cubic inches
horsepower | [engine power](https://en.wikipedia.org/wiki/Horsepower) of the car | horsepower
weight | weight of car | pounds
acceleration | elapsed time to go from 0 to 60 miles per hour | seconds
model year | model year of the car | year
origin | Country of origin | discrete number
car name | name of the car | n/a


In [None]:
# Load Data
mpg = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTg36jLawSOgGP9hp0oJ3OYZiHMWbuGLiau-8DMjtcKNv7v9Zy_zFBQs9gZU-44GGeIyfXE2iwo26_z/pub?output=csv')
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [None]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    int64  
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   model year    392 non-null    int64  
 7   origin        392 non-null    int64  
 8   car name      392 non-null    object 
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB


Deal with categorical variables.

In [None]:
# origin
mpg['origin'].value_counts()

1    245
3     79
2     68
Name: origin, dtype: int64

In [None]:
# model year
mpg['model year'].nunique()

13

Even though 'model year' and 'orgin' are integer datatypes, the number of unique values tells us they represent categories, not actual numeric values.  We will need to one hot encode them after we split the data.  'model year' could be interpreted as either ordinal (earlier years to later years) or nominal.  We will treat it as nominal in this instance.  It would also be a valid choice to treat it as ordinal.

In [None]:
#transform 'model year' and 'origin' into object type variables for one-hot encoding
cat_cols = ['model year','origin']
mpg[cat_cols] = mpg[cat_cols].astype('object')
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    int64  
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   model year    392 non-null    object 
 7   origin        392 non-null    object 
 8   car name      392 non-null    object 
dtypes: float64(3), int64(3), object(3)
memory usage: 27.7+ KB


In [None]:
# car name
mpg['car name'].nunique()

301

The car name seems to be unique and will not help with modeling, 
I will drop this column

In [None]:
mpg.drop(columns = 'car name', inplace = True)

Split data into target vector and features matrix. We are predicting mpg.

In [None]:
X = mpg.drop(columns='mpg')
y = mpg['mpg']

### Train-Test-Split

This method allows us to validate how our model does on unseen data - which allows us to check for high variance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# ColumnTransformer

Create 

1. column selectors
2. one hot encoding and scaling transformers 
3. and a ColumnTransformer object that will one-hot encode the categorical variables and scale the numeric variables.

## KNN is a clustering algorithm.  It is ALWAYS necessary to scale your data when using a clustering algorithm.

In [None]:
# Create your preprocessing steps
# use make_column_selector, ColumnTransformer, and OneHotEncoder
# remember to set sparse=False for the OneHotEncoder
# remember to set remainder='passthrough' for the ColumnTransformer


Create a function to take the true and predicted labels and print MAE, MSE, RMSE, and R2 metrics

In [None]:
# Create a function to take the true and predicted labels and print MAE, MSE, RMSE, and R2 metrics
def evaluate_model(y_true, y_pred):
  """takes two arrays, true labels and predicted labels, and prints
  MAE, MSE, RMSE, and R2 metrics"""
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = np.sqrt(mean_squared_error(y_true, y_pred))
  r2 = r2_score(y_true, y_pred)

  print(f'scores: MAE: {mae:,.2f} \nMSE: {mse:,.2f} \nRMSE: {rmse:,.2f} \
  \nR2: {r2:.2f}')

## Baseline model

1. Instantiate DummyRegressor with the 'mean' strategy to get some baseline metrics.

2. Put your ColumnTransformer and baseline model into a pipeline.

3. Fit your pipeline on the training data and evaluate it on BOTH the training and the testing data using a metric of your choice.

In [None]:
#instantiate baseling


#create pipeline

#fit pipeline

#create train and test predictions

#evaluate model


## High Bias
Fit a Decision Tree Regressor model to predict price that has very high bias by adjusting max_depth.

## High Variance
Now, fit a Decision Tree Regressor pipeline to predict price that has very high variance by adjusting max_depth.

## Balance

1. Choose a metric to optimize in order to decide which model is best.  In a comment, explain why you chose that metric.

2. Adjust the max_depth to get the best model possible.  

Hint: You might try creating a loop over a reasonable number of max_depth values and storing lists of scores, then plotting those scores to visually determine the best value for the max_depth.

In [None]:
#create a range of max_depth values

#create a dataframe to store train and test scores.

#loop over the values in depths

  #fit a new model with max_depth

  #put the model into a pipeline
 
  
  #fit the model
  
  
  #create prediction arrays
  
  
  #evaluate the model using R2 Score
 
  
  #store the scores in the scores dataframe


In [None]:
#display the scores from adjusting the max_depth.


In [None]:
#plot the scores to visually determine the best max_depth


In [None]:
#sort the dataframe by test scores and save the index (k) of the best score


## Balance for Random Forest

1. Choose a metric to optimize in order to decide which model is best.  In a comment, explain why you chose that metric.

2. Adjust the max_depth to get the best model possible.  

Hint: You might try creating a loop over a reasonable number of max_depth values and storing lists of scores, then plotting those scores to visually determine the best value for the max_depth.

In [None]:
#create a range of max_depth values

#create a dataframe to store train and test scores.


#loop over the values in depths

  #Trying depth. Printing depth = n
  #fit a new model with max_depth=n
  

  #put the model into a pipeline
  
  
  #fit the model
  
  
  #create prediction arrays
 
  
  #evaluate the model using R2 Score
 
  
  #store the scores in the scores dataframe
  

In [None]:
#display the scores from adjusting the max depth for the Random Forest


In [None]:
#plot the scores to visually determine the best max_depth


In [None]:
#sort the dataframe by test scores and save the index max_depth of the best score


In [None]:
#create and fit a final model using the best value for max_depth


#evaluate the final model
