## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd
df = pd.read_csv('401ksubs.csv')
df.head(10)

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809
5,0,15.0,1,0,60,3,0.0,0,0,225.0,3600
6,0,37.155,1,0,49,5,3.483,0,1,1380.494,2401
7,0,31.896,1,0,38,5,-2.1,0,0,1017.355,1444
8,0,47.295,1,0,52,2,5.29,0,1,2236.817,2704
9,1,29.1,0,1,45,1,29.6,0,1,846.81,2025


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

In [2]:
# Variables that capture someone's expenditures
# Variables that capture someone's savings

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

In [3]:
# This may lead to potential intended/unintended racial discrimination by the model.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

In [4]:
# Do not use incsq as it is simply the squared value of our y that we are trying to predict.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

In [5]:
# incsq and agesq. It could be that there is a relationship between the squared of these variables and e401k.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

In [6]:
# inc and age are incorrectly described. inc should be referring to someone's income and age shoudl be referring to someone's age.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

In [7]:
# Linear regression (yes because we can determine feature importance)
# KNN (no)
# Random forest/decision tree (yes)
# XGBoost (yes)
# Support vector regressor (yes)

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [8]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [9]:
scaled_df = pd.DataFrame(StandardScaler().fit_transform(df), columns = df.columns)
scaled_df.head(10)

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,-0.803173,-1.082858,-1.300887,-0.506898,-0.104886,-1.2355,-0.226651,-0.617776,1.712236,-0.648965,-0.216227
1,1.245062,0.912268,-1.300887,1.972784,-0.590372,-1.2355,2.109561,1.618709,-0.584032,0.542404,-0.63494
2,-0.803173,-1.09581,0.768706,-0.506898,0.283503,-0.580086,-0.298179,-0.617776,-0.584032,-0.651671,0.158941
3,-0.803173,2.475242,0.768706,1.972784,0.283503,-0.580086,0.042656,-0.617776,-0.584032,2.550909,0.158941
4,-0.803173,-0.690807,-1.300887,-0.506898,1.157377,-1.2355,-0.00972,-0.617776,-0.584032,-0.536366,1.133706
5,-0.803173,-1.006889,0.768706,-0.506898,1.837058,0.075328,-0.298179,-0.617776,-0.584032,-0.631789,2.016912
6,-0.803173,-0.087163,0.768706,-0.506898,0.768989,1.386157,-0.243724,-0.617776,1.712236,-0.246792,0.678145
7,-0.803173,-0.305481,0.768706,-0.506898,-0.29908,1.386157,-0.331012,-0.617776,-0.584032,-0.367786,-0.390411
8,-0.803173,0.333781,0.768706,-0.506898,1.06028,-0.580086,-0.215472,-0.617776,1.712236,0.038525,1.016466
9,1.245062,-0.421552,-1.300887,1.972784,0.3806,-1.2355,0.164607,-0.617776,1.712236,-0.424609,0.258315


In [10]:
X_train, X_test, y_train, y_test = train_test_split(scaled_df.drop(columns = ['e401k', 'p401k', 'pira', 'inc', 'incsq']), scaled_df['inc'], test_size = 0.2, random_state = 42)

In [11]:
np.random.seed(42)
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

knn_reg = KNeighborsRegressor()
knn_reg.fit(X_train, y_train)

cart_reg = DecisionTreeRegressor()
cart_reg.fit(X_train, y_train)

bagged_reg = BaggingRegressor()
bagged_reg.fit(X_train, y_train)

random_forest_reg = RandomForestRegressor()
random_forest_reg.fit(X_train, y_train)

adaboost_reg = AdaBoostRegressor()
adaboost_reg.fit(X_train, y_train)

support_vector_reg = SVR()
support_vector_reg.fit(X_train, y_train)

##### 9. What is bootstrapping?

In [12]:
# Sampling with replacement

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

In [13]:
# For decision tree, original sample is used to grow 1 decision tree. With set of bagged decision tree, 1 decision tree is grown for each bootstrapped sample and the predictions are then aggregated.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

In [14]:
# In set of bagged decision tree, all variables are considered at node splitting. For random forest, only a random subset of variables.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

In [15]:
# The random selection of features for each split in the random forest decreases the variance of our predictions.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [16]:
from sklearn.metrics import mean_squared_error
def rmse_score(model, X_train, X_test, y_train, y_test):
    mse_train = mean_squared_error(y_true = y_train,
                                  y_pred = model.predict(X_train))
    mse_test = mean_squared_error(y_true = y_test,
                                  y_pred = model.predict(X_test))
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test) 
    
    print("The training RMSE for " + str(model) + " is: " + str(rmse_train))
    print("The testing RMSE for " + str(model) + " is: " + str(rmse_test))
    return (rmse_train, rmse_test)

rmse_score(linear_reg, X_train, X_test, y_train, y_test)

The training RMSE for LinearRegression() is: 0.8370830332453492
The testing RMSE for LinearRegression() is: 0.8675193605893168


(0.8370830332453492, 0.8675193605893168)

In [17]:
rmse_score(knn_reg, X_train, X_test, y_train, y_test)

The training RMSE for KNeighborsRegressor() is: 0.6858277907277434
The testing RMSE for KNeighborsRegressor() is: 0.8378921541643052


(0.6858277907277434, 0.8378921541643052)

In [18]:
rmse_score(cart_reg, X_train, X_test, y_train, y_test)

The training RMSE for DecisionTreeRegressor() is: 0.0939782006070435
The testing RMSE for DecisionTreeRegressor() is: 1.12495120395057


(0.0939782006070435, 1.12495120395057)

In [19]:
rmse_score(bagged_reg, X_train, X_test, y_train, y_test)

The training RMSE for BaggingRegressor() is: 0.36598824971122906
The testing RMSE for BaggingRegressor() is: 0.8735992000085644


(0.36598824971122906, 0.8735992000085644)

In [20]:
rmse_score(random_forest_reg, X_train, X_test, y_train, y_test)

The training RMSE for RandomForestRegressor() is: 0.3207732600275436
The testing RMSE for RandomForestRegressor() is: 0.8436312635478437


(0.3207732600275436, 0.8436312635478437)

In [21]:
rmse_score(adaboost_reg, X_train, X_test, y_train, y_test)

The training RMSE for AdaBoostRegressor() is: 0.9352864035892462
The testing RMSE for AdaBoostRegressor() is: 0.9780338648986455


(0.9352864035892462, 0.9780338648986455)

In [22]:
rmse_score(support_vector_reg, X_train, X_test, y_train, y_test)

The training RMSE for SVR() is: 0.7858885593080794
The testing RMSE for SVR() is: 0.8206525603025993


(0.7858885593080794, 0.8206525603025993)

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

In [23]:
# Every model is overfitting but there are some that overfits more than the others.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [24]:
# Support vector regression. Test set RMSE is the lowest and the overfitting is not very severe.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [25]:
# Perform feature engineering and explore other possible relationships (log or other kinds of mathematical transformations)
# Do gridsearch to find the best parameters

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

In [26]:
# p401k=1 if eligble for 401(k). If we use this, everyone with p401k=1 should be eligible (assuming eligibility criteria does not change).

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

In [27]:
# Logistic regression
# KNN
# Decision tree/set of bagged decision tree
# Random forest classifier

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC

In [29]:
X_train, X_test, y_train, y_test = train_test_split(scaled_df.drop(columns = ['e401k', 'p401k']),
                                                    [1 if scaled_df['e401k'][i] > 0 else 0 for i in range(scaled_df.shape[0])],
                                                    test_size = .2,
                                                    random_state = 42)

In [30]:
np.random.seed(42)
logreg_class = LogisticRegression()
logreg_class.fit(X_train, y_train)

knn_class = KNeighborsClassifier()
knn_class.fit(X_train, y_train)

cart_class = DecisionTreeClassifier()
cart_class.fit(X_train, y_train)

bagged_class = BaggingClassifier()
bagged_class.fit(X_train, y_train)

random_forest_class = RandomForestClassifier()
random_forest_class.fit(X_train, y_train)

adaboost_class = AdaBoostClassifier()
adaboost_class.fit(X_train, y_train)

support_vector_class = SVC()
support_vector_class.fit(X_train, y_train)

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

In [31]:
# False positives: Not eligible but predicted eligible
# False negatives: Eligible but predicted not eligible

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

In [32]:
# Minimise false positive. This is to ensure that I do not need to spend a lot of money to reach out to a alot of ineligible folks.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

In [33]:
# Minimise specificity.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

In [34]:
# F1 score ensures balance between precision and recall.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [35]:
from sklearn.metrics import f1_score
def f1_scorer(model, X_train, X_test, y_train, y_test):
    f1_train = f1_score(y_true = y_train,
                        y_pred = model.predict(X_train))
    f1_test = f1_score(y_true = y_test,
                       y_pred = model.predict(X_test))
    
    print("The training F1-score for " + str(model) + " is: " + str(f1_train))
    print("The testing F1-score for " + str(model) + " is: " + str(f1_test))
    return (f1_train, f1_test)

print(f1_scorer(logreg_class, X_train, X_test, y_train, y_test))
print()
print(f1_scorer(knn_class, X_train, X_test, y_train, y_test))
print()
print(f1_scorer(cart_class, X_train, X_test, y_train, y_test))
print()
print(f1_scorer(bagged_class, X_train, X_test, y_train, y_test))
print()
print(f1_scorer(random_forest_class, X_train, X_test, y_train, y_test))
print()
print(f1_scorer(adaboost_class, X_train, X_test, y_train, y_test))
print()
print(f1_scorer(support_vector_class, X_train, X_test, y_train, y_test))

The training F1-score for LogisticRegression() is: 0.4727870199219552
The testing F1-score for LogisticRegression() is: 0.4777870913663035
(0.4727870199219552, 0.4777870913663035)

The training F1-score for KNeighborsClassifier() is: 0.653122648607976
The testing F1-score for KNeighborsClassifier() is: 0.4977511244377811
(0.653122648607976, 0.4977511244377811)

The training F1-score for DecisionTreeClassifier() is: 1.0
The testing F1-score for DecisionTreeClassifier() is: 0.4702627939142462
(1.0, 0.4702627939142462)

The training F1-score for BaggingClassifier() is: 0.9725380444288962
The testing F1-score for BaggingClassifier() is: 0.49615975422427033
(0.9725380444288962, 0.49615975422427033)

The training F1-score for RandomForestClassifier() is: 1.0
The testing F1-score for RandomForestClassifier() is: 0.5465465465465464
(1.0, 0.5465465465465464)

The training F1-score for AdaBoostClassifier() is: 0.5621436716077537
The testing F1-score for AdaBoostClassifier() is: 0.568848758465011

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

In [36]:
# Overfitting occurs for KNN, decision tree, bagging and random forest.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [37]:
# Logistics regression. Although F1 score is not the best, it does not overfit and it has high level of explainability.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [38]:
# Same as above. Gridsearch and feature engineering.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.