# Supervised Learning Model Comparison

![A piggybank with 401k written beside it](https://imgur.com/2xg0qOu.jpg)
*Getty Images*

The data science process:

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

Today we'll be focused on creating and comparing regression and classification models. 

### Step 1: Define the problem.

Scenario:

We're a data scientist with a financial services company. Specifically, we want to leverage data in order to identify potential customers.

"401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We'll tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, we should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. 

### Step 2: Gather the data.

##### Let's read in the data from the repository.

In [17]:
import pandas as pd
import numpy as np

In [18]:
df = pd.read_csv('401ksubs.csv')
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [30]:
df.shape

(9275, 11)

##### Other variables that, if available, would be helpful to have:

- Savings
- Job type
- Credit history
- Investments
- Rents or owns

##### Is it ethically sound to put `race` into our model in order to better predict who to target when advertising IRAs and 401(k)s?

No, because our model will become discriminatory based on race.

Discriminating based on race is the complete opposite of acceptable. 

## Step 3: Explore the data.

##### When attempting to predict income, which feature(s) would we reasonably not use? Why?

inc, because if we had inc we wouldn't need to predict inc.

incsq, for similar reasons.

##### Note that we already two variables have already been created for us through feature engineering

incsq and agesq.

this might've been done because age and income exceptionally good indicators for eligibility, so a subject-matter expert might wanted to make age and income more potenent in their model.

##### Looking at the data dictionary, one variable description appears to be an error.

Age's current description is age^2. The correct description should simply be a person's age.

Inc's current description has a similar issue. it's current definition is inc^2. It's actual description should be a person's income.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, we should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### Potential models to use for regression problems and whether they're appropriate for solving this specific regression problem

- a multiple linear regression model (Yes, we can understand influence of features.)
- a $k$-nearest neighbors model (No, we cannot understand influence of features.)
- a decision tree (Yes, we can understand influence of features.)
- a set of bagged decision trees (Yes, we can understand influence of features.)
- a random forest (Yes, we can understand influence of features.)
- a set of extremely randomized trees (Yes, we can understand influence of features.)
- an Adaboost model (Yes, we can understand influence of features.)
- an XGBoost model (Yes, we can understand influence of features.)
- a support vector regressor (Yes, we can understand influence of features.)

##### Let's try a bunch of model types and see what they spit out.

Model types we'll attempt:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor

In [38]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [40]:
scaled_df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,-0.803173,-1.082858,-1.300887,-0.506898,-0.104886,-1.2355,-0.226651,-0.617776,1.712236,-0.648965,-0.216227
1,1.245062,0.912268,-1.300887,1.972784,-0.590372,-1.2355,2.109561,1.618709,-0.584032,0.542404,-0.63494
2,-0.803173,-1.09581,0.768706,-0.506898,0.283503,-0.580086,-0.298179,-0.617776,-0.584032,-0.651671,0.158941
3,-0.803173,2.475242,0.768706,1.972784,0.283503,-0.580086,0.042656,-0.617776,-0.584032,2.550909,0.158941
4,-0.803173,-0.690807,-1.300887,-0.506898,1.157377,-1.2355,-0.00972,-0.617776,-0.584032,-0.536366,1.133706


In [41]:
X_train, X_test, y_train, y_test = train_test_split(scaled_df.drop(columns = ['e401k', 'p401k', 'pira', 'inc', 'incsq']),
                                                    scaled_df['inc'],
                                                    test_size = .2,
                                                    random_state = 42)

In [42]:
import numpy as np
np.random.seed(42)

In [58]:
lr = LinearRegression()
lr.fit(X_train, y_train)

knn_r = KNeighborsRegressor()
knn_r.fit(X_train, y_train)

dt_r = DecisionTreeRegressor()
dt_r.fit(X_train, y_train)

bag_r = BaggingRegressor()
bag_r.fit(X_train, y_train)

rf_r = RandomForestRegressor()
rf_r.fit(X_train, y_train)

ada_r = AdaBoostRegressor()
ada_r.fit(X_train, y_train)

svr = SVR()
svr.fit(X_train, y_train)

SVR()

In [None]:
pipe_svec_r.fit(X_train, y_train)

##### Bootstrapping?

Bootstrapping is random resampling with replacement.

Combining the models from bootstrapped give us a better idea as to what we'd see from the whole population than to just getting one model from our original sample.

##### A decision tree vs a set of bagged decision trees

With a set of bagged decision trees, we have bootstrapped  different samples and grown one decision tree on each bootstrapped sample and aggregate our predictions.

A decision tree only uses the original sample and grows just one tree.

##### A set of bagged decision trees vs a random forest

The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node.

With bagged decision trees, we generate many different trees on pretty similar data. These trees are strongly correlated with one another. Because these trees are correlated with one another, they will have high variance. By "de-correlating" our trees from one another, we can drastically reduce the variance of our model.

Random forest reduces variance (at the expense of a small increase in bias) and thus should greatly improve the overall performance of the final model.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [47]:
from sklearn import metrics
from sklearn.metrics import mean_squared_error

In [51]:
def rmse(model, X_train, X_test, y_train, y_test):
    mse_train = mean_squared_error(y_true = y_train,
                                   y_pred = model.predict(X_train))
    mse_test = mean_squared_error(y_true = y_test, 
                                  y_pred = model.predict(X_test))
    
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)               
    print('Training RMSE: ', rmse_train)
    print('Testing RMSE: ', rmse_test)

In [52]:
rmse(lr, X_train, X_test, y_train, y_test)

Training RMSE:  0.8370830332453493
Testing RMSE:  0.8675193605893169


In [53]:
rmse(knn_r, X_train, X_test, y_train, y_test)

Training RMSE:  0.6860819173270954
Testing RMSE:  0.837326616404933


In [54]:
rmse(dt_r, X_train, X_test, y_train, y_test)

Training RMSE:  0.0939782006070435
Testing RMSE:  1.1272071615450177


In [55]:
rmse(bag_r, X_train, X_test, y_train, y_test)

Training RMSE:  0.36582653691398165
Testing RMSE:  0.8746667054536171


In [56]:
rmse(rf_r, X_train, X_test, y_train, y_test)

Training RMSE:  0.32073401874223845
Testing RMSE:  0.843648400610395


In [57]:
rmse(ada_r, X_train, X_test, y_train, y_test)

Training RMSE:  0.9352864035892462
Testing RMSE:  0.9780338648986455


In [59]:
rmse(svr, X_train, X_test, y_train, y_test)

Training RMSE:  0.7858885593080797
Testing RMSE:  0.8206525603025994


##### What do we notice?

All my models are ovefit.

Testing RMSE is worse (higher) than our training RMSE. 

Our models are generalizing poorly to held-out/unseen data.

##### Based on everything we've covered so far, if we had to pick just one model as your final model to use to answer the problem in front of you, which one model should we pick?

We'd get rid of KNN right away because we can't easily identify which features affect income, which is the point of what we're doing. We're looking for which features best predict income, not which model makes the best predictions.

Outside of that, it's pretty much a judgement call on which models overfit the least (inear regression, AdaBoost, and the Support Vector Regressor). 

Some considerations we might have our:

- Do we have time to tune the models to try and eke out better performance?
- Is one model substantially better at solving the problem we want to solve?
- Do we need something understandable by a lay audience? (i.e. Linear regression is more common and more easily understood than AdaBoost or Support Vector Machines.)

If given time, we should try tuning the three remaining models to see if one performs substantially better than the other. If they all have roughly the same performance, we should generally go with the simplest/easiest to understand model. In this case, linear regression.

##### What could we do to improve the performance of our models?

- Use gridsearch
- Use polynomial features
- Remove outliers

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.


##### Why would we shy away from using `p401k` in our model?

If you participate in a 401k already, you're eligible by default. that'll just give us a list of everyone with a 401k, which won't necessarily tell us anything about the factors that best determine eligibility. 

##### Potential classification models we can use and whether they're appropriate for solving our problem

- a logistic regression model (Yes, we can predict whether or not one is eligible for a 401(k).)
- a $k$-nearest neighbors model (Yes, we can predict whether or not one is eligible for a 401(k).)
- a Naive Bayes model (Yes, we can predict whether or not one is eligible for a 401(k).)
- a decision tree (Yes, we can predict whether or not one is eligible for a 401(k).)
- a set of bagged decision trees (Yes, we can predict whether or not one is eligible for a 401(k).)
- a random forest (Yes, we can predict whether or not one is eligible for a 401(k).)
- a set of extremely randomized trees (Yes, we can predict whether or not one is eligible for a 401(k).)
- an Adaboost model (Yes, we can predict whether or not one is eligible for a 401(k).)
- an XGBoost model (Yes, we can predict whether or not one is eligible for a 401(k).)
- a support vector classifier (Yes, we can predict whether or not one is eligible for a 401(k).)

##### Let's attempt to use the following models for our classification problem:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier

In [61]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC


In [60]:
X_train, X_test, y_train, y_test = train_test_split(scaled_df.drop(columns = ['e401k', 'p401k']),
                                                    [1 if scaled_df['e401k'][i] > 0 else 0 for i in range(scaled_df.shape[0])],
                                                    ## I ran the above line because when I scaled e401k, e401k was no longer binary!
                                                    ## I needed to turn my Y vector into a discrete variable.
                                                    test_size = .2,
                                                    random_state = 42)

In [62]:
np.random.seed(42)

In [64]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

knn_c = KNeighborsClassifier()
knn_c.fit(X_train, y_train)

dt_c = DecisionTreeClassifier()
dt_c.fit(X_train, y_train)

bag_c = BaggingClassifier()
bag_c.fit(X_train, y_train)

rf_c = RandomForestClassifier()
rf_c.fit(X_train, y_train)

ada_c = AdaBoostClassifier()
ada_c.fit(X_train, y_train)

svc = SVC()
svc.fit(X_train, y_train)

SVC()

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### If our "positive" class is someone eligible for a 401(k) are false positives and false negatives are...
False positives: someone we predicted to be eligible but who wasn't.

False negatives: someone we predicted to be ineligible but who actually is eligible.

##### In this specific case, we want to minimize false positives...

Because incorrectly assuming someone is eligible might hurt your company financially. 

##### If we want to minimize false positives, then we want to minimize specificity.

$$\text{Specificity} = \frac{TN}{N} = \frac{TN}{TN + FP}$$

##### If we wanted to balance our false positives and false negatives we'd probably want to use `f1-score`.

$$
\begin{eqnarray*}
F_1 &=& \frac{2}{\frac{1}{\text{precision}} + \frac{1}{\text{recall}}} \\
&=& \frac{2}{\frac{1}{\frac{TP}{TP + FP}} + \frac{1}{\frac{TP}{TP + FN}}} \\
&=& \frac{2}{\frac{TP + FP}{TP} + \frac{TP + FN}{TP}} \\
&=& \frac{2}{\frac{TP + FP + TP + FN}{TP}} \\
&=& \frac{2}{2 + \frac{FP + FN}{TP}}
\end{eqnarray*}
$$

Since the F1 score is an average of Precision and Recall, it means that the F1 score gives equal weight to Precision and Recall:<br>
A model will obtain a high F1 score if both Precision and Recall are high<br>
A model will obtain a low F1 score if both Precision and Recall are low<br>
A model will obtain a medium F1 score if one of Precision and Recall is low and the other is high<br>

##### Let's evaluate our models using `f1-score`

In [65]:
from sklearn.metrics import f1_score

In [66]:
def f1_scores(model, X_train, X_test, y_train, y_test):
    f1_train = f1_score(y_true = y_train,
                        y_pred = model.predict(X_train))
    f1_test = f1_score(y_true = y_test, 
                       y_pred = model.predict(X_test))
                   
    print('Training F1: ', f1_train)
    print('Testing F1: ', f1_test)

In [67]:
f1_scores(logreg, X_train, X_test, y_train, y_test)

Training F1:  0.4727870199219552
Testing F1:  0.4777870913663035


In [68]:
f1_scores(knn_c, X_train, X_test, y_train, y_test)

Training F1:  0.653122648607976
Testing F1:  0.4977511244377811


In [69]:
f1_scores(dt_c, X_train, X_test, y_train, y_test)

Training F1:  1.0
Testing F1:  0.4702627939142462


In [70]:
f1_scores(bag_c, X_train, X_test, y_train, y_test)

Training F1:  0.9725380444288962
Testing F1:  0.49615975422427033


In [71]:
f1_scores(rf_c, X_train, X_test, y_train, y_test)

Training F1:  1.0
Testing F1:  0.5465465465465464


In [73]:
f1_scores(dt_c, X_train, X_test, y_train, y_test)

Training F1:  1.0
Testing F1:  0.4702627939142462


In [75]:
f1_scores(ada_c, X_train, X_test, y_train, y_test)

Training F1:  0.5621436716077537
Testing F1:  0.5688487584650113


In [74]:
f1_scores(svc, X_train, X_test, y_train, y_test)

Training F1:  0.47162162162162163
Testing F1:  0.45207956600361665


##### We want our $F_1$ score to be as high as possible. Thus, overfitting occurs when we have a high training $F_1$ score and a low testing $F_1$ score.

Models that appear to be overfit are:
- knn
- decision trees
- bagged decision trees
- random forest
- SVC

##### Which model should we pick?

Borrowing from earlier...

Any models that cannot solve my problem, we remove:
    - In this case, all models can solve my problem. I want to maximize my ability to correctly predict whether or not someone is eligible for a 401(k).
    
Among the models that _can_ solve our problem, we want to find the one that performs the best based on a metric of my choice:
    - In this case, the models that seem to overfit the least (i.e. the gap between training and testing $F_1$) are logistic regression and AdaBoost.
    
Once again, it becomes a judgment call:
    - Do we have time to tune the models to try and eke out better performance?
    - Is one model substantially better at solving the problem we want to solve?
    - Do we need something understandable by a lay audience? (i.e. Logistic regression is more common and more easily understood than AdaBoost.)
    
If given time, we should try tuning the remaining models to see if one performs substantially better than the other. 

If they all have roughly the same performance, we generally go with the simplest/easiest to understand model. 

In this case, that would bne logistic regression.

## Step 6: Answer the problem.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.