## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.

Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv("401ksubs.csv")
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [4]:
df.shape

(9275, 11)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9275 entries, 0 to 9274
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   e401k   9275 non-null   int64  
 1   inc     9275 non-null   float64
 2   marr    9275 non-null   int64  
 3   male    9275 non-null   int64  
 4   age     9275 non-null   int64  
 5   fsize   9275 non-null   int64  
 6   nettfa  9275 non-null   float64
 7   p401k   9275 non-null   int64  
 8   pira    9275 non-null   int64  
 9   incsq   9275 non-null   float64
 10  agesq   9275 non-null   int64  
dtypes: float64(3), int64(8)
memory usage: 797.2 KB


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

1. `investments`: 1 if person has investments, 0 if person does not have investments
2. `properties`: number of properties that a person owns
3. `savings`: amount of savings that a person has

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

Results of the data could potentially cause negative sterotypes or further stigmatisation of racial minorities.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

`inc` and `incsq` should be left out as it would cause bias in predicting `income`.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

`incsq` and `agesq`. There's a chance that the correlations between `incsq` and `agesq`, separately, and other features are more pronounced compared to `inc` and `age`.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

`inc` should be reflected as `inc` (the annual value of one's income), not  `inc^2`.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

| Model                | Explanation                                                        | Answer |
|----------------------|--------------------------------------------------------------------|--------|
| Linear Regression    | A linear regression model fits a linear relationship between the input features and the target variable. |Yes|
| Decision Trees       | Decision trees recursively split the feature space into regions, making binary decisions at each node to classify or predict the target variable. |Yes|
| Random Forests       | Random forests are ensemble methods that aggregate the predictions of multiple decision trees, providing better predictive performance and robustness. |Yes|
| Support Vector Machines | Support Vector Machines find the hyperplane that maximizes the margin between classes in the feature space, making them effective for classification tasks. |Yes|
| kNN                  | k-Nearest Neighbors classifies a data point based on the majority class of its k nearest neighbors, making it suitable for both classification and regression tasks. |No|
| Extremely Randomized Trees | Extremely Randomized Trees is an ensemble learning method that builds multiple decision trees using randomly selected features and splits, increasing diversity and reducing overfitting. |Yes|
| Bagging Classifier   | Bagging Classifier aggregates predictions from multiple base classifiers trained on random subsets of the training data, reducing variance and improving performance. |Yes|
| AdaBoost             | AdaBoost sequentially combines weak learners to create a strong learner, with each new learner focusing on the mistakes of the previous ones, resulting in improved predictive performance. |Yes|
| XGBoost              | XGBoost is an optimized gradient boosting framework that uses a tree-based ensemble method, providing high performance and scalability for regression and classification tasks. |Yes|


##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [8]:
X = df.drop(columns=["e401k", "p401k", "pira", "inc", "incsq"])
y = df["inc"]

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

In [10]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [11]:
import numpy as np
np.random.seed(42)

In [16]:
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)

bag_reg = BaggingRegressor()
bag_reg.fit(X_train, y_train)

rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)

ab_reg = AdaBoostRegressor()
ab_reg.fit(X_train, y_train)

sv_reg = SVR()
sv_reg.fit(X_train, y_train)

##### 9. What is bootstrapping?

Bootstrapping is a statistical resampling method that involves repeatedly sampling with replacement from the original dataset to estimate parameters or assess the variability of a statistic.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

A decision tree is a single tree-based model that recursively splits the feature space to make predictions. In contrast, a set of bagged decision trees, such as a Random Forest, consists of multiple decision trees trained on different subsets of the data and aggregated to reduce variance and improve predictive performance.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

A set of bagged decision trees, also known as bootstrap aggregating or bagging, consists of multiple decision trees trained on different subsets of the data and aggregated to reduce variance. On the other hand, a random forest is a specific type of bagged decision trees where each tree is trained using a random subset of features at each split, further enhancing diversity and robustness.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

A random forest may be superior to a set of bagged decision trees because it introduces additional randomness by selecting a random subset of features at each split, which further decorrelates the trees and reduces overfitting. This enhanced diversity often leads to improved generalization performance and robustness against noisy or high-dimensional datasets.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [13]:
from sklearn.metrics import mean_squared_error

In [14]:
def rmse_score(model, X_train, X_test, y_train, y_test):
    mse_train = mean_squared_error(y_true = y_train,
                                  y_pred = model.predict(X_train))
    mse_test = mean_squared_error(y_true = y_test,
                                  y_pred = model.predict(X_test))
    rmse_train = mse_train ** 0.5
    rmse_test = mse_test ** 0.5
    
    print("The training RMSE for " + str(model) + " is: " + str(rmse_train))
    print("The testing RMSE for " + str(model) + " is: " + str(rmse_test))
    return (rmse_train, rmse_test)

In [15]:
rmse_score(linear_reg, X_train, X_test, y_train, y_test)

The training RMSE for LinearRegression() is: 20.164244947447397
The testing RMSE for LinearRegression() is: 20.897416610818784


(20.164244947447397, 20.897416610818784)

In [17]:
rmse_score(dt_reg, X_train, X_test, y_train, y_test)

The training RMSE for DecisionTreeRegressor() is: 2.2638130048030134
The testing RMSE for DecisionTreeRegressor() is: 27.223114831640345


(2.2638130048030134, 27.223114831640345)

In [18]:
rmse_score(bag_reg, X_train, X_test, y_train, y_test)

The training RMSE for BaggingRegressor() is: 8.77259615341449
The testing RMSE for BaggingRegressor() is: 21.189514901954276


(8.77259615341449, 21.189514901954276)

In [19]:
rmse_score(rf_reg, X_train, X_test, y_train, y_test)

The training RMSE for RandomForestRegressor() is: 7.786319107886259
The testing RMSE for RandomForestRegressor() is: 20.258773929676256


(7.786319107886259, 20.258773929676256)

In [20]:
rmse_score(ab_reg, X_train, X_test, y_train, y_test)

The training RMSE for AdaBoostRegressor() is: 21.286674118383086
The testing RMSE for AdaBoostRegressor() is: 22.381383635759548


(21.286674118383086, 22.381383635759548)

In [21]:
rmse_score(sv_reg, X_train, X_test, y_train, y_test)

The training RMSE for SVR() is: 19.791523022677538
The testing RMSE for SVR() is: 20.515564669076976


(19.791523022677538, 20.515564669076976)

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

|           Model          | Training RMSE | Testing RMSE |
|:------------------------:|:-------------:|:------------:|
|     Linear Regression    |     0.837     |     0.868    |
|       Decision Tree      |     0.094     |     1.130    |
|   Bagged Decision Trees  |     0.366     |     0.876    |
|       Random Forest      |     0.376     |     0.879    |
|         AdaBoost         |     0.878     |     0.924    |
| Support Vector Regressor |     0.786     |     0.820    |

It appears that every model is overfit to the training data as seen by the higher test RMSE scores compared to the train RMSE scores.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

Linear regression, AdaBoost, and SVR appear to have testing RMSE scores closer to the training RMSE scores. This indicates a better fitted model. Determine the best model by doing hyperparameter tuning.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1. Tune models using GridSearch.
2. Consider interaction terms in the feature engineering stage.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

There is no clear distinction between how `e401k` and `p401k`. Also, an individual who is participating in 401k would be eligible for 401k by default. This might skew the results with the model being biased towards individuals who are participating in 401k.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

| Model                | Explanation                                                        | Answer |
|----------------------|--------------------------------------------------------------------|--------|
| Linear Regression    | A linear regression model fits a linear relationship between the input features and the target variable. |Yes|
| Decision Trees       | Decision trees recursively split the feature space into regions, making binary decisions at each node to classify or predict the target variable. |Yes|
| Random Forests       | Random forests are ensemble methods that aggregate the predictions of multiple decision trees, providing better predictive performance and robustness. |Yes|
| Support Vector Machines | Support Vector Machines find the hyperplane that maximizes the margin between classes in the feature space, making them effective for classification tasks. |Yes|
| kNN                  | k-Nearest Neighbors classifies a data point based on the majority class of its k nearest neighbors, making it suitable for both classification and regression tasks. |Yes|
| Extremely Randomized Trees | Extremely Randomized Trees is an ensemble learning method that builds multiple decision trees using randomly selected features and splits, increasing diversity and reducing overfitting. |Yes|
| Bagging Classifier   | Bagging Classifier aggregates predictions from multiple base classifiers trained on random subsets of the training data, reducing variance and improving performance. |Yes|
| AdaBoost             | AdaBoost sequentially combines weak learners to create a strong learner, with each new learner focusing on the mistakes of the previous ones, resulting in improved predictive performance. |Yes|
| XGBoost              | XGBoost is an optimized gradient boosting framework that uses a tree-based ensemble method, providing high performance and scalability for regression and classification tasks. |Yes|

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

In [23]:
X = df.drop(columns=["e401k", "p401k"])
y = df["e401k"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = 0.2,
    random_state = 42
)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [24]:
np.random.seed(42)

In [25]:
lr_class = LogisticRegression()
lr_class.fit(X_train, y_train)

knn_class = KNeighborsClassifier()
knn_class.fit(X_train, y_train)

dt_class = DecisionTreeClassifier()
dt_class.fit(X_train, y_train)

bag_class = BaggingClassifier()
bag_class.fit(X_train, y_train)

rf_class = RandomForestClassifier()
rf_class.fit(X_train, y_train)

ab_class = AdaBoostClassifier()
ab_class.fit(X_train, y_train)

sv_class = SVC()
sv_class.fit(X_train, y_train)

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Minimise false positives as the classfication problem (predicting whether one is eligible for 401k) depends on it.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

Minimise specificity.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

As FP and FN increase, denominator increases. If the numerator (2TP) remains constant, F1-score would decrease.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [26]:
from sklearn.metrics import f1_score

In [27]:
def f1_scorer(model, X_train, X_test, y_train, y_test):
    f1_train = f1_score(y_true = y_train,
                        y_pred = model.predict(X_train))
    f1_test = f1_score(y_true = y_test,
                       y_pred = model.predict(X_test))
    
    print("The training F1-score for " + str(model.__class__.__name__) + " is: " + str(f1_train))
    print("The testing F1-score for " + str(model.__class__.__name__) + " is: " + str(f1_test))
    print()

In [29]:
f1_scorer(lr_class, X_train, X_test, y_train, y_test)
f1_scorer(knn_class, X_train, X_test, y_train, y_test)
f1_scorer(dt_class, X_train, X_test, y_train, y_test)
f1_scorer(bag_class, X_train, X_test, y_train, y_test)
f1_scorer(rf_class, X_train, X_test, y_train, y_test)
f1_scorer(ab_class, X_train, X_test, y_train, y_test)
f1_scorer(sv_class, X_train, X_test, y_train, y_test)

The training F1-score for LogisticRegression is: 0.4727870199219552
The testing F1-score for LogisticRegression is: 0.4773869346733668



The training F1-score for KNeighborsClassifier is: 0.6514866390666164
The testing F1-score for KNeighborsClassifier is: 0.49661399548532736

The training F1-score for DecisionTreeClassifier is: 1.0
The testing F1-score for DecisionTreeClassifier is: 0.4705882352941176

The training F1-score for BaggingClassifier is: 0.9728878782578274
The testing F1-score for BaggingClassifier is: 0.4885145482388974

The training F1-score for RandomForestClassifier is: 1.0
The testing F1-score for RandomForestClassifier is: 0.5462753950338599

The training F1-score for AdaBoostClassifier is: 0.5621436716077537
The testing F1-score for AdaBoostClassifier is: 0.5688487584650113

The training F1-score for SVC is: 0.4707470747074707
The testing F1-score for SVC is: 0.45347786811201446



##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

|           Model          | Training F1 | Testing F1 |
|:------------------------:|:----------:|:---------:|
|     Logistic Regression  |    0.473   |    0.477  |
|    k-Nearest Neighbors   |    0.653   |    0.498  |
|       Decision Tree      |    1.000   |    0.470  |
|   Bagged Decision Trees  |    0.972   |    0.496  |
|       Random Forest      |    0.969   |    0.498  |
|         AdaBoost         |    0.562   |    0.567  |
| Support Vector Classifier |    0.472   |    0.452  |

As we want F1 score to be maximised, an overfit model is when the test F1-score is lower than the train F1-score.

This is the case for kNN, decision trees, bagged decision trees, random forest and support vector classifier.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

Support Vector Classifier as the gap between train and test scores appears the smallest for SVC.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1. Tune models using GridSearch.
2. Consider interaction terms in the feature engineering stage.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.