<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Supervised Learning Model Comparison

---

### Let us begin...

Recall the `data science process`.
   1. Define the problem.
   2. Gather the data.
   3. Explore the data.
   4. Model the data.
   5. Evaluate the model.
   6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

#### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. 

#### When predicting `e401k`, you may use the entire dataframe if you wish.

In [3]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import root_mean_squared_error, make_scorer, f1_score

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, \
BaggingClassifier, RandomForestClassifier, AdaBoostClassifier

### Step 2: Gather the data.

##### 1. Read in the data.

In [5]:
d401k_df = pd.read_csv('401ksubs.csv')

In [6]:
# review sample
d401k_df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [7]:
# check for null
d401k_df.isnull().sum()

e401k     0
inc       0
marr      0
male      0
age       0
fsize     0
nettfa    0
p401k     0
pira      0
incsq     0
agesq     0
dtype: int64

In [8]:
d401k_df.describe()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
count,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0
mean,0.392129,39.254641,0.628571,0.20442,41.080216,2.885067,19.071675,0.276226,0.25434,2121.192483,1793.652722
std,0.488252,24.090002,0.483213,0.403299,10.299517,1.525835,63.963838,0.447154,0.435513,3001.469424,895.648841
min,0.0,10.008,0.0,0.0,25.0,1.0,-502.302,0.0,0.0,100.1601,625.0
25%,0.0,21.66,0.0,0.0,33.0,2.0,-0.5,0.0,0.0,469.1556,1089.0
50%,0.0,33.288,1.0,0.0,40.0,3.0,2.0,0.0,0.0,1108.091,1600.0
75%,1.0,50.16,1.0,0.0,48.0,4.0,18.4495,1.0,1.0,2516.0255,2304.0
max,1.0,199.041,1.0,1.0,64.0,13.0,1536.798,1.0,1.0,39617.32,4096.0


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

- Debt: Total debt (loans, credit card debt)
- Savings Rate: Percentage of income saved each year
- Risk Tolerance: Individual's willingness to take on investment risk.

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

It could lead to discriminatory practices, as certain racial groups might be excluded or targeted disproportionately, regardless of their actual financial needs or potential.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

Calculate income by taking the square root of incsq, which is derived from inc squared.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs (Subject Matter Experts) might have done this!

- `incsq` is income ^ 2
- `agesq` is age ^ 2

Subject Matter Experts might have created these squared terms to capture non-linear relationships.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

- `inc` it should be `income` not income ^ 2 as dictionary say.
- `age` it should be `age` not age ^ 2 as dictionary say.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

- Linear Regression: Suitable for linear relationships
- Gradient Boosting: It can handles complex relationships
- Gradient Descent: Would be used to iteratively adjust the model minimize the difference between predicted income and actual income

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [22]:
# initialize values

X = d401k_df[['marr','male','age','agesq','fsize','nettfa']]
y = d401k_df['inc']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [23]:
# scale value
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [24]:
test_models = {
    'LR': LinearRegression(),
    'KNN': KNeighborsRegressor(),
    'DCT': DecisionTreeRegressor(),
    'BAG': BaggingRegressor(),
    'RF': RandomForestRegressor(),
    'ADA': AdaBoostRegressor(),
}

search_params = {
    'LR': {},
    'KNN': {'n_neighbors': [1, 3, 5, 7]},
    'DCT': {'max_depth': [None, 10, 15, 30, 50], 'min_samples_split': [0.001, 0.01, 0.1, 1.0], 'min_samples_leaf': [1, 2, 4, 8]},
    'BAG': {'n_estimators': [10, 15, 30, 50, 100]},
    'RF': {'n_estimators': [10, 15, 30, 50, 100], 'max_depth': [None, 5, 10, 15]},
    'ADA': {'n_estimators': [10, 15, 30, 50, 100], 'learning_rate': [0.001, 0.01, 0.1, 1.0]},
}

In [25]:
# testing model, evaluate with R^2
for name, model in test_models.items():
    grid = GridSearchCV(estimator=model, param_grid=search_params[name], scoring='r2', cv=5, n_jobs=8)
    grid.fit(X_train_sc, y_train)
    print(f"Best Estimator of {model} is {grid.best_estimator_}")
    print(f"Best Score of {model} is {grid.best_score_:.8f}")
    print(f"Training Score of {model} is {grid.score(X_train_sc, y_train):.8f}")
    print(f"Testing Score of {model} is {grid.score(X_test_sc, y_test):.8f}")
    print(f"Difference Training/Testing Score of {model} is {grid.score(X_train_sc, y_train) - grid.score(X_test_sc, y_test):.8f}")
    print('-'*80)

Best Estimator of LinearRegression() is LinearRegression()
Best Score of LinearRegression() is 0.28311732
Training Score of LinearRegression() is 0.29257484
Testing Score of LinearRegression() is 0.27494825
Difference Training/Testing Score of LinearRegression() is 0.01762659
--------------------------------------------------------------------------------
Best Estimator of KNeighborsRegressor() is KNeighborsRegressor(n_neighbors=7)
Best Score of KNeighborsRegressor() is 0.30340663
Training Score of KNeighborsRegressor() is 0.48741439
Testing Score of KNeighborsRegressor() is 0.33819723
Difference Training/Testing Score of KNeighborsRegressor() is 0.14921716
--------------------------------------------------------------------------------
Best Estimator of DecisionTreeRegressor() is DecisionTreeRegressor(min_samples_leaf=2, min_samples_split=0.1)
Best Score of DecisionTreeRegressor() is 0.37222742
Training Score of DecisionTreeRegressor() is 0.38579921
Testing Score of DecisionTreeRegres

In [26]:
# Best model is RandomForestRegressor(max_depth=5)
# Display the relative importance of each feature in the model
model = RandomForestRegressor(max_depth=5)
model.fit(X_train_sc, y_train)
importance_df = pd.DataFrame({'Feature': X_train.columns,
                              'Importance': model.feature_importances_}) \
                            .sort_values(by='Importance', ascending=False)
importance_df
# Nettfa is the feature with the highest predictive power.

Unnamed: 0,Feature,Importance
5,nettfa,0.702537
0,marr,0.235889
3,agesq,0.027085
2,age,0.025699
1,male,0.005446
4,fsize,0.003344


##### 9. What is bootstrapping?

The bootstrap method is a resampling technique that involves repeatedly drawing samples from a data set with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

- Decision trees are single models that can be lead to overfitting.
- Bagged decision trees are ensembles of multiple decision trees that can reduce overfitting and improve performance.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

Bagging and random forest are both ensemble techniques that use decision trees.
- Bagging reduces variance by averaging multiple trees.
- Random forest further improves performance by introducing feature randomness.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Random forests are generally superior to bagged decision trees because they introduce additional randomness by considering only a subset of features at each split, further reducing correlation between trees and improving 
performance.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [36]:
# testing model, evaluate with (neg)RMSE
for name, model in test_models.items():
    grid = GridSearchCV(estimator=model, param_grid=search_params[name], scoring='neg_root_mean_squared_error', cv=5, n_jobs=8)
    grid.fit(X_train_sc, y_train)
    print(f"Best Estimator of {model} is {grid.best_estimator_}")
    print(f"Best Score of {model} is {abs(grid.best_score_)}")
    print(f"Training Score of {model} is {abs(grid.score(X_train_sc, y_train))}")
    print(f"Testing Score of {model} is {abs(grid.score(X_test_sc, y_test))}")
    print(f"Difference Training/Testing Score of {model} is {abs(grid.score(X_train_sc, y_train)) - abs(grid.score(X_test_sc, y_test))}")
    print('-'*80)

Best Estimator of LinearRegression() is LinearRegression()
Best Score of LinearRegression() is 20.28457138101961
Training Score of LinearRegression() is 20.164244947447397
Testing Score of LinearRegression() is 20.897416610818777
Difference Training/Testing Score of LinearRegression() is -0.7331716633713796
--------------------------------------------------------------------------------
Best Estimator of KNeighborsRegressor() is KNeighborsRegressor(n_neighbors=7)
Best Score of KNeighborsRegressor() is 19.999444960091047
Training Score of KNeighborsRegressor() is 17.16425342367413
Testing Score of KNeighborsRegressor() is 19.96514133648189
Difference Training/Testing Score of KNeighborsRegressor() is -2.8008879128077595
--------------------------------------------------------------------------------
Best Estimator of DecisionTreeRegressor() is DecisionTreeRegressor(min_samples_split=0.1)
Best Score of DecisionTreeRegressor() is 18.983492502580713
Training Score of DecisionTreeRegressor(

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

It doesn't overfit except KNeighborsRegressor and BaggingRegressor, as the RMSE remains relatively the same.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

DecisionTreeRegressor is a suitable model due to its strong performance and lack of overfitting.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

- Conduct research into potential new features
- Refine hyperparameters through GridSearch
- Perform feature engineering and selection

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

- It trend to overfit our model

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

- Logistic Regression is well-suited for binary classification tasks.
- KNN is a non-parametric algorithm but can be sensitive to noise and outliers.
- Decision Trees can handle both categorical and numerical data and are robust to outliers.
- Random Forests can mitigate overfitting through feature sampling.
- AdaBoost provides insights into feature importance.

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [48]:
test_models = {
    'LR': LogisticRegression(),
    'KNN': KNeighborsClassifier(),
    'DCT': DecisionTreeClassifier(),
    'BAG': BaggingClassifier(),
    'RF': RandomForestClassifier(),
    'ADA': AdaBoostClassifier()
}

search_params = {
    'LR': {'solver': ['lbfgs', 'liblinear'], 'C': [0.001, 0.01, 0.1, 1]},
    'KNN': {'n_neighbors': [1, 3, 5, 7]},
    'DCT': {'max_depth': [None, 10, 20, 30]},
    'BAG': {'n_estimators': [10, 15, 30, 50]},
    'RF': {'n_estimators': [10, 15, 30, 50], 'max_depth': [None, 5, 15, 30]},
    'ADA': {'n_estimators': [10, 15, 30, 50], 'learning_rate': [0.001, 0.01, 0.1, 1], 'algorithm': ['SAMME']},
}

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

- False Positive: Predicts that an individual is eligible for 401(k), when in reality they are not.
- False Negative: Predicts that an individual is ineligible for 401(k), when in reality they are eligible.

In [51]:
# initialize values
X = d401k_df.drop(columns=['e401k','p401k'])
y = d401k_df['e401k']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [52]:
# scale values
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [53]:
# find the baseline
y.value_counts(normalize=True).mul(100).round(2)

e401k
0    60.79
1    39.21
Name: proportion, dtype: float64

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

- Minimizing false negatives is more desirable.
- Result in an individual missing out on valuable retirement savings opportunities, potentially leading to financial hardship in their later years.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

- Specificity measures the proportion of actual negatives that are correctly identified as negative

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

F1 provides a balanced measure of both false positives and false negatives.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [61]:
# testing model with F1 score
for name, model in test_models.items():
    grid = GridSearchCV(estimator=model, param_grid=search_params[name], scoring='f1', cv=5, n_jobs=8)
    grid.fit(X_train_sc, y_train)
    
    best_model = grid.best_estimator_
    best_model.fit(X_train_sc, y_train)
    y_train_pred = best_model.predict(X_train_sc)
    y_test_pred = best_model.predict(X_test_sc)
    print(f"Best Estimator of {model} is {grid.best_estimator_}")
    print(f"Best Score of {model} is {grid.best_score_:.8f}")
    print(f"Training F1-Score of {model} is {best_model.score(X_train_sc, y_train):.8f}")
    print(f"Testing F1-Score of {model} is {best_model.score(X_test_sc, y_test):.8f}")
    print(f"Difference Training/Testing F1-Score of {model} is {best_model.score(X_train_sc, y_train) - best_model.score(X_test_sc, y_test):.8f}")
    print('-'*80)

Best Estimator of LogisticRegression() is LogisticRegression(C=1, solver='liblinear')
Best Score of LogisticRegression() is 0.47012010
Training F1-Score of LogisticRegression() is 0.65404313
Testing F1-Score of LogisticRegression() is 0.66361186
Difference Training/Testing F1-Score of LogisticRegression() is -0.00956873
--------------------------------------------------------------------------------
Best Estimator of KNeighborsClassifier() is KNeighborsClassifier(n_neighbors=7)
Best Score of KNeighborsClassifier() is 0.47906587
Training F1-Score of KNeighborsClassifier() is 0.73045822
Testing F1-Score of KNeighborsClassifier() is 0.63665768
Difference Training/Testing F1-Score of KNeighborsClassifier() is 0.09380054
--------------------------------------------------------------------------------
Best Estimator of DecisionTreeClassifier() is DecisionTreeClassifier(max_depth=10)
Best Score of DecisionTreeClassifier() is 0.53168399
Training F1-Score of DecisionTreeClassifier() is 0.767924

In [62]:
# Best model is AdaBoostClassifier(algorithm='SAMME', learning_rate=0.01, n_estimators=15)
# Display the relative importance of each feature in the model
model = AdaBoostClassifier(algorithm='SAMME', learning_rate=0.01, n_estimators=15)
model.fit(X_train_sc, y_train)
importance_df = pd.DataFrame({'Feature': X_train.columns,
                              'Importance': model.feature_importances_}) \
                            .sort_values(by='Importance', ascending=False)
importance_df
# Nettfa is the feature with the highest predictive power.

Unnamed: 0,Feature,Importance
5,nettfa,1.0
0,inc,0.0
1,marr,0.0
2,male,0.0
3,age,0.0
4,fsize,0.0
6,pira,0.0
7,incsq,0.0
8,agesq,0.0


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

BaggingClassifier, which tends to overfit as evidenced by a 0.35 difference between training and testing F1-scores.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

RandomForestClassifier, as it performs well for both regression and classification tasks.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

- Conduct research into potential new features
- Refine hyperparameters through GridSearch
- Perform feature engineering and selection

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

* Regression: RandomForestRegressor achieved the highest R^2 score
  * The model's performance is suboptimal, with an R^2 of 0.39 and a significant difference of 0.04 between training and testing scores, suggesting overfitting.
  * `nettfa` is the feature with the highest predictive power for the income.
* Classification: AdaBoostClassifier achieved the highest F1 score
  * `nettfa` is the feature with the highest predictive power for the `e401k`.