<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Supervised Learning Model Comparison

_author The arbitrary and capricious heart of data science_

---

### Let us begin...

Recall the "data science process."
   1. Define the problem.
   2. Gather the data.
   3. Explore the data.
   4. Model the data.
   5. Evaluate the model.
   6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
# Imports
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# # Imports statistics
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import AdaBoostRegressor, BaggingRegressor,\
RandomForestRegressor, AdaBoostClassifier, BaggingClassifier,\
RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay,\
accuracy_score, plot_roc_curve, roc_auc_score, recall_score, \
precision_score, f1_score, classification_report, mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Open dataset
df = pd.read_csv("401ksubs.csv")

In [3]:
# View top 5 rows 
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [4]:
# Check row and columns
# Large sample size
# I'm comfortable divide training and testing my model 80% and 20% 
df.shape

(9275, 11)

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

Other variables that would be helpful
- Education	and self employed 
- Personal investment such as stocks, bonds, property and others. It's show someone have knowledge for investment. 
- Owning property such as own house/condo, car etc. It's good indicator of someone have strong financial health and able to keep money.

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

race : nationality

This would be an unethical because if a certain nationality(race) makes less income just because the data was gathered, the model will have an inherent bias towards that nationality.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

I wouldn't use `incsq` feature to predict income(`inc`) because `incsq` is similarly double values of `inc`. If we use both features in the model, the model can directly predict so what is wrong. we want to predict, not tell the model the answer. 

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

In [5]:
df[["age", "agesq","inc","incsq"]].head()

Unnamed: 0,age,agesq,inc,incsq
0,40,1600,13.17,173.4489
1,35,1225,61.23,3749.113
2,44,1936,12.858,165.3282
3,44,1936,98.88,9777.254
4,53,2809,22.614,511.393


The features `incsq` and `agesq`. They may have done this to add emphasis to these features in their modeling process.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

- `inc` is defined as `inc^2` in data dictionary but should refer to one's income
- `age` is defined as `age^2` in data dictionary but should refer to one's age

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

List all modeling tactics  
 - a multiple linear regression model(Yes, it can predict of features)  
 - a k-nearest neighbors model (No, it can't predict of features) 
 - a decision tree (Yes, it can predict of features)  
 - a set of bagged decision trees (Yes, it can predict of features)(similar to decision tree model)  
 - a random forest(Yes, it can predict of features)(similar to decision tree model)    
 - an Adaboost model(Yes, it can predict of features)(similar to decision tree model)    

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [6]:
# Check missing value
df.isnull().sum()

e401k     0
inc       0
marr      0
male      0
age       0
fsize     0
nettfa    0
p401k     0
pira      0
incsq     0
agesq     0
dtype: int64

In [7]:
# Check types 
# Not have object type
df.dtypes

e401k       int64
inc       float64
marr        int64
male        int64
age         int64
fsize       int64
nettfa    float64
p401k       int64
pira        int64
incsq     float64
agesq       int64
dtype: object

In [8]:
# Regression: What features best predict one's income?
# not use features same meaning in income
# not use features tell participate in eligible 401k/IRA
# Define X & y
X = df.drop(columns=["inc","incsq", "e401k", "p401k","pira"])
y = df["inc"]

In [9]:
# Check
X.head()

Unnamed: 0,marr,male,age,fsize,nettfa,agesq
0,0,0,40,1,4.575,1600
1,0,1,35,1,154.0,1225
2,1,0,44,2,0.0,1936
3,1,1,44,2,21.8,1936
4,0,0,53,1,18.45,2809


In [10]:
# Split into training & testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size =0.2,
                                                    random_state=42)

In [11]:
# In Knn model and use hyperpatameter of GridSearch
# Must have scale our features 
# I decide use it all my model
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [12]:
# Instantiate & fit our models

# Linear regression
lr = LinearRegression()
lr.fit(X_train_sc, y_train)

# k-nearest neighbors model
knn = KNeighborsRegressor()
knn.fit(X_train_sc, y_train)

# Decision tree
dtr = DecisionTreeRegressor()
dtr.fit(X_train_sc, y_train)

# Set of bagged decision trees
bag = BaggingRegressor()
bag.fit(X_train_sc, y_train)

# Random forest
rdf = RandomForestRegressor()
rdf.fit(X_train_sc, y_train)

# Adaboost model
ada = AdaBoostRegressor()
ada.fit(X_train_sc, y_train)

In [13]:
# List of our models
model_list_reg = [lr, knn, dtr, bag, rdf, ada]

# Dict columns in DataFrame
model_table_dict = {'model':[], 'R2_training': [], 'R2_testing':[],
                   'marr':[], 'male':[], 'age':[], 'fsize':[], 'nettfa':[], 
                    'agesq':[]}

# Make Dataframe of our models
# Show R^2 of model training and testing
# Show features importance of each models
for i in model_list_reg:
    model_table_dict['model'].append(str(i).split("()")[0])
    model_table_dict['R2_training'].append(round(i.score(X_train_sc, y_train),3))
    model_table_dict['R2_testing'].append(round(i.score(X_test_sc, y_test),3))
    
    # lr use coef_ function
    if i == lr:
        model_table_dict['marr'].append(i.coef_[0])
        model_table_dict['male'].append(i.coef_[1])
        model_table_dict['age'].append(i.coef_[2])
        model_table_dict['fsize'].append(i.coef_[3])
        model_table_dict['nettfa'].append(i.coef_[4])
        model_table_dict['agesq'].append(i.coef_[5])
        
    # Group dtr/rdf/ada use feature_importances_ function
    elif i == dtr or i == rdf or i == ada: 
        model_table_dict['marr'].append(i.feature_importances_[0])
        model_table_dict['male'].append(i.feature_importances_[1])
        model_table_dict['age'].append(i.feature_importances_[2])
        model_table_dict['fsize'].append(i.feature_importances_[3])
        model_table_dict['nettfa'].append(i.feature_importances_[4])
        model_table_dict['agesq'].append(i.feature_importances_[5])
    
    # Knn and Bag can't predict important feature
    else: 
        model_table_dict['marr'].append("-")
        model_table_dict['male'].append("-")
        model_table_dict['age'].append("-")
        model_table_dict['fsize'].append("-")
        model_table_dict['nettfa'].append("-")
        model_table_dict['agesq'].append("-")
        
model_df = pd.DataFrame(model_table_dict)
model_df

Unnamed: 0,model,R2_training,R2_testing,marr,male,age,fsize,nettfa,agesq
0,LinearRegression,0.293,0.275,10.297189,1.221685,31.549951,-3.277678,8.153784,-31.2624
1,KNeighborsRegressor,0.526,0.324,-,-,-,-,-,-
2,DecisionTreeRegressor,0.991,-0.236,0.104979,0.017113,0.103982,0.065292,0.609194,0.09944
3,BaggingRegressor,0.863,0.286,-,-,-,-,-,-
4,RandomForestRegressor,0.896,0.316,0.104379,0.021274,0.099212,0.068824,0.606589,0.099723
5,AdaBoostRegressor,0.261,0.224,0.157077,0.023726,0.052208,0.009793,0.712831,0.044364


In [14]:
# LinearRegression model
# Not overfitting 
# R^2 for training/testing is very similar
# Most important feature : (age)
model_df.iloc[0]

model          LinearRegression
R2_training               0.293
R2_testing                0.275
marr                  10.297189
male                   1.221685
age                   31.549951
fsize                 -3.277678
nettfa                 8.153784
agesq                  -31.2624
Name: 0, dtype: object

In [15]:
# KNeighborsRegressormodel
# slightly overfitting 
# R^2 of training is more than testing 
model_df.iloc[1]

model          KNeighborsRegressor
R2_training                  0.526
R2_testing                   0.324
marr                             -
male                             -
age                              -
fsize                            -
nettfa                           -
agesq                            -
Name: 1, dtype: object

In [16]:
# DecisionTreeRegressor model
# Overfitting !! 
# R^2 of training is very more than testing 
# Most important feature : (nettfa)
model_df.iloc[2]

model          DecisionTreeRegressor
R2_training                    0.991
R2_testing                    -0.236
marr                        0.104979
male                        0.017113
age                         0.103982
fsize                       0.065292
nettfa                      0.609194
agesq                        0.09944
Name: 2, dtype: object

In [17]:
# BaggingRegressor model
# Overfitting
# R^2 of training is more than testing 
model_df.iloc[3]

model          BaggingRegressor
R2_training               0.863
R2_testing                0.286
marr                          -
male                          -
age                           -
fsize                         -
nettfa                        -
agesq                         -
Name: 3, dtype: object

In [18]:
# RandomForestRegressor model
# Overfitting
# R^2 of training is more than testing 
# Most important feature : (nettfa)
model_df.iloc[4]

model          RandomForestRegressor
R2_training                    0.896
R2_testing                     0.316
marr                        0.104379
male                        0.021274
age                         0.099212
fsize                       0.068824
nettfa                      0.606589
agesq                       0.099723
Name: 4, dtype: object

In [19]:
# AdaBoostRegressor model
# Not overfitting 
# R^2 for training/testing is very similar
# Most important feature : (nettfa)
model_df.iloc[5]

model          AdaBoostRegressor
R2_training                0.261
R2_testing                 0.224
marr                    0.157077
male                    0.023726
age                     0.052208
fsize                   0.009793
nettfa                  0.712831
agesq                   0.044364
Name: 5, dtype: object

##### 9. What is bootstrapping?

- The bootstrap is a statistical technique that is used to quantify the uncertainty of a model
- Bootstrap samples are simply random samples with replacement that the same size as our original sample but are not replicas or exactly the same(averaged out the predictions)

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

- Decision trees to produce better predictive performance than utilizing a single decision tree.

- Set of bagged decision trees idea is to create several subsets(B model) of data from training sample chosen`features` randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the `predictions from different trees are used which is more robust than a single decision tree.`

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

- Random forest is based on applying a set of bagged decision trees with one important extension: In addition to sampling the `rows`, the algorithm also samples the features.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

- Performance of random forest is usually pretty strong for the classifier and the regressor more than set of bagged decision trees because reducing bias while not increasing variance too much.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [20]:
# Add RSME on dict columns in DataFrame
model_table_dict.update({'RMSE_training': [], 'RMSE_testing':[]})

# Make Dataframe of our models
# Add RSME values
for i in model_list_reg:
    y_pred_train = i.predict(X_train_sc)
    y_pred_test = i.predict(X_test_sc)
    model_table_dict['RMSE_training'].append(round(mean_squared_error(y_train, y_pred_train, squared=False),2))
    model_table_dict['RMSE_testing'].append(round(mean_squared_error(y_test, y_pred_test, squared=False),2))

model_df = pd.DataFrame(model_table_dict)
model_df

Unnamed: 0,model,R2_training,R2_testing,marr,male,age,fsize,nettfa,agesq,RMSE_training,RMSE_testing
0,LinearRegression,0.293,0.275,10.297189,1.221685,31.549951,-3.277678,8.153784,-31.2624,20.16,20.9
1,KNeighborsRegressor,0.526,0.324,-,-,-,-,-,-,16.5,20.18
2,DecisionTreeRegressor,0.991,-0.236,0.104979,0.017113,0.103982,0.065292,0.609194,0.09944,2.26,27.28
3,BaggingRegressor,0.863,0.286,-,-,-,-,-,-,8.88,20.73
4,RandomForestRegressor,0.896,0.316,0.104379,0.021274,0.099212,0.068824,0.606589,0.099723,7.73,20.3
5,AdaBoostRegressor,0.261,0.224,0.157077,0.023726,0.052208,0.009793,0.712831,0.044364,20.61,21.62


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

In [21]:
# If RSME is low, It's good to predict income

In [22]:
# LinearRegression model
# Not overfitting 
# RMSE for training/testing is very similar
model_df.loc[0, ['model',"RMSE_training","RMSE_testing"]]

model            LinearRegression
RMSE_training               20.16
RMSE_testing                 20.9
Name: 0, dtype: object

In [23]:
# KNeighborsRegressormodel
# slightly overfitting 
# RSME of training is lower than testing    
model_df.loc[1, ['model',"RMSE_training","RMSE_testing"]]

model            KNeighborsRegressor
RMSE_training                   16.5
RMSE_testing                   20.18
Name: 1, dtype: object

In [24]:
# DecisionTreeRegressor model
# Overfitting !!
# RSME of training is very lower than testing
model_df.loc[2, ['model',"RMSE_training","RMSE_testing"]]

model            DecisionTreeRegressor
RMSE_training                     2.26
RMSE_testing                     27.28
Name: 2, dtype: object

In [25]:
# BaggingRegressor model
# Overfitting !!
# RSME of training is very lower than testing
model_df.loc[3, ['model',"RMSE_training","RMSE_testing"]]

model            BaggingRegressor
RMSE_training                8.88
RMSE_testing                20.73
Name: 3, dtype: object

In [26]:
# RandomForestRegressor model
# Overfitting !!
# RSME of training is very lower than testing
model_df.loc[4, ['model',"RMSE_training","RMSE_testing"]]

model            RandomForestRegressor
RMSE_training                     7.73
RMSE_testing                      20.3
Name: 4, dtype: object

In [27]:
# AdaBoostRegressor model
# Not overfitting 
# RMSE for training/testing is very similar
model_df.loc[5, ['model',"RMSE_training","RMSE_testing"]]

model            AdaBoostRegressor
RMSE_training                20.61
RMSE_testing                 21.62
Name: 5, dtype: object

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [28]:
# Pick RandomForestRegressor model to final model
# It had one of the lowest RMSE scores and highest R^2. (Good prediction)
# Even if it's overfiiting (High Variance)
model_df.loc[4, ['model',"R2_training","R2_testing","RMSE_training","RMSE_testing"]]

model            RandomForestRegressor
R2_training                      0.896
R2_testing                       0.316
RMSE_training                     7.73
RMSE_testing                      20.3
Name: 4, dtype: object

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

- First I would check for outliers in the dataset. 
- Secondly I would perform a Gridsearch to find the best parameters for this 
model.
- Thirdly I would like to use feature engineer considering transforming income(using Log scale).  
- Lastly I would collect more data.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

If we include whether or not someone already possesses a 401k, then every person with a 401k must by definition be eligible for a 401k. However, this probably doesn't contribute to our understanding of what makes someone eligible for a 401k and will only confound what we're actually interested in studyin

In [29]:
# Someone is already possesses a 401k must be eligible for a 401k  
# That is show 99.96 % same row in dataset
# This probably doesn't contribute to our model that is more good predict
df[["e401k","p401k"]].duplicated().value_counts(normalize=True).mul(100)

True     99.967655
False     0.032345
dtype: float64

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

- a logistic regression model (Yes, we can predict whether or not one is eligible for a 401(k).)
- a $k$-nearest neighbors model (Yes, we can predict whether or not one is eligible for a 401(k).)
- a decision tree (Yes, we can predict whether or not one is eligible for a 401(k).)
- a set of bagged decision trees (Yes, we can predict whether or not one is eligible for a 401(k).)
- a random forest (Yes, we can predict whether or not one is eligible for a 401(k).)
- a set of extremely randomized trees (Yes, we can predict whether or not one is eligible for a 401(k).)
- an Adaboost model (Yes, we can predict whether or not one is eligible for a 401(k).)

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [30]:
# Define X & y
X = df.drop(columns=["e401k", "p401k"])
y = df["e401k"]

# Split into training & testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=42)

# In Knn model and use hyperpatameter of GridSearch
# Must have scale our features 
# I decide use it all my model
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [31]:
# Check shape
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((7420, 9), (1855, 9), (7420,), (1855,))

In [32]:
# Create a baseline score 
# baseline score is ~61%
y_train.value_counts(normalize=True).mul(100).round(2)

0    60.73
1    39.27
Name: e401k, dtype: float64

In [33]:
logreg = LogisticRegression()
logreg.fit(X_train_sc, y_train)

In [34]:
y_preds = logreg.predict(X_test_sc) 

In [35]:
y_preds

array([0, 0, 1, ..., 0, 0, 1], dtype=int64)

In [36]:
# List of our models
model_list_class = [LogisticRegression(),
                    KNeighborsClassifier(),
                    DecisionTreeClassifier(),
                    BaggingClassifier(),
                    RandomForestClassifier(),
                    AdaBoostClassifier()]

# Dict columns in DataFrame
model_table_dict = {'model':[], 'accuracy': [], 'true_negatives':[], 
                  'false_positives':[],'false_negatives':[],'true_positives':[], 
                  'F1_training':[], 'F1_testing':[]}

# Make Dataframe of our models
# Show confusion_matrix of each model training and testing
# Show f1-score of each models
for i in model_list_class:
    # fit
    i.fit(X_train_sc, y_train)
    
    # predict
    y_train_pred = i.predict(X_train_sc)
    y_test_pred = i.predict(X_test_sc)
    
    # add values in dict
    model_table_dict['model'].append(str(i).split("()")[0])
    model_table_dict['accuracy'].append(accuracy_score(y_test, y_test_pred))
    model_table_dict['true_negatives'].append(confusion_matrix(y_test, y_test_pred).ravel()[0])
    model_table_dict['false_positives'].append(confusion_matrix(y_test, y_test_pred).ravel()[1])
    model_table_dict['false_negatives'].append(confusion_matrix(y_test, y_test_pred).ravel()[2])
    model_table_dict['true_positives'].append(confusion_matrix(y_test, y_test_pred).ravel()[3])
    model_table_dict['F1_training'].append(round(f1_score(y_train, y_train_pred),2))
    model_table_dict['F1_testing'].append(round(f1_score(y_test, y_test_pred),2))

# make Dataframe
model_df = pd.DataFrame(model_table_dict)
model_df

Unnamed: 0,model,accuracy,true_negatives,false_positives,false_negatives,true_positives,F1_training,F1_testing
0,LogisticRegression,0.663612,946,186,438,285,0.47,0.48
1,KNeighborsClassifier,0.639353,856,276,393,330,0.65,0.5
2,DecisionTreeClassifier,0.58221,739,393,382,341,1.0,0.47
3,BaggingClassifier,0.649057,877,255,396,327,0.97,0.5
4,RandomForestClassifier,0.659838,878,254,377,346,1.0,0.52
5,AdaBoostClassifier,0.691105,904,228,345,378,0.56,0.57


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

- False positives are people we incorrectly predict to be eligible for a 401k.
- False negatives are people we incorrectly predict to be ineligible for a 401k.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

- We would to predict to be eligible for a 401k.
- We would rather minimize our false positive because we would want to minimize as much risk as we can.
- People who aren't eligible for 401k's being able to open them up would be bad for business.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

- If I want to minimize false positives, then I want to optimize specificity.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

- It balances our false positives and false negatives. 
- As either false positives or false negatives increase, the denominator increases 
- while the numerator stays fixed, meaning our 𝐹1-score decreases.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data. and  25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

In [37]:
# LogisticRegression model
# Accuracy score is ~66% more than baseline score
# Not overfitting
# F1-score for training/testing is very similar
model_df.loc[0]

model              LogisticRegression
accuracy                     0.663612
true_negatives                    946
false_positives                   186
false_negatives                   438
true_positives                    285
F1_training                      0.47
F1_testing                       0.48
Name: 0, dtype: object

In [38]:
# KNeighborsClassifier model
# Accuracy score is ~64% more than baseline score
# Slightly overfitting
# F1-score of training is higher than testing 
model_df.loc[1]

model              KNeighborsClassifier
accuracy                       0.639353
true_negatives                      856
false_positives                     276
false_negatives                     393
true_positives                      330
F1_training                        0.65
F1_testing                          0.5
Name: 1, dtype: object

In [39]:
# DecisionTreeClassifier
# Accuracy score is ~58% lower than baseline score
# Very overfitting
# F1-score of training is very higher than testing 
model_df.loc[2]

model              DecisionTreeClassifier
accuracy                          0.58221
true_negatives                        739
false_positives                       393
false_negatives                       382
true_positives                        341
F1_training                           1.0
F1_testing                           0.47
Name: 2, dtype: object

In [40]:
# BaggingClassifier
# Accuracy score is ~64% more than baseline score
# Very overfitting
# F1-score of training is very higher than testing 
model_df.loc[3]

model              BaggingClassifier
accuracy                    0.649057
true_negatives                   877
false_positives                  255
false_negatives                  396
true_positives                   327
F1_training                     0.97
F1_testing                       0.5
Name: 3, dtype: object

In [41]:
# RandomForestClassifier
# Accuracy score is ~66% more than baseline score
# Very overfitting
# F1-score of training is very higher than testing 
model_df.loc[4]

model              RandomForestClassifier
accuracy                         0.659838
true_negatives                        878
false_positives                       254
false_negatives                       377
true_positives                        346
F1_training                           1.0
F1_testing                           0.52
Name: 4, dtype: object

In [42]:
# AdaBoostClassifier
# Accuracy score is ~69% more than baseline score and the highest of models 
# Not overfitting
# F1-score for training/testing is very similar
model_df.loc[5]

model              AdaBoostClassifier
accuracy                     0.691105
true_negatives                    904
false_positives                   228
false_negatives                   345
true_positives                    378
F1_training                      0.56
F1_testing                       0.57
Name: 5, dtype: object

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [43]:
# Pick AdaBoostClassifier model to final model
# Accuracy score is the highest models (Good prediction)
# F1-score is acceptable (Good prediction)
# Not overfitting
model_df.loc[5]

model              AdaBoostClassifier
accuracy                     0.691105
true_negatives                    904
false_positives                   228
false_negatives                   345
true_positives                    378
F1_training                      0.56
F1_testing                       0.57
Name: 5, dtype: object

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

- First I would check for outliers in the dataset.
- Secondly I would perform a Gridsearch to find the best hyperparamters for this model.
- Lastly I would collect more data.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

In [44]:
# Regression: What features best predict one's income?
# The feature as best predicts one's income is Net Total Financial Assets.
# That is reasonable because people have high net total financial assets 
# So they have knowledge of investment and saving money.

In [45]:
# Classification: Predict whether or not one is eligible for a 401k.
# We can predict eligibility for a 401k approximately 69 percent.
# However we can improve our model by used Gridsearch to find the best hyperparamters 