## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd
df = pd.read_csv("./401ksubs.csv")

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

In [2]:
df

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.170,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.230,0,1,35,1,154.000,1,0,3749.1130,1225
2,0,12.858,1,0,44,2,0.000,0,0,165.3282,1936
3,0,98.880,1,1,44,2,21.800,0,0,9777.2540,1936
4,0,22.614,0,0,53,1,18.450,0,0,511.3930,2809
5,0,15.000,1,0,60,3,0.000,0,0,225.0000,3600
6,0,37.155,1,0,49,5,3.483,0,1,1380.4940,2401
7,0,31.896,1,0,38,5,-2.100,0,0,1017.3550,1444
8,0,47.295,1,0,52,2,5.290,0,1,2236.8170,2704
9,1,29.100,0,1,45,1,29.600,0,1,846.8100,2025


In [3]:
df.corrwith(df['e401k'])

e401k     1.000000
inc       0.268178
marr      0.080843
male     -0.027641
age       0.031526
fsize     0.012015
nettfa    0.143950
p401k     0.769170
pira      0.118643
incsq     0.206618
agesq     0.017526
dtype: float64

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

Advertising based on race is implicite Racism, it assumes a trend across all members of a certain race and allows for racist decisions.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

In [5]:
#income, marriage and nettfa are likely the best variables (based on the above correlation)

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

In [6]:
#Income Sq and Age Sq


##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

In [7]:
#Age Sq?  Seems like a inneffective adjustment.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

In [8]:
#Random Forest Regressor, Decision Tree, Linear Regression, Logistic Regression

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a $k$-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [9]:
X = df.drop(['e401k','p401k','pira','inc'], axis = 1)
y = df['inc']

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [12]:
svr = SVR()

In [13]:
svr.fit(X_train,y_train)
svr.score(X_test,y_test)
svr_preds = svr.predict(X_test)

In [14]:
knn = KNeighborsRegressor()

In [15]:
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
knn_preds = knn.predict(X_test)

In [16]:
bag = BaggingRegressor()

In [17]:
bag.fit(X_train,y_train)
bag.score(X_test,y_test)

bag_preds = bag.predict(X_test)

In [18]:
lr = LinearRegression()

In [19]:
lr.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [20]:
lr.score(X_test,y_test)
lr_preds = lr.predict(X_test)

In [21]:
rfr = RandomForestRegressor()

In [22]:
rfr.fit(X_train,y_train)
rfr.score(X_test,y_test)

rfr_preds = rfr.predict(X_test)

In [23]:
ada = AdaBoostRegressor()

In [24]:
ada.fit(X_train,y_train)
ada.score(X_test,y_test)
ada_preds = ada.predict(X_test)

##### 9. What is bootstrapping?

Its a way of random sampling our data WITH replacement (i.e in our random sample we do not remove the samples we select, so there is possiblility for multiples in our sets.)

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

Essentially the bag makes a cluster of models at random and test their choices.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

All the features are used in a bagged decision tree as opposed to the random forest which makes a selection of the features at random.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Less Variance due to less complex models.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [26]:
from sklearn.metrics import mean_squared_error

In [27]:
MSE_svr = mean_squared_error(y_test, svr_preds)

MSE_svr

612.7976878399319

In [28]:
MSE_knn = mean_squared_error(y_test, knn_preds)

MSE_knn

0.22319366506252677

In [29]:
MSE_bag = mean_squared_error(y_test, bag_preds)

MSE_bag

0.03436396824062117

In [30]:
MSE_lr = mean_squared_error(y_test, lr_preds)

MSE_lr

64.47549025929648

In [31]:
MSE_rfr = mean_squared_error(y_test, rfr_preds)

MSE_rfr

0.04067810031047903

In [32]:
MSE_ada = mean_squared_error(y_test, ada_preds)

MSE_ada

5.672423249799138

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

No, I think most of these models had generally poor performance overall

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I wasn't surprised to find KNN had scored the best.  KNN on has good performance on these smaller data sets and the computational cost is relatively neglegable. 

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

Better engineer the data to find points that are more effeicient at the regression metrics.  This seems the be thebiggest issue currently.  Potentially more data would help in this endeavour aswell, simply put, I would work to gather more conculsive data.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

In [35]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [36]:
X = df.drop('e401k', axis = 1)
y = df['e401k']

X_train, X_test, y_train, y_test = train_test_split(X,y)

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a $k$-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [37]:
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
rfc.score(X_test,y_test)
rfc_pred = rfc.predict(X_test)

In [38]:
Ada = AdaBoostClassifier()
Ada.fit(X_train,y_train)
Ada.score(X_test,y_test)
Ada_preds = Ada.predict(X_test)

In [39]:
Bag = BaggingClassifier()
Bag.fit(X_train,y_train)
Bag.score(X_test,y_test)
Bag_preds = Bag.predict(X_test)

In [40]:
lg = LogisticRegression()
lg.fit(X_train,y_train)
lg.score(X_test,y_test)
lg_preds = lg.predict(X_test)

In [41]:
svc = SVC()
svc.fit(X_train, y_train)
svc.score(X_test,y_test)
svc_preds = svc.predict(X_test)

In [42]:
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
knn.score(X_test,y_test)
knn_preds = knn.predict(X_test)

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False Positive = Someone uneligible for 401k but is predicted as not
False Negative = Someone eligible for 401k but is pedcited as not

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Likely the false negative have less of a long term effect as people will be making contributions to their savings believing they are not being matched, meaning more contributions directly to their savings.  However if someone is told they are uneligible for a 401k they would potentially contribute less to their savings expecting a partial match. 

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

False Negatives

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

Essentially is it a metric of the balance between accuracy and precision.  By optimizing for a certain metric above we need to understand the balance on our model.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [43]:
from sklearn.metrics import f1_score


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

In [44]:
f1_score(y_test, rfc_pred)

0.8060422960725075

In [45]:
f1_score(y_test, Ada_preds)

0.8182386008744534

In [46]:
f1_score(y_test,Bag_preds)

0.8107121119902616

In [47]:
f1_score(y_test, lg_preds)

0.819262038774234

In [48]:
f1_score(y_test, svc_preds)

0.016477857878475798

In [49]:
f1_score(y_test, knn_preds)

0.47957371225577267

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

AdaBoost seems to have the strongest performance amoungst all the graphs.  The boosting method seems especially good for a dataset of this size.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

More features would allow for a more complex model which I think is a major issue for the classifier.  Alternatively, getting some more data within the features available.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Regression: Age, Marriage and Fsize
Classification: (Note Above Models.)