## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:

import pandas as pd
import numpy as np

from math import sqrt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier
from sklearn.metrics import mean_squared_error, f1_score
from sklearn import svm

In [2]:
df = pd.read_csv('./401ksubs.csv')


In [3]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

### 1) College Degree

### 2) Dependents(Children

#### 3) Age

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

#### This could lead to discrimination based on race. We could probably get sued honestly

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

In [4]:
df.head()


Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


#### age squared and income squared doesnt really show anything. Its just squaring the income and age. Seems like featuers that odnt describe anything

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

### The two feature engineers are income squared and age squared. Maybe to create an obvious hreshold for what should qualify for certain retirement plans. I dont know why we would square age and income it has no real value


##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

#### Income should not be squared. Also it the income column should be shown as thousands. Maybe there was a confusion and thats why they squared it to make it into thousands. Net total financial assets are in thousands so lets scale the data fo rincoem

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

-  Linear Regression:  Using regression model, to predict continous variables based on x values for a target variable


-  Ridge Regression:  Using regression model, to predict continous variables based on x values for a target variable. Ridge regression takes the linear regression to penalize coeficients that are close to 0. It does not zero coeficients.


-  Lasso Regression:   Using regression model, to predict continous variables based on x values for a target variable. Ridge regression takes the linear regression to penalize coeficients that are close to 0. It does not zero coeficients. The lasso doss 0 out coeficients



-  ElasticNet Regression:  A regression tactic combines the effects of Lasso Regression and Ridge Regression.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [5]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [6]:
df.isnull().sum()

e401k     0
inc       0
marr      0
male      0
age       0
fsize     0
nettfa    0
p401k     0
pira      0
incsq     0
agesq     0
dtype: int64

### Train Test Model

In [7]:
features = ['marr', 'male', 'agesq', 'fsize', 'nettfa']

X = df[features]
y = df['incsq']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


In [9]:
ss = StandardScaler()


In [10]:
ss.fit(X_train)


StandardScaler(copy=True, with_mean=True, with_std=True)

In [11]:
X_train_sc = ss.transform(X_train)

In [12]:
X_test_sc = ss.transform(X_test)

## Linear Regerssion

In [13]:
lr = LinearRegression()


In [14]:
lr.fit(X_train_sc, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [15]:
cross_val_score(lr, X_train_sc, y_train).mean()




0.2508303293885998

In [16]:
lr.score(X_train_sc, y_train)

0.2535546958398234

In [17]:
lr.score(X_test_sc, y_test)


0.1767209397886863

## KNN Regressor

In [18]:
knn = KNeighborsRegressor()


In [19]:
knn.fit(X_train_sc, y_train)


KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [20]:
cross_val_score(knn, X_train_sc, y_train).mean()




0.23088654138199372

In [21]:
knn.score(X_train_sc, y_train)


0.48097097365025354

In [22]:
knn.score(X_test_sc, y_test)


0.22033294839055806

## Decision Tree

In [23]:
dt = DecisionTreeRegressor()

In [24]:
dt.fit(X_train_sc, y_train)


DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [25]:
cross_val_score(dt, X_train_sc, y_train).mean()




-0.32222597640513323

In [26]:
dt.score(X_train_sc, y_train)


0.9930593439412755

In [27]:
dt.score(X_test_sc, y_test)


-0.5083446453570741

## Bagged Decision Tree¶


In [28]:
bag = BaggingRegressor()


In [29]:
bag.fit(X_train_sc, y_train)


BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False,
                 max_features=1.0, max_samples=1.0, n_estimators=10,
                 n_jobs=None, oob_score=False, random_state=None, verbose=0,
                 warm_start=False)

In [30]:
cross_val_score(bag, X_train_sc, y_train).mean()




0.19724318199339808

In [31]:
bag.score(X_train_sc, y_train)


0.8525935993751484

In [32]:
bag.score(X_test_sc, y_test)


0.11575061423771736

## Random Forests¶


In [33]:
rf = RandomForestRegressor()


In [34]:
rf.fit(X_train_sc, y_train)




RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [35]:
cross_val_score(rf, X_train_sc, y_train).mean()




0.21205527419590917

In [36]:
rf.score(X_train_sc, y_train)


0.85852283057014

In [37]:
rf.score(X_test_sc, y_test)


0.13746293159442657

## Ada Boost Model

In [38]:
ada = AdaBoostRegressor()

In [39]:
ada.fit(X_train_sc, y_train)


AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=50, random_state=None)

In [40]:
cross_val_score(ada, X_train_sc, y_train).mean()




-0.16514836994918003

In [41]:
ada.score(X_train_sc, y_train)


-0.07342142040412813

In [42]:
ada.score(X_test_sc, y_test)


-0.182522733695756

## Support Vector Machine¶


In [43]:
svr = svm.SVR()


In [44]:
svr.fit(X_train_sc, y_train)


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)

In [45]:
cross_val_score(svr, X_train_sc, y_train).mean()




-0.07432116300293101

In [46]:
svr.score(X_train_sc, y_train)


-0.06097722225719848

In [47]:
svr.score(X_test_sc, y_test)


-0.06038690657261925

##### 9. What is bootstrapping?

###### It takes random samples from the data sets and runs models in parallel with replacement

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

###### Bagged decision trees is when you take random samples and run decision trees in parallel and then finds the average of the best model. It is considered an ensemble model that works in tune with other models until you lower variance

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

###### Random forest takes a random sample of rows just like bagging, but it also takes a random sample of features and createsr multiple models in union(parallel). Finally, it takes multiple models and averages it out to find the best model. Similar to bagged decision trees but in addition samples multiple features.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

A random forest model might be superior to a set of bagged decision trees because is contains less variance and reduces bis. Decision trees have high bias and when you bad it takes you to the more optimal model



## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

## Linear Regerssion

In [49]:
lr_pred = lr.predict(X_train_sc)
lr_train = sqrt(mean_squared_error(y_train, lr_pred))
lr_train


2577.458304052816

In [50]:
lr_predics_test = lr.predict(X_test_sc)
lr_rms_test = sqrt(mean_squared_error(y_test, lr_predics_test))
lr_rms_test

2771.707049080118

## KNN Regressor

In [51]:
knn_pred = knn.predict(X_train_sc)
knn_train = sqrt(mean_squared_error(y_train, knn_pred))
knn_train

2149.257625655916

In [53]:
knn_pred_test = knn.predict(X_test_sc)
knn_test = sqrt(mean_squared_error(y_test, knn_pred_test))
knn_test

2697.294596593576

## Decision Tree

In [54]:
dt_predics_train = dt.predict(X_train_sc)


In [55]:
dt_rms_train = sqrt(mean_squared_error(y_train, dt_predics_train))


In [56]:
dt_rms_train


248.53806629171348

In [57]:
dt_predics_test = dt.predict(X_test_sc)


In [58]:
dt_rms_test = sqrt(mean_squared_error(y_test, dt_predics_test))


In [59]:
dt_rms_test


3751.665284022304

## Bagged Decision Tree¶


In [60]:
bag_predics_train = bag.predict(X_train_sc)


In [61]:
bag_rms_train = sqrt(mean_squared_error(y_train, bag_predics_train))


In [62]:
bag_rms_train


1145.3832403701276

In [63]:
bag_predics_test = bag.predict(X_test_sc)


In [64]:
bag_rms_test = sqrt(mean_squared_error(y_test, bag_predics_test))


In [65]:
bag_rms_test


2872.507524721036

## Random Forests¶


In [66]:
rf_predics_train = rf.predict(X_train_sc)


In [67]:
rf_rms_train = sqrt(mean_squared_error(y_train, rf_predics_train))


In [68]:
rf_rms_train


1122.111037638943

In [69]:
rf_predics_test = rf.predict(X_test_sc)


In [70]:
rf_rms_test = sqrt(mean_squared_error(y_test, rf_predics_test))


In [71]:
rf_rms_test

2837.0218185520785

## Ada Boost Model

In [72]:

ada_predics_train = ada.predict(X_train_sc)

In [73]:
ada_rms_train = sqrt(mean_squared_error(y_train, ada_predics_train))


In [74]:
ada_rms_train


3090.848952834557

In [75]:
ada_predics_test = ada.predict(X_test_sc)


In [76]:
ada_rms_test = sqrt(mean_squared_error(y_test, ada_predics_test))


In [77]:
ada_rms_test


3321.8387043521025

## Support Vector Machine¶


In [78]:
svr_predics_train = svr.predict(X_train_sc)


In [79]:
svr_rms_train = sqrt(mean_squared_error(y_train, svr_predics_train))


In [80]:
svr_rms_train


3072.880584136111

In [81]:
svr_predics_test = svr.predict(X_test_sc)


In [82]:
svr_rms_test = sqrt(mean_squared_error(y_test, svr_predics_test))


In [83]:
svr_rms_test


3145.617917298943

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

#### All the models seem overfit but linear, SVM, and ada boost. 

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.


If I had to pick one model, I would pick the Linear Regression model because this model was the only model that did not have a major discrepency in test vs train. The RMSE might not be the lowest but it shows the least overfit and the discrepncy is small between train and test. Picking the best model might not have to do with the best score but the least overfit model and reperesents the data as a whole

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1) Ensemble model

2) Dummifying age range

3) Regularization

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

We would be using 401k to predict  401k which would cause leakage in our data set. It would not do well on a foreign validation set that doesnt have target

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

-  Logistic Regression:  Classifcation where coef can be interpreted and regularized.

-  KNearest Neighbors:  Define neighborhoods for classification

-  Decision Trees:  Find the depth until pure classification.

-  Bagged Decision Trees:  Resampling until optimal model .

-  Random Forest: Random samples and features for classification.

Minimize False Positives. It is better to be not eligible but actually be elgible than vice versa

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [84]:
features = ['incsq', 'marr', 'male', 'agesq', 'fsize', 'nettfa', 'pira']

X = df[features]
y = df['e401k']

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify = y)


In [86]:
ss = StandardScaler()
ss.fit(X_train)
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)


### Logistic Regression

In [87]:
logreg = LogisticRegression()
logreg.fit(X_train_sc, y_train)
cross_val_score(logreg, X_train_sc, y_train).mean()




0.6354235468547004

In [94]:
logreg.score(X_train_sc, y_train)



0.6374353076480737

In [95]:
logreg.score(X_test_sc, y_test)

0.6442432082794308

### KNN

In [89]:
knn = KNeighborsClassifier()
knn.fit(X_train_sc, y_train)
cross_val_score(knn, X_train_sc, y_train).mean()




0.6368634276476947

In [92]:
knn.score(X_train_sc, y_train)



0.7538815411155837

In [93]:
knn.score(X_test_sc, y_test)

0.6248382923673997

### Decision Tree


In [96]:
dt = DecisionTreeClassifier()
dt.fit(X_train_sc, y_train)


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [97]:
cross_val_score(dt, X_train_sc, y_train).mean()




0.5904235220612299

In [98]:
dt.score(X_train_sc, y_train)


1.0

In [99]:
dt.score(X_test_sc, y_test)


0.5838723587753342

### Bagged Decision Tree¶


In [100]:
bag = BaggingClassifier()
bag.fit(X_train_sc, y_train)


BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=10,
                  n_jobs=None, oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [101]:
cross_val_score(bag, X_train_sc, y_train).mean()




0.6385875655787293

In [102]:
bag.score(X_train_sc, y_train)


0.9772857964347326

In [103]:
bag.score(X_test_sc, y_test)


0.6515739542906425

## Random Forests


In [104]:
rf = RandomForestClassifier()
rf.fit(X_train_sc, y_train)




RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [105]:
cross_val_score(rf, X_train_sc, y_train).mean()




0.6528167861711939

In [106]:
rf.score(X_train_sc, y_train)


0.9774295572167913

In [107]:
rf.score(X_test_sc, y_test)


0.6425183268650281

## AdaBoost¶


In [108]:
ada = AdaBoostClassifier()
ada.fit(X_train_sc, y_train)


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)

In [109]:
cross_val_score(ada, X_train_sc, y_train).mean()
ada.score(X_train_sc, y_train)



0.6922081656124209

In [110]:
ada.score(X_test_sc, y_test)


0.6873652436394998

## SVC

In [111]:
svc = svm.SVC()
svc.fit(X_train_sc, y_train)
cross_val_score(svc, X_train_sc, y_train).mean()




0.6661884006228119

In [112]:
svc.score(X_train_sc, y_train)


0.6835825186889016

In [113]:
svc.score(X_test_sc, y_test)


0.6778783958602846

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False Postives: Predicted: 401k Eligible Actual: Not eligible

False Negaitves: PredictedL Not eligible . Actual: Eligible

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Minimize False Positives. Better to be eligible then not(Actual). 

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

Optimize Specificity the false postives to 0

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

F1 score considers both precision and recall to compute accuracy

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

## Logistic Regression

In [131]:
logreg_train = logreg.predict(X_train_sc)
logreg_f1_train = f1_score(y_train, logreg_train)
logreg_f1_train

0.29749303621169915

In [132]:
logreg_test = logreg.predict(X_test_sc)
logreg_f1_test = f1_score(y_test, logreg_test)
logreg_f1_test


0.29906542056074764

## Knn Classification

In [133]:
knn_train = knn.predict(X_train_sc)
knn_f1_train = f1_score(y_train, knn_train)
knn_f1_train


0.657188626351622

In [134]:
knn_test = knn.predict(X_test_sc)
knn_f1_test = f1_score(y_test, knn_test)
knn_f1_test


0.4746376811594203

## Decision Tree

In [135]:
dt_train = dt.predict(X_train_sc)
dt_f1_train = f1_score(y_train, dt_train)
dt_f1_train


1.0

In [136]:
dt_test = dt.predict(X_test_sc)
dt_f1_test = f1_score(y_test, dt_test)
dt_f1_test


0.4820182501341922

### Bagged Decision Tree¶


In [137]:
bag_train = bag.predict(X_train_sc)
bag_f1_train = f1_score(y_train, bag_train)
bag_f1_train

0.9704230625233995

In [138]:
bag_test = bag.predict(X_test_sc)
bag_f1_test = f1_score(y_test, bag_test)
bag_f1_test


0.49688667496886674

### Random Forests¶


In [139]:
rf_train = rf.predict(X_train_sc)
rf_f1_train = f1_score(y_train, rf_train)
rf_f1_train


0.9705164319248827

In [140]:
rf_test = rf.predict(X_test_sc)
rf_f1_test = f1_score(y_test, rf_test)
rf_f1_test


0.4702875399361023

## AdaBoost

In [141]:
ada_train = ada.predict(X_train_sc)
ada_f1_train = f1_score(y_train, ada_train)
ada_f1_train

0.5824068656134191

In [142]:
ada_test = ada.predict(X_test_sc)
ada_f1_test = f1_score(y_test, ada_test)
ada_f1_test

0.5752782659636789

## SVM

In [143]:
svc_train = svc.predict(X_train_sc)
svc_f1_train = f1_score(y_train, svc_train)
svc_f1_train


0.47078624669391683

In [144]:
svc_test = svc.predict(X_test_sc)
svc_f1_test = f1_score(y_test, svc_test)
svc_f1_test


0.4645161290322581

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

Logistic Regression seems like the only model that doesnt overfit. The data set seems that there arent good predictiors for this classification. 

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

Ada Boost shhows the higest scores and is slighlty overfit. I would choose thise one, but I would want to run a pipeline that shows the best hyperparams. Better than the logistic regression

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

!) Use a gridsearch, voting classifier

2) Feature engineers some features to create a better dataset

3) Regularization on logistic and linear

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.