# Project 2 Homework

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

In [2]:
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from GPyOpt.methods import BayesianOptimization
from sklearn.model_selection import cross_val_score, KFold
from sklearn.datasets import load_diabetes
from scipy.stats import uniform, randint
import re
import pandas as pd
from tqdm import tnrange, tqdm_notebook
import timeit
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import sklearn

from tpot import TPOTRegressor, TPOTClassifier
import tpot

np.random.seed(8675309)  # seed courtesy of Tommy Tutone

For this project you're going to apply hyperparameter optimization to both a regression and a classification problem. It looks like a lot to do below, but it's mostly a matter of modifying code from the presentation. 

## Objective

For each of the models in problems 1 and 2 below, apply the following 4 tuning methods from the presentation: GridSearchCV, RandomSearchCV, BayesianOptimization, and TPOT.
* **For TPOT**: In Problem 1 do only hyperparameter optimization. In Problem 2 do **both** hyperparameter optimization and also run TPOT and let it choose the model. See the presentation for examples of both.

### What to submit

For each problem you need to include the following:

1. A pandas table that reports:
    * The best parameters for each tuning method
    * The optimized score from the test data
    * The number of model fits used in the optimization
2. A brief discussion about which hyperparameter optimization approach worked best

### Notes:
* **For problem 1**: your pandas table should include the best parameters for each of the 4 tuning methods above.
* **For problem 2**: your pandas table should include the best parameters for each of the 5 tuning methods (the 4 methods above and the TPOT model search).
* **For GridSearchCV**: you should include at least 2 or 3 values for each hyperparameter and one of those values should be the default.
* **For BayesianOptimization**: you'll have to use `int()` or `bool()` to cast the float values of the hyperparameters inside your `cross_cv()` function.
* **For TPOT**: you should use a finer grid than for GridSearchCV, but not more than 10 to 20 possible values for each hyperparameter.  You could lower the number of possible values to keep the search space smaller.
    * If your code is too slow you can reduce the number of cross-validation folds to 3 and if your dataset is really large you can randomly choose a smaller subset of the rows.
* Use section headers to label your work.  Your summary / discussion should be more than simply "XYZ is the best model", but it also shouldn't be more than a few paragraphs and a table.


### Regarding data

* You can use either the specified dataset or you can choose your own.  
    * If you use your own data it should have at least 500 rows and 10 features.  
    * If your data has categorical features you'll need "one hot" encode it (convert categorical features into multiple binary features).  <a href="https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/">Here is a nice tutorial</a>.  For categories with only two values you can remove one of the two hot encoded columns.
* If you do want to use your own data, we suggest first getting things working with the suggested datasets.  Finding, cleaning, and preparing data can take a lot of time.

# Problem 1 - Optimize Random Forest Regression

### Find optimized hyperparameters for a random forest regression model. 

You may use either the diabetes data used in the presentation or a dataset that you choose.  **You do not need to include the TPOT general search for this problem** (use TPOT to optimize RandomForestRegressor, but don't run TOPT to choose a model). Here are ranges for a subset of the hyperparameters:

Hyperparameter |Type | Default Value | Typical Range
---- | ---- | ---- | ----
n_estimators | discrete / integer | 100 | 10 to 150
max_features | continuous / float | 1.0 | 0.05 to 1.0
min_samples_split | discrete / integer | 2 | 2 to 20
min_samples_leaf | discrete / integer | 1 | 1 to 20
bootstrap | discrete / boolean | True | True, False


You can add other hyperparameters to the optimization if you wish.
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">Documentation for sklearn RandomForestRegressor</a>

<font color = "blue"> *** 15 points: </font>

In [3]:
# load data and define x, y
diabetes = load_diabetes()
xd = np.array(diabetes.data)
yd = np.array(diabetes.target)

# split data into train/test sets
x_train, x_test, y_train, y_test = train_test_split(xd, yd, test_size=0.2, random_state=123)

In [20]:
def calculate_scores(model, best_params, x_test = x_test, y_test = y_test):
    """
    Function to calculate scores for both regression models and classifiers. It works with only randomforest regressor, but could be changed
    """
    y_pred = model.predict(x_test)  
    results = best_params
    condition = False

    try:
        condition = isinstance(model.estimator, sklearn.ensemble.forest.RandomForestRegressor)
        
    except:
        condition = isinstance(model, (sklearn.ensemble.forest.RandomForestRegressor, tpot.tpot.TPOTRegressor))
    
    if condition == True:
        r_squared = model.score(x_test,y_test)
        mse = mean_squared_error(y_test,y_pred)
        rmse = np.sqrt(mse)

        results['rmse'] = rmse
        results['r_squared'] = r_squared
    
    else:
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average="weighted")
        sensitivity = recall_score(y_test, y_pred, average="weighted")
        
        results['accuracy'] = accuracy
        results['precision'] = precision
        results['sensitivity'] = sensitivity

    return results

def cv_score_rf(hyp_parameters):
    """
    Perform CV on RF
    """
    hyp_parameters = hyp_parameters[0]
    rf_model = RandomForestRegressor(n_estimators=int(hyp_parameters[0]),
                                 max_features=hyp_parameters[1],
                                 min_samples_split=int(hyp_parameters[2]),
                                 min_samples_leaf=int(hyp_parameters[3]),
                                 bootstrap=bool(hyp_parameters[4]))
    scores = cross_val_score(rf_model,
                             X=x_train,
                             y=y_train,
                             cv=KFold(n_splits=5))
    return np.array(scores.mean())

def cv_score_xgb(hyp_parameters):
    """
    Perform CV on XGB
    """
    hyp_parameters = hyp_parameters[0]
    xgb_model = xgb.XGBClassifier(objective="binary:logistic",
                                 learning_rate=hyp_parameters[0],
                                 max_depth=int(hyp_parameters[1]),
                                 n_estimators=int(hyp_parameters[2]),
                                 subsample=hyp_parameters[3],
                                 min_child_weight=int(hyp_parameters[4]),
                                 reg_alpha=hyp_parameters[5],
                                 reg_lambda=hyp_parameters[6],
                                njobs=-1)
    scores = cross_val_score(xgb_model,
                             X=x_train,
                             y=y_train,
                             cv=KFold(n_splits=5))
    return np.array(scores.mean())  # return average of 5-fold scores

def lines_that_start_with(string, fp):
    return [line for line in fp if line.startswith(string)]

def lines_that_contain(string, fp):
    return [line for line in fp if string in line]

def optimize_model(model, 
                   opt_type, 
                   x_train = x_train, 
                   y_train = y_train, 
                   x_test = x_test, 
                   y_test = y_test, 
                   use_parallel = True,
                   cv = 5,
                   n = 3):

    # Set params for RF    
    if isinstance(model, RandomForestRegressor):
        n_estimators = np.linspace(10,150, n, dtype=np.int).tolist() if opt_type != 'randomcv' else randint(10,151)
        max_features =  np.linspace(0.05,1, n).tolist() if opt_type != 'randomcv' else uniform(0.05,1)
        min_samples_split = np.linspace(2,20, n, dtype=np.int).tolist() if opt_type != 'randomcv' else randint(2,21)
        min_samples_leaf = np.linspace(1,20, n, dtype=np.int).tolist() if opt_type != 'randomcv' else randint(1,21)
        bootstrap = np.array([True, False], dtype=bool).tolist() if opt_type != 'randomcv' else [True,False]
    
        params = {
            "n_estimators": n_estimators,
            "max_features": max_features,
            "min_samples_split": min_samples_split,
            "min_samples_leaf": min_samples_leaf,
            "bootstrap": bootstrap
        }

    # Set RFS for XGB
    else: 
        n_estimators = np.linspace(50,150, n, dtype=np.int).tolist() if opt_type != 'randomcv' else randint(50,151)
        max_depth =  np.linspace(1,10, n, dtype=np.int).tolist() if opt_type != 'randomcv' else randint(1,11)
        min_child_weight = np.linspace(1,20, n, dtype=np.int).tolist() if opt_type != 'randomcv' else randint(1,21)
        learning_rate = np.linspace(0.001,1, n).tolist() if opt_type != 'randomcv' else uniform(0.001,1)
        subsample = np.linspace(0.05,1, n).tolist() if opt_type != 'randomcv' else uniform(0.05,0.9)
        reg_lambda = np.linspace(0,5, n, dtype=np.int).tolist() if opt_type != 'randomcv' else randint(0,5)
        reg_alpha = np.arange(1,6, n, dtype=np.int).tolist() if opt_type != 'randomcv' else randint(0,5)
    
        params = {
            "learning_rate": learning_rate,
            "max_depth": max_depth,
            "n_estimators": n_estimators,
            "subsample": subsample,
            "min_child_weight": min_child_weight,
            "reg_lambda": reg_lambda,
            "reg_alpha:": reg_alpha
        }
    
    # Specify to use parallel processing
    n_jobs = -1 if use_parallel == True else 1
    
    # if tuning method == grid search
    if opt_type == 'gridcv':
        print("Optimizing hyperparameters with GridSearchCV...")
        grid_search = GridSearchCV(model,
                           param_grid=params,
                           cv=cv,
                           verbose=1,
                           n_jobs=n_jobs,
                           return_train_score=True)
        
        grid_search.fit(x_train, y_train)
        best_params = grid_search.best_params_
        results = calculate_scores(grid_search, best_params)
        return(results)

    # if tuning method == random CV
    elif opt_type == 'randomcv':
        print("Optimizing hyperparameters with RandomSearchCV...")
        random_search = RandomizedSearchCV(
            model,
            param_distributions=params,
            random_state=8675309,
            n_iter=10,
            cv=cv,
            verbose=1,
            n_jobs=n_jobs,
            return_train_score=True)
        
        random_search.fit(x_train, y_train)
        best_params = random_search.best_params_
        results = calculate_scores(random_search, best_params)
        return(results)
    
    # if tuning method == Bayesian optimization
    elif opt_type == "bayes":
        print("Optimizing hyperparameters with Bayesian Optimization...")
        
        # if model is RF
        if isinstance(model, RandomForestRegressor):
            hp_bounds = [{'name': 'n_estimators', 'type': 'discrete', 'domain': (min(n_estimators), max(n_estimators))}, 
            {'name': 'max_features','type': 'continuous','domain': (min(max_features), max(max_features))}, 
            {'name': 'min_samples_split','type': 'discrete','domain': (min(min_samples_split), max(min_samples_split))}, 
            {'name': 'min_samples_leaf','type': 'discrete','domain': (min(min_samples_leaf), max(min_samples_leaf))}, 
            {'name': 'bootstrap','type': 'discrete','domain': (True, False)}]
            cv_score = cv_score_rf
        
        # if model is XGB
        else:
            hp_bounds = [{'name': 'learning_rate','type': 'continuous','domain': (min(learning_rate), max(learning_rate))}, 
            {'name': 'max_depth','type': 'discrete','domain': (min(max_depth), max(max_depth))}, 
            {'name': 'n_estimators','type': 'discrete','domain': (min(n_estimators), max(n_estimators))}, 
            {'name': 'subsample','type': 'continuous','domain': (min(subsample), max(subsample))}, 
            {'name': 'min_child_weight','type': 'discrete','domain': (min(min_child_weight), max(min_child_weight))}, 
            {'name': 'reg_alpha','type': 'continuous','domain': (min(reg_alpha), max(reg_alpha))}, 
            {'name': 'reg_lambda','type': 'continuous','domain': (min(reg_lambda), max(reg_lambda))}]
            cv_score = cv_score_xgb

        # create optmizer
        optimizer = BayesianOptimization(f=cv_score,
                                         domain=hp_bounds,
                                         model_type='GP',
                                         acquisition_type='EI',
                                         acquisition_jitter=0.05,
                                         exact_feval=True,
                                         maximize=True,
                                         verbosity=True,
                                        njobs=n_jobs)

        optimizer.run_optimization(max_iter=20,verbosity=False)
        best_params = {}

        # if model is RF, convert continuous/discrete vals
        if isinstance(model, RandomForestRegressor):
            for i in range(len(hp_bounds)):
                if hp_bounds[i]['type'] == 'continuous':
                    best_params[hp_bounds[i]['name']] = optimizer.x_opt[i]
                elif hp_bounds[i]['type'] == 'discrete' and hp_bounds[i]['name'] != 'bootstrap':
                    best_params[hp_bounds[i]['name']] = int(optimizer.x_opt[i])
                else:
                    best_params[hp_bounds[i]['name']] = bool(optimizer.x_opt[i])
    
            bayopt_search = RandomForestRegressor(**best_params)
    
        # if model is XGB, do same conversion
        else:
            for i in range(len(hp_bounds)):
                if hp_bounds[i]['type'] == 'continuous':
                    best_params[hp_bounds[i]['name']] = optimizer.x_opt[i]
                else:
                    best_params[hp_bounds[i]['name']] = int(optimizer.x_opt[i])

            bayopt_search =  xgb.XGBClassifier(objective="binary:logistic", **best_params)
        
        bayopt_search.fit(x_train,y_train)
        results = calculate_scores(bayopt_search, best_params)
                
        return(results)
    
    # if tuning method is TPOT
    elif opt_type == 'tpot':
        print("Optimizing hyperparameters with TPOT...")
        
        # specify config for regressor
        if isinstance(model, RandomForestRegressor):
            tpot_config = {
                'sklearn.ensemble.RandomForestRegressor': {
                    "n_estimators": n_estimators,
                    "max_features": max_features,
                    "min_samples_split": min_samples_split,
                    "min_samples_leaf": min_samples_leaf,
                    "bootstrap": bootstrap
                }
            }

            tpot = TPOTRegressor(generations=5,
                                 scoring="r2",
                                 population_size=15,
                                 verbosity=2,
                                 config_dict=tpot_config,
                                 cv=cv,
                                 random_state=8675309)

            tpot.fit(x_train, y_train)
            tpot.export('tpot_rf.py')

            # process output file to get params
            with open("tpot_rf.py", "r") as fp:
                for line in lines_that_start_with("exported_pipeline = ", fp):
                    parse_this = line

            p = re.compile(r"[\w]+=[\w|[\d+\.\d]+")
            match_list = p.findall(parse_this)
            best_params = {}

            for match in match_list:
                key, val = match.split("=")
                best_params[key] = eval(val)

            results = calculate_scores(tpot, best_params)

        # specify config for XGB
        else:
            tpot_config = {
                'xgboost.XGBClassifier': {
                    'n_estimators': n_estimators,
                    'max_depth': max_depth,
                    'learning_rate': learning_rate,
                    'subsample': subsample,
                    'min_child_weight': min_child_weight,
                    'reg_alpha': reg_alpha,
                    'reg_lambda': reg_lambda,
                    'nthread': [1],
                    'objective': ['binary:logistic'],
                }
            }

            tpot = TPOTClassifier(generations=5,
                     population_size=1,
                     verbosity=2,
                     config_dict=tpot_config,
                     cv=cv,
                     random_state=8675309)

            tpot.fit(x_train, y_train)
            tpot.export('tpot_xbg.py')
            
            # process output file to get params
            with open("tpot_xbg.py", "r") as fp:
                # for line in lines_that_contain("exported_pipeline =", fp):
                #     parse_this = line

                for line in lines_that_contain("XGB", fp):
                    parse_this = line

            p = re.compile(r"[\w]+=[\d+\.\d+]+")
            match_list = p.findall(parse_this)
            best_params = {}

            for match in match_list:
                key, val = match.split("=")
                best_params[key] = eval(val)

            if 'nthread' in best_params:
                del best_params['nthread']

            results = calculate_scores(tpot, best_params)
        return(results)

def wrapper(model, cv = 3, n = 3):
    start_time = timeit.default_timer()
    tuning_methods = ['gridcv','randomcv','bayes','tpot']
    
    if isinstance(model, RandomForestRegressor):
        cols = ['n_estimators','max_features','min_samples_split','min_samples_leaf','bootstrap','rmse','r_squared']
    else:
        cols = ['n_estimators','max_depth','min_child_weight','learning_rate','subsample','reg_lambda','reg_alpha','accuracy','precision','sensitivity']
    
    df  = pd.DataFrame(columns = cols)
    
    print("Parallel processing being used for all tuning methods except TPOT")
    for method in tqdm_notebook(tuning_methods):
        results = optimize_model(model, opt_type = method, cv = cv, n = n)
        df.loc[method] = results
        
    stop_time = timeit.default_timer()
        
    print(f"Done! Time elapsed: {round(stop_time - start_time)} seconds")
        
    return df

### Run optimization

In [5]:
# define model and pass it to wrapper
rf_model = RandomForestRegressor(random_state=0)
# optimize_model(rf_model, opt_type="tpot", cv=2, n=3)
df = wrapper(rf_model)

Parallel processing being used for all tuning methods except TPOT


HBox(children=(IntProgress(value=0, max=4), HTML(value='')))

Optimizing hyperparameters with GridSearchCV...
Optimizing hyperparameters with RandomSearchCV...
Optimizing hyperparameters with Bayesian Optimization...
Optimizing hyperparameters with TPOT...


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=90, style=ProgressStyle(descripti…

Generation 1 - Current best internal CV score: 0.43028051129919564
Generation 2 - Current best internal CV score: 0.4333338187237798
Generation 3 - Current best internal CV score: 0.4333338187237798
Generation 4 - Current best internal CV score: 0.43703229676507416
Generation 5 - Current best internal CV score: 0.43703229676507416

Best pipeline: RandomForestRegressor(input_matrix, bootstrap=True, max_features=0.525, min_samples_leaf=1, min_samples_split=11, n_estimators=150)

Done! Time elapsed: 64 seconds


### View Results

In [6]:
df

Unnamed: 0,n_estimators,max_features,min_samples_split,min_samples_leaf,bootstrap,rmse,r_squared
gridcv,80,0.525,11,1,True,53.462674,0.54632
randomcv,131,0.632398,2,5,True,51.525567,0.578601
bayes,150,0.577421,2,20,True,53.692206,0.542416
tpot,150,0.525,11,1,True,53.552177,0.5448


### Summary:
<font color = "blue"> *** 5 points: </font>

I ran this model for much longer

# Problem 2 - Optimize XGBoost Classifier

### Find optimized hyperparameters for an xgboost classifier model. 

This problem contains 5 parts.


### Notes:

#### About the data
The first cell below loads a subset of the loans default data from DS705 and your job is to predict whether a loan defaults or not.  The `status_bad` column is the target column and a 1 indicates a loan that defaulted.  We have selected a subset of the original data that includes 2000 each of good and bad loans.  The data has already been cleaned and encoded.  You're welcome to look into a different dataset, but start by getting this working and then add your own data.

#### This is classification, not regression
The score for each model will be accuracy and not MSE.  Your summary table should include accuracy, sensitivity, and precision for each optimized model applied to the test data.  (<a href="https://classeval.wordpress.com/introduction/basic-evaluation-measures/">Here is a nice overview of metrics for binary classification data</a>) that includes definitions of accuracy and such.

For the models you'll mostly just need to change 'regressor' to 'classifier', e.g. `XGBClassifier` instead of `XGBRegressor`.


Hyperparameter | Type | Default Value | Typical Range
---- | ---- | ---- | ----
n_estimators | discrete / integer | 100 | 50 to 150
max_depth | discrete / integer | 3| 1 to 10
min_child_weight | discrete / integer | 1 | 1 to 20
learning_rate | continuous / float | 0.1 | 0.001 to 1
sub_sample | continuous / float | 1 | 0.05 to 1
reg_lambda | continuous / float | 1 | 0 to 5
reg_alpha  | continuous / float | 0 | 0 to 5

## Part 1: Loading the data

In [7]:
# Do not change this cell for loading and preparing the data
import pandas as pd
import numpy as np

X = pd.read_csv('./data/loans_subset.csv')

# split into predictors and target
# convert to numpy arrays for xgboost, OK for other models too
y = np.array(X['status_Bad']) # 1 for bad loan, 0 for good loan
x = np.array(X.drop(columns = ['status_Bad']))

# split into test and training data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=0) # notice the lower case x/y labels

## Part 2

Write a function called `my_classifier_results` modeled after `my_regression_results` that applies a model to the test data and prints out the accuracy, sensitivity, precision, and the confusion matrix.  There is no need to make a plot.

<font color = "blue"> *** 5 points - (don't delete this cell) </font>

In [8]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

def my_classifier_results(model, x_test = x_test, y_test = y_test):
    y_pred = model.predict(x_test)    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average="weighted")
    sensitivity = recall_score(y_test, y_pred, average="weighted")    
    print(f"Accuracy: {accuracy}, precision: {round(precision,4)}, sensitivity: {round(sensitivity,4)}\n")
    cmtx = pd.DataFrame(
        confusion_matrix(y_test, y_pred, labels=[1,0]), 
        index=['true:bad', 'true:good'], 
        columns=['pred:bad','pred:good']
    )
    print(f"{cmtx}\n")

## Part 3

Start by training some baseline models using default values of the hyperparameters.  We've included logistic regression in a cell below to get you started.  Use `LogisticRegression`, `RandomForestClassifier`, and `GaussianNB` (Gaussian Naive Bayes) from `sklearn`.  Also use `XGBClassifier` from `xgboost` where you may need to include `objective="binary:logistic"` as an option. The default scoring method for all of the `sklearn` classifiers is accuracy. Apply `my_classifier_results` to the test data for each model.

<font color = "blue"> *** 10 points - (don't delete this cell) </font>

In [9]:
from sklearn.linear_model import LogisticRegression

rf_classifier = RandomForestClassifier()
log_classifier = LogisticRegression(solver='lbfgs',max_iter=1000)
gnb_classifier = GaussianNB()
xgbr_classifier = xgb.XGBClassifier(objective="binary:logistic")

model_names = ["RandomForest classifier","LogisticRegression classifier", "GaussianBN classifier", "XGB classifier"]
models = [rf_classifier, log_classifier, gnb_classifier, xgbr_classifier]

for model, name in zip(models, model_names):
    print(f"Fitting {name}\n")
    model.fit(x_train, y_train)
    my_classifier_results(model)

Fitting RandomForest classifier

Accuracy: 0.6575, precision: 0.6584, sensitivity: 0.6575

           pred:bad  pred:good
true:bad        119         78
true:good        59        144

Fitting LogisticRegression classifier

Accuracy: 0.5475, precision: 0.5507, sensitivity: 0.5475

           pred:bad  pred:good
true:bad        126         71
true:good       110         93

Fitting GaussianBN classifier

Accuracy: 0.56, precision: 0.5851, sensitivity: 0.56

           pred:bad  pred:good
true:bad        160         37
true:good       139         64

Fitting XGB classifier

Accuracy: 0.6625, precision: 0.6627, sensitivity: 0.6625

           pred:bad  pred:good
true:bad        132         65
true:good        70        133



## Part 4

Now use the four hyperparameter optimization techniques on `XGBClassifier` and TPOT general model optimization.  Apply `my_classifer_results` to the test data in each case.
* Feel free to use 3 folds instead of 5 for cross validation to speed things up. 
* Choose a very small number of iterations, population size, etc. until you're sure things are working correctly, then turn up the numbers.  General TPOT optimization will take a while (fair warning: it took about 30 minutes on my Macbook Pro with generations = 10, population_size=40, and cv=5)  
* The hyperparameters to consider for are the same as they were in the presentation , but here they are again for convenience:

<font color = "blue"> *** 10 points - (don't delete this cell) </font>

In [17]:
from sklearn.model_selection import GridSearchCV

# define the grid
params = {
    "learning_rate": [0.01, 0.1],
    "max_depth": [2, 4, 6],
    "n_estimators": [10, 100],
    "subsample": [0.8, 1],
    "min_child_weight": [1, 3],
    "reg_lambda": [1, 3],
    "reg_alpha:": [1, 3]
}

# setup the grid search
grid_search = GridSearchCV(xgbr_classifier,
                           param_grid=params,
                           cv=3,
                           verbose=1,
                           n_jobs=-1,
                           return_train_score=True)

grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   13.0s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   42.1s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 960 out of 960 | elapsed:  1.8min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='warn', n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.1], 'max_depth': [2, 4, 6],
                         'min_child_weight': [1, 3], 'n_estimators': [10, 100],
          

In [22]:
# define model and pass it to wrapper
# optimize_model(xgbr_classifier, opt_type="gridcv", cv=3, n=2)
df = wrapper(xgbr_classifier, cv=3, n=2)

Parallel processing being used for all tuning methods except TPOT


HBox(children=(IntProgress(value=0, max=4), HTML(value='')))

Optimizing hyperparameters with GridSearchCV...
Fitting 3 folds for each of 192 candidates, totalling 576 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   17.4s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   52.1s
[Parallel(n_jobs=-1)]: Done 576 out of 576 | elapsed:  1.2min finished


Optimizing hyperparameters with RandomSearchCV...
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    3.6s finished


Optimizing hyperparameters with Bayesian Optimization...
Optimizing hyperparameters with TPOT...


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=6, style=ProgressStyle(descriptio…

Generation 1 - Current best internal CV score: 0.4997222222222222
Generation 2 - Current best internal CV score: 0.4997222222222222
Generation 3 - Current best internal CV score: 0.4997222222222222
Generation 4 - Current best internal CV score: 0.6483333333333333
Generation 5 - Current best internal CV score: 0.6483333333333333

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.001, max_depth=1, min_child_weight=20, n_estimators=150, nthread=1, objective=binary:logistic, reg_alpha=1, reg_lambda=5, subsample=1.0)

Done! Time elapsed: 155 seconds


In [23]:
df

Unnamed: 0,n_estimators,max_depth,min_child_weight,learning_rate,subsample,reg_lambda,reg_alpha,accuracy,precision,sensitivity
gridcv,50.0,1.0,1.0,1.0,1.0,0.0,,0.63,0.630163,0.63
randomcv,70.0,8.0,5.0,0.183223,0.786292,0.0,,0.6025,0.602415,0.6025
bayes,150.0,1.0,1.0,0.105238,0.778466,1.214367,1.691005,0.6575,0.657458,0.6575
tpot,150.0,1.0,20.0,0.001,1.0,5.0,1.0,0.63,0.635929,0.63


## Fit TPOT AutoML

In [24]:
tpot = TPOTClassifier(generations=5,
                     population_size=5,
                     verbosity=2,
                     cv=2,
                     random_state=8675309)

tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))
tpot.export('tpot_optimal_pipeline.py')

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=30, style=ProgressStyle(descripti…

Generation 1 - Current best internal CV score: 0.655833381430056
Generation 2 - Current best internal CV score: 0.655833381430056
Generation 3 - Current best internal CV score: 0.6566668693416263
Generation 4 - Current best internal CV score: 0.6566668693416263
Generation 5 - Current best internal CV score: 0.6616674881689778

Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.01, max_depth=5, max_features=0.55, min_samples_leaf=15, min_samples_split=7, n_estimators=100, subsample=0.2)
0.655


In [25]:
my_classifier_results(tpot)

Accuracy: 0.655, precision: 0.6567, sensitivity: 0.655

           pred:bad  pred:good
true:bad        137         60
true:good        78        125



## Part 5 - Summary

* In addition to your summary table, answer:
    * If the bank is primarily interested in correctly identifying loans that are truly bad, then which model should they use?  Why?

<font color = "blue"> *** 5 points - (don't delete this cell) </font>