# Fit the model developed in (LINK NOTEBOOK) and make Kaggle submission 

### Issues preventing full incorporation into one pipeline
- current structure of `feature_engineer_aggregate_individuals` needs to be run after the other feature engineering, but within a pipeline cannot work on the np array produced by the feature_extraction `FeatureUnion`
- betweeen feature engineering and other transformations, want to subset to heads of household (otherwise these transformers will use non-heads in fitting) 
- if I'm using early stopping, then I need to split data into train and validation sets (which would need to happen after the transformations...)

### Why that matters
- Hyperopt tuning on entire process
- Using `pickle` to save fitted model should it need to be used in production

## Things I've learned: 
- always be precise in the problem you're trying to solve. For example, when figuring out how to best use kfold for xgboost early stopping validation, should I be doing this as a custom estimator, or doing the whole pipeline for the subset of the data? (isn't this just bagging?) 

# READ
- http://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/![image.png](attachment:image.png)
- https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de
- http://scikit-learn.org/stable/modules/ensemble.html
- https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

# Questions
- should I be bagging the XGBoost classifier, or stacking multiple? 
- using `BaggingClassifier` can you implement early stopping with oob? 
- should you bag AND stack? 


### Things I'm figuring out
- how/if to use early stopping within kfold cross validation (rather than just one validation set). **if I do this please please show your work in a notebook**
    - WHAT DO I WANT? I want a pipeline step that runs xgboost with early stopping  on a kfold and uses soft voting to combine, right? 
    - OR do I want BAGGING? (e.g. `BaggingClassifier`) - the oob could theoretically be used for early stopping, right? 
    - OR, do I want a regular pipeline that is then fit e.g. 5 times for a kfold with 5 folds, then combine the predictions. This seems slower, but avoids leakage etc and leverages the pipeline tools (e.g. using the different folds for every pipeline step). BUT then it wouldn't actually help do better predictions when pipeline.fit is called (and hence doesn't actually help hyperparameter tuning). ISN'T THIS JUST STACKING (http://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/), but does it allow hyperparamter optimization?
    - OR is it better to just have the one xgboost classifier in the pipeline, but then do stacking of diverse estimators? https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/ 
        - may need a custom pipeline object. I've seen multiple people iterate over kfoldcv and aggregate the predictions (using soft voting). Examples include https://www.kaggle.com/sudosudoohio/stratified-kfold-xgboost-eda-tutorial-0-281 and https://www.kaggle.com/willkoehrsen/a-complete-introduction-and-walkthrough 
       - Here's sample code to create a custom estimator: http://danielhnyk.cz/creating-your-own-estimator-scikit-learn/
       - The sklearn GradientBoostedClassifier has a built-in `validation_fraction` which seems like could more easily be put into a pipeline, though this notebook makes it seem like it doesn't actually improve test scores? http://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html
       - Here's a related discussion on using just the transformers from the pipeline to transform data (which can then be used by the xgboost classifer). https://github.com/scikit-learn/scikit-learn/issues/8414 (note: make sure it hasn't been implemented by sklearn already, i recall that pipeline.transform may call all but estimator inherently)
       - Probably unrelated, but here's a lengthy pipeline for titanic data:https://github.com/mratsim/MachineLearning_Kaggle/blob/master/Kaggle%20-%20001%20-%20Titanic%20Survivors/Kaggle-001-Python-MagicalForest.py#L526
- how/if to use Hyperopt within kfold cross validation (rather than just one validation set). Granted, this would be much slower. Would doing a proper splitting of the xgboost model help this, given it has to use a set of parameters to train for multiple different cv sets? 


### Important note: 
My goal with this repository is **not to get the best Kaggle score**. I know, crazy, right? I'm more interested in learning best practices, such as building one pipeline for the entire model. Most if not all of the leading kernels (such as https://www.kaggle.com/willkoehrsen/a-complete-introduction-and-walkthrough) may do pipelines for a couple steps but not for the whole model. Indeed, I scored higher when I did all the transformers (including feature selection) once then scored cross validation on just the final estimator (rather than the entire pipeline).  

## Lessons
- **Feature engineering:** 
    - Manually using logical features hurt the model, the best way seemed to be making many features and automating feature selection (or using a properly regularized estimator) 
- **Feature selection:** 
    - `RFECV` seems to be super sensitive to settings, including the classifier used and hyperparameters therein. Even within cross-validation folds (ie, using the same classifier and hyperparameters), I saw between 13 and 143 features selected. This merits further investigation before using in a final pipeline.
    - `SelectFromModel` is much quicker and consistent than `RFECV`, especially given that it exposes an important hyperparameter (`threshold`) for hyperparameter tuning. User note: make sure your feature importances are scaled appropriately for your `threshold`. Oddly, using the final estimator (`BaggedLGBMClassifier`) as the `SelectFromModel` classifier resulted in worse cross-validation scores. 
- **Early stopping**
    - This was key, though it still results in overfitting
- **Bagging**
    - For simple (and untuned) classification algorithms, bagging had inconsistent results on bias and variance. However, for my implementation of `BaggedLGBMClassifier`, it significantly reduced both bias and variance. Hooray!
    - I ended up writing a **custom sklearn estimator**, `BaggedLGBMClassifier`, because I couldn't figure out a way to use bagging *and* early_stopping with the sklearn API (e.g. `BaggingClassifier`). My implementation uses bagging of 5 `LGBMClassifier` estimators whose early_stopping is determined using the unsampled (aka "out-of-bag") observations as validation set (since bootstrapped sampling of the data leaves ~37% of the data unsampled). 
- **Hyperparameter tuning** 
    - I had success using `Hyperopt` - while it takes a little more setup than `GridSearchCV`, I found it to be very useful. 
    - *Question*: I'm curious whether using the base implementation of `Hyperopt` can be dangerous. Unlike `GridSearchCV`, it uses only one train/test split of the data to find optimal hyperparameters - couldn't this lead to overfitting? I'm hoping that my use of bagging for the final estimator will avoid or mitigate potential overfitting. 
    
    
## Next steps
- I haven't yet used sophisticated insights tools for my `BaggedLGBMClassifier`, such as permutation importance, partial dependence plots, and SHAP values. Here is a notebook I've used these tools for a simple Random Forest classifier: https://github.com/zwrankin/chicago_bicycle_share/blob/master/notebooks/2018_09_24_initial_data_exploration_and_models.ipynb
- One of the obstacles to implementing these tools is figuring out how to get feature names out of pipelines with feature selection. 
- Build ensembles with `brew` (https://pypi.org/project/brew/) or `mlens` (https://github.com/flennerhag/mlens)
  
## Notebook



In [1]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

%matplotlib inline

In [2]:
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.feature_selection import SelectFromModel, SelectKBest, RFECV
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from xgboost import XGBClassifier
import lightgbm as lgb

from hyperopt import hp
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

In [38]:
# Tools for developing code
%load_ext autoreload 
%autoreload 2

# Add library to path 
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from lib.model import kfold, f1_scorer
from lib.model import load_and_process_training_data, load_and_process_test_data
from lib.visualization import report_cv_scores
from lib.visualization import plot_learning_curve, plot_feature_importances


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load processing training data & tuned hyperparameters
(LINK TO NOTEBOOK)

In [4]:
train_X = pd.read_hdf('../models/training.hdf', key='train_X')
train_y = pd.read_hdf('../models/training.hdf', key='train_y')
val_X = pd.read_hdf('../models/training.hdf', key='val_X')
val_y = pd.read_hdf('../models/training.hdf', key='val_y')

features = train_X.columns

# Try bagging of base estimators using `BaggingClassifier`
Initial results show that improvement for bias & variance are inconsistent across estimators and depend on subset fractions of samples and features

In [5]:
# For methods we don't need to do early_stopping
X = pd.concat([train_X, val_X]).reset_index().drop('index', axis=1)
y = pd.concat([train_y, val_y]).reset_index().drop('index', axis=1)

In [18]:
%%time
# Create classifiers
rf = RandomForestClassifier(random_state=1)
et = ExtraTreesClassifier(random_state=1)
knn = KNeighborsClassifier()
svc = SVC(random_state=1)
rg = RidgeClassifier(random_state=1)
clf_array = [rf, et, knn, svc, rg]
for clf in clf_array:
    vanilla_scores = cross_val_score(clf, X, y, cv=kfold, scoring=f1_scorer, n_jobs=-1)
    bagging_clf = BaggingClassifier(clf, max_samples=0.5, max_features=1.0, n_estimators=10, bootstrap=True, random_state=1)
    bagging_scores = cross_val_score(bagging_clf, X, y, cv=kfold, scoring=f1_scorer, n_jobs=-1)
    
    print(f"Mean of: {vanilla_scores.mean():.3f}, std: (+/-) {vanilla_scores.std():.3f} [{clf.__class__.__name__}]")
    print(f"Mean of: {bagging_scores.mean():.3f}, std: (+/-) {bagging_scores.std():.3f} [Bagging {clf.__class__.__name__}]\n")

Mean of: 0.374, std: (+/-) 0.026 [RandomForestClassifier]
Mean of: 0.357, std: (+/-) 0.022 [Bagging RandomForestClassifier]

Mean of: 0.355, std: (+/-) 0.013 [ExtraTreesClassifier]
Mean of: 0.337, std: (+/-) 0.020 [Bagging ExtraTreesClassifier]

Mean of: 0.366, std: (+/-) 0.022 [KNeighborsClassifier]
Mean of: 0.326, std: (+/-) 0.010 [Bagging KNeighborsClassifier]

Mean of: 0.324, std: (+/-) 0.021 [SVC]
Mean of: 0.322, std: (+/-) 0.026 [Bagging SVC]

Mean of: 0.314, std: (+/-) 0.013 [RidgeClassifier]
Mean of: 0.310, std: (+/-) 0.012 [Bagging RidgeClassifier]

Wall time: 40.5 s


In [6]:
%%time
# Create classifiers
rf = RandomForestClassifier(random_state=1)
et = ExtraTreesClassifier(random_state=1)
knn = KNeighborsClassifier()
svc = SVC(random_state=1)
rg = RidgeClassifier(random_state=1)
clf_array = [rf, et, knn, svc, rg]
for clf in clf_array:
    vanilla_scores = cross_val_score(clf, X, y, cv=kfold, scoring=f1_scorer, n_jobs=-1)
    bagging_clf = BaggingClassifier(clf, max_samples=1.0, max_features=1.0, n_estimators=10, bootstrap=True, random_state=1)
    bagging_scores = cross_val_score(bagging_clf, X, y, cv=kfold, scoring=f1_scorer, n_jobs=-1)
    
    print(f"Mean of: {vanilla_scores.mean():.3f}, std: (+/-) {vanilla_scores.std():.3f} [{clf.__class__.__name__}]")
    print(f"Mean of: {bagging_scores.mean():.3f}, std: (+/-) {bagging_scores.std():.3f} [Bagging {clf.__class__.__name__}]\n")

Mean of: 0.374, std: (+/-) 0.026 [RandomForestClassifier]
Mean of: 0.361, std: (+/-) 0.013 [Bagging RandomForestClassifier]

Mean of: 0.355, std: (+/-) 0.013 [ExtraTreesClassifier]
Mean of: 0.344, std: (+/-) 0.015 [Bagging ExtraTreesClassifier]

Mean of: 0.366, std: (+/-) 0.022 [KNeighborsClassifier]
Mean of: 0.336, std: (+/-) 0.027 [Bagging KNeighborsClassifier]

Mean of: 0.324, std: (+/-) 0.021 [SVC]
Mean of: 0.342, std: (+/-) 0.018 [Bagging SVC]

Mean of: 0.314, std: (+/-) 0.013 [RidgeClassifier]
Mean of: 0.320, std: (+/-) 0.007 [Bagging RidgeClassifier]

Wall time: 44.8 s


## Manually implement bagging (with early stopping) for boosted estimator
(I couldn't seem to manage this in the native APIs, but seems like there should be a way)

### Conclusion: Using soft voting of 5 identical classifiers using bootstrapped data, the stacked model has lower bias and lower variance than any of the component estimators. Hooray!

In [9]:
tuned_params = pickle.load(open("../models/tuned_params.p", "rb"))
EARLY_STOPPING_ROUNDS = 10
tuned_params

{'boosting_type': 'dart',
 'colsample_bytree': 0.5796397953791418,
 'learning_rate': 0.08739537002929919,
 'min_child_samples': 15,
 'num_leaves': 48,
 'reg_alpha': 0.4239159481112283,
 'reg_lambda': 0.36419362906439723,
 'subsample_for_bin': 40000,
 'subsample': 0.986210861412967,
 'class_weight': 'balanced',
 'limit_max_depth': 1,
 'max_depth': 22}

In [6]:
from sklearn.utils import resample

In [35]:
def bag_and_boost_model(X, y, random_state):
    """
    Fits a gradient boosted model to bootstrapped data, and uses the unsampled (aka "out-of-bag") observations
    as a validation set for early stopping
    Different than sklearn's native BaggingClassifier because it allows early stopping  
    """
    # Note that due to bootstrapping, sample_fraction=1 still leaves ~37% of data in the validation set 
    X_train, y_train = resample(X, y, n_samples=len(X), replace=True, random_state=random_state) 
    valid_idx = [i for i in X.index if i not in X_train.index]
    X_valid = X.loc[valid_idx]
    y_valid = y.loc[valid_idx]
    
    fit_params = {"eval_set": [(X_valid, y_valid)], 
              "early_stopping_rounds": EARLY_STOPPING_ROUNDS, 
              "verbose": False}
    
    clf = lgb.LGBMClassifier(**tuned_params, objective = 'multiclass', 
                               n_jobs = -1, n_estimators = 10000,
                               random_state = 10)
 
    return clf.fit(X_train, y_train, **fit_params) 

In [81]:
%%time
%%capture --no-stdout

def stacked_clfs_predict(clfs, X_test):
    probs = np.mean([clf.predict_proba(X_test) for clf in clfs], axis=0)
    predictions = pd.DataFrame(probs).idxmax(axis = 1)
    return np.array(predictions)  

def score_predictions(predictions, y_test, scores):
    score = f1_score(y_test, predictions, average = 'macro')
    scores.append(score)
    return scores
    
def print_cv_scores(clf_name, scores):
    print(f'{clf_name} Cross Validation F1 Score = {round(np.mean(scores), 4)} with std = {round(np.std(scores), 4)}')
    
scores, clf1_scores, clf2_scores, clf3_scores, clf4_scores, clf5_scores = [[] for _ in range(6)]
np.random.seed(0)

for i, (train_index, test_index) in enumerate(kfold.split(X, y)):
    # print(f'[Fold {i + 1}/5]')
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test= y.iloc[train_index], y.iloc[test_index]

    clf1 = bag_and_boost_model(X_train, y_train, random_state=i*100 + 1)
    clf2 = bag_and_boost_model(X_train, y_train, random_state=i*100 + 2)
    clf3 = bag_and_boost_model(X_train, y_train, random_state=i*100 + 3)
    clf4 = bag_and_boost_model(X_train, y_train, random_state=i*100 + 4)
    clf5 = bag_and_boost_model(X_train, y_train, random_state=i*100 + 5)
    
    stacked_predictions = stacked_clfs_predict([clf1, clf2, clf3, clf4, clf5], X_test)

    scores = score_predictions(stacked_predictions, y_test, scores)
    clf1_scores = score_predictions(clf1.predict(X_test), y_test, clf1_scores)
    clf2_scores = score_predictions(clf2.predict(X_test), y_test, clf2_scores)
    clf3_scores = score_predictions(clf3.predict(X_test), y_test, clf3_scores)
    clf4_scores = score_predictions(clf4.predict(X_test), y_test, clf4_scores)
    clf5_scores = score_predictions(clf5.predict(X_test), y_test, clf5_scores)

for name, scores in zip(['clf1', 'clf2', 'clf3', 'clf4', 'clf5', 'ENSEMBLE'], 
                        [clf1_scores, clf2_scores, clf3_scores, clf4_scores, clf5_scores, scores]):
    print_cv_scores(name, scores)

clf1 Cross Validation F1 Score = 0.401 with std = 0.0132
clf2 Cross Validation F1 Score = 0.4065 with std = 0.0134
clf3 Cross Validation F1 Score = 0.4077 with std = 0.0126
clf4 Cross Validation F1 Score = 0.4096 with std = 0.0224
clf5 Cross Validation F1 Score = 0.3957 with std = 0.0215
ENSEMBLE Cross Validation F1 Score = 0.4166 with std = 0.0123
Wall time: 21.3 s


## Goal - BaggedLGBclassifier Pipeline estimator
- fit(X, y)
- predict(X)
- feature_importances_ (ideally...)

## REMEMBER - if it's in a pipeline it'll get an array not a dataframe..
## Am I breaking sklearn conventions by having `fit_params` be normal params...

In [128]:
from lib.pipeline import BaggedLGBMClassifier

In [163]:
model = BaggedLGBMClassifier(random_state=1)

In [164]:
%%time
%%capture --no-stdout

scores = []
np.random.seed(0)

for i, (train_index, test_index) in enumerate(kfold.split(X, y)):
    # print(f'[Fold {i + 1}/5]')
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test= y.iloc[train_index], y.iloc[test_index]

    model.fit(X_train, y_train)
    scores = score_predictions(model.predict(X_test), y_test, scores)

print_cv_scores('Model', scores)

Model Cross Validation F1 Score = 0.4184 with std = 0.0209
Wall time: 22.3 s


## Make a simple pipeline to see if my `BaggedLGBMClassifier` can be used easily in one

In [5]:
X_data, y_data = load_and_process_training_data()
X_data.shape
y_data = y_data - 1 #try normalizing to 0 to avoid bugs

In [174]:
%%time
%%capture --no-stdout

pipeline = Pipeline(steps=[
                        ('imputer', Imputer(strategy='mean')),
                        ('model', BaggedLGBMClassifier()),
                        ])

pipeline.fit(X, y)

  assert (y.index == X.index, 'X and y indices do not match')


Wall time: 5.94 s


In [9]:
# del(pipeline)
from lib.model import pipeline

In [10]:
%%time
%%capture --no-stdout

pipeline.fit(X_data, y_data)

KeyboardInterrupt: 

Wall time: 44.6 s


In [11]:
%%time
transformer_pipeline = Pipeline(pipeline.steps[:-1])
classifier = Pipeline(pipeline.steps[-1])
transformed = transformer_pipeline.fit_transform(X_data, y_data)

Wall time: 1min 27s


In [24]:
transformed.shape
X_transformed = transformed

In [26]:
X_transformed.shape

(2973, 38)

In [27]:
y_data.shape

(2973,)

In [31]:
scores = []
classifier = pipeline.steps[-1][1]
classifier.fit(X_transformed, y_data)
scores = score_predictions(classifier.predict(X_transformed), y_data, scores)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [59]:
scores

[0.4315401291050065,
 0.36638650684008356,
 0.3362760034627622,
 0.21534631507079502,
 0.39413173981216787]

In [13]:
%%time
cv_score = cross_val_score(pipeline, X_data, y_data, cv=kfold, scoring=f1_scorer, n_jobs=-1)
print(f'Cross Validation F1 Score = {round(cv_score.mean(), 4)} with std = {round(cv_score.std(), 4)}')

Cross Validation F1 Score = 0.3455 with std = 0.0557
Wall time: 5min 57s


### So when I used the transformed data on full pipeline, I get good CV score. 
When I use the full pipeline on raw data, I get bad CV score (super variable cross-validation scores, some high some low). 
Here, let's see if I can isolate the issue.
Starting with `SelectFromModel` because it's much quicker.
- Now trying using RFECV, even though it's slow, but maybe that'll give better results (since that's what's creating the saved transformed data that's giving good and low std cv scores). In fact, it chooses anywhere from 3 to 143 features, and bad scores. 
- Oddly, using the BaggedLGBMClassifier in `SelectFromModel` gave very poor scores

# DOCUMENT THIS STUFF

In [4]:
def score_predictions(predictions, y_test, scores):
    score = f1_score(y_test, predictions, average = 'macro')
    scores.append(score)
    return scores
    
def print_cv_scores(clf_name, scores):
    print(f'{clf_name} Cross Validation F1 Score = {round(np.mean(scores), 4)} with std = {round(np.std(scores), 4)}')
    

In [40]:
X_data, y_data = load_and_process_training_data()
X_data.shape
y_data = y_data - 1 #try normalizing to 0 to avoid bugs

In [9]:
# train_X = pd.read_hdf('../models/training.hdf', key='train_X')
# train_y = pd.read_hdf('../models/training.hdf', key='train_y')
# val_X = pd.read_hdf('../models/training.hdf', key='val_X')
# val_y = pd.read_hdf('../models/training.hdf', key='val_y')

# X_data = pd.concat([train_X, val_X]).reset_index().drop('index', axis=1)
# y_data = pd.concat([train_y, val_y]).reset_index().drop('index', axis=1)

In [30]:
# I'm always getting the highest 
folds = [fold for fold in kfold.split(X_data, y_data)]
train_index = folds[0][0]
test_index = folds[0][1]

In [58]:
%%time
%%capture --no-stdout

scores = []
np.random.seed(0)
from lib.model import pipeline

for i, (train_index, test_index) in enumerate(kfold.split(X_data, y_data)):
    # print(f'[Fold {i + 1}/5]')
    #HACK 
#     train_index = folds[0][0]
#     test_index = folds[0][1]
    X_train, X_test = X_data.iloc[train_index], X_data.iloc[test_index]
    y_train, y_test= y_data.iloc[train_index], y_data.iloc[test_index]

    transformer_pipeline = Pipeline(pipeline.steps[:-1])
    transformer_pipeline.fit(X_train, y_train)
    X_train_t = transformer_pipeline.transform(X_train)
    X_test_t = transformer_pipeline.transform(X_test)
    print(X_train_t.shape)
    
    classifier = pipeline.steps[-1][1]
    classifier.fit(X_train_t, y_train)
    scores = score_predictions(classifier.predict(X_test_t), y_test, scores)
    
print_cv_scores('Model', scores)

  y = pd.DataFrame(y).reset_index().drop('index', axis=1)


(2377, 198)
(2377, 198)
(2379, 198)
(2379, 198)
(2380, 198)
Model Cross Validation F1 Score = 0.3487 with std = 0.0737
Wall time: 1min 10s


In [57]:
scores

[0.4294427075904234,
 0.4046136631095094,
 0.4217817676594283,
 0.3141564968296868,
 0.4166412128993996]

In [None]:
RandomForestClassifier

In [20]:
np.random.random_sample(10)

array([0.44714384, 0.96890304, 0.36316126, 0.39753349, 0.73868266,
       0.44908343, 0.93728436, 0.25723349, 0.38175981, 0.69714162])

In [6]:
%%time
%%capture --no-stdout

# NOTE - selector__threshold range seems to be from 
# 0.01 (11 features) to 0.001 (177 features), 0.005 gives 96 features
param_grid = dict(selector__threshold=[0.001, 0.007, 0.01])

grid = GridSearchCV(pipeline, param_grid=param_grid, cv=kfold, scoring=f1_scorer, n_jobs=-1)
grid.fit(X_data, y_data)

# print(grid.best_params_)
report_cv_scores(grid.cv_results_)

Model with rank: 1
Mean validation score: 0.380 (std: 0.041)
Parameters: {'selector__threshold': 0.007}

Model with rank: 2
Mean validation score: 0.379 (std: 0.040)
Parameters: {'selector__threshold': 0.01}

Model with rank: 3
Mean validation score: 0.357 (std: 0.067)
Parameters: {'selector__threshold': 0.001}

Wall time: 1min 43s
