# **Feature Selection**

## Objectives
* Do a grid search using cross-validation in order to select a classification model. We will run this grid search several times removing different features in order to test our Hytpothesis # something and construct non-trivial models.

## Inputs
* The train and test data set aside in the the last notebook.
* The pipeline that was produced in the last notebook.
* Parameter values determined in previous notebook.

## Outputs
* A choice of classification model that we will further tune and evaluate.

## Additional Comments
* We have chosen to do classification. We may yet do a regression model on the point differential.      

---

# Change working directory
We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

home_dir = '/workspace/pp5-ml-dashboard'
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


We now load our training and test sets, as well as some of the packages that we will be using.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from src.utils import get_df, save_df

train_dir = 'train/csv'
X_TrainSet = get_df('X_TrainSet',train_dir)
Y_TrainSet = get_df('Y_TrainSet',train_dir)

test_dir = 'test/csv'
X_TestSet = get_df('X_TestSet',test_dir)
Y_TestSet = get_df('Y_TestSet',test_dir)

## Section 1: Full Pipeline
We will build the full pipeline here. It will accept some parameters so that we can tune it later. We also declare some constants that we established in the last notebook.

In [3]:
from sklearn.preprocessing import StandardScaler
from feature_engine import transformation as vt
from feature_engine.selection import DropFeatures, SmartCorrelatedSelection
from sklearn.pipeline import Pipeline


# Constants needed for feature engineering
META = ['season', 'play_off']
TRIVIAL = ['pts_home','pts_away','plus_minus_home']
THRESH = 0.85
TRANSFORMS = {'log_e':(vt.LogTransformer, False), 
                'log_10':(vt.LogTransformer,'10'),
                'reciprocal':(vt.ReciprocalTransformer,False), 
                'power':(vt.PowerTransformer,False),
                'box_cox':(vt.BoxCoxTransformer,False),
                'yeo_johnson':(vt.YeoJohnsonTransformer,False)}
TRANSFORM_ASSIGNMENTS = {
    'yeo_johnson': ['dreb_away', 'blk_home', 'oreb_away', 'fta_away', 'dreb_home', 
                    'ast_home', 'stl_away', 'pts_away', 'stl_home', 'reb_away',
                    'pts_home', 'fgm_away', 'oreb_home', 'pf_away', 'pf_home'],
    'box_cox': ['ast_away', 'fta_home']
                            }


def base_pipeline(to_drop=None,thresh=THRESH):
    if not to_drop:
        to_drop = META
    else:
        to_drop.extend(META)
    to_drop = list(set(to_drop))
    pipeline = Pipeline([
        ('dropper', DropFeatures(features_to_drop=to_drop)),
        ('corr_selector', SmartCorrelatedSelection(method="pearson",
                                                   threshold=thresh,
                                                   selection_method="variance")
                                                   )
                        ])
    pipeline.to_drop = to_drop
    return pipeline

    
def add_transformations(pipeline, transform_assignments):
    # This needs to be called after the above is fit so that the correlation selector has that attr
    to_drop = pipeline.to_drop
    dropping = set(to_drop + pipeline['corr_selector'].features_to_drop_)
    
    new_assignments = { key: [val for val in value if val not in dropping] 
                       for key,value in transform_assignments.items()}
    ''' Olde version
    for value in transform_assignments.values():
        for drop_term in dropping:
            if drop_term in value:
                value.remove(drop_term)
    '''
    for transform, targets in new_assignments.items():
        if not targets:
            continue
        pipeline.steps.append(
            (transform, TRANSFORMS[transform][0](variables=targets))
            )
    pipeline.steps.append(('scaler', StandardScaler()))
    return pipeline

We add the last two steps of the pipeline, the feature selection and the model itself. We do this in a separate step since it depends on the model.

After a first pass, we will see that our models only use the features in `TRIVIAL`. These features make the classification problem too ... trivial as the winner of the game can be computed directly from `'plus_minus_home'`. The points of the each team can be used together to compute the winner as well, so we also remove them. 

# Attention
this could probably be implimented better with classes, but I think I don't want to worry too much about inheritance, I could do it with composition, and then tune the hyperparameters via methods. Do that later.

In [4]:
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

MODELS = {
    'LogisticRegression': LogisticRegression,
    'DecisionTree': DecisionTreeClassifier,
    'RandomForest': RandomForestClassifier,
    'GradientBoosting': GradientBoostingClassifier,
    'ExtraTrees': ExtraTreesClassifier,
    'AdaBoost': AdaBoostClassifier,
    'XGBoost': XGBClassifier
}


def add_feat_selection_n_model(pipeline,model_name,random_state=42):
    model = MODELS[model_name](random_state=random_state)
    pipeline.steps.append(("feat_selection", SelectFromModel(model)))
    pipeline.steps.append((model_name,model))
    return pipeline

Let us now create our list of models that we are going to train.

In [5]:
def create_pipelines(to_drop=None,thresh=THRESH):
    PIPELINES = {}
    for model_name in MODELS:
        base_pipe = base_pipeline(to_drop,thresh)
        base_pipe.fit(X_TrainSet)
        pipe_w_transforms= add_transformations(base_pipe,TRANSFORM_ASSIGNMENTS)
        PIPELINES[model_name] = add_feat_selection_n_model(pipe_w_transforms,model_name)
    return PIPELINES
PIPELINES = create_pipelines(thresh=0.6)

## Section 2: Cross Validation Search
We are going to perform multiple grid searchs as we refine our selection of features. The final grid search will determine the best model for our data, which we will tune in the next notebook.

We have a few different Hypotheses. One of them, hypothesis 1, is that the models will gravitate strongly towards the features related to points. In order of estimated impact, they are:
* `'plus_minus'`
* `'pts'`
* `'fgm'`
* `'fg3m'`, `'ftm'`

We will start by training our model on all of these features. To test our hypothesis, we will remove one collection of features at a time until we are left with no features that help you directly determine the score (and hence winner). We will check which features the models use along the way to see if we can validate our hypothesis. We expect that the earlier models will be able to predict the outcome of the games flawlessly.

We won't be passing any parameters as we will tune the hyperparameters in the following notebook.

Note: we are silencing warnings below. These warnings generate  hundreds of lines and no exceptions are raised.

In [6]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import logging
logging.captureWarnings(True)

In [16]:
from sklearn.model_selection import GridSearchCV
# to suppress warnings, I think
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import logging
logging.captureWarnings(True)
os.environ['PYTHONWARNINGS']='ignore'


def grid_search(X_train, y_train, pipelines,param_grid={}):
    GRIDS = {}
    count = 0
    for pipe_name, pipe in pipelines.items():
        print(f"### Beginning grid search for {pipe_name} ###")
        grid=GridSearchCV(estimator=pipe,
                    param_grid=param_grid,
                    cv=5,
                    n_jobs=-2,
                    verbose=3,
                    scoring=['accuracy','precision'],
                    refit='precision')
        grid.fit(X_train,y_train)
        GRIDS[pipe_name] = grid
        count +=1
        print(f"Finished with model {count}.")
        print(f"{len(pipelines.keys())-count} models remaining.")
    return GRIDS


In [17]:

GRIDS = grid_search(X_TrainSet,Y_TrainSet,PIPELINES)

### Beginning grid search for LogisticRegression ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits


[CV 1/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   2.7s
[CV 2/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 4/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 3/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 5/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.8s
Finished with model 1.
6 models remaining.
### Beginning grid search for DecisionTree ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 2/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 3/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 1/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 5/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 4/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
Finished with model 2.
5 models remaining

We see from the above that the models are "flawless." The reports in

We will use the following code to generate reports for evaluating our models. You can pass one of the resulting pipelines to `'clf_performance'` or the whole grid to `'grid_search_report_best'` if you like.

In [20]:
from src.model_eval import clf_performance, get_best_scores, grid_search_report_best

print(GRIDS.keys())
def extract_best_estimator(grid_collection, model_name):
  return grid_collection[model_name].best_estimator_

def get_best_scores(grid_collection):
  for name, grid in grid_collection.items():
      res = (pd.DataFrame(grid.cv_results_)
         .sort_values(by=['mean_test_precision','mean_test_accuracy'],ascending=False)
         .filter(['params','mean_test_precision','mean_test_accuracy'])
        .values)
      intro = f"Best {name} model:"
      print(f"{intro} Avg. Precision: {res[0][1]*100}%.")
      print(f"{len(intro)*' '} Avg. Accuracy: {res[0][2]*100}%.")
      print()

get_best_scores(GRIDS)

grid_search_report_best(GRIDS,X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet,['loss','win'])


dict_keys(['LogisticRegression', 'DecisionTree', 'RandomForest', 'GradientBoosting', 'ExtraTrees', 'AdaBoost', 'XGBoost'])
Best LogisticRegression model: Avg. Precision: 100.0%.
                               Avg. Accuracy: 100.0%.

Best DecisionTree model: Avg. Precision: 100.0%.
                         Avg. Accuracy: 100.0%.

Best RandomForest model: Avg. Precision: 100.0%.
                         Avg. Accuracy: 100.0%.

Best GradientBoosting model: Avg. Precision: 100.0%.
                             Avg. Accuracy: 100.0%.

Best ExtraTrees model: Avg. Precision: 100.0%.
                       Avg. Accuracy: 100.0%.

Best AdaBoost model: Avg. Precision: 100.0%.
                     Avg. Accuracy: 100.0%.

Best XGBoost model: Avg. Precision: 100.0%.
                    Avg. Accuracy: 100.0%.

LogisticRegression
#### Train Set #### 

---  Confusion Matrix  ---
                Actual loss Actual win
Prediction loss       13689          0
Prediction win            0      21187


---  C

Simply uncomment out the last line to look at a report for the best estimator of each type. There were 2 incorrect predictions, both made by the `ExtraTrees` model. This is as we expected. The points scored in the game determine who wins, and we have a lot of that data still present. Let's recall which features we train on. Some features that we dropped were common to each model.

In [21]:
print(META)
sample_pipe = GRIDS['LogisticRegression'].best_estimator_
corr_dropped = list(sample_pipe['corr_selector'].features_to_drop_)
print(corr_dropped)


['season', 'play_off']
['fgm_home', 'fg3m_home', 'ftm_home', 'dreb_home', 'ast_home', 'stl_home', 'pf_home', 'fgm_away', 'fg3a_away', 'ftm_away', 'dreb_away', 'ast_away', 'stl_away', 'pf_away']


So some of the point related features were removed. But plus/minus score and actual scores were still retained.
In the end, each model was only trained an the following features.

In [24]:
def find_features(X,fitted_pipe, initial_drop):
    corr_dropped = list(fitted_pipe['corr_selector'].features_to_drop_)
    auto_dropped = initial_drop + corr_dropped
    cols = [col for col in X_TrainSet.columns if col not in auto_dropped]

    features = fitted_pipe['feat_selection'].get_support()
    X = X_TrainSet.filter(cols)
    if len(X.columns) != features.shape[0]:
        raise ValueError
    feat_selected_dropped = []
    for index, col in enumerate(X.columns):
        if not features[index]:
            X.drop(col, axis=1, inplace=True)
        else:
            feat_selected_dropped.append(col)
    return X.columns, auto_dropped + feat_selected_dropped

best_pipes = {name: grid.best_estimator_ for name, grid in GRIDS.items()}
def find_overlaps(best_pipes,initial_drop):
    feat_lists = [set(find_features(X_TrainSet,pipe,initial_drop)[0]) for pipe in best_pipes.values()]
    overlap_kept = set.intersection(*feat_lists)
    overlap_dropped = set.difference(set(X_TrainSet.columns),overlap_kept)
    overlap_dropped = set.difference(overlap_dropped, set(META))
    return overlap_kept, overlap_dropped
for pipe_name, pipe in best_pipes.items():
    print(pipe_name, "used:")
    kept, dropped = find_features(X_TrainSet,pipe,META)
    print(kept)
overlap_kept, overlap_dropped = find_overlaps(best_pipes,META)
print(f"All models used: {overlap_kept}.")
print(f"All models dropped: {overlap_dropped}.")

LogisticRegression used:
Index(['pts_home', 'plus_minus_home', 'pts_away'], dtype='object')
DecisionTree used:
Index(['plus_minus_home'], dtype='object')
RandomForest used:
Index(['pts_home', 'plus_minus_home', 'pts_away'], dtype='object')
GradientBoosting used:
Index(['plus_minus_home'], dtype='object')
ExtraTrees used:
Index(['pts_home', 'plus_minus_home', 'pts_away'], dtype='object')
AdaBoost used:
Index(['plus_minus_home'], dtype='object')
XGBoost used:
Index(['plus_minus_home'], dtype='object')
All models used: {'plus_minus_home'}.
All models dropped: {'fta_home', 'stl_away', 'pts_away', 'blk_home', 'fta_away', 'ftm_home', 'fga_away', 'blk_away', 'tov_away', 'fg3m_away', 'fg3m_home', 'ast_away', 'dreb_home', 'oreb_away', 'reb_home', 'reb_away', 'fg3a_home', 'fgm_home', 'pf_home', 'dreb_away', 'fgm_away', 'fg3a_away', 'stl_home', 'tov_home', 'ftm_away', 'pts_home', 'fga_home', 'ast_home', 'pf_away', 'oreb_home'}.


This a large step towards validating our first hypothesis. Let's remove `'plus_minus_home'` and see how this impacts the models.

In [25]:
PIPELINES_wo_pm = create_pipelines(to_drop=['plus_minus_home'],thresh=0.6)
GRIDS_wo_pm = grid_search(X_TrainSet,Y_TrainSet,PIPELINES_wo_pm)

### Beginning grid search for LogisticRegression ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 3/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 4/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.7s
[CV 2/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.8s
[CV 5/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.8s
[CV 1/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.8s
Finished with model 1.
6 models remaining.
### Beginning grid search for DecisionTree ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 3/5] END . accuracy: (test=0.999) precision: (test=1.000) total time=   0.8s
[CV 5/5] END . accuracy: (test=1.000) precision: (test=1.000) total time=   0.9s
[CV 2/5] END . accuracy: (test=0.999) precision: (test=1.000) total time=   0.9s
[CV 4/5] END . accuracy: (test=0.999) precision: (test=0.999) total time=   0.9s
[CV 1/5] E

Unsurprisingly, these models also performed extremely well. `AdaBoost` was the 'worst' performing model.

In [26]:
get_best_scores(GRIDS_wo_pm)

clf_performance(X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet,GRIDS_wo_pm['AdaBoost'].best_estimator_,['loss','win'])

Best LogisticRegression model: Avg. Precision: 100.0%.
                               Avg. Accuracy: 100.0%.

Best DecisionTree model: Avg. Precision: 99.94809538713346%.
                         Avg. Accuracy: 99.9455205353326%.

Best RandomForest model: Avg. Precision: 99.96697000239465%.
                         Avg. Accuracy: 99.96845919239748%.

Best GradientBoosting model: Avg. Precision: 99.91977346807073%.
                             Avg. Accuracy: 99.91398096083654%.

Best ExtraTrees model: Avg. Precision: 99.98584349176355%.
                       Avg. Accuracy: 99.9885308769853%.

Best AdaBoost model: Avg. Precision: 98.77393315282352%.
                     Avg. Accuracy: 98.75272393212981%.

Best XGBoost model: Avg. Precision: 99.92451828391368%.
                    Avg. Accuracy: 99.93118402880536%.

#### Train Set #### 

---  Confusion Matrix  ---
                Actual loss Actual win
Prediction loss       13341        348
Prediction win           42      21145


---  C

Let's see what features were removed and what our models were trained on.

In [35]:
best_pipes_wo_pm = {name: grid.best_estimator_ for name, grid in GRIDS_wo_pm.items()}
initial_drop = ['plus_minus_home'] + META

def find_features(X, fitted_pipe, initial_drop):
    corr_dropped = list(fitted_pipe['corr_selector'].features_to_drop_)
    auto_dropped = initial_drop + corr_dropped
    cols = [col for col in X_TrainSet.columns if col not in auto_dropped]

    features = fitted_pipe['feat_selection'].get_support()
    X = X_TrainSet.filter(cols)
    if len(X.columns) != features.shape[0]:
        raise ValueError
    feat_selected_dropped = []
    for index, col in enumerate(X.columns):
        if not features[index]:
            X.drop(col, axis=1, inplace=True)
        else:
            feat_selected_dropped.append(col)
    return X.columns, auto_dropped + feat_selected_dropped

for pipe_name, pipe in best_pipes_wo_pm.items():
    print(pipe_name, "used:")
    kept, dropped = find_features(X_TrainSet, pipe, initial_drop)
    print(kept)

overlap_kept, overlap_dropped = find_overlaps(best_pipes_wo_pm,initial_drop)
print(f"All models used: {overlap_kept}.")
print(f"All models dropped: {overlap_dropped}.")

LogisticRegression used:
Index(['pts_home', 'pts_away'], dtype='object')
DecisionTree used:
Index(['pts_home', 'pts_away'], dtype='object')
RandomForest used:
Index(['pts_home', 'pts_away'], dtype='object')
GradientBoosting used:
Index(['pts_home', 'pts_away'], dtype='object')
ExtraTrees used:
Index(['pts_home', 'pts_away'], dtype='object')
AdaBoost used:
Index(['pts_home', 'pts_away'], dtype='object')
XGBoost used:
Index(['pts_home', 'pts_away'], dtype='object')
All models used: {'pts_home', 'pts_away'}.
All models dropped: {'fta_home', 'oreb_away', 'stl_away', 'blk_home', 'reb_home', 'reb_away', 'fg3a_home', 'fta_away', 'ftm_home', 'fga_away', 'blk_away', 'tov_away', 'fg3m_away', 'fgm_home', 'pf_home', 'dreb_away', 'fg3m_home', 'fgm_away', 'plus_minus_home', 'ast_away', 'fg3a_away', 'stl_home', 'dreb_home', 'tov_home', 'ftm_away', 'fga_home', 'ast_home', 'pf_away', 'oreb_home'}.


As expected, they focused completely on the points scored by each team. Despite this, there were some models that made in accurate predictions. Let's go further and remove the points each team scored.

In [36]:
PIPELINES_wo_pts = create_pipelines(to_drop=TRIVIAL,thresh=0.6)
GRIDS_wo_pts = grid_search(X_TrainSet,Y_TrainSet,PIPELINES_wo_pts)

### Beginning grid search for LogisticRegression ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 2/5] END . accuracy: (test=0.918) precision: (test=0.929) total time=   0.6s
[CV 3/5] END . accuracy: (test=0.930) precision: (test=0.935) total time=   0.7s
[CV 1/5] END . accuracy: (test=0.926) precision: (test=0.941) total time=   0.7s
[CV 4/5] END . accuracy: (test=0.925) precision: (test=0.937) total time=   0.7s
[CV 5/5] END . accuracy: (test=0.924) precision: (test=0.934) total time=   0.7s
Finished with model 1.
6 models remaining.
### Beginning grid search for DecisionTree ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits


This made it more challenging for the models. Lets see what the best performance was and which features the models trained on.

In [None]:
get_best_scores(GRIDS_wo_pts)

The best performing LogisticRegression model had a score of 92.46759519581731%.

The best performing DecisionTree model had a score of 84.47069193712802%.

The best performing RandomForest model had a score of 88.32435467429548%.

The best performing GradientBoosting model had a score of 89.10138272335668%.

The best performing ExtraTrees model had a score of 88.51358472263327%.

The best performing AdaBoost model had a score of 88.98096618000066%.

The best performing XGBoost model had a score of 89.57449319325244%.



The Decision Tree model was the worst and the Logistic Regression was the best. Let's look at the reports of each.

In [None]:
target_names = ['LogisticRegression','DecisionTree']
two_grids = {name: grid for name, grid in GRIDS_wo_pts.items() if name in target_names}
grid_search_report_best(two_grids,X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet,['loss','win'])

LogisticRegression
#### Train Set #### 

---  Confusion Matrix  ---
                Actual loss Actual win
Prediction loss       12305       1384
Prediction win         1236      19951


---  Classification Report  ---
              precision    recall  f1-score   support

        loss       0.91      0.90      0.90     13689
         win       0.94      0.94      0.94     21187

    accuracy                           0.92     34876
   macro avg       0.92      0.92      0.92     34876
weighted avg       0.92      0.92      0.92     34876
 

#### Test Set ####

---  Confusion Matrix  ---
                Actual loss Actual win
Prediction loss        3142        349
Prediction win          308       4920


---  Classification Report  ---
              precision    recall  f1-score   support

        loss       0.91      0.90      0.91      3491
         win       0.93      0.94      0.94      5228

    accuracy                           0.92      8719
   macro avg       0.92      0.92   

These are both good models. Their accuracy and precision are within the desired range. Remember, we wanted accuracy above 70% and average precision above 75%.

In [None]:
sample_wo_pts = GRIDS_wo_pts['LogisticRegression'].best_estimator_
corr_dropped_wo_pts = sample_wo_pts['corr_selector'].features_to_drop_
print(corr_dropped_wo_pts)
auto_dropped_wo_pts = META + TRIVIAL + corr_dropped_wo_pts
remaining_cols_wo_pts = [col for col in X_TrainSet.columns 
                         if col not in auto_dropped_wo_pts]




best_pipes = {name: grid.best_estimator_ for name, grid in GRIDS_wo_pts.items()}

for pipe_name, pipe in best_pipes.items():
    print(pipe_name)
    print(find_features(X_TrainSet,pipe,remaining_cols_wo_pts))

feat_dict = {name: find_features(X_TrainSet,pipe,remaining_cols_wo_pts) for name,pipe in best_pipes.items()}
all_feats = {feat for feats in feat_dict.values() for feat in feats}
feat_overlap = []
print(all_feats)

['fg3m_home', 'ftm_home', 'dreb_home', 'ast_home', 'stl_home', 'pf_home', 'fg3a_away', 'ftm_away', 'dreb_away', 'ast_away', 'stl_away', 'pf_away']
LogisticRegression
Index(['fgm_home', 'fg3a_home', 'fta_home', 'fgm_away', 'fg3m_away',
       'fta_away'],
      dtype='object')
DecisionTree
Index(['fgm_home', 'fta_home', 'fgm_away', 'fta_away'], dtype='object')
RandomForest
Index(['fgm_home', 'fta_home', 'fgm_away', 'fta_away'], dtype='object')
GradientBoosting
Index(['fgm_home', 'fta_home', 'fgm_away', 'fta_away'], dtype='object')
ExtraTrees
Index(['fgm_home', 'fta_home', 'reb_home', 'fgm_away', 'fta_away', 'reb_away'], dtype='object')
AdaBoost
Index(['fgm_home', 'fta_home', 'fgm_away', 'fta_away'], dtype='object')
XGBoost
Index(['fgm_home', 'fta_home', 'fgm_away', 'fta_away'], dtype='object')
{'reb_away', 'fgm_home', 'fg3a_home', 'fg3m_away', 'fta_home', 'fgm_away', 'reb_home', 'fta_away'}


There are some interesting observations here. A lot has been done during the feature selection steps. The '`'corr_selection'` step meant that the (automated)  `'feat_selection'` had less to choose from. 

Interesting observations:
* The Logistic Regression kept the most features and performed the best. It was able to almost completely compute the score of the away team.
* 

# Attention, maybe run an experiment later to see what happens when we remove one/both feature selection steps.

In [None]:
NEW_PIPELINES = create_pipelines()
'''fitted_pipelines = {}
for pipe_name, pipe in NEW_PIPELINES.items():
    pipe.fit(X_TrainSet,Y_TrainSet)
    fitted_pipelines[pipe_name] = pipe
'''
GRIDS_wo_pm = grid_search(X_TrainSet,Y_TrainSet,NEW_PIPELINES)

### Beginning grid search for LogisticRegression ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END ..................................., score=1.000 total time=   1.3s
[CV 4/5] END ..................................., score=1.000 total time=   1.3s
[CV 2/5] END ..................................., score=1.000 total time=   1.4s
[CV 3/5] END ..................................., score=1.000 total time=   1.5s
[CV 5/5] END ..................................., score=1.000 total time=   1.5s
Finished with model 1.
6 models remaining.
### Beginning grid search for DecisionTree ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 2/5] END ..................................., score=1.000 total time=   0.8s
[CV 3/5] END ..................................., score=1.000 total time=   0.8s
[CV 4/5] END ..................................., score=1.000 total time=   1.1s
[CV 5/5] END ..................................., score=1.000 total time=   1.1s
[CV 1/5] E

In [None]:
def get_best_scores(GRIDS):
  for grid in GRIDS.values():
    res = (pd.DataFrame(grid.cv_results_)
       .sort_values(by='mean_test_score',ascending=False)
       .filter(['params','mean_test_score'])
       .values)

    print(res)

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_pred=prediction, y_true=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

 

After examining the reports for all of the models in `GRIDS` and those in our new search `GRIDS_wo_pm`, we found that the two worst performing models are still very close to being 100% accurate. They are AdaBoost and ExtraTrees. Other models did fail to be perfect, but these were the only two that did not score 100% across the board on all metrics for the test set.

In [None]:
for name, grid in GRIDS_wo_pm.items():
    print(name)
    best_estimator = grid.best_estimator_
    clf_performance(X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet,best_estimator, label_map=['loss', 'win']) 

LogisticRegression
#### Train Set #### 

---  Confusion Matrix  ---
                Actual loss Actual win
Prediction loss       13689          0
Prediction win            0      21187


---  Classification Report  ---
              precision    recall  f1-score   support

        loss       1.00      1.00      1.00     13689
         win       1.00      1.00      1.00     21187

    accuracy                           1.00     34876
   macro avg       1.00      1.00      1.00     34876
weighted avg       1.00      1.00      1.00     34876
 

#### Test Set ####

---  Confusion Matrix  ---
                Actual loss Actual win
Prediction loss        3491          0
Prediction win            0       5228


---  Classification Report  ---
              precision    recall  f1-score   support

        loss       1.00      1.00      1.00      3491
         win       1.00      1.00      1.00      5228

    accuracy                           1.00      8719
   macro avg       1.00      1.00   

In [None]:
best_ada = GRIDS_wo_pm['AdaBoost'].best_estimator_
best_extra_tree = GRIDS_wo_pm['ExtraTrees'].best_estimator_
print("Ada Boost")
clf_performance(X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet,best_ada, label_map=['loss', 'win'])

print("Extra Tree")
clf_performance(X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet, best_extra_tree, label_map=['loss', 'win'])

Ada Boost
#### Train Set #### 

---  Confusion Matrix  ---
                Actual loss Actual win
Prediction loss       13689          0
Prediction win            0      21187


---  Classification Report  ---
              precision    recall  f1-score   support

        loss       1.00      1.00      1.00     13689
         win       1.00      1.00      1.00     21187

    accuracy                           1.00     34876
   macro avg       1.00      1.00      1.00     34876
weighted avg       1.00      1.00      1.00     34876
 

#### Test Set ####

---  Confusion Matrix  ---
                Actual loss Actual win
Prediction loss        3491          0
Prediction win            0       5228


---  Classification Report  ---
              precision    recall  f1-score   support

        loss       1.00      1.00      1.00      3491
         win       1.00      1.00      1.00      5228

    accuracy                           1.00      8719
   macro avg       1.00      1.00      1.00  

Let's inspect which features our models used.

In [None]:
for pipe_name, pipe in NEW_PIPELINES.items():
    print(pipe_name)
    print(find_features(X_TrainSet,Y_TrainSet,pipe))

LogisticRegression


KeyError: 'feat_selection'

Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
