# **Feature Selection**

## Objectives
* Do a grid search using cross-validation in order to select a classification model. We will run this grid search several times removing different features in order to test our Hytpothesis # something and construct non-trivial models.

## Inputs
* The train and test data set aside in the the last notebook.
* The pipeline that was produced in the last notebook.
* Parameter values determined in previous notebook.

## Outputs
* A choice of classification model that we will further tune and evaluate.

## Additional Comments
* We have chosen to do classification. We may yet do a regression model on the point differential.      

---

# Change working directory
We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

home_dir = '/workspace/pp5-ml-dashboard'
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


We now load our training and test sets, as well as some of the packages that we will be using.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from src.utils import get_df, save_df

train_dir = 'train/csv'
X_TrainSet = get_df('X_TrainSet',train_dir)
Y_TrainSet = get_df('Y_TrainSet',train_dir)

test_dir = 'test/csv'
X_TestSet = get_df('X_TestSet',test_dir)
Y_TestSet = get_df('Y_TestSet',test_dir)

## Section 1: Full Pipeline
We will build the full pipeline here. It will accept some parameters so that we can tune it later. We also declare some constants that we established in the last notebook.

In [3]:
from sklearn.preprocessing import StandardScaler
from feature_engine import transformation as vt
from feature_engine.selection import DropFeatures, SmartCorrelatedSelection
from sklearn.pipeline import Pipeline


# Constants needed for feature engineering
META = ['season', 'play_off']
TRIVIAL = ['pts_home','pts_away','plus_minus_home']
THRESH = 0.85
TRANSFORMS = {'log_e':(vt.LogTransformer, False), 
                'log_10':(vt.LogTransformer,'10'),
                'reciprocal':(vt.ReciprocalTransformer,False), 
                'power':(vt.PowerTransformer,False),
                'box_cox':(vt.BoxCoxTransformer,False),
                'yeo_johnson':(vt.YeoJohnsonTransformer,False)}
TRANSFORM_ASSIGNMENTS = {
    'yeo_johnson': ['dreb_away', 'blk_home', 'oreb_away', 'fta_away', 'dreb_home', 
                    'ast_home', 'stl_away', 'pts_away', 'stl_home', 'reb_away',
                    'pts_home', 'fgm_away', 'oreb_home', 'pf_away', 'pf_home'],
    'box_cox': ['ast_away', 'fta_home']
                            }


def base_pipeline(to_drop=None,thresh=THRESH):
    if not to_drop:
        to_drop = META
    else:
        to_drop.extend(META)
    to_drop = list(set(to_drop))
    pipeline = Pipeline([
        ('dropper', DropFeatures(features_to_drop=to_drop)),
        ('corr_selector', SmartCorrelatedSelection(method="pearson",
                                                   threshold=thresh,
                                                   selection_method="variance")
                                                   )
                        ])
    pipeline.to_drop = to_drop
    return pipeline

    
def add_transformations(pipeline, transform_assignments):
    # This needs to be called after the above is fit so that the correlation selector has that attr
    to_drop = pipeline.to_drop
    dropping = set(to_drop + pipeline['corr_selector'].features_to_drop_)
    
    new_assignments = { key: [val for val in value if val not in dropping] 
                       for key,value in transform_assignments.items()}
    ''' Olde version
    for value in transform_assignments.values():
        for drop_term in dropping:
            if drop_term in value:
                value.remove(drop_term)
    '''
    for transform, targets in new_assignments.items():
        if not targets:
            continue
        pipeline.steps.append(
            (transform, TRANSFORMS[transform][0](variables=targets))
            )
    pipeline.steps.append(('scaler', StandardScaler()))
    return pipeline

We add the last two steps of the pipeline, the feature selection and the model itself. We do this in a separate step since it depends on the model.

After a first pass, we will see that our models only use the features in `TRIVIAL`. These features make the classification problem too ... trivial as the winner of the game can be computed directly from `'plus_minus_home'`. The points of the each team can be used together to compute the winner as well, so we also remove them. 

# Attention
this could probably be implimented better with classes, but I think I don't want to worry too much about inheritance, I could do it with composition, and then tune the hyperparameters via methods. Do that later.

In [4]:
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

MODELS = {
    'LogisticRegression': LogisticRegression,
    'DecisionTree': DecisionTreeClassifier,
    'RandomForest': RandomForestClassifier,
    'GradientBoosting': GradientBoostingClassifier,
    'ExtraTrees': ExtraTreesClassifier,
    'AdaBoost': AdaBoostClassifier,
    'XGBoost': XGBClassifier
}


def add_feat_selection_n_model(pipeline,model_name,random_state=42):
    model = MODELS[model_name](random_state=random_state)
    pipeline.steps.append(("feat_selection", SelectFromModel(model)))
    pipeline.steps.append((model_name,model))
    return pipeline

Let us now create our list of models that we are going to train.

In [5]:
def create_pipelines(to_drop=None,thresh=THRESH):
    PIPELINES = {}
    for model_name in MODELS:
        base_pipe = base_pipeline(to_drop,thresh)
        base_pipe.fit(X_TrainSet)
        pipe_w_transforms= add_transformations(base_pipe,TRANSFORM_ASSIGNMENTS)
        PIPELINES[model_name] = add_feat_selection_n_model(pipe_w_transforms,model_name)
    return PIPELINES
PIPELINES = create_pipelines()

## Section 2: Cross Validation Search
We are going to perform multiple grid searchs as we refine our selection of features. The final grid search will determine the best model for our data, which we will tune in the next notebook.

# Attention below cell

In [6]:
# this cell was for dev and can be deleted later
pipe1 = PIPELINES['LogisticRegression']
pipe1
rand_forest_pipe = base_pipeline()
rand_forest_pipe.steps.append(('feature_selection', SelectFromModel(RandomForestClassifier(random_state=42))))
rand_forest_pipe.steps.append(('model',RandomForestClassifier(random_state=42)))
#logistic_pipe = add_feat_selection_n_model(logistic_pipe,'LogisticRegression')
rand_forest_pipe

Pipeline(steps=[('dropper',
                 DropFeatures(features_to_drop=['play_off', 'season'])),
                ('corr_selector',
                 SmartCorrelatedSelection(selection_method='variance',
                                          threshold=0.85)),
                ('feature_selection',
                 SelectFromModel(estimator=RandomForestClassifier(random_state=42))),
                ('model', RandomForestClassifier(random_state=42))])

We have a few different Hypotheses. One of them, hypothesis 1, is that the models will gravitate strongly towards the features related to points. In order of estimated impact, they are:
* `'plus_minus'`
* `'pts'`
* `'fgm'`
* `'fg3m'`, `'ftm'`

We will start by training our model on all of these features. To test our hypothesis, we will remove one collection of features at a time until we are left with no features that help you directly determine the score (and hence winner). We will check which features the models use along the way to see if we can validate our hypothesis. We expect that the earlier models will be able to predict the outcome of the games flawlessly.

If you wish to get immediate information, set the verbosity to 3. It will report on how the model is performing during each step of the grid search. Or you could wait and see the performance in the following cell.

Note: we are silencing warnings below. These warnings generate  hundreds of lines and no exceptions are raised.

In [None]:
from sklearn.model_selection import GridSearchCV


param_grid = {"model__n_estimators":[10*i for i in range(1,10)],}

def grid_search(X_train, y_train, pipelines,param_grid={}):
    GRIDS = {}
    for pipe_name, pipe in pipelines.items():
        print(f"### Beginning grid search for {pipe_name} ###")
        try:
            grid=GridSearchCV(estimator=pipe,
                    param_grid=param_grid,
                    cv=5,
                    n_jobs=-2,
                    verbose=3,
                    scoring='accuracy')
            grid.fit(X_train,y_train)
        except ValueError as e:
            msg = "Invalid parameter model for estimator"
            if msg not in str(e):
                raise e
            print("An estimator was passed invalid parameters. We will "
                  f"continue the grid search for {pipe_name} with "
                  "param_grid=\{\}")
            print(f"### Beginning grid search for {pipe_name} with empty params. ###")
            grid=GridSearchCV(estimator=pipe,
                    param_grid={},
                    cv=5,
                    n_jobs=-2,
                    verbose=3,
                    scoring='accuracy')
            grid.fit(X_train,y_train)
        GRIDS[pipe_name] = grid
    return GRIDS
GRIDS = grid_search(X_TrainSet,Y_TrainSet,PIPELINES, param_grid)
print(len(GRIDS))

We will use the following code to generate reports for evaluating our models.

In [9]:
# warning check
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn.model_selection import GridSearchCV
import sys

param_grid = {"model__n_estimators":[10*i for i in range(1,3)],}
pipe_name = list(PIPELINES.keys())[0]
pipe = PIPELINES[pipe_name]
print(f"### Beginning grid search for {pipe_name} ###")
sys.stderr = open(os.devnull, "w")  # silence stderr
try:
    grid=GridSearchCV(estimator=pipe,
                    param_grid=param_grid,
                    cv=5,
                    n_jobs=-2,
                    verbose=3,
                    scoring='accuracy')
    

    grid.fit(X_TrainSet,Y_TrainSet)   
except ValueError as e:
    msg = "Invalid parameter model for estimator"
    if msg not in str(e):
        raise e
    print("An estimator was passed invalid parameters. We will "
                  f"continue the grid search for {pipe_name} with "
                  "param_grid={}")
    print(f"### Beginning grid search for {pipe_name} with empty params. ###")
    grid=GridSearchCV(estimator=pipe,
                    param_grid={},
                    cv=5,
                    n_jobs=-2,
                    verbose=3,
                    scoring='accuracy')
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        grid.fit(X_TrainSet,Y_TrainSet)
sys.stderr = sys.__stderr__ 
        

### Beginning grid search for LogisticRegression ###
Fitting 5 folds for each of 2 candidates, totalling 10 fits
An estimator was passed invalid parameters. We will continue the grid search for LogisticRegression with param_grid={}
### Beginning grid search for LogisticRegression with empty params. ###
Fitting 5 folds for each of 1 candidates, totalling 5 fits


  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).inde

[CV 1/5] END ..................................., score=1.000 total time=   1.5s
[CV 2/5] END ..................................., score=1.000 total time=   1.5s
[CV 4/5] END ..................................., score=1.000 total time=   1.6s


  return f(*args, **kwargs)
  return f(*args, **kwargs)


[CV 5/5] END ..................................., score=1.000 total time=   1.4s
[CV 3/5] END ..................................., score=1.000 total time=   1.4s


In [None]:
from src.model_eval import confusion_matrix_and_report, get_best_scores


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)
  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

get_best_scores(GRIDS)

As there are several different 

In [None]:
for name, grid in GRIDS.items():
    best_pipe = grid.best_estimator_
    print(name)
    clf_performance(X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet,best_pipe, label_map=['loss', 'win'])




We are not surprised to see the high level of accuracy. The points scored in the game literally determine who wins, and we have a lot of that data still present. Let's recall which features we train on. Some features that we dropped were common to each model.

In [None]:
print(META)
pipe1 = PIPELINES['LogisticRegression']
pipe1[:2].fit(X_TrainSet,Y_TrainSet)
pipe1['corr_selector'].__dir__()

In the end, each model was only trained an the following features.

In [None]:
cols_auto_dropped = META + list(pipe1['corr_selector'].features_to_drop_)
remaining_cols = [col for col in X_TrainSet.columns if col not in cols_auto_dropped]
def find_features(X,Y,pipe):
    pipe.fit(X,Y)
    features = pipe['feat_selection'].get_support()
    X = X_TrainSet.filter(remaining_cols)
    if len(X.columns) != features.shape[0]:
        raise ValueError
    for index, col in enumerate(X.columns):
        if not features[index]:
            X.drop(col, axis=1, inplace=True)
    return X.columns


for pipe_name, pipe in NEW_PIPELINES.items():
    print(pipe_name)
    print(find_features(X_TrainSet,Y_TrainSet,pipe))

Let's see how the training goes if we remove `'plus_minus_home'`, `'pts_home'`, and `'pts_away'`. We expect the model to be able to predict the winner solely based on these. Since we will be dropping them before we select features using correlation or `SelectFromModel`, we expect this to have an impact on the models. If you only remove `'plus_minus_home'`, this impacts the performance, but barely. The models fail to be perfect, however, the worst scoring models, ExtraTrees and AdaBoost, have a lowest score of 97%.

In [None]:
NEW_PIPELINES = create_pipelines()
'''fitted_pipelines = {}
for pipe_name, pipe in NEW_PIPELINES.items():
    pipe.fit(X_TrainSet,Y_TrainSet)
    fitted_pipelines[pipe_name] = pipe
'''
GRIDS_wo_pm = grid_search(X_TrainSet,Y_TrainSet,NEW_PIPELINES)

In [None]:
def get_best_scores(GRIDS):
  for grid in GRIDS.values():
    res = (pd.DataFrame(grid.cv_results_)
       .sort_values(by='mean_test_score',ascending=False)
       .filter(['params','mean_test_score'])
       .values)

    print(res)

from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_pred=prediction, y_true=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

 

After examining the reports for all of the models in `GRIDS` and those in our new search `GRIDS_wo_pm`, we found that the two worst performing models are still very close to being 100% accurate. They are AdaBoost and ExtraTrees. Other models did fail to be perfect, but these were the only two that did not score 100% across the board on all metrics for the test set.

In [None]:
for name, grid in GRIDS_wo_pm.items():
    print(name)
    best_estimator = grid.best_estimator_
    clf_performance(X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet,best_estimator, label_map=['loss', 'win']) 

In [None]:
best_ada = GRIDS_wo_pm['AdaBoost'].best_estimator_
best_extra_tree = GRIDS_wo_pm['ExtraTrees'].best_estimator_
print("Ada Boost")
clf_performance(X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet,best_ada, label_map=['loss', 'win'])

print("Extra Tree")
clf_performance(X_TrainSet,Y_TrainSet,X_TestSet,Y_TestSet, best_extra_tree, label_map=['loss', 'win'])

Let's inspect which features our models used.

In [None]:
for pipe_name, pipe in NEW_PIPELINES.items():
    print(pipe_name)
    print(find_features(X_TrainSet,Y_TrainSet,pipe))

Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
