# **Model Selection**

## Objectives
* Do a grid search using cross-validation in order to select a classification model.

## Inputs
* The train and test data set aside in the the last notebook.
* The pipeline that was produced in the last notebook.
* Parameter values determined in previous notebook.

## Outputs
* A choice of classification model that we will further tune and evaluate.

## Additional Comments
* We have chosen to do classification. We may yet do a regression model on the point differential.      

---

# Change working directory
We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

home_dir = '/workspace/pp5-ml-dashboard'
os.chdir(home_dir)
current_dir = os.getcwd()
print(current_dir)

/workspace/pp5-ml-dashboard


We now load our training and test sets, as well as some of the packages that we will be using.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from src.utils import get_df, save_df

train_dir = 'train/csv'
X_TrainSet = get_df('X_TrainSet',train_dir)
Y_TrainSet = get_df('Y_TrainSet',train_dir)

test_dir = 'test/csv'
X_TestSet = get_df('X_TestSet',test_dir)
Y_TestSet = get_df('Y_TestSet',test_dir)

## Section 1: Full Pipeline
We will build the full pipeline here. It will accept some parameters so that we can tune it later. We also declare some constants that we established in the last notebook.

In [None]:
from sklearn.preprocessing import StandardScaler
from feature_engine import transformation as vt
from feature_engine.selection import DropFeatures, SmartCorrelatedSelection
from sklearn.pipeline import Pipeline


# Constants needed for feature engineering
META = ['season', 'play_off']
THRESH = 0.85
TRANSFORMS = {'log_e':(vt.LogTransformer, False), 
                'log_10':(vt.LogTransformer,'10'),
                'reciprocal':(vt.ReciprocalTransformer,False), 
                'power':(vt.PowerTransformer,False),
                'box_cox':(vt.BoxCoxTransformer,False),
                'yeo_johnson':(vt.YeoJohnsonTransformer,False)}
TRANSFORM_ASSIGNMENTS = {
    'yeo_johnson': ['dreb_away', 'blk_home', 'oreb_away', 'fta_away', 'dreb_home', 
                    'ast_home', 'stl_away', 'pts_away', 'stl_home', 'reb_away',
                    'pts_home', 'fgm_away', 'oreb_home', 'pf_away', 'pf_home'],
    'box_cox': ['ast_away', 'fta_home']
                            }


def pipeline(to_drop=None,thresh=THRESH, 
             transform_assignments=TRANSFORM_ASSIGNMENTS):
    if not to_drop:
        to_drop = META
    else:
        to_drop.extend(META)
    pipeline = Pipeline([
        ('dropper', DropFeatures(features_to_drop=to_drop)),
        ('corr_selector', SmartCorrelatedSelection(method="pearson",
                                                   threshold=thresh,
                                                   selection_method="variance",))
                        ])
    for transform in transform_assignments:
        pipeline.steps.append(
            (transform, TRANSFORMS[transform](variables=transform_assignments[transform]))
        )
    pipeline.steps.append(('scaler', StandardScaler()))
    return pipeline

In [None]:
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

MODELS = {
    'LogisticRegression': LogisticRegression,
    'DecisionTree': DecisionTreeClassifier,
    'RandomForest': RandomForestClassifier,
    'GradientBoosting': GradientBoostingClassifier,
    'ExtraTrees': ExtraTreesClassifier,
    'AdaBoost': AdaBoostClassifier,
    'XGBoost': XGBClassifier
}


def add_feat_selection_n_model(pipeline,model,**params):
    pipeline.steps.append(("feat_selection", SelectFromModel(MODELS[model])))
    pipeline.steps.append((model,MODELS[model](**params)))
    return pipeline

# Section 1

Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
