# What are we doing?

## Objectives

+ Construct a cross-validation pipeline.
+ Use cross-validation to evaluate different hyperparameter performance.
+ Perform grid search for systemic evaluation.
+ Store and manage results.

## Procedure

The diagram below, taken from Scikit Learn's documentation, shows the procedure that we will follow:

![](./images/05_grid_search_workflow.png)


+ System requriements:
    
    - Automation: the system should operate automatically with the least amount of supervision. 
    - Replicability: changes to code and (arguably) data should be logged and controled. Randomness should also be controlled (random seeds, etc.)
    - Persistence: persist results for later analysis.


## What is a Hyperparameter?

+ Generally speaking, hyperparameters are parameters that control the learning process: regularization weights, learning rate, entropy/gini metrics, etc. 
+ Hyperparameters will drive the behaviour and performance of a model. Model selection is intimately related with hyperparameter tuning. 
+ Selection critieria are based on performance evaluation and, to get better performance estimates, we use cross-validation.

## Searching the Hyperparameter Grid

+ To address the automation requirement, we could use `GridSearchCV()`, which is a self-contained function for performing a Grid Search over a hyperparameter space.
+ To "Search the Hyperparameter Grid" exhaustively means that we will consider all possible combination of hyperparameter values in the search space and evaluate the model using those hyperparams. For example, if we have two parameters that we are exploring, kernel (takes values "rbf" and "poly") and C (takes values 1.0 and 0.5), then this grid would be the combinations:

    + (rbf, 1.0)
    + (rbf, 0.5)
    + (poly, 1.0)
    + (poly, 0.5)

+ Under each combination, we perform CV and evaluate the model's performance.

# Setup

We start with [Give me some credit](https://www.kaggle.com/c/GiveMeSomeCredit) data that we used in the previous session.

In [1]:
%load_ext dotenv
%dotenv 
import os
import sys
sys.path.append(os.getenv('SRC_DIR'))
import pandas as pd
import numpy as np
import os
ft_path = os.getenv("CREDIT_DATA")
df_raw = pd.read_csv(ft_path)


In [2]:
df = df_raw.drop(columns = ["Unnamed: 0"]).rename(
    columns = {
        'SeriousDlqin2yrs': 'delinquency',
        'RevolvingUtilizationOfUnsecuredLines': 'revolving_unsecured_line_utilization', 
        'age': 'age',
        'NumberOfTime30-59DaysPastDueNotWorse': 'num_30_59_days_late', 
        'DebtRatio': 'debt_ratio', 
        'MonthlyIncome': 'monthly_income',
        'NumberOfOpenCreditLinesAndLoans': 'num_open_credit_loans', 
        'NumberOfTimes90DaysLate':  'num_90_days_late',
        'NumberRealEstateLoansOrLines': 'num_real_estate_loans', 
        'NumberOfTime60-89DaysPastDueNotWorse': 'num_60_89_days_late',
        'NumberOfDependents': 'num_dependents'
    }
).assign(
    high_debt_ratio = lambda x: (x['debt_ratio'] > 1)*1,
    missing_monthly_income = lambda x: x['monthly_income'].isna()*1,
    missing_num_dependents = lambda x: x['num_dependents'].isna()*1, 
)

In [11]:
df.head(100)

Unnamed: 0,delinquency,revolving_unsecured_line_utilization,age,num_30_59_days_late,debt_ratio,monthly_income,num_open_credit_loans,num_90_days_late,num_real_estate_loans,num_60_89_days_late,num_dependents,high_debt_ratio,missing_monthly_income,missing_num_dependents
0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0,0,0,0
1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0,0,0,0
2,0,0.658180,38,1,0.085113,3042.0,2,1,0,0,0.0,0,0,0
3,0,0.233810,30,0,0.036050,3300.0,5,0,0,0,0.0,0,0,0
4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0.245353,37,0,0.288417,6500.0,11,1,1,1,0.0,0,0,0
96,0,0.542243,48,2,10.000000,,2,0,0,0,,1,1,1
97,0,0.010531,57,0,0.280665,5714.0,6,0,1,0,0.0,0,0,0
98,0,0.363200,32,0,0.480524,2900.0,4,0,1,0,0.0,0,0,0


Use a simple pipeline composed of:

+ Preprocessing steps.
+ Logistic Regression classifier.

We will explore the hyperparameter sapce by evaluating different regularization strategies and parameters.

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

In [4]:

num_cols = ['revolving_unsecured_line_utilization', 'age',
       'num_30_59_days_late', 'debt_ratio', 'monthly_income',
       'num_open_credit_loans', 'num_90_days_late', 'num_real_estate_loans',
       'num_60_89_days_late', 'num_dependents', 
       # Although expressed as numbers, these columns are boolean:
       # 'high_debt_ratio',
       # 'missing_monthly_income', 
       # 'missing_num_dependents' 
       ]

pipe_num_simple = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler())
])

ctransform_simple= ColumnTransformer([
    ('numeric_simple', pipe_num_simple, num_cols),
], remainder='passthrough')

pipe_lr = Pipeline([
    ('preprocess', ctransform_simple),
    ('clf', LogisticRegression())
])
pipe_lr

0,1,2
,steps,"[('preprocess', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('numeric_simple', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


Obtain the parameters of the pipeline with `.get_params()`.

In [5]:
pipe_lr.get_params()

{'memory': None,
 'steps': [('preprocess',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('numeric_simple',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(strategy='median')),
                                                    ('standardizer',
                                                     StandardScaler())]),
                                    ['revolving_unsecured_line_utilization', 'age',
                                     'num_30_59_days_late', 'debt_ratio',
                                     'monthly_income', 'num_open_credit_loans',
                                     'num_90_days_late', 'num_real_estate_loans',
                                     'num_60_89_days_late', 'num_dependents'])])),
  ('clf', LogisticRegression())],
 'transform_input': None,
 'verbose': False,
 'preprocess': ColumnTransformer(remainder='passthrough',
                   tra

## Setup the Splitting Strategy

In [6]:
X = df.drop(columns = 'delinquency')
Y = df['delinquency']

scoring = ['neg_log_loss', 'roc_auc', 'f1', 'accuracy', 'precision', 'recall']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)



To perform the Grid Search we need to define a parameter grid:

- A parameter grid defines all of the combinations of parameters that we need to explore.
- The function `GridSearchCV()` performs an exhaustive search of parameter combinations.
- The parameter grid is defined as a dictionary of lists:

    * Each entry's key is the name of the parameter.
    * Each entry's value is the list of values that we would like to explore.

In [7]:
param_grid = {
    'clf__C': [0.01, 0.5, 1.0],
    'clf__penalty': ['l1', 'l2'],
    'clf__solver': ['liblinear'],
    }

Some key inputs to [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) are:

+ `estimator`: the pipeline or classifier that we are tuning.
+ `param_grid`: the parameter grid defined as a dictionary of lists described above.
+ `n_jobs`: settings for parallel computation.
+ `refit`: options for refitting the model using the best-performing configuration.

In [8]:
grid_cv = GridSearchCV(
    estimator=pipe_lr, 
    param_grid=param_grid, 
    scoring = scoring, 
    cv = 5,
    refit = "neg_log_loss")
grid_cv.fit(X_train, Y_train)

0,1,2
,estimator,Pipeline(step...egression())])
,param_grid,"{'clf__C': [0.01, 0.5, ...], 'clf__penalty': ['l1', 'l2'], 'clf__solver': ['liblinear']}"
,scoring,"['neg_log_loss', 'roc_auc', ...]"
,n_jobs,
,refit,'neg_log_loss'
,cv,5
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,transformers,"[('numeric_simple', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l1'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'liblinear'
,max_iter,100


Access the cross-validation results using the property `.cv_results_`:

In [12]:
res = grid_cv.cv_results_
res = pd.DataFrame(res)
res.columns

res[['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_clf__C', 'param_clf__penalty', 'param_clf__solver', 'params',
       'mean_test_neg_log_loss',
       'std_test_neg_log_loss', 'rank_test_neg_log_loss']].sort_values('rank_test_neg_log_loss')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__C,param_clf__penalty,param_clf__solver,params,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss
4,5.222793,0.165292,0.037115,0.00151,1.0,l1,liblinear,"{'clf__C': 1.0, 'clf__penalty': 'l1', 'clf__so...",-0.225373,0.000492,1
5,0.338633,0.007494,0.037509,0.001842,1.0,l2,liblinear,"{'clf__C': 1.0, 'clf__penalty': 'l2', 'clf__so...",-0.225373,0.000488,2
2,5.145516,0.195592,0.037696,0.001681,0.5,l1,liblinear,"{'clf__C': 0.5, 'clf__penalty': 'l1', 'clf__so...",-0.225374,0.000488,3
3,0.341457,0.006971,0.037561,0.001018,0.5,l2,liblinear,"{'clf__C': 0.5, 'clf__penalty': 'l2', 'clf__so...",-0.225377,0.000481,4
0,3.484204,0.439289,0.035476,0.004364,0.01,l1,liblinear,"{'clf__C': 0.01, 'clf__penalty': 'l1', 'clf__s...",-0.227513,0.000446,5
1,0.315505,0.021464,0.035747,0.004147,0.01,l2,liblinear,"{'clf__C': 0.01, 'clf__penalty': 'l2', 'clf__s...",-0.228725,0.000628,6


Access the best-performing configuration:

In [13]:
grid_cv.best_params_

{'clf__C': 1.0, 'clf__penalty': 'l1', 'clf__solver': 'liblinear'}

In [14]:
grid_cv.best_estimator_

0,1,2
,steps,"[('preprocess', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('numeric_simple', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l1'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'liblinear'
,max_iter,100


The best-performing classifier (pipeline) trained on the complete training set is:

# Tracking GridSearchCV Experiments

+ We can expand our infrastructure for hyperparameter tuning across various models.
+ The plan:

    - Create a model ingredient to obtain the classifier object.
    - Create experiment param grids to organize our parameter grids.
    - Schedule the experiments.


Explore the code in `./05_src/exp__logistic_simple.py` and `./05_src/exp__logistic_grid_search.py`:

+ `exp__logistic_simple.py` implements a single experiment run in MLFlow, i.e., a single set of parameters will be trained and evaluated by the code.
+ `exp__logistic_grid_search.py` runs through a series of tests (one test given by a parametrization of the model pipeline). Each run is recorded independently as a parent run.
+ Also notice that we have pulled the data component of the experiment to a module of its own.

## Running Experiments from the Command Line

Access the experiment through the [Command Line Interface](https://sacred.readthedocs.io/en/stable/command_line.html).

```
cd src  # if required
python -m credit.exp__logistic_grid_search.py
```
