# Grid Search CV Example

This notebook contains an example of how the Grid Search CV module can be used to search for the best parameters when generating or optimising a set of rules.

The Grid Search CV process is as follows:
* Generate k-fold stratified cross validation datasets.
* Generate unique sets of rule generation/optimisation parameters from the given search space.
* For each of the training and validation datasets:
    * Train/optimise a set of rules using each unique set of parameters (using the training dataset).
    * Use the provided *scorer* to calculate the maximum overall performance of each rule set on the validation dataset.
* Return the parameters which generated the highest mean overall performance across the validation datasets.

## Requirements

To run, you'll need the following:

* A labelled, processed dataset (nulls imputed, categorical features encoded).

----

## Import packages

In [1]:
from iguanas.rule_selection import GridSearchCV, GreedyFilter
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_optimisation import BayesianOptimiser
from iguanas.metrics.classification import FScore, Precision

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from hyperopt import tpe, anneal

## Read in data

Let's read in some labelled, processed dummy data.

In [2]:
X_train = pd.read_csv(
    'dummy_data/X_train_gs.csv', 
    index_col='eid'
)
y_train = pd.read_csv(
    'dummy_data/y_train_gs.csv', 
    index_col='eid'
).squeeze()
X_test = pd.read_csv(
    'dummy_data/X_test_gs.csv', 
    index_col='eid'
)
y_test = pd.read_csv(
    'dummy_data/y_test_gs.csv', 
    index_col='eid'
).squeeze()

----

## Generate rules using *GridSearchCV*

We can use the *GridSearchCV* class to implement stratifield k-fold cross validation when searching for the best rule generation parameters. **This allows us to find the best rule generation parameters whilst also reducing the likelihood of overfitting.**

### Set up class parameters

We first need to define the search values for each parameter in the provided rule generation class. We define these in a dictionary, where the dictionary keys are the relevant rule generation parameters and the dictionary values are lists of search values for each parameter. The *GridSearchCV* class will then calculate each unique combination of parameter values, and find the set of parameters that produce the best mean overall rule performance across the folds.

In [3]:
p = Precision()
fs = FScore(beta=1)

In [4]:
param_grid = {
    'metric': [fs.fit], 
    'n_total_conditions': [1, 4], 
    'tree_ensemble': [
        RandomForestClassifier(n_estimators=5, random_state=0), 
        RandomForestClassifier(n_estimators=15, random_state=0), 
    ],
    'target_feat_corr_types': [None, 'Infer']
}

Now that we have our search values, we can define the rest of the *GridSearchCV* class parameters. For the *scorer* parameter, we'll use the *GreedyFilter* class to calculate the best combined rule performance for each parameter set & fold. 

**Note here that we are splitting the data into 3 folds for training/validation.**

**Please see the class docstring for more information on each parameter**

In [5]:
scorer = GreedyFilter(
    metric=fs.fit, 
    sorting_metric=p.fit
)

In [6]:
params = {
    'rule_class': RuleGeneratorDT,
    'param_grid': param_grid,
    'scorer': scorer,
    'cv': 3,
    'num_cores': 2,
    'verbose': 1
}

### Instantiate class and run fit method

Once the parameters have been set, we can run the *.fit()* method to search for the best rule generation parameters.

In [7]:
gs_cv = GridSearchCV(**params)

8 unique parameter sets


In [8]:
gs_cv.fit(
    X=X_train, 
    y=y_train
)

--- Fitting and validating rules using folds ---
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:05<00:00,  4.24it/s]
--- Re-fitting rules using best parameters on full dataset ---
--- Filtering rules to give best combined performance ---


### Outputs

See the `Attributes` section in the class docstring for a description of each attribute generated:

In [9]:
gs_cv.param_results_per_fold.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,metric,n_total_conditions,tree_ensemble,target_feat_corr_types,Performance
Fold,ParamSetIndex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,<bound method FScore.fit of FScore with beta=1>,1,"RandomForestClassifier(n_estimators=5, random_...",,0.511278
0,1,<bound method FScore.fit of FScore with beta=1>,1,"RandomForestClassifier(n_estimators=5, random_...",Infer,0.511278
0,2,<bound method FScore.fit of FScore with beta=1>,1,"(DecisionTreeClassifier(max_depth=4, max_featu...",,0.532258
0,3,<bound method FScore.fit of FScore with beta=1>,1,"(DecisionTreeClassifier(max_depth=4, max_featu...",Infer,0.532258
0,4,<bound method FScore.fit of FScore with beta=1>,4,"RandomForestClassifier(n_estimators=5, random_...",,0.614458


In [10]:
gs_cv.param_results_aggregated.head()

Unnamed: 0_level_0,metric,n_total_conditions,tree_ensemble,target_feat_corr_types,PerformancePerFold,MeanPerformance,StdDevPerformance
ParamSetIndex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,<bound method FScore.fit of FScore with beta=1>,1,"RandomForestClassifier(n_estimators=5, random_...",,"[0.5112781954887218, 0.35185185185185186, 0.38...",0.417464,0.068072
1,<bound method FScore.fit of FScore with beta=1>,1,"RandomForestClassifier(n_estimators=5, random_...",Infer,"[0.5112781954887218, 0.35185185185185186, 0.38...",0.417464,0.068072
2,<bound method FScore.fit of FScore with beta=1>,1,"(DecisionTreeClassifier(max_depth=4, max_featu...",,"[0.532258064516129, 0.38961038961038963, 0.402...",0.441582,0.064346
3,<bound method FScore.fit of FScore with beta=1>,1,"(DecisionTreeClassifier(max_depth=4, max_featu...",Infer,"[0.532258064516129, 0.38961038961038963, 0.402...",0.441582,0.064346
4,<bound method FScore.fit of FScore with beta=1>,4,"RandomForestClassifier(n_estimators=5, random_...",,"[0.6144578313253012, 0.538860103626943, 0.5]",0.551106,0.047523


In [11]:
gs_cv.best_score

0.5829592533675241

In [12]:
gs_cv.best_params

{'metric': <bound method FScore.fit of FScore with beta=1>,
 'n_total_conditions': 4,
 'tree_ensemble': RandomForestClassifier(max_depth=4, n_estimators=15, random_state=0),
 'target_feat_corr_types': 'Infer'}

----

## Apply rules to a separate dataset

Use the *.transform()* method to apply the best performing rules overall to a separate dataset.

In [13]:
X_rules_test = gs_cv.transform(
    X=X_test, 
    y=y_test, 
    sample_weight=None
)

### Outputs

The *.transform()* method returns a dataframe giving the binary columns of the rules as applied to the given dataset. See the `Attributes` section in the class docstring for a description of each attribute generated:

In [14]:
gs_cv.rule_descriptions.head()

Unnamed: 0_level_0,Precision,Recall,PercDataFlagged,Metric,Logic,nConditions
Rule,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
RGDT_Rule_20211209_100,0.847826,0.336207,0.010497,0.481481,(X['account_number_sum_order_total_per_account...,3
RGDT_Rule_20211209_6,0.860465,0.318966,0.009813,0.465409,(X['account_number_avg_order_total_per_account...,2
RGDT_Rule_20211209_104,0.860465,0.318966,0.009813,0.465409,(X['account_number_sum_order_total_per_account...,2
RGDT_Rule_20211209_39,0.837838,0.267241,0.008444,0.405229,(X['account_number_avg_order_total_per_account...,2
RGDT_Rule_20211209_112,0.852941,0.25,0.007759,0.386667,(X['account_number_sum_order_total_per_account...,2


In [15]:
X_rules_test.head()

Rule,RGDT_Rule_20211209_100,RGDT_Rule_20211209_6,RGDT_Rule_20211209_104,RGDT_Rule_20211209_39,RGDT_Rule_20211209_112,RGDT_Rule_20211209_46,RGDT_Rule_20211209_31,RGDT_Rule_20211209_121,RGDT_Rule_20211209_45,RGDT_Rule_20211209_23,...,RGDT_Rule_20211209_95,RGDT_Rule_20211209_94,RGDT_Rule_20211209_16,RGDT_Rule_20211209_88,RGDT_Rule_20211209_8,RGDT_Rule_20211209_43,RGDT_Rule_20211209_65,RGDT_Rule_20211209_15,RGDT_Rule_20211209_41,RGDT_Rule_20211209_30
eid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
975-8351797-7122581,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
785-6259585-7858053,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
057-4039373-1790681,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
095-5263240-3834186,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
980-3802574-0009480,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


----

## Optimise rules using *GridSearchCV*

We can use the *GridSearchCV* class to implement stratifield k-fold cross validation when searching for the best rule optimisation parameters. **This allows us to find the best rule optimisation parameters whilst also reducing the likelihood of overfitting.**

### Set up class parameters

We firstly need to define the search values for each parameter in the provided rule optimisation class. We define these in a dictionary, where the dictionary keys are the relevant rule optimisation parameters and the dictionary values are lists of search values for each parameter. The *GridSearchCV* class will then calculate each unique combination of parameter values, and find the set of parameters that produce the best mean overall rule performance across the folds.

In [16]:
f0dot5 = FScore(beta=0.5)
f1 = FScore(beta=1)

In this example, we'll optimise the rules we generated in the previous Grid Search exercise. To do this, we need to convert the generated rules into the standard Iguanas lambda expression format:

In [17]:
rule_lambdas = gs_cv.as_rule_lambdas(
    as_numpy=False, 
    with_kwargs=True
)
lambda_kwargs = gs_cv.lambda_kwargs

In [18]:
param_grid = {
    'rule_lambdas': [rule_lambdas],
    'lambda_kwargs': [lambda_kwargs],
    'metric': [f0dot5.fit, f1.fit],
    'n_iter': [30],
    'algorithm': [tpe.suggest, anneal.suggest],
    'verbose': [0]
}

Now that we have our search values, we can define the rest of the *GridSearchCV* class parameters. For the *scorer* parameter, we'll use the *GreedyFilter* class to calculate the best combined rule performance for each parameter set & fold. 

**Note here that we are splitting the data into 3 folds for training/validation.**

**Please see the class docstring for more information on each parameter**

In [19]:
scorer = GreedyFilter(
    metric=fs.fit, 
    sorting_metric=p.fit
)

In [20]:
params = {
    'rule_class': BayesianOptimiser,
    'param_grid': param_grid,
    'scorer': scorer,
    'cv': 3,
    'num_cores': 2,
    'verbose': 1
}

### Instantiate class and run fit method

Once the parameters have been set, we can run the *.fit()* method to search for the best rule optimisation parameters.

In [21]:
gs_cv = GridSearchCV(**params)

4 unique parameter sets


In [22]:
gs_cv.fit(
    X=X_train, 
    y=y_train
)

--- Fitting and validating rules using folds ---
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:18<00:00,  1.53s/it]
--- Re-fitting rules using best parameters on full dataset ---
--- Filtering rules to give best combined performance ---


### Outputs

See the `Attributes` section in the class docstring for a description of each attribute generated:

In [23]:
gs_cv.param_results_per_fold.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,rule_lambdas,lambda_kwargs,metric,n_iter,algorithm,verbose,Performance
Fold,ParamSetIndex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=0.5>,30,<function suggest at 0x7f99f8ec4e50>,0,0.608108
0,1,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=0.5>,30,<function suggest at 0x7f99f8ed6670>,0,0.614379
0,2,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=1>,30,<function suggest at 0x7f99f8ec4e50>,0,0.574586
0,3,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=1>,30,<function suggest at 0x7f99f8ed6670>,0,0.633803
1,0,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=0.5>,30,<function suggest at 0x7f99f8ec4e50>,0,0.546763


In [24]:
gs_cv.param_results_aggregated.head()

Unnamed: 0_level_0,rule_lambdas,lambda_kwargs,metric,n_iter,algorithm,verbose,PerformancePerFold,MeanPerformance,StdDevPerformance
ParamSetIndex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=0.5>,30,<function suggest at 0x7f99f8ec4e50>,0,"[0.608108108108108, 0.5467625899280575, 0.5350...",0.563301,0.032043
1,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=0.5>,30,<function suggest at 0x7f99f8ed6670>,0,"[0.6143790849673203, 0.562962962962963, 0.5323...",0.569905,0.033836
2,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=1>,30,<function suggest at 0x7f99f8ec4e50>,0,"[0.574585635359116, 0.5454545454545454, 0.5362...",0.552091,0.016346
3,{'RGDT_Rule_20211209_100': <function _ConvertR...,{'RGDT_Rule_20211209_100': {'account_number_su...,<bound method FScore.fit of FScore with beta=1>,30,<function suggest at 0x7f99f8ed6670>,0,"[0.6338028169014086, 0.5735294117647058, 0.549...",0.585451,0.035624


In [25]:
gs_cv.best_score

0.5854506121697506

In [26]:
gs_cv.best_params

{'rule_lambdas': {'RGDT_Rule_20211209_100': <function iguanas.rules._convert_rule_dicts_to_rule_strings._ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211209_104': <function iguanas.rules._convert_rule_dicts_to_rule_strings._ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211209_6': <function iguanas.rules._convert_rule_dicts_to_rule_strings._ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211209_39': <function iguanas.rules._convert_rule_dicts_to_rule_strings._ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211209_31': <function iguanas.rules._convert_rule_dicts_to_rule_strings._ConvertRuleDictsToRuleStrings._convert_to_lambda.<locals>._make_lambda.<locals>.<lambda>(**kwargs)>,
  'RGDT_Rule_20211209_112': <function 

----

## Apply rules to a separate dataset

Use the *.transform()* method to apply the best performing rules overall to a separate dataset.

In [27]:
X_rules_test = gs_cv.transform(
    X=X_test, 
    y=y_test, 
    sample_weight=None
)

### Outputs

The *.transform()* method returns a dataframe giving the binary columns of the rules as applied to the given dataset. See the `Attributes` section in the class docstring for a description of each attribute generated:

In [28]:
gs_cv.rule_descriptions.head()

Unnamed: 0_level_0,Precision,Recall,PercDataFlagged,Metric,Logic,nConditions
Rule,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
RGDT_Rule_20211209_100,0.847826,0.336207,0.010497,0.481481,(X['account_number_sum_order_total_per_account...,3
RGDT_Rule_20211209_22,0.857143,0.310345,0.009585,0.455696,(X['account_number_avg_order_total_per_account...,3
RGDT_Rule_20211209_44,0.857143,0.310345,0.009585,0.455696,(X['account_number_avg_order_total_per_account...,3
RGDT_Rule_20211209_60,0.857143,0.310345,0.009585,0.455696,(X['account_number_avg_order_total_per_account...,3
RGDT_Rule_20211209_9,0.857143,0.310345,0.009585,0.455696,(X['account_number_avg_order_total_per_account...,3


In [29]:
X_rules_test.head()

Rule,RGDT_Rule_20211209_100,RGDT_Rule_20211209_22,RGDT_Rule_20211209_44,RGDT_Rule_20211209_60,RGDT_Rule_20211209_9,RGDT_Rule_20211209_52,RGDT_Rule_20211209_31,RGDT_Rule_20211209_141,RGDT_Rule_20211209_140,RGDT_Rule_20211209_150,RGDT_Rule_20211209_113,RGDT_Rule_20211209_117,RGDT_Rule_20211209_41,RGDT_Rule_20211209_64,RGDT_Rule_20211209_16,RGDT_Rule_20211209_43
eid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
975-8351797-7122581,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
785-6259585-7858053,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
057-4039373-1790681,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
095-5263240-3834186,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
980-3802574-0009480,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


----