# Rules-Based System (RBS) example

This is an example of how Iguanas can be used to set up a Rules-Based System (RBS) from scratch. This includes:

* Generating new rules
* Optimising existing rules
* Combining these rules and removing those which are unnecessary
* Setting up and optimising the RBS pipeline
* Testing the optimised RBS pipeline on a test set

In this example, we'll be creating an RBS for a **transaction fraud use case** (i.e. identifying potentially fraudulent transactions). The metric that we'll optimising for will be the **F1 Score**, and we'll just be focusing on **rules to capture fraudulent behaviour** (rather than also including rules which capture good behaviour, which is a relevant methodology to use too).

## Requirements

To run, you'll need the following:

* A raw, labelled dataset.

---

## Table of contents

1. [Read/process data](#ReadProcessData)
2. [Rule Generation](#RuleGeneration)
3. [Rule Optimisation](#RuleOptimisation)
4. [Combine rules and remove those which are unnecessary](#CombineRules)
5. [Set up the RBS Pipeline](#SetUpPipeline)
6. [Optimise the RBS Pipeline](#OptimiseThePipeline)
7. [Filter rules for the optimised RBS Pipeline](#FilterRulesForPipeline)
8. [Apply the optimised RBS Pipeline to the test set](#ApplyPipeline)
9. [Convert generated rule conditions to system-ready](#ConvertGenToSys)
10. [Our final rule set and RBS Pipeline](#FinalRuleSet)
11. [There's an easier (and better) way!](#BetterWay)

----

## Import packages

In [1]:
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_optimisation import BayesianOptimiser
from iguanas.metrics.classification import FScore, Precision
from iguanas.metrics.pairwise import JaccardSimilarity
from iguanas.rules import Rules, ConvertProcessedConditionsToGeneral, ReturnMappings
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.rule_selection import SimpleFilter, GreedyFilter, CorrelatedFilter
from iguanas.rbs import RBSPipeline, RBSOptimiser

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from category_encoders.one_hot import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import pickle
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import seaborn as sns

In [2]:

DATA_FOLDER_NAME    = 'BlockChainAttacksDataSet'
DATA_FOLDER_PATTERN = 'DataSet0'
DATA_FILE_EXT       = 'csv'

PROJECT_DIR_NAME = 'CyVers' #<! Royi: Anton, don't change it, it should be a team constant
PROJECT_DIR_PATH = os.path.join(os.getcwd()[:os.getcwd().find(PROJECT_DIR_NAME)], PROJECT_DIR_NAME) #>! Pay attention, it will create issues in cases you name the folder `CyVersMe` or anything after / before `CyVers`
TRAIN_BY_TSX    = 1
TRAIN_BY_FILES  = 2

In [3]:
from DataSetsAuxFun import *

In [4]:
# Parameters
dataSetRotoDir = os.path.join(PROJECT_DIR_PATH, DATA_FOLDER_NAME)

# Features Analysis
numCrossValPps = 4

# Training
trainMode = TRAIN_BY_FILES
testSetRatio = 1 / 3
numKFolds = 3
gridSearchScore = 'f1' #<! Use strings from `sklearn.metrics.get_scorer_names()`
gridSearchScore = 'recall' #<! We need to have better PD

# Amount USD Outlier threshold
amountUsdOutlierThr = 1e9

randomState = 42

lSlctedFeaturesRaw    = ['Amount', 'Currency', 'Currency Type', 'Amount [USD]', 'Receiver Type', 'Gas Price', 'Gas Limit', 'Gas Used' ]#lSlctedFeaturesRaw    = ['Amount', 'Currency', 'Amount [USD]', 'Receiver Type']
lSlctedFeaturesCalc   = [enumObj.name for enumObj in FeatureName if ((enumObj is not FeatureName.TIME_MAX) and (enumObj is not FeatureName.TIME_MIN))]
lSlctdFeatures        = lSlctedFeaturesRaw + lSlctedFeaturesCalc
lCatFeatures          = ['Currency', 'Currency Type', 'Receiver Type']#lCatFeatures          = ['Currency', 'Receiver Type']
# lFeaturesRemove       = [FeatureName.TIME_MAX.name, FeatureName.TIME_MIN.name] #<! Auxiliary features to be removed before processing

timeColStr = 'Block Time'

In [5]:
# Loading / Generating Data
lCsvFile = ExtractCsvFiles(dataSetRotoDir, folderNamePattern = DATA_FOLDER_PATTERN)
print(f'The number of file found: {len(lCsvFile)}')

lCsvColName     = ['Transaction ID', 'Block Time', 'Transaction Time', 'Sender ID', 'Receiver ID', 'Receiver Type', 'Amount', 'Currency', 'Currency Hash', 'Currency Type', 'Amount [USD]', 'Gas Price', 'Gas Limit', 'Gas Used', 'Gas Predicted', 'Balance In', 'Balance Out', 'Label', 'Risk Level']
lCsvColNameFlag = [True,              True,         True,               True,        True,          True,            True,     True,       True,            True,            True,           True,        True,        True,       True,            True,         True,          False,   False]  #<! Flags if a column is a must to have

# dfData = pd.read_csv(os.path.join(DATA_FOLDER_NAME, csvFileName))
#dfData, dAssetFile = LoadCsvFilesDf(lCsvFile, baseFoldePath = '')
dfData, dAssetFile =  LoadCsvFilesDf(lCsvFile, baseFoldePath = '', lColName = lCsvColName, lColFlag =  lCsvColNameFlag)

numRows, numCols = dfData.shape

print(f"The number of rows (Samples): {numRows}, The number of columns: {numCols}, number of unique sender id's: {dfData['Sender ID'].unique().shape}")
print(f'The data list of columns is: {dfData.columns} with {len(dfData.columns)} columns')

The number of file found: 328


  dfCurrData = pd.read_csv(os.path.join(baseFoldePath, lCsvFileName[ii]))


The number of rows (Samples): 1845647, The number of columns: 19, number of unique sender id's: (328,)
The data list of columns is: Index(['Transaction ID', 'Transaction Time', 'Block Time', 'Sender ID',
       'Receiver ID', 'Receiver Type', 'Amount', 'Currency', 'Currency Hash',
       'Currency Type', 'Amount [USD]', 'Gas Price', 'Gas Limit', 'Gas Used',
       'Gas Predicted', 'Balance In', 'Balance Out', 'Label', 'Risk Level'],
      dtype='object') with 19 columns


In [6]:
dfData[timeColStr] = pd.to_datetime(dfData[timeColStr], infer_datetime_format = 'True') #<! Stable time format

In [7]:
# Sort data by transaction date
dfData.sort_values(timeColStr, inplace = True)
# dfData.reset_index(drop = True, inplace = True)

In [8]:
# Detecting invalid `Amount USD`

dsInValidTrnsUsd = ((dfData['Amount [USD]'] == 0) | (dfData['Amount [USD]'].isna()) | (dfData['Amount [USD]'] == ''))

print(f'Number of invalid `Amount [USD]`: {dsInValidTrnsUsd.sum()}')

Number of invalid `Amount [USD]`: 163887


In [9]:
# Remove invalid data
dfData.drop(dfData.index[dsInValidTrnsUsd], inplace = True) #<! Royi: Should we do a reset index?

In [10]:
# Detecting Outliers in the `Amount [USD]`

dsOutlierTrnsUsd = ((dfData['Amount [USD]'] >= amountUsdOutlierThr) | (dfData['Amount [USD]'] <= 0))

print(f'Number of outliers `Amount [USD]`: {dsOutlierTrnsUsd.sum()}')

Number of outliers `Amount [USD]`: 0


In [11]:
# Remove outliers
dfData.drop(dfData.index[dsOutlierTrnsUsd], inplace = True) #<! Royi: Should we do a reset index?

In [12]:
# From now on this is the data to work with
numRows, numCols = dfData.shape

print(f'The number of rows (Samples): {numRows}, The number of columns: {numCols}')

The number of rows (Samples): 1681760, The number of columns: 19


In [13]:
numGrps = dfData.GrpBySender.numGrps

In [14]:
dfGbs = dfData.GrpBySender

In [15]:
# Features - Amount Based

sum_s           = dfGbs.AggBySender(colName = dfGbs.amountUSDColLabel, grpLabel = None, calcType = CalcType.TYPE_SUM)
mean_s          = dfGbs.AggBySender(colName = dfGbs.amountUSDColLabel, grpLabel = None, calcType = CalcType.TYPE_MEAN)
std_s           = dfGbs.AggBySender(colName = dfGbs.amountUSDColLabel, grpLabel = None, calcType = CalcType.TYPE_STD)
var_s           = dfGbs.AggBySender(colName = dfGbs.amountUSDColLabel, grpLabel = None, calcType = CalcType.TYPE_VAR)
median_s        = dfGbs.AggBySender(colName = dfGbs.amountUSDColLabel, grpLabel = None, calcType = CalcType.TYPE_MEDIAN)
count_s         = dfGbs.AggBySender(colName = dfGbs.amountUSDColLabel, grpLabel = None, calcType = CalcType.TYPE_COUNT)
min_s           = dfGbs.AggBySender(colName = dfGbs.amountUSDColLabel, grpLabel = None, calcType = CalcType.TYPE_MIN)
max_s           = dfGbs.AggBySender(colName = dfGbs.amountUSDColLabel, grpLabel = None, calcType = CalcType.TYPE_MAX)
coint_c         = dfGbs.AggBySender(colName = dfGbs.currencyColLabel, grpLabel = None, calcType = CalcType.TYPE_COUNT_COIN_TYPE)
receiver_type_c = dfGbs.AggBySender(colName = dfGbs.receiverTypeColLabel, grpLabel = None, calcType = CalcType.TYPE_COUNT_RECEIVER_TYPE)

gas_pr_mean     = dfGbs.AggBySender(colName = dfGbs.gasPriceColLabel, grpLabel = None, calcType = CalcType.TYPE_MEAN)
gas_lim_mean    = dfGbs.AggBySender(colName = dfGbs.gasLimitColLabel, grpLabel = None, calcType = CalcType.TYPE_MEAN)
gas_used_mean   = dfGbs.AggBySender(colName = dfGbs.gasUsedColLabel, grpLabel = None, calcType = CalcType.TYPE_MEAN)
gas_pr_std      = dfGbs.AggBySender(colName = dfGbs.gasPriceColLabel, grpLabel = None, calcType = CalcType.TYPE_STD)
gas_lim_std     = dfGbs.AggBySender(colName = dfGbs.gasLimitColLabel, grpLabel = None, calcType = CalcType.TYPE_STD)
gas_used_std    = dfGbs.AggBySender(colName = dfGbs.gasUsedColLabel, grpLabel = None, calcType = CalcType.TYPE_STD)
gas_pr_med      = dfGbs.AggBySender(colName = dfGbs.gasPriceColLabel, grpLabel = None, calcType = CalcType.TYPE_MEDIAN)
gas_lim_med     = dfGbs.AggBySender(colName = dfGbs.gasLimitColLabel, grpLabel = None, calcType = CalcType.TYPE_MEDIAN)
gas_used_med    = dfGbs.AggBySender(colName = dfGbs.gasUsedColLabel, grpLabel = None, calcType = CalcType.TYPE_MEDIAN)


dfData[FeatureName.AMOUNT_SUM_ASSET.name]          = sum_s
dfData[FeatureName.AMOUNT_MEAN_ASSET.name]         = mean_s
dfData[FeatureName.AMOUNT_STD_ASSET.name]          = std_s
dfData[FeatureName.AMOUNT_VAR_ASSET.name]          = var_s
dfData[FeatureName.AMOUNT_MEDIAN_ASSET.name]       = median_s
dfData[FeatureName.AMOUNT_MIN_ASSET.name]          = min_s
dfData[FeatureName.AMOUNT_MAX_ASSET.name]          = max_s
dfData[FeatureName.TSX_COUNT_ASSET.name]           = count_s
dfData[FeatureName.COIN_TYPE_COUNT_ASSET.name]     = coint_c
dfData[FeatureName.RECEIVER_TYPE_COUNT_ASSET.name] = receiver_type_c

dfData[FeatureName.GAS_PRICE_MEAN_ASSET.name] = gas_pr_mean
dfData[FeatureName.GAS_PRICE_STD_ASSET.name] = gas_pr_std
dfData[FeatureName.GAS_PRICE_MEDIAN_ASSET.name] = gas_pr_med

dfData[FeatureName.GAS_LIMIT_MEAN_ASSET.name] = gas_lim_mean
dfData[FeatureName.GAS_LIMIT_STD_ASSET.name] = gas_lim_std
dfData[FeatureName.GAS_LIMIT_MEDIAN_ASSET.name] = gas_lim_med

dfData[FeatureName.GAS_USED_MEAN_ASSET.name] = gas_used_mean
dfData[FeatureName.GAS_USED_STD_ASSET.name] = gas_used_std
dfData[FeatureName.GAS_USED_MEDIAN_ASSET.name] = gas_used_med

#COIN_TYPE_COUNT_USR                 

In [16]:
# Features - Time Based

td_mean_s   = dfGbs.AggBySender(colName = dfGbs.timeDiffAssetColLabel, grpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_MEAN)
td_std_s    = dfGbs.AggBySender(colName = dfGbs.timeDiffAssetColLabel, grpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_STD)
td_median_s = dfGbs.AggBySender(colName = dfGbs.timeDiffAssetColLabel, grpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_MEDIAN)
td_min_s    = dfGbs.AggBySender(colName = dfGbs.timeDiffAssetColLabel, grpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_MIN)
td_max_s    = dfGbs.AggBySender(colName = dfGbs.timeDiffAssetColLabel, grpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_MAX)

dfData[FeatureName.TIME_DIFF_MEAN_ASSET.name]   = td_mean_s
dfData[FeatureName.TIME_DIFF_STD_ASSET.name]    = td_std_s
dfData[FeatureName.TIME_DIFF_MEDIAN_ASSET.name] = td_median_s
dfData[FeatureName.TIME_DIFF_MIN_ASSET.name]    = td_min_s
dfData[FeatureName.TIME_DIFF_MAX_ASSET.name]    = td_max_s

In [17]:
# Features - Amount Based (User)

sum_s           = dfGbs.AggByReceiver(colName = dfGbs.amountUSDColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_SUM)
mean_s          = dfGbs.AggByReceiver(colName = dfGbs.amountUSDColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_MEAN)
std_s           = dfGbs.AggByReceiver(colName = dfGbs.amountUSDColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_STD)
var_s           = dfGbs.AggByReceiver(colName = dfGbs.amountUSDColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_VAR)
median_s        = dfGbs.AggByReceiver(colName = dfGbs.amountUSDColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_MEDIAN)
count_s         = dfGbs.AggByReceiver(colName = dfGbs.amountUSDColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_COUNT)
min_s           = dfGbs.AggByReceiver(colName = dfGbs.amountUSDColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_MIN)
max_s           = dfGbs.AggByReceiver(colName = dfGbs.amountUSDColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_MAX)
coin_c          = dfGbs.AggByReceiver(colName = dfGbs.currencyColLabel, grpLabel = None, calcType = CalcType.TYPE_COUNT_COIN_TYPE)
receiver_type_c = dfGbs.AggByReceiver(colName = dfGbs.receiverTypeColLabel, grpLabel = None, calcType = CalcType.TYPE_COUNT_RECEIVER_TYPE) #<! Royi: We need to check why is it so important?!?!

gas_pr_mean     = dfGbs.AggByReceiver(colName = dfGbs.gasPriceColLabel, grpLabel = None, calcType = CalcType.TYPE_MEAN)
gas_lim_mean    = dfGbs.AggByReceiver(colName = dfGbs.gasLimitColLabel, grpLabel = None, calcType = CalcType.TYPE_MEAN)
gas_used_mean   = dfGbs.AggByReceiver(colName = dfGbs.gasUsedColLabel, grpLabel = None, calcType = CalcType.TYPE_MEAN)

gas_pr_std      = dfGbs.AggByReceiver(colName = dfGbs.gasPriceColLabel, grpLabel = None, calcType = CalcType.TYPE_STD)
gas_lim_std     = dfGbs.AggByReceiver(colName = dfGbs.gasLimitColLabel, grpLabel = None, calcType = CalcType.TYPE_STD)
gas_used_std    = dfGbs.AggByReceiver(colName = dfGbs.gasUsedColLabel, grpLabel = None, calcType = CalcType.TYPE_STD)

gas_pr_med      = dfGbs.AggByReceiver(colName = dfGbs.gasPriceColLabel, grpLabel = None, calcType = CalcType.TYPE_MEDIAN)
gas_lim_med     = dfGbs.AggByReceiver(colName = dfGbs.gasLimitColLabel, grpLabel = None, calcType = CalcType.TYPE_MEDIAN)
gas_used_med    = dfGbs.AggByReceiver(colName = dfGbs.gasUsedColLabel, grpLabel = None, calcType = CalcType.TYPE_MEDIAN)

gas_pr_quant    = dfGbs.dfSubGrpByRec[dfGbs.gasPriceColLabel].transform('quantile' ,q =0.75)#dfGbs.AggByReceiver(colName = dfGbs.gasPriceColLabel, grpLabel = None, calcType = CalcType.TYPE_PCTILE)
gas_lim_quant   = dfGbs.dfSubGrpByRec[dfGbs.gasLimitColLabel].transform('quantile' ,q =0.75)#dfGbs.AggByReceiver(colName = dfGbs.gasLimitColLabel, grpLabel = None, calcType = CalcType.TYPE_PCTILE)
gas_used_quant  = dfGbs.dfSubGrpByRec[dfGbs.gasUsedColLabel].transform('quantile' ,q =0.75)#dfGbs.AggByReceiver(colName = dfGbs.gasUsedColLabel, grpLabel = None, calcType = CalcType.TYPE_PCTILE)


dfData[FeatureName.AMOUNT_SUM_USR.name]          = sum_s
dfData[FeatureName.AMOUNT_MEAN_USR.name]         = mean_s
dfData[FeatureName.AMOUNT_STD_USR.name]          = std_s
dfData[FeatureName.AMOUNT_VAR_USR.name]          = var_s
dfData[FeatureName.AMOUNT_MEDIAN_USR.name]       = median_s
dfData[FeatureName.AMOUNT_MIN_USR.name]          = min_s
dfData[FeatureName.AMOUNT_MAX_USR.name]          = max_s
dfData[FeatureName.TSX_COUNT_USR.name]           = count_s
dfData[FeatureName.COIN_TYPE_COUNT_USR.name]     = coin_c
dfData[FeatureName.RECEIVER_TYPE_COUNT_USR.name] = receiver_type_c    

dfData[FeatureName.GAS_PRICE_MEAN_USR.name] = gas_pr_mean
dfData[FeatureName.GAS_PRICE_STD_USR.name] = gas_pr_std
dfData[FeatureName.GAS_PRICE_MEDIAN_USR.name] = gas_pr_med

dfData[FeatureName.GAS_LIMIT_MEAN_USR.name] = gas_lim_mean
dfData[FeatureName.GAS_LIMIT_STD_USR.name] = gas_lim_std
dfData[FeatureName.GAS_LIMIT_MEDIAN_USR.name] = gas_lim_med

dfData[FeatureName.GAS_USED_MEAN_USR.name] = gas_used_mean
dfData[FeatureName.GAS_USED_STD_USR.name] = gas_used_std
dfData[FeatureName.GAS_USED_MEDIAN_USR.name] = gas_used_med

dfData[FeatureName.GAS_PRICE_QUANTILE_USR.name] = gas_pr_quant
dfData[FeatureName.GAS_LIMIT_QUANTILE_USR.name] = gas_lim_quant
dfData[FeatureName.GAS_USED_QUANTILE_USR.name] = gas_used_quant


In [18]:
# Features - Time Based (User)

td_mean_s   = dfGbs.AggByReceiver(colName = dfGbs.timeDiffUserColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_MEAN)
td_std_s    = dfGbs.AggByReceiver(colName = dfGbs.timeDiffUserColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_STD)
td_median_s = dfGbs.AggByReceiver(colName = dfGbs.timeDiffUserColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_MEDIAN)
td_min_s    = dfGbs.AggByReceiver(colName = dfGbs.timeDiffUserColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_MIN)
td_max_s    = dfGbs.AggByReceiver(colName = dfGbs.timeDiffUserColLabel, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_TIME_DIFF_MAX)

dfData[FeatureName.TIME_DIFF_MEAN_USR.name]   = td_mean_s
dfData[FeatureName.TIME_DIFF_STD_USR.name]    = td_std_s
dfData[FeatureName.TIME_DIFF_MEDIAN_USR.name] = td_median_s
dfData[FeatureName.TIME_DIFF_MIN_USR.name]    = td_min_s
dfData[FeatureName.TIME_DIFF_MAX_USR.name]    = td_max_s

In [19]:
# Features - Time Based

dfData[FeatureName.TIME_HOUR.name]    = dfGbs.GetTimeVals(periodTimeType = PeriodTimeType.HOUR_DAY)
dfData[FeatureName.TIME_WEEKDAY.name] = dfGbs.GetTimeVals(periodTimeType = PeriodTimeType.DAY_WEEK)

In [20]:
# Ratio Based Features

dfData[FeatureName.AMOUNT_MEAN_RATIO_USR_ASSET.name]    = dfData[FeatureName.AMOUNT_MEAN_USR.name] / dfData[FeatureName.AMOUNT_MEAN_ASSET.name]
dfData[FeatureName.AMOUNT_STD_RATIO_USR_ASSET.name]    = dfData[FeatureName.AMOUNT_STD_USR.name] / dfData[FeatureName.AMOUNT_STD_ASSET.name]
dfData[FeatureName.TIME_DIFF_MEAN_RATIO_USR_ASSET.name] = dfData[FeatureName.TIME_DIFF_MEAN_USR.name] / dfData[FeatureName.TIME_DIFF_MEAN_ASSET.name]
dfData[FeatureName.TIME_DIFF_STD_RATIO_USR_ASSET.name] = dfData[FeatureName.TIME_DIFF_STD_USR.name] / dfData[FeatureName.TIME_DIFF_STD_ASSET.name]

In [21]:
# Frequency Based Features

dfData[FeatureName.TIME_MAX.name] = dfGbs.AggByReceiver(colName = timeColStr, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_MAX)
dfData[FeatureName.TIME_MIN.name] = dfGbs.AggByReceiver(colName = timeColStr, grpLabel = None, subGrpLabel = None, calcType = CalcType.TYPE_MIN)

dfData[FeatureName.TIME_INTERVL_USR.name] = ((dfData[FeatureName.TIME_MAX.name] - dfData[FeatureName.TIME_MIN.name])).dt.total_seconds()

# Frequency of the User Transactions
dfData[FeatureName.TSX_FREQ_HZ_USR.name] = dfData[FeatureName.TSX_COUNT_USR.name] / dfData[FeatureName.TIME_INTERVL_USR.name]

In [22]:
#Ratios between the user to the mean of all users.
dfData[FeatureName.GAS_PRICE_USR_ASSET_RATIO_MEAN.name] = dfData[FeatureName.GAS_PRICE_MEAN_USR.name] / dfData[FeatureName.GAS_PRICE_MEAN_ASSET.name]
dfData[FeatureName.GAS_LIMIT_USR_ASSET_RATIO_MEAN.name] = dfData[FeatureName.GAS_LIMIT_MEAN_USR.name] / dfData[FeatureName.GAS_LIMIT_MEAN_ASSET.name]
dfData[FeatureName.GAS_USED_USR_ASSET_RATIO_MEAN.name] = dfData[FeatureName.GAS_USED_MEAN_USR.name] / dfData[FeatureName.GAS_USED_MEAN_ASSET.name] 
#Gas Price', 'Gas Limit', 'Gas Used'
dfData[FeatureName.GAS_PRICE_LIMIT_RATIO.name] = dfData['Gas Price'] / dfData['Gas Limit']
dfData[FeatureName.GAS_PRICE_USED_RATIO.name] = dfData['Gas Price'] / dfData['Gas Used']
dfData[FeatureName.GAS_USED_LIMIT_RATIO.name] = dfData['Gas Used'] / dfData['Gas Limit'] 

dfData[FeatureName.GAS_PRICE_LIMIT_RATIO_MEAN.name] = dfData[FeatureName.GAS_PRICE_MEAN_USR.name] / dfData[FeatureName.GAS_LIMIT_MEAN_USR.name]
dfData[FeatureName.GAS_PRICE_USED_RATIO_MEAN.name] = dfData[FeatureName.GAS_PRICE_MEAN_USR.name] / dfData[FeatureName.GAS_USED_MEAN_USR.name]
dfData[FeatureName.GAS_USED_LIMIT_RATIO_MEAN.name] = dfData[FeatureName.GAS_USED_MEAN_USR.name] / dfData[FeatureName.GAS_PRICE_MEAN_USR.name] 


#Compare it to 75 quantile (TSX Gas Price / Quantile(75) of Gas Price).
dfData[FeatureName.GAS_PRICE_QUANTILE_RATIO.name] = dfData['Gas Price'] / dfData[FeatureName.GAS_PRICE_QUANTILE_USR.name]
dfData[FeatureName.GAS_LIMIT_QUANTILE_RATIO.name] = dfData['Gas Limit'] / dfData[FeatureName.GAS_LIMIT_QUANTILE_USR.name]
dfData[FeatureName.GAS_USED_QUANTILE_RATIO.name] =  dfData['Gas Used'] / dfData[FeatureName.GAS_USED_QUANTILE_USR.name]

In [23]:
#Feature to indicate first transaction
dfData[FeatureName.MIN_INDICATOR.name] = 0 ; dfData.loc[dfData[timeColStr] == dfData[FeatureName.TIME_MIN.name], FeatureName.MIN_INDICATOR.name] = 1 
### TODO !!! this can be invorrect. it will need a review !!!!!! 

In [24]:
#56
#Create features based on the currency of the transactions:
# 1. The number of different types of currencies per user. <-- done previously = dfData[FeatureName.COIN_TYPE_COUNT_USR.name]
# 2. The average of the number of types of all user for an asset. <-- groupby asset , mean(number of different types of currencies per user)
# 3. The ratio between a specific user to the average of the asset. --> 1/2
    

dfData[FeatureName.COIN_TYPE_COUNT_USR_MEAN_ASSET.name]    = dfGbs.AvgByUserCoinType()
dfData[FeatureName.COIN_TYPE_USR_MEAN_ASSET_RATIO.name]  = dfData[FeatureName.COIN_TYPE_COUNT_USR.name] / dfData[FeatureName.COIN_TYPE_COUNT_USR_MEAN_ASSET.name]


In [78]:
dfData_ = dfData.copy(deep=True) ###<<-- I create a copy of data frame for experiment with categorical variables 

In [91]:
l_sfs = [
    #'Amount',
 #'Gas Price',
 #'Gas Used',
 'AMOUNT_SUM_USR',
 'AMOUNT_MEAN_ASSET',
 'AMOUNT_STD_USR',
 'AMOUNT_VAR_USR',
 'AMOUNT_MIN_ASSET',
 'AMOUNT_MIN_USR',
 'AMOUNT_MAX_USR',
 'TIME_DIFF_MEAN_USR',
 'TIME_DIFF_STD_USR',
 'TIME_DIFF_MEDIAN_USR',
 'TIME_DIFF_MIN_ASSET',
 'TIME_DIFF_MIN_USR',
 'TIME_DIFF_MAX_ASSET',
 'TIME_DIFF_MAX_USR',
 'COIN_TYPE_USR_MEAN_ASSET_RATIO',
 #'COIN_TYPE_COUNT_USR',
 #'RECEIVER_TYPE_COUNT_USR',
 #'TIME_HOUR',
 #'TIME_WEEKDAY',
 'TIME_INTERVL_USR',
 'TIME_DIFF_STD_RATIO_USR_ASSET',
 'TIME_DIFF_MEAN_RATIO_USR_ASSET',
 #'GAS_PRICE_STD_USR',
 #'GAS_PRICE_MEDIAN_USR',
 'GAS_USED_MEAN_USR',
 'GAS_USED_STD_ASSET',
 'GAS_USED_STD_USR',
 #'MIN_INDICATOR',
 'GAS_PRICE_USR_ASSET_RATIO_MEAN',
 #'GAS_USED_LIMIT_RATIO',
 'GAS_PRICE_USED_RATIO_MEAN',
 #'GAS_USED_LIMIT_RATIO_MEAN',
 #'GAS_PRICE_QUANTILE_RATIO',
 'GAS_USED_QUANTILE_RATIO',
 'GAS_LIMIT_QUANTILE_RATIO',
 'GAS_PRICE_QUANTILE_USR',
 'GAS_USED_QUANTILE_USR']


In [92]:
dfData_ = dfData_[l_sfs+['Label', 'Sender ID']]

---

In [93]:
print(dfData_.max())

AMOUNT_SUM_USR                                             2877530151.438098
AMOUNT_MEAN_ASSET                                            48995742.155142
AMOUNT_STD_USR                                              302919513.850023
AMOUNT_VAR_USR                                           91760231871134256.0
AMOUNT_MIN_ASSET                                                 19788118.17
AMOUNT_MIN_USR                                                  238171636.87
AMOUNT_MAX_USR                                                   614182955.6
TIME_DIFF_MEAN_USR                                               142108354.0
TIME_DIFF_STD_USR                                             72917978.79002
TIME_DIFF_MEDIAN_USR                                             142108354.0
TIME_DIFF_MIN_ASSET                                               18487552.0
TIME_DIFF_MIN_USR                                                142108354.0
TIME_DIFF_MAX_ASSET                                               71935375.0

## Read/process data <a name="ReadProcessData"></a>

### Read in data

In [94]:
#data = pd.read_csv(
#    'dummy_data/dummy_pipeline_output_data.csv',
#    index_col='eid'
#)

data = dfData_

In [95]:
data.shape

(1681760, 29)

Then we can split the data into features (*X*) and the target column (*y*):

In [96]:
fraud_column = 'Label'#fraud_column = 'sim_is_fraud'
X = data.drop(
    [fraud_column, 'Sender ID'], 
    axis=1
)
y = data[fraud_column]

### Process the data

#### Train/test split

Before applying any data processing steps, we should split the data into training and test sets:

In [97]:
'''
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.33, 
    random_state=0,
    groups = dfData['Sender ID']
)
'''


from sklearn.model_selection import GroupShuffleSplit 

splitter = GroupShuffleSplit(test_size=.33, random_state = 0)
split = splitter.split(X,y, groups=dfData['Sender ID'])
train_inds, test_inds = next(split)

X_train = X.iloc[train_inds] ; X_test = X.iloc[test_inds]
y_train = y.iloc[train_inds] ; y_test = y.iloc[test_inds]

#hKFoldSplt = StratifiedGroupKFold(n_splits = numKFolds, shuffle = True, random_state = randomState)
#for vTrainIdx, vTestIdx in hKFoldSplt.split(mX, vY, groups = dfData['Sender ID']):
#    mXTrain, mXTest, vYTrain, vYTest = mX.iloc[vTrainIdx], mX.iloc[vTestIdx], vY.iloc[vTrainIdx], vY.iloc[vTestIdx]


#### Process data for rule generation

When generating new rules, we need to first process the data. The main data processesing steps that need to be applied before using the rule generator are:

* Remove uneccessary columns
* Impute null values
* One hot encode categorical features
* Feature selection ***(in this case, the feature set is small, so this step is omitted from the example)***

##### Remove unnecessary columns

We need to remove those columns which will not be useful or make sense to have in our rules - in this case, this includes any features whose name containis 'sim', 'eid' or any high cardinality columns. Note however that there may be additional columns that you have to remove from your dataset:

In [98]:
sim_cols = X_train.filter(regex='sim_').columns.tolist()
eid_cols = X_train.filter(regex='eid').columns.tolist()
high_card_cols = X_train.select_dtypes(include='object').columns[(X_train.select_dtypes(include='object').nunique() > 50)].tolist()

In [99]:
X_train = X_train.drop(sim_cols + eid_cols + high_card_cols, axis=1)
X_test = X_test.drop(sim_cols + eid_cols + high_card_cols, axis=1)

In [100]:
X_train.shape, X_test.shape

((1153340, 27), (528420, 27))

##### Impute null values

We can now impute the null values. You can use any imputation method you like - here we'll impute using the following methodology:

* Impute numeric values with -1.
* Impute categorical features with the category 'missing'.
* Impute boolean features with 'missing'.

In [101]:
print("Number of null values in X_train:", X_train.isna().sum().sum())

Number of null values in X_train: 3049587


In [102]:
num_cols = X_train.select_dtypes(include=np.number).columns.tolist()
cat_cols = X_train.select_dtypes(include=object).columns.tolist()
bool_cols = X_train.select_dtypes(include=bool).columns.tolist()

In [103]:
X_train[bool_cols] = X_train[bool_cols].astype(object)
X_test[bool_cols] = X_test[bool_cols].astype(object)

In [104]:
X_train.loc[:, num_cols] = X_train.loc[:, num_cols].fillna(-1)
X_train.loc[:, cat_cols] = X_train.loc[:, cat_cols].fillna('missing')
X_train.loc[:, bool_cols] = X_train.loc[:, bool_cols].fillna('missing')
X_test.loc[:, num_cols] = X_test.loc[:, num_cols].fillna(-1)
X_test.loc[:, cat_cols] = X_test.loc[:, cat_cols].fillna('missing')
X_test.loc[:, bool_cols] = X_test.loc[:, bool_cols].fillna('missing')

In [105]:
print("Number of null values in X_train:", X_train.isna().sum().sum())

Number of null values in X_train: 0


##### One hot encode categorical features

Now we can one hot encode the categorical features:

In [106]:
ohe = OneHotEncoder(use_cat_names=True)

In [107]:
ohe.fit(X_train)
X_train = ohe.transform(X_train)
X_test = ohe.transform(X_test)



In [108]:
X_train.shape, X_test.shape

((1153340, 27), (528420, 27))

---

## Rule generation <a name="RuleGeneration"></a>

Now that we've processed our raw data, we can use this to generate rules. There are two rule generator algorithms in Iguanas:

* `RuleGeneratorDT`: Generate rules by extracting the highest performing branches from a tree ensemble model.
* `RuleGeneratorOpt`: Generate rules by optimising the thresholds of single features and combining these one condition rules with AND conditions to create more complex rules.

**In this example, we'll only use the** `RuleGeneratorDT`**, but you can use the RuleGeneratorOpt instead or additionally.**

### Set up class parameters

**Please see the class docstring for more information on each parameter**

In [109]:
p = Precision()
f1 = FScore(beta=1)

In [110]:
params = {
    'metric': f1.fit,
    'n_total_conditions': 4,   
    'tree_ensemble': RandomForestClassifier(n_estimators=10, random_state=0),
    'target_feat_corr_types': 'Infer',
    'num_cores': 4,
    'verbose': 1
}

### Instantiate class and run fit method

Once the parameters have been set, we can run the `fit` method to generate the rules:

In [111]:
rg = RuleGeneratorDT(**params)

In [112]:
X_rules_gen_train = rg.fit(
    X=X_train, 
    y=y_train
)

--- Calculating correlation of features with respect to the target ---
--- Returning column datatypes ---
--- Training tree ensemble ---
--- Extracting rules from tree ensemble ---
100%|██████████| 10/10 [00:08<00:00,  1.15it/s]


### Outputs

The `fit` method return the binary columns of the generated rules. See the `Attributes` section in the class docstring for a description of each attribute generated:

In [113]:
X_rules_gen_train.head()

Unnamed: 0,RGDT_Rule_20221010_0,RGDT_Rule_20221010_1,RGDT_Rule_20221010_2,RGDT_Rule_20221010_3,RGDT_Rule_20221010_4,RGDT_Rule_20221010_5,RGDT_Rule_20221010_6,RGDT_Rule_20221010_7,RGDT_Rule_20221010_8,RGDT_Rule_20221010_9,...,RGDT_Rule_20221010_74,RGDT_Rule_20221010_75,RGDT_Rule_20221010_76,RGDT_Rule_20221010_77,RGDT_Rule_20221010_78,RGDT_Rule_20221010_79,RGDT_Rule_20221010_80,RGDT_Rule_20221010_81,RGDT_Rule_20221010_82,RGDT_Rule_20221010_83
905528,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,0,0,0,1,0,0
905527,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,1,1,1,1,1,1
905526,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,0,0,0,1,0,0
905525,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,0,0,0,1,0,0
905524,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,0,0,0,1,0,0


----

## Rule Optimisation <a name="RuleOptimisation"></a>

Now we can optimise the existing rules.

First, we'll read in the existing rules, which have been stored in the standard Iguanas string format:

In [115]:
with open('rule_strings.pkl', 'rb') as f:
    rule_strings = pickle.load(f)

We can then instantiate the `Rules` class with these rules, so that we can convert them into the standard Iguanas lambda expression format:

In [116]:
existing_rules = Rules(rule_strings=rule_strings)

To convert the rules, we use the `as_rule_lambdas` method from the instantiated `Rules` class:

In [117]:
existing_rule_lambdas = existing_rules.as_rule_lambdas(as_numpy=False, with_kwargs=True)

The standard Iguanas lambda expression format allows new values to be injected into the condition string of a rule. This means that the rule's performance can be evaluated with new values (this capability is leveraged in the rule optimisers).

We can now use the `BayesianOptimiser` class to optimise the thresholds of these rules.

### Set up class parameters

**Please see the class docstring for more information on each parameter**

In [118]:
params = {
    'rule_lambdas': existing_rule_lambdas,
    'lambda_kwargs': existing_rules.lambda_kwargs,
    'metric': f1.fit,
    'n_iter': 10,
    'num_cores': 4,
    'verbose': 1
}

### Instantiate class and run fit method

Once the parameters have been set, we can run the `fit` method to optimise the thresholds of the rules.

**Note:** we use the raw, unprocessed data here, as productionised rules will usually run on raw data:

In [119]:
ro = BayesianOptimiser(**params)

In [120]:
X_rules_opt_train = ro.fit(
    X=X.loc[X_train.index], 
    y=y_train
)

--- Checking for rules with features that are missing in `X` ---
100%|██████████| 23/23 [00:00<?, ?it/s]

  X = X[rule_features_in_X]



--- Checking for rules that exclusively contain non-optimisable conditions ---
0it [00:00, ?it/s]

  X_min = X[cols].min()
  X_max = X[cols].max()



--- Checking for rules that exclusively contain zero-variance features ---
0it [00:00, ?it/s]


RulesNotOptimisedError: There are no optimisable rules in the set

### Outputs

The `fit` method returns the binary columns of the generated rules. See the `Attributes` section in the class docstring for a description of each attribute generated.

**Note the following cases where the rule optimiser will be unable to run:**

* Rules that contain features that are missing in `X`.
* Rules that contain no optimisable features (e.g. all of the conditions are string-based).
* Rules that contain exclusively zero variance features.
* Rules that contain a feature that is completely null in `X`.

In [None]:
X_rules_opt_train.head()

---

## Combine rules and remove those which are unnecessary <a name="CombineRules"></a>

We now have two sets of rules:

1. Newly generated rules
2. Optimised existing rules

We can combine these rule sets, then apply correlation reduction and filtering methods to remove those which are unneccesary:

In [None]:
# Combine the binary columns of each rule set
X_rules_train = pd.concat([
    X_rules_gen_train, 
    X_rules_opt_train
], axis=1)

In [None]:
X_rules_train.head(2)

In [None]:
X_rules_train.shape

### Standard filter

We can use the `SimpleFilter` class from the `rule_selection` module to filter out rules whose performance is below a desired threshold. In this example, we'll filter out rules with an F1 score below 0.01:

In [None]:
fr = SimpleFilter(
    threshold=0.01,
    operator='>=',
    metric=f1.fit,
)

Now we can run the `fit_transform` method to remove the rules which do not meet the filter requirements:

In [None]:
X_rules_train = fr.fit_transform(
    X_rules=X_rules_train, 
    y=y_train
)

#### Outputs

The `fit_transform` method returns a dataframe containing the filtered rule binary columns. See the `Attributes` section in the class docstring for a description of each attribute generated.

In [None]:
X_rules_train.shape

### Remove correlated rules

We can use the `CorrelatedFilter` class from the `rule_selection` module along with a correlation reduction class to remove correlated rules - see the `correlation_reduction` module for more information on these classes. 

In this example, we'll be using the `AgglomerativeClusteringReducer` class from that module. To instantiate this class, we also need to define a similarity function - see the `metrics.pairwise` module for more information. In this example, we'll use the Jaccard similarity:

In [None]:
js = JaccardSimilarity()

In [None]:
acfr = AgglomerativeClusteringReducer(
    threshold=0.75,
    strategy='bottom_up', 
    similarity_function=js.fit, 
    metric=f1.fit
)

Now we can instantiate the `CorrelatedFilter` class, and run the `fit_transform` method to remove correlated rules:

In [None]:
fcr = CorrelatedFilter(correlation_reduction_class=acfr)

In [None]:
X_rules_train = fcr.fit_transform(
    X_rules=X_rules_train,
    y=y_train
)

#### Outputs

The `fit_transform` method returns a dataframe containing the binary columns of the uncorrelated rules. See the `Attributes` section in the class docstring for a description of each attribute generated.

In [None]:
X_rules_train.shape

### Greedy filter

We can use the `GreedyFilter` class from the `rule_selection` module to sort the rules by a given metric (e.g. precision), then iterate through the rules and calculate the combined performance of the top n number of rules. Here, we'll sort the rules by precision, then calculate the F1 score of the top n combined rules:

In [None]:
gf = GreedyFilter(
    metric=f1.fit, 
    sorting_metric=p.fit,
    verbose=1
)

In [None]:
X_rules_train = gf.fit_transform(
    X_rules=X_rules_train, 
    y=y_train
)

We can also plot the combined performance of the top *n* rules (calculated from running the `fit` method) on the training set using the `plot_top_n_performance_on_train` method:

In [None]:
gf.plot_top_n_performance_on_train()

The graph shows that when the rules are sorted by precision, then the F1 score is calculated for the top n combined rules, the combined performance begins to plateau/drop. So the algorithm will only keep those rules that deliver the maximum combined performance (and drop the rest).

#### Outputs

The `fit_transform` method returns a dataframe containing the filtered rule binary columns. See the `Attributes` section in the class docstring for a description of each attribute generated.

In [None]:
X_rules_train.shape

----

## Set up the RBS Pipeline <a name="SetUpPipeline"></a>

Now, let's set up our RBS Pipeline using our combined, filtered rule set. In this case, we'll go for a simple approach:

1. If any rules trigger, reject the transaction.
2. If no rules trigger, approve any remaining transactions.

To set up the pipeline using the logic above, we first need to create the `config` parameter. This is just a list which outlines the stages of the pipeline. Each stage should be defined using a tuple of two elements: 

1. The first element should be an integer which corresponds to the decision made at that stage (either `0` or `1`).
2. The second element should be a list that dictates which rules should trigger for that decision to be made.

In our example, the config will be:

In [None]:
config = [
    (1, X_rules_train.columns.tolist())
]

Here, the first stage is configured via the tuple in the first element of the list. This says to apply a decision of `1` (i.e. reject) to transactions where the any of the rules have triggered.

We also need to specify the final decision to be made if no rules are triggered - this is set via the `final_decision` parameter. In our case this should be `0`, as we want to approve any remaining transactions:

In [None]:
final_decision = 0

With these parameters configured, we can now instantiate our `RBSPipeline` class:

In [None]:
rbsp = RBSPipeline(
    config=config,
    final_decision=final_decision
)

---

## Optimise the RBS Pipeline <a name="OptimiseThePipeline"></a>

Now that we have our RBS Pipeline set up, we can optimise it using the RBS Optimiser. Here, we just pass the instatiated pipeline class to the `pipeline` parameter:

In [None]:
rbso = RBSOptimiser(
    pipeline=rbsp, 
    metric=f1.fit,
    n_iter=60, 
    verbose=1
)

Then we run the `fit_transform` method to optimise the pipeline using the given dataset, then apply it to the dataset:

In [None]:
pipe_pred_train = rbso.fit_predict(
    X_rules=X_rules_train, 
    y=y_train
)

### Outputs

The `fit_transform` method optimises the pipeline and returns the prediction of the optimised pipeline by applying it to the given dataset. See the `Attributes` section in the class docstring for a description of each attribute generated.

In [None]:
rbso.config

We can also use Sklearn's *classification_report* and *confusion_matrix* functions to generate some performance metrics for the pipeline:

In [None]:
print(
    classification_report(
        y_true=y_train, 
        y_pred=pipe_pred_train, 
        digits=4
    )
)

In [None]:
sns.set_style('white')

In [None]:
cm = ConfusionMatrixDisplay(
    confusion_matrix(
        y_true=y_train, 
        y_pred=pipe_pred_train
    )
)
cm.plot()

----

## Filter rules for the optimised RBS Pipeline <a name="FilterRulesForPipeline"></a>

Now that we know which rules we need for our final, optimised RBS Pipeline, we can filter our original generated and optimised rule sets to include only those rules which are required.

In [None]:
rbs_rule_names_gen = [rule for rule in rbso.rules_to_keep if rule in rg.rule_names]
rbs_rule_names_opt = [rule for rule in rbso.rules_to_keep if rule in ro.rule_names]

Finally, we filter the original generated and optimised rule sets:

In [None]:
rg.filter_rules(include=rbs_rule_names_gen)
ro.filter_rules(include=rbs_rule_names_opt)

----

## Apply the optimised RBS Pipeline to the test set <a name="ApplyPipeline"></a>

To apply our optimised RBS Pipeline to the test set, we first need to apply our filtered generated and optimised rules to the test set:

In [None]:
# Generated rules
X_rules_gen_test = rg.transform(X=X_test)
# Optimised rules (note we using the raw, unprocessed data here)
X_rules_opt_test = ro.transform(X=X.loc[X_test.index])

Now we can combine these binary columns into one set:

In [None]:
X_rules_test = pd.concat([
    X_rules_gen_test, 
    X_rules_opt_test
], axis=1)

Then, using these binary columns, apply our optimised RBS Pipeline to the test set, using the `predict` method:

In [None]:
opt_pipe_pred_test = rbso.predict(X_rules=X_rules_test)

### Outputs

The `predict` method returns the prediction of the optimised pipeline by applying it to the given dataset.

We can also use Sklearn's *classification_report* and *confusion_matrix* functions to generate some performance metrics for the pipeline:

In [None]:
print(
    classification_report(
        y_true=y_test, 
        y_pred=opt_pipe_pred_test, 
        digits=4
    )
)

In [None]:
cm = ConfusionMatrixDisplay(
    confusion_matrix(
        y_true=y_test, 
        y_pred=opt_pipe_pred_test
    )
)
cm.plot()

In [None]:
rbso.config

### Compare to initial RBS pipeline performance

If we assume that, for our initial RBS pipeline:

* Only the **original, existing rules** were used (since these are likely to be the rules that are currently productionised).
* The RBS pipeline was set up in a similar way to our optimised RBS Pipeline (i.e. if any rules trigger, reject the transaction; else, approve the transaction).

then we can calculate the performance of the initial RBS Pipeline and compare it to our optimised pipeline.

To set up the initial RBS Pipeline, we follow a similar process as before. First, we need the names of the **original, existing rules** that are used to reject transactions:

In [None]:
existing_rule_names = list(existing_rules.rule_strings.keys())

Then we can create our `config` using these rule names:

In [None]:
config = [
    (1, existing_rule_names)
]

Now, we need to apply the original, existing rules to the test set:

In [None]:
X_rules_existing_test = existing_rules.transform(X=X.loc[X_test.index])

Then instantiate the `RBSPipeline` class using `config` we created above, keeping the other parameters the same:

In [None]:
rbsp_initial = RBSPipeline(
    config=config,
    final_decision=0
)

Now we can apply the initial RBS pipeline to the test set:

In [None]:
init_pipe_pred_test = rbsp_initial.predict(X_rules=X_rules_existing_test)

#### Performance comparison

Finally, we can compare the performance of the initial RBS Pipeline and the optimised RBS Pipeline:

In [None]:
# F1 Score
init_pipe_f1 = f1.fit(init_pipe_pred_test, y_test)
opt_pipe_f1 = f1.fit(opt_pipe_pred_test, y_test)
print(f'The F1 score of the initial RBS Pipeline is: {round(init_pipe_f1, 3)}')
print(f'The F1 score of the optimised RBS Pipeline is: {round(opt_pipe_f1, 3)}')
print(f'% improvement in F1 score is: {round(100*(opt_pipe_f1-init_pipe_f1)/init_pipe_f1)}%')

In [None]:
# Get confusion matrices for initial and optimised pipelines
initial_conf_matrix = confusion_matrix(y_true=y_test, y_pred=init_pipe_pred_test)
opt_conf_matrix = confusion_matrix(y_true=y_test, y_pred=opt_pipe_pred_test)

In [None]:
conf_matrix_diff = opt_conf_matrix - initial_conf_matrix

In [None]:
print(f'Absolute change in true positives: {conf_matrix_diff[1, 1]}')
print(f'Absolute change in false positives: {conf_matrix_diff[0, 1]}')
print(f'Absolute change in true negatives: {conf_matrix_diff[0, 0]}')
print(f'Absolute change in false negatives: {conf_matrix_diff[1, 0]}')

---

## Convert generated rule conditions to system-ready <a name="ConvertGenToSys"></a>

Now that we have our final rule set and our optimised RBS Pipeline, we can convert the conditions of the generated rules to work on raw, unprocessed data (which is usually the type of data seen in a production system) - this involves the following:

* Adding a null condition if the generated condition covered imputed null values.
* Converting generated conditions with One Hot Encoded features into conditions that flag that specific category.

For example:

- If a numeric rule condition initially had a threshold such that the imputed null values were included in the condition, the converted condition has an additional condition to check whether the feature is also null. 
    - E.g. If a rule initially had the logic `(X['num_items']<=1)` (which included the imputed value of 0), then the converted rule logic would be `((X['num_items']<=1)|(X['num_items'].isna()))`, with an additional condition to check for nulls.
- If a categorical rule condition checks whether the value is the imputed null category, the converted condition is such that it will explicitly check for null values. 
    - E.g. If a rule initially had the logic `(X['country_missing']==True)`, then the converted rule logic would be `(X['country'].isna())`, such that it explicitly checks for null values.
- For categorical rule conditions, the converted condition is such that it will explicitly check for the category. 
    - E.g. If a rule initially had the logic `(X['country_US']==False)`, then the converted rule logic would be `(X['country']!='US')`, such that it explicitly checks whether the 'country' column is not equal to the 'US' category.

To do this, we can use the `ConvertProcessedConditionsToGeneral` class from the `iguanas.rules` module. Note that we only need to apply this process to the generated rules, since those are the only rules which reference the processed data.

Before we can use this class, we need to provide the following:

* A dictionary of the value used to impute nulls for each feature in the original, unprocessed dataset.
* A dictionary of the category linked to each One Hot Encoded column.

To get these dictionaries, we can use the `ReturnMappings` class from the `iguanas.rules` module:

In [None]:
rm = ReturnMappings()

In [None]:
imputed_values_mapping = rm.return_imputed_values_mapping(
    [num_cols, -1], 
    [cat_cols, 'missing'], 
    [bool_cols, 'missing']
)

In [None]:
ohe_categories_mapping = rm.return_ohe_categories_mapping(
    pre_ohe_cols=X.columns, 
    post_ohe_cols=X_train.columns, 
    pre_ohe_dtypes=X.dtypes
)

Now that we have our mapping dictionaries for imputed values and one hot encoded values, we can convert the logic of our generated rules to make them production-ready:

In [None]:
conv_gen_rules = ConvertProcessedConditionsToGeneral(
    imputed_values=imputed_values_mapping, 
    ohe_categories=ohe_categories_mapping
)

In [None]:
conv_gen_rule_strings = conv_gen_rules.convert(
    rule_strings=rg.rule_strings, 
    X=X_train
)

### Outputs

The `convert` method returns a dictionary containing the set of rules which account for imputed/OHE variables, defined using the standard Iguanas string format (values) and their names (keys). See the `Attributes` section in the class docstring for a description of each attribute generated.

---

## Our final rule set and RBS Pipeline <a name="FinalRuleSet"></a>

We can now (finally!) create the rule set that we'll use in our optimised RBS pipeline. All we need to do is add our generated rules (that were reformatted for raw data) to our optimised rules:

In [None]:
rbs_rule_strings = {}
rbs_rule_strings.update(conv_gen_rule_strings)
rbs_rule_strings.update(ro.rule_strings)

Then we can create an instance of the `Rules` class using these rules (we can use this class to change between representations of the rules, if required):

In [None]:
rbs_rules = Rules(rule_strings=rbs_rule_strings)

**Our final rules (in the standard Iguanas string format):**

In [None]:
rbs_rules.rule_strings

**Our optimised RBS Pipeline configuration:**

In [None]:
rbso.config