# Optimisation Functions Example

This notebook contains an example of how optimisation functions can be applied to a dataset, how they can be used in other Iguanas modules and how to create your own.

## Requirements

To run, you'll need the following:

* A dataset containing binary predictor columns and a binary target column.

----

## Import packages

In [2]:
from iguanas.rule_optimisation.optimisation_functions import Precision, Recall, FScore, Revenue, AlertsPerDay, \
PercVolume

import pandas as pd
import numpy as np
import databricks.koalas as ks
from typing import Union

## Create data

Let's create some dummy predictor columns and a binary target column. For this example, let's assume the dummy predictor columns represent rules that have been applied to a dataset.

In [3]:
np.random.seed(0)

y_pred = pd.Series(np.random.randint(0, 2, 1000), name = 'A')
y_preds = pd.DataFrame(np.random.randint(0, 2, (1000, 5)), columns=[i for i in 'ABCDE'])
y = pd.Series(np.random.randint(0, 2, 1000), name = 'label')
amounts = pd.Series(np.random.randint(0, 1000, 1000), name = 'amounts')

y_pred_ks = ks.from_pandas(y_pred)
y_preds_ks = ks.from_pandas(y_preds)
y_ks = ks.from_pandas(y)
amounts_ks = ks.from_pandas(amounts)

21/11/23 17:55:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/23 17:55:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/11/23 17:55:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


----

## Apply optimisation functions

The example applies four supervised optimisation functions and two unsupervised optimisation function:

**Supervised optimisation functions**

* Precision score
* Recall score
* Fbeta score
* Revenue

**Unsupervised optimisation functions**

* Alerts per day (calculates the negative squared difference between the daily number of records a rule flags vs the targetted daily number of records flagged)
* Percentage of volume (calculates the negative squared difference between the percentage of the overall volume that the rule flags vs the targetted percentage of volume flagged)

**Note that the *FScore*, *Precision* or *Recall* classes are ~100 times faster on larger datasets compared to the same functions from Sklearn's *metrics* module. They also work with Koalas DataFrames, whereas the Sklearn functions do not.**

### Instantiate class and run fit method

We can run the *.fit()* method to calculate the optimisation metric for each column in the dataset.

#### Supervised optimisation functions

##### Precision score

In [4]:
precision = Precision()
# With Pandas (single predictor)
rule_precision = precision.fit(y_true=y, y_preds=y_pred, sample_weight=None)
# With Koalas (single predictor)
rule_precision_ks = precision.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=None)
# With Pandas (multiple predictors)
rule_precisions = precision.fit(y_true=y, y_preds=y_preds, sample_weight=None)
# With Koalas (multiple predictors)
rule_precisions_ks = precision.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=None)

                                                                                

##### Recall score

In [5]:
recall = Recall()
# With Pandas (single predictor)
rule_recall = recall.fit(y_true=y, y_preds=y_pred, sample_weight=None)
# With Koalas (single predictor)
rule_recall_ks = recall.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=None)
# With Pandas (multiple predictors)
rule_recalls = recall.fit(y_true=y, y_preds=y_preds, sample_weight=None)
# With Koalas (multiple predictors)
rule_recalls_ks = recall.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=None)

##### Fbeta score (beta=1)

In [6]:
f1 = FScore(beta=1)
# With Pandas (single predictor)
rule_f1 = f1.fit(y_true=y, y_preds=y_pred, sample_weight=None)
# With Koalas (single predictor)
rule_f1_ks = f1.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=None)
# With Pandas (multiple predictors)
rule_f1s = f1.fit(y_true=y, y_preds=y_preds, sample_weight=None)
# With Koalas (multiple predictors)
rule_f1s_ks = f1.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=None)

##### Revenue

In [7]:
rev = Revenue(y_type='Fraud', chargeback_multiplier=2)
# With Pandas (single predictor)
rule_rev = rev.fit(y_true=y, y_preds=y_pred, sample_weight=amounts)
# With Koalas (single predictor)
rule_rev_ks = rev.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=amounts_ks)
# With Pandas (multiple predictors)
rule_revs = rev.fit(y_true=y, y_preds=y_preds, sample_weight=amounts)
# With Koalas (multiple predictors)
rule_revs_ks = rev.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=amounts_ks)



#### Unsupervised optimisation functions

##### Alerts per day

In [8]:
apd = AlertsPerDay(n_alerts_expected_per_day=5, no_of_days_in_file=10)
# With Pandas (single predictor)
rule_apd = apd.fit(y_preds=y_pred)
# With Koalas (single predictor)
rule_apd_ks = apd.fit(y_preds=y_pred_ks)
# With Pandas (multiple predictors)
rule_apds = apd.fit(y_preds,)
# With Koalas (multiple predictors)
rule_apds_ks = apd.fit(y_preds_ks,)

##### Percentage of volume

In [9]:
pv = PercVolume(perc_vol_expected=0.02)
# With Pandas (single predictor)
rule_pv = pv.fit(y_preds=y_pred)
# With Koalas (single predictor)
rule_pv_ks = pv.fit(y_preds=y_pred_ks)
# With Pandas (multiple predictors)
rule_pvs = pv.fit(y_preds,)
# With Koalas (multiple predictors)
rule_pvs_ks = pv.fit(y_preds_ks,)

### Outputs

The *.fit()* method returns the optimisation metric defined by the class:

In [10]:
rule_precision, rule_precision_ks, rule_precisions, rule_precisions_ks

(0.48214285714285715,
 0.48214285714285715,
 array([0.4875717 , 0.47109208, 0.47645951, 0.48850575, 0.4251497 ]),
 array([0.4875717 , 0.47109208, 0.47645951, 0.48850575, 0.4251497 ]))

In [11]:
rule_recall, rule_recall_ks, rule_recalls, rule_recalls_ks

(0.5051975051975052,
 0.5051975051975052,
 array([0.53014553, 0.45738046, 0.52598753, 0.53014553, 0.44282744]),
 array([0.53014553, 0.45738046, 0.52598753, 0.53014553, 0.44282744]))

In [12]:
rule_f1, rule_f1_ks, rule_f1s, rule_f1s_ks

(0.4934010152284264,
 0.4934010152284264,
 array([0.50796813, 0.46413502, 0.5       , 0.50847458, 0.43380855]),
 array([0.50796813, 0.46413502, 0.5       , 0.50847458, 0.43380855]))

In [13]:
rule_rev, rule_rev_ks, rule_revs, rule_revs_ks

(1991,
 1991,
 array([ 15119, -14481,  11721,  25063, -74931]),
 array([ 15119, -14481,  11721,  25063, -74931]))

In [14]:
rule_apd, rule_apd_ks, rule_apds, rule_apds_ks

(-2061.16,
 -2061.16,
 array([-2237.29, -1738.89, -2313.61, -2227.84, -2034.01]),
 array([-2237.29, -1738.89, -2313.61, -2227.84, -2034.01]))

In [15]:
rule_pv, rule_pv_ks, rule_pvs, rule_pvs_ks

(-0.234256,
 -0.234256,
 array([-0.253009, -0.199809, -0.261121, -0.252004, -0.231361]),
 array([-0.253009, -0.199809, -0.261121, -0.252004, -0.231361]))

The *.fit()* method can be fed into various Iguanas modules as an argument (wherever the `opt_func` parameter appears). For example, in the RuleGeneratorOpt module, you can set the metric used to optimise the rules using this methodology.

----

## Creating your own optimisation function

Say we want to create a class which calculates the Positive likelihood ratio (TP rate/FP rate).

The main class structure involves having a *.fit()* method which has three arguments - the binary predictor(s), the binary target and any event specific weights to apply. This method should return a single numeric value.

In [16]:
class PositiveLikelihoodRatio:
    
    def fit(self, 
            y_true: Union[np.array, pd.Series, ks.Series], 
            y_preds: Union[np.array, pd.Series, ks.Series, pd.DataFrame, ks.DataFrame], 
            sample_weight: Union[np.array, pd.Series, ks.Series]) -> float:
        
        def _calc_plr(y_true, y_preds):
            # Calculate TPR
            tpr = (y_true * y_preds).sum() / y_true.sum()
            # Calculate FPR
            fpr = ((1 - y_true) * y_preds).sum()/(1 - y_true).sum()
            return 0 if tpr == 0 or fpr == 0 else tpr/fpr
        
        # Set this option to allow calc of TPR/FPR on Koalas dataframes
        with ks.option_context("compute.ops_on_diff_frames", True):
            if y_preds.ndim == 1:            
                return _calc_plr(y_true, y_preds)
            else:
                plrs = np.empty(y_preds.shape[1])
                for i, col in enumerate(y_preds.columns):                        
                    plrs[i] = _calc_plr(y_true, y_preds[col])
                return plrs

We can then apply the *.fit()* method to the dataset to check it works:

In [17]:
plr = PositiveLikelihoodRatio()
# With Pandas (single predictor)
rule_plr = plr.fit(y_true=y, y_preds=y_pred, sample_weight=None)
# With Koalas (single predictor)
rule_plr_ks = plr.fit(y_true=y_ks, y_preds=y_pred_ks, sample_weight=None)
# With Pandas (multiple predictors)
rule_plrs = plr.fit(y_true=y, y_preds=y_preds, sample_weight=None)
# With Koalas (multiple predictors)
rule_plrs_ks = plr.fit(y_true=y_ks, y_preds=y_preds_ks, sample_weight=None)

In [18]:
rule_plr, rule_plr_ks, rule_plrs, rule_plrs_ks

(1.004588142519177,
 1.004588142519177,
 array([1.02666243, 0.96105448, 0.98196952, 1.0305076 , 0.79801195]),
 array([1.02666243, 0.96105448, 0.98196952, 1.0305076 , 0.79801195]))

Finally, after instantiating the class, we can feed the *.fit* method to a relevant Iguanas module (for example, we can feed the *.fit()* method to the *opt_func* parameter in the *BayesianOptimiser* class so that rules are generated which maximise the Positive Likelihood Ratio).

----