# Classification Metrics Spark Example

Classification metrics are used to calculate the performance of binary predictors based on a binary target. They are used extensively in other Iguanas modules. This example shows how they can be applied in Spark and how to create your own.

## Requirements

To run, you'll need the following:

* A dataset containing binary predictor columns and a binary target column.

----

## Import packages

In [1]:
from iguanas.metrics.classification import Precision, Recall, FScore, Revenue, Bounds

import numpy as np
import databricks.koalas as ks
from typing import Union

## Create data

Let's create some dummy predictor columns and a binary target column. For this example, let's assume the dummy predictor columns represent rules that have been applied to a dataset.

In [2]:
np.random.seed(0)

y_pred_ks = ks.Series(np.random.randint(0, 2, 1000), name = 'A')
y_preds_ks = ks.DataFrame(np.random.randint(0, 2, (1000, 5)), columns=[i for i in 'ABCDE'])
y_ks = ks.Series(np.random.randint(0, 2, 1000), name = 'label')
amounts_ks = ks.Series(np.random.randint(0, 1000, 1000), name = 'amounts')

22/03/01 17:05:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 17:05:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


----

## Apply optimisation functions

There are currently four classification metrics available:

* Precision score
* Recall score
* Fbeta score
* Revenue
* Bounds

**Note that the *FScore*, *Precision* or *Recall* classes are ~100 times faster on larger datasets compared to the same functions from Sklearn's *metrics* module. They also work with Koalas DataFrames, whereas the Sklearn functions do not.**

### Instantiate class and run fit method

We can run the `fit` method to calculate the optimisation metric for each column in the dataset.

#### Precision score

In [3]:
precision = Precision()
# Single predictor
rule_precision_ks = precision.fit(y_preds=y_pred_ks, y_true=y_ks, sample_weight=None)
# Multiple predictors
rule_precisions_ks = precision.fit(y_preds=y_preds_ks, y_true=y_ks, sample_weight=None)

                                                                                

#### Recall score

In [4]:
recall = Recall()
# Single predictor
rule_recall_ks = recall.fit(y_preds=y_pred_ks, y_true=y_ks, sample_weight=None)
# Multiple predictors
rule_recalls_ks = recall.fit(y_preds=y_preds_ks, y_true=y_ks, sample_weight=None)

#### Fbeta score (beta=1)

In [5]:
f1 = FScore(beta=1)
# Single predictor)
rule_f1_ks = f1.fit(y_preds=y_pred_ks, y_true=y_ks, sample_weight=None)
# Multiple predictors)
rule_f1s_ks = f1.fit(y_preds=y_preds_ks, y_true=y_ks, sample_weight=None)

#### Revenue

In [6]:
rev = Revenue(y_type='Fraud', chargeback_multiplier=2)
# Single predictor
rule_rev_ks = rev.fit(y_preds=y_pred_ks, y_true=y_ks, sample_weight=amounts_ks)
# Multiple predictors
rule_revs_ks = rev.fit(y_preds=y_preds_ks, y_true=y_ks, sample_weight=amounts_ks)

                                                                                

In [7]:
bounds = Bounds(
    bounds=[
        {
            'metric': precision.fit,
            'operator': '>',
            'threshold': 0.45
        },
        {
            'metric': recall.fit,
            'operator': '>',
            'threshold': 0.5
        }
    ]
)
# Single predictor
rule_bound_ks = bounds.fit(y_preds=y_pred_ks, y_true=y_ks)
# Multiple predictors
rule_bounds_ks = bounds.fit(y_preds=y_preds_ks, y_true=y_ks)

### Outputs

The `fit` method returns the optimisation metric defined by the class:

In [8]:
rule_precision_ks, rule_precisions_ks

(0.48214285714285715,
 array([0.4875717 , 0.47109208, 0.47645951, 0.48850575, 0.4251497 ]))

In [9]:
rule_recall_ks, rule_recalls_ks

(0.5051975051975052,
 array([0.53014553, 0.45738046, 0.52598753, 0.53014553, 0.44282744]))

In [10]:
rule_f1_ks, rule_f1s_ks

(0.4934010152284264,
 array([0.50796813, 0.46413502, 0.5       , 0.50847458, 0.43380855]))

In [11]:
rule_rev_ks, rule_revs_ks

(1991, array([ 15119, -14481,  11721,  25063, -74931]))

In [12]:
rule_bound_ks, rule_bounds_ks

(0.501299373374265,
 array([0.50753581, 0.48934673, 0.50649652, 0.50753581, 0.48571075]))

The `fit` method can be fed into various Iguanas modules as an argument (wherever the `metric` parameter appears). For example, in the RuleGeneratorOpt module, you can set the metric used to optimise the rules using this methodology.

----

## Creating your own optimisation function

Say we want to create a class which calculates the Positive likelihood ratio (TP rate/FP rate).

The main class structure involves having a `fit` method which has three arguments - the binary predictor(s), the binary target and any event specific weights to apply. This method should return a single numeric value.

In [13]:
class PositiveLikelihoodRatio:
    
    def fit(self,             
            y_preds: Union[ks.Series, ks.DataFrame], 
            y_true: ks.Series, 
            sample_weight: ks.Series) -> float:
        
        def _calc_plr(y_true, y_preds):
            # Calculate TPR
            tpr = (y_true * y_preds).sum() / y_true.sum()
            # Calculate FPR
            fpr = ((1 - y_true) * y_preds).sum()/(1 - y_true).sum()
            return 0 if tpr == 0 or fpr == 0 else tpr/fpr
        
        # Set this option to allow calc of TPR/FPR on Koalas dataframes
        with ks.option_context("compute.ops_on_diff_frames", True):
            if y_preds.ndim == 1:            
                return _calc_plr(y_true, y_preds)
            else:
                plrs = np.empty(y_preds.shape[1])
                for i, col in enumerate(y_preds.columns):                        
                    plrs[i] = _calc_plr(y_true, y_preds[col])
                return plrs

We can then apply the `fit` method to the dataset to check it works:

In [14]:
plr = PositiveLikelihoodRatio()
# Single predictor
rule_plr_ks = plr.fit(y_preds=y_pred_ks, y_true=y_ks, sample_weight=None)
# Multiple predictors
rule_plrs_ks = plr.fit(y_preds=y_preds_ks, y_true=y_ks, sample_weight=None)

In [15]:
rule_plr_ks, rule_plrs_ks

(1.004588142519177,
 array([1.02666243, 0.96105448, 0.98196952, 1.0305076 , 0.79801195]))

Finally, after instantiating the class, we can feed the `fit` method to a relevant Iguanas module (for example, we can feed the `fit` method to the `metric` parameter in the `BayesianOptimiser` class so that rules are generated which maximise the Positive Likelihood Ratio).

----