# Injections Benchmark

In [4]:
# Note: you need to put your Gitlab PAT in the command below

!pip install -qq git+https://oauth2:<your-token>@gitlab.com/whylabs/datascience/whylabs-llm-toolkit.git#egg=whylabs-llm-toolkit[eval]

[33mDEPRECATION: git+https://oauth2:****@gitlab.com/whylabs/datascience/whylabs-llm-toolkit.git#egg=whylabs-llm-toolkit[eval] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


The injection benchmark supports two types of models:

- __Classifier__: Outputs 0 (non-injection) or 1 (injection)
- __Scorer__: Outputs a float in the range [0,1], where the closer to 1, the closer it is to an injection

To do so, you need to create a class with a `predict` method that takes a list of strings and outputs a list of ints/float, depending on whether it is a classifier or a scorer.

Here's an example of a dummy classifier and a dummy scorer, giving random values:

In [7]:
import random

class Classifier():
    def predict(self, inputs):
        # your real classifier would do something smarter than guess here
        return [random.choice([0,1]) for _ in inputs]

class Scorer():
    def predict(self, inputs):
        # your real scorer would do something smarter than guess here
        return [random.uniform(0,1) for _ in inputs]

my_classifier = Classifier()
my_scorer = Scorer()

example_input = ["This is a test", "This is another test"]
print(example_input)
print(f"Classifier output: {my_classifier.predict(example_input)}")
print(f"Scorer output: {my_scorer.predict(example_input)}")


['This is a test', 'This is another test']
Classifier output: [0, 1]
Scorer output: [0.6651211633231211, 0.37558377554737354]


## Running the Benchmark #1 - Classifier Model

> Adicionar aspas



You can create a benchmark object with:

In [8]:
from whylabs_llm_toolkit.eval.benchmarks import InjectionsBenchmark

benchmark = InjectionsBenchmark()

In [11]:
from pprint import pprint

result = benchmark.run(my_classifier)

pprint({key: result[key].to_dict() for key in result})


{'injections_multilingual': {'accuracy': 0.49311701081612586,
                             'auc': None,
                             'confusion.fn': 529,
                             'confusion.fp': 502,
                             'confusion.tn': 498,
                             'confusion.tp': 505,
                             'f1': 0.4948554630083293,
                             'precision': 0.5014895729890765,
                             'recall': 0.488394584139265,
                             'roc_curve.fpr': None,
                             'roc_curve.thresholds': None,
                             'roc_curve.tpr': None,
                             'support': 2034,
                             'threshold': None,
                             'time': 0.002393484115600586},
 'intentions': {'accuracy': 0.5221238938053098,
                'auc': None,
                'confusion.fn': 54,
                'confusion.fp': 54,
                'confusion.tn': 59,
                'co

Note that, since we're using this with a classifier, some metrics are None, like `threshold`, `roc_curve`, `auc`, which is expected. Those all make sense when a scorer is used and you can set a classification threshold.

## Running the Benchmark #1 - Scorer Model

The scorer model outputs a float. To get performance results, we need to define the threshold used along with the scorer. We have two options for that:

- `auto_threshold`: If True, you don't need to define a threshold. Internally, the threshold that optimizes for best F0.5 Score will be used. The calculated threshold will be shown in the result in `threshold`.
- `classification_threshold`: if a float between [0-1] is passed, this will turn the scorer effectively into a classifier

In [12]:
benchmark = InjectionsBenchmark(auto_threshold=True)
result = benchmark.run(my_scorer)

pprint({key: result[key].to_dict() for key in result})

{'injections_multilingual': {'accuracy': 0.5088495575221239,
                             'auc': 0.4921460348162476,
                             'confusion.fn': 0,
                             'confusion.fp': 999,
                             'confusion.tn': 1,
                             'confusion.tp': 1034,
                             'f1': 0.6742745353765895,
                             'precision': 0.5086079685194295,
                             'recall': 1.0,
                             'roc_curve.fpr': None,
                             'roc_curve.thresholds': None,
                             'roc_curve.tpr': None,
                             'support': 2034,
                             'threshold': 0.002389979113693985,
                             'time': 0.0007619857788085938},
 'intentions': {'accuracy': 0.5,
                'auc': 0.4535985590101026,
                'confusion.fn': 0,
                'confusion.fp': 113,
                'confusion.tn': 0,
        

OR, by setting classification_threshold:

In [13]:
benchmark = InjectionsBenchmark(classification_threshold=0.5)
result = benchmark.run(my_scorer)

pprint({key: result[key].to_dict() for key in result})

{'injections_multilingual': {'accuracy': 0.48918387413962633,
                             'auc': 0.4861914893617021,
                             'confusion.fn': 522,
                             'confusion.fp': 517,
                             'confusion.tn': 483,
                             'confusion.tp': 512,
                             'f1': 0.4963645176926806,
                             'precision': 0.4975704567541302,
                             'recall': 0.4951644100580271,
                             'roc_curve.fpr': None,
                             'roc_curve.thresholds': None,
                             'roc_curve.tpr': None,
                             'support': 2034,
                             'threshold': 0.5,
                             'time': 0.0007762908935546875},
 'intentions': {'accuracy': 0.47345132743362833,
                'auc': 0.49330409585715407,
                'confusion.fn': 57,
                'confusion.fp': 62,
                'confusi

## Interpreting the Results

The most important metrics to look at are:
- [F0.5](https://en.wikipedia.org/wiki/F-score): This is a measure that balances precision and recall, favoring precision. This means that reducing false positives is more important than reducing false negatives.
- Precision: How much of what we "blocked" were real injection? (High precision -> low false positive rate)
- Recall: How much of real injections did we block? (High recall -> low false negative rate)

Also, some notes:

- If your model only supports english, you should ignore the multilingual benchmarks
- If your model doesn't try to catch harmful behaviors, like "tell me how to steal a car", your model probably will perform poorly on the `intentions` dataset and can also be ignored.