# LionGuard

LionGuard is a classifier for detecting unsafe content in the Singaporean context. It uses [pre-trained BAAI English embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) and performs classification with a trained classification model. The [original beta version](https://huggingface.co/jfooyh/lionguard-beta) uses an XGBoost classifier, whereas the [new LionGuard models](https://huggingface.co/dsaidgovsg) released by DSAID GovTech use a Ridge classifier.

<small><i>GovTech introduced LionGuard [on 24th June at a technical sharing session](https://www.imda.gov.sg/activities/activities-catalogue/technical-sharing-session-on-ai-robustness). [LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content](https://arxiv.org/abs/2407.10995) was released on the same day, detailing the architecture of LionGuard and comparing it to various contemporaries. DSAID at GovTech later released a [Medium post](https://medium.com/dsaid-govtech/building-lionguard-a-contextualised-moderation-classifier-to-tackle-local-unsafe-content-8f68c8f13179) explaining the rationale and design of LionGuard. The above description has been modified from the [LionGuard-v1 model release](https://huggingface.co/govtech/lionguard-v1).</i></small>

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../..")

## Basic Usage

In [3]:
from walledeval.judge import LionGuardJudge

The [LionGuard-binary](https://huggingface.co/dsaidgovsg/lionguard-binary-v1.0) model is a Ridge classifier that embeds the input text based on the [BAAI English embeddings model](https://huggingface.co/BAAI/bge-large-en-v1.5) and inputs said embeddings into the pretrained classifier. Our system uses presets to load any of the models released by LionGuard's team.

These are the models supported:

|Model|Classifier Type|Preset|
|:--------------------------------------------------------------:|:-------:|:------:|
| [LionGuard-binary-v1.0](https://huggingface.co/dsaidgovsg/lionguard-binary-v1.0 ) | Ridge | `binary` |
| [LionGuard-harassment-v1.0](https://huggingface.co/dsaidgovsg/lionguard-harassment-v1.0) | Ridge | `harassment` | 
| [LionGuard-hateful-v1.0](https://huggingface.co/dsaidgovsg/lionguard-hateful-v1.0) | Ridge | `hateful` |
| [LionGuard-public_harm-v1.0](https://huggingface.co/dsaidgovsg/lionguard-public_harm-v1.0) | Ridge | `public_harm` |
| [LionGuard-self_harm-v1.0](https://huggingface.co/dsaidgovsg/lionguard-self_harm-v1.0) | Ridge | `self_harm` |
| [LionGuard-sexual-v1.0](https://huggingface.co/dsaidgovsg/lionguard-sexual-v1.0) | Ridge | `sexual` |
| [LionGuard-toxic-v1.0](https://huggingface.co/dsaidgovsg/lionguard-toxic-v1.0 ) | Ridge | `toxic` |
|[LionGuard-violent-v1.0](https://huggingface.co/dsaidgovsg/lionguard-violent-v1.0) | Ridge | `violent` |

<!--| [LionGuard-beta](https://huggingface.co/jfooyh/lionguard-beta) | XGBoost | `beta` |-->

In [6]:
binary = LionGuardJudge.from_preset("binary")
binary

models/lionguard-binary.onnx:   0%|          | 0.00/12.7k [00:00<?, ?B/s]

<walledeval.judge.lionguard.LionGuardJudge at 0x7fa3a7a854b0>

The `embed` function is an intermediary function used to format embeddings based on the BBAI English Embeddings model.

In [7]:
binary.embed("Hello World")

array([[ 0.0343454 ,  0.03316101,  0.02191255, ..., -0.00717639,
        -0.00188533,  0.01511723]], dtype=float32)

The `check` function is the default function in our judges, which takes in the text and outputs a logit. The logit value is a value between 0 and 1 and shows how toxic the text is (1 being "most toxic").

In [8]:
binary.check("Hello World")

0.0007077140617184341

In [9]:
binary.check("bloody fuck you bloody")

1.0

In [10]:
binary.check("knn")

0.5783132314682007

In [11]:
binary.check("chibai")

0.09701492637395859

In [12]:
binary.check("knn chao chibai")

0.920634925365448

In [13]:
binary.score(0.920634925365448)

False

In [14]:
output, score = binary("knn chao chibai")
print("Output:", output)
print("Classification:", "safe" if score else "unsafe")

Output: 0.920634925365448
Classification: unsafe


The above results show that the model is well-finetuned on english words like "fuck", but funnily enough, a chinese swear word like "knn" isn't strongly recognized, and another like "chibai" isn't even classified positively. This shows that the model is not very good at classifying toxic Singaporean phrases, which is rather ironic. However, when the content contains more words, it is classified more explicitly as a unsafe sample.

## Testing the Beta Model

The Beta model (which has since been removed from HuggingFace Hub) is different because it is a XGBoost classiifer. Hence, here's another demo (similar to above).

NOTE: DO NOT TRY THIS, THE MODEL HAS BEEN TAKEN DOWN.

In [12]:
beta = LionGuardJudge.from_preset("beta")
beta

<walledeval.judge.lionguard.LionGuardJudge at 0x7f85a6c8ceb0>

In [13]:
beta.embed("Hello World")

array([[ 0.0343454 ,  0.03316101,  0.02191255, ..., -0.00717639,
        -0.00188533,  0.01511723]], dtype=float32)

In [14]:
beta.check("Hello World")

7.457677565980703e-05

In [15]:
beta.check("bloody fuck you bloody")

0.9999995231628418

In [16]:
beta.check("knn")

0.999834418296814

In [17]:
beta.check("chibai")

0.0005032883491367102

In [18]:
beta.check("knn chao chibai")

0.8807035684585571

In [19]:
beta.score(0.8807035684585571)

False

In [20]:
output, score = beta("knn chao chibai")
print("Output:", output)
print("Classification:", "safe" if score else "unsafe")

Output: 0.8807035684585571
Classification: unsafe


Even here, it acts pretty similarly, but is able to identify a word like "knn" is pretty bad.

## Testing All Models Together

We can also choose to perform an analysis on how each model recognizes text.

In [17]:
presets = ["binary", "harassment", "hateful", "public_harm", "self_harm", "sexual", "toxic", "violent"]

judges = {
    preset: LionGuardJudge.from_preset(preset)
    for preset in presets
}

models/lionguard-toxic.onnx:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

models/lionguard-violent.onnx:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

In [23]:
import numpy as np

def ensemble(text):
    return {name: judge.check(text) for name, judge in judges.items()}

In [26]:
ensemble("fuck")

{'binary': 0.824999988079071,
 'harassment': 0.30227137,
 'hateful': 0.27736473,
 'public_harm': 0.37700772,
 'self_harm': 0.32861292,
 'sexual': 0.20570648,
 'toxic': 0.42799044,
 'violent': 0.2513579}

In [27]:
ensemble("knn")

{'binary': 0.5783132314682007,
 'harassment': 0.27873802,
 'hateful': 0.46916413,
 'public_harm': 0.39066374,
 'self_harm': 0.5401348,
 'sexual': 0.49440575,
 'toxic': 0.4349258,
 'violent': 0.43546712}

In [29]:
ensemble("chibai")

{'binary': 0.09701492637395859,
 'harassment': -0.63368547,
 'hateful': -0.27831292,
 'public_harm': -0.28430176,
 'self_harm': -0.33843124,
 'sexual': -0.41787767,
 'toxic': -0.18748307,
 'violent': -0.52312994}

These results appear to suggest that `binary` does not seem to coincide well with the safety taxonomy laid out by the LionGuard researchers. The probability scores also somehow manage to go negative, which is very odd, assuming the output of the ridge classiifer is a probability score as the codebase does imply.