# LionGuard

LionGuard is a classifier for detecting unsafe content in the Singaporean context. It uses [pre-trained BAAI English embeddings](https://huggingface.co/BAAI/bge-large-en-v1.5) and performs classification with a trained classification model. The [original beta version](https://huggingface.co/jfooyh/lionguard-beta) uses an XGBoost classifier, whereas the [new LionGuard models](https://huggingface.co/dsaidgovsg) released by DSAID GovTech use a Ridge classifier.

<small><i>GovTech introduced LionGuard [on 24th June at a technical sharing session](https://www.imda.gov.sg/activities/activities-catalogue/technical-sharing-session-on-ai-robustness). The above description has been modified from the [LionGuard-binary model release](https://huggingface.co/dsaidgovsg/lionguard-binary-v1.0).</i></small>

In [1]:
import sys
sys.path.append("../..")

## Basic Usage

In [2]:
from walledeval.judge import LionGuardJudge

The [LionGuard-binary](https://huggingface.co/dsaidgovsg/lionguard-binary-v1.0) model is a Ridge classifier that embeds the input text based on the [BAAI English embeddings model](https://huggingface.co/BAAI/bge-large-en-v1.5) and inputs said embeddings into the pretrained classifier. Our system uses presets to load any of the models released by LionGuard's team.

These are the models supported:

|Model|Classifier Type|Preset|
|:--------------------------------------------------------------:|:-------:|:------:|
| [LionGuard-beta](https://huggingface.co/jfooyh/lionguard-beta) | XGBoost | `beta` |
| [LionGuard-binary-v1.0](https://huggingface.co/dsaidgovsg/lionguard-binary-v1.0 ) | Ridge | `binary` |
| [LionGuard-harassment-v1.0](https://huggingface.co/dsaidgovsg/lionguard-harassment-v1.0) | Ridge | `harassment` | 
| [LionGuard-hateful-v1.0](https://huggingface.co/dsaidgovsg/lionguard-hateful-v1.0) | Ridge | `hateful` |
| [LionGuard-public_harm-v1.0](https://huggingface.co/dsaidgovsg/lionguard-public_harm-v1.0) | Ridge | `public_harm` |
| [LionGuard-self_harm-v1.0](https://huggingface.co/dsaidgovsg/lionguard-self_harm-v1.0) | Ridge | `self_harm` |
| [LionGuard-sexual-v1.0](https://huggingface.co/dsaidgovsg/lionguard-sexual-v1.0) | Ridge | `sexual` |
| [LionGuard-toxic-v1.0](https://huggingface.co/dsaidgovsg/lionguard-toxic-v1.0 ) | Ridge | `toxic` |
 |[LionGuard-violent-v1.0](https://huggingface.co/dsaidgovsg/lionguard-violent-v1.0) | Ridge | `violent` |

In [3]:
binary = LionGuardJudge.from_preset("binary")
binary

<walledeval.judge.lionguard.LionGuardJudge at 0x7f1ba6abcd90>

The `embed` function is an intermediary function used to format embeddings based on the BBAI English Embeddings model.

In [4]:
binary.embed("Hello World")

array([[ 0.0343454 ,  0.03316101,  0.02191255, ..., -0.00717639,
        -0.00188533,  0.01511723]], dtype=float32)

The `check` function is the default function in our judges, which takes in the text and outputs a logit. The logit value is a value between 0 and 1 and shows how toxic the text is (1 being "most toxic").

In [5]:
binary.check("Hello World")

0.030662020668387413

In [6]:
binary.check("bloody fuck you bloody")

1.0

In [7]:
binary.check("knn")

0.2818181812763214

In [8]:
binary.check("chibai")

0.09012345969676971

In [9]:
binary.check("knn chao chibai")

0.28015562891960144

In [10]:
binary.score(0.28015562891960144)

True

In [11]:
output, score = binary("knn chao chibai")
print("Output:", output)
print("Classification:", "safe" if score else "unsafe")

Output: 0.28015562891960144
Classification: safe


The above results show that the model is well-finetuned on english words like "fuck", but funnily enough, a chinese swear word like "knn" isn't strongly recognized, and another like "chibai" isn't even classified positively. This shows that the model is not very good at classifying toxic Singaporean phrases, which is rather ironic.

## Testing the Beta Model

The Beta model is different because it is a XGBoost classiifer. Hence, here's another demo (similar to above).

In [12]:
beta = LionGuardJudge.from_preset("beta")
beta

<walledeval.judge.lionguard.LionGuardJudge at 0x7f1addcbfca0>

In [13]:
beta.embed("Hello World")

array([[ 0.0343454 ,  0.03316101,  0.02191255, ..., -0.00717639,
        -0.00188533,  0.01511723]], dtype=float32)

In [14]:
beta.check("Hello World")

7.457677565980703e-05

In [15]:
beta.check("bloody fuck you bloody")

0.9999995231628418

In [16]:
beta.check("knn")

0.999834418296814

In [17]:
beta.check("chibai")

0.0005032883491367102

In [18]:
beta.check("knn chao chibai")

0.8807035684585571

In [19]:
beta.score(0.8807035684585571)

False

In [20]:
output, score = beta("knn chao chibai")
print("Output:", output)
print("Classification:", "safe" if score else "unsafe")

Output: 0.8807035684585571
Classification: unsafe


Even here, it acts pretty similarly, but is able to identify a word like "knn" is pretty bad.