# Toxicity Model Configuration

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/LanguageToolkit/blob/main/langkit/examples/Toxicity_Model_Configuration.ipynb)

In this example, we'll show you how you can use different toxicity models to extract the `toxicity` metric from a prompt/response with LangKit.

## Install LangKit

First let's install __LangKit__.

In [None]:
%pip install langkit[all]==0.0.30

By default, Langkit uses the [`martin-ha/toxic-comment-model`](https://huggingface.co/martin-ha/toxic-comment-model) model. That's the model that will be used if you simply call `toxicity.init()`.

We can also call it explictly:

In [17]:
from langkit import toxicity, extract
# this is the default model
toxicity.init(model_path="martin-ha/toxic-comment-model")
results = extract({"prompt": "I hate you!"})

results

{'prompt': 'I hate you!', 'prompt.toxicity': 0.9164737462997437}

`toxic-comment-model` seems to be very sensitive with regards to punctuation. For example, try removing the exclamation mark above and see how the toxicity score changes.

Langkit also supports toxicity models from Unitary's [detoxify](https://github.com/unitaryai/detoxify) package. For example, to use detoxify's `unbiased` model, you can call `toxicity.init(model_path="detoxify/unbiased")`, like this:

In [19]:
toxicity.init(model_path="detoxify/unbiased")

results = extract({"prompt": "I hate you."})
print(f"Results - detoxify/unbiased:\n {results}")

Results - detoxify/unbiased:
 {'prompt': 'I hate you.', 'prompt.toxicity': 0.81225103}


On premilinary internal tests, the `unbiased` model seems to perform better than the default model at a cost of a slower inference time. If you want to sacrifice latency for a boost in accuracy, you might want to experiment with the `unbiased` model.

In addition to `unbiased`, you can also use the `original` and `multilingual` models from detoxify:

In [21]:
toxicity.init(model_path="detoxify/multilingual")

results = extract({"prompt": "Eu te odeio."})
print(f"Results - detoxify/multilingual:\n {results}")

toxicity.init(model_path="detoxify/original")

results = extract({"prompt": "I hate you."})
print(f"Results - detoxify/original:\n {results}")

Results - detoxify/multilingual:
 {'prompt': 'Eu te odeio.', 'prompt.toxicity': 0.9851344}
Results - detoxify/original:
 {'prompt': 'I hate you.', 'prompt.toxicity': 0.9475088}


The `multilingual` model has been trained on the following languages: english, french, spanish, italian, portuguese, turkish or russian.

You can have more information on the different models directly at the [detoxify](https://github.com/unitaryai/detoxify) repository.

# Limitations and Ethical Considerations

For more information on the limitations and ethical considerations for each of the related modules, please refer to:

- https://huggingface.co/martin-ha/toxic-comment-model#limitations-and-bias
- https://github.com/unitaryai/detoxify?tab=readme-ov-file#limitations-and-ethical-considerations