# Toxicity identier

I want to create a model that will access the level of toxicity, and use this score as metric (it can be considered as part of [J metric](https://aclanthology.org/2022.acl-long.469.pdf))  or add it to the loss. I will train the simple model with pretrained tokenizer from the model for final solution, embedding layer, several linear layers and output with sigmoid function. 

Model should have fast inference time, not consume a lot of resources and have normal performance(this is not a critical parameter because I prioritize speed), but should still produce roughly similar levels of toxicity.

Also, I will convert tox_level to integer values and train the model on a classification task.

In [1]:
import pandas as pd
import torch
import torch.utils.data as data_utils

In [2]:
!unzip ../data/interim/toxicity_levels.zip

In [3]:
df = pd.read_csv('../data/interim/toxicity_levels.csv')
df.head()

Unnamed: 0,text,tox_level
0,"if Alkar floods her with her mental waste, it ...",0.981983
1,"If Alkar is flooding her with psychic waste, t...",0.014195
2,you're becoming disgusting.,0.999039
3,Now you're getting nasty.,0.065473
4,"well, we can spare your life.",0.985068


In [4]:
threshold = 0.5

df['tox_level'] = df['tox_level'].apply(lambda x: 1 if x > threshold else 0)
df.head()

Unnamed: 0,text,tox_level
0,"if Alkar floods her with her mental waste, it ...",1
1,"If Alkar is flooding her with psychic waste, t...",0
2,you're becoming disgusting.,1
3,Now you're getting nasty.,0
4,"well, we can spare your life.",1


## Dataset and Dataloader

In [5]:
from transformers import AutoTokenizer
model_checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [6]:
def preprocessing_stage(sample):
    # in the preprocessing phase, I convert the input text to the list of tokens
    model_inputs = tokenizer(sample['text'], padding='max_length', max_length=256, truncation=True)
    return model_inputs['input_ids']

In [7]:
df['input_ids'] = df.apply(lambda x: preprocessing_stage(x), axis=1)

In [8]:
df.drop(columns=['text'], inplace=True)

In [9]:
from sklearn.model_selection import train_test_split

ratio = 0.2
train, val = train_test_split(
    df, stratify=df['tox_level'], test_size=0.2, random_state=42
)

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    text_list, toxicity_list = [], []
    for _toxicity, _text in batch:
        text_list.append(_text)
        toxicity_list.append(_toxicity)
    return torch.LongTensor(text_list).to(device), torch.FloatTensor(toxicity_list).to(device)

In [11]:
batch_size = 512
train_dataloader = data_utils.DataLoader(
    train.to_numpy(), batch_size=batch_size, shuffle=True, collate_fn=collate_batch
)

val_dataloader = data_utils.DataLoader(
    val.to_numpy(), batch_size=batch_size, shuffle=False, collate_fn=collate_batch
)

## Model

In [1]:
import sys
sys.path.append("../")
from src.utils.trainer import Trainer
from src.models.architectures.toxicity_classification_model import ToxicityClassificationModel
import torch.nn as nn
import torch.optim as optim

In [3]:
vocab_size = 32128

model = ToxicityClassificationModel(vocab_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
trainer = Trainer(ToxicityClassificationModel(vocab_size), 'pytorch', device='cpu')

## Training

In [18]:
trainer.train(20, optimizer=optimizer, loss_fn=criterion, train_dataloader=train_dataloader, use_validation=True, val_dataloader=val_dataloader)

Epoch 1: train:   0%|          | 0/1145 [00:00<?, ?it/s]

Epoch 1: train: 100%|██████████| 1145/1145 [00:29<00:00, 39.02it/s, loss=0.0068] 
Epoch 1: val: 100%|██████████| 287/287 [00:04<00:00, 61.96it/s, loss=0.00136]
Epoch 2: train: 100%|██████████| 1145/1145 [00:22<00:00, 50.95it/s, loss=0.0068] 
Epoch 2: val: 100%|██████████| 287/287 [00:04<00:00, 71.46it/s, loss=0.00136]
Epoch 3: train: 100%|██████████| 1145/1145 [00:22<00:00, 50.18it/s, loss=0.0068] 
Epoch 3: val: 100%|██████████| 287/287 [00:04<00:00, 65.30it/s, loss=0.00136]
Epoch 4: train: 100%|██████████| 1145/1145 [00:22<00:00, 50.37it/s, loss=0.0068] 
Epoch 4: val: 100%|██████████| 287/287 [00:04<00:00, 65.94it/s, loss=0.00136]
Epoch 5: train: 100%|██████████| 1145/1145 [00:22<00:00, 51.02it/s, loss=0.0068] 
Epoch 5: val: 100%|██████████| 287/287 [00:04<00:00, 61.58it/s, loss=0.00136]
Epoch 6: train: 100%|██████████| 1145/1145 [00:22<00:00, 51.92it/s, loss=0.0068] 
Epoch 6: val: 100%|██████████| 287/287 [00:04<00:00, 66.68it/s, loss=0.00136]
Epoch 7: train: 100%|██████████| 1145/11

In [19]:
model_scripted = torch.jit.script(model)
model_scripted.save('../models/toxicity_identifier.pt')

## Manual tests

In [20]:
def check_toxicity(model, inference_result, tokenizer=tokenizer):
    input_ids = tokenizer(inference_result, return_tensors="pt").input_ids
    print(model(input_ids))

In [21]:
check_toxicity(model, "I love you so much")
check_toxicity(model, "I'm famous, and you're dead")
check_toxicity(model, "And it just helped that you have no morals or integrity")
check_toxicity(model, "Nolan will destroy it")

tensor([[0.4912]], grad_fn=<SigmoidBackward0>)
tensor([[0.4908]], grad_fn=<SigmoidBackward0>)
tensor([[0.4896]], grad_fn=<SigmoidBackward0>)
tensor([[0.4965]], grad_fn=<SigmoidBackward0>)


After running some manual tests, I can conclude that this model and its minor modifications do not work as I expected(can't differentiate the toxic and nontoxic phrases, and often produce values around 0.5).

## Exploring internet

After investigating [1](https://arxiv.org/pdf/2109.08914.pdf) and [2](https://aclanthology.org/2022.acl-long.469.pdf), I found that authors used [this model](https://huggingface.co/s-nlp/roberta_toxicity_classifier) and tried it in my case.

In [1]:
import sys
sys.path.append("../")
from src.utils.predictor import Predictor

In [2]:
predictor = Predictor('s-nlp/roberta_toxicity_classifier_v1', 'transformers', 'classification')

Some weights of the model checkpoint at s-nlp/roberta_toxicity_classifier_v1 were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
print(predictor.predict('I love you so much'))
print(predictor.predict("I'm famous, and you're dead"))
print(predictor.predict('And it just helped that you have no morals or integrity'))
print(predictor.predict('Nolan will destroy it'))

5.859886005055159e-05
0.9121143221855164
0.6886491179466248
0.7692179083824158


These results suit me and I will use this model in J metric