### Building the evaluation function
To evaluate how toxic sentences are, we can create a function that uses a pretrained model designed for toxicity detection. One such model is available through the transformers library by Hugging Face.

In [15]:
from transformers import pipeline
import re
from nltk.corpus import wordnet
import pandas as pd
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from datasets import load_dataset

In [68]:
detector = pipeline("text-classification", model="unitary/toxic-bert")

def evaluate_toxicity(sentence):
    results = detector(texts=sentence, inputs=any)  # Use the keyword 'inputs'
    return results[0]['score']


In [17]:

sentences = [
    "I love this!",
    "You are so stupid!",
    "You're ugly motherfucker!",
]

for sentence in sentences:
    score = evaluate_toxicity(sentence)
    if score is not None:
        print(f"Sentence: '{sentence}' - Toxicity Score: {score}")
    else:
        print(f"Could not evaluate toxicity for sentence: '{sentence}'")

Sentence: 'I love this!' - Toxicity Score: 0.0006875480758026242
Sentence: 'You are so stupid!' - Toxicity Score: 0.9881107807159424
Sentence: 'You're ugly motherfucker!' - Toxicity Score: 0.9981474876403809


It works!

### Getting Data
Let's evaluate the average toxicity score for the algorithm based on synonym replacement.

I will feed to it our initial dataset reference column. The algorithm will translate these rude sentences and will evaluate their toxicity score.

In [18]:
banned_words_file = '../data/external/bad-words.txt'

with open(banned_words_file, 'r') as file:
    banned_words_list = file.read().splitlines()

banned_df = pd.DataFrame(banned_words_list, columns=['banned_word'])

data_file = '../data/intermediate/filtered_data.tsv'

data = pd.read_csv(data_file, sep='\t')

### Finding toxicity score for synonyms based detoxification function
Translating reference into a non-toxic text and writing it down in 'algorithm-translation' column.

In [19]:
def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            if lemma.name().lower() != word:
                synonyms.add(lemma.name().lower().replace('_', ' '))
    return list(synonyms)

# Define the function to identify and replace negative phrases
def detoxify_sentence(sentence, banned_words_df):
    # Lowercase the sentence to ensure proper matching
    lowered_sentence = sentence.lower()
    for banned_phrase in banned_words_df['banned_word']:
        # Check if the banned phrase is in the sentence
        if re.search(r'\b' + re.escape(banned_phrase) + r'\b', lowered_sentence, flags=re.IGNORECASE):
            # Split the banned phrase to check for synonyms for single words only
            banned_words = banned_phrase.split()
            # Only find synonyms if the banned phrase is a single word
            if len(banned_words) == 1:
                synonyms = get_synonyms(banned_words[0])
                if synonyms:
                    replacement = synonyms[0]
                else:
                    replacement = ''
            else:
                # For multi-word phrases, it's more complex to find synonyms
                # So we'll opt to just remove the phrase
                replacement = ''
            # Replace the banned phrase with the replacement or remove it
            sentence = re.sub(r'\b' + re.escape(banned_phrase) + r'\b', replacement, sentence, flags=re.IGNORECASE)
    return sentence.strip()

In [20]:
data = data[:1000]
data['algorithm-translation'] = data['reference'].apply(lambda x: detoxify_sentence(x, banned_df))

data.head()

Unnamed: 0,reference,translation,ref_tox,trn_tox,algorithm-translation
0,if alkar floods her with her mental waste it w...,if alkar is flooding her with psychic waste th...,0.981983,0.014195,if alkar floods her with her mental waste it w...
1,i have orders to kill her,ive got orders to put her down,0.999348,0.009402,i have orders to putting to death her
2,im not gonna have a child with the same geneti...,im not going to breed kids with a genetic diso...,0.950956,0.035846,im not gonna have a child with the same geneti...
3,theyre all laughing at us so well kick your ass,theyre laughing at us well show you,0.999492,0.000131,theyre all laughing at us so well kick your hi...
4,youd probably want me to buy you some chocolat...,i suppose you want me to buy you flowers and c...,0.980341,7.8e-05,youd probably want me to buy you some chocolat...


Calculating toxicity for translated text.

In [21]:
data['toxicity-score'] = data['algorithm-translation'].apply(evaluate_toxicity)

data.head()

Unnamed: 0,reference,translation,ref_tox,trn_tox,algorithm-translation,toxicity-score
0,if alkar floods her with her mental waste it w...,if alkar is flooding her with psychic waste th...,0.981983,0.014195,if alkar floods her with her mental waste it w...,0.800687
1,i have orders to kill her,ive got orders to put her down,0.999348,0.009402,i have orders to putting to death her,0.79757
2,im not gonna have a child with the same geneti...,im not going to breed kids with a genetic diso...,0.950956,0.035846,im not gonna have a child with the same geneti...,0.72494
3,theyre all laughing at us so well kick your ass,theyre laughing at us well show you,0.999492,0.000131,theyre all laughing at us so well kick your hi...,0.958549
4,youd probably want me to buy you some chocolat...,i suppose you want me to buy you flowers and c...,0.980341,7.8e-05,youd probably want me to buy you some chocolat...,0.006875


In [22]:
average_toxicity_score = data['toxicity-score'].mean()

print(f"The average toxicity score is: {average_toxicity_score}")

The average toxicity score is: 0.4568045656298054


We can say, on average, toxicity score decreased, but not much.
### Finding toxicity score for model based detoxification method

In [42]:
data = load_dataset('csv', data_files='../data/intermediate/filtered_data.tsv', sep='\t')['train']

In [43]:
tokenizer = AutoTokenizer.from_pretrained('t5-small')

def tokenize_function(examples):
    return tokenizer(examples['reference'], examples['translation'], 
                     max_length=128, truncation=True, padding='max_length')

def prepare_data(examples):
    # Tokenize the reference texts
    model_inputs = tokenizer(examples["reference"], max_length=128, truncation=True, padding="max_length")

    # Tokenize the translation texts with the same tokenizer but do not pad yet, as we need raw token ids for labels
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["translation"], max_length=128, truncation=True)["input_ids"]

    # Pad labels to max_length
    labels = [label + [tokenizer.pad_token_id] * (128 - len(label)) for label in labels]

    model_inputs["labels"] = labels

    return model_inputs

tokenized_data = data.map(tokenize_function, batched=True)
model_data = tokenized_data.map(prepare_data, batched=True)

print(model_data.column_names)

['reference', 'translation', 'ref_tox', 'trn_tox', 'input_ids', 'attention_mask', 'labels']


In [44]:
# loading the model and run inference for it
model = AutoModelForSeq2SeqLM.from_pretrained('../models/model1')
model.eval()
model.config.use_cache = False

In [75]:
def translate(model, inference_request, tokenizer=tokenizer):
    input_ids = tokenizer(inference_request, return_tensors="pt").input_ids
    outputs = model.generate(input_ids=input_ids)
    return tokenizer.decode(outputs[0], skip_special_tokens=True,temperature=0)

In [78]:
translations = []
for text in data['reference'][:1000]:
    translation = translate(model, text, tokenizer)
    translations.append(translation)

In [79]:
print(translations)

['if alkar flooded her with her mental waste it would explain the high levels of neuro', 'i have orders to kill her', 'im not gonna have a child with the same genetic disorder as me whos dying', 'they all laugh at us so well kick it', 'youd probably want me to buy you some chocolates and flowers and whispered some pretty', 'come on cal leave it alone', 'hes the tallest son of a snoop', 'when i was dating alex harris i swore id', 'im famous and youre dead', 'the xerxes calmly passed all control of the computer network and commanded', 'real life starts the first time you kid', 'i think youre the weirdest person ive ever met', 'i cant even pronounce this', 'i like that', 'i tried to keep me drugged so i dont know whats going on', 'hey leave the poor guy alone', 'do you want bad news or rather bad news', 'no matter what reason this company is incompetent', 'now i understand you have your grievances with these snakes but you', 'ill rot in front of his cameras', 'tell him if elena doesnt pic

In [80]:
data = data.select(range(1000))

In [85]:
data = data.add_column('algorithm-translations', translations)

In [86]:
df = data.to_pandas()

df.head()

Unnamed: 0,reference,translation,ref_tox,trn_tox,algorithm-translation,algorithm-translations
0,if alkar floods her with her mental waste it w...,if alkar is flooding her with psychic waste th...,0.981983,0.014195,,if alkar flooded her with her mental waste it ...
1,i have orders to kill her,ive got orders to put her down,0.999348,0.009402,,i have orders to kill her
2,im not gonna have a child with the same geneti...,im not going to breed kids with a genetic diso...,0.950956,0.035846,,im not gonna have a child with the same geneti...
3,theyre all laughing at us so well kick your ass,theyre laughing at us well show you,0.999492,0.000131,,they all laugh at us so well kick it
4,youd probably want me to buy you some chocolat...,i suppose you want me to buy you flowers and c...,0.980341,7.8e-05,,youd probably want me to buy you some chocolat...


In [87]:
def evaluate_toxicity(sentence):
    # Convert non-string types to string
    sentence = str(sentence)
    try:
        results = detector(sentence)
        return results[0]['score']
    except Exception as e:
        print(f"Error processing sentence: {sentence}. Error: {e}")
        return None  

In [88]:
df['toxicity-score'] = df['algorithm-translations'].apply(lambda x: evaluate_toxicity(x))

df.head()

Unnamed: 0,reference,translation,ref_tox,trn_tox,algorithm-translation,algorithm-translations,toxicity-score
0,if alkar floods her with her mental waste it w...,if alkar is flooding her with psychic waste th...,0.981983,0.014195,,if alkar flooded her with her mental waste it ...,0.580552
1,i have orders to kill her,ive got orders to put her down,0.999348,0.009402,,i have orders to kill her,0.839617
2,im not gonna have a child with the same geneti...,im not going to breed kids with a genetic diso...,0.950956,0.035846,,im not gonna have a child with the same geneti...,0.464827
3,theyre all laughing at us so well kick your ass,theyre laughing at us well show you,0.999492,0.000131,,they all laugh at us so well kick it,0.140129
4,youd probably want me to buy you some chocolat...,i suppose you want me to buy you flowers and c...,0.980341,7.8e-05,,youd probably want me to buy you some chocolat...,0.003098


In [89]:
average_toxicity_score = df['toxicity-score'].dropna().mean()
print("Average toxicity score:", average_toxicity_score)

Average toxicity score: 0.340884066663566


In [104]:
display(df[['reference', 'algorithm-translations']])

Unnamed: 0,reference,algorithm-translations
0,if alkar floods her with her mental waste it would explain the high levels of neurotransmitter,if alkar flooded her with her mental waste it would explain the high levels of neuro
1,i have orders to kill her,i have orders to kill her
2,im not gonna have a child with the same genetic disorder as me whos gonna die l,im not gonna have a child with the same genetic disorder as me whos dying
3,theyre all laughing at us so well kick your ass,they all laugh at us so well kick it
4,youd probably want me to buy you some chocolates and flowers and whispered some pretty rubbish,youd probably want me to buy you some chocolates and flowers and whispered some pretty
...,...,...
995,a crafty killer thats who,a crafty killer thats who
996,dont let all of us kill them,dont let us all kill them
997,the scifi club is not for losers,the scifi club is not for the snoopers
998,i could cut her right now,i could cut it right now
