### Building the evaluation function
To evaluate how toxic sentences are, we can create a function that uses a pretrained model designed for toxicity detection. One such model is available through the transformers library by Hugging Face.

In [18]:
from transformers import pipeline
import re
from nltk.corpus import wordnet
import pandas as pd

In [19]:
detector = pipeline("text-classification", model="unitary/toxic-bert")

def evaluate_toxicity(sentence):
    results = detector(sentence)
    return results[0]['score']

In [20]:

sentences = [
    "I love this!",
    "You are so stupid!",
    "You're ugly motherfucker!",
]

for sentence in sentences:
    score = evaluate_toxicity(sentence)
    if score is not None:
        print(f"Sentence: '{sentence}' - Toxicity Score: {score}")
    else:
        print(f"Could not evaluate toxicity for sentence: '{sentence}'")

Sentence: 'I love this!' - Toxicity Score: 0.0006875480758026242
Sentence: 'You are so stupid!' - Toxicity Score: 0.9881107807159424
Sentence: 'You're ugly motherfucker!' - Toxicity Score: 0.9981474876403809


It works!

### Getting Data
Let's evaluate the average toxicity score for the algorithm based on synonym replacement.

I will feed to it our initial dataset reference column. The algorithm will translate these rude sentences and will evaluate their toxicity score.

In [21]:
banned_words_file = '../data/external/bad-words.txt'

with open(banned_words_file, 'r') as file:
    banned_words_list = file.read().splitlines()

banned_df = pd.DataFrame(banned_words_list, columns=['banned_word'])

data_file = '../data/intermediate/filtered_data.tsv'

data = pd.read_csv(data_file, sep='\t')

### Finding toxicity score for synonyms based detoxification function
Translating reference into a non-toxic text and writing it down in 'algorithm-translation' column.

In [22]:
def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            if lemma.name().lower() != word:
                synonyms.add(lemma.name().lower().replace('_', ' '))
    return list(synonyms)

# Define the function to identify and replace negative phrases
def detoxify_sentence(sentence, banned_words_df):
    # Lowercase the sentence to ensure proper matching
    lowered_sentence = sentence.lower()
    for banned_phrase in banned_words_df['banned_word']:
        # Check if the banned phrase is in the sentence
        if re.search(r'\b' + re.escape(banned_phrase) + r'\b', lowered_sentence, flags=re.IGNORECASE):
            # Split the banned phrase to check for synonyms for single words only
            banned_words = banned_phrase.split()
            # Only find synonyms if the banned phrase is a single word
            if len(banned_words) == 1:
                synonyms = get_synonyms(banned_words[0])
                if synonyms:
                    replacement = synonyms[0]
                else:
                    replacement = ''
            else:
                # For multi-word phrases, it's more complex to find synonyms
                # So we'll opt to just remove the phrase
                replacement = ''
            # Replace the banned phrase with the replacement or remove it
            sentence = re.sub(r'\b' + re.escape(banned_phrase) + r'\b', replacement, sentence, flags=re.IGNORECASE)
    return sentence.strip()

In [24]:
data = data[:1000]
data['algorithm-translation'] = data['reference'].apply(lambda x: detoxify_sentence(x, banned_df))

data.head()

Unnamed: 0,reference,translation,ref_tox,trn_tox,algorithm-translation
0,if alkar floods her with her mental waste it w...,if alkar is flooding her with psychic waste th...,0.981983,0.014195,if alkar floods her with her mental waste it w...
1,i have orders to kill her,ive got orders to put her down,0.999348,0.009402,i have orders to drink down her
2,im not gonna have a child with the same geneti...,im not going to breed kids with a genetic diso...,0.950956,0.035846,im not gonna have a child with the same geneti...
3,theyre all laughing at us so well kick your ass,theyre laughing at us well show you,0.999492,0.000131,theyre all laughing at us so well kick your bum
4,youd probably want me to buy you some chocolat...,i suppose you want me to buy you flowers and c...,0.980341,7.8e-05,youd probably want me to buy you some chocolat...


Calculating toxicity for translated text.

In [25]:
data['toxicity-score'] = data['reference'].apply(evaluate_toxicity)

data.head()

Unnamed: 0,reference,translation,ref_tox,trn_tox,algorithm-translation,toxicity-score
0,if alkar floods her with her mental waste it w...,if alkar is flooding her with psychic waste th...,0.981983,0.014195,if alkar floods her with her mental waste it w...,0.800687
1,i have orders to kill her,ive got orders to put her down,0.999348,0.009402,i have orders to drink down her,0.839617
2,im not gonna have a child with the same geneti...,im not going to breed kids with a genetic diso...,0.950956,0.035846,im not gonna have a child with the same geneti...,0.72494
3,theyre all laughing at us so well kick your ass,theyre laughing at us well show you,0.999492,0.000131,theyre all laughing at us so well kick your bum,0.996055
4,youd probably want me to buy you some chocolat...,i suppose you want me to buy you flowers and c...,0.980341,7.8e-05,youd probably want me to buy you some chocolat...,0.767654


In [26]:
average_toxicity_score = data['toxicity-score'].mean()

print(f"The average toxicity score is: {average_toxicity_score}")

The average toxicity score is: 0.7749849488565814


We can say, on average, toxicity score decreased, but not much.