## Data Inspection

#### Links
Multilingual Text Detoxification: 
- [Task introduction page](https://pan.webis.de/clef24/pan24-web/text-detoxification.html#data)

Huggingface links for the datasets:  
- [ParaDetox datasets for English](https://huggingface.co/datasets/s-nlp/paradetox)
- [ParaDetox datasets for Russian](https://huggingface.co/datasets/s-nlp/ru_paradetox)
- [Multilingual ParaDetox data](https://huggingface.co/datasets/textdetox/multilingual_paradetox)

#### Load the official datasets

In [1]:
# pip install datasets

In [2]:
from datasets import load_dataset

# Load the ParaDetox datasets for English and Russian, for the fine-tuning
en_data = load_dataset('s-nlp/paradetox', split='train')
en_train, en_valid = en_data.train_test_split(test_size=0.1, seed=42).values()

ru_train = load_dataset('s-nlp/ru_paradetox', split='train')
ru_valid = load_dataset('s-nlp/ru_paradetox', split='validation')

# Load the DEV dataset - multilingual ParaDetox data
dev_data = load_dataset('textdetox/multilingual_paradetox')
en_dev = dev_data['en']
ru_dev = dev_data['ru']

  from .autonotebook import tqdm as notebook_tqdm


#### Data samples from the offical data

In [3]:
# English dataset samples
print("English dataset samples:")
for i in range(5):
    print(en_train[i])

# Russian dataset samples
print("Russian dataset samples:")
for i in range(5):
    print(ru_train[i])

English dataset samples:
{'en_toxic_comment': 'mostly because they put the fucking stadium out so far no one wants to drive there .', 'en_neutral_comment': 'mostly because they put the stadium out so far no one wants to drive there .'}
{'en_toxic_comment': 'botch is my new word for when a boy is being a bitch', 'en_neutral_comment': 'botch is my new word for when a boy is being bad'}
{'en_toxic_comment': 'man will you get some help for this brain washed liberal bs you live in', 'en_neutral_comment': 'Man will you get some help for this man bs you live in.'}
{'en_toxic_comment': "well it came fresh out his ass , so i don 't see how it could be wrong .", 'en_neutral_comment': "well it came fresh out his mouth , so i don 't see how it could be wrong ."}
{'en_toxic_comment': 'their hard drives are shit .', 'en_neutral_comment': 'Their hard drives are not good.'}
Russian dataset samples:
{'ru_toxic_comment': 'и,чё,блядь где этот херой был до этого со своими доказательствами?', 'ru_neutral_c

In [4]:
# English dev dataset samples
print("English dev dataset samples:")
for i in range(5):
    print(en_dev[i])

# Russian dev dataset samples
print("Russian dev dataset samples:")
for i in range(5):
    print(ru_dev[i])

English dev dataset samples:
{'toxic_sentence': 'then all of a sudden i see her , shes now got the big phony tits and everything .'}
{'toxic_sentence': 'Prove my comment wrong and post a video of how fucking tough you are.'}
{'toxic_sentence': "My page should be protected first so that worthless scum like you can't keep vandalizing it."}
{'toxic_sentence': 'i never uttered or wrote that crap .'}
{'toxic_sentence': 'i go there for work every few months , and i can assure you it fucking sucks .'}
Russian dev dataset samples:
{'toxic_sentence': 'пидоры когда будем дышать чистым воздухом. адмистрация это к вам'}
{'toxic_sentence': 'твоё очко уходит в пользу форума!'}
{'toxic_sentence': 'и нехуя здесь орать что нам что то принадлежало'}
{'toxic_sentence': 'еще блядь один герой нашего времени..'}
{'toxic_sentence': 'скорее всего мы видим самый конец астронавтики пиндосии, как и самого государства извращенца.'}


#### Descriptive statistics

In [5]:
import pandas as pd

def calculate_text_statistics(dataset, feature):
    lengths = [len(sample[feature].split()) for sample in dataset]
    return {
        'Mean Length': sum(lengths) / len(lengths),
        'Max Length': max(lengths),
        'Min Length': min(lengths)
    }

# Dictionary to hold all the datasets and their corresponding features for analysis
datasets_info = {
    'en_train_toxic': {'dataset': en_train, 'feature': 'en_toxic_comment'},
    'en_train_neutral': {'dataset': en_train, 'feature': 'en_neutral_comment'},
    'en_valid_toxic': {'dataset': en_valid, 'feature': 'en_toxic_comment'},
    'en_valid_neutral': {'dataset': en_valid, 'feature': 'en_neutral_comment'},
    'en_dev_toxic': {'dataset': en_dev, 'feature': 'toxic_sentence'},
    'ru_train_toxic': {'dataset': ru_train, 'feature': 'ru_toxic_comment'},
    'ru_train_neutral': {'dataset': ru_train, 'feature': 'ru_neutral_comment'},
    'ru_valid_toxic': {'dataset': ru_valid, 'feature': 'ru_toxic_comment'},
    'ru_valid_neutral': {'dataset': ru_valid, 'feature': 'ru_neutral_comment'},
    'ru_dev_toxic': {'dataset': ru_dev, 'feature': 'toxic_sentence'}
}

stats_list = []

# Calculate statistics
for name, info in datasets_info.items():
    language, split_type = name.split('_')[0], '_'.join(name.split('_')[1:])
    stats = {
        'Language': language.capitalize(),
        'Split': split_type,
        'Number of Samples': len(info['dataset']),
        **calculate_text_statistics(info['dataset'], info['feature'])
    }
    stats_list.append(stats)

stats_df = pd.DataFrame(stats_list)

# stats_df.to_csv('dataset_statistics.csv', index=False)

In [6]:
stats_df

Unnamed: 0,Language,Split,Number of Samples,Mean Length,Max Length,Min Length
0,En,train_toxic,17769,11.851427,20,1
1,En,train_neutral,17769,9.303225,27,1
2,En,valid_toxic,1975,12.000506,20,5
3,En,valid_neutral,1975,9.442025,23,1
4,En,dev_toxic,400,11.9575,24,4
5,Ru,train_toxic,11090,10.371235,28,1
6,Ru,train_neutral,11090,8.92615,29,1
7,Ru,valid_toxic,1116,10.336918,20,5
8,Ru,valid_neutral,1116,8.660394,24,1
9,Ru,dev_toxic,400,10.4925,25,4
