# 1. Information about the submission

## 1.1 Name and number of the assignment 

**RUSSE 2022 Russian Text Detoxification Based on Parallel Corpora**

## 1.2 Student name

**Nuzhnov Mark**

## 1.3 Codalab user ID

**Nuzhnov_Mark**

## 1.4 Additional comments

***Enter here** any additional comments which you would like to communicate to a TA who is going to grade this work not related to the content of your submission.*

# 2. Technical Report

The T5 model was presented in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
<br>


![T5](https://bs-uploads.toptal.io/blackfish-uploads/uploaded_file/file/209941/image-1584537267103-77e51b4ce416bb44fabee01f2e7a18dd.png)

The T5 (Text-to-Text Transfer Transformer) model is a transformer-based neural network architecture developed by Google Research. It is designed to perform various natural language processing tasks by converting input text into output text, hence the name "text-to-text transfer." The model is trained on a large corpus of text data and can perform tasks such as text summarization, question answering, machine translation, and more.

The T5 model consists of several components, including an encoder, a decoder, and an attention mechanism. The encoder takes in the input text and converts it into a sequence of vectors, which are then passed through multiple layers of self-attention and feed-forward neural networks. This process allows the model to capture the contextual relationships between words in the input text.

The decoder then takes these encoded vectors and generates the output text by predicting the next word in the sequence based on the previous words and the encoded input. The decoder also uses self-attention and feed-forward neural networks to generate the output text.

The attention mechanism in the T5 model is used to weigh the importance of different parts of the input text when generating the output text. This helps the model focus on the most relevant information and improve its accuracy.

We define the detoxification task as the task of style transfer: from the toxic style to the nontoxic style. We want to rewrite the sentence and preserve the context.
We define the task of style transfer as follows. Let us consider two corpora $D^{X} = {x_{1}, x_{2}, ..., x_{n} }$ and $D^{Y} = {y_{1}, y_{2}, ..., y_{n} }$ in two styles $–s^{X}$ (toxic) and $–s^{Y}$ (non-toxic), correspondingly.
The task is to create a model $ f_{\theta} : X \to Y$, where $X$ and $Y$ are all possible texts in styles $s^{X}$ and $s^{Y}$ . The task of selecting the optimal set of parameters $\theta$ for $f$ consists maximising the probability $ p(y^{'}|x, s^{Y})$ of transferring a sentence $x$ in style $s^{X}$ to the sentence $y^{'}$ which saves the content of $x$ and has the style $s^{Y}$. The parameters are maximized on the corpora $D^{X}$ and $D^{Y}$ which can be parallel or non-parallel. We focus on the transfer $s^{X} \to s^{Y}  $, where $s^{X}$ is the toxic style, and $s^{Y}$ is neutral.

## 2.1 Methodology 

## 2.2 Discussion of results

I tried several models. Results can be seen bellow.
Method |  Style transfer accuracy  | Meaning preservation | Fluency | Joint score | ChrF1
--- | --- | --- | --- | --- | ---| 
Baseline | 0.56 | 0.89 | 0.85 | 0.41 | 0.53 
T5-base_20epochs | 0.81 | 0.78 | 0.81 | 0.52 | 0.54
T5-large_20epochs | 0.80 | 0.79 | 0.82 | 0.53 | 0.55 
T5-large_35epochs | 0.81 | 0.79 | 0.82 | 0.53 | 0.54 

# 3. Code

*Enter here all code used to produce your results submitted to Codalab. Add some comments and subsections to navigate though your solution.*

*In this part you are expected to develop yourself a solution of the task and provide a reproducible code:*
- *Using Python 3;*
- *Contains code for installation of all dependencies;*
- *Contains code for downloading of all the datasets used*;
- *Contains the code for reproducing your results (in other words, if a tester downloads your notebook she should be able to run cell-by-cell the code and obtain your experimental results as described in the methodology section)*.


*As a result, you code will be graded according to these criteria:*
- ***Readability**: your code should be well-structured preferably with indicated parts of your approach (Preprocessing, Model training, Evaluation, etc.).*
- ***Reproducibility**: your code should be reproduced without any mistakes with “Run all” mode (obtaining experimental part).*


## Start

In [1]:
!wget https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/train.tsv -q
!wget https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/test.tsv -q
!wget https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/dev.tsv -q

In [2]:
!pip install -q transformers

In [3]:
!pip install sentencepiece



In [1]:
import os
import gc
import pandas as pd
from sklearn.utils import shuffle
from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments
from transformers.file_utils import cached_property
from typing import Tuple, List, Dict, Union
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm, trange



device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


In [2]:
def clean_memory():
    gc.collect()
    torch.cuda.empty_cache()

In [3]:
df = pd.read_csv('train.tsv', sep='\t', index_col = 0)
df = df.fillna('')

In [4]:
df.head()

Unnamed: 0_level_0,toxic_comment,neutral_comment1,neutral_comment2,neutral_comment3
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"и,чё,блядь где этот херой был до этого со свои...","Ну и где этот герой был,со своими доказательст...",Где этот герой был до этого со своими доказате...,"и,где этот герой был до этого со своими доказа..."
1,"О, а есть деанон этого петуха?","О, а есть деанон",,
2,"херну всякую пишут,из-за этого лайка.долбоебизм.","Чушь всякую пишут, из- за этого лайка.","Ерунду всякую пишут,из-за этого лайка.",
3,из за таких пидоров мы и страдаем,из за таких плохих людей мы и страдаем,Из-за таких людей мы и страдаем,из за таких как он мы и страдаем
4,гондон путинский он а не артист,"Человек Путина он, а не артист",,


In [5]:
df_train_toxic = []
df_train_neutral = []

for index, row in df.iterrows():
    references = row[['neutral_comment1', 'neutral_comment2', 'neutral_comment3']].tolist()
    
    for reference in references:
        if len(reference) > 0:
            df_train_toxic.append(row['toxic_comment'])
            df_train_neutral.append(reference)
        else:
            break

In [6]:
df = pd.DataFrame({
    'toxic_comment': df_train_toxic,
    'neutral_comment': df_train_neutral
})

df = shuffle(df)

In [7]:
df.head()

Unnamed: 0,toxic_comment,neutral_comment
7958,ну пизда мне! потряю сознание из-за голода и е...,ну конец мне! потряю сознание из-за голода и п...
699,"гандон ебанный,ненасытный чтобы таких бог нака...",Надеюсь его накажет бог
6036,Руслан-сучка!!!! всё из-за него...( ПОДАЮ В СУ...,"Все из-за Руслана, подаю в суд"
7359,"Твою мать! За что блять? Соплей и кашля мало,т...","что это?Соплей и кашля мало,теперь глаза начал..."
10657,ДОБАВИЛА В ВИДЕОЗАПИСИ МИТТИ. НО СИЛ СМОТРЕТЬ ...,"Добавила в видеозаписи Митти, но сил смотреть ..."


In [8]:
class PairsDataset(torch.utils.data.Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __getitem__(self, idx):
        assert idx < len(self.x['input_ids'])
        item = {key: val[idx] for key, val in self.x.items()}
        item['decoder_attention_mask'] = self.y['attention_mask'][idx]
        item['labels'] = self.y['input_ids'][idx]
        return item
    
    @property
    def n(self):
        return len(self.x['input_ids'])

    def __len__(self):
        return self.n # * 2

In [9]:
class DataCollatorWithPadding:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        batch = self.tokenizer.pad(
            features,
            padding=True,
        )
        ybatch = self.tokenizer.pad(
            {'input_ids': batch['labels'], 'attention_mask': batch['decoder_attention_mask']},
            padding=True,
        ) 
        batch['labels'] = ybatch['input_ids']
        batch['decoder_attention_mask'] = ybatch['attention_mask']
        
        return {k: torch.tensor(v) for k, v in batch.items()}

In [10]:
def evaluate_model(model, test_dataloader):
    num = 0
    den = 0

    for batch in test_dataloader:
        with torch.no_grad():
            loss = model(**{k: v.to(model.device) for k, v in batch.items()}).loss
            num += len(batch) * loss.item()
            den += len(batch)
    val_loss = num / den
    return val_loss

In [11]:
def train_loop(
    model, train_dataloader, val_dataloader, 
    max_epochs=30, 
    max_steps=1_000, 
    lr=3e-5,
    gradient_accumulation_steps=1, 
    cleanup_step=100,
    window=100,
):
    clean_memory()
    optimizer = torch.optim.Adam(params = [p for p in model.parameters() if p.requires_grad], lr=lr)

    ewm_loss = 0
    step = 0
    model.train()

    for epoch in trange(max_epochs):
        print(step, max_steps)
        if step >= max_steps:
            break
        tq = tqdm(train_dataloader)
        for i, batch in enumerate(tq):
            try:
                batch['labels'][batch['labels']==0] = -100
                loss = model(**{k: v.to(model.device) for k, v in batch.items()}).loss
                loss.backward()
            except Exception as e:
                print('error on step', i, e)
                loss = None
                clean_memory()
                continue
            if i and i % gradient_accumulation_steps == 0:
                optimizer.step()
                optimizer.zero_grad()
                step += 1
                if step >= max_steps:
                    break

            if i % cleanup_step == 0:
                clean_memory()

            w = 1 / min(i+1, window)
            ewm_loss = ewm_loss * (1-w) + loss.item() * w
            tq.set_description(f'loss: {ewm_loss:4.4f}')

            if (i == len(train_dataloader)-1)  and val_dataloader is not None:
                model.eval()
                eval_loss = evaluate_model(model, val_dataloader)
                model.train()
                print(f'epoch {epoch}: train loss: {ewm_loss:4.4f}  val loss: {eval_loss:4.4f}')
                
            if step % 620 == 0:
                model.save_pretrained(f't5_base_{dname}_{steps}')
        
    clean_memory()

In [12]:
def train_model(x, y, model, test_size=0.1, batch_size=32, **kwargs):
    x1, x2, y1, y2 = train_test_split(x, y, test_size=test_size, random_state=42)
    train_dataset = PairsDataset(tokenizer(x1), tokenizer(y1))
    test_dataset = PairsDataset(tokenizer(x2), tokenizer(y2))
    
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, drop_last=False, shuffle=True, collate_fn=data_collator)
    val_dataloader = DataLoader(test_dataset, batch_size=batch_size, drop_last=False, shuffle=True, collate_fn=data_collator)

    train_loop(model, train_dataloader, val_dataloader, **kwargs)
    return model

In [13]:
!ls

dev.tsv  eval.zip   t5_base_10000_dev.txt   t5_base_train_10000  test.tsv
dev.zip  hw3.ipynb  t5_base_10000_eval.txt  t5_base_train_20000  train.tsv


In [15]:
model_name = 't5_large_train_20000'
model = T5ForConditionalGeneration.from_pretrained(model_name).cuda()
model.gradient_checkpointing_enable()
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained('ai-forever/ruT5-large', use_fast = False)

In [16]:
datasets = {
    'train': df
}

In [None]:
steps = 20000
for dname, d in datasets.items():
    print(f'\n\n\n  {dname}  {steps} \n=====================\n\n')
    model = train_model(d['toxic_comment'].tolist(), d['neutral_comment'].tolist(), model=model, batch_size=80, max_epochs=15, max_steps=steps)
    model.save_pretrained(f't5_large_{dname}_{steps}')




  train  20000 




  0%|          | 0/15 [00:00<?, ?it/s]

0 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 0: train loss: 0.5418  val loss: 8.7798
124 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 1: train loss: 0.4947  val loss: 9.2712
248 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 2: train loss: 0.4613  val loss: 8.9896
372 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 3: train loss: 0.4306  val loss: 8.9784
496 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 4: train loss: 0.4035  val loss: 9.0247
620 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 5: train loss: 0.3783  val loss: 9.3573
744 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 6: train loss: 0.3584  val loss: 9.2951
868 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 7: train loss: 0.3367  val loss: 9.2084
992 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 8: train loss: 0.3169  val loss: 9.4530
1116 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 9: train loss: 0.3007  val loss: 9.3156
1240 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 10: train loss: 0.2856  val loss: 9.3936
1364 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 11: train loss: 0.2734  val loss: 9.6067
1488 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 12: train loss: 0.2606  val loss: 9.4635
1612 20000


  0%|          | 0/125 [00:00<?, ?it/s]

epoch 13: train loss: 0.2609  val loss: 9.8402
1736 20000


  0%|          | 0/125 [00:00<?, ?it/s]

In [None]:
df_dev = pd.read_csv('dev.tsv', sep='\t')
toxic_inputs = df_dev['toxic_comment'].tolist()

In [None]:
def paraphrase(text, model, n=None, max_length='auto', temperature=0.0, beams=3):
    texts = [text] if isinstance(text, str) else text
    inputs = tokenizer(texts, return_tensors='pt', padding=True)['input_ids'].to(model.device)
    if max_length == 'auto':
        max_length = int(inputs.shape[1] * 1.2) + 10
    result = model.generate(
        inputs, 
        num_return_sequences=n or 1, 
        do_sample=False, 
        temperature=temperature, 
        repetition_penalty=3.0, 
        max_length=max_length,
        bad_words_ids=[[2]],  # unk
        num_beams=beams,
    )
    texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]
    if not n and isinstance(text, str):
        return texts[0]
    return texts

In [None]:
print(paraphrase(['ниюхай пипку'], model, temperature=50.0, beams=10))

In [None]:
print(paraphrase(['ёбаный рот'], model, temperature=50.0, beams=10))

In [None]:
para_results = []
problematic_batch = [] #if something goes wrong you can track such bathces
batch_size = 8

for i in tqdm(range(0, len(toxic_inputs), batch_size)):
    batch = [sentence for sentence in toxic_inputs[i:i + batch_size]]
    try:
        para_results.extend(paraphrase(batch, model, temperature=0.0))
    except Exception as e:
        print(i)
        para_results.append(toxic_inputs[i:i + batch_size])

In [None]:
with open('t5_base_10000_dev.txt', 'w') as file:
    file.writelines([sentence+'\n' for sentence in para_results])

In [None]:
df_test = pd.read_csv('test.tsv', sep='\t')
toxic_inputs = df_test['toxic_comment'].tolist()
para_results = []
problematic_batch = [] #if something goes wrong you can track such bathces
batch_size = 8

for i in tqdm(range(0, len(toxic_inputs), batch_size)):
    batch = [sentence for sentence in toxic_inputs[i:i + batch_size]]
    try:
        para_results.extend(paraphrase(batch, model, temperature=0.0))
    except Exception as e:
        print(i)
        para_results.append(toxic_inputs[i:i + batch_size])

In [None]:
with open('t5_base_10000_eval.txt', 'w') as file:
    file.writelines([sentence+'\n' for sentence in para_results])

In [None]:
!zip dev.zip t5_base_10000_dev.txt
!zip eval.zip t5_base_10000_eval.txt