<a href="https://colab.research.google.com/github/totminaekaterina/RUSSE-2022-Detoxification/blob/main/finetune_rugpt3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install --upgrade transformers==4.6.0

Collecting transformers==4.6.0
  Downloading transformers-4.6.0-py3-none-any.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 24.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 16.3 MB/s 
[?25hCollecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 55.3 MB/s 
Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed huggingface-hub-0.0.8 sacremoses-0.0.47 tokenizers-0.10.3 transformers-4.6.0


In [2]:
import argparse
import json
import random
from pprint import pprint
from pathlib import Path
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
from transformers import (AutoTokenizer,
                          AutoModelForCausalLM,
                          Trainer,
                          TrainingArguments,
                          TrainerCallback,
                          AdamW,
                          get_linear_schedule_with_warmup)

In [3]:
def set_random_seed(seed_val):
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

In [4]:
def get_random_example(dataset):
    """
    Get random example from given dataset.
    Return promt with special tokens and true neutral_comment1.
    """
    sample = dataset.sample()
    prompt = f'<|startoftext|>{sample["toxic_comment"].item()}<|sep|>'
    true_output = sample["neutral_comment1"].item()
    return prompt, true_output

In [5]:
class SimplificationDataset(torch.utils.data.Dataset):
    def __init__(self, texts_list, tokenizer, gpt2_type="gpt2", max_length=1024):
        self.tokenizer = tokenizer

        texts_combined = []
        for input_text, out_text in texts_list:
            text_combined = f"<|startoftext|>{input_text}<|sep|>{out_text}<|endoftext|>"
            texts_combined.append(text_combined)
        self.encodings = tokenizer(texts_combined,
                              truncation=True,
                              max_length=max_length,
                              padding="max_length",
                              return_tensors="pt")

    def __len__(self):
        return len(self.encodings.input_ids)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = item["input_ids"]
        return item

In [6]:
class PrintExampleCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, logs=None, **kwargs):
        prompt, true_output = get_random_example(valid)
        print(prompt.strip("<|startoftext|>").strip("<|sep|>"), true_output, sep="\n")
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
        model.eval()
        with torch.no_grad():
            sample_outputs = model.generate(
                input_ids,
                do_sample=True,   
                top_k=50,
                max_length=MAX_LENGTH,
                top_p=0.95,
                temperature=0.9,
                num_return_sequences=1
            ).detach().cpu()
        model.train()

        for sample in sample_outputs:
            res = (tokenizer.decode(sample, skip_special_tokens=False)
                            .split("<|sep|>")[1]
                            .replace("<|pad|>", "")
                            .replace("<|endoftext|>", ""))
            print(res, "-" * 80, sep="\n")

In [7]:
!wget https://raw.githubusercontent.com/totminaekaterina/RUSSE-2022-Detoxification/main/prepared_data/train_part.csv

--2022-02-10 06:24:24--  https://raw.githubusercontent.com/totminaekaterina/RUSSE-2022-Detoxification/main/prepared_data/train_part.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 790378 (772K) [text/plain]
Saving to: ‘train_part.csv’


2022-02-10 06:24:24 (91.0 MB/s) - ‘train_part.csv’ saved [790378/790378]



In [8]:
!wget https://raw.githubusercontent.com/skoltech-nlp/russe_detox_2022/main/data/input/dev.tsv
!wget https://raw.githubusercontent.com/skoltech-nlp/russe_detox_2022/main/data/input/test.tsv

--2022-02-10 06:24:25--  https://raw.githubusercontent.com/skoltech-nlp/russe_detox_2022/main/data/input/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 200691 (196K) [text/plain]
Saving to: ‘dev.tsv’


2022-02-10 06:24:25 (78.8 MB/s) - ‘dev.tsv’ saved [200691/200691]

--2022-02-10 06:24:25--  https://raw.githubusercontent.com/skoltech-nlp/russe_detox_2022/main/data/input/test.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104462 (102K) [text/plain]
Saving to: ‘test.tsv’


2022-02-10 06:24:25 (26.6 MB/s) - ‘test.tsv’ 

In [9]:
DATA_DIR = Path("/content")
TRAIN_PATH = DATA_DIR / "train_part.csv"
VALID_PATH = DATA_DIR / "dev.tsv"
TEST_PATH = DATA_DIR / "test.tsv"

In [10]:
train = pd.read_csv(TRAIN_PATH)
valid = pd.read_csv(VALID_PATH, sep="\t")
valid.drop(["neutral_comment2"], axis=1, inplace=True)
valid.drop(["neutral_comment3"], axis=1, inplace=True)
valid.columns = ["toxic_comment", "neutral_comment1"]

# load model

In [11]:
model_name = "sberbank-ai/rugpt3medium_based_on_gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Downloading:   0%|          | 0.00/674 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

# add special tokens

In [12]:
special_tokens = {
    "bos_token": "<|startoftext|>",
    "pad_token": "<|pad|>",
    "sep_token": "<|sep|>",
}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))

Embedding(50261, 1024)

In [14]:
MAX_LENGTH = 200
DATA_COLS = ["toxic_comment", "neutral_comment1"]
train_dataset = SimplificationDataset(train[DATA_COLS].values.tolist(), tokenizer, max_length=MAX_LENGTH)
valid_dataset = SimplificationDataset(valid[DATA_COLS].values.tolist(), tokenizer, max_length=MAX_LENGTH)

EPOCH_STEPS = len(train_dataset) // 8 // 1
EVAL_STEPS = EPOCH_STEPS
print(f"Total steps: {EPOCH_STEPS * 5}\nEvaluate and save every {EVAL_STEPS} steps.")

Total steps: 2405
Evaluate and save every 481 steps.


In [15]:
training_args = TrainingArguments(
    output_dir="result_rugpt3medium",
    logging_dir="logs_rugpt3medium",
    logging_first_step=True,
    num_train_epochs=5,
    evaluation_strategy="steps",
    eval_steps=EVAL_STEPS,
    save_steps=EVAL_STEPS,
    logging_steps=100,
    lr_scheduler_type="linear",
    warmup_steps=500,
    learning_rate=0.00005,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    weight_decay=0,
    fp16=False,
    seed=19,
)

In [16]:
Path(training_args.output_dir).mkdir(exist_ok=True)
Path(training_args.logging_dir).mkdir(exist_ok=True)

In [17]:
pprint(training_args.to_dict())

{'_n_gpu': 1,
 'adafactor': False,
 'adam_beta1': 0.9,
 'adam_beta2': 0.999,
 'adam_epsilon': 1e-08,
 'dataloader_drop_last': False,
 'dataloader_num_workers': 0,
 'dataloader_pin_memory': True,
 'ddp_find_unused_parameters': None,
 'debug': [],
 'deepspeed': None,
 'disable_tqdm': False,
 'do_eval': True,
 'do_predict': False,
 'do_train': False,
 'eval_accumulation_steps': None,
 'eval_steps': 481,
 'evaluation_strategy': 'steps',
 'fp16': False,
 'fp16_backend': 'auto',
 'fp16_full_eval': False,
 'fp16_opt_level': 'O1',
 'gradient_accumulation_steps': 1,
 'greater_is_better': None,
 'group_by_length': False,
 'ignore_data_skip': False,
 'label_names': None,
 'label_smoothing_factor': 0.0,
 'learning_rate': 5e-05,
 'length_column_name': 'length',
 'load_best_model_at_end': False,
 'local_rank': -1,
 'logging_dir': 'logs_rugpt3medium',
 'logging_first_step': True,
 'logging_steps': 100,
 'logging_strategy': 'steps',
 'lr_scheduler_type': 'linear',
 'max_grad_norm': 1.0,
 'max_steps': 

In [18]:
with open(Path(training_args.output_dir) / "run_parameters.txt", "w") as f:
  pprint(training_args.to_dict(), f)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[PrintExampleCallback],
)

trainer.train()



Step,Training Loss,Validation Loss
481,0.5477,0.518247
962,0.4504,0.508188
1443,0.3702,0.531864
1924,0.3062,0.570566
2405,0.2522,0.613723


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


охренел, что ли ваш батька! кем себя возомнил, что бы на ковёр нашего генпрокурора вызывать, предатель гребанный
Кем ваш батька себя возомнил? Кто он такой чтоб вызывать нашего генпрокурора?
Охренел он, кем он себя возомнил, что бы нашего прокурора вызывать
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


это тока для добоебов которые верят в эту чушь интернетувскую
Это для тех, кто верит во все что есть в интернете
это только для обманщиков, которые верят в эту чушь интернетовскую
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


а вы считаете он прав?ему навстречу пошли, а он как пидор конченый слился.
А вы считаете он прав? Ему на встречу пошли, а он слился
а вы считаете, что он прав?ему навстречу пошли, а он как не мужик
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


вот как поставят чипс так и снимут маски туда клонят эти идиоты! они своё добиваются так или иначе
вот как поставят чипс так и снимут маски туда клонят! они своё добиваются так или иначе
Вот как поставят чипс, так и снимут,, они свое добиваются
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


суки пиздят за неделю всё разволилось
Врут, за неделю все разволилось
Враньё говорят за неделю всё разволилось
--------------------------------------------------------------------------------




TrainOutput(global_step=2410, training_loss=0.6403291135408077, metrics={'train_runtime': 3320.7936, 'train_samples_per_second': 0.726, 'total_flos': 14946785280000.0, 'epoch': 5.0, 'init_mem_cpu_alloc_delta': 2215694336, 'init_mem_gpu_alloc_delta': 1524178944, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': -1862643712, 'train_mem_gpu_alloc_delta': 4270919168, 'train_mem_cpu_peaked_delta': 1884684288, 'train_mem_gpu_peaked_delta': 7956937728})

In [19]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
