In [1]:
! nvidia-smi

Fri Aug 11 01:32:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  Off |
|  0%   37C    P5    58W / 375W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [2]:
%pip install ruprompts
!pip install accelerate -U
!pip install datasets

[0mNote: you may need to restart the kernel to use updated packages.


[0m

# ruPROMPTs tutorial

This [tutorial](https://github.com/ai-forever/ru-prompts/blob/main/notebooks/detox-russe-train-python.ipynb) presents an example of prompt-tuning with ruPROMTS framework for the detoxification task.

In [2]:
!wget https://raw.githubusercontent.com/skoltech-nlp/russe_detox_2022/main/data/input/train.tsv
!wget https://raw.githubusercontent.com/skoltech-nlp/russe_detox_2022/main/data/input/dev.tsv
!wget https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/test.tsv

--2023-08-11 01:32:21--  https://raw.githubusercontent.com/skoltech-nlp/russe_detox_2022/main/data/input/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1902888 (1.8M) [text/plain]
Saving to: ‘train.tsv.2’


2023-08-11 01:32:21 (34.3 MB/s) - ‘train.tsv.2’ saved [1902888/1902888]

--2023-08-11 01:32:21--  https://raw.githubusercontent.com/skoltech-nlp/russe_detox_2022/main/data/input/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 200691 (196K) [text/plain]
Saving to: ‘dev.tsv.2’


2023-08-11 01:32:21 (8.65 MB/s) 

## Imports

In [3]:
import os

import pandas as pd
import shutil
from tqdm import tqdm

from datasets import load_dataset

import transformers
from transformers import GPT2LMHeadModel, AutoTokenizer
from transformers import set_seed
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
from transformers import Trainer
from transformers import TrainingArguments
from transformers import pipeline


from ruprompts import TensorPromptProvider, LSTMPromptProvider
from ruprompts import Text2TextPreprocessor
from ruprompts import Prompt, PromptFormat

from ruprompts.callbacks import (
    FreezeTransformerUnfreezePrompt,
    ReduceCheckpoint,
    SavePretrainedPrompt,
)

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

## Data

In [4]:
# df = pd.read_csv("train.tsv", sep="\t")
# df.drop(["index"], axis=1, inplace=True)
# df.to_csv("train.tsv", index=False, sep="\t")

datasets = load_dataset("csv", 
                        data_files={"train": "train.tsv", 
                                    "validation": "dev.tsv",
                                   }, sep="\t")
train_dataset = datasets["train"]
valid_dataset = datasets["validation"]
test_data = pd.read_csv("test.tsv", sep="\t")['toxic_comment'].tolist()

## Model

In [5]:
backbone_id = "sberbank-ai/rugpt3large_based_on_gpt2"

model = GPT2LMHeadModel.from_pretrained(backbone_id)
tokenizer = AutoTokenizer.from_pretrained(backbone_id, pad_token="<pad>", eos_token="<pad>")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Training TensorPromptProvider

In [6]:
set_seed(1)

prompt_format = PromptFormat("<P*100>{toxic_comment}<P*20>")
prompt_provider = TensorPromptProvider()

prompt = Prompt(prompt_format, prompt_provider)
prompt.patch(model, tokenizer)

preprocessor = Text2TextPreprocessor(
    prompt_format=prompt_format,
    tokenizer=tokenizer,
    target_field="neutral_comment1",
    max_tokens=1792,
    truncation_field="toxic_comment",
)

train_dataset = train_dataset.map(preprocessor)
valid_dataset = valid_dataset.map(preprocessor)

training_args = TrainingArguments(
    output_dir="./TensorPromptProvider",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=1,
    eval_steps=1000,
    save_steps=1000,
    logging_steps=100,
    evaluation_strategy="steps",
    save_strategy="steps",
    logging_strategy="steps",
    save_total_limit=2,
    metric_for_best_model="eval_loss",
    learning_rate=0.1,
    max_steps=100000,
    #report_to="tensorboard",
    # report_to=["tensorboard", "wandb"],  # uncomment to log to WandB
    logging_dir="logs",
    seed=1,
)

optimizer = AdamW(prompt_provider.parameters(), lr=training_args.learning_rate)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=2000,
    num_training_steps=training_args.max_steps,
)



In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=preprocessor.collate_fn(),
    optimizers=(optimizer, scheduler),
    callbacks=[FreezeTransformerUnfreezePrompt(), ReduceCheckpoint(), SavePretrainedPrompt(prompt)],
)

trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
1000,2.1526,2.034594
2000,1.9502,1.764478
3000,1.8289,1.718937
4000,1.6163,1.723563
5000,1.6502,1.549067
6000,1.6264,1.565487
7000,1.6358,1.517514
8000,1.549,1.618584
9000,1.5134,1.547968
10000,1.3685,1.480748


TrainOutput(global_step=100000, training_loss=1.2705454140472412, metrics={'train_runtime': 5635.404, 'train_samples_per_second': 35.49, 'train_steps_per_second': 17.745, 'total_flos': 1.3368979771224883e+17, 'train_loss': 1.2705454140472412, 'epoch': 28.79})

## Training LSTMPromptProvider

In [7]:
set_seed(1)

prompt_format = PromptFormat("<P*100>{toxic_comment}<P*20>")
prompt_provider = LSTMPromptProvider()

prompt = Prompt(prompt_format, prompt_provider)
prompt.patch(model, tokenizer)

preprocessor = Text2TextPreprocessor(
    prompt_format=prompt_format,
    tokenizer=tokenizer,
    target_field="neutral_comment1",
    max_tokens=1792,
    truncation_field="toxic_comment",
)

train_dataset = train_dataset.map(preprocessor)
valid_dataset = valid_dataset.map(preprocessor)

training_args = TrainingArguments(
    output_dir="./LSTMPromptProvider",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=1,
    eval_steps=1000,
    save_steps=1000,
    logging_steps=100,
    evaluation_strategy="steps",
    save_strategy="steps",
    logging_strategy="steps",
    save_total_limit=2,
    metric_for_best_model="eval_loss",
    learning_rate=0.01,
    max_steps=100000,
    #report_to="tensorboard",
    # report_to=["tensorboard", "wandb"],  # uncomment to log to WandB
    logging_dir="logs",
    seed=1,
)

optimizer = AdamW(prompt_provider.parameters(), lr=training_args.learning_rate)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=2000,
    num_training_steps=training_args.max_steps,
)



In [8]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=preprocessor.collate_fn(),
    optimizers=(optimizer, scheduler),
    callbacks=[FreezeTransformerUnfreezePrompt(), ReduceCheckpoint(), SavePretrainedPrompt(prompt)],
)

trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
1000,3.0008,3.308973
2000,3.256,3.24214
3000,3.1189,3.172971
4000,3.1162,3.085704
5000,3.0266,2.970207
6000,3.1149,2.932387
7000,2.9913,2.886102
8000,3.0404,2.868218
9000,3.0094,2.883643
10000,2.7718,2.889516


KeyboardInterrupt: 

## Inference

In [24]:
best_checkpoint = './TensorPromptProvider/checkpoint-52000'

prompt = Prompt.from_pretrained(best_checkpoint)

ppln = pipeline("text2text-generation-with-prompt", 
                prompt=prompt, 
                model=model, 
                tokenizer=tokenizer, 
                device=0)

ppln({"toxic_comment": "Ублюдок, мать твою, а ну иди сюда"}, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


[{'generated_text': 'А ну иди сюда'}]

In [14]:
all_ = []
top_p_predictions = []

beam_count = 3

for idx, i in enumerate(tqdm(valid_dataset["toxic_comment"])):
    options = ppln(
        {"toxic_comment": i},
        do_sample=True,
        num_beams=beam_count,
        num_return_sequences=beam_count,
        top_k=50,
        top_p=0.95,
        early_stopping=True,
    )
    
    #print(options)
    options = [i["generated_text"].replace("<pad>", "") for i in options]
    answer = sorted(options, key=len)[-1]  # get longest answer
    top_p_predictions.append(answer)

  0%|          | 0/800 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 1/800 [00:00<03:09,  4.22it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 2/800 [00:00<02:36,  5.11it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 3/800 [00:00<02:15,  5.88it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  0%|          | 4/800 [00:00<02:27,  5.40it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  1%|          | 5/800 [00:00<02:25,  5.45it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  1%|          | 6/800 [00:01<02:28,  5.33it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  1%|          | 7/800 [00:01<02:09,  6.14it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  1%|          | 8/800 [00:01<02:00,  6.60it/s]Setting `pad_token_id` to `eos_token_id`:

  9%|▉         | 71/800 [00:16<02:41,  4.50it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  9%|▉         | 72/800 [00:17<02:26,  4.96it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  9%|▉         | 73/800 [00:17<02:58,  4.08it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  9%|▉         | 74/800 [00:17<02:31,  4.78it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  9%|▉         | 75/800 [00:17<02:40,  4.52it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 10%|▉         | 76/800 [00:18<02:34,  4.68it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 10%|▉         | 77/800 [00:18<02:30,  4.80it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 10%|▉         | 78/800 [00:18<02:42,  4.45it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 10%|▉         | 79/800 [00:18<02:47,  4.29it/s]Setting `pad_token_id` t

 26%|██▋       | 212/800 [00:51<02:33,  3.82it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 27%|██▋       | 213/800 [00:51<02:12,  4.44it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 27%|██▋       | 214/800 [00:51<01:52,  5.22it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 27%|██▋       | 215/800 [00:51<02:18,  4.21it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 27%|██▋       | 216/800 [00:52<02:23,  4.06it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 27%|██▋       | 217/800 [00:52<02:50,  3.42it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 27%|██▋       | 218/800 [00:52<02:39,  3.64it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 27%|██▋       | 219/800 [00:52<02:28,  3.92it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 28%|██▊       | 220/800 [00:53<02:47,  3.45it/s]Setting `pad_to

 44%|████▍     | 352/800 [01:27<01:47,  4.18it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 44%|████▍     | 353/800 [01:27<01:39,  4.49it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 44%|████▍     | 354/800 [01:27<01:32,  4.83it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 44%|████▍     | 355/800 [01:27<01:35,  4.64it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 44%|████▍     | 356/800 [01:28<01:51,  3.98it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 45%|████▍     | 357/800 [01:28<02:12,  3.35it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 45%|████▍     | 358/800 [01:28<01:56,  3.79it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 45%|████▍     | 359/800 [01:28<01:40,  4.40it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 45%|████▌     | 360/800 [01:29<01:56,  3.79it/s]Setting `pad_to

 62%|██████▏   | 492/800 [02:02<00:57,  5.35it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 62%|██████▏   | 493/800 [02:02<00:56,  5.41it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 62%|██████▏   | 494/800 [02:03<01:27,  3.51it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 62%|██████▏   | 495/800 [02:03<01:22,  3.69it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 62%|██████▏   | 496/800 [02:03<01:34,  3.21it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 62%|██████▏   | 497/800 [02:04<01:18,  3.85it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 62%|██████▏   | 498/800 [02:04<01:16,  3.96it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 62%|██████▏   | 499/800 [02:04<01:27,  3.46it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 62%|██████▎   | 500/800 [02:04<01:20,  3.72it/s]Setting `pad_to

 70%|███████   | 563/800 [02:19<01:00,  3.92it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 70%|███████   | 564/800 [02:20<01:13,  3.22it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 71%|███████   | 565/800 [02:20<01:18,  3.01it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 71%|███████   | 566/800 [02:21<01:12,  3.24it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 71%|███████   | 567/800 [02:21<01:05,  3.55it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 71%|███████   | 568/800 [02:21<00:59,  3.91it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 71%|███████   | 569/800 [02:21<00:58,  3.93it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 71%|███████▏  | 570/800 [02:21<00:50,  4.55it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 71%|███████▏  | 571/800 [02:22<00:54,  4.20it/s]Setting `pad_to

 88%|████████▊ | 703/800 [02:56<00:33,  2.87it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 88%|████████▊ | 704/800 [02:56<00:35,  2.69it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 88%|████████▊ | 705/800 [02:56<00:33,  2.88it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 88%|████████▊ | 706/800 [02:56<00:27,  3.40it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 88%|████████▊ | 707/800 [02:57<00:23,  3.91it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 88%|████████▊ | 708/800 [02:57<00:24,  3.68it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 89%|████████▊ | 709/800 [02:57<00:20,  4.47it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 89%|████████▉ | 710/800 [02:57<00:23,  3.79it/s]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 89%|████████▉ | 711/800 [02:58<00:19,  4.48it/s]Setting `pad_to

In [15]:
top_p_predictions[:3]

['Температуры горения хватит чтобы её расплавить',
 'А ты там был. Ты вообще служил',
 'а сам где кормишься?']