<a href="https://colab.research.google.com/github/tmvfb/generalSVR-generator/blob/main/generalSVR_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> (old notebook version)

# Finetune RuGPT model with certain telegram channel content  
RuGPT3Small model taken from [here](https://huggingface.co/ai-forever/rugpt3small_based_on_gpt2).

## Install env

In [1]:
!pip install --quiet -r requirements.txt

[33mDEPRECATION: pytorch-lightning 1.7.7 has a non-standard dependency specifier torch>=1.9.*. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: distro-info 1.1build1 has a non-standard version number. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [2]:
import pandas as pd
import numpy as np
import random
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [3]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7f0d93b31210>

In [4]:
!mkdir models/

mkdir: cannot create directory ‘models/’: File exists


## Create files and build train/validation samples

In [5]:
!python3 parser.py

In [6]:
data_path = "data/parsed_data.json"
data = pd.read_json(data_path, encoding="utf-8")
data.head(10)

Unnamed: 0,0
0,"Дорогие подписчики и гости канала! Канал ""Гене..."
1,"Добрый вечер, дорогие подписчики и гости канал..."
2,"Здравствуйте, дорогие наши подписчики и гости ..."
3,"Здравствуйте, дорогие подписчики и гости канал..."
4,"Здравствуйте, дорогие подписчики и гости канал..."
5,"Здравствуйте, дорогие подписчики и гости канал..."
6,"Здравствуйте, дорогие наши подписчики и гости ..."
7,"Здравствуйте, дорогие наши подписчики и гости ..."
8,"Дорогие подписчики и гости канала! Завтра, во ..."
9,Дорогие подписчики и гости канала! Сегодня мат...


In [7]:
data.shape

(931, 1)

In [8]:
val_ind = random.sample(range(data.shape[0]), 150)
train = [data.iloc[i][0] for i in range(len(data)) if i not in val_ind]
valid = [data.iloc[i][0] for i in range(len(data)) if i in val_ind]
# train = list(np.random.choice(data.iloc[:, 0], size=1100))
# valid = list(np.random.choice(data.iloc[:, 0], size=250))

In [9]:
len(train), len(valid)

(781, 150)

In [10]:
!mkdir artifacts

mkdir: cannot create directory ‘artifacts’: File exists


In [11]:
with open("artifacts/train.txt", "w", encoding="utf-8") as file:
    file.write("\n".join(train))

In [12]:
with open("artifacts/valid.txt", "w", encoding="utf-8") as file:
    file.write("\n".join(valid))

## Train 
The following code downloads RuGPT model and tokenizer from huggingface and finetunes model for generating essays.

In [13]:
!wget https://raw.githubusercontent.com/huggingface/transformers/main/examples/pytorch/language-modeling/run_clm.py

--2023-12-06 17:25:12--  https://raw.githubusercontent.com/huggingface/transformers/main/examples/pytorch/language-modeling/run_clm.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28779 (28K) [text/plain]
Saving to: ‘run_clm.py.3’


2023-12-06 17:25:13 (2.21 MB/s) - ‘run_clm.py.3’ saved [28779/28779]



In [14]:
torch.cuda.empty_cache()

In [15]:
!python3 run_clm.py \
    --model_name_or_path ai-forever/rugpt3small_based_on_gpt2 \
    --train_file artifacts/train.txt \
    --validation_file artifacts/valid.txt \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --block_size 2048 \
    --dataset_config_name plain_text \
    --do_train \
    --do_eval \
    --output_dir models/essays \
    --overwrite_output_dir

2023-12-06 17:25:16.361740: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-06 17:25:16.609240: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
12/06/2023 17:25:20 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,

## Evaluate model

In [16]:
tok = GPT2Tokenizer.from_pretrained("models/essays")

In [17]:
model = GPT2LMHeadModel.from_pretrained("models/essays")

In [18]:
model.cuda()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50264, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dro

In [22]:
text = "Дорогие подписчики и гости канала! Владимир Путин съел яичницу на завтрак."
inpt = tok.encode(text, return_tensors="pt")

In [23]:
out = model.generate(
    inpt.cuda(), max_length=100, repetition_penalty=5.0, do_sample=True, top_k=20, top_p=1, temperature=1
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [24]:
tok.decode(out[0], skip_special_tokens=True)

'Дорогие подписчики и гости канала! Владимир Путин съел яичницу на завтрак. Несмотря, что в утреннем меню были исключительно омлет с колбасой (на мой вкус-не особо изысканно), у стола президента во вторник была не просто двойная порция яиц разных видов плюс несколько десятков грамм сыра «Бургер Кинг»… А ещё президент провел совещание по вопросам безопасности полетов беспилотных летательных аппаратов над территорией РФ для руководства военного блока страны.. И если вчера всё выглядело более убедительно при подготовке к встрече'