It is the time to thain something finally.

Based on [translation.ipynb](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/translation.ipynb) and [fred-t5 finetune repo](https://github.com/Den4ikAI/FRED-T5-Finetuning).

I use a single RTX3060 12GB as naive use of 2+ GPUs cause OOM in case of `FRED-T5-large`.
`ruT5-base` training is possible with `CUDA_VISIBLE_DEVICES=2,3` out of the box though.

In [None]:
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

In [None]:
from difflib import SequenceMatcher
import re
import json
from tqdm.notebook import tqdm
import random

I use a part of the data I have only as the model trains too long otherwise.
8-12 hours of finetuning was just fine for my usual task so I prefer to hold on to this here.

In [None]:
DATASET_FILES = [
    "/home/jovyan/data/kaggle.jsonl",
    "/home/jovyan/data/ficbook_replaces.jsonl",
    "/home/jovyan/data/pikabu_replaces.jsonl",
    # "/home/jovyan/data/librusec_replaces.jsonl"
]

In [None]:
MODEL = {
    0: {
        "type": "FRED-T5",
        "path": "/home/jovyan/wdc1/models/FRED-T5-large",
        # "path": "/home/jovyan/wdc1/models/FRED-T5-1.7B"
        # "path": "/home/jovyan/models/3_fred-t5/checkpoint-11000"
    },
    1: {
        "type": "ruT5",
        "path": "/home/jovyan/wdc1/models/ruT5-base",
    },
}[0]
TRAINED_SAVE_PATH = "/home/jovyan/models/7_fred-t5-large"

# He obtayn

In [None]:
import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.getcwd())))
from replaces import Replace, Replaces

In [None]:
dataset = []
for file in DATASET_FILES:
    with open(file) as f:
        for line in tqdm(f, desc=file):
            dataset.append({"replaces": Replaces(json.loads(line)["replaces"])})

In [None]:
dataset[1000]

Also `ruT5` sentencepiece tokenizer misses new line `"\n"` symbol so ```<extra_id_0>\n<extra_id_1>``` encodes-decodes into ```<unk> extra_id_0<unk> extra_id_1<unk>```. To not to fix its outout (a possible but painful action) [one is advised](https://github.com/google/sentencepiece/issues/101) to add the symbol explicitely.

In [None]:
from transformers import GPT2Tokenizer, T5Tokenizer, AutoTokenizer


if MODEL["type"] == "ruT5":
    tokenizer = T5Tokenizer.from_pretrained(MODEL_PATH)
    tokenizer.add_tokens("\n")
elif MODEL["type"] == "FRED-T5":
    tokenizer = GPT2Tokenizer.from_pretrained(MODEL_PATH, eos_token="</s>")
else:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

One problem about the last train iteration was deluded prediction of long numbers like `125678`.
It could possibly happen because of tokenization of numbers if divided on parts which are not easy to operate.
Lets check it out now.

In [None]:
good, wrong = [], []
for i in range(100, 1000):
    a = str(i)
    ids = tokenizer.encode(a)
    b = "|".join([tokenizer.decode(_) for _ in ids])
    (wrong if a != b else good).append(b)

In [None]:
len(good), len(wrong)
# (28, 872) in case of FRED-T5, (19, 881) in case of ruT5

The particular `FRED-T5-large` tokenizer splitted the majority of the three digits numbers.
May be it would be better if numbers are forced splitted on single digits like `123456` to `1 2 3 4 5 6`.

Other option is to divide numbers by three digit groups such that `1234567` would turn into `1 234 567`. We try that option first.

In [None]:
def strip_numbers(s):
    return " ".join(((" ".join(part) if part.isdigit() else part) for part in s.split()))


def strip_numbers(s):
    result = []
    for part in s.split():
        if part.isdigit():
            while len(part) > 3:
                result.append(part[:- 3 * ((len(part) - 1) // 3)])
                part = part[- 3 * ((len(part) - 1) // 3):]
            if part:
                result.append(part)
        else:
            result.append(part)
    return " ".join(result)


strip_numbers("у нас было 1234567890 пакетиков травы, 750 ампул новокаина, 55555 пакетиков диэтиламида лизергиновой кислоты, солонка, на 1000/2000 наполненная кокаином")
# "у нас было 1 234 567 890 пакетиков травы, 750 ампул новокаина, 55 555 пакетиков диэтиламида лизергиновой кислоты, солонка, на 1000/2000 наполненная кокаином"

In [None]:
from collections import Counter
from itertools import chain
data = []
added = Counter()
for elem in tqdm(dataset):
    if all(_.type == "E" for _ in elem["replaces"]):
        continue
    if "prompt" in elem and "target" in elem:
        continue
    replace_words = list(chain(*(r.text_to.strip().lower().split() for r in elem["replaces"] if r.type != "E")))
    added.update(replace_words)
    prompt, target = "<SC1>", ""
    etid = 0
    for r in elem["replaces"]:
        if r.type == "E":
            prompt += r.text_to
        else:
            ws_number = len(r.text_from) - len(r.text_from.rstrip())
            prompt += f"[{strip_numbers(r.text_from.rstrip())}]<extra_id_{etid}>{' ' * ws_number}"
            target += f"<extra_id_{etid}> {r.text_to.strip()} \n"
            etid += 1
    elem["prompt"] = f"{prompt}</s>"
    elem["target"] = f"{target}</s>"
    data.append(elem)

We made here train examples of that kind

    <SC1>Временами я думаю, какое применение найти тем [14 697]<extra_id_0> рублям, что лежат уже больше [33]<extra_id_1> лет?

and we want to predict a text like this

    <extra_id_0> четырнадцати тысячам шестистам девяноста семи
    <extra_id_1> тридцати трёх </s>

Lets check what have we added so far like the most (un)common __words__.

In [None]:
added.most_common()[:10], added.most_common()[-10:], data[0]

Besides rare mistakes it seems to be trainable on.

The distribution is shifted anyway to my taste as will be shown later.
One fast and simple thing to do about it is to iterate over and filter examples as we have seen too much of **all** the replaced words at the moment.

In [None]:
occ_limit = (sum(added.values()) / len(added)) ** 2  # feel free to find another heuristic
print(occ_limit)
added2 = Counter()
balanced_data = []
for elem in tqdm(data):
    replace_words = list(chain(*[r.text_to.strip().lower().split() for r in elem["replaces"] if r.type != "E"]))
    if any((added2[word] < occ_limit for word in replace_words)):
        balanced_data.append(elem)
        added2.update(replace_words)
len(balanced_data), len(balanced_data) / len(data)

We have gotten rid of 2/3 of the data we had had!
Check it out visually now.

In [None]:
stat_regs = {
    "re_digits": re.compile(r"\d"),
    "re_digits_latin": re.compile(r"[a-zA-Z\d]"),
    "re_latin": re.compile(r"[a-zA-Z]")
}
for stat_name, stat_re in stat_regs.items():
    print(
        stat_name,
        len([elem for elem in tqdm(balanced_data) if any(re.search(stat_re, r.text_from) for r in elem["replaces"])])
    )

In [None]:
from matplotlib import pyplot as plt 

axs = plt.subplot()
axs.set_yscale('log')
axs.plot([_[1] for _ in added2.most_common()[:1000]])
axs.plot([_[1] for _ in added.most_common()[:1000]])

Only extra data lost so far.

In [None]:
dataset = Dataset.from_list(balanced_data).train_test_split(test_size=0.01)
dataset

In [None]:
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["prompt"],
        text_target=examples["target"],
        max_length=128,  # NB should affect memory consumption
        truncation=True
    )
    return model_inputs


dataset = dataset.map(preprocess_function, batched=True, num_proc=10)

In [None]:
dataset = dataset.remove_columns(["prompt", "target", "replaces"])

Just in case I get rid of examples with possible truncation mistakes.

In [None]:
from collections import Counter
c = Counter([len(_["input_ids"]) for _ in dataset["train"]])
sum([v for k, v in c.items() if k < 128]), c

In [None]:
for k, v in dataset.items():
    dataset[k] = [_ for _ in v if 10 < len(_["input_ids"]) < 126]
{k:len(v) for k, v in dataset.items()}

# He trayn

Time to train actually as last!

In [None]:
from transformers import T5ForConditionalGeneration


model = T5ForConditionalGeneration.from_pretrained(MODEL_PATH)

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer,  DataCollatorForSeq2Seq


training_args = Seq2SeqTrainingArguments(
    output_dir=TRAINED_SAVE_PATH,
    optim="adafactor",
    evaluation_strategy="steps",
    eval_steps=1000,
    save_steps=1000,
    logging_first_step=True,
    learning_rate=1e-4,
    lr_scheduler_type="constant",
    # gradient_checkpointing=True,
    gradient_checkpointing=False,
    gradient_accumulation_steps=8,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=2,
    save_total_limit=20,
    num_train_epochs=2,
    # predict_with_generate=True,
    # fp16=True,
    push_to_hub=False,
    remove_unused_columns=False,
    load_best_model_at_end=True,
    # auto_find_batch_size=True,
    dataloader_num_workers=4,
    report_to="tensorboard",
)


In [None]:
import transformers
transformers.logging.set_verbosity_info()

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
)
trainer.train()

In [None]:
model.save_pretrained(os.path.join(TRAINED_SAVE_PATH, "final"), safe_serialization=False)
tokenizer.save_pretrained(os.path.join(TRAINED_SAVE_PATH, "final"))

# But most importantly he explayn

In [None]:
import torch
lm_text = '<SC1>я купил [iphone 12X]<extra_id_0> за [142 990]<extra_id_1> руб без [3-x]<extra_id_2> часов полдень и т.д.'
lm_text = '<SC1>я купил айфон за [14 970]<extra_id_0> рублей'
lm_text = "<SC1>Временами я думаю, какое применение найти тем [14 697]<extra_id_0> рублям, что лежат уже больше [33]<extra_id_1> лет?"
lm_text = "<SC1>Было у отца [3]<extra_id_0> сына, но не было даже [2-3]<extra_id_1> пиджаков с блёстками за [142 990 руб]<extra_id_2>."
lm_text = "<SC1>В школе у меня одни [5]<extra_id_0>."
lm_text = '<SC1>Было у отца [3]<extra_id_0> сына. Старшему было [35]<extra_id_1>, среднему - не меньше [33]<extra_id_2>, а младший на [4]<extra_id_3> младше всех. Бывает.'
lm_text = "<SC1>Временами я думаю, какое применение найти тем [265 948 697]<extra_id_0> рублям, что лежат уже больше [33]<extra_id_1> лет у нашего [DevOps]<extra_id_2>?"
input_ids = torch.tensor([tokenizer.encode(lm_text)]).to("cuda:0")
outputs = model.generate(input_ids, eos_token_id=tokenizer.eos_token_id, early_stopping=True, max_new_tokens=50)
print(tokenizer.decode(outputs[0][1:]))