It is the time to thain something finally.

Based on [translation.ipynb](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/translation.ipynb) and [fred-t5 finetune repo](https://github.com/Den4ikAI/FRED-T5-Finetuning). Modified as in [`tensor_parallel` example](https://github.com/BlackSamorez/tensor_parallel/blob/main/examples/training_flan-t5-xl.ipynb).

I use two of my 4x RTX3060 12GB rig as use of 3+ GPUs cause `Bus error (core dumped)` error. One is necessary to restart the jupyterlab docker container then in order to recover it.

> It is possible to use `"cuda:3"` device for a single gpu but `"cuda:2,3"` seems to be not supported by 🤗 thansformers.

`tensor_parallel` does not work with modern versions of transformers (despite its official requirements) so I had to downgrade it manually.
```
!pip install tensor_parallel
!pip install transformers==4.29.2
```

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2,3"

In [None]:
from difflib import SequenceMatcher
import re
import json
from tqdm.notebook import tqdm
import random

In [None]:
# !pip install datasets transformers[torch]

I use a part of the data I have as the model trains too long otherwise.
8-12 hours of finetuning was just fine for my usual task so I prefer to hold on to this here.

In [None]:
FILES = [
    "/home/jovyan/work/dataset/ficbook_pairs.jsonl",
    # # should be first as its index fields are str not integers
    # # otherwise field type can be specified explicitely like
    # # `features=Features({'prompt': Value('string'), 'target': Value('string')})`
    "/home/jovyan/work/dataset/pikabu_pairs.jsonl",
    # "/home/jovyan/work/dataset/librusec_pairs.jsonl"
]

In [None]:
MODEL_PATH = "/home/jovyan/wdc1/models/FRED-T5-1.7B"
# MODEL_PATH = "/home/jovyan/wdc1/models/FRED-T5-large"
TRAINED_SAVE_PATH = "/home/jovyan/work/models/2_fred-t5"

In [None]:
from datasets import load_dataset, Features, Value, Dataset

dataset = load_dataset('json', data_files=FILES)
# dataset = dataset['train']#.train_test_split(test_size=0.1)
dataset, dataset["train"][0]

# He obtayn

I decided to put a construction of train examples alongside the training code itself as
* it is fast actually and
* I do see the preprocessing as a part of the future model.

So, here is the code.
It finds parts of two lines which are different and construct that "before" and "after" thing.
It filters identical pairs as well since there is nothing to learn from.

In [None]:
from typing import Optional


class Replace(dict):
    def __init__(
        self,
        type: str, text_from: str, text_to: Optional[str]=None,
        *args, **kwargs
    ):
        super().__init__(*args, **kwargs)
        self["type"] = type
        self["text_from"] = text_from
        self["text_to"] = "" if not text_to else text_to

    @property
    def type(self):
        return self["type"]

    @property
    def text_from(self):
        return self["text_from"]

    @property
    def text_to(self):
        return self["text_to"]

    def extend(self, r):
        if self.type != r.type:
            raise Exception("Replace type mismatch")
        self["text_from"] += r["text_from"]
        self["text_to"] += r["text_to"]


class Replaces(list):
    def add(self, r: Replace):
        if self and r.type == self[-1].type:
            self[-1].extend(r)
        else:
            return super().append(r)

In [None]:
re_tokens = re.compile(r"[а-яА-Я]+\s*|\d+(?:\.\d+)?\s*|[^а-яА-Я\d\s]+\s*")


def tokenize(text):
    return re.findall(re_tokens, text)


"|".join(tokenize("ты, да я, да мы c тобой - вместе 2.5."))

In [None]:
re_digits = re.compile(r"\d")


def diff(seq1, seq2):
    sm = SequenceMatcher(
        # lambda x: not re.search(r"\w", x.strip()),
        a=seq1,
        b=seq2,
        autojunk=False
    )
    result = Replaces()
    for tag, i1, i2, j1, j2 in sm.get_opcodes():
        # print(tag, " ".join(seq1[i1:i2]), " ".join(seq2[j1:j2]))
        text_from, text_to = "".join(seq1[i1:i2]), "".join(seq2[j1:j2])
        if tag == "equal":
            type = "E"
        elif tag == "replace" and "".join((_.strip() for _ in seq1[i1:i2])) == "".join((_.strip() for _ in seq2[j1:j2])):
            type = "E"
        else:
            if not re.search(re_digits, text_from) and not re.search(re_digits, text_to):
                type = "E"
                text_to = None
            else:
                type = "R"
        result.add(Replace(type, text_from, text_to))
    return result


diff(
    tokenize("Всего - тридцать два с половиной."),
    tokenize("Всего - 32.5.")
)

In [None]:
dataset["train"][1000]

In [None]:
from transformers import GPT2Tokenizer


tokenizer = GPT2Tokenizer.from_pretrained(MODEL_PATH, eos_token='</s>')

One problem about the last train iteration was deluded prediction of long numbers like `125678`.
It could possibly happen because of tokenization of numbers if divided on parts which are not easy to operate.
Lets check it out now.

In [None]:
good, wrong = [], []
for i in range(100, 1000):
    a = str(i)
    ids = tokenizer.encode(a)
    b = "|".join([tokenizer.decode(_) for _ in ids])
    (wrong if a != b else good).append(b)

In [None]:
len(good), len(wrong)
# (28, 872)

The particular `FRED-T5-large` tokenizer splitted the majority of the three digits numbers.
May be it would be better if numbers are forced splitted on single digits like `123456` to `1 2 3 4 5 6`.

Other option is to divide numbers by three digit groups such that `1234567` would turn into `1 234 567`. We try that option first.

In [None]:
def strip_numbers(s):
    return " ".join(((" ".join(part) if part.isdigit() else part) for part in s.split()))


def strip_numbers(s):
    result = []
    for part in s.split():
        if part.isdigit():
            while len(part) > 3:
                result.append(part[:- 3 * ((len(part) - 1) // 3)])
                part = part[- 3 * ((len(part) - 1) // 3):]
            if part:
                result.append(part)
        else:
            result.append(part)
    return " ".join(result)


strip_numbers("у нас было 1234567890 пакетиков травы, 750 ампул новокаина, 55555 пакетиков диэтиламида лизергиновой кислоты, солонка, на 1000/2000 наполненная кокаином")
# "у нас было 1 234 567 890 пакетиков травы, 750 ампул новокаина, 55 555 пакетиков диэтиламида лизергиновой кислоты, солонка, на 1000/2000 наполненная кокаином"

In [None]:
from collections import Counter
from itertools import chain
data = []
occ_limit = len(dataset["train"]) / 100  # rough trim here
added = Counter()
for elem in tqdm(dataset["train"]):
    elem.setdefault("replaces", diff(tokenize(elem["tn"]), tokenize(elem["itn"])))
    if all(_.type == "E" for _ in elem["replaces"]):
        continue
    if "prompt" in elem and "target" in elem:
        continue
    replace_words = list(chain(*(r.text_from.strip().lower().split() for r in elem["replaces"] if r.type != "E")))
    if not any(added[word] < occ_limit for word in replace_words):
        continue
    added.update(replace_words)
    prompt, target = "<SC1>", ""
    etid = 0
    for r in elem["replaces"]:
        if r.type == "E":
            prompt += r.text_from
        else:
            ws_number = len(r.text_to) - len(r.text_to.rstrip())
            prompt += f"[{strip_numbers(r.text_to.rstrip())}]<extra_id_{etid}>{' ' * ws_number}"
            target += f"<extra_id_{etid}> {r.text_from.strip()}\n"
            etid += 1
    elem["prompt"] = f"{prompt}</s>"
    elem["target"] = f"{target}</s>"
    data.append(elem)

We made here train examples of that kind

    <SC1>Временами я думаю, какое применение найти тем [14 697]<extra_id_0> рублям, что лежат уже больше [33]<extra_id_1> лет?

and we want to predict a text like this

    <extra_id_0> четырнадцати тысячам шестистам девяноста семи
    <extra_id_1> тридцати трёх </s>

Lets check what have we added so far like the most (un)common __words__.

In [None]:
added.most_common()[:10], added.most_common()[-10:], data[0]

Besides rare mistakes it seems to be trainable on.

The distribution is shifted anyway to my taste as will be shown later.
One fast and simple thing to do about it is to iterate over and filter examples as we have seen too much of **all** the replaced words at the moment.

In [None]:
occ_limit = sum(added.values()) / len(added)
print(occ_limit)
added2 = Counter()
balanced_data = []
for elem in tqdm(data):
    replace_words = list(chain(*[r.text_from.strip().lower().split() for r in elem["replaces"] if r.type != "E"]))
    if any((added2[word] < occ_limit for word in replace_words)):
        balanced_data.append(elem)
        added2.update(replace_words)
len(balanced_data) / len(data)

We have gotten rid of 2/3 of the data we had had!
Check it out visually now.

In [None]:
from matplotlib import pyplot as plt 

axs = plt.subplot()
axs.set_yscale('log')
x = [1, 2, 3, 4, 5] 
y = [1, 4, 9, 16, 25] 
axs.plot([_[1] for _ in added2.most_common()])
axs.plot([_[1] for _ in added.most_common()])

Only extra data lost so far.

In [None]:
del data

In [None]:
dataset = Dataset.from_list([_ for _ in balanced_data]).train_test_split(test_size=0.02)
dataset

In [None]:
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["prompt"],
        text_target=examples["target"],
        max_length=128,  # NB should affect memory consumption
        truncation=True
    )
    return model_inputs


dataset = dataset.map(preprocess_function, batched=True, num_proc=10)

In [None]:
dataset = dataset.remove_columns(["prompt", "target", "tn", "itn", "orig_index", "text_index", "replaces"])

Just in case I get rid of examples with possible truncation mistakes.

In [None]:
from collections import Counter
c = Counter([len(_["input_ids"]) for _ in dataset["train"]])
sum([v for k, v in c.items() if k < 128]), c

In [None]:
for k, v in dataset.items():
    dataset[k] = [_ for _ in v if 10 < len(_["input_ids"]) < 126]
{k:len(v) for k, v in dataset.items()}

# He trayn

Time to train actually as last!

In [None]:
from transformers import T5ForConditionalGeneration
import torch


model = T5ForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    offload_state_dict=True
)

In case of `ruT5-base` training use these lines:

```python
!pip install datasets transformers[sentencepiece]
from transformers import T5ForConditionalGeneration, T5Tokenizer
path = "./ruT5-base"
model = T5ForConditionalGeneration.from_pretrained(path)
tokenizer = T5Tokenizer.from_pretrained(path)
```

In [None]:
import tensor_parallel as tp


model = tp.tensor_parallel(
    model,
    ["cuda:0", "cuda:1"]
)

In [None]:
input_ids = tokenizer("A cat sat on a mat", return_tensors="pt").input_ids.to("cuda")
output_ids = tokenizer("A cat sat did not sit on a mat", return_tensors="pt").input_ids.to("cuda")
loss = model(input_ids=input_ids, labels=output_ids).loss
loss.backward()  # check nvidia-smi for gpu memory usage :)

In [None]:
# !pip install bitsandbytes scipy

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer,  DataCollatorForSeq2Seq


training_args = Seq2SeqTrainingArguments(
    output_dir=TRAINED_SAVE_PATH,
    # optim="adamw_bnb_8bit",
    optim="adafactor",
    evaluation_strategy="steps",
    eval_steps=1000,
    save_steps=1000,
    logging_first_step=True,
    learning_rate=1e-4,
    lr_scheduler_type="constant",
    # gradient_checkpointing=True,
    gradient_checkpointing=False,
    gradient_accumulation_steps=8,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    save_total_limit=20,
    num_train_epochs=2,
    # predict_with_generate=True,
    # fp16=True,
    push_to_hub=False,
    remove_unused_columns=False,
    load_best_model_at_end=True,
    # auto_find_batch_size=True,
    auto_find_batch_size=False,
    dataloader_num_workers=4,
)


In [None]:
import transformers
transformers.logging.set_verbosity_info()

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model),
    # compute_metrics=compute_metrics,
    # optimizers=(adam_bnb_optim, None),
)
trainer.train()

In [None]:
with tp.save_tensor_parallel(model):
    model.save_pretrained(os.path.join(TRAINED_SAVE_PATH, "final"))
    tokenizer.save_pretrained(os.path.join(TRAINED_SAVE_PATH, "final"))

# But most importantly he explayn

In [None]:
import torch
lm_text = '<SC1>я купил [iphone 12X]<extra_id_0> за [142 990]<extra_id_1> руб без [3-x]<extra_id_2> часов полдень и т.д.'
# lm_text = '<SC1>я купил айфон за [14 970]<extra_id_0> рублей'
lm_text = "<SC1>Временами я думаю, какое применение найти тем [14 697]<extra_id_0> рублям, что лежат уже больше [33]<extra_id_1> лет?"
# lm_text = "<SC1>Было у отца [3]<extra_id_0> сына, но не было даже [2-3]<extra_id_1> пиджаков с блёстками за [142 990 руб]<extra_id_2>."
# lm_text = "<SC1>В школе у меня одни [5]<extra_id_0>."
# lm_text = '<SC1>Было у отца [3]<extra_id_0> сына. Старшему было [35]<extra_id_1>, среднему - не меньше [33]<extra_id_2>, а младший на [4]<extra_id_3> младше всех. Бывает.'
lm_text = "<SC1>Временами я думаю, какое применение найти тем [265 948 697]<extra_id_0> рублям, что лежат уже больше [33]<extra_id_1> лет?"
input_ids = torch.tensor([tokenizer.encode(lm_text)]).to("cuda:0")
outputs = model.generate(input_ids, eos_token_id=tokenizer.eos_token_id, early_stopping=True)
print(tokenizer.decode(outputs[0][1:]))

In [None]:
!nvidia-smi