<a href="https://colab.research.google.com/github/semkud/russian_jokes/blob/main/Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune RuGPTs in megatron and deepspeed
How to finetune RuGPTs models with megatron and deepspeed. Example for RuGPT3Small. Note for other models it will take more GPU memory.

This notebook is valid for all RuGPTs models except RuGPT3XL.
## Install env

In [None]:
%%bash
rm -rf /usr/local/cuda
ln -s /usr/local/cuda-10.1 /usr/local/cuda

In [None]:
!pip3 install transformers==3.5.0



In [None]:
import subprocess

CUDA_version = [s for s in subprocess.check_output(["nvcc", "--version"]).decode("UTF-8").split(", ") if s.startswith("release")][0].split(" ")[-1]
print("CUDA version:", CUDA_version)

if CUDA_version == "10.0":
    torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
    torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
    torch_version_suffix = ""
else:
    torch_version_suffix = "+cu110"

CUDA version: 10.1


In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243


If code below doesn't work, check your cuda version and installation here https://pytorch.org/get-started/previous-versions/

In [None]:
!pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.7.1%2Bcu101-cp37-cp37m-linux_x86_64.whl (735.4 MB)
[K     |████████████████████████████████| 735.4 MB 16 kB/s 
[?25hCollecting torchvision==0.8.2+cu101
  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.8.2%2Bcu101-cp37-cp37m-linux_x86_64.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 29.3 MB/s 
[?25hCollecting torchaudio==0.7.2
  Downloading torchaudio-0.7.2-cp37-cp37m-manylinux1_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 4.7 MB/s 
Installing collected packages: torch, torchvision, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 1.10.0+cu111
    Uninstalling torch-1.10.0+cu111:
      Successfully uninstalled torch-1.10.0+cu111
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.11.1+cu111
    Uninstalli

In [None]:
%%writefile setup.sh

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Writing setup.sh


In [None]:
!sh setup.sh

Cloning into 'apex'...
remote: Enumerating objects: 9133, done.[K
remote: Counting objects: 100% (204/204), done.[K
remote: Compressing objects: 100% (162/162), done.[K
remote: Total 9133 (delta 108), reused 69 (delta 37), pack-reused 8929[K
Receiving objects: 100% (9133/9133), 14.61 MiB | 21.66 MiB/s, done.
Resolving deltas: 100% (6219/6219), done.
  cmdoptions.check_install_build_global(options)
Using pip 21.1.3 from /usr/local/lib/python3.7/dist-packages/pip (python 3.7)
Value for scheme.platlib does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
distutils: /usr/local/lib/python3.7/dist-packages
sysconfig: /usr/lib/python3.7/site-packages
Value for scheme.purelib does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
distutils: /usr/local/lib/python3.7/dist-packages
sysconfig: /usr/lib/python3.7/site-packages
Value for scheme.headers does not match. Please report this to <https://github.com/pypa/pip/issues/9617>
distutils: /us

In [None]:
!git clone  https://github.com/sberbank-ai/ru-gpts

Cloning into 'ru-gpts'...
remote: Enumerating objects: 683, done.[K
remote: Counting objects: 100% (178/178), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 683 (delta 110), reused 141 (delta 83), pack-reused 505[K
Receiving objects: 100% (683/683), 413.81 KiB | 5.17 MiB/s, done.
Resolving deltas: 100% (410/410), done.


In [None]:
!pip install deepspeed==0.3.7

Collecting deepspeed==0.3.7
  Downloading deepspeed-0.3.7.tar.gz (258 kB)
[K     |████████████████████████████████| 258 kB 5.2 MB/s 
Collecting tensorboardX==1.8
  Downloading tensorboardX-1.8-py2.py3-none-any.whl (216 kB)
[K     |████████████████████████████████| 216 kB 42.6 MB/s 
[?25hCollecting ninja
  Downloading ninja-1.10.2.3-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB)
[K     |████████████████████████████████| 108 kB 47.1 MB/s 
Building wheels for collected packages: deepspeed
  Building wheel for deepspeed (setup.py) ... [?25l[?25hdone
  Created wheel for deepspeed: filename=deepspeed-0.3.7-py3-none-any.whl size=252514 sha256=c2c749162861b064907a36fd8b6b8f0a1b0b6a4115a1f4e849f95feb069af61a
  Stored in directory: /root/.cache/pip/wheels/3c/d9/4b/82641510ef70fb7af78f938356bbf87503a4e7397a7d19d11e
Successfully built deepspeed
Installing collected packages: tensorboardX, ninja, deepspeed
Successfully installed deepspeed-0.3.7 ninja-1.10.2.3 tensorboardX-1.

## Download files

## Prepare data for parallel
We use custom implementation of distributed dataset. For training and evaluating we should specify file `file.list` with list of paths to txt files. All files from `file.list` will be splitted between aviable GPUs. The logic of splitting is described by the following code:

```python
shard_size = len(files) // world_size
shard_start = rank * shard_size
shard_end = (rank + 1) * shard_size
files = files[shard_start:shard_end]
```

For more details please see full code of dataset: `src.dataset_rugpt3.RuGpt3TextDataset`.

In [None]:
!echo train.txt > train.list
!echo valid.txt > valid.list

## Train
Load model from Huggingface and finetune on essays.

This will take arount ten minutes.

In [None]:
!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts

!USE_DEEPSPEED=1 python -m torch.distributed.launch --nproc_per_node 1 ru-gpts/pretrain_gpt3.py \
  --train-data-path "train.list" \
  --test-data-path "valid.list" \
  --max-files-per-process 100 \
  --logging-dir="log" \
  --save model \
  --load-huggingface sberbank-ai/rugpt3small_based_on_gpt2 \
  --save-interval 1000 \
  --log-interval 100 \
  --eval-interval 1000 \
  --eval-iters 100 \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --batch-size 1 \
  --seq-length 2048 \
  --max-position-embeddings 2048 \
  --train-iters 1000 \
  --resume-dataloader \
  --distributed-backend "nccl" \
  --lr 0.00015 \
  --lr-decay-style "cosine" \
  --lr-decay-iters 3200 \
  --clip-grad 0.5 \
  --warmup .004 \
  --fp16 \
  --checkpoint-activations \
  --deepspeed-activation-checkpointing \
  --deepspeed \
  --deepspeed_config ru-gpts/src/deepspeed_config/gpt3_small_2048.json \


using world size: 1 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
[2022-03-16 20:11:03,194] [INFO] [checkpointing.py:629:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
Pretrain GPT3 model
arguments:
  attention_dropout ............ 0.1
  num_attention_heads .......... 12
  hidden_size .................. 768
  intermediate_size ............ None
  num_layers ................... 12
  layernorm_epsilon ............ 1e-05
  hidden_dropout ............... 0.1
  max_position_embeddings ...... 2048
  vocab_size ................... 30522
  deep_init .................... False
  make_vocab_size_divisible_by . 8
  cpu_optimizer ................ False
  cpu_torch_adam ............... False
  sparse_mode .................. all
  fp16 ......................... True
  fp32_emb

At the end of training output should be something like this:

"-----------------------------------------------------------------------------------------

 validation loss at the end of training for test data | LM loss: 3.0002 | LM PPL: 20.090

-----------------------------------------------------------------------------------------"

## Generate

Load pretrained model from dir and generate.

In [None]:
!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts

!python ru-gpts/generate_samples.py \
  --load model/ \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --batch-size 1 \
  --seq-length 500 \
  --max-position-embeddings 2048 \
  --distributed-backend "nccl" \
  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \
  --no-load-optim


Generate Samples
using world size: 1 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
prepare tokenizer done, size 50264
building GPT3 model ...
 > number of parameters on model parallel rank 0: 125231616
Load checkpoint from model/
global rank 0 is loading checkpoint model/1000/mp_rank_00_model_states.pt
  successfully loaded model/1000/mp_rank_00_model_states.pt
Loaded

Context prompt (stop to exit) >>> Вовочка говорит маме
[H[2J
Taken time 39.74


Context: Вовочка говорит маме

GPT: Вовочка говорит маме: – Мама, я больше не хочу спать! – Ну, слава Богу! – А почему? – Тогда я буду писать с утра до вечера…</s>
Василий Иванович построил дивизию. Чапаев спрашивает:– Троцкистская организация может решить, что такое перестройка?</s>
– Комиссар, наши девушки оказывают интим

### Convert checkpoint to Huggingface format

In [None]:
!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts

!python ru-gpts/convert2huggingface.py \
  --load model/ \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --max-position-embeddings 2048 \
  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \
  --no-load-optim \
  --export-huggingface model_hf


using world size: 1 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
prepare tokenizer done, size 50264
building GPT3 model ...
 > number of parameters on model parallel rank 0: 125231616
Load checkpoint from model/
global rank 0 is loading checkpoint model/1000/mp_rank_00_model_states.pt
  successfully loaded model/1000/mp_rank_00_model_states.pt
Loaded
Export to huggingface model  model_hf with config {'vocab_size': 50264, 'n_positions': 2048, 'n_ctx': 2048, 'n_embd': 768, 'n_layer': 12, 'n_head': 12}
Saved huggingface model <class 'src.model.distributed.DistributedDataParallel'>
Exported in huggingface format to model_hf


#### Test load

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [None]:
model = GPT2LMHeadModel.from_pretrained("model_hf")

In [None]:
def load_tokenizer_and_model(model_name_or_path):
  return GPT2Tokenizer.from_pretrained(model_name_or_path), GPT2LMHeadModel.from_pretrained(model_name_or_path).cuda()


def generate(
    model, tok, text,
    do_sample=True, max_length=50, repetition_penalty=5.0,
    top_k=5, top_p=0.95, temperature=1,
    num_beams=None,
    no_repeat_ngram_size=3
    ):
  input_ids = tok.encode(text, return_tensors="pt").cuda()
  out = model.generate(
      input_ids.cuda(),
      max_length=max_length,
      repetition_penalty=repetition_penalty,
      do_sample=do_sample,
      top_k=top_k, top_p=top_p, temperature=temperature,
      num_beams=num_beams, no_repeat_ngram_size=no_repeat_ngram_size
      )
  return list(map(tok.decode, out))

In [None]:
tok, model = load_tokenizer_and_model("model_hf")
generated = generate(model, tok, "Александр Сергеевич Пушкин родился в ", num_beams=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
generated[0]

'Александр Сергеевич Пушкин родился в  1805 году. Ведет свою родословную историю с 1705 года.</s>\n—\xa0Василий Иванович, как правильно пишется «ананизм» или «демократия»?    —\xa0Правильно'

In [None]:
with open('valid.txt', encoding='utf-8') as f:
    anec_file = f.read()
anecs = anec_file.split('</s>\n')

In [None]:
import random
random.shuffle(anecs)
test_anecs = anecs[0:300]

In [None]:
min([1,2,3,4])

1

In [None]:
starts = []
natural_ends = []
for a in test_anecs:
  x = a.count('.')
  if x > 1:
    try:
      dot_index = a.index('.')
    except:
      dot_index = 1000
    try:
      i_index = a.index('!')
    except:
      i_index = 1000
    try:
      q_index = a.index('?')
    except:
      q_index = 1000
  
    start = a[0:min([dot_index,i_index,q_index])+1]
    starts.append(start)
    natural_ends.append(a)


In [None]:
generated = generate(model, tok, starts[0], num_beams=10)
print(generated[0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Из ментовской сводки новостей:  «За прошедшие сутки в городе H зафиксировано: один пожар, одно ДТП, одно ограбление, одно изнасилование. На пешеходном переходе мужчина и женщина подрались».</s>
Вовочка


In [None]:
neuro_ends = []
for start in starts:
  g = generate(model, tok, start, num_beams=10)[0]
  try:
    num = g.index('<')
    neuro_ends.append(g[0:num])
  except:
    neuro_ends.append(g)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

UnboundLocalError: ignored

In [None]:
for a in range(84,215):
  g = generate(model, tok, starts[a], num_beams=10)[0]
  try:
    num = g.index('<')
    neuro_ends.append(g[0:num])
  except:
    neuro_ends.append(g)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [None]:
starts.pop(83)
print(neuro_ends[83])
print(starts[83])

Идет тетка мимо одного дома и видит, как лужайку косит какой-то мужик в противогазе, защитном костюме и со всем снаряжением. Тетка спрашивает:    — Мужик, что это у тебя в штанах шевелится
Идет тетка мимо одного дома и видит, как лужайку косит какой-то мужик в противогазе, защитном костюме и со всем снаряжением.


In [None]:
print(len(starts), len(natural_ends), len(neuro_ends))
str_starts = ''
str_natural = ''
str_neuro = ''
for x in range(0,214):
  str_starts+='\nsep@rator\n'+starts[x]
  str_natural+='\nsep@rator\n'+natural_ends[x]
  str_neuro+='\nsep@rator\n'+neuro_ends[x]
f = open('starts.txt','w', encoding = 'UTF-8')
j = open('natural_ends.txt','w', encoding = 'UTF-8')
k = open('neuro_ends.txt','w', encoding = 'UTF-8')
f.write(str_starts)
j.write(str_natural)
k.write(str_neuro)

214 214 214


33394