# Training GPT-2 from scratch
This notebook trains GPT-2 from scratch in COLAB.

Also, as an experiment, GPT-2 model with tied weights in across transformed blocks is trained.
The intuition for that was, that weight tying might be benefitial, so that rules learned in one transformer block can be re-used in other blocks.

Here some results obtained:
*   Validation loss for vanilla GPT-2: 3.2658
*   Validation loss for GPT-2 with tied parameters in attention blocks: 3.5591

Experimental model has higher loss, but 6x less parameters in transformer block layers.

## Dataset and implementation details
* Dataset is wikitext-2
* GPT-2 model implementaion is taken from https://github.com/lopuhin/transformer-lm and adapted to Colab environment. Added code for weight tying experiment.

# References
* "Language Models are Unsupervised Multitask Learners", https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
* "Attention is all you need", https://arxiv.org/pdf/1706.03762.pdf

## Steps to reproduce
Colab sessions are time-limited and may termitate due to network issues. Because of this, training data and model are kept on the mounted Google Drive.
Choose the base path on the Google Drive and modify base_path below:

In [0]:
from google.colab import drive
drive.mount('/content/drive')
base_path='drive/My Drive/Colab Notebooks/lopuhin_transformer_lm'

# Upload dataset
* Create sub-folder 'data' in the base folder, and upoad train.txt, test.txt and valid.txt from wiki-2 dataset

In [0]:
!git clone https://github.com/semicontinuity/nlp.git

In [0]:
!(cd nlp; git pull --update; git reset --hard)

In [0]:
!pip install attr json-log-plots fire matplotlib numpy sentencepiece torch tqdm




# Create input dataset

In [0]:
data_path = base_path + '/data-'

from nlp.lopuhin_transformer_lm.lm import data
data.sp_train(
    data_path,
    data_path + '/sp-text.txt',
    data_path + '/sp-model',
)

In [0]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

# Train/Evaluale model

In [0]:
work_path=data_path + '/work'
work_path=data_path + '/work-tied-blocks'

In [0]:
from nlp.lopuhin_transformer_lm.lm import main

only_validate = True
tied_blocks = True

main.main(
        run_path      = work_path,
        dataset_path  = 'drive/My Drive/Colab Notebooks/lopuhin_transformer_lm/data/encoded',
        sp_model_path = 'drive/My Drive/Colab Notebooks/lopuhin_transformer_lm/data/sp-model.model',
        batch_size    = 26,  # per GPU: 36 if 16GB GPU RAM available, 26 if 11.4
        epochs        = 10,
        g_accum_gradients        = 2,  # accumulate gradients N times (globally)
        gradient_checkpointing   = False, # saves GPU memory
        n_ctx         = 256,
        n_embed       = 512,
        n_head        = 4,
        n_layer       = 6,
        epoch_pbar_refresh_every = 50,
        log_every     = 1000,
        save_every    = 1000,
        validate_every= 1000,
        only_validate = only_validate,
        tied_blocks   = tied_blocks,
)

*   Validation loss for vanilla GPT-2: 3.2658
*   Validation loss for GPT-2 with tied parameters in attention blocks: 3.5591