<a href="https://colab.research.google.com/github/rahiakela/transformers-for-natural-language-processing/blob/main/6-text-generation-with-gpt-2-and-gpt-3-models/training_gpt2_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Training a GPT-2 language model

This notebook will train a GPT-2 model on a custom dataset that we will encode. We will then interact with our customized model. We will be using the kant.txt dataset.

## Prerequisites and Setup

In [None]:
%%shell

wget https://raw.githubusercontent.com/nshepperd/gpt-2/finetuning/train.py
wget https://raw.githubusercontent.com/nshepperd/gpt-2/finetuning/src/load_dataset.py
wget https://raw.githubusercontent.com/nshepperd/gpt-2/finetuning/encode.py
wget https://raw.githubusercontent.com/nshepperd/gpt-2/finetuning/src/accumulate.py
wget https://raw.githubusercontent.com/nshepperd/gpt-2/finetuning/src/memory_saving_gradients.py

wget https://github.com/PacktPublishing/Transformers-for-Natural-Language-Processing/raw/main/Chapter06/gpt-2-train_files/dset.txt

## Step 1: Initial steps of the training process

The program now clones OpenAI's GPT-2 repository and not N Shepperd's repository:

In [2]:
!git clone https://github.com/openai/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 233, done.[K
remote: Total 233 (delta 0), reused 0 (delta 0), pack-reused 233[K
Receiving objects: 100% (233/233), 4.38 MiB | 11.58 MiB/s, done.
Resolving deltas: 100% (124/124), done.


We have already uploaded the files we need to train the GPT-2 model from N
Shepperd's directory.

The program now installs the requirements:

In [None]:
import os #when the VM restarts import os necessary

os.chdir("/content/gpt-2")
!pip3 install -r requirements.txt

This notebook requires toposort, which is a topological sort algorithm:

In [None]:
!pip install toposort

>Do not restart the notebook after installing the requirements. Wait until you have checked the TensorFlow version to restart the VM only once during your session. Then restart it if necessary.

We now check the TensorFlow version to make sure we are running version tf 1.x:

In [5]:
#Colab has tf 1.x , and tf 2.x installed
#Restart runtime using 'Runtime' -> 'Restart runtime...'
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

TensorFlow 1.x selected.
1.15.2


The program now downloads the 117M parameter GPT-2 model we will train with
our dataset:

In [1]:
import os # after runtime is restarted

os.chdir("/content/gpt-2")
!python3 download_model.py '117M' #creates model directory

Fetching checkpoint: 1.00kit [00:00, 1.13Mit/s]                                                     
Fetching encoder.json: 1.04Mit [00:00, 3.42Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 1.05Mit/s]                                                   
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:35, 14.0Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 5.14Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 1.77Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 1.69Mit/s]                                                       


We will copy the dataset and the 117M parameter GPT-2 model into the src
directory:

In [2]:
# Copying the Project Resources to src
!cp /content/dset.txt /content/gpt-2/src/
!cp -r /content/gpt-2/models/ /content/gpt-2/src/

The goal is to group all of the resources we need to train the model in the src project directory.

We will now go through the N Shepperd training files.

## Step 2: The N Shepperd training files

The training files we will use come from N Shepperd's GitHub repository. We
already uploaded them. We will now copy them into our project directory:

In [3]:
!cp /content/train.py /content/gpt-2/src/
!cp /content/load_dataset.py /content/gpt-2/src/
!cp /content/encode.py /content/gpt-2/src/
!cp /content/accumulate.py /content/gpt-2/src/
!cp /content/memory_saving_gradients.py /content/gpt-2/src/

The training files are now ready to be activated. Let's now explore them, starting with `encode.py`.

## Step 3: Encoding the dataset

The dataset must be encoded before training it.

The dataset is loaded, encoded, and saved in out.npz when we run the cell:

In [4]:
os.chdir("/content/gpt-2/src/")
model_name="117M"

!python /content/gpt-2/src/encode.py dset.txt out.npz

2021-05-20 05:45:18.685383: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Reading files
100% 1/1 [00:10<00:00, 10.22s/it]
Writing out.npz


Our GPT-2 117M model is ready to be trained.

## Step 4: Training the model

We will now train the GPT-2 117M model on our dataset. We send the name of our
encoded dataset to the program:

In [5]:
os.chdir("/content/gpt-2/src/")
!python train.py --dataset out.npz

2021-05-20 05:53:44.657729: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "train.py", line 28, in <module>
    import model, sample, encoder
  File "/content/gpt-2/src/model.py", line 3, in <module>
    from tensorflow.contrib.training import HParams
ModuleNotFoundError: No module named 'tensorflow.contrib'


When you run the cell, it will train until you stop it. The trained model is saved after 1,000 steps. When the training exceeds 1,000 steps, stop it. The saved model checkpoints are in `/content/gpt-2/src/checkpoint/run1`.

You can also stop training the model after 1,000 steps with Ctrl + M.

The program manages the optimizer and gradients with the `/content/gpt-2/src/
memory_saving_gradients.py` and `/content/gpt-2/src/accumulate.py` programs.

`train.py` contains a complete list of parameters that can be tweaked to modify the training process. Run the notebook without changing them first. Then, if you wish, you can experiment with the training parameters and see if you can obtain better results.

## Steps 5: Creating a training model directory

This section will create a temporary directory for our model, store the information we need, and rename it to replace the directory of the GPT-2 117M model we downloaded.

We start by creating a temporary directory named `tgmodel`:

In [None]:
run_dir = '/content/gpt-2/models/tgmodel'
if not os.path.exists(run_dir):
  os.makedirs(run_dir)

We then copy the checkpoint files that contain the trained parameters we saved
when we trained our model.

In [None]:
!cp /content/gpt-2/src/checkpoint/run1/model-1000.data-00000-of-00001 /content/gpt-2/models/tgmodel
!cp /content/gpt-2/src/checkpoint/run1/checkpoint /content/gpt-2/models/tgmodel
!cp /content/gpt-2/src/checkpoint/run1/model-1000.index /content/gpt-2/models/tgmodel
!cp /content/gpt-2/src/checkpoint/run1/model-1000.meta /content/gpt-2/models/tgmodel

Our `tgmodel` directory now contains the trained parameters of our GPT-2 model.

We will now retrieve the hyperparameters and vocabulary files from the GPT-2
117M model we downloaded:

In [None]:
!cp /content/gpt-2/models/117M/encoder.json /content/gpt-2/models/tgmodel
!cp /content/gpt-2/models/117M/hparams.json /content/gpt-2/models/tgmodel
!cp /content/gpt-2/models/117M/vocab.bpe /content/gpt-2/models/tgmodel

Our `tgmodel` directory now contains our complete customized GPT-2 117M model.

Our last step is to rename the original GPT-2 model we downloaded and set the
name of our model to 117M:

In [None]:
# Renaming the model directories
!mv /content/gpt-2/models/117M  /content/gpt-2/models/117M_OpenAI
!mv /content/gpt-2/models/tgmodel  /content/gpt-2/models/117M

Our trained model is now the one the cloned OpenAI GPT-2 repository will run.

Let's interact with our model!

## Steps 6: Generating Unconditional Samples

We will interact with a GPT-2 117M model trained on our dataset. We
will first generate an unconditional sample that requires no input on our part. Then we will enter a context paragraph to obtain a conditional text completion response from our trained model.

Let's first run an unconditional sample:

In [None]:
os.chdir("/content/gpt-2/src")
!python generate_unconditional_samples.py --model_name '117M'

You will not be prompted to enter context sentences since this is an unconditional sample generator.

>To stop the cell, double-click on the run button of the cell or type Ctrl + M.

The result is random but makes sense from a grammatical perspective. From a
semantic point of view, the result is not as interesting because we provided no context. But still, the process is remarkable. It invents posts, writes a title, dates it, invents organizations and addresses, produces a topic, and even imagines web links!

The result of an unconditional text generator is interesting but not convincing.

## Step 7: Interactive Context and Completion Examples

We will now run a conditional sample. The context we enter will condition the model to think as we want it to, to complete the text by generating tailor-made paragraphs.

If necessary, take a few minutes to go back through the parameters.

The program prompts us to enter the context.

Let's enter the same paragraph written by Emmanuel Kant.

```
Human reason, in one sphere of its cognition, is called upon to
consider questions, which it cannot decline, as they are presented by
its own nature, but which it cannot answer, as they transcend every
faculty of the mind.
```

Run the cell and explore the magic:

In [None]:
os.chdir("/content/gpt-2/src")
!python interactive_conditional_samples.py --temperature 0.8 --top_k 40 --model_name '117M'

Wow! I doubt anybody can see the difference between the text completion produced by our trained GPT-2 model and a human. It might also generate different outputs at each run.

In fact, I think our model could outperform many humans in this abstract exercise in philosophy, reason, and logic!


## Conclusions

We can draw some conclusions from our experiment:

- A well-trained transformer model can produce text completion that is
human-level.
- A GPT-2 model can almost reach human level in text generation on complex
and abstract reasoning.
- Text context is an efficient way of conditioning a model by demonstrating
what is expected.
- Text completion is text generation based on text conditioning if context
sentences are provided.

Bear in mind that our trained GPT-2 model will react like a human. If you enter a short, incomplete, uninteresting, or tricky context, you will obtain puzzled or bad results. GPT-2 expects the best out of us, as in real life!