#  Text Generator with fine-tune GPT-2

This notebook aims to fine-tune a GPT-2 to generate texts with the same style that it learned from the given data. To do this, we use google colab free GPU to accelerate the training process and we sue `gpt-2-simple`.


**Initialization**

In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## GPU

Verify which GPU is active by running the cell below.

In [None]:
!nvidia-smi

Tue Oct  6 09:26:10 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Downloading GPT-2

We download the GPT-2 model first. 
We use the samllest size of GPT-2: `124M` (default): the "small" model, 500MB on disk.

In [None]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 278Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 117Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 368Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:06, 79.2Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 279Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 183Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 203Mit/s]                                                       


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

We mount a Google Drive in the VM

In [None]:
gpt2.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Uploading a Text File to be Trained to Colaboratory

In [None]:
file_name = "quotes_gpt.txt"

In [None]:
gpt2.copy_file_from_gdrive(file_name)

## Finetune GPT-2

We create a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

Other optional-but-helpful parameters for `gpt2.finetune`:

*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. 

In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=1000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:26<00:00, 26.66s/it]


dataset has 4979602 tokens
Training...
[10 | 28.51] loss=3.73 avg=3.73
[20 | 50.25] loss=3.57 avg=3.65
[30 | 72.40] loss=3.23 avg=3.51
[40 | 94.94] loss=3.54 avg=3.51
[50 | 118.12] loss=3.54 avg=3.52
[60 | 141.16] loss=3.49 avg=3.51
[70 | 163.90] loss=3.42 avg=3.50
[80 | 186.83] loss=3.20 avg=3.46
[90 | 209.87] loss=3.35 avg=3.45
[100 | 232.72] loss=3.35 avg=3.44
[110 | 255.58] loss=3.75 avg=3.47
[120 | 278.54] loss=3.44 avg=3.47
[130 | 301.47] loss=3.43 avg=3.46
[140 | 324.30] loss=3.51 avg=3.47
[150 | 347.19] loss=3.32 avg=3.46
[160 | 370.15] loss=3.45 avg=3.46
[170 | 392.99] loss=3.42 avg=3.45
[180 | 415.77] loss=3.27 avg=3.44
[190 | 438.68] loss=3.40 avg=3.44
[200 | 461.64] loss=3.70 avg=3.45
't and that's great. That's good because in a world where every second counts, it's all you can think about, just like a thousand years ago, that might make life exciting. But even if they didn't make life exciting, you wouldn't know it at that moment. You can't ever really know which side you

We copy the checkpoint folder to your own Google Drive to not loose the trained model.

In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

## Load a Trained Model Checkpoint

**Load**

In [None]:
#gpt2.copy_checkpoint_from_gdrive(run_name='run1')

In [None]:
#sess = gpt2.start_tf_sess()
#gpt2.load_gpt2(sess, run_name='run1')

Loading checkpoint checkpoint/run1/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1000


## Generate Text From The Trained Model

We generate text with fine-tuned model. `generate` generates a single text from the loaded model.

In [None]:
gpt2.generate(sess, run_name='run1')

There are two great traits of a teacher: (1) a passion for teaching, and (2) a passion for teaching.

For most of the history of our species, there have been two kinds of teachers: those who are able to give us a good information about the world, and those who are able to put words in our brains. We might like to think that all kinds of teachers can do this, but it's hard to believe that anyone can do that.

The stage of teaching is always changing and the end of the stage is always changing.

I used to think that the best way to learn is by doing. But I didn't know that knowledge was an act: it takes a person to do something about the world, and the world doesn’t know what the hell is going on, so it leaves things to chance.

The meanings of words are the same as the meanings of words: they are the same as the meanings of words.

There has been a great deal of self-education since the time of the poet's father, and I have often endeavoured to find out why this was so often the case. T

We generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, we can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate`:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [None]:
gpt2.generate(sess,
              length=250,
              temperature=0.7,
              prefix="Life",
              nsamples=5,
              batch_size=5
              )

Life.

Sincerely,

Love is a language of infinite simplicity

Love is the language of endless complexity

Love is the language of impossible complexity

Love is the language of infinite complexity

Love is the language of endless complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinity complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite complexity

Love is the language of infinite compl

The results look good gramatically. The sentences make sense and look alike quotes. However, it repeats sometimes the same sentence. This phenomene is due to "tempearture" parameter.

For bulk generation, we can generate a large amount of text to a file and sort out the samples locally

In [None]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())
gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      temperature=0.7,
                      nsamples=100,
                      batch_size=20
                      )

In [None]:
# may have to run twice to get file to download
files.download(gen_file)

Thanks to [Max Woolf](http://minimaxir.com) for the notebook work to share how to fine-tune easily a GPT-2 with colab.