<a href="https://colab.research.google.com/github/addadda023/DJT-speech-generator/blob/master/Train_a_GPT_2_Text_Generating_Model_DT_Speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Text generation using [GPT-2-simple](https://github.com/minimaxir/gpt-2-simple),  Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model.

Let's download the packages. Note tensorflow 1.x version is installed because the gpt2 package doesn't support 2.0 yet. This is also important to note if you want to deploy the model as docker image later.

In [1]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### Check GPU status

Since GPU is strongly recommended, check the status of GPU. Remember to select GPU in Tuntime -> Change runtime type.

In [2]:
# Check which GPU is being run 
!nvidia-smi

Thu Nov 14 05:47:16 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8    10W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

### GPT-2 Model download

To train the model on new text, we need to download the GPT-2 model first. 
There are three released sizes of GPT-2:

1. 124M (default): the "small" model, 500MB on disk.
2. 355M: the "medium" model, 1.5GB on disk.
3. 774M: the "large" model, cannot currently be finetuned with Colaboratory but can be used to generate text from the pretrained model.
4. 1558M: the "extra large", true model. Will not work if a K80 GPU is attached to the notebook. (like 774M, it cannot be finetuned).

This next cell downloads it from Google Cloud Storage and saves it in the Colaboratory VM at /models/<model_name>.

In [3]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 258Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 126Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 482Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:05, 97.1Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 273Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 121Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 170Mit/s]                                                       


### Uploading/loading your input text file.

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model out of Colaboratory, is to route it through Google Drive first.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

Alternatively, you can directly upload the text file to the notebook sidebar top left if its less than **10MB**.

In [4]:
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Check contents of google drive
!ls "/content/drive/My Drive"

In [0]:
file_name = 'all_transcripts.txt'
gpt2.copy_file_from_gdrive(file_name)

### Fine tuning GPT2

The next cell will start the finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of steps (to have the finetuning run indefinitely, set steps = -1).

The model checkpoints will be saved in `/checkpoint/run1` by default. Make sure to change the `run_name` variable if you're training different versions. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after a few hours, so make sure you end training and save the results so you don't lose them! You can simply stop the cell and it will auto-store the last checkpoint data. The model will serve from that last checkpoint.

**NOTE:** If you want to rerun this cell, restart the VM first (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Parameters for gpt2.finetune:

* **restore_from:** Set to fresh to start training from the base GPT-2, or set to latest to restart training from an existing checkpoint.
* **sample_every:** Number of steps to print example output.
* **print_every:** Number of steps to print training progress.
* **learning_rate:** Learning rate for the training. (default 1e-4, can lower to 1e-5 if you have `<`1MB input data)
* **run_name:** Subfolder within checkpoint to save the model. This is useful if you want to work with multiple models (will also need to specify run_name when loading the model).
* **overwrite:** Set to True if you want to continue finetuning an existing model (w/ restore_from='latest') without creating duplicate copies.

The input used to finetune this model is speech transcript from Donald Trump's all political rallies since Oct 2015. The speech transcripts were scraped from FactBase. Note the text is being used purely for educational purpose.

In [6]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=1000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=200
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:03<00:00,  3.82s/it]


dataset has 753122 tokens
Training...
[10 | 29.08] loss=2.75 avg=2.75
[20 | 51.13] loss=2.80 avg=2.77
[30 | 73.61] loss=2.70 avg=2.75
[40 | 96.60] loss=2.56 avg=2.70
[50 | 120.24] loss=2.65 avg=2.69
[60 | 144.11] loss=2.56 avg=2.67
[70 | 167.50] loss=2.57 avg=2.65
[80 | 190.93] loss=2.84 avg=2.68
[90 | 214.61] loss=2.66 avg=2.68
[100 | 238.16] loss=2.38 avg=2.65
[110 | 261.65] loss=2.62 avg=2.64
[120 | 285.19] loss=2.44 avg=2.63
[130 | 308.83] loss=2.60 avg=2.62
[140 | 332.47] loss=2.52 avg=2.61
[150 | 356.10] loss=2.49 avg=2.61
[160 | 379.69] loss=2.32 avg=2.59
[170 | 403.30] loss=2.38 avg=2.57
[180 | 426.87] loss=2.48 avg=2.57
[190 | 450.44] loss=2.51 avg=2.56
[200 | 474.03] loss=2.38 avg=2.55
Saving checkpoint/run1/model-200
 on the job, really.
This election is about jobs and jobs, and people are working. For the first time since Reagan, we've produced more product -- we're producing more. We're not exporting, not exporting at all. And now it's like all around the globe. That's a g

Remember to copy the last checkpoint to Google drive. You can then download the model from Google drive.

In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

### Generate text from trained model

Use the generate command to generate a sample output. 

Helpful parameters for gpt2.generate:

* **length:** Number of tokens to generate (default 1023, the maximum)
* **temperature:** The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **top_k:** Limits the generated guesses to the top k guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set top_k=40)
* **top_p:** Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with top_p=0.9)
* **truncate:** Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first <|endoftext|>). It may be useful to combine this with a smaller length if the input texts are short. You can also use `'\n'` to generate only 1 line of output.
* **include_prefix:** If using truncate and `include_prefix=False`, the specified prefix will not be included in the returned text.

In [14]:
gpt2.generate(sess, run_name='run1',
              length=1023,
              prefix='Apple')

Apple's $10 billion a year, but it's peanuts. And it's peanuts.
China, we always lose for a long period. They can't take $10 billion from us. We have to make a deal. Under the USMCA, which is a tremendous deal, we will also protect Medicare. We have to protect it. And I have to tell you, we have a great, great new governor.
Governor Cuomo is going to be fantastic. Governor Cuomo. And I have to say it, we have great new governor of New York, and I like him. I have to say it. We have great new governor of Florida, and I have to say it again, great new governor of Florida, by the way.
Great new governor of all people, just a great guy, he loves the people of Florida. And when I make a deal with a state, I want to make sure it's going to work with him. But he's a great guy, and he loves the people of Florida. And he's going to be fantastic.
And we will always protect patients with pre-existing condition. Always. Always. They have been doing it for so many decades.
And we're going to get it

### Loading a pretrained model

You can also load a different pretrained model and generate text.

In [0]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run2')
gpt2.generate(sess, run_name='run2',
              length=100,
              prefix='Coca cola',
              truncate='\n')

In [15]:
gpt2.tf.__version__

'1.15.0'