# Blueno Mars: Lyrics Generator using GPT-2
Code adapted from `gpt-2-simple` tutorial by Max Woolf.

**Getting started**:


*   Go to Runtime -> Change runtime type and select GPU
*   Run the cell below to verify that it's running with a GPU and import the libraries. 



In [2]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
!nvidia-smi

import gpt_2_simple as gpt2
import tensorflow as tf
import os

from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
Thu Dec 10 15:42:53 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                             

Download the model. We experimented with using the "small" 124M and "medium" 355M hyperparameter versions, and found the latter to give the best results. There are larger versions available, however, it's not possible to run them for free on the GPU provided by Google Colab. 

In [3]:
gpt2.download_gpt2(model_name="355M")

Fetching checkpoint: 1.05Mit [00:00, 272Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 123Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 481Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:05, 248Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 252Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 74.0Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 206Mit/s]                                                       


Load in corpus of 12,500 pop songs, scraped using the Genius API. The file is expected to be named `data.txt`. 

In [4]:
gpt2.mount_gdrive()
big_corpus_file = "data.txt"
gpt2.copy_file_from_gdrive(big_corpus_file)

Mounted at /content/drive


When finetuning GPT-2, it has no sense of the beginning or end of a document within a larger text. We use `start_token` and `end_token` to indicate the beginning and end of a text. 

In [6]:
start_token = "<|startoftext|>"
end_token = "<|endoftext|>"
line_break_token = "<|line_break|>"
verse_break_token = "<|verse_break|>"
unk_token = "<|UNK|>"

**Optional**: Parse into a smaller corpus of 3000 songs. This speeds up the training steps during fine-tuning. However, it's a tradeoff because the generated results are not as diverse in content and structure than fine-tuning on the larger corpus of 12,500 songs. 

In [None]:
mini_corpus_file = "mini_corpus.txt"

with open(big_corpus_file, "r") as big_corpus, open(mini_corpus_file, "w") as mini_corpus:
  num_songs = 0

  for line in big_corpus.read().splitlines():
    if num_songs >= 3000:
      break

    if line == start_token:
      num_songs += 1
    else:
      mini_corpus.write(line + "\n")

mini_corpus.close()

Function to add in repeat tokens. We didn't do this originally in our preprocessing and had to write a script to add it in here. We found that the model learned to use "<|repeat|>" during fine-tuning, and the generated songs incorporate repetetive structure. 

In [7]:
def add_repeat_tokens(input_file_name, repeat_file_name):
  with open(input_file_name, mode="r") as input_file, open(repeat_file_name, mode="w") as repeat_file:
    prev_line = ""

    for line in input_file.read().split("\n"):
      if line == prev_line:
        repeat_file.write("<|repeat|>\n")
      else:
        repeat_file.write(line + "\n")
        prev_line = line

If you want to use the smaller corpus of songs as the dataset to GPT-2, replace `big_corpus_file` with `mini_corpus_file` (loaded in above). The code snippet below adds in repeat tokens to the corpus of songs, and saves it to be passed as the dataset to `finetune`.

In [20]:
dataset = "with_repeat_data.txt"
add_repeat_tokens(big_corpus_file, dataset)

In [9]:
num_steps_per_epoch = 3000
num_epochs = 2
steps = num_steps_per_epoch * num_epochs

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=dataset,
              model_name='355M',
              steps=steps,
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=500,
              overwrite=True,
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:33<00:00, 33.76s/it]


dataset has 9739866 tokens
Training...
Saving checkpoint/run1/model-0
[10 | 23.89] loss=1.75 avg=1.75
[20 | 39.00] loss=1.82 avg=1.78
[30 | 54.29] loss=1.26 avg=1.61
[40 | 69.90] loss=0.93 avg=1.43
[50 | 85.81] loss=1.49 avg=1.45
[60 | 101.93] loss=1.12 avg=1.39
[70 | 118.45] loss=1.32 avg=1.38
[80 | 135.42] loss=1.88 avg=1.44
[90 | 152.65] loss=1.90 avg=1.50
[100 | 169.42] loss=1.05 avg=1.45
[110 | 186.07] loss=1.28 avg=1.43
[120 | 202.81] loss=1.66 avg=1.45
[130 | 219.82] loss=1.13 avg=1.43
[140 | 236.85] loss=0.70 avg=1.37
[150 | 253.69] loss=1.31 avg=1.37
[160 | 270.55] loss=1.54 avg=1.38
[170 | 287.60] loss=0.86 avg=1.35
[190 | 321.67] loss=1.27 avg=1.32
[200 | 338.55] loss=1.56 avg=1.33
'n'Nasty <|line_break|>
but when it comes to sex n
<|verse_break|>
<|UNK|> <|line_break|>
you <|UNK|> in my house <|line_break|>
you <|UNK|> my house <|line_break|>
you <|UNK|> <|UNK|> house <|line_break|>
you <|UNK|> a house <|line_break|>
you <|UNK|> inside a house door <|line_break|>
<|UNK|> ni

After the model is trained, we can copy the checkpoint folder to Google Drive locally and save the (partially) tuned model. 

In [10]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

Running the next cell will copy the model's checkpoint files from your Google Drive into the Colaboratory VM. This means you can run the code snippets below to start a GPT session and call `generate` without fine-tuning the model. Note that you may have to rerun the imports again

In [11]:
gpt2.copy_checkpoint_from_gdrive(run_name='run1')

In [12]:
tf.reset_default_graph()

sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run1')

Loading checkpoint checkpoint/run1/model-6000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-6000


Interesting hyperameters of note to `generate`:


*   `top_k`: takes the top k guesses from GPT, making this parameter smaller reduces the junk we get
*   `temperature`: changes how semantically different each run of the generation is to each other. We tried tweaking between 0.7-1.0, and found 0.8 to be a happy medium. 
*   `return_as_list`: if `True`, the function returns each sample as an element in a list. For this instance, each sample represents a song. 
*   `length`: the number of characters to output. This network has a limitation of 1024 characters. During preprocessing, we calculated the average number of characters to a song was ~2000 characters. However in `generate`, we're limited to sampling songs about half that length, a tradeoff from using this network. 



In [17]:
generated_songs = gpt2.generate(sess,
              length=1024,
              temperature=0.8,
              top_k=40,
              prefix=start_token,
              # truncate=end_token,
              include_prefix=True,
              nsamples=5,
              batch_size=5,
              return_as_list=True
              )
print(*generated_songs, sep='\n')

Write the sample text from `generate` continuously to a text file `results.txt`. Note that `batch_size` must be small, otherwise [this error](https://github.com/tensorflow/models/issues/1993) will notify you that resources are being exhausted. This is why the following code snippet calls `generate` in a loop. Each call to `generate` will sample five songs. We decided to output 40 songs in total for metrics for both here and in the RNN.

In [21]:
result_file = open("results.txt", "a")

for _ in range(8):
  generated_songs = gpt2.generate(sess,
              length=1024,
              temperature=0.8,
              top_k=40,
              prefix=start_token,
              # truncate=end_token,
              include_prefix=True,
              nsamples=5,
              batch_size=5,
              return_as_list=True
              )
  result_file.write("\n".join(generated_songs))

result_file.close()