#  Train GPT-2 Text-Generating Model for Question Generation

Adapted from [Max Woolf](http://minimaxir.com) ([this GitHub repository](https://github.com/minimaxir/gpt-2-simple), [blog post](https://minimaxir.com/2019/09/howto-gpt2/))

In [0]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
from google.colab import drive
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import os
import shutil
import json
import tensorflow as tf
from gpt_2_simple.src import model
from tqdm import tqdm
gpt2.mount_gdrive()
drive.mount('/content/drive')   #force_remount=True

In [0]:
def is_mounted():
    """Checks if the Google Drive is mounted."""
    assert os.path.isdir('/content/drive'), "You must mount first using mount_gdrive()"
    
def get_file(file_name):
  is_mounted()
  file_path = "NLU_Project/SummQG/data/squad/sentences_questions/"+file_name 
  print("File path:", file_path)
  shutil.copyfile("/content/drive/My Drive/" + file_path, file_name)

In [0]:
file_name = "squad_sq_remainder_unique_dict.json"

In [4]:
get_file(file_name)

File path: NLU_Project/SummQG/data/squad/sentences_questions/squad_sq_remainder_unique_dict.json


## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

In [0]:
# squad_sq_clean-20000-0.0001    #plataueing at 0.78
# squad_sq_clean-25000-0.00005   #plataueing at 0.39
# squad_sq_clean-35000-0.00001   #plataueing at 0.25

lr = 0.000005
steps = 5000

run_name = "squad_sq_clean-40000-0.000001"  #before crashing, this was performing at loss 0.17, then 0.21


In [0]:
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              dataset=file_name,
              model_name="squad_sq_clean-35000-0.00001",
              steps=steps,
              restore_from = "latest",
              learning_rate = lr,
              overwrite = False,
              run_name=run_name,
              print_every=100,
              sample_every=steps,
              save_every=steps
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/squad_sq_clean-35000-0.00001/model-10000
INFO:tensorflow:Restoring parameters from models/squad_sq_clean-35000-0.00001/model-10000


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:23<00:00, 23.60s/it]


dataset has 5722247 tokens
Training...
[100 | 95.55] loss=0.23 avg=0.23
[200 | 183.71] loss=0.13 avg=0.18
[300 | 271.85] loss=0.12 avg=0.16
[400 | 359.98] loss=0.10 avg=0.15
[500 | 448.15] loss=0.13 avg=0.14
[600 | 536.23] loss=0.05 avg=0.12
[700 | 624.33] loss=0.15 avg=0.13
[800 | 712.43] loss=0.21 avg=0.14
[900 | 800.50] loss=0.36 avg=0.16
[1000 | 888.54] loss=0.10 avg=0.16
[1100 | 976.60] loss=0.42 avg=0.18
[1200 | 1064.67] loss=0.16 avg=0.18
[1300 | 1152.71] loss=0.18 avg=0.18
[1400 | 1240.77] loss=0.10 avg=0.17
[1500 | 1328.80] loss=0.18 avg=0.17
[1600 | 1416.87] loss=0.06 avg=0.17
[1700 | 1504.90] loss=0.15 avg=0.16
[1800 | 1592.96] loss=0.08 avg=0.16
[1900 | 1681.03] loss=0.14 avg=0.16
[2000 | 1769.15] loss=0.09 avg=0.15
[2100 | 1857.27] loss=0.15 avg=0.15
[2200 | 1945.41] loss=0.32 avg=0.16
[2300 | 2033.56] loss=0.51 avg=0.18
[2400 | 2121.68] loss=0.20 avg=0.18
[2500 | 2209.79] loss=0.05 avg=0.17
[2600 | 2297.94] loss=0.28 avg=0.18
[2700 | 2386.01] loss=0.17 avg=0.18
[2800 | 24

After the model is trained, you can copy the checkpoint folder to your own Google Drive.


In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name=run_name)

## Load a Trained Model Checkpoint

In [0]:
run_name = 'squad_sq_clean-40000-0.000001'

In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name=run_name)

In [0]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, checkpoint_dir="/content/checkpoint", run_name=run_name)

Loading checkpoint /content/checkpoint/squad_sq_clean-40000-0.000001/model-5000
INFO:tensorflow:Restoring parameters from /content/checkpoint/squad_sq_clean-40000-0.000001/model-5000


## Generate Text From The Trained Model

In [0]:
# text_file = "squad_test_set_clean.txt"

# def squad_couplet_data_processing(data_file):
#   contexts = []
#   gold_questions = []
#   with open(data_file, "r") as infile:
#     text = infile.read()

#   couplet = text.split("<|endoftext|>")
#   for i in couplet:
#     if i == "" or i == "\n":
#       pass
#     else:
#       c_q = i.split("[question]")
#       contexts.append(c_q[0]+"[question] ")
#       gold_questions.append(c_q[1][:-1])
#   return contexts, gold_questions

# contexts, gold_questions = squad_couplet_data_processing("squad_test_set_clean.txt")

In [0]:
text_file = "squad_sq_remainder_unique_dict.json"
contexts = json.load(open(text_file, )).keys()

In [0]:
contexts = list(contexts)

In [9]:
len(contexts)

4953

In [0]:
for context in tqdm(list(contexts[:1])):
  q = gpt2.generate(sess,
              run_name = run_name,
              length=50,
              temperature=0.9,
              top_k = 50, 
              top_p = 0.9,
              prefix= context,
              nsamples=10,
              batch_size=10,
              include_prefix = False,
              truncate = "<|endoftext|>",
              return_as_list=True
              )
  print(q)


  0%|          | 0/1 [00:00<?, ?it/s]

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


100%|██████████| 1/1 [00:09<00:00,  9.78s/it]

['[question] where does midna destroy the mirror of twilight ?\n', "[question] who is midna 's tearful farewell to ?\n", '[question] what game did midna destroy before she left for the bathroom ?\n', '[question] what game does midna destroy with a tear ?\n', '[question] what item does midna destroy before leaving for the mirror of twilight ?\n', '[question] what entity destroys the mirror of twilight ?\n', '[question] what object does midna destroy with a tear ?\n', '[question] after the destruction of the mirror of twilight , who sends a tear to maintain balance between hyrule and the twilight realm ?\n', '[question] who throws the tearful farewell to link and zelda before destroying the mirror of twilight ?\n', '[question] which princess takes the mirror of twilight to destroy ?\n']





In [0]:
contexts[:1]

['\n<|startoftext|>\n[sentence] after bidding farewell to link and zelda , midna returns home before destroying the mirror of twilight with a tear to maintain balance between hyrule and the twilight realm . \n[question] ']

# RUN BELOW SCRIPT TO GENERATE QUESTIONS IN BATCHES

In [0]:
!pip install -q gpt-2-simple

In [1]:
### RUN BELOW SCRIPT TO GENERATE QUESTIONS IN BATCHES
%tensorflow_version 1.x
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import os
import shutil
import json
import tensorflow as tf
from gpt_2_simple.src import model
from tqdm import tqdm

run_name = 'squad_sq_clean-40000-0.000001'
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, checkpoint_dir="/content/checkpoint", run_name=run_name)

text_file = "squad_sq_remainder_unique_dict.json"
contexts = list(json.load(open(text_file, )).keys())
assert(len(contexts)==4953)

###############################################################################
generated_questions = {}
i = 4300

for context in tqdm(contexts[i:i+50]):
  q = gpt2.generate(sess,
              run_name = run_name,
              length=50,
              temperature=0.1,
              top_k = 50, 
              top_p = 0.9,
              prefix= context,
              nsamples=10,
              batch_size=10,
              include_prefix = False,
              truncate = "<|endoftext|>",
              return_as_list=True
              )
  generated_questions[context] = q
  i += 1

output_file = "squad_sq_clean_unique_contexts_"+str(i)+".json"
with open(output_file, "w") as json_file:
  json.dump(generated_questions, json_file)
folder_dir = "/content/drive/My Drive/NLU_Project/SummQG/results/squad_sq_clean_unique_contexts/"
shutil.copyfile(output_file, folder_dir + output_file)

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Loading checkpoint /content/checkpoint/squad_sq_clean-40000-0.000001/model-5000
INFO:tensorflow:Restoring parameters from /content/checkpoint/squad_sq_clean-40000-0.000001/model-5000


  0%|          | 0/50 [00:00<?, ?it/s]

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


100%|██████████| 50/50 [13:38<00:00, 16.37s/it]


'/content/drive/My Drive/NLU_Project/SummQG/results/squad_sq_clean_unique_contexts/squad_sq_clean_unique_contexts_4350.json'

# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.