#  Train GPT-2 Text-Generating Model for Question Generation

Adapted from [Max Woolf](http://minimaxir.com) ([this GitHub repository](https://github.com/minimaxir/gpt-2-simple), [blog post](https://minimaxir.com/2019/09/howto-gpt2/))

In [1]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import numpy as np
import os
import shutil
import json
import tensorflow as tf
from gpt_2_simple.src import model
from tqdm import tqdm

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [0]:
#gpt2.download_gpt2(model_name="355M")
#!nvidia-smi

In [3]:
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [4]:
from google.colab import drive
drive.mount('/content/drive')   #force_remount=True

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
def is_mounted():
    """Checks if the Google Drive is mounted."""
    assert os.path.isdir('/content/drive'), "You must mount first using mount_gdrive()"

def get_file(file_name):
  is_mounted()
  file_path = "NLU_Project/SummQG/data/narrativeqa/"+file_name 
  print("File path:", file_path)
  shutil.copyfile("/content/drive/My Drive/" + file_path, file_name)

In [0]:
#file_name = "training_set.txt"

In [7]:
# Get NQA Test Set
get_file("filtered_test_set.txt")

File path: NLU_Project/SummQG/data/narrativeqa/filtered_test_set.txt


## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

In [0]:
run_name = "nqa-30000-0.000001"

In [0]:
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              dataset=file_name,
              model_name='nqa-25000-0.00001',
              steps=5000,
              restore_from = "latest",
              learning_rate = 0.000001,
              overwrite = False,
              run_name=run_name,
              print_every=50,
              sample_every=5000,
              save_every=5000
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/nqa-25000-0.00001/model-5000
INFO:tensorflow:Restoring parameters from models/nqa-25000-0.00001/model-5000


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [02:20<00:00, 140.40s/it]


dataset has 26187600 tokens
Training...
[50 | 53.18] loss=0.11 avg=0.11
[100 | 97.35] loss=0.14 avg=0.12
[150 | 141.54] loss=0.19 avg=0.15
[200 | 185.75] loss=0.08 avg=0.13
[250 | 229.95] loss=0.15 avg=0.13
[300 | 274.13] loss=0.26 avg=0.16
[350 | 318.32] loss=0.10 avg=0.15
[400 | 362.51] loss=0.13 avg=0.14
[450 | 406.69] loss=0.08 avg=0.14
[500 | 450.88] loss=0.21 avg=0.15
[550 | 495.10] loss=0.09 avg=0.14
[600 | 539.32] loss=0.17 avg=0.14
[650 | 583.52] loss=0.10 avg=0.14
[700 | 627.70] loss=0.05 avg=0.13
[750 | 671.90] loss=0.18 avg=0.14
[800 | 716.06] loss=0.11 avg=0.13
[850 | 760.24] loss=0.11 avg=0.13
[900 | 804.44] loss=0.07 avg=0.13
[950 | 848.63] loss=0.08 avg=0.13
[1000 | 892.80] loss=0.12 avg=0.13
[1050 | 936.98] loss=0.12 avg=0.13
[1100 | 981.17] loss=0.12 avg=0.13
[1150 | 1025.33] loss=0.14 avg=0.13
[1200 | 1069.50] loss=0.14 avg=0.13
[1250 | 1113.71] loss=0.06 avg=0.12
[1300 | 1157.92] loss=0.19 avg=0.13
[1350 | 1202.08] loss=0.26 avg=0.13
[1400 | 1246.26] loss=0.11 avg=0

After the model is trained, you can copy the checkpoint folder to your own Google Drive.

If you want to download it to your personal computer, it's strongly recommended you copy it there first, then download from Google Drive. The checkpoint folder is copied as a `.rar` compressed file; you can download it and uncompress it locally.

In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name=run_name)

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Load a Trained Model Checkpoint

In [0]:
run_name = "nqa-25000-0.00001"

In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name=run_name)

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [0]:
#checkpoint_dir = "checkpoint"
#checkpoint_path = os.path.join(checkpoint_dir, run_name)
#print(checkpoint_path)
#hparams = model.default_hparams()
#with open(os.path.join(checkpoint_path, 'hparams.json')) as f:
#      hparams.override_from_dict(json.load(f))
#ckpt = tf.train.latest_checkpoint(checkpoint_path)

In [10]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, checkpoint_dir="/content/checkpoint", run_name=run_name)

Loading checkpoint /content/checkpoint/nqa-25000-0.00001/model-5000
INFO:tensorflow:Restoring parameters from /content/checkpoint/nqa-25000-0.00001/model-5000


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = gpt2.generate(sess, return_as_list=True)[0]`

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [0]:
text_file = "filtered_test_set.txt"

In [0]:
regenerate_idx = [2004, 3618, 3631, 7445, 7583, 6998]

In [0]:
def data_processing(data_file):
  contexts = []
  gold_questions = []
  with open(data_file, "r") as infile:
    text = infile.read()

  triplet = text.split("<|endoftext|>")
  for i in triplet:
    if i == "" or i == "\n":
      pass
    else:
      c_q = i.split("[question]")
      contexts.append(c_q[0]+"[question] ")
      gold_questions.append(c_q[1][:-1])
  return contexts, gold_questions

In [0]:
contexts, gold_questions = data_processing(text_file)

In [6]:
for i in regenerate_idx:
    print(gold_questions[i])

 Where is Angel Falls?
 Where does the story take place?
 In the story, what country is the main location of events?
 What city does the story begin?
 Where does Longshanks get his education?
 Where do Jem and Margaret end up living?


In [15]:
len(contexts)

8265

In [20]:
generated_questions = {}

for idx in regenerate_idx:
  q = gpt2.generate(sess,
              run_name = run_name,
              length=50,
              temperature=0.1,
              top_k = 50, 
              top_p = 0.9,
              prefix= contexts[idx],
              nsamples=10,
              batch_size=10,
              include_prefix = False,
              truncate = "<|endoftext|>",
              return_as_list=True
              )
  generated_questions[idx] = q

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [7]:
for i in regenerate_idx:
    print(contexts[i])


<|startoftext|>
[summary]  Extreme sport athlete Johnny Utah (Luke Bracey), and his friend Jeff (Max Thieriot), are traversing a steep ridgeline on motorbikes. The run ends with a jump onto a lone stone column, where Jeff overshoots the landing and falls to his death.
Seven years later, Utah is an FBI agent candidate. He attends a briefing on a skyscraper heist, in which the criminals steal diamonds, escaping by parachute, in Mumbai. A similar heist happens over Mexico where the criminals unload millions of dollars in bills over Mexico, then disappear into the Cave of Swallows. Utah's research concludes that they were done by the same men, who are attempting to complete the Ozaki 8, a list of eight extreme ordeals to honor the forces of nature. They have already completed three, and Utah predicts they'll attempt the fourth on a rare sea wave phenomenon in France. After presenting his analysis, Utah is sent undercover to France under a field agent named Pappas (Ray Winstone). They reac

In [21]:
generated_questions

{2004: [' Where does Utah find Bodhi and Grommet?\n',
  ' Where does Utah go after he completes his third ordeal?\n',
  ' Where does Utah go after he completes his third ordeal?\n',
  ' Where does Utah find Bodhi and Grommet?\n',
  ' Where does Utah find Bodhi and Grommet?\n',
  ' Where does Utah go after he completes his third ordeal?\n',
  '',
  '',
  ' Where does Utah go after he completes his third ordeal?\n',
  ''],
 3618: ['', '', '', '', '', '', '', '', '', ''],
 3631: ['', '', '', '', '', '', '', '', '', ''],
 6998: [' Where does Jem move to after he marries Margaret?\n',
  '',
  ' Where does Jem move to after he marries Margaret?\n',
  ' Where does Jem move to after he marries Margaret?\n',
  ' Where does Jem move to after he marries Margaret?\n',
  ' Where does Jem move to after he marries Margaret?\n',
  ' Where does Jem move to after he marries Margaret?\n',
  '',
  ' Where does Jem move to after his marriage to Mary?\n',
  ' Where does Jem move to after he marries Margaret


## Sample Narrative
> document_id = "04954299c7b6bdc7b31b951bc0daa277353576a9"\
> test set: start index = 854, end index = 884 (exclusive)



In [0]:
a = contexts[854:884]

In [0]:
# Select q-a pair from Mark Hunter based on index chosen randomly 
rand_idx = np.random.randint(low = 6662, high = 6692, size = 12, dtype = int)

In [0]:
gold_questions[6690] # PURPOSELY CHOOSE THIS QUESTION 

' What did Graham do to save Karin from choking?'

In [0]:
rand_idx = np.append(rand_idx, 6690)

In [0]:
rand_idx

array([6683, 6691, 6684, 6684, 6666, 6675, 6685, 6670, 6666, 6667, 6669,
       6686, 6690])

In [0]:
sample_generated_questions = {}
for i in set(rand_idx):
    sample_generated_questions[i] = {}
sample_generated_questions

{6666: {},
 6667: {},
 6669: {},
 6670: {},
 6675: {},
 6683: {},
 6684: {},
 6685: {},
 6686: {},
 6690: {},
 6691: {}}

In [0]:
for key, value in sample_generated_questions.items():
    context = contexts[key]
    answer = context.split("[answer]")[1].split("[question]")[0][:-1]
    gold_question = gold_questions[key]
    value["answer"] = answer
    value["gold question"] = gold_question

In [0]:
string = a[0]
print(string)


<|startoftext|>
[summary]  Ray Kinsella is a novice Iowa farmer who lives with his wife, Annie, and daughter, Karin. In the opening narration, he explains how he had a troubled relationship with his father, John Kinsella, who had been a devoted baseball fan. While walking through his cornfield one evening, he hears a voice whispering, "If you build it, he will come." He continues hearing it before finally seeing a vision of a baseball diamond in his field. Annie is skeptical of that, but she allows him to plow the corn under in order to build a baseball field. As he builds it, he tells Karin the story of the 1919 Black Sox Scandal. As months pass and nothing happens at it, his family faces financial ruin until, one night, Karin spots a uniformed man in it. Ray recognizes him as Shoeless Joe Jackson, a deceased baseball player idolized by John. Thrilled to be able to play baseball again, he asks to bring others to the field to play. He later returns with the seven other players banned 

In [0]:
sample_generated_questions

{6666: {'answer': " Ray's brother in law OR Mark ",
  'gold question': " Who can't see the players?"},
 6667: {'answer': ' Archibald "Moonlight" Graham OR Archibald Moonlight Graham',
  'gold question': ' Which player played only one game for the Giants?'},
 6669: {'answer': ' A hot dog OR hot dog',
  'gold question': ' What was Karin choking on?'},
 6670: {'answer': ' Shoeless Joe OR Shoeless Joe Jackson',
  'gold question': ' Who did Ray think "the voice" was?'},
 6675: {'answer': ' Terrence Mann OR books written by radical author Terence Mann',
  'gold question': ' Who does the PTA wish to ban?'},
 6683: {'answer': ' He heard a voice whisper, If you build it, he will come. OR A voice.',
  'gold question': ' What did Ray hear while walking through his cornfield one evening?'},
 6684: {'answer': ' A baseball diamond in his cornfield. OR Shoeless Joe Jackson',
  'gold question': ' What did he see a vision of?'},
 6685: {'answer': " Deceased baseball player Shoeless Joe Jackson. OR 'Sho

In [0]:
for key, value in sample_generated_questions.items():
    print(key)

6690
6691
6666
6667
6669
6670
6675
6683
6684
6685
6686


In [0]:
len(contexts[800].split())

902

In [0]:
string = '\n<|startoftext|>\n[summary]  Ray Kinsella is a novice Iowa farmer who lives with his wife, Annie, and daughter, Karin. In the opening narration, he explains how he had a troubled relationship with his father, John Kinsella, who had been a devoted baseball fan. While walking through his cornfield one evening, he hears a voice whispering, "If you build it, he will come." He continues hearing it before finally seeing a vision of a baseball diamond in his field. Annie is skeptical of that, but she allows him to plow the corn under in order to build a baseball field. As he builds it, he tells Karin the story of the 1919 Black Sox Scandal. As months pass and nothing happens at it, his family faces financial ruin until, one night, Karin spots a uniformed man in it. will come to watch baseball.\n[answer] "If you build it, he will come." \n[question] '

In [0]:
a[0]

'\n<|startoftext|>\n[summary]  In 2003, Dr. Serena Kogan (Helena Bonham Carter) of Cyberdyne Systems convinces death row inmate Marcus Wright (Sam Worthington) to sign over his body for medical research following his execution. One year later, the automated Skynet system is activated, becomes self-aware, begins to perceive humans as a threat, and eradicates much of humanity with nuclear weapons in the event known as "Judgment Day".\nIn 2018, John Connor (Christian Bale) leads an attack on a Skynet base, where he discovers human prisoners and schematics for a new type of Terminator, incorporating living tissue (The T-800). He is the only survivor of the assault after the base is destroyed in a nuclear explosion. Following Connor\'s departure, Marcus emerges from the base\'s wreckage and begins walking towards Los Angeles, after taking the clothing from a Resistance soldier, who died in the explosion.\nJohn returns to Resistance headquarters, located aboard a nuclear submarine, and tells

In [0]:
q = gpt2.generate(sess,
              run_name = run_name,
              length=50,
              temperature=0.9,
              top_k = 50, 
              top_p = 0.9,
              prefix= contexts[0],
              nsamples=10,
              batch_size=10,
              include_prefix = False,
              truncate = "<|endoftext|>",
              return_as_list=True
              )

In [0]:
q

[' Who is Mark Hunter?\n',
 ' Who is Mark Hunter?\n',
 ' Who is Mark Hunter?\n',
 " What is Mark's profession?\n",
 ' Who is Mark Hunter?\n',
 ' Who is Mark Hunter?\n',
 " What is Mark Hunter's profession?\n",
 ' Who is Mark Hunter?\n',
 ' How does Mark Hunter start his radio station?\n',
 " What is Mark Hunter's occupation?\n"]

In [0]:
q

[" What is Mark Hunter's identity?\n",
 ' Who is Mark Hunter?\n',
 " What is Mark Hunter's relationship to his parents?\n",
 ' Who is Mark Hunter?\n',
 " What is Mark Hunter's job as a radio DJ?\n",
 ' Who is Mark Hunter?\n',
 " What is Mark Hunter's real identity?\n",
 ' Who is Mark Hunter?\n',
 ' Who is Mark Hunter?\n',
 " What is Mark Hunter's profession?\n"]

In [0]:
q

[' How is Mark Hunter described?\n',
 " What is Mark Hunter's job?\n",
 ' Who is Mark Hunter?\n',
 " What is Mark's main profession?\n",
 ' Who is Mark Hunter?\n',
 " What is Mark's profession?\n",
 " What is Mark Hunter's occupation?\n",
 ' Who is Mark Hunter?\n',
 " What is Mark's preferred occupation?\n",
 ' Who is Mark Hunter?\n']

In [0]:
q

[' How is Mark Hunter characterized?\n',
 ' Who is Mark Hunter?\n',
 " What is Mark Hunter's occupation?\n",
 ' Who is Mark Hunter?\n',
 " What is Mark's first job in Phoenix, Arizona?\n",
 " What is Mark Hunter's personality?\n",
 " What is Mark's identity?\n",
 " What is Mark Hunter's profession?\n",
 ' Who is Mark Hunter?\n',
 ' Who is Mark Hunter?\n']

In [0]:
q


[' Who is Mark Hunter?\n',
 ' Who is Mark Hunter?\n',
 " What is Mark Hunter's profession?\n",
 ' Who is Mark Hunter?\n',
 " What is Mark Hunter's job as a radio DJ?\n",
 ' How does Mark Hunter identify himself?\n',
 " What is Mark Hunter's profession?\n",
 " What is Mark Hunter's profession?\n",
 " What is Mark Hunter's occupation?\n",
 " What is Mark Hunter's major?\n"]

In [0]:
# Generate Questions for Mark Hunter Text 
questions_generated = []
for key, value in tqdm(sample_generated_questions.items()):
  q = gpt2.generate(sess,
              run_name = run_name,
              length=50,
              temperature=1,
              top_k = 50, 
              top_p = 0.9,
              nsamples=10,
              batch_size=10,
              prefix= contexts[key],
              include_prefix = False,
              truncate = "<|endoftext|>",
              return_as_list=True
              )
  questions_generated.append(q)
  value["generated questions"] = q
  print(q)

  0%|          | 0/11 [00:03<?, ?it/s]


KeyboardInterrupt: ignored

In [0]:
sample_generated_questions

{6662: {'answer': ' Iowa OR Iowa',
  'generated questions': ['<|startoftext|>\n[summary]  Ray Kinsella is a novice Iowa farmer who lives with his wife, Annie, and daughter, Karin. In the opening narration, he explains how he had a troubled relationship with his father, John Kinsella, who had been a devoted baseball fan. While walking through his cornfield one evening, he hears a voice whispering, "If you build it, he will come." He continues hearing it before finally seeing a vision of a baseball diamond in his field. Annie is skeptical of that, but she allows him to plow the corn under in order to build a baseball field. As he builds it, he tells Karin the story of the 1919 Black Sox Scandal. As months pass and nothing happens at it, his family faces financial ruin until, one night, Karin spots a uniformed man in it. Ray recognizes him as Shoeless Joe Jackson, a deceased baseball player idolized by John. Thrilled to be able to play baseball again, he asks to bring others to the field 

In [0]:
# cleaning generated set of questions
for key, value in sample_generated_questions.items(): 
    value["generated questions"] = [1:][:-1]
    set(sample["generated questions"])

In [0]:
sample_generated_questions

{6662: {'answer': ' Iowa OR Iowa',
  'generated questions': 'The extra-credit DVD depicts how the radio station became a criminal enterprise. What did Mark in his ending do?',
  'gold question': ' Where does Ray Kinsella and his family live?'},
 6663: {'answer': ' Karin  OR Karin.',
  'generated questions': 'The extra-credit DVD depicts how the radio station became a criminal enterprise. What did Mark in his ending do?',
  'gold question': " What is the name of Ray's daughter?"},
 6665: {'answer': ' Shoeless Joe Jackson OR Shoeless Joe Jackson',
  'generated questions': 'The extra-credit DVD depicts how the radio station became a criminal enterprise. What did Mark in his ending do?',
  'gold question': " Which deceased baseball player shows up at Ray's field?"},
 6667: {'answer': ' Archibald "Moonlight" Graham OR Archibald Moonlight Graham',
  'generated questions': 'The extra-credit DVD depicts how the radio station became a criminal enterprise. What did Mark in his ending do?',
  'go

## Bulk Generation

In [0]:
!pip install -q gpt-2-simple

In [0]:
%tensorflow_version 1.x
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import os
import shutil
import json
import tensorflow as tf
from gpt_2_simple.src import model
from tqdm import tqdm

run_name = "nqa-25000-0.00001"
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, checkpoint_dir="/content/checkpoint", run_name=run_name)

text_file = "filtered_test_set.txt"

def data_processing(data_file):
  contexts = []
  gold_questions = []
  with open(data_file, "r") as infile:
    text = infile.read()

  triplet = text.split("<|endoftext|>")
  for i in triplet:
    if i == "" or i == "\n":
      pass
    else:
      c_q = i.split("[question]")
      contexts.append(c_q[0]+"[question] ")
      gold_questions.append(c_q[1][:-1])
  return contexts, gold_questions

contexts, gold_questions = data_processing(text_file)

###################################################
generated_questions = {}
i = 7800

for context in tqdm(contexts[i:i+100]):
  q = gpt2.generate(sess,
              run_name = run_name,
              length=50,
              temperature=0.1,
              top_k = 50, 
              top_p = 0.9,
              prefix= context,
              nsamples=10,
              batch_size=10,
              include_prefix = False,
              truncate = "<|endoftext|>",
              return_as_list=True
              )
  generated_questions[i] = q
  i += 1 

output_file = "nqa_"+str(i)+".json"
with open(output_file, "w") as json_file:
  json.dump(generated_questions, json_file)
shutil.copyfile(output_file, "/content/drive/My Drive/NLU_Project/SummQG/results/nqa/"+output_file)

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Loading checkpoint /content/checkpoint/nqa-25000-0.00001/model-5000
INFO:tensorflow:Restoring parameters from /content/checkpoint/nqa-25000-0.00001/model-5000


  0%|          | 0/100 [00:00<?, ?it/s]

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


100%|██████████| 100/100 [45:03<00:00, 27.04s/it]


'/content/drive/My Drive/NLU_Project/SummQG/results/nqa/nqa_7900.json'

In [0]:
!nvidia-smi

Sat May  9 22:51:04 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P0    33W / 250W |   8587MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

In [0]:
generated_= []
for context in tqdm(contexts[:1]):
  q = gpt2.generate(sess,
              run_name = run_name,
              length=50,
              temperature=1,
              top_k = 50, 
              top_p = 0.9,
              prefix= context,
              nsamples=10,
              batch_size=10,
              include_prefix = False,
              truncate = "<|endoftext|>",
              return_as_list=True
              )
  generated_.append(q)

100%|██████████| 1/1 [00:47<00:00, 47.35s/it]


In [0]:
generated_ #T = 1

[[' Who is Mark Hunter?\n',
  " What is Mark's real identity?\n",
  ' Who is Mark Hunter?\n',
  ' How is Mark "Hard-luck" Robinson originally a high school student?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  " What is Mark Hunter's real identity?\n",
  ' How is Mark first introduced to FM radio?\n']]

In [0]:
generated_ #T = 0.5

[[" What is Mark Hunter's profession?\n",
  " What is Mark's profession?\n",
  ' Who is Mark Hunter?\n',
  " What is Mark Hunter's profession?\n",
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  " What is Mark's profession?\n",
  " What is Mark's job?\n",
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n']]

In [0]:
print(contexts[0])

<|startoftext|>
[summary]  Mark Hunter (Slater), a high school student in a sleepy suburb of Phoenix, Arizona, starts an FM pirate radio station that broadcasts from the basement of his parents' house. Mark is a loner, an outsider, whose only outlet for his teenage angst and aggression is his unauthorized radio station. His pirate station's theme song is "Everybody Knows" by Leonard Cohen and there are glimpses of cassettes by such alternative musicians as The Jesus and Mary Chain, Camper Van Beethoven, Primal Scream, Soundgarden, Ice-T, Bad Brains, Concrete Blonde, Henry Rollins, and The Pixies. By day, Mark is seen as a loner, hardly talking to anyone around him; by night, he expresses his outsider views about what is wrong with American society. When he speaks his mind about what is going on at his school and in the community, more and more of his fellow students tune in to hear his show.
Nobody knows the true identity of "Hard Harry" or "Happy Harry Hard-on," as Mark refers to hims

In [0]:
del generated_questions

In [0]:
!/opt/bin/nvidia-smi

Sat May  9 22:21:57 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
!ps -aux|grep python

root        1557  0.1  0.3 484948 101568 ?       Sl   21:53   0:06 /usr/bin/python2 /usr/local/bin/jupyter-notebook --ip="172.28.0.2" --port=9000 --FileContentsManager.root_dir="/" --MappingKernelManager.root_dir="/content"
root        2190 55.0 40.7 35091696 10905532 ?   Ssl  22:20  18:18 /usr/bin/python3 -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-fe3b34a5-802d-4ad2-9727-b86cffa92442.json
root        2695  0.0  0.0  39192  6644 ?        S    22:53   0:00 /bin/bash -c ps -aux|grep python
root        2697  0.0  0.0  38568  5616 ?        S    22:53   0:00 grep python


In [0]:
strr = !ps -aux|grep python

In [0]:
!rm -9 2190

rm: invalid option -- '9'
Try 'rm --help' for more information.


In [0]:
!kill --help

kill: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
    Send a signal to a job.
    
    Send the processes identified by PID or JOBSPEC the signal named by
    SIGSPEC or SIGNUM.  If neither SIGSPEC nor SIGNUM is present, then
    SIGTERM is assumed.
    
    Options:
      -s sig	SIG is a signal name
      -n sig	SIG is a signal number
      -l	list the signal names; if arguments follow `-l' they are
    		assumed to be signal numbers for which names should be listed
      -L	synonym for -l
    
    Kill is a shell builtin for two reasons: it allows job IDs to be used
    instead of process IDs, and allows processes to be killed if the limit
    on processes that you can create is reached.
    
    Exit Status:
    Returns success unless an invalid option is given or an error occurs.


In [0]:
roots = !ps -aux|grep python
ram_used = ["kill", "-9", "1569"]
#for root in roots:
#  ram_used.append(root.split()[1])
cmd = ' '.join(ram_used)
print(cmd)


  

  # !kill -9 25937 25969 25986 26003 26033 26054 26337 26365 26404 26426 26444 26446

kill -9 1569


In [0]:
!{cmd}

In [0]:
run_time

NameError: ignored

### First example from test set with varying temperature = 0.1, 0.5, 1

In [0]:
contexts[0]

'<|startoftext|>\n[summary]  Mark Hunter (Slater), a high school student in a sleepy suburb of Phoenix, Arizona, starts an FM pirate radio station that broadcasts from the basement of his parents\' house. Mark is a loner, an outsider, whose only outlet for his teenage angst and aggression is his unauthorized radio station. His pirate station\'s theme song is "Everybody Knows" by Leonard Cohen and there are glimpses of cassettes by such alternative musicians as The Jesus and Mary Chain, Camper Van Beethoven, Primal Scream, Soundgarden, Ice-T, Bad Brains, Concrete Blonde, Henry Rollins, and The Pixies. By day, Mark is seen as a loner, hardly talking to anyone around him; by night, he expresses his outsider views about what is wrong with American society. When he speaks his mind about what is going on at his school and in the community, more and more of his fellow students tune in to hear his show.\nNobody knows the true identity of "Hard Harry" or "Happy Harry Hard-on," as Mark refers to

In [0]:
gold_questions[:1]

[' Who is Mark Hunter?']

In [0]:
generated_questions #temperature = 0.1

[[' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  " What is Mark Hunter's profession?\n",
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n']]

In [0]:
generated_questions #temperature = 0.5

[[" What is Mark Hunter's profession?\n",
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  " What is Mark Hunter's profession?\n",
  " What is Mark Hunter's profession?\n",
  " What is Mark Hunter's occupation?\n",
  " What is Mark Hunter's background?\n",
  ' Who is Mark Hunter?\n',
  " What is Mark Hunter's occupation?\n"]]

In [0]:
generated_questions #temperature = 1

[[' Who is Mark Hunter?\n',
  " What is Mark Hunter's occupation?\n",
  ' Who is Mark Hunter?\n',
  " What is Mark's job at school?\n",
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  ' Who is Mark Hunter?\n',
  " What is Mark Hunter's social class reputation?\n",
  " What is Mark Hunter's job?\n",
  ' Who is Mark Hunter?\n']]

# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.