<a href="https://colab.research.google.com/github/the-minerva-quest/minerva-open-data/blob/main/Minerva_Quest_DH_Tutorial_GPT_2_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# !!! MAKE A COPY OF THIS NOTEBOOK FIRST !!!

#  Digital Humanities Tutorial: Generating Text with GPT-2
The purpose of this notebook is to provide an accessible demo for generating custom-style texts with the GPT-2 architecture.To learn more about GPT-2, you should check out OpenAI's [original paper](https://openai.com/blog/better-language-models/) with amazing output samples.

Please follow the instructions provided below, and if you get stuck reach out to me at armin.hamp@minerva.kgi.edu.

## Step I: Loading the model

By running the following cells, we will load the requisite libraries and a small version of the pretrained GPT-2 model.


In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [None]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 287Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 124Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 387Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:03, 140Mit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 365Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 195Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 187Mit/s]                                                       


Running this cell will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

---



In [None]:
gpt2.mount_gdrive()

Mounted at /content/drive


## Step II: Training your model

Here you will get a chance to finetune the pre-loaded GPT-2 model with the textual style of your choice. To do this, you will have to find a corpus of text that represents a particular style that you want to reproduce. For example, this could be a collection of children's stories, romantic novels, corporate email messaging.

Here are two great resources to find inspiration, but feel free to compile your own dataset.



*   https://research.fb.com/downloads/babi/
*   https://github.com/niderhoff/nlp-datasets

Once found a style or genre of your liking:

1. Obtain a plaintext (.txt) file of the corpus of your choice. 
2. Make sure that your file is **no larger than 50MB **for more might make the Colab Notebook run out of memory during the finetuning stage.
3. Upload your .txt file into the sample_data folder on the left-hand side bar. See image below for reference:
![alt text](https://i.imgur.com/TGcZT4h.png)




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


4. Change the filename in the cell below to the name of the plaintext you just uploaded.

In [None]:
#the file here is from Facebook Research's https://research.fb.com/downloads/babi/ bAbi - automatic text comprehension 
#it is a collection of children's stories
file_name = "quest.txt"

In [None]:
gpt2.copy_file_from_gdrive(file_name)

Finetune the model with following hyperparameters. This Process should take about 20 minutes.


In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name='124M',
              steps=1000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:02<00:00,  2.56s/it]


dataset has 439961 tokens
Training...
[10 | 28.16] loss=3.33 avg=3.33
[20 | 49.58] loss=3.32 avg=3.32
[30 | 71.20] loss=3.24 avg=3.30
[40 | 93.02] loss=3.17 avg=3.27
[50 | 115.02] loss=3.10 avg=3.23
[60 | 137.19] loss=3.21 avg=3.23
[70 | 159.60] loss=3.23 avg=3.23
[80 | 182.12] loss=3.31 avg=3.24
[90 | 204.77] loss=2.94 avg=3.20
[100 | 227.56] loss=3.09 avg=3.19
[110 | 250.54] loss=2.67 avg=3.14
[120 | 273.51] loss=3.07 avg=3.13
[130 | 296.56] loss=2.75 avg=3.10
[140 | 319.65] loss=2.86 avg=3.08
[150 | 342.75] loss=2.71 avg=3.06
[160 | 365.83] loss=2.93 avg=3.05
[170 | 389.00] loss=2.64 avg=3.02
[180 | 412.32] loss=2.64 avg=3.00
[190 | 435.64] loss=2.46 avg=2.97
[200 | 458.93] loss=2.40 avg=2.94
                                                                                                                                                                                                                                                                                                       

## Generating Text From Our Newly Trained Model

The `gpt.generate()` function will now generate text from our finetuned model. Play around with the generator and see what kind of results you can get.

By default, the gpt2.generate() function will generate as much text as possible (1,024 tokens) with a little bit of randomness. You can tweak the generator using the below parameters. You can also increase the temperature to increase “creativity” by allowing the network to more likely make suboptimal predictions, provide a prefix to specify how exactly you want your text to begin.

As a bonus, you can bulk-generate text with gpt-2-simple by setting nsamples (number of texts to generate total) and batch_size (number of texts to generate at a time); the Colaboratory GPUs can support a batch_size of up to 20, and you can generate these to a text file with gpt2.generate_to_file(file_name) with the same parameters as gpt2.generate(). You can download the generated file locally via the sidebar, and use those to easily save and share the generated texts.

In [None]:
gpt2.generate(sess,
              length=300,
              temperature=1,
              prefix="Ben Nelson, the founder and CEO of the Minerva Project, revealed himself to be a robot yesterday to a.",
              nsamples=2,
              batch_size=2
              )

Ben Nelson, the founder and CEO of the Minerva Project, revealed himself to be a robot yesterday. Nelson claimed, incorrectly, that he had created the self-balancing robot at his side, and that he has since worked to figure out how it works.  Other claims have him claiming to have been playing a game he created called “Defense Against The Damned.” In this game, an opponent who is clearly hurt or dead can be saved with a pivotal critical decision. However, it is unlikely that Nelson’s self-balancing self-balancing system will live up to this status. It would take active action to stop it, and only a small minority of successful robot defense mechanisms exist. Furthermore, even a small minority is never enough. In order to operate the self-balancing system, a robot has to be very, very safe, and have a significant amount of fatalities at the end.  Another obstacle may be determining which safe and reliable methods of defibrillator installation are more efficient. Since 2002, researchers at Lawrence Berkeley National Laboratory (BerkeleyNLL) have been working to develop better ways to detect when a self-balancing device is disengaging. These advancements have caused concerns among researchers that the devices could make it harder to defibrillate service members first, and that defibrillators may become a de facto hand-me-down approach.  The self-balancing claims are unfounded. Self-balancing claims vary depending on a customer’s age, dependency, and/or geographic area. 