<a href="https://colab.research.google.com/github/spatank/CIS-530/blob/master/Homework%2012/language_model_finetuning_hw.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 12 - Language Model Fine-Tuning
This colab demonstrates how to fine-tune GPT-2 on a dataset of first-person text adventures. We use the [Hugging Face Transformer](https://github.com/huggingface/transformers) library in order to do this.

**IMPORTANT: Make sure that you have GPU set as your Hardware Accelerator in `Runtime > Change runtime type` before running this Colab.**

## Setup

### Install HuggingFace Transfomers library.

In [1]:
!git clone https://github.com/huggingface/transformers

import os
os.chdir('/content/transformers')

!git checkout d6ef587a10e0d8836376a2314d8aeae36ad63263

!pip install .
!pip install -r ./examples/requirements.txt

os.chdir('/content/transformers/examples')

!pip install dict_to_obj

fatal: destination path 'transformers' already exists and is not an empty directory.
HEAD is now at d6ef587a [ci] Fixup e36bd94345af6045108a391f9ac7f4dc557548de
Processing /content/transformers
Building wheels for collected packages: transformers
  Building wheel for transformers (setup.py) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-2.5.1-cp36-none-any.whl size=505540 sha256=c38ab44a5b33df2dfd327d0c8c878162530fb2b1df351e831e4bd9b4e1691d77
  Stored in directory: /tmp/pip-ephem-wheel-cache-6tr_609a/wheels/23/19/dd/2561a4e47240cf6b307729d58e56f8077dd0c698f5992216cf
Successfully built transformers
Installing collected packages: transformers
  Found existing installation: transformers 2.5.1
    Uninstalling transformers-2.5.1:
      Successfully uninstalled transformers-2.5.1
Successfully installed transformers-2.5.1


In [0]:
import torch
import run_language_modeling
import run_generation
from dict_to_obj import DictToObj
import collections
import random
import numpy as np

from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModelWithLMHead

### Mount your Google Drive
We will be saving trained checkpoints on your Google Drive so that they can be accessed even if the Colab session dies. Make sure to login with your UPenn credentials, as you will be saving several gigabytes of data, and Penn gives you unlimited Drive storage.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Download text adventure data.
It's already been split into a train, valid, and test set for you.

In [4]:
# Download the train and test set.
!wget -nc -O /content/text_adventures_test.txt https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/transformers/text_adventures_test.txt
!wget -nc -O /content/text_adventures_dev.txt https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/transformers/text_adventures_dev.txt
!wget -nc -O /content/text_adventures_train.txt https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/transformers/text_adventures_train.txt

--2020-04-28 20:22:56--  https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/transformers/text_adventures_test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 309533 (302K) [text/plain]
Saving to: ‘/content/text_adventures_test.txt’


2020-04-28 20:22:57 (16.3 MB/s) - ‘/content/text_adventures_test.txt’ saved [309533/309533]

--2020-04-28 20:22:57--  https://raw.githubusercontent.com/computational-linguistics-class/computational-linguistics-class.github.io/master/homework/transformers/text_adventures_dev.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.c

In [5]:
!ls /content/

drive	     text_adventures_dev.txt   text_adventures_train.txt
sample_data  text_adventures_test.txt  transformers


In [0]:
!ls "/content/drive/My Drive/finetuned_models/text_adventures"

## Finetune and Eval
The Hugging Face library provides a script [run_language_modeling.py](https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py) which contains all of the code for training and evaluating a language model.

We will be calling this script directly from the command line in order to launch training. We will also use functions from this script to conduct evaluation and generate samples at inference time.

### Launch fine-tuninng
We will be calling `run_language_modeling.py` from the command line to launch fine-tuning, **Running fine-tuning may take several hours.** Every `save_steps` steps, a checkpoint is saved to disk. The checkpoint contains all the learned weights for your model, and you can  always reload the model from a saved checkpoint, even if your Colab has crashed.

Below is an explanation of some of the arguments you might want to modify in the command below. 

* `--line_by_line`: Add `--line_by_line` if distinct lines of the text should be treated as distinct training examples. For example, if your dataset contains one story/tweet/article per line, this should be set.
* `--num_train_epochs`: The number of times to iterate over the train set. Increasing the number of epochs may result in better performance, but making this number too high will cause the model to overfit on the train set.
* `--block_size`: Your training text is truncated into blocks of this length. At test time, you will only want to generate sequences that are at most this length.
* `--gradient_accumulation_steps`: Update the model weights every this many steps. You shold set this to >1 when the batch size is very small to improve training stability.
* `--output_dir`: This is the where checkpoints will get saved. When you finetune on your own dataset, you should change this path. We recommend saving checkpoints to your Google Drive (`/content/drive/My Drive/`) so you can access them even if the Colab session dies.
* `--model_name_or_path` The path to the model weights to use when starting fine-tuning. You can set this to `gpt2-medium` to initialize with GPT-2's 355 million parameter model, or `gpt2` to initialize with their smaller 124 million parameter model. You can also set this to one of your own checkpoints to restart your training job if it crashes.

**I am getting out-of memory errors. What do I do?**

The number of trainable paramters in the model is a function of the `block_size` and the `batch_size`. If you are getting out-of-memory errors, then try drecreasing these value.

**Oh no! My computer went to sleep and the Colab disconnected.**

The train job might have still completed. Check the `output_dir` in your Google Drive to see if checkpoint files have been created there.

**Training is taking foreverrrrrr.**

Try decreasing `num_train_epochs` or changing `model_name_or_path` to `gpt2` instead of `gpt2-medium`.
If your evaluation set is very large, you might also want to remove the `evaluate_during_training` flag or increase `logging_steps`.

In [7]:
!python run_language_modeling.py \
    --output_dir='/content/drive/My Drive/finetuned_models/text_adventures' \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --save_total_limit=5 \
    --num_train_epochs=1.0 \
    --do_train \
    --logging_steps=500 \
    --save_steps=100 \
    --train_data_file=/content/text_adventures_train.txt \
    --do_eval \
    --eval_data_file=/content/text_adventures_dev.txt \
    --per_gpu_train_batch_size=2 \
    --per_gpu_eval_batch_size=2 \
    --block_size=128 \
    --gradient_accumulation_steps=5 \
    --overwrite_output_dir

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Iteration:  38% 2956/7771 [06:56<10:29,  7.65it/s][A
Iteration:  38% 2957/7771 [06:56<09:53,  8.12it/s][A
Iteration:  38% 2958/7771 [06:56<09:27,  8.48it/s][A
Iteration:  38% 2959/7771 [06:56<09:12,  8.72it/s][A
Iteration:  38% 2960/7771 [06:56<10:08,  7.91it/s][A
Iteration:  38% 2961/7771 [06:57<10:28,  7.65it/s][A
Iteration:  38% 2962/7771 [06:57<09:52,  8.12it/s][A
Iteration:  38% 2963/7771 [06:57<09:29,  8.45it/s][A
Iteration:  38% 2964/7771 [06:57<09:15,  8.65it/s][A
Iteration:  38% 2965/7771 [06:57<10:05,  7.93it/s][A
Iteration:  38% 2966/7771 [06:57<10:27,  7.65it/s][A
Iteration:  38% 2967/7771 [06:57<09:54,  8.09it/s][A
Iteration:  38% 2968/7771 [06:57<09:27,  8.46it/s][A
Iteration:  38% 2969/7771 [06:57<09:10,  8.73it/s][A
Iteration:  38% 2970/7771 [06:58<10:02,  7.97it/s][A
Iteration:  38% 2971/7771 [06:58<10:23,  7.69it/s][A
Iteration:  38% 2972/7771 [06:58<09:47,  8.16it/s][A
Iteration:  38% 2

### Compute perplexity of a dataset.
This section shows how to compute perplexity of a dataset according to either the pre-trained or your fine-tuned language model. While this is possible to do by calling `run_language_modeling.py` on the command-line as above, we'll instead call the Python functions directly.

#### Look at what checkpoints are available
Run `ls` to look at what checkpoints saved been saved. You'll want to set `CHECKPOINT_PATH` below to one of these in order to evaluate the model weights saved in that checkpoint.

In [8]:
!ls '/content/drive/My Drive/finetuned_models/text_adventures'

checkpoint-1100  checkpoint-1500   pytorch_model.bin	    vocab.json
checkpoint-1200  config.json	   special_tokens_map.json
checkpoint-1300  eval_results.txt  tokenizer_config.json
checkpoint-1400  merges.txt	   training_args.bin


#### Helper functions

In [0]:
def load_model(args):
  """Creates a model and loads in weights for it."""
  config = AutoConfig.from_pretrained(args.model_name_or_path, cache_dir=None)

  model = AutoModelWithLMHead.from_pretrained(
      args.model_name_or_path,
      from_tf=bool(".ckpt" in args.model_name_or_path),
      config=config,
      cache_dir=None
  )

  tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=None)
  config = AutoConfig.from_pretrained(args.model_name_or_path, cache_dir=None)
  model = AutoModelWithLMHead.from_pretrained(
    args.model_name_or_path,from_tf=bool(".ckpt" in args.model_name_or_path),
    config=config, cache_dir=None)
  
  model.to(args.device)
  return model

def set_seed(seed):
  """Set the random seed."""
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  if args.n_gpu > 0:
    torch.cuda.manual_seed_all(args.seed)

def do_perplexity_eval(args, model, data_file_path):
  """Computes the perplexity of the text in data_file_path according to the provided model."""
  set_seed(args.seed)

  args.eval_data_file=data_file_path

  tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=None)

  args.block_size = min(args.block_size, tokenizer.max_len)

  result = run_language_modeling.evaluate(args, model, tokenizer, prefix="")
  return result

#### Compute it.

In [15]:
# Set this to the checkpoint you want to evalute, or to "gpt2-medium" to
# evaluate the pre-trained model without finetuning.
CHECKPOINT_PATH = '/content/drive/My Drive/finetuned_models/text_adventures/checkpoint-1500'
# CHECKPOINT_PATH = "gpt2"

# Set this to the list of text files you want to evaluate the perplexity of.
DATA_PATHS = ["/content/text_adventures_dev.txt",
              "/content/text_adventures_test.txt"]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
print("Running on device: ", device)

args = collections.defaultdict(
  model_name_or_path=CHECKPOINT_PATH,
  output_dir=CHECKPOINT_PATH,
  block_size = 128,
  local_rank=-1,
  eval_batch_size=2,
  per_gpu_eval_batch_size=2,
  n_gpu=n_gpu,
  mlm=False,
  device=device,
  line_by_line=False,
  overwrite_cache=None,
  model_type='gpt2',
  seed=42,
)
args = DictToObj(args)

model = load_model(args)

for data_path in DATA_PATHS:
  eval_results = do_perplexity_eval(args, model, data_path)
  perplexity = eval_results['perplexity']
  print('{} is the perplexity of {} according to {}'.format(
      perplexity, data_path, CHECKPOINT_PATH))

04/28/2020 20:50:31 - INFO - transformers.configuration_utils -   loading configuration file /content/drive/My Drive/finetuned_models/text_adventures/checkpoint-1500/config.json
04/28/2020 20:50:31 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": false,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "eos_token_ids": null,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "num_beams": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states":

Running on device:  cuda


04/28/2020 20:50:51 - INFO - transformers.configuration_utils -   loading configuration file /content/drive/My Drive/finetuned_models/text_adventures/checkpoint-1500/config.json
04/28/2020 20:50:51 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": false,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "eos_token_ids": null,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "num_beams": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states":

33.38034439086914 is the perplexity of /content/text_adventures_dev.txt according to /content/drive/My Drive/finetuned_models/text_adventures/checkpoint-1500


Evaluating: 100%|██████████| 290/290 [00:08<00:00, 34.02it/s]
04/28/2020 20:51:13 - INFO - run_language_modeling -   ***** Eval results  *****
04/28/2020 20:51:13 - INFO - run_language_modeling -     perplexity = tensor(24.9464)


24.94635581970215 is the perplexity of /content/text_adventures_test.txt according to /content/drive/My Drive/finetuned_models/text_adventures/checkpoint-1500


### Generate samples
The following code generates text samples that are are continuations of a provided prompt.

In [0]:
def generate_samples(args, model, prompt_text):
  """Generating sampling for the provided prompt using the provided model."""
  set_seed(args.seed)

  _, tokenizer_class = run_generation.MODEL_CLASSES[args.model_type]
  tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path, cache_dir=None)

  requires_preprocessing = args.model_type in run_generation.PREPROCESSING_FUNCTIONS.keys()
  encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
  encoded_prompt = encoded_prompt.to(args.device)

  output_sequences = model.generate(
      input_ids=encoded_prompt,
      max_length=args.length + len(encoded_prompt[0]),
      temperature=args.temperature,
      top_k=args.k,
      top_p=args.p,
      repetition_penalty=args.repetition_penalty,
      do_sample=True,
      num_return_sequences=args.num_return_sequences,
  )

  # Remove the batch dimension when returning multiple sequences
  if len(output_sequences.shape) > 2:
    output_sequences.squeeze_()

  generated_sequences = []

  for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
    generated_sequence = generated_sequence.tolist()

    # Decode text
    text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

    # Remove all text after the stop token
    text = text[: text.find(args.stop_token) if args.stop_token else None]

    # Remove the excess text that was used for pre-processing
    text = text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]

    # Add the prompt at the beginning of the sequence.
    total_sequence = prompt_text + text

    generated_sequences.append(total_sequence)

  return generated_sequences

In [19]:
# Set this to the checkpoint you want to use for generation, or to "gpt2-medium"
# to generate with the pre-trained model without finetuning.
CHECKPOINT_PATH = '/content/drive/My Drive/finetuned_models/text_adventures/checkpoint-1400'
# CHECKPOINT_PATH = 'gpt2'

# You should try out other prompts as well as no prompt at all.
PROMPT = 'You are standing in at the mouth of a large cave.'


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
print("Running on device: ", device)

args = collections.defaultdict(
  model_name_or_path=CHECKPOINT_PATH,
  output_dir=CHECKPOINT_PATH,
  n_gpu=n_gpu,
  mlm=False,
  device=device,
  model_type='gpt2',
  seed=42,
  stop_token=None, # Set this if your dataset has a special word that indicates the end of a text.
  temperature=1.0,  # temperature sampling. Set this to temperature=1.0 to not use temperature.
  k=50,  # k for top-k sampling. Set this to k=0 to not use top-k.
  p=1.0,  # p for nucleus sampling. Set this to p=1.0 to not use nucleus sampling.
  repetition_penalty=None,
  length=100,  # Number of tokens to generate.
  num_return_sequences=3,  # Number of independently computed samples to generate.
)
args = DictToObj(args)

model = load_model(args)
sequences = generate_samples(args, model, PROMPT)
for idx, sequence in enumerate(sequences):
  print('\n====== GENERATION {} ======'.format(idx))
  print(sequence)

04/28/2020 20:54:01 - INFO - transformers.configuration_utils -   loading configuration file /content/drive/My Drive/finetuned_models/text_adventures/checkpoint-1400/config.json
04/28/2020 20:54:01 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": false,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "eos_token_ids": null,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "num_beams": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states":

Running on device:  cuda


04/28/2020 20:54:06 - INFO - transformers.configuration_utils -   loading configuration file /content/drive/My Drive/finetuned_models/text_adventures/checkpoint-1400/config.json
04/28/2020 20:54:06 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": false,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "eos_token_ids": null,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "num_beams": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states":


You are standing in at the mouth of a large cave. You stare down at the floor, your eyes bulging with terror. The cave holds a massive mass of broken stones, and they are filled with a thicket of plants and animal flesh. You don't have the strength to move, and despite your rage you're not fast enough to block any of the stones that pass through before you reach the mouth. You keep your eyes fixed on the cave. "What in the fag did my father steal?" 


The man doesn't give you

You are standing in at the mouth of a large cave.
> You make haste
You don't have time to waste, the beast in front of you has turned its head, and now's the perfect time to attack.

You try to dodge, and the beast's head hits you in the chest, sending you flying backwards. The creature's chest crampens, its mouth opens to reveal an opening that has been left behind.

"I'm coming." you say.

"You could leave me, or go down for

You are standing in at the mouth of a large cave.” He says, turning to you.

“Where i