# Text Generation - Fine-tuning GPT-2

In this notebook we'll tackle the task of text generation with [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model. We'll look at data preparation and fine-tuning process needed in order for GPT-2 to produce desired text. Our goal in this notebook is to fine-tune a pretrained model to generate haikus based on provided keywords.

First things first, let's make sure we have a GPU instance in this Colab session:
- `Edit -> Notebook settings -> Hardware accelerator` must be set to GPU
- if needed, reinitiliaze the session by clicking `Connect` in top right corner

After the session is initilized, we can check our assigned GPU with the following command (fingers crossed it's a Tesla P100 :P):

In [None]:
!nvidia-smi

Let's install and import everything we need:

In [171]:
%%capture
!pip install git+https://github.com/huggingface/transformers
!pip install datasets
!pip install evaluate
!!pip install --upgrade --no-cache-dir gdown

In [7]:
import numpy as np
import torch

from datasets import load_dataset
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

## Dataset

Let's load the dataset of haikus and take a look at some of the examples:

In [None]:
dataset = load_dataset("statworx/haiku")

In [None]:
dataset = dataset.map(lambda ex: {"text": f"<SOH> {ex['keywords']}: {ex['text']}"})
dataset["train"].to_json("haiku_train.json")

In [None]:
dataset["train"]["text"][:3]

## Training (don't run during tutorial)

We can use a training script that HuggingFace provides for Casual Language Modelling:

In [None]:
!wget https://github.com/huggingface/transformers/raw/main/examples/pytorch/language-modeling/run_clm.py

In [None]:
# optional if you want to save your models to Google Drive
from google.colab import drive
drive.mount("/content/drive/")

In [None]:
!python run_clm.py \
    --model_name_or_path gpt2 \
    --train_file haiku_train.json \
    --per_device_train_batch_size 8 \
    --block_size 96 \
    --do_train \
    --output_dir /content/drive/MyDrive/NLP-workshop-materials/haiku-gpt2/

## Evaluation

We now have a fine-tuned GPT-2 model ready to generate haikus. GPT-2 outputs a probability distribution over the next token conditioned on previous ones. There are a couple of ways we can go about generating text:
- Greedy decoding
- Beam search
- Top-k/Top-p sampling

You can read more [here](https://huggingface.co/blog/how-to-generate).

Let's first download and initilize the already fine-tuned model.

In [None]:
!mkdir /content/gpt2-haiku
!gdown -O /content/gpt2-haiku/config.json https://drive.google.com/uc?id=13BNZ5ZihTgs9-oq_JljJUxsW8-4dakY7
!gdown -O /content/gpt2-haiku/pytorch_model.bin https://drive.google.com/uc?id=1Pdh8tH4_vpzLPw8RnJ0FrPQE6urqr9AJ

In [175]:
# only run if you want to use the model we've already fine-tuned for you
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2-haiku").to(device)

#### Greedy decoding
This is the simplest approach, at every step we just select the most probable next word, i.e. the word with highest outputed probability. One can immediately see that after some text the model will start repeating itself. This would therefore be a bad decoding scheme if we want to produce long continuous text, but since we're producing fairly short quotes it might achieve okay results.

<div>
<img src="https://github.com/andrejmiscic/NLP-workshop/raw/master/figures/greedy.PNG" width="800"/>
</div>

In [65]:
from transformers.utils import logging
logging.set_verbosity_error()

In [166]:
bad_words_ids = tokenizer(["ices", "icespare", "ice", "iced", "urn", "vernal", "vernalis", "vernacular", "equinox", "vernate", "verna", "vernas", "vernier", "ver"], add_special_tokens=False)["input_ids"]

In [176]:
def postprocess_haiku(text: str) -> str:
    colon_idx = text.find(':')
    if colon_idx < 0 or colon_idx + 2 >= len(text):
        return text.replace('/ ', '\n')
    text = text[colon_idx + 2:]
    soh_idx = text.find('<')
    if soh_idx < 0:
        soh_idx = text.find('>')
    if soh_idx < 0:
        return text.replace('/ ', '\n')
    return text[:soh_idx].replace('/ ', '\n')

In [135]:
def generate_text_greedy(prompt="", max_length=64, bad_words_ids=None):
  model.eval()
  model_prompt = "<SOH> " if len(prompt) == 0 else "<SOH> " + prompt + ": "
  input_ids = tokenizer.encode(model_prompt, return_tensors='pt').to(device)
  generated_ids = model.generate(input_ids, max_length=max_length, bad_words_ids=bad_words_ids).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids][0]
  return postprocess_haiku(generated_text)

In [None]:
print(generate_text_greedy())

In [None]:
print(generate_text_greedy(bad_words_ids=bad_words_ids))

In [None]:
print(generate_text_greedy("data science", bad_words_ids=bad_words_ids))
print()
print(generate_text_greedy("maple tree", bad_words_ids=bad_words_ids))
print()
print(generate_text_greedy("swedish summer", bad_words_ids=bad_words_ids))

#### Beam search

Beam search is also a deterministic decoding, but offers an improvement over greedy decoding. A problem of greedy decoding is that we might miss the most likely sequence since we predict only the most probable word at each timestep. Beam search mitigates this by keeping a track of most probable *n* sequences at every step and ultimately selecting the most probable sequence.

<div>
<img src="https://github.com/andrejmiscic/NLP-workshop/raw/master/figures/beam.PNG" width="500"/>
</div>

In [157]:
def generate_text_beam(prompt="", max_length=64, num_beams=4, bad_words_ids=None):
  model.eval()
  model_prompt = "<SOH> " if len(prompt) == 0 else "<SOH> " + prompt + ": "
  input_ids = tokenizer.encode(model_prompt, return_tensors='pt').cuda()
  generated_ids = model.generate(input_ids, max_length=max_length, num_beams=num_beams,
                                 no_repeat_ngram_size=2, bad_words_ids=bad_words_ids).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids][0]
  return postprocess_haiku(generated_text)

In [None]:
print(generate_text_beam(bad_words_ids=bad_words_ids))

In [None]:
print(generate_text_beam("data science", bad_words_ids=bad_words_ids))
print()
print(generate_text_beam("maple tree", bad_words_ids=bad_words_ids))
print()
print(generate_text_beam("swedish summer", bad_words_ids=bad_words_ids))

#### Top-k/Top-p sampling

We've looked at two deterministic decoding schemes, let's now focus on non-deterministic that is based on sampling the next word from a probability distribution. Outputed probability distribution is over the entire model vocabulary (order of tens of thousands), it has most of its mass on a subset of most probable words and a very long tail. The tokens in the tail part would produce incoherent gibberish therefore we must somehow limit ourselves to only sample from most probable words. That's where top-k and top-p sampling come into play:

- [Top-k sampling](https://arxiv.org/abs/1805.04833) selects *k* most probable words and distributes their comulative probability over them. The problem is that we must choose a fixed sized parameter *k* which might lead to suboptimal results in some scenarios.
- [Top-p sampling](https://arxiv.org/abs/1904.09751) addresses this by selecting top words whose cumulative probability just exceeds p. This comulative probability is then again distributed among these words.

We'll use a combination of both in this notebook, but you're free to test different scenarios.

There is another parameter that we haven't introduced: `temperature` which controls the outputed distribution from softmax function. Regular softmax has `temperature` = 1. If `temperature` -> 0, we give more probability mass to more probable words (we go towards greedy decoding). Higher values cause a more uniform distribution.

<div>
<img src="https://github.com/andrejmiscic/NLP-workshop/raw/master/figures/topk.PNG" width="800"/>
</div>

In [162]:
def generate_text_sampling(prompt="", max_length=64, top_k=50, top_p=0.90, temp=1.0, num_return=1, bad_words_ids=None):
  model.eval()
  model_prompt = "<SOH> " if len(prompt) == 0 else "<SOH> " + prompt + ": "
  input_ids = tokenizer.encode(model_prompt, return_tensors='pt').cuda()
  generated_ids = model.generate(input_ids, do_sample=True, max_length=max_length, temperature=temp, 
                                 top_k=top_k, top_p=top_p, num_return_sequences=num_return, bad_words_ids=bad_words_ids).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
  return [postprocess_haiku(text) for text in generated_text]

In [None]:
for haiku in generate_text_sampling(num_return=3, temp=0.7, bad_words_ids=bad_words_ids):
    print(haiku, end="\n\n")

In [None]:
for haiku in generate_text_sampling("data science", num_return=3, temp=0.7, bad_words_ids=bad_words_ids):
    print(haiku, end="\n\n")
print('-' * 20)
for haiku in generate_text_sampling("maple tree", num_return=3, temp=0.7, bad_words_ids=bad_words_ids):
    print(haiku, end="\n\n")
print('-' * 20)
for haiku in generate_text_sampling("swedish summer", num_return=3, temp=0.7, bad_words_ids=bad_words_ids):
    print(haiku, end="\n\n")