# Language Modeling & Text Generation with GPT-2

In [2]:
import re
from sklearn.model_selection import train_test_split
import pandas as pd
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, default_data_collator
from collections import Counter
import spacy
import nltk
import en_core_web_sm
import torch
import sys
sys.path.append('..')
from util.generate import FicDataset, EarlyStoppingCallback, clean_text_gen, count_custom_tokens, chunk_docs

Using HuggingFace and PyTorch, we construct a language model by fine-tuning the pretrained GPT-2 model on our corpus.

## Cleaning

We perform some basic cleaning of extraneous characters and extra whitespace before training our GPT-2 text generation model on our corpus. We don't need to do more complex processing such as lemmatization and stopword removal as GPT-2 trains on the full text.

In [None]:
df = pd.read_pickle("../data/avatar_fics_processed.pickle")

In [None]:
df["cleaned"] = df["text"].map(clean_text_gen)

In [None]:
# df.to_pickle("avatar_fics_cleaned.pickle")

## Find and Add Custom Tokens 

We find and add the most common custom tokens in our corpus to the pretrained GPT-2 tokenizer. We do so by tokenizing the text some other way (we simply split by spaces) and then comparing the most common resulting tokens to GPT-2's original vocabulary.

A few notes: the GPT-2 tokenizer treats capitalized words differently, and words that begin sentences differently. We'll account for the former behavior by adding some common capitalized words into our corpus that don't appear in the original vocabulary, e.g. "Avatar." We have to handle the latter in our custom token search in order to properly compare words in our corpus to the pretrained vocabulary, but we don't add custom tokens for this behavior at the moment.

Get the pretrained tokenizer and its vocabulary.

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
vocab = tokenizer.get_vocab()

Let's load some stopwords from SpaCy and NLTK and ignore those for ease of use.

In [None]:
nlp = en_core_web_sm.load()
stopwords_nltk = set(nltk.corpus.stopwords.words('english'))
stopwords_spacy = nlp.Defaults.stop_words
stopwords = stopwords_nltk.union(stopwords_spacy)

In [None]:
count = count_custom_tokens(df["cleaned"], vocab, stopwords)

In [None]:
count.most_common(10)

From this list we grab the following common custom tokens. We also add a padding token.

In [None]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.add_tokens(["Zuko", "Sokka", "Mai", "Appa", "Katara", "Kya", "Suki", "Iroh", "Aang", "Toph", "Beifong", "Agni", "Kai", "Hakoda", "Ozai", "Azula", "Ursa"])
tokenizer.add_tokens(["Avatar", "Tribe", "Uncle", "Tea", "Kingdom", "Air", "Water", "Earth", "Prince", "Fire", "Lord", "Nephew", "Temple"]);

In [None]:
tokenizer.save_pretrained("../models/tokenizer_textgen")

## Chunking Our Corpus

Since GPT-2 takes a limited number of tokens as input (maximum 1024), we split our corpus into smaller chunks before training. The size of chunks also influences batch size, which may be important depending on our GPU memory. We choose 500 as a reasonable guess.

We make sure to cut off chunks at sentence boundaries so GPT-2 doesn't train on documents with partial sentences. We also exclude overly long sentences should they occur in the corpus.

In [None]:
max_tokens = 500
chunked_docs = chunk_docs(df["cleaned"], tokenizer, max_tokens)

In [None]:
chunked_df = pd.DataFrame(chunked_docs, columns=["text"])

In [85]:
chunked_df.to_pickle("../data/chunked_df.pickle")

In [4]:
# chunked_df = pd.read_pickle("../data/chunked_df.pickle")

## Train-Test Split and PyTorch Encoding

We do a train-val-test split and encode the data so it can be properly processed by HugginFace and PyTorch.

In [None]:
train, valtest = train_test_split(chunked_df["text"], test_size = 0.2, random_state=0)
val, test = train_test_split(valtest, test_size = 0.5, random_state=0)

In [None]:
train_encodings = tokenizer(train.tolist(), truncation=True, max_length=max_tokens, padding="longest")
val_encodings = tokenizer(val.tolist(), truncation=True, max_length=max_tokens, padding="longest")
test_encodings = tokenizer(test.tolist(), truncation=True, max_length=max_tokens, padding="longest")

In [None]:
train_dataset = FicDataset(train_encodings)
val_dataset = FicDataset(val_encodings)
test_dataset = FicDataset(test_encodings)

In [None]:
torch.save(train_dataset, "../data/datasets/train.data")
torch.save(val_dataset, "../data/datasets/val.data")
torch.save(test_dataset, "../data/datasets/test.data")

## Fine-Tuning Our Model

We fine-tune our model, using early stopping to pause when validation fails to increase 3 consecutive times. Our model ran for approximately 1.5 epochs before it stopped early.

In [None]:
model = GPT2LMHeadModel.from_pretrained("gpt2")

In [None]:
model.resize_token_embeddings(len(tokenizer)) 
training_args = TrainingArguments(
    output_dir='../models/checkpoints',          # output directory
    overwrite_output_dir = True,
    save_total_limit = 3,
    num_train_epochs = 5,              # total # of training epochs
    per_device_train_batch_size=2,  # batch size per device during training
    per_device_eval_batch_size=2,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    save_steps=500,
    weight_decay=0.01,               # strength of weight decay
    logging_dir='../models/logs',            # directory for storing logs
    evaluation_strategy="steps",
    logging_steps=500,
    eval_steps=500,
    load_best_model_at_end=True,
)

# our util class copied from committed but not live huggingface cold
callback = EarlyStoppingCallback()

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    tokenizer=tokenizer,
    data_collator = default_data_collator,
    
    callbacks=[callback]
)

In [None]:
trainer.train()

## Text Generation

We generate text after setting various parameters for the text generation sampling. A tutorial may be found [here](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb).

If running inference on a GPU, the ```cuda()``` calls may be commented out.

In [112]:
tokenizer = GPT2Tokenizer.from_pretrained("../models/tokenizer_textgen")
model = GPT2LMHeadModel.from_pretrained("../models/model_textgen")#.cuda()

In [113]:
torch.manual_seed(0)

<torch._C.Generator at 0x7fcb48844fb0>

In [114]:
output_length = 200
temperature = 0.8
top_p = 0.94
top_k = 60
rep_pen = 1.2
num_return = 3

context = "Sokka "

In [115]:
input_ids = tokenizer.encode(context, return_tensors="pt")#.cuda()

output_sequences = model.generate(
    input_ids=input_ids,
    max_length=output_length,
    temperature=temperature,
    top_k=top_k,
    top_p=top_p,
    repetition_penalty=rep_pen,
    do_sample=True,
    num_return_sequences=num_return
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [116]:
for i, output in enumerate(output_sequences):
  print("{}: {}".format(i, tokenizer.decode(output, skip_special_tokens=True)))

0: Sokka's lips parted. "I can't believe I'm getting married to someone who is going after my own children," he said, and he sounded so sad. Zuko brought a hand up to his mouth as Zuko sighed heavily before turning to look at him again. "You're kidding me!  "What does that mean?  "I can't imagine having to do it alone for the rest of your life," Sokka said honestly. Zuko didn't want to admit that. They'd already been through this together and had both gotten along well enough that he thought they were going to be good people eventually in their lives. "No, I think you don't have to deal with this right now," Toph said. She was wearing her new boyfriend's outfit all day. It was probably just her imagination that had been running through her head when they got there, but she could see the way his expression softened when he turned away from them, back towards the tree trunk. "Do you
1: Sokka's mouth twitches into a smile, and Sokka shakes his head at the sight. "Not that I'm surprised yo