<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/11_fine_tuning_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Fine-Tuning GPT-2

Predicting the next element in a sequence is exactly what a Transformer decoder
does, so it should be no surprise that GPT-2 is actually a Transformer decoder.



We’ll start our NLP journey by following the steps of Alice and Dorothy, from
[Alice’s Adventures in Wonderland](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/1476) by Lewis Carroll and [The Wonderful Wizard of Oz](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/1740) by L. Frank Baum.

![](https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-pytorch-step-by-step/Part-III-NLP/images/alice_dorothy.png?raw=1)

*Left: "Alice and the Baby Pig" illustration by John Tenniel's, from "Alice's Adventure's in Wonderland" (1865).*

*Right: "Dorothy meets the Cowardly Lion" illustration by W.W. Denslow, from "The Wonderful Wizard of Oz" (1900)*


##Setup

In [1]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)
except ModuleNotFoundError:
    pass

from config import *
config_chapter11()
# This is needed to render the plots in this chapter
from plots.chapter11 import *

Downloading files from GitHub repo to Colab...
Finished!


In [2]:
%%capture

!pip install accelerate -U
!pip install datasets
!pip -q install spacy
!python -m spacy download en_core_web_sm

In [3]:
import os
import json
import errno
import requests
import numpy as np
from copy import deepcopy
from operator import itemgetter

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset, Dataset

from data_generation.nlp import ALICE_URL, WIZARD_URL, download_text
from stepbystep.v4 import StepByStep
# These are the classes we built in Chapter 10
from seq2seq import *

import spacy
import nltk
from nltk.tokenize import sent_tokenize

In [4]:
from datasets import load_dataset, Split
from transformers import (
    DataCollatorForLanguageModeling,
    BertModel, BertTokenizer, BertForSequenceClassification,
    DistilBertModel, DistilBertTokenizer,
    DistilBertForSequenceClassification,
    AutoModelForSequenceClassification,
    AutoModel, AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, pipeline, TextClassificationPipeline
)
from transformers.pipelines import SUPPORTED_TASKS

In [5]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

##Downloading Books

In [7]:
!rm -rf data

In [8]:
# let's download data
HOME_DIR = "data"
download_text(ALICE_URL, HOME_DIR)
download_text(WIZARD_URL, HOME_DIR)

In [9]:
# let's see the downloaded data
#!cat data/alice28-1476.txt

In [10]:
#!cat data/wizoz10-1740.txt

We need to remove these additions to the original texts:

In [11]:
alice_file = os.path.join(HOME_DIR, "alice28-1476.txt")
with open(alice_file, "r") as f:
  # The actual texts of the books are contained between lines 105 and 3703
  alice_text = "".join(f.readlines()[104:3704])

wizard_file = os.path.join(HOME_DIR, "wizoz10-1740.txt")
with open(wizard_file, "r") as f:
  # The actual texts of the books are contained between lines 309 and 5099
  wizard_text = "".join(f.readlines()[310:5100])

In [12]:
print(alice_text[:500])
print("\n", "#"*70, "\n")
print(wizard_text[:500])

                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 2.8




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `w

 ###################################################################### 

                    THE WONDERFUL WIZARD OF OZ


                          1.  The Cyclone


    Dorothy lived in the midst of the great Kansas prairies, with
Uncle Henry, who was a farmer, and Aunt Em, who was the farmer's
wife.  Their house was small, for the lumber to build it had to be
carried by wagon many miles.  There were four walls, a floor and a
roof, which made one room; and this room contained a rusty looking

We can partially automate the removal of the extra lines by setting the real start and end lines of each text in a configuration file.

In [13]:
text_cfg = """fname,start,end
alice28-1476.txt,104,3704
wizoz10-1740.txt,310,5100"""
bytes_written = open(os.path.join(HOME_DIR, 'lines.cfg'), 'w').write(text_cfg)

##Sentence Tokenization

A token is a piece of a text, and to tokenize a text means to split
it into pieces; that is, into a list of tokens.

The most common kind of piece is a word.

So, tokenizing a text usually means to
split it into words using the white space as a separator.

In [14]:
sentence = "I'm following the white rabbit"
tokens = sentence.split(" ")
tokens

["I'm", 'following', 'the', 'white', 'rabbit']

Let's do sentence tokenization, which means to split a text into its sentences.

In [15]:
corpus_alice = sent_tokenize(alice_text)
corpus_wizard = sent_tokenize(wizard_text)

len(corpus_alice), (len(corpus_wizard))

(1612, 2240)

Let’s check one sentence from the first corpus of text.

In [16]:
corpus_alice[2]

'There was nothing so VERY remarkable in that; nor did Alice\nthink it so VERY much out of the way to hear the Rabbit say to\nitself, `Oh dear!'

Let’s check one sentence from the second corpus of text.

In [17]:
corpus_wizard[30]

'"There\'s a cyclone coming, Em," he called to his wife.'

Our dataset is going to be a collection of CSV files, one file for each book, with each
CSV file containing one sentence per line.

Therefore, we need to:

* clean the line breaks to make sure each sentence is on one line only;
* define an appropriate quote char to "wrap" the sentence such that the original commas and semicolons in the original text do not get misinterpreted as separation chars of the CSV file; and
* add a second column to the CSV file to
identify the original source of the sentence since we’ll be concatenating, and
shuffling the sentences before training a model on our corpora.

The sentence above should end up looking like this:
```log
\"There's a cyclone coming, Em," he called to his wife.\,wizoz10-1740.txt
```

The function below does the grunt work of cleaning, splitting, and saving the
sentences to a CSV file for us:

In [18]:
def sentence_tokenize(source, quote_char="\\", sep_char=",", include_header=True, include_source=True, extensions=("txt"), **kwargs):
  # If source is a folder, goes through all files inside it that match the desired extensions ('txt' by default)
  if os.path.isdir(source):
    filenames = [f for f in os.listdir(source) if os.path.isfile(os.path.join(source, f)) and os.path.splitext(f)[1][1:] in extensions]
  elif isinstance(source, str):
    filenames = [source]

  # If there is a configuration file, builds a dictionary with the corresponding start and end lines of each text file
  config_file = os.path.join(source, "lines.cfg")
  config = {}
  if os.path.exists(config_file):
    with open(config_file, "r") as f:
      rows = f.readlines()
    for r in rows[1:]:
      fname, start, end = r.strip().split(",")
      config.update({fname: (int(start), int(end))})

  new_fnames = []
  # For each file of text
  for fname in filenames:
    # If there's a start and end line for that file, use it
    try:
        start, end = config[fname]
    except KeyError:
        start = None
        end = None

    # Opens the file, slices the configures lines (if any)
    # cleans line breaks and uses the sentence tokenizer
    with open(os.path.join(source, fname), 'r') as f:
        contents = (''.join(f.readlines()[slice(start, end, None)]).replace('\n', ' ').replace('\r', ''))
    corpus = sent_tokenize(contents, **kwargs)

    # Builds a CSV file containing tokenized sentences
    base = os.path.splitext(fname)[0]
    new_fname = f'{base}.sent.csv'
    new_fname = os.path.join(source, new_fname)
    with open(new_fname, 'w') as f:
        # Header of the file
        if include_header:
            if include_source:
                f.write('sentence,source\n')
            else:
                f.write('sentence\n')
        # Writes one line for each sentence
        for sentence in corpus:
            if include_source:
                f.write(f'{quote_char}{sentence}{quote_char}{sep_char}{fname}\n')
            else:
                f.write(f'{quote_char}{sentence}{quote_char}\n')
    new_fnames.append(new_fname)

  # Returns list of the newly generated CSV files
  return sorted(new_fnames)

In [19]:
new_fnames = sentence_tokenize(HOME_DIR)
new_fnames

['data/alice28-1476.sent.csv', 'data/wizoz10-1740.sent.csv']

##Pipeline

In [None]:
# Let’s load the GPT-2-based text generation pipeline:
text_generator = pipeline("text-generation")

Then, let’s use the first two paragraphs from Alice’s Adventures in Wonderland as
our base text:

In [21]:
base_text = """
Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had peeped
into the book her sister was reading, but it had no pictures or
conversations in it, `and what is the use of a book,'thought Alice
`without pictures or conversation?' So she was considering in her
own mind (as well as she could, for the hot day made her feel very
sleepy and stupid), whether the pleasure of making a daisy-chain
would be worth the trouble of getting up and picking the daisies,
when suddenly a White Rabbit with pink eyes ran close by her.
"""

In [22]:
text_generator.model.config.task_specific_params

{'text-generation': {'do_sample': True, 'max_length': 50}}

In [23]:
result = text_generator(base_text, max_length=250)
print(result[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had peeped
into the book her sister was reading, but it had no pictures or
conversations in it, `and what is the use of a book,'thought Alice
`without pictures or conversation?' So she was considering in her
own mind (as well as she could, for the hot day made her feel very
sleepy and stupid), whether the pleasure of making a daisy-chain
would be worth the trouble of getting up and picking the daisies,
when suddenly a White Rabbit with pink eyes ran close by her.
The Rabbit, looking at her, asked the
-- "How did that White Rabbit get onto the daisies?",' said Alice
`and what is the use of a daisies if that white Rabbit isn't familiar with this
--
And what is the use of being so far down a little bit as a white dog with a
white nose, when a White Rabbit goes so far up a daisie?
Then the White Rabbit's eyes looked at her, and said, `You
se


By the way, if you try using greedy decoding instead (setting `do_sample=False`), the generated text simply and annoyingly repeats the same text over and over again:

In [24]:
result = text_generator(base_text, max_length=250, do_sample=False)
print(result[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had peeped
into the book her sister was reading, but it had no pictures or
conversations in it, `and what is the use of a book,'thought Alice
`without pictures or conversation?' So she was considering in her
own mind (as well as she could, for the hot day made her feel very
sleepy and stupid), whether the pleasure of making a daisy-chain
would be worth the trouble of getting up and picking the daisies,
when suddenly a White Rabbit with pink eyes ran close by her.
'Oh, my dear, I am so glad to see you!' said Alice, 'I am so glad to see you!'
'Oh, my dear, I am so glad to see you!' said Alice, 'I am so glad to see you!'
'Oh, my dear, I am so glad to see you!' said Alice, 'I am so glad to see you!'
'Oh, my dear, I am so glad to see you!' said Alice, 'I am so glad to see


Wait a minute! Aren’t we fine-tuning GPT-2 so it can write text in a
given style?



##Data Preparation

In order to capture the style of Lewis Carroll’s Alice’s Adventures in Wonderland, we
need to use a dataset containing sentences from that book alone.

In [None]:
dataset = load_dataset(path="csv", data_files=["data/alice28-1476.sent.csv"], quotechar="\\", split=Split.TRAIN)

shuffled_dataset = dataset.shuffle(seed=42)
split_dataset = shuffled_dataset.train_test_split(test_size=0.2, seed=42)

train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

In [None]:
# Next, we tokenize the dataset using GPT-2's pre-trained tokenizer
auto_tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [27]:
def tokenize(row):
  return auto_tokenizer(row["sentence"])

In [None]:
tokenized_train_dataset = train_dataset.map(
    tokenize,
    remove_columns=["source", "sentence"],
    batched=True
)

tokenized_test_dataset = test_dataset.map(
    tokenize,
    remove_columns=["source", "sentence"],
    batched=True
)

In [29]:
# without padding, the sentences have varied lengths
list(map(len, tokenized_train_dataset[0:6]["input_ids"]))

[9, 28, 20, 9, 34, 29]

##Packing Dataset

The "packing" is actually simpler now; it is simply concatenating the inputs together
and then chunking them into blocks.

In [30]:
def group_texts(examples, block_size=128):
  # Concatenate all texts
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])

  # We drop the small remainder, we could add padding
  # if the model supported it instead of this drop, you can customize this part to your needs.
  total_length = (total_length // block_size) * block_size
  # Split by chunks of max_len
  result = {
      k: [t[i: i + block_size] for i in range(0, total_length, block_size)]
      for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

In [None]:
# We can apply the function above to our datasets
lm_train_dataset = tokenized_train_dataset.map(
    group_texts,
    batched=True
)

lm_test_dataset = tokenized_test_dataset.map(
    group_texts,
    batched=True
)

lm_train_dataset.set_format(type="torch")
lm_test_dataset.set_format(type="torch")

In [32]:
# Now, the first data point actually contains the first 128 tokens of our dataset
print(lm_train_dataset[0]["input_ids"])

tensor([   63,  2437,   466,   345,   760,   314,  1101,  8805,  8348,   464,
         2677,  3114,  7296,  6819,   379,   262,  2635, 25498,    11,   508,
          531,   287,   257,  1877,  3809,    11,  4600,  7120, 25788,  1276,
         3272,    12,  1069,  9862, 12680,  4973,  2637,  1537,   611,   314,
         1101,   407,   262,   976,    11,   262,  1306,  1808,   318,    11,
         5338,   287,   262,   995,   716,   314,    30,   464,   360,   579,
         1076,  6364,  4721,   465,  2951,    13,    63,  1026,   373,   881,
        21289,   272,   353,   379,  1363,  4032,  1807,  3595, 14862,    11,
         4600, 12518,   530,  2492,   470,  1464,  3957,  4025,   290,  4833,
           11,   290,   852,  6149,   546,   416, 10693,   290, 33043,    13,
         1870, 14862,   373,   523,   881, 24776,   326,   673,  4966,   572,
          379,  1752,   287,   262,  4571,   340,  6235,   284,    11,  1231,
         2111,   284,  4727,   262,  7457,   340,   550,   925])

Consequently, the datasets get smaller, since they do not contain sentences
anymore but sequences of 128 tokens instead:

In [33]:
len(lm_train_dataset), len(lm_test_dataset)

(239, 56)

##Model Training

GPT-2 is a model for causal language modeling, and that’s the AutoModel we use to load it.

In [None]:
gpt_model = AutoModelForCausalLM.from_pretrained("gpt2")
print(gpt_model.__class__)

In [36]:
# let's override the default trainer arguments
training_args = TrainingArguments(
    output_dir="output",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=8,  # means mini-batch has size eight
    evaluation_strategy="steps",
    eval_steps=300,
    logging_steps=300,
    gradient_accumulation_steps=8,
    prediction_loss_only=True
)

Since GPT-2 is a generative model, we won’t be running any additional metrics during training or validation, and there’s no need for anything but the loss.

In [38]:
# let’s redefine the trainer
trainer = Trainer(
    model=gpt_model,
    args=training_args,
    train_dataset=lm_train_dataset,
    eval_dataset=lm_test_dataset
)

In [39]:
# There we go—we’re 100% ready to call the glorious train() method
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=29, training_loss=3.6358197968581627, metrics={'train_runtime': 700.8899, 'train_samples_per_second': 0.341, 'train_steps_per_second': 0.041, 'total_flos': 15154937856000.0, 'train_loss': 3.6358197968581627, 'epoch': 0.97})

In [40]:
# let's check the final validation
trainer.evaluate()

{'eval_loss': 3.419328451156616,
 'eval_runtime': 69.9444,
 'eval_samples_per_second': 0.801,
 'eval_steps_per_second': 0.1,
 'epoch': 0.97}

##Generating Text

Let's assign our fine-tuned model and pretrained
tokenizer to a pipeline and using most of its default values.

In [41]:
device_index = (gpt_model.device.index if gpt_model.device.type != "cpu" else -1)

gpt_gen = pipeline("text-generation", model=gpt_model, tokenizer=auto_tokenizer, device=device_index)

In [43]:
# The only parameter we may have to change is, once again, the max_length
result = gpt_gen(base_text, max_length=250)
print(result[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had peeped
into the book her sister was reading, but it had no pictures or
conversations in it, `and what is the use of a book,'thought Alice
`without pictures or conversation?' So she was considering in her
own mind (as well as she could, for the hot day made her feel very
sleepy and stupid), whether the pleasure of making a daisy-chain
would be worth the trouble of getting up and picking the daisies,
when suddenly a White Rabbit with pink eyes ran close by her.
`You!'thought Alice, `and I can't look at this!'
Alice said nothing with any trepidation, and as she went back to drawing up her
`What's more?'Alice had seen the Queen, when, as she was talking, she got a
little confused about the meaning of this sentence.`Well,'thought Alice, and, without waiting for any more words,
`just as soon as I was finished with this talk--`I suppose I must go to the
'I


In [44]:
result = gpt_gen(base_text, max_length=250)
print(result[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had peeped
into the book her sister was reading, but it had no pictures or
conversations in it, `and what is the use of a book,'thought Alice
`without pictures or conversation?' So she was considering in her
own mind (as well as she could, for the hot day made her feel very
sleepy and stupid), whether the pleasure of making a daisy-chain
would be worth the trouble of getting up and picking the daisies,
when suddenly a White Rabbit with pink eyes ran close by her.
`Did you get a White Rabbit!'said Alice: `I've only seen one before!'
So Alice began by making small changes with the stick, and gave her way up the steps towards the garden:


Alice was only a little surprised to be at the first steps, and in her own way noticed that the Rabbit-in-the-Band had started all the way out,
`I had just seen what the King and Queen said when she went upstairs the other day,'sai

I tried it out
several times and, in my humble opinion, the output looks more "Alice-y" now.

What do you think?