<a href="https://colab.research.google.com/github/talitmr/text_generation_with_pretrained_model/blob/main/text_generation_with_gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with GPT-2

In this notebook file, I explained the steps of text generation using pre-trained transformer model, gpt2, trained with Shakspeare's words. 

In [None]:
# installing necessary modules
!pip install transformers
!pip install torch
!pip install sentencepiece
!pip install pyyaml



* GPT2Tokenizer is based on byte-level Byte-Pair-Encoding. It has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not. 
* GPT2LMHeadModel is a model class that contains linear layer.
* HfArgumentParser The class is designed to play well with the native argparse. In particular, you can add more (non-dataclass backed) arguments to the parser after initialization and you'll get the output back after parsing as an additional namespace.
* TrainingArguments is the subset of the arguments we use in our example scripts.
* Trainer is a feature-complete training and eval loop for PyTorch.


In [None]:
# importing necessary modules

from transformers import GPT2Tokenizer, GPT2LMHeadModel, HfArgumentParser, TrainingArguments, Trainer,  default_data_collator
from datasets import load_dataset

In [None]:
# getting the shakespeare data
shakespeare = load_dataset("tiny_shakespeare")

Using custom data configuration default
Reusing dataset tiny_shakespeare (/root/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e)


In [None]:
# I initialized a GPT2 model with a language modelling head
model = GPT2LMHeadModel.from_pretrained('gpt2')

# I initialize GPT2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [None]:
def tokenize_function(examples):
    output = tokenizer(examples['text'])
    return output

# tokenize dataset
tokenized_shakespeare = shakespeare.map(
    tokenize_function,
    batched=True,
    remove_columns=shakespeare["train"].column_names
)

Loading cached processed dataset at /root/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e/cache-24f9224521537a34.arrow


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Token indices sequence length is longer than the specified maximum sequence length for this model (18066 > 1024). Running this sequence through the model will result in indexing errors





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
shakespeare['train']


Dataset({
    features: ['text'],
    num_rows: 1
})

In [None]:
tokenized_shakespeare['train']

Dataset({
    features: ['attention_mask', 'input_ids'],
    num_rows: 1
})

In [None]:
# finding max length of the model for block model
block_size = tokenizer.model_max_length

def new_token(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = int(len(concatenated_examples[list(examples.keys())[0]]))
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# spliting whole dataset into smaller parts of block_size length. 
new_shakespeare = tokenized_shakespeare.map(
    new_token,
    batched=True
)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
# defining training and evaluating datasets
train_dataset = new_shakespeare["train"]
eval_dataset = new_shakespeare["validation"]

In [None]:
training_args = TrainingArguments(output_dir = "output/", 
                                  per_device_train_batch_size=2, 
                                  num_train_epochs=3, 
                                  save_total_limit=1)

# setting the Trainer parameters
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=default_data_collator
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


I trained the model 3 epochs, since it takes a lot of time.

In [None]:
train_result = trainer.train()

***** Running training *****
  Num examples = 294
  Num Epochs = 3
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 441


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




Below, I generate some with trained model. Using the top probability 0.8 and generating maximum 200 tokens. I think the output is really enjoyable and training 3 epochs is not worse when I read the output text

In [None]:
# tokenizing the begining of the sentence
start = tokenizer.encode('Half blood prince is a genius although he is a slytherin', return_tensors='pt').cuda()

# generating samples by using top probabilty
output = model.generate(
    start, 
    do_sample=True, 
    max_length=200, 
    top_p=0.8
)

# printing the generated texts
print(tokenizer.decode(output[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Half blood prince is a genius although he is a slytherin
That will not fly to his head, and henceforth he
Will go with me to my father's death.

KING RICHARD II:
The queen is an officer of death: let her be, and let me
Call the king to my father's house.

MUMBAI:
Sir, then, what shall I do for this?

KING RICHARD II:
Thou shalt not be made an enemy.

MUMBAI:
Go, marry me and I will kill thee with thee.

KING RICHARD II:
No, no.

MUMBAI:
I have said enough:
O, so thou dost make a king:
I say to him, he will kill thee with thy
head and thy finger.

KING RICHARD II:
Thou shalt not be made an enemy


In [None]:
# tokenizing the begining of the sentence
start = tokenizer.encode('Rawenclaws are most ', return_tensors='pt').cuda()

# generating samples by using top probabilty
output = model.generate(
    start, 
    do_sample=True, 
    max_length=200, 
    top_p=0.8
)

# printing the generated texts
print(tokenizer.decode(output[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Rawenclaws are most ixant: the duke is a man of great humility; the rest are as he: but, as he would have, he hath a son of royal descent.

DUCHESS OF YORK:
Why, 'tis, and the rest;

KING RICHARD II:
Why, 'tis a pleasure to have your child,
And, if it be so, to have him raised up to be a good king:
You, King Henry's son, that shall have it,
Are the lords of York, but to the King of France:
'Tis a pretty son, and he will not be too young:
But I am a father, so I will be.

JOHN OF YORK:
Ay, and, I know, so far as I know, a true king.

DUCHESS OF YORK:
A king, what, he would have him raise up to
