Poem Generator<br>
Dartmouth COSC 72 Final Project<br>
Authors: Zhoucai Ni and Alex Kruger<br>
Emails:
zhoucai.ni.24@dartmouth.edu | alexander.j.kruger.23@dartmouth.edu<br>
Description: Uses Pytorch's GPT2 and a dataset of poems from The Poetry Foundation to train a neural network that generates new poetry based on a prompt

Install necessary packages

The following code draws inspiration from the week 6 homework on GPT-2 Training and from  https://scottmduda.medium.com/generating-an-edgar-allen-poe-styled-poem-using-gpt-2-289801ded82c

In [3]:
!pip install tokenizer
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tokenizer
  Downloading tokenizer-3.4.2-py2.py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizer
Successfully installed tokenizer-3.4.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m73.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Do

Import important packages and tools

In [4]:
# General modules
import numpy as np
import pandas as pd 
import random
import os

# time related modules
import time
import datetime


# PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab.
import torch

# Transformers provides state-of-the-art general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG).
# GPT2Tokenizer is used to tokenize the text data for GPT-2 model.
# GPT2LMHeadModel represents the GPT-2 model with a language modeling head.
# GPT2Config represents the configuration of the GPT-2 model.
# AdamW is a class representing the Adam optimizer with weight decay.
# get_linear_schedule_with_warmup creates a schedule with a learning rate that decreases linearly after linearly increasing during a warmup period.
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config, AdamW, get_linear_schedule_with_warmup

# Dataset represents a Python iterable over a dataset.
# random_split is a function that splits the dataset into non-overlapping new datasets of given lengths.
# DataLoader combines a dataset and a sampler and provides an iterable over the given dataset.
# RandomSampler samples elements randomly.
# SequentialSampler samples elements sequentially.
from torch.utils.data import Dataset, random_split, DataLoader, RandomSampler, SequentialSampler
import plotly.express as px




Set global variables

In [5]:
RAND_SEED = 73
BATCH_SIZE = 2
EPOCHS = 1
MAX_LEN = 1024

Import data from Poetry Foundation as a pandas dataframe<br>
data can be downloaded here: https://www.kaggle.com/datasets/tgdivy/poetry-foundation-poems

In [6]:
poem_stanza_df = pd.read_csv('PoetryFoundationData.csv')
poem_stanza_df = poem_stanza_df.fillna('')

# Uncomment to train a smaller subset for specifc Poets
# Bob_frost_df = poem_stanza_df[poem_stanza_df['Poet'] == 'Robert Frost']

# Bob_frost_df['Poem']

Prepare tokenizer

In [7]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
special_tokens_dict = {
    'bos_token': '<BOS>', 
    'eos_token': '<EOS>', 
    'pad_token': '<PAD>'}
num_added_tokens = tokenizer.add_special_tokens(special_tokens_dict)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Use a class to better represent data as an object

In [8]:
class PoemDataset(Dataset):
    """
    Custom Dataset subclass.
    The dataset reads a list of strings (data), tokenizes them using a pre-specified tokenizer, and returns
    their corresponding input_ids and attention_masks as tensors.
    
    :param data: List of strings to tokenize.
    :param tokenizer: Tokenizer object to be used to tokenize data.
    :param gpt2_type: (Optional) Type of GPT-2 used.
    :param max_length: (Optional) Maximum length of the sequences.
    """
    def __init__(self, data, tokenizer, gpt2_type='gpt2', max_length=MAX_LEN):
        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []
        
        # Iterate over data, tokenize each sequence and append its input_id and attention_mask to respective lists
        for i in data:
            encodings_dict = tokenizer('<BOS>' + i + '<EOS>',
                                     truncation=True,
                                     max_length=max_length,
                                     padding='max_length')

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        """
        Returns the number of sequences in data.
        
        :return: number of sequences in data
        """
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        """
        Returns the input_id and attention_mask tensors of the sequence at the provided index.
        
        :param idx: index to access
        :return: tensors of input_id and attention_mask of the sequence at the provided index.
        """
        return self.input_ids[idx], self.attn_masks[idx]

    


In [9]:
poem_stanza_dataset = PoemDataset(poem_stanza_df['Poem'].values, tokenizer, max_length=MAX_LEN)

Split data into training and validation sets

In [10]:
def train_val_split(split, dataset):
    """
    Calculates the size of the training and validation datasets.
    
    :param split: Float representing the proportion of data to be used for training. 
    Should be between 0 and 1, where 0 means no data for training, and 1 means all data for training.
    :param dataset: The dataset to be split into training and validation sets.
    :return: The sizes of the training and validation datasets.
    """
     # calculate the size of the training dataset
    train_size = int(split * len(dataset)) 
    # the remaining data will be used for validation
    val_size = len(dataset) - train_size    
    return train_size, val_size


# Use the function defined above to split the PoemDataset into training and validation datasets
poem_stanza_train_size, poem_stanza_val_size = train_val_split(0.8, poem_stanza_dataset)

# Use PyTorch's random_split function to randomly split the PoemDataset into training and validation datasets
poem_stanza_train_dataset, poem_stanza_val_dataset = random_split(poem_stanza_dataset, 
                                                                  [poem_stanza_train_size, poem_stanza_val_size])


Use our random seed global variable to initialize the randomizer and PyTorch

In [11]:
torch.cuda.manual_seed_all(RAND_SEED)
random.seed(RAND_SEED)
np.random.seed(RAND_SEED)
torch.manual_seed(RAND_SEED)

<torch._C.Generator at 0x7fcf648fde30>

Apply the data loader to the training and validation sets

In [12]:
poem_stanza_train_dataloader = DataLoader(poem_stanza_train_dataset,
                              sampler=RandomSampler(poem_stanza_train_dataset),
                              batch_size=BATCH_SIZE)

poem_stanza_val_dataloader = DataLoader(poem_stanza_val_dataset,
                            sampler=SequentialSampler(poem_stanza_val_dataset),
                            batch_size=BATCH_SIZE)

Log time and initialize hyperparameters

In [13]:
# helper function for logging time
def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

# hyperparameters
learning_rate = 1e-3
eps = 1e-8
warmup_steps = 50
device = torch.device('cuda')

Configure and optimize the model

In [14]:
# Create a configuration object for a GPT2 model.
# The configuration specifies the model architecture including the vocabulary size and the maximum length of position embeddings.
# 'from_pretrained' method is used to initialize the configuration with the pretrained 'gpt2' model configuration.
# output_hidden_states is set to True, which means that the model will return all hidden states.
configuration = GPT2Config(vocab_size=len(tokenizer), n_positions=MAX_LEN).from_pretrained('gpt2', output_hidden_states=True)

# Initialize a GPT2LMHeadModel with the above configuration.
# GPT2LMHeadModel is the GPT2 model with a language modeling head, i.e., a linear layer on top of the hidden states output.
# 'from_pretrained' method is used to initialize the model with the pretrained 'gpt2' weights.
poem_stanza_model = GPT2LMHeadModel.from_pretrained('gpt2', config=configuration)

# Resize the token embeddings of the model in case the current size doesn't match with the provided tokenizer's vocabulary size.
# This is necessary when you've added some special tokens or used a different tokenizer than the one originally used to train 'gpt2'.
poem_stanza_model.resize_token_embeddings(len(tokenizer))

# Move the model to GPU for faster computations.
poem_stanza_model.cuda()

# Initialize the AdamW optimizer, which is an Adam optimizer with weight decay.
# The parameters of the model will be updated by this optimizer during the training.
optimizer = AdamW(poem_stanza_model.parameters(), lr=learning_rate, eps=eps)

# Compute the total number of training steps.
# This is used by the learning rate scheduler.
total_steps = len(poem_stanza_train_dataloader) * EPOCHS

# Initialize the learning rate scheduler.
# We use a scheduler that linearly decreases the learning rate from the maximum value to 0, after a warmup period during which it linearly increases.
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=warmup_steps,
                                            num_training_steps=total_steps)


# Move the model to the specific device (GPU/CPU).
poem_stanza_model = poem_stanza_model.to(device)


Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



Train the Model

In [30]:
# Initialize an empty list to hold the losses during training
losses = []
valid_losses =[]
start_time = time.time()

for epoch_i in range(0, EPOCHS):

    print(f'Epoch {epoch_i + 1} of {EPOCHS}')

    t0 = time.time()
    
    # Reset the total training loss for this epoch
    total_train_loss = 0

    poem_stanza_model.train()

    # Loop over each batch from the training data loader
    for step, batch in enumerate(poem_stanza_train_dataloader):

        # Move the input ids, labels and masks to the GPU
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        # Clear out the gradients from the previous training step
        poem_stanza_model.zero_grad()        

        # Forward pass: compute the outputs of the model by passing in the input
        outputs = poem_stanza_model(b_input_ids, labels=b_labels, attention_mask=b_masks, token_type_ids=None)

        # Extract the loss from the outputs
        loss = outputs[0]  

        # Extract and accumulate the total loss
        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Perform a backward pass to calculate gradients
        loss.backward()

        # Update parameters
        optimizer.step()

        # Update the learning rate
        scheduler.step()

        losses.append(loss.item())
        

        if step % 50 == 0:
            print(f"Step: {step}, Loss: {loss.item():.4f}")

        # Break the loop after 1000 steps.
        if step > 1000:
            break

    # Calculate the average training loss for this epoch
    avg_train_loss = total_train_loss / len(poem_stanza_train_dataloader)       

    # Calculate the time spent on this epoch
    training_time = format_time(time.time() - t0)

    # Print the average training loss and time spent on this epoch
    print(f'Average Training Loss: {avg_train_loss}. Epoch Training Time: {training_time}')

    # Set the model to 'eval' mode. This is important when the model has layers like dropout, batchnorm etc. which behave differently during training and evaluation.
    poem_stanza_model.eval()

    # Reset the total validation loss
    total_eval_loss = 0
    nb_eval_steps = 0

    # Loop over each batch from the validation data loader
    for batch in poem_stanza_val_dataloader:
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        # We don't need to track gradients for validation, so wrap in no_grad to save memory
        with torch.no_grad():        

            # Forward pass
            outputs  = poem_stanza_model(b_input_ids, attention_mask=b_masks, labels=b_labels)

            loss = outputs[0]  

        # Accumulate the validation loss
        batch_loss = loss.item()
        total_eval_loss += batch_loss     
        valid_losses.append(batch_loss)   

    # Calculate the average validation loss for this epoch
    avg_val_loss = total_eval_loss / len(poem_stanza_val_dataloader)

    print(f'Average Validation Loss: {avg_val_loss}')

print(f'Total Training Time: {format_time(time.time()-start_time)}')


Epoch 1 of 1
Step: 0, Loss: 1.9971
Step: 50, Loss: 0.8168
Step: 100, Loss: 2.1951
Step: 150, Loss: 1.9830
Step: 200, Loss: 0.8496
Step: 250, Loss: 1.0005
Step: 300, Loss: 1.1123
Step: 350, Loss: 2.7396
Step: 400, Loss: 2.1484
Step: 450, Loss: 1.1395
Step: 500, Loss: 0.4966
Step: 550, Loss: 1.8984
Step: 600, Loss: 1.3414
Step: 650, Loss: 1.2759
Step: 700, Loss: 0.7070
Step: 750, Loss: 2.0921
Step: 800, Loss: 1.5292
Step: 850, Loss: 0.6120
Step: 900, Loss: 0.5833
Step: 950, Loss: 1.7738
Step: 1000, Loss: 3.5659
Average Training Loss: 0.23474663062587794. Epoch Training Time: 0:11:47
Average Validation Loss: 1.2531718199731297
Total Training Time: 0:17:17


Loss Curve plot

In [33]:
# Create an array for the x-axis
x = np.arange(len(losses)) * (MAX_LEN * BATCH_SIZE)

# Plot the line chart
px.line(y=losses, x=np.arange(len(losses))*(MAX_LEN * BATCH_SIZE), labels={"y":"Loss", "x":"Tokens"}, title="Training curve for my tiny demo model!")



Save the trained model

In [17]:
torch.save(poem_stanza_model.state_dict(),  'poem_stanza_model.pth')

Use the newly trained model to generate poems

In [34]:
# text = "I love my dog"
# input_ids = tokenizer.encode(text, return_tensors='tf')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

poem_stanza_model = poem_stanza_model.to(device)

# create text generation seed promp
prompt = "<BOS> I miss home"
generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(device)

sample_outputs = poem_stanza_model.generate(
                                generated, 
                                do_sample=True,   
                                top_k=50, 
                                max_length=MAX_LEN,
                                top_p=0.95, 
                                num_return_sequences=3
                                )

# print(tokenizer.decode(sample_outputs[0], skip_special_tokens=True))

for i, sample_output in enumerate(sample_outputs):
    print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0: I miss home,
but I miss you too,
like the last week
until you start a new life
from my blood—
I miss you too,
like the last day you left
until I go home to my bed.
I don’t love you,
but I love you even now.
I don’t love you with a smile
as much as I do that day.
I love you with the smell of your milk
and its scent
on your lips, in all your smiles,
as if I were your lover.
I love you with the sounds you make
on your cheeks, laughing in them as though there were no words.
I love you with the sound of your music,
talking to you like an animal, saying,
and it has just got to stop. — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —