# Model Building

sources:

- https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
- https://towardsdatascience.com/generating-haiku-with-deep-learning-dbf5d18b4246
- https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/

In [2]:
import os
import sys
import pickle
import torch
import torch.nn
import torch.optim
import torch.utils.data

In [3]:
"""Google Drive"""
from google.colab import drive
drive.mount('/content/gdrive/')
root_path = '/content/gdrive/My Drive/Virginia Tech/graduate/courses/2021_spring/ece_5424/assignments/project_ece_5424/'
dataset_path = './dataset'
store_file = os.path.join(root_path,dataset_path,'lyrics.pickle')

Mounted at /content/gdrive/


In [4]:
"""Offline Usage"""
# dataset_path = '../../dataset'
# store_file = os.path.join(dataset_path,'lyrics.pickle')

'Offline Usage'

## Construct Dataset class

We construct a PyTorch dataset class which aids in loading the pre-processed song lyrics.

The pre-processed lyrics come with the following attributes:

- `index2token`: Maps token integer ID to actual token
- `token2index`: Maps token to integer ID
- `counts`: Token frequencies
- `corpus`: List of tokenized lyrics for each sentence
- `vectors`: List of token ID tensors for each sentence

In [5]:
class WorshipLyricDataset(torch.utils.data.Dataset):
    """Worhip Song dataset from Genius.
    """

    def __init__(self, path: str):

        # Load the pre-processed pickle file.
        with open(path, 'rb') as fp:
            store = pickle.load(fp)
        
        # Unpack the pickle.
        self.index2token = store['index2token']
        self.token2index = store['token2index']
        self.counts = store['counts']
        self.corpus = store['corpus']
        self.vectors = [torch.LongTensor(vec) for vec in store['vectors']]
        self.syllables = torch.nn.functional.one_hot(torch.LongTensor(store['syllables'])) # One-hot encoded syllabl counts.

    def __len__(self):
        return len(self.vectors)

    def __getitem__(self, idx):
        return (self.vectors[idx], self.syllables[idx],)

In [6]:
# Construct the data object.
dataset = WorshipLyricDataset(path=store_file)

To support training with variable-length sentences we must pad the input sentences on a per-batch basis. To do this in conjunction with a data loader, we define a "collate" function which pads the sentences within each batch.

In [7]:
def pad_collate(batch):
    """Pad batches from dataloader.

    This allows for more efficient padding,
    by only padding within each batch.
    """
    sentences, syllables = zip(*batch)
    sen_lens = torch.LongTensor([len(vec) for vec in sentences])
    sen_pad = torch.nn.utils.rnn.pad_sequence(sentences, batch_first=True, padding_value=0)
    syllables = torch.stack(syllables) # Convert tuple of tensors to single 2D tensor.
    syllables = syllables.reshape(syllables.size(0),1,syllables.size(1)) # Convert to 3D.
    syllables = syllables.repeat_interleave(sen_pad.size(1), dim=1) # Duplicate syllable count for every word in each sentence.
    return (sen_pad,syllables,sen_lens,)

With the dataset and collation function defined we can now construct a `dataloader`, which will allow us to iterate over the lyrics dataset in batches. All training will be done using this loader object.

In [8]:
# Construct data loader.
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2, collate_fn=pad_collate)

## Decoder model

We define a novel RNN decoder architecture, called `SentenceRNN`, to generate next-word prediction for religious music.

This architecture was adapted from the wondeful PyTorch tutorial ["NLP From Scratch: Translation with a Sequence to Sequence Network and Attention"](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).

The anatomy of our model consts of:

1. Embedding input layer
1. Unidirectional single-layer LSTM
1. Fully-connected output layer

The model is generalized to support variable-length sequences by requiring inputs to be **zero-padded** and then it subsequently **packs** the padded input so that padding tokens are ignored by the internal LSTM layers.

For each forward pass of the model it outputs the **current token prediction** (from the fully-connected layer), the **lengths of the sequence predictions** (required for packed padded input sequences), and the **LSTM output hidden and cell states**. This allows the LSTM hidden and cell states to be fed back into subsequent forward passes to retain sequence memory.

In [9]:
class SentenceRNN(torch.nn.Module):
    
    def __init__(self, n_hidden: int, n_vocab: int, n_layers: int, dropout: float = 0., bidirectional: bool = False):
        super().__init__()

        self.n_hidden = n_hidden
        self.n_vocab = n_vocab
        self.n_layers = n_layers
        self.bidirectional = bidirectional
        self.n_dir = 2 if bidirectional else 1

        # Embedding layer.
        self.embed = torch.nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=n_hidden,
        )

        # LSTM layer.
        self.lstm = torch.nn.LSTM(
            input_size=n_hidden,
            hidden_size=n_hidden,
            num_layers=n_layers,
            dropout=dropout,
            bidirectional=bidirectional,
            batch_first=True,
            )

        # Word mapping fully-connected layer.
        self.fc = torch.nn.Linear(in_features=n_hidden, out_features=n_vocab)
    

    def forward(self, sentences: torch.Tensor, lens: torch.Tensor, hidden: torch.Tensor, cell: torch.Tensor):
        """
        Args:
            sentences (torch.Tensor): Sentence word vectors.
            lens (torch.Tensor): True lengths of padded sentence vectors.
            hidden (torch.Tensor): Hidden state vector.
            cell (torch.Tensor): Cell state vector.
        """
        
        # Embed the sentence vectors as floating-point.
        #
        # inputs: (batch_size, sentence_length,)
        sentences_embed = self.embed(sentences)
        # embedded: (batch_size, sentence_length, embed_dim,)

        # Pack the embedding so that the paddings are ignored.
        sentences_embed_packed = torch.nn.utils.rnn.pack_padded_sequence(
            input=sentences_embed,
            lengths=lens, 
            batch_first=True,
            enforce_sorted=False,
            )

        # Pass the input feature vector as the first step.
        output_packed, (hidden, cell) = self.lstm(sentences_embed_packed, (hidden,cell,))

        # Get padded output
        output_padded, output_lens = torch.nn.utils.rnn.pad_packed_sequence(output_packed, batch_first=True)

        # Obtain token-level classification.
        output_padded_fc = self.fc(output_padded)

        # Run packing on output layer.
        return output_padded_fc, output_lens, (hidden, cell,)

    def init_hc(self, batch_size: int, device: str = 'cpu') -> torch.Tensor:
        """Helepr to zero-initialize hidden and cell state tensors."""
        return torch.zeros((self.n_layers*self.n_dir, batch_size, self.n_hidden), device=device)

## Train

Here we define some helper classes and functions for timing the training rounds.

In [10]:
import time
from contextlib import contextmanager

class timecontext:
    """Elapsed time context manager."""
    def __enter__(self):
        self.seconds = time.time()
        return self
    
    def __exit__(self, type, value, traceback):
        self.seconds = time.time() - self.seconds

@contextmanager
def timecontextprint(description='Elapsed time'):
    """Context manager to print elapsed time from call."""
    with timecontext() as t:
        yield t
    print(f"{description}: {t.seconds} seconds")

The training itself is done by initializing the LSTM hidden and cell states to zero. Then to generalize all sentence structures we always set the first token to be run through the model as the start-of-sentence (SOS) token. 

To better generalize the next-token predictions we apply a technique known as "teacher forcing". In teacher forcing we pass the known next-token target value at each step as the input to the next decoder step. This forces the decoder to learn using the proper next-token rather than solely based on the predictions at each step. To increase generalization performance further we randomly apply teacher forcing for each batch based on a probability distribution (by default 50% probability). 

In [11]:
import random
def train(decoder, loader, epochs, optimizer_decoder, criterion, device='cpu', teacher_force_ratio=0.5):
    decoder.to(device)
    decoder.train()
    for e in range(epochs):
        running_loss = 0.0
        for b,batch in enumerate(loader):
            sentences,syllables,sen_lens = batch
            sentences = sentences.to(device)
            syllables = syllables.to(device)
            sen_lens = sen_lens.to(device)

            # Initialize hidden output.
            decoder_hidden = decoder.init_hc(32, device=device)
            decoder_cell = decoder.init_hc(32, device=device)

            # Setup initial decoder inputs.
            SOS_token = dataset.token2index['<sos>']
            decoder_input = SOS_token*torch.ones((sentences.size(0), 1,), dtype=torch.long, device=device)
            decoder_input_lens = torch.ones((sentences.size(0),), dtype=torch.long)

            # Initialize batch loss to zero.
            loss = 0

            # Determine if teacher-forcing should be used for this batch.
            use_teacher_forcing = True if random.random() < teacher_force_ratio else False

            # Teacher forcing.
            # Feed the target as the next input.
            if use_teacher_forcing:
                for target_idx in range(1, sentences.size(1)):
                    outputs, out_lens, (decoder_hidden, decoder_cell) = decoder(decoder_input, decoder_input_lens, decoder_hidden, decoder_cell)

                    # Reshape outputs and targets to fit insize the criterion.
                    outputs = outputs.squeeze(dim=1)
                    targets = sentences[:,target_idx].reshape(sentences.size(0),-1)

                    # Calculate batch loss.
                    loss += criterion(
                        outputs,
                        targets.squeeze(dim=1),
                    )

                    # For teacher forcing set the input of the
                    # next round to be the current target.
                    decoder_input = targets.detach()

            # No teacher forcing.
            # Feed the RNN predictions as the next input.
            else:
                for target_idx in range(1, sentences.size(1)):
                    outputs, out_lens, (decoder_hidden, decoder_cell) = decoder(decoder_input, decoder_input_lens, decoder_hidden, decoder_cell)

                    # Get best prediction.
                    topv, topi = outputs.topk(1)

                    # Reshape outputs and targets to fit insize the criterion.
                    outputs = outputs.squeeze(dim=1)
                    targets = sentences[:,target_idx].reshape(sentences.size(0),-1)

                    # Calculate batch loss.
                    loss += criterion(
                        outputs,
                        targets.squeeze(dim=1),
                    )

                    # Set the input of the next round to be the current prediction.
                    decoder_input = topi.squeeze(dim=-1).detach()

            # Back-propagate, and step the optimizers.
            loss.backward()
            optimizer_decoder.step()
            
            # Zero the gradients
            optimizer_decoder.zero_grad()

            # Accumulate the loss for this epoch.
            running_loss += loss.item()

        # Report epoch results.
        print(f'[epoch {e}]: loss {running_loss}')

In [12]:
# Length of vocabulary.
n_words = len(dataset.index2token)

# Decoder.
decoder = SentenceRNN(
    n_vocab=n_words,
    n_hidden=256,
    n_layers=1,
)

To speed-up training PyTorch allows us to leverage a GPU, using CUDA, if one is available. Since training a CNN can be computationally intensive we prefer to use a GPU for speed, but will revert to using the CPU if necessary.

In [13]:
# Set runtime device.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

Device: cuda


With the model defined we can now train it on the lyrics dataset.

We define a set of training hyperparameters `epochs` and `lr` which are number of training iterations and optimizer learning rate respectively.

To speed-up subsequent runs, we also save the trained model to a file. This allows us to train the model once, and then simply load the pre-trained model (using the flag `load_from_file = True`) if the Jupyter notebook is run multiple times during a single session.

In [14]:
# Define path to decoder model storage.
load_from_file = True
decoder_store = os.path.join(root_path, dataset_path, 'decoder.pt')

# Load model from store file.
if load_from_file and os.path.exists(decoder_store):
    decoder.load_state_dict(torch.load(decoder_store))

else:
    # Learning parameters.
    epochs = 10
    lr = 1e-3
    print(f'Training decoder model: epochs={epochs}, lr={lr}, batches={len(dataloader)}')

    # Train the model.
    # Display training time too.
    with timecontextprint():
        optim_decoder = torch.optim.Adam(decoder.parameters(), lr=lr)
        criterion = torch.nn.CrossEntropyLoss(reduction='mean')
        train(decoder,
            loader=dataloader,
            epochs=epochs,
            optimizer_decoder=optim_decoder,
            criterion=criterion,
            device=device,
        )

    # Store model state to file.
    decoder_store = os.path.join(root_path,dataset_path,'decoder.pt')
    torch.save(decoder.state_dict(), decoder_store)
    print(f'Saved decoder model: {decoder_store}')

Training decoder model: epochs=10, lr=0.001, batches=898
[epoch 0]: loss 35091.922761917114
[epoch 1]: loss 32093.397901535034
[epoch 2]: loss 31068.27512550354
[epoch 3]: loss 30753.926628112793
[epoch 4]: loss 30132.63254737854
[epoch 5]: loss 29897.695526123047
[epoch 6]: loss 29800.166801452637
[epoch 7]: loss 29300.294973373413
[epoch 8]: loss 28935.599615097046
[epoch 9]: loss 28534.963705062866
Elapsed time: 226.15394949913025 seconds
Saved decoder model: /content/gdrive/My Drive/Virginia Tech/graduate/courses/2021_spring/ece_5424/assignments/project_ece_5424/./dataset/decoder.pt


## Evaluate

Now that the model has been trained we can use it to generate song lyrics.

We define a helper function to evaluate the decoder model. The process of decoder evaluation is actually very similar to training. The difference is that for each token prediction we randomly choose the predicted token based on the normalized probability distribution of the prediction set. We also do not employ teacher forcing since the next-token at each step is unknown.

The evaluation process continuously loops through next-token predictions until either an end-of-sentence (EOS) token or a maximum decoded token length is reached.

In [15]:
import numpy as np
from typing import List

def evaluate(decoder, seed='<sos>', max_length=None, device='cpu') -> List[str]:
    """Generate a sequence of tokens using the decoder and a starting seed."""
    decoder.to(device)
    decoder.eval()
    with torch.no_grad():

        # Initialize hidden output.
        decoder_hidden = decoder.init_hc(1, device=device)
        decoder_cell = decoder.init_hc(1, device=device)

        # Setup initial decoder inputs.
        EOS_index = dataset.token2index['<eos>']
        SOS_index = dataset.token2index['<sos>']
        seed_token = dataset.token2index.get(seed, SOS_index)
        decoder_input = seed_token*torch.ones((1, 1,), dtype=torch.long, device=device)
        decoder_input_lens = torch.ones((1,), dtype=torch.long)

        # Always initialize the deocded tokens with an SOS token.
        decoded_tokens = [seed]

        # Loop indefinitely until token list is generated.
        while True:

            # Run current inputs, hidden and cell states through the decoder.
            outputs, out_lens, (decoder_hidden, decoder_cell) = decoder(decoder_input, decoder_input_lens, decoder_hidden, decoder_cell)

            # Normalize the probability distribution of the current prediction.
            probs = np.array(torch.nn.functional.softmax(outputs, dim=2).squeeze().cpu())
            probs = probs / probs.sum()

            # Choose the top prediction from the normalized probability disrtribution above.
            topi = torch.tensor(np.random.choice(decoder.n_vocab, 1, p=probs, replace=False), dtype=torch.long, device=device).view(1,1,1)

            # Add the current token to the decoded list.
            decoded_tokens.append(dataset.index2token[topi.item()])

            # Stop if current token is EOS or maximum sentence length has been reached.
            if (topi.item() == EOS_index) or (max_length and len(decoded_tokens) == max_length):
                break

            # Set the input of the next round to be the current prediction.
            decoder_input = topi.squeeze(dim=-1).detach()

        return decoded_tokens

We also define a helper function to generate song lyrics, composed of multiple lines, by inputting a set of line-count, optional maximum sentence length, and optional seed list for each sentence. This generates a sequence of tokens for each line of the song and then subsequently joins all lines into a single newline-delimited string.

In [16]:
def generate_song(num_lines: int, max_length: int = None, seeds: list = None) -> str:
    """Helper to generate song lyrics given constraints."""

    # Build list of line seeds if none were provided.
    seed = '<sos>'
    if not seeds:
        seeds = [seed]*num_lines

    # Generate predictions for each line.
    lines = []
    for i in range(num_lines):
        tokens = evaluate(decoder, max_length=max_length, seed=seeds[i], device=device)
        if tokens[0] != '<sos>': tokens = ['<sos>'] + tokens
        lines.append(' '.join(tokens))

    # Join lines together and return as single string.
    return '\n'.join(lines)

In [17]:
print(generate_song(num_lines=3, max_length=10))

<sos> are my help in my affection <eos>
<sos> me that is starting <eos>
<sos> don t still you why i do not deserve


In [18]:
print(generate_song(num_lines=3, seeds=['god', 'we', 'they']))

<sos> god wake deeper deep else <eos>
<sos> we exalt dwelling faithfulness watch generations win <eos>
<sos> they can take strangers home away <eos>


In [23]:
print(generate_song(num_lines=3, seeds=['we', 'give', 'you',]))

<sos> we bear dancing <eos>
<sos> give create disease buried there hand to know your love sings me <eos>
<sos> you breathes rest and messiah <eos>
