# TV Scripts Generator

## Project

Imagine your job was to produce a TV script for the next episode of a long running TV show. There were so many episodes and it is hard for you to come up with something new. Good news is you can use deep learning to write the script for you. There are many scripts from previous episodes that a nerual network can learn from. And what's important, the audience likes what it is used to, so it sounds like a perfect solution.

In this project I'm going to train a neural network using scripts from the 9 seasons of Seinfeld show. Based on recognized patterns, it will be able to generate a new text. And I will use it to create a one.

## Load the data

If you look at the data file, you will find out that it contains lines from scripts, appended episode by episode. I'm loading them all as a one huge string of text.

In [1]:
import helper

# Load in data
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

Now I can access `text` letter by letter. Here I print first 1000 characters.

In [2]:
text[:1000]

'jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! you wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? where ever you are in life, its my feeling, youve gotta go. \n\njerry: (poi

You can see that new lines are represented by `\n`.

### Explore it

By using `view_line_range` I print out the first ten lines of text. Empty lines also count as lines, so in result it printed out 5 quotes.

You can also see some dataset stats. Rough estimate of unique words is overshot as it considers everything between spaces as a single word. So it will consider *back!* and *back* as two different words.

In [3]:
import numpy as np

# Determine which lines range to print
view_line_range = (0, 10)

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! y

## Pre-process the data

While working with text data, it is useful to code words into numbers. I am going to do it now.

### Lookup tables

I will create two dictionaries, `vocab_to_int` that will map words to integers and `int_to_vocab` that will do the opposite. These lookup tables will make it easy to translate between words and their integer codings later on.

In [4]:
import problem_unittests as tests

def create_lookup_tables(text):
    '''Create lookup tables for vocabulary.
    
    Args:
        text(str): The text of tv scripts split into words
    
    Return: A tuple of dicts (vocab_to_int, int_to_vocab)
    '''
    # Transform text into a large tuple of unique words
    vocab = tuple(set(text))
    
    int_to_vocab = dict(enumerate(vocab))
    vocab_to_int = { vocab : i for i, vocab in int_to_vocab.items() }

    return (vocab_to_int, int_to_vocab)

# Test
tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### Tokenize punctuation

Later, I'm going to split `text` into words using whitespaces. However, punctuation can cause some problems like for example recognizning *bye* and *bye!* as two different words. You can solve it by replacing every punctuation mark with some word token, like "." with "||period||". You might ask, why to add separators "|"? It's so during analysis you can distinguish between real words and the tokens.

Now I'm going to create a dictionary that can map punctuation marks into tokens.

In [5]:
def token_lookup():
    '''Generate a dict to turn punctuation into a token.
    
    Return: Tokenized dictionary where the key is the punctuation and the value is the token
    '''
    punctuation_to_token = {
        '.' : '||period||',
        ',' : '||comma||',
        '"' : '||quotation_mark||',
        ';' : '||semicolon||',
        '!' : '||exclamation_mark||',
        '?' : '||question_mark||',
        '(' : '||left_parentheses||',
        ')' : '||right_parentheses||',
        '-' : '||dash||',
        '\n': '||new_line||'
    }
        
    return punctuation_to_token

# Test
tests.test_tokenize(token_lookup)

Tests Passed


### Pre-process the data and save it

As a last step, I'm going to do the actual pre-processing. The code will translate punctuation marks nad create dictionaries. It will save the results in a file, so you can easily load it and just continue from this point.

In [6]:
# Pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

## Checkpoint

This code allows you to continue analysis whenever you come back to the notebook. Just run this cell and it will load the pre-processed data.

In [7]:
import helper
import problem_unittests as tests

# Load the pre-processed data
int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

In [8]:
# Count the number of unique words
print("Number of unique words: ", len(vocab_to_int))

Number of unique words:  21388


As you can see, after the pre-processing the number of unique words is more accurate. There are about 20,000 of them.

## Data pipeline

It's time to build the pipeline that will feed data to the model.

If possible, you want to use GPU during training.

In [9]:
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('No GPU found. Please use a GPU during training.')

No GPU found. Please use a GPU during training.


### Providing data

The neural network needs input in form of tensors. To give data this format I'm going to use `TensorDataset`. Then I'm going to pass the dataset into `DataLoader` which will work as an iterator and handle shuffling and batching.

Now you know that batching will be handled by the data loader but I still need to provide the feature and target variables. What form should they take? The generator's job will be to provide new lines of text on the basis of existing ones. So assuming that text is coded as integers representing words like this:

```
text = [ 2, 5, 3, 7, 1, 2, 6, 8, 9, 4, ... ]
```
it makes sense to define features and targets in the following way:
```
x1 = [ 2, 5, 3, 7 ]  y1 = [ 1 ]
x2 = [ 5, 3, 7, 1 ]  y2 = [ 2 ]
x3 = [ 3, 7, 1, 2 ]  y3 = [ 6 ]
```
As you can see, each feature consists of 4 consecutive words from the original text, while the target is the next, fifth word. This will allow the model to learn how to predict text from existing scripts. The length of a sequence, which in the example is equal to 4, is just a hyperparameter and can be modified.

Let's write a function that provides variables this way and wraps them in a data loader.

In [10]:
from torch.utils.data import TensorDataset, DataLoader
from torch import Tensor

def batch_data(words, sequence_length, batch_size, shuffle = False):
    '''Provide batches of data using data loader.
    
    Args:
        words(list): TV scripts represented as a list of words coded as integers
        sequence_length(int): length of each word sequence (feature)
        batch_size(int): number of features and targets in each batch
    
    Returns: data loader providing batches of (x, y).
    '''
    n = len(words)
    s = sequence_length
    
    features = []
    targets = []
    
    # Iterate over text, word by word
    for i in range(n):
        # As long as it is still possible to form a sequence
        if i + s < n:
            # Add each next sequence of words as a feature
            features.append([words[j] for j in range(i, i + s)])
            # And the following word as a target
            targets.append(words[i + s])
        else:
            break
    
    # Transform features and targets into tensors
    dataset = TensorDataset(Tensor(features), Tensor(targets))
    # Return a dataloader
    dataloader = DataLoader(dataset = dataset, 
                            batch_size = batch_size, 
                            drop_last = True,
                            shuffle = shuffle)
    return dataloader

I will re-create data from the example so you can compare the output.

In [11]:
# Parameters from the example
fake_text = [ 2, 5, 3, 7, 1, 2, 6, 8, 9, 4 ]
seq_length = 4
batch_size = 3

data_loader = batch_data(fake_text, seq_length, batch_size)

# Print text for comparison
print('text: ', fake_text, '\n')
# Print batches
for batch_n, (x_batch, y_batch) in enumerate(data_loader):
    print('batch ' + str(batch_n + 1) + ':')
    for i in range(batch_size):
        j = i + batch_n * seq_length + 1
        print('x' + str(j) + ': ', x_batch[i].numpy(), 
              'y' + str(j) + ': ', y_batch[i].numpy().astype(np.int))
    print('\n')

text:  [2, 5, 3, 7, 1, 2, 6, 8, 9, 4] 

batch 1:
x1:  [2. 5. 3. 7.] y1:  1
x2:  [5. 3. 7. 1.] y2:  2
x3:  [3. 7. 1. 2.] y3:  6


batch 2:
x5:  [7. 1. 2. 6.] y5:  8
x6:  [1. 2. 6. 8.] y6:  9
x7:  [2. 6. 8. 9.] y7:  4




You can notice that the output looks exactly as in the example. The order is also preserved in the second batch.

Additionally, I created a test function. It creates data loader 3 times, each time on a random set of parameters. It prints *All tests passed*, meaning that words are in a correct order.

In [12]:
# Test function
def test_data_loader(int_text):
    # Test data loader 3 times with a random sequence length and a batch size
    for i in range(3):
        sequence_len = np.random.randint(1, 11)
        batch_size = int(np.random.choice([8, 16, 32, 64]))
        _data_loader_single_test(int_text, sequence_len, batch_size)
    # Print a message if all tests are successful
    print("All tests passed")

def _data_loader_single_test(int_text, sequence_len, batch_size):
    # Create a data loader
    data_loader = batch_data(int_text, sequence_len, batch_size)
    # Helper function
    tensor_to_int = lambda tensor: tensor.numpy().astype(np.int).tolist()

    # Test the first batch
    for x_batch, y_batch in data_loader:
        # Test batch size
        assert(len(x_batch) == batch_size)
        assert(len(y_batch) == batch_size)
        
        # Test sequence length
        assert(len(x_batch[0]) == sequence_len)
        
        # Test words sequences
        for i in range (0, batch_size):
            a = tensor_to_int(x_batch[i]) # sequence of feature words
            b = int_text[i : i + sequence_len] # corresponding sequence from text
            assert(a == b)

            a = tensor_to_int(y_batch[i]) # target word
            b = int_text[i + sequence_len] # corresponding target from text
            assert(a == b)
        # Stop test after first batch
        break

In [13]:
# Test
test_data_loader(int_text)

All tests passed


## Build the RNN

I'm going to define the neural network architecture. I will use a bunch of parameters that will allow you for its customization.

It will have 3 methods:
 - `__init__` - responsible for setting class variables
 - `init_hidden` - responsible for initializing hidden state
 - `forward` - responsible for flow of data during a forward pass

Basically, what I want the network to do is:
- take a batch of word sequences
- code them as word embeddings
- input them into LSTM layer
- flatten its output
- put it into fully-connected layer
- deflatten the result
- output a batch of predictions for every last-in-a-sequence word

You may wonder why the output is only the last predicted word. Let's say that the input sequence are three words [B, F, C]. Then, they will be processed in parallel, by units LSTM-0, LSTM-1 and LSTM-2. Each of these units will try to predict the next word in a sequence, but you already know that after B comes F and that after F comes C. You can tell it from the original sequence. The only interesting prediction is that for the last word, C, because you can't guess it just by looking at the input sequence.

And it is the idea behind the model that I mentioned before, to predict the next word given a sequence of them.

In [14]:
import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, 
                 vocab_size, 
                 output_size, 
                 embedding_dim, 
                 hidden_dim, 
                 n_layers, 
                 dropout = 0.5):
        '''Initialize RNN's parameters and layers.
        
        Args:
            vocab_size(int): length of the input, one-hot encoded vector 
                             (equal to the vocab size)
            output_size(int): number of output nodes
            embedding_dim(int): length of embedding vector
                                (how many digits will be used to represent a word)
            hidden_dim(int): number of hidden nodes in the LSTM and FC layers
            dropout(float): dropout applied to the output of LSTM layer
        '''
        super(RNN, self).__init__()

        # Set class variables
        self.vocab_size = vocab_size
        self.output_size = output_size
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.dropout = dropout

        ## Define model layers
        # Takes in a one hot-encoded vector (length equal to the number 
        # of unique words in a dict), outputs numerical code for that
        # word like 1234
        self.embed = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.lstm = nn.LSTM(input_size = self.embedding_dim,
                            hidden_size = self.hidden_dim,
                            num_layers = self.n_layers,
                            dropout = self.dropout,
                            batch_first = True)
        self.fc = nn.Linear(self.hidden_dim, self.output_size)
    
    
    def forward(self, nn_input, hidden):
        '''Forward pass.
        
        Args:
            nn_input(tensor): input (batch_size, sequence_length)
                              each observation is a list of ints
                              representing words (vocab_to_int)
            hidden(tuple): the hidden state, a tuple of tensors
                           (n_layers, batch_size, hidden_dim)      
        
        Returns: tuple of output and hidden state
        '''   
        # Get the batch size (first input dimension)
        batch_size = nn_input.size(0)
        
        # Get embedding for the input, each word is encoded as a number
        # (batch_size, sequence_length) -> (batch_size, sequence_length, embedding_dim)
        embed_input = self.embed(nn_input.long())
        
        # Get the outputs and the new hidden state from the LSTM:
        # output (batch_size, sequence_length, hidden_dim)
        # hidden state (n_layers, batch_size, hidden_dim)
        lstm_output, hidden = self.lstm(embed_input, hidden)

        # Flatten the output before feeding it into fc layer
        # From 3D to 2D, (batch_size * sequence_length, hidden_dim)
        # It works because fc layer can take 2D input like 
        # (batch_size, hidden_dim)
        lstm_output = lstm_output.contiguous().view(-1, self.hidden_dim)
        # View represents only a view of original object without
        # actual change in memory. contiguous() causes an actual
        # change in memory that matches a view
        
        # Feed into fc
        # Returns a one-hot encoded vector for every input word
        # (batch_size, hidden_dim) -> (batch_size, vocab_size)
        output = self.fc(lstm_output)
        
        # Reshape the output to the desired output size
        # Basically just return to the 3D representation
        # starting with (batch_size, seq_length, ...)
        # (batch_size, vocab_size) -> (batch_size, sequence_length, vocab_size)
        output = output.view(batch_size, -1, self.output_size)

        # Get the output (batch_size, vocab_size)
        # Get prediction for the last word in the input sequence
        out = output[:, -1]
        # For every observation in a batch it is a vector of (vocab_size, )
        # with probability  for each word (actually logit, because it is 
        # before applying softmax by the loss function)
        
        # Return one batch of predictions and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        '''Initialize the hidden state of an LSTM.
        
        Args:
            batch_size(int)
        
        Returns: hidden state (n_layers, batch_size, hidden_dim)
                 For every LSTM layer, it outputs a batch of hidden state
                 output signals for each word.
        '''
        # Get weight matrix
        weight = next(self.parameters()).data
        # Create matrix of zero weights based on its shape
        zero_weight = weight.new(self.n_layers, batch_size, self.hidden_dim).zero_()
        
        # Initialize hidden state with zero weights, and move to GPU if available
        if train_on_gpu:
            hidden = (zero_weight.cuda(), zero_weight.cuda())
        else:
            hidden = (zero_weight, zero_weight)
        
        return hidden

# Test
tests.test_rnn(RNN, train_on_gpu)

Tests Passed


### Forward and backprop

I am going to write a function that performs forward and back passes through network. I will call it later in the training loop like this:
```
loss = forward_back_prop(decoder, decoder_optimizer, criterion, inp, target)
```
It will make the code easier to read.

In [15]:
def forward_back_prop(rnn, 
                      optimizer, 
                      criterion, 
                      feature, 
                      target, 
                      hidden, 
                      clip_value = 1):
    '''Forward and backward propagation run.
    
    Args:
        rnn(RNN object): instance of the model object
        optimizer(object): PyTorch optimizer
        criterion(object): PyTorch loss function
        feature(tensor): feature tensor (batch_size, sequence_length, vocab_size)
        target(tensor): target tensor (batch_size, vocab_size)
    
    Returns: loss and the latest hidden state tensor
    '''   
    # Move data to GPU, if available
    if train_on_gpu:
        feature, target = feature.cuda(), target.cuda()
    
    # Detach hidden state from its history
    # Otherwise it will backprop through entire training history
    # and take a long time
    hidden = tuple([each.data for each in hidden])
    
    # Clear the gradients
    optimizer.zero_grad()
    # Forward pass
    output, hidden = rnn.forward(feature, hidden)
    # Calculate the batch loss
    loss = criterion(output, target.long())
    # Backward pass
    loss.backward(retain_graph = True)
    # Clip gradient to prevent it from exploading
    # If gradient > clip threshold, set its value to the clip threshold
    nn.utils.clip_grad_norm_(rnn.parameters(), clip_value)
    # Parameter update
    optimizer.step()

    # Return the loss over a batch and the hidden state produced by thw model
    return loss.item(), hidden

# Test
tests.test_forward_back_prop(RNN, forward_back_prop, train_on_gpu)

Tests Passed


## Training

Architecture is ready, so it's the time for training.

### Train loop

I'm going to write a code for a training loop using `forward_back_prop` that you saw before.

In [16]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches = 100):
    # Initialize loss tracker
    min_loss = np.Inf
    
    losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # Initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # Make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset) // batch_size
            if batch_i > n_batches:
                break
            
            # Forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # Record loss
            losses.append(loss)

            # Print loss stats
            if batch_i % show_every_n_batches == 0:
                # Calculate the loss in this epoch until now
                loss = np.average(losses)
                # Print
                print('Epoch: {:>4}/{:<4}  Loss: {:.6f}\n'.format(
                    epoch_i, n_epochs, loss))
        
        # Calculate final epoch loss
        loss = np.average(losses)
        # Clear the loss record before new epoch
        losses = []
        
        # Check if loss has improved
        if loss < min_loss:
            print('Epoch loss decreased ({:.6f} --> {:.6f}).  Saving model ...\n'.format(
                min_loss,
                loss)
            )
            # Save the model
            helper.save_model('model', rnn)
            # Update min loss
            min_loss = loss

    # Return a trained rnn
    return rnn

### Hyperparameters

Now I will set the network hyperparameters. There is a bunch of them and you can always tune them later.

In [17]:
## Data params

# Sequence Length
sequence_length = 5  # of words in a sequence
# Batch Size
batch_size = 128

# Data loader - init with data params
train_loader = batch_data(int_text, sequence_length, batch_size)

In [18]:
## Training params

# Number of Epochs
num_epochs = 10
# Learning Rate
learning_rate = 0.001

## Model parameters

# Vocab size
vocab_size = len(vocab_to_int)
# Output size
output_size = len(vocab_to_int)
# Embedding Dimension
embedding_dim = 12
# Hidden Dimension
hidden_dim = 200
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 6969

### Training

Now you can actually run the training code. My goal is to reach the loss below 3.5.

In [19]:
# Create model
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout = 0.5)
# Move it to GPU if available
if train_on_gpu:
    rnn.cuda()

# Define loss and optimizer
optimizer = torch.optim.Adam(rnn.parameters(), lr = learning_rate)
criterion = nn.CrossEntropyLoss()

# Train
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

Training for 10 epoch(s)...
Epoch:    1/10    Loss: 4.866976

Epoch loss decreased (inf --> 4.866976).  Saving model ...

Epoch:    2/10    Loss: 4.234397

Epoch loss decreased (4.866976 --> 4.234397).  Saving model ...

Epoch:    3/10    Loss: 4.003606

Epoch loss decreased (4.234397 --> 4.003606).  Saving model ...

Epoch:    4/10    Loss: 3.853343

Epoch loss decreased (4.003606 --> 3.853343).  Saving model ...

Epoch:    5/10    Loss: 3.746055

Epoch loss decreased (3.853343 --> 3.746055).  Saving model ...

Epoch:    6/10    Loss: 3.666510

Epoch loss decreased (3.746055 --> 3.666510).  Saving model ...

Epoch:    7/10    Loss: 3.601086

Epoch loss decreased (3.666510 --> 3.601086).  Saving model ...

Epoch:    8/10    Loss: 3.553352

Epoch loss decreased (3.601086 --> 3.553352).  Saving model ...

Epoch:    9/10    Loss: 3.509620

Epoch loss decreased (3.553352 --> 3.509620).  Saving model ...

Epoch:   10/10    Loss: 3.473532

Epoch loss decreased (3.509620 --> 3.473532).  Savin

### Hyperparameter tuning

My general idea during model tuning was to keep the model simple. If e.g. increase in hidden dimension improved the loss but only slightly, I would move back to the previous value.

`seq_length` affects how the model works. Just think about how prediction based on 5 words can differ from a one based on 10. Text generated by the latter model is probably more consistent as it can get a better grasp of context, since there are more words to learn from. I kept this value low, at 5 as it should still perform fine.

The general good value for `batch_size` is either 32, 64, 128 or 256. I decided to keep it at 128. As you may know, training accuracy is proportional to `batch_size` * `learning_rate` so I decided to focus on tuning the learning rate instead.

As for `learning_rate`, I tried both 0.005 and 0.0005, but it seemed that 0.001 works better.

You can remember that on the input and output of the network there is a one-hot encoded vector. `vocab_size` and `output_size` simply determine its length. And you should set it to the number of unique words.

According to the Google Developers' blog entry, there is a simple formula for a good `embedding_dim` value. It is the 4th root of the number of classes. Here, classes refer to unique words, so the result is approximately 21,000 ** 0.25 ~ 12.

I just set `num_epochs` to 10. It is a minimum value at which I was able to go below loss of 3.5.

Generally, a good value of `n_layers` is between 1 to 3. The middle 2 was enough to get good results.

As with most parameters, you need to experiment. I tried a range of parameters for `hidden_dim` and the loss decreases nicely at 200.

## Checkpoint

In [20]:
import torch
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

if train_on_gpu:
    map_location = lambda storage, loc: storage.cuda()
else:
    map_location = 'cpu'

trained_rnn = helper.load_model('model', map_location = map_location)