# TV Script Generation
Pytorch implementation of **Recurrent Neural Network (LSTM)** used for generation of new Seinfeld TV scripts. The network is trained on scripts from [Seinfeld dataset](https://www.kaggle.com/thec03u5/seinfeld-chronicles#scripts.csv).

This project is the fourth assigment for [Udacity Deep Learning Nanodegree](https://eu.udacity.com/course/deep-learning-nanodegree--nd101).

### Import Dependencies

In [0]:
import torch
from torch import nn, optim
from torch.utils.data import TensorDataset, DataLoader
import torch.nn.functional as F

import time
import math
import numpy as np

from collections import Counter

## Load the Data
First we load the data from file `Seinfeld_Scripts.txt` and explore it a little bit.


In [0]:
file_name = 'Seinfeld_Scripts.txt'

def load(file_name):
    """
    Reads a file file_name and returns it as string
    
    :param file_name: name of a file to load
    :return: content of the file
    """
    with open(file_name , 'r') as file:
        text = file.read()
    return text

text = load(file_name)

### Explore the Data
To see what the data looks like, we display some **statistics** and the beginning of the text.

In [4]:
print('Dataset statistics')
print('Number of lines: {}'.format(len(text.split('\n'))))
print('Number of words: {}'.format(len(text.split())))
print('Number of unique words: {}'.format(len(set(text.split()))))

Dataset statistics
Number of lines: 109233
Number of words: 605614
Number of unique words: 46367


Print a few lines:

In [5]:
print('\n'.join(text.split('\n')[10:20]))

jerry: oh, you dont recall? 

george: (on an imaginary microphone) uh, no, not at this time. 

jerry: well, senator, id just like to know, what you knew and when you knew it. 

claire: mr. seinfeld. mr. costanza. 

george: are, are you sure this is decaf? wheres the orange indicator? 



## Preprocess the Data
Before building and training a network, we preprocess the data: 
- tokenize punctuation, 
- split the text into words,
- convert words to integers.

### Tokenize Punctuation
In order not to consider words with punctuation ('hello' vs. 'hello!') as different words, the **punctuation signs are replaced** by special words.

In [0]:
def punctuation_lookup():
    """
    Returns a dictionary of punctuation signs and special words to replace them
    
    :return: a dictionary of punctuation
    """
    punctuation = {
        '.': '<Period>',
        ',': '<Comma>',
        '"': '<QuotationMark>',
        ';': '<Semicolon>',
        '!': '<ExclamationMark>',
        '?': '<QuestionMark>',
        '(': '<LeftParentheses>',
        ')': '<RightParentheses>',
        '-': '<Dash>',
        '\n': '<Return>'}
    return punctuation

### Lookup Tables
We create lookup tables - **mappings for converting words to integers** and back for use in a word embedding.

In [0]:
def create_lookup_tables(text):
    """
    Creates lookup tables to convert words to integers and back
    
    :param text: input text split into words
    :return: two dictionaries converting words to integers and back
    """

    counter = Counter(text)
    # Start indexing from 1 (leave 0 as a padding word)
    vocab_to_int = {word: i for i, (word, count) 
                    in enumerate(counter.most_common(), 1)}
    int_to_vocab = {vocab_to_int[word]: word for word in counter}

    return (vocab_to_int, int_to_vocab)

In [10]:
def preprocess(text):
    """
    Preprocesses text to be ready as an input to the neural network
    
    :param text: input text
    :return: text converted to integers and two dictionaries mapping word to 
        integers and back
    """
    
    # Tokenize punctuation
    punctuation = punctuation_lookup()
    for sign, token in punctuation.items():
        text = text.replace(sign, ' {} '.format(token))
        
    # Convert to lowercase
    text = text.lower()
    
    # Split to words
    text = text.split()
    
    # Create Lookup Tables and convert text to integers
    vocab_to_int, int_to_vocab = create_lookup_tables(text)
    text_int = [vocab_to_int[word] for word in text]

    return text_int, vocab_to_int, int_to_vocab
    
text_int, vocab_to_int, int_to_vocab = preprocess(text)

# Print the beginning of the processed text
print(text_int[0:20])

[8, 35, 5, 28, 19, 25, 23, 51, 59, 4, 35, 5, 28, 3, 84, 121, 63, 4, 9, 55]


## Build and Train the Neural Network
Check if training on **GPU** is available.

In [7]:
# Check if CUDA is available
train_on_gpu = torch.cuda.is_available()

if train_on_gpu:
    print('CUDA is available! Training on GPU.')
else:
    print('CUDA is not available. Training on CPU.') 

CUDA is available! Training on GPU.


### Batch Input
Split dataset to **training and validation**, prepare **batches** out of the data and create data loaders.  

In [0]:
def batch_data(words, seq_length, batch_size, train_split = 0.9):
    """
    Makes batches of data and returns DataLoaders 
    
    :param words: processed input text converted to integer list
    :param seq_length: length of input sequnce
    :param batch_size: number of examples in a batch
    :param train_split: portion of data to be used for training
    :return: DataLoaders with training and validation data
    """
    
    # Define feature and target tensors
    feature_tensor = torch.tensor([words[i:i+seq_length]
                                   for i, _ in enumerate(words[seq_length:])])
    target_tensor = torch.tensor([word for word in words[sequence_length:]])
    
    # Create dataset
    data_all = TensorDataset(feature_tensor, target_tensor)
    
    # Split into training and validation parts
    n_train = int(train_split * len(data_all))
    data = dict(zip(['train', 'valid'], torch.utils.data.random_split(
        data_all, (n_train, len(data_all) - n_train))))
    
    # Define DataLoaders
    dataloaders = {
        phase: DataLoader(data[phase], shuffle = True, batch_size = batch_size)
                for phase in ['train', 'valid']}
    
    return dataloaders

### Model Architecture
We build the model from **Embedding** layer to transform words to embeddings (word vectors), `n` layers of **LSTM RNN** and the final **Linear** output layer.

In [0]:
class TvScriptGenerator(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, 
                 dropout = 0.5):
        """
        Initializes the neural network
        
        :param vocab_size: size of the vocabulary
        :param embedding_dim: dimension of word embeddings
        :param hidden_dim: dimension of hidden state of RNN
        :param num_layers: number of layers of RNN
        :param dropout: dropout rate between layers
        
        """
        super().__init__()
        
        # Define layers        
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, 
                            batch_first = True, dropout = dropout)
        self.linear = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, nn_input, hidden):
        """
        Forward pass
        
        :param nn_input: input of neural network
        :param hidden: hidden state of RNN
        :return: model output and a new hidden state
        
        """
        
        embedding = self.embed(nn_input)
        
        lstm_output, hidden = self.lstm(embedding, hidden)
        
        # Shape of lstm_output is [batch_size, sequence_length, hidden_dim]
        # Take only the output for the last element of a sequence
        lstm_output = lstm_output[:, -1, :].squeeze()
        
        output = self.linear(lstm_output)
        
        return output, hidden
        

### Implement the Training Algorithm
**Forward and backward** propagation pass:

In [0]:
def forward_backward_prop(model, optimizer, criterion, inputs, targets, hidden,
                          clip):
    """
    Performs a forward nad backward propagation pass of a model for one batch
    of data, returns training loss and new hidden state
    
    :param model: model to perform forward and backward propagations on 
    :param optimizer: optimizer used for updating parameters
    :param criterion: loss function
    :param inputs: input to the model
    :param targets: target words to compare with predicted words
    :param hidden: hidden state of LSTM network
    :return: training loss and new hidden state
    """
    
    # Detach hidden state so that we don't backpropagate through entire history
    if hidden is not None:
        hidden = tuple([each.data for each in hidden])
        
    # Zero gradients
    optimizer.zero_grad()
    
    # Forward propagation
    output, hidden = model(inputs, hidden)
    loss = criterion(output, targets)
    
    # Clip the parameters to prevent exploding gradients
    nn.utils.clip_grad_norm_(model.parameters(), clip)
    
    # Backward propagation
    loss.backward()
    optimizer.step()

    return float(loss), hidden

Training for **one epoch**:

In [0]:
def train_one_epoch(model, optimizer, criterion, loaders, batch_size, clip):
    """
    Trains a model for one epoch
    
    :param model: model to train
    :param optimizer: optimizer for optimization of model parameters
    :param criterion: loss function
    :param loader: DataLoader with training data
    :param clip: value used to clip gradients
    :return: average training loss
    """

    # Start each epoch with clean hidden state
    hidden = None;

    # Training
    model.train()
    train_loss = 0
    
    for inputs, labels in loaders['train']:
        # Take only full batches
        if len(inputs) != batch_size:
            break
        if train_on_gpu:
            inputs, labels = inputs.cuda(), labels.cuda()

        # Forward and backward pass
        loss, hidden = forward_backward_prop(model, optimizer, criterion,
                                             inputs, labels, hidden, clip)
        train_loss += loss
        
    return train_loss / len(loaders['train'])

**Validation:**

In [0]:
def validate(model, criterion, loaders, batch_size):
    """
    Runs forward pass on validation data and returns validation loss
    
    :param model: model to validate
    :param criterion: loss function
    :param loaders: DataLoaders with validation data
    :return: average validation loss
    """
    
    model.eval()
    valid_loss = 0
    with torch.no_grad():
        for inputs, labels in loaders['valid']:
            # Take only full batches
            if len(inputs) != batch_size:
                break
            if train_on_gpu:
                inputs, labels = inputs.cuda(), labels.cuda()

            output, _ = model(inputs, None)
            valid_loss += float(criterion(output, labels))
            
    return valid_loss / len(loaders['valid'])

Functions for **saving and loading** model parameters:

In [0]:
def save_parameters(model, epoch, loss, path):
    """
    Saves a checkpoint with state_dict of a model into a file
    
    :param model: model to save
    :param epoch: epoch number
    :param loss: validation loss of the model
    :param path: path to save the checkpoint to
    """
    checkpoint = {
        'state_dict': model.state_dict(),
        'epochs': epoch,
        'loss': loss
    }
    torch.save(checkpoint, path)

In [0]:
def load_parameters(model, path):
    """
    Loads parameters into model
    
    :param model: model to load parameters into
    :param path: filepath of a checkpoint with saved parameters
    :return: epoch and minimal validation loss of the saved model
    """
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['state_dict'])
    epoch = checkpoint['epochs']
    min_loss = checkpoint['loss']
    
    return epoch, min_loss

**Training loop:**

In [0]:
def train(model, optimizer, criterion, scheduler, num_epochs, loaders, 
          batch_size, clip = 5):
    """
    Trains a model
    
    :param model: model to train
    :param optimizer: optimizer for optimization of model parameters
    :param criterion: loss function
    :param scheduler: scheduler for learning rate decay
    :param num_epochs: number of epoch to train
    :param loader: DataLoader with training data
    :param batch_size: number of examples in a batch
    :param clip: value used to clip gradients
    """
    
    min_loss = math.inf
    
    # Move model to GPU
    if train_on_gpu:
        model.cuda()
        
    for epoch in range(1, num_epochs + 1):
        
        start_time = time.time()
        
        # Training
        train_loss = train_one_epoch(model, optimizer, criterion, loaders, 
                                     batch_size, clip)
        
        # Validation
        valid_loss = validate(model, criterion, loaders, batch_size)
            
        # Save model if validation loss decreased
        if valid_loss < min_loss:
            min_loss = valid_loss
            save_parameters(model, epoch, valid_loss, 'checkpoint.pth')
            
        scheduler.step()
            
        print(('Epoch {}/{}, Training loss: {:.3f}, Validation loss: {:.3f}, ' \
               'Time: {:.0f} s').format(epoch, num_epochs,
                                        train_loss, valid_loss,
                                        time.time() - start_time))

### Define Hyperparameters

In [0]:
# Batching hyperparameters
sequence_length = 32
batch_size = 128

# Training hyperparameters
num_epochs = 15
lr = 0.001
dropout = 0.5
weight_decay = 0.00001

# Model hyperparameters
vocab_size = len(vocab_to_int) + 1
embedding_dim = 512
hidden_dim = 256
num_layers = 2

# Scheduler hyperparameters
step = 12
gamma = 0.2

### Train the Model

In [0]:
# Build a model
model = TvScriptGenerator(vocab_size, embedding_dim, hidden_dim, num_layers,
                      dropout)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = lr, 
                       weight_decay = weight_decay)
scheduler = optim.lr_scheduler.StepLR(optimizer, step, gamma)

# Move model to GPU
if train_on_gpu:
    model.cuda()

# Load and batch data
loaders = batch_data(text_int, sequence_length, batch_size)

# Train
train(model, optimizer, criterion, scheduler, num_epochs, loaders, batch_size)

Epoch 1/15, Training loss: 4.389, Validation loss: 4.021, Time: 320 s
Epoch 2/15, Training loss: 3.989, Validation loss: 3.904, Time: 320 s
Epoch 3/15, Training loss: 3.882, Validation loss: 3.842, Time: 320 s
Epoch 4/15, Training loss: 3.821, Validation loss: 3.806, Time: 320 s
Epoch 5/15, Training loss: 3.775, Validation loss: 3.775, Time: 320 s
Epoch 6/15, Training loss: 3.739, Validation loss: 3.753, Time: 320 s
Epoch 7/15, Training loss: 3.706, Validation loss: 3.737, Time: 320 s
Epoch 8/15, Training loss: 3.678, Validation loss: 3.722, Time: 321 s
Epoch 9/15, Training loss: 3.656, Validation loss: 3.711, Time: 320 s
Epoch 10/15, Training loss: 3.636, Validation loss: 3.701, Time: 319 s
Epoch 11/15, Training loss: 3.619, Validation loss: 3.690, Time: 322 s
Epoch 12/15, Training loss: 3.602, Validation loss: 3.688, Time: 322 s
Epoch 13/15, Training loss: 3.587, Validation loss: 3.680, Time: 322 s
Epoch 14/15, Training loss: 3.439, Validation loss: 3.629, Time: 322 s
Epoch 15/15, Tr

In [0]:
# Load the best model parameters
load_parameters(model, 'checkpoint.pth');

## Generate TV Scripts
Now we can use the trained model to generate **new Seinfeld scripts**. 

The words of a new script are generated **one by one** until we reach a given length. A **sequence of words** is used to obtain a next word, it's **randomly chosen** out of `topk` most likely words. In the beginning, the sequence contains only a prime word and is padded with 0.

In [0]:
def generate_script(model, prime_id, length, topk = 5):
    """
    Generates a new TV script
    
    :param model: model used for generating words
    :param prime_id: prime word index
    :param length: length of the generated script
    :param tokp: a number of most likely words to choose the next word from
    :return: generated TV script
    """
    
    model.eval()
    
    with torch.no_grad():
        # Sequence with a starting word, padded with zeros
        current_sequence = np.full((1, sequence_length), 0)
        current_sequence[0, -1] = prime_id
        
        # List for generated indices of words
        script_ids = [prime_id]

        for _ in range(1, length):
            # Convert to torch tensor and move to GPU
            tensor = torch.tensor(current_sequence)
            if train_on_gpu:
                tensor = tensor.cuda()

            # Run the model to get probabilities of the next word
            output, _ = model(tensor, None)
            ps = F.softmax(output, dim = 0)
            top_ps, top_word_ids = ps.topk(topk)
            
            # Convert to numpy and move back to CPU
            top_word_ids = top_word_ids.cpu().numpy()
            top_ps = top_ps.cpu().numpy()
            
            # Randomly choose the next word out of top k words
            next_word_id = np.random.choice(top_word_ids, p = top_ps/top_ps.sum())

            # Update the sequence and list with the new word
            current_sequence = np.roll(current_sequence, -1)
            current_sequence[0, -1] = next_word_id
            script_ids.append(next_word_id)

        # Join generated word list into a string
        script = ' '.join([int_to_vocab[word_id] for word_id in script_ids])
        
        # Replace the special punctuation words for actual punctuation
        punctuation = punctuation_lookup()
        for sign, token in punctuation.items():
            script = script.replace(' ' + token.lower(), sign)
        script = script.replace('( ', ' (')
        script = script.replace('\n ','\n')
            
    return script

Let's now generate an example script starting with `'jerry:'`. As you can see, the script is **not perfect** and sometimes it doesn't make sense. On the other hand, it's not too bad, the model **learnt the structure** of a script pretty well and also some grammatical rules.

In [19]:
script = generate_script(model, vocab_to_int['jerry:'], 400, 10)
print(script)

jerry: the police? the police have been in there?

kramer: yeah, yeah. i mean, the only thing we got in there for a while, they have to wait.

george: well, i guess i could say something.... (she leaves)

elaine: so she said he was gonna do something like this...

jerry: yeah, yeah... (george takes it to his chest)

george: what?

jerry: you know, you can't get out of here!

elaine: oh, i just said it was a little nervous but it's not you. you know, i got a problem.

george: what do you want to do this guy in the pool?

jerry: what?

elaine: what did i do?

jerry: well, i'm sure we don't have a baby.

elaine: (pause) you know... you can have a little more coffee and a lot of pressure... (points to the table)

[setting: dealership car]

george: oh, the bubble boy.

hoyt: george?

peterman: oh, i'm sorry, i'm sure you're going back and i'm gonna see my doctor.

jerry: (to george) hey jerry, listen, you know, i'm not really sure that it's like a problem. if you could come down and get it,

In [0]:
file_name = 'generated_script1.txt'
with open(file_name, 'w') as file:
    file.write(script)