# TV Script Generation

The purpose of this project is to generate my own [Seinfeld](https://en.wikipedia.org/wiki/Seinfeld) TV script using an RNN. The network generates a new ,"fake" TV script, based on patterns it recognizes in the training data.

**Data source:**
- The network will be trained with part of the [Seinfeld dataset](https://www.kaggle.com/thec03u5/seinfeld-chronicles#scripts.csv) of scripts from 9 seasons.  

**Changes to project**
- 2019-12-02: Start notebook
- 2019-12-04: Finish project

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries-and-load-data" data-toc-modified-id="Import-libraries-and-load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries and load data</a></span></li><li><span><a href="#Explore-data" data-toc-modified-id="Explore-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Explore data</a></span></li><li><span><a href="#Implement-Pre-processing" data-toc-modified-id="Implement-Pre-processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Implement Pre-processing</a></span><ul class="toc-item"><li><span><a href="#Create-Lookup-Table" data-toc-modified-id="Create-Lookup-Table-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Create Lookup Table</a></span></li><li><span><a href="#Tokenize-Punctuation" data-toc-modified-id="Tokenize-Punctuation-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Tokenize Punctuation</a></span></li><li><span><a href="#Pre-process-and-save-data-(Checkpoint)" data-toc-modified-id="Pre-process-and-save-data-(Checkpoint)-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Pre-process and save data (Checkpoint)</a></span></li></ul></li><li><span><a href="#Build-the-Neural-Network" data-toc-modified-id="Build-the-Neural-Network-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Build the Neural Network</a></span><ul class="toc-item"><li><span><a href="#Prepare-input-tensors-/-batches" data-toc-modified-id="Prepare-input-tensors-/-batches-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Prepare input tensors / batches</a></span></li><li><span><a href="#Define-the-Neural-Network" data-toc-modified-id="Define-the-Neural-Network-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Define the Neural Network</a></span></li><li><span><a href="#Define-forward-and-backpropagation-functions" data-toc-modified-id="Define-forward-and-backpropagation-functions-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Define forward and backpropagation functions</a></span></li></ul></li><li><span><a href="#Neural-Network-Training" data-toc-modified-id="Neural-Network-Training-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Neural Network Training</a></span><ul class="toc-item"><li><span><a href="#Define-Train-Loop" data-toc-modified-id="Define-Train-Loop-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Define Train Loop</a></span></li><li><span><a href="#Set-Hyperparameters" data-toc-modified-id="Set-Hyperparameters-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Set Hyperparameters</a></span></li><li><span><a href="#Train" data-toc-modified-id="Train-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Train</a></span></li><li><span><a href="#Question:-How-did-you-decide-on-your-model-hyperparameters?" data-toc-modified-id="Question:-How-did-you-decide-on-your-model-hyperparameters?-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Question: How did you decide on your model hyperparameters?</a></span></li><li><span><a href="#Checkpoint,-re-load-trained-model" data-toc-modified-id="Checkpoint,-re-load-trained-model-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Checkpoint, re-load trained model</a></span></li></ul></li><li><span><a href="#Generate-new--TV-Script" data-toc-modified-id="Generate-new--TV-Script-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Generate new  TV Script</a></span><ul class="toc-item"><li><span><a href="#Save-your-favorite-scripts" data-toc-modified-id="Save-your-favorite-scripts-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Save your favorite scripts</a></span></li></ul></li></ul></div>

## Import libraries and load data

In [1]:
from collections import Counter
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

import helper
import problem_unittests as tests

In [2]:
# load in data
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

## Explore data
You can play around with `view_line_range` to view different parts of the data.

In [3]:
view_line_range = (0, 10)

import numpy as np

print('Dataset Stats\n')
print('Approx. number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print('\nThe lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats

Approx. number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! you 

---
## Implement Pre-processing

1. Lookup Table
2. Tokenize Punctuation

### Create Lookup Table
Create two dictionaries:
- Dictionary to go from the words to an id, we'll call `vocab_to_int`
- Dictionary to go from the id to word, we'll call `int_to_vocab`

Return these dictionaries in the following **tuple** `(vocab_to_int, int_to_vocab)`

In [4]:
def create_lookup_tables(text):
    """ Create two lookup tables for vocabulary and return them as a tuple.
    
    Arguments:
    ----------
    - param text: The text of tv scripts split into words
    
    Returns:
    --------
    - A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    word_counter = Counter()
    for word in text:
        word_counter[word] += 1
    word_list_sorted = sorted(word_counter, key=word_counter.get, reverse=True)
    vocab_to_int = {word : pos for pos, word in enumerate(word_list_sorted)}
    int_to_vocab = {pos : word for pos, word in enumerate(word_list_sorted)}
    
    # return tuple
    return (vocab_to_int, int_to_vocab)

In [5]:
# Run tests
tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### Tokenize Punctuation
We'll be splitting the script into a word array using spaces as delimiters.  However, punctuations like periods and exclamation marks can create multiple ids for the same word. For example, "bye" and "bye!" would generate two different word ids.

Implement the function `token_lookup` to return a dict that will be used to tokenize symbols like "!" into "||Exclamation_Mark||".  Create a dictionary for the following symbols where the symbol is the key and value is the token:
- Period ( **.** )
- Comma ( **,** )
- Quotation Mark ( **"** )
- Semicolon ( **;** )
- Exclamation mark ( **!** )
- Question mark ( **?** )
- Left Parentheses ( **(** )
- Right Parentheses ( **)** )
- Dash ( **-** )
- Return ( **\n** )

This dictionary will be used to tokenize the symbols and add the delimiter (space) around it.  This separates each symbols as its own word, making it easier for the neural network to predict the next word. Make sure you don't use a value that could be confused as a word; for example, instead of using the value "dash", try using something like "||dash||".

In [6]:
def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    
    lookup = {'.': '<PERIOD>',
              ',': '<COMMA>',
              '"': '<QUOTATION_MARK>',
#               ':': '<COLON>',
              ';': '<SEMICOLON>' ,
              '!': '<EXCLAMATION_MARK>',
              '?': '<QUESTION_MARK>',
              '(': '<LEFT_PAREN>',
              ')': '<RIGHT_PAREN>',
              '-': '<DASH>',
              '?': '<QUESTION_MARK>',
              '\n': '<NEW_LINE>',
             }
        
    return lookup

In [7]:
# Run tests
tests.test_tokenize(token_lookup)

Tests Passed


### Pre-process and save data (Checkpoint)

In [8]:
## Pre-process and save training data
# helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

In [9]:
# Reload pre-processed data
int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

## Build the Neural Network

In [10]:
# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

### Prepare input tensors / batches
- Use [TensorDataset](http://pytorch.org/docs/master/data.html#torch.utils.data.TensorDataset) to provide a known format to the dataset
- Use [DataLoader](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) to handle batching, shuffling, and other dataset iteration functions.

You can create data with TensorDataset by passing in feature and target tensors. Then create a DataLoader as usual.
```
data = TensorDataset(feature_tensors, target_tensors)
data_loader = torch.utils.data.DataLoader(data, 
                                          batch_size=batch_size)
```

**Batching**
Implement the `batch_data` function to batch `words` data into chunks of size `batch_size` using the `TensorDataset` and `DataLoader` classes.

>You can batch words using the DataLoader, but it will be up to you to create `feature_tensors` and `target_tensors` of the correct size and content for a given `sequence_length`.

For example, say we have these as input:
```
words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4
```

Your first `feature_tensor` should contain the values:
```
[1, 2, 3, 4]
```
And the corresponding `target_tensor` should just be the next "word"/tokenized word value:
```
5
```
This should continue with the second `feature_tensor`, `target_tensor` being:
```
[2, 3, 4, 5]  # features
6             # target
```

In [11]:
def batch_data(words, sequence_length, batch_size):
    """Batch the neural network data using Pytorch's DataLoader class.
    Each batch is a 2D array with size batch_size x sequence_length.
    
    Arguments:
    ----------
    words: list, the word ids of the TV scripts
    sequence_length: int, sequence length of each batch
    batch_size: int, the number of sequences in a batch
    
    Returns:
    --------
    DataLoader with batched data
    """
    
    # Make sure a label can be returned for every sequence 
    n_sequences = len(words) - (sequence_length + 1)  
    
    features = np.zeros(shape=(n_sequences, sequence_length), dtype=int)
    labels = np.zeros(n_sequences, dtype=int)
    
    for n in range(0, n_sequences):
        x = np.array(words[n: n + sequence_length])
        y = np.array(words[n + sequence_length])
        features[n, :] = x
        labels[n] = y
        
    data = TensorDataset(torch.from_numpy(features), torch.from_numpy(labels))    
    data_loader = torch.utils.data.DataLoader(data,
                                              shuffle=True,
                                              batch_size=batch_size,    
                                             )    
        
    return data_loader

In [12]:
# Test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[  3,   4,   5,   6,   7],
        [ 16,  17,  18,  19,  20],
        [ 30,  31,  32,  33,  34],
        [  0,   1,   2,   3,   4],
        [  4,   5,   6,   7,   8],
        [ 32,  33,  34,  35,  36],
        [ 26,  27,  28,  29,  30],
        [ 42,  43,  44,  45,  46],
        [ 21,  22,  23,  24,  25],
        [  1,   2,   3,   4,   5]])

torch.Size([10])
tensor([  8,  21,  35,   5,   9,  37,  31,  47,  26,   6])


In [13]:
t_loader = batch_data(int_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[ 12238,   1530,     20,  12239,  12240],
        [   103,     25,    289,      1,  17276],
        [     3,     11,    349,      6,   3276],
        [     5,     42,    667,      1,   1092],
        [     2,     17,      5,    292,      6],
        [     0,     14,     15,     41,   1402],
        [    33,   2515,   1070,     21,     52],
        [     8,      1,   2116,   5978,    181],
        [     2,      5,     27,     19,      1],
        [ 15658,     17,   8220,     15,      2]])

torch.Size([10])
tensor([   1,   87,  113,    2,  680,   47,  516,   70,   17,   17])


### Define the Neural Network

**Note:** The output of this model should be the *last* batch of word scores after a complete sequence has been processed. That is, for each input sequence of words, we only want to output the word scores for a single, most likely, next word.

In [14]:
class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):
        """ Initialize the PyTorch RNN Module.
        
        Arguments:
        ----------
        - vocab_size: int, number of input dimensions (the size of the vocabulary)
        - output_size: int, the number of output dimensions of the neural network
        - embedding_dim: int, size of embeddings       
        - hidden_dim: int, size of the hidden layer outputs
        - dropout: float <= 1, dropout to add in between LSTM layers
        
        Returns:
        --------
        - None
        """
        
        super(RNN, self).__init__()
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.output_size = output_size            
        
        # Define model layers
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
                            num_layers=n_layers, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, output_size)
        
        
    def forward(self, x, hidden):
        """ Forward propagation of the neural network.
        
        Arguments:
        ----------
        - x: tensor, the input to the neural network
        - hidden: tensor, the hidden state        
        
        Returns:
        --------
        - Two Tensors, the output of the neural network and the latest hidden state
        """
        
        batch_size = x.shape[0]
        embeddings = self.embed(x)
        self.lstm.flatten_parameters()
        lstm_out, hidden = self.lstm(embeddings, hidden)
        lstm_out_stacked = lstm_out.contiguous().view(-1, self.hidden_dim)
        fc_out = self.fc(lstm_out_stacked)
        
        # To get last batch of word scores
        # reshape into (batch_size, seq_length, output_size)
        output = fc_out.view(batch_size, -1, self.output_size)
        out = output[:, -1]  # get last batch of probs

        # return one batch of output word scores and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        """Initialize the hidden state of an LSTM: Create two new tensors with 
        sizes n_layers x batch_size x hidden_dim,initialized to zero, for 
        hidden state and cell state of LSTM.
        
        Arguments:
        ----------
        - batch_size: int, the batch_size of the hidden state
        
        Returns:
        --------
        - hidden: tensor, hidden state of dims (n_layers, batch_size, hidden_dim)
        """
        
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

In [15]:
# Run tests
tests.test_rnn(RNN, train_on_gpu)

Tests Passed


### Define forward and backpropagation functions

This function will be called, iteratively, in the training loop as follows:
```
loss = forward_back_prop(decoder, decoder_optimizer, criterion, inp, target)
```

And it should return the average loss over a batch and the hidden state returned by a call to `RNN(inp, hidden)`. Recall that you can get this loss by computing it, as usual, and calling `loss.item()`.

In [16]:
def forward_back_prop(decoder, decoder_optimizer, criterion, inputs, target, hidden):
    """ Forward and backward propagation on the neural network.
    
    Arguments:
    ----------
    - decoder: the PyTorch Module that holds the neural network
    - decoder_optimizer: the PyTorch optimizer for the neural network
    - criterion: the PyTorch loss function
    - inputs: a batch of input to the neural network
    - target: the target output for the batch of input
    
    Returns:
    --------
    - The loss and the latest hidden state Tensor
    """
    
    # Move data to GPU, if available
    if(train_on_gpu):
        decoder = decoder.cuda()
        inputs, target = inputs.cuda(), target.cuda()
        
    # Create new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in hidden])

    # Zero the accumulated gradients
    decoder_optimizer.zero_grad()
    # Get the output from the model
    output, h = decoder(inputs, h)
    # Calculate the loss and perform backpropagation
    loss = criterion(output, target)
    loss.backward()
    
    # Clip exploding gradients 
    nn.utils.clip_grad_norm_(decoder.parameters(), 5)
    decoder_optimizer.step()
    
    # Return the loss over a batch and the hidden state
    return loss.item(), h

In [17]:
# Run tests (not completely extensive)
tests.test_forward_back_prop(RNN, forward_back_prop, train_on_gpu)

Tests Passed


## Neural Network Training

### Define Train Loop

In [18]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(10, n_epochs + 1):
        
        # Initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # Make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # Perform forward & back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # Record loss
            batch_losses.append(loss)

            # Print loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # Return trained rnn
    return rnn

### Set Hyperparameters

Set and train the neural network with the following parameters:
- Set `sequence_length` to the length of a sequence.
- Set `batch_size` to the batch size.
- Set `num_epochs` to the number of epochs to train for.
- Set `learning_rate` to the learning rate for an Adam optimizer.
- Set `vocab_size` to the number of uniqe tokens in our vocabulary.
- Set `output_size` to the desired size of the output.
- Set `embedding_dim` to the embedding dimension; smaller than the vocab_size.
- Set `hidden_dim` to the hidden dimension of your RNN.
- Set `n_layers` to the number of layers/cells in your RNN.
- Set `show_every_n_batches` to the number of batches at which the neural network should print progress.

If the network isn't getting the desired results, tweak these parameters and/or the layers in the `RNN` class.

In [19]:
# Set data parameters

# Sequence Length
sequence_length = 10 # of words in a sequence
# Batch Size
batch_size = 128

# Create data loader
train_loader = batch_data(int_text, sequence_length, batch_size)

In [20]:
# Set training parameters

# Number of Epochs
num_epochs = 11
# Learning Rate
learning_rate = 0.001 

# Model parameters

# Vocab size
vocab_size = len(vocab_to_int)  
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 200 
# Hidden Dimension
hidden_dim = 256
# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 1000

### Train

> **You should aim for a loss less than 3.5.** You should also experiment with different sequence lengths, which determine the size of the long range dependencies that a model can learn.

In [35]:
do not run accidentially

# Create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
    rnn.cuda()

# Define loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# Train model
trained_rnn = train_rnn(trained_rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# Save trained model
helper.save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 11 epoch(s)...
Epoch:   10/11    Loss: 3.225506061553955

Epoch:   10/11    Loss: 3.222130899429321

Epoch:   10/11    Loss: 3.227776923418045

Epoch:   10/11    Loss: 3.2282236764431

Epoch:   10/11    Loss: 3.2316513385772705

Epoch:   10/11    Loss: 3.2096594355106354

Epoch:   11/11    Loss: 3.2157082357886115

Epoch:   11/11    Loss: 3.224860273599625

Epoch:   11/11    Loss: 3.2180999081134796

Epoch:   11/11    Loss: 3.227487102270126

Epoch:   11/11    Loss: 3.227510671377182

Epoch:   11/11    Loss: 3.223778514623642

Model Trained and Saved


  "type " + obj.__name__ + ". It won't be checked "


### Question: How did you decide on your model hyperparameters? 
For example, did you try different sequence_lengths and find that one size made the model converge faster? What about your hidden_dim and n_layers; how did you decide on those?

**Answer:**


Batch Size = 128
- Tried 64 to 256 too, but 128 showed best performance.

Number of RNN Layers = 2
- Did not try more as the results were satisfying.

Hidden Dimension = 256
- Tried range 128 and 512 too, but but 256 showed best performance.

Embedding Dimension = 200
- Range between 200 and 500 seems to be common, I aimed for the lower and it worked fine. _(Suggestion review: the vocab contains ~46,367 unique words. Try to cut this down significantly by 98% to 1000 embeddings.)_

Learning Rate = 0.001
- Standard starting point, worked fine.

Sequence Length = 10
- Did not want to go too much above the average line length of the original scripts. Higher lighths slowed training considerably. _(Suggestion review: Chose a sequence length of 50 to give the network a context of approximately 10 script lines on average to consider during training.)_ 

Epochs = 11
- Enough to get loss of <3.5. Longer training could have further decreased it, but the curve got quite flat in the end.

### Checkpoint, re-load trained model

In [21]:
# Load model and data
_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')

## Generate new  TV Script
With the network trained and saved, we can use it to generate a new, "fake" Seinfeld TV script in this section.

**Generate Text:** To generate the text, the network needs to start with a single word and repeat its predictions until it reaches a set length. Also note that the function below uses topk sampling to introduce some randomness in choosing the most likely next word, given an output set of word scores!

In [22]:
def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """Generate text using a trained neural network.
    
    Parameters:
    -----------
    - decoder: The PyTorch Module that holds the trained neural network
    - prime_id: The word id to start the first prediction
    - int_to_vocab: Dict of word id keys to word values
    - token_dict: Dict of puncuation tokens keys to puncuation values
    - pad_value: The value used to pad a sequence
    - predict_len: The length of text to generate
    
    Returns: 
    --------
    The generated text
    """
    rnn.eval()
    
    # Create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # Initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # Get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # Get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # Use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # Select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # Retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # The generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # Return all the sentences
    return gen_sentences

In [23]:
# Run the cell multiple times to get different results
gen_length = 400 # modify the length to your preference
prime_word = 'jerry' # name for starting the script

pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

jerry:

jerry: i think it's all that.

jerry: i don't think you can take a ride.

elaine: i don't know.

george: well, i don't think so. i mean, i have no control on this.

jerry: oh, you got that fedex?!

jerry: i don't understand, but i don't want it to get out of here.

jerry: i think i can...(she turns around and puts his head down)

kramer:(to jerry) i don't want you to take a ride.

elaine: what is that?

kramer: yeah, yeah.

george:(to jerry) oh, yeah...(to kramer) you got that straight in the bathroom?(elaine enters)

elaine: hey, hey, hey! i got news with him!

george: what is that?

jerry: no, i didn't say anything.

jerry: oh yeah. i know what i'm gonna do, i don't have to be there.

jerry: oh, come on, let's see, what do you need?

george: no.

kramer: hey, jerry, you are gonna do that.

jerry: i don't know. i mean, you know, the guy who plays object on.

george: oh, i think i can go.

jerry: well...

kramer:(to elaine) i don't know what you want.

elaine: oh, yeah.

jerry:

### Save your favorite scripts

In [30]:
# save script to a text file
f =  open("generated_script_1.txt","w")
f.write(generated_script)
f.close()

---