# TV Script Generation

In this project, I generated my own [Seinfeld](https://en.wikipedia.org/wiki/Seinfeld) TV scripts using RNNs.  We'll be using part of the [Seinfeld dataset](https://www.kaggle.com/thec03u5/seinfeld-chronicles#scripts.csv) of scripts from 9 seasons.  The Neural Network you'll build will generate a new ,"fake" TV script, based on patterns it recognizes in this training data.

## Get the Data

The data is already provided in `./data/Seinfeld_Scripts.txt` 

In [1]:
# load in data
import helper
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

## Explore the Data


In [2]:
view_line_range = (0, 10)

import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! y

---
## Implement Pre-processing Functions
The first thing to do to any dataset is pre-processing.  Implement the following pre-processing functions below:
- Lookup Table
- Tokenize Punctuation

### Lookup Table


In [3]:
import problem_unittests as tests
from collections import Counter

def create_lookup_tables(text):

    
    words = Counter(text)
    vocab = sorted(words, key = words.get, reverse = True)
    vocab_to_int = {word : ii for ii, word in enumerate(vocab)} 
    int_to_vocab = {ii : word for ii, word in enumerate(vocab)}
    
    return (vocab_to_int, int_to_vocab)



tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### Tokenize Punctuation


In [4]:
def token_lookup():

    punctuation = {'.': "||Period||",
                   ',': "||Comma||",
                   '"': "||Quotation_Mark||",
                   ';': "||Semicolon||",
                   '!': "||Exclamation_Mark||",
                   '?': "||Question_Mark||",
                   '(': "||Left_Parentheses||",
                   ')': "||Right_Parentheses||",
                   '-': "||Dash||",
                   '\n': "||Return||"}    
    return punctuation

tests.test_tokenize(token_lookup)

Tests Passed


In [5]:
# pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

In [6]:
import helper
import problem_unittests as tests

int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

## Build the Neural Network
In this section, we'll build the components necessary to build an RNN by implementing the RNN Module and forward and backpropagation functions.



## Input


In [8]:
from torch.utils.data import TensorDataset, DataLoader


def batch_data(words, sequence_length, batch_size):

    
    feature_size = len(words) - sequence_length
    train_x = np.zeros((feature_size, sequence_length), dtype = int)
    train_y = np.zeros(feature_size)
    
    for i in range(0, feature_size):
        train_x[i] = words[i:i+sequence_length]
        train_y[i] = words[i+sequence_length]
    

    feature_tensor = np.asarray(train_x, np.int64)
    target_tensor = np.asarray(train_y, np.int64)
    data = TensorDataset(torch.from_numpy(feature_tensor), torch.from_numpy(target_tensor))
    dataloader = DataLoader(data, batch_size = batch_size, shuffle = True)
    
    return dataloader


words = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
length = 3
loader = batch_data(np.array(words), length, 3)

detaiter = iter(loader)
detaiter.next()

[tensor([[1, 2, 3],
         [2, 3, 4],
         [7, 8, 9]]),
 tensor([ 4,  5, 10])]

In [9]:
# testing dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[37, 38, 39, 40, 41],
        [44, 45, 46, 47, 48],
        [43, 44, 45, 46, 47],
        [ 8,  9, 10, 11, 12],
        [ 9, 10, 11, 12, 13],
        [ 2,  3,  4,  5,  6],
        [17, 18, 19, 20, 21],
        [31, 32, 33, 34, 35],
        [ 3,  4,  5,  6,  7],
        [42, 43, 44, 45, 46]])

torch.Size([10])
tensor([42, 49, 48, 13, 14,  7, 22, 36,  8, 47])


---
## Build the Neural Network


In [None]:
import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):

        super(RNN, self).__init__()

        
        # set class variables
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # define model layers
        
        # embedding and lstm layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        
        # dropout layer
        self.dropout = nn.Dropout(p=0.2)
        
        # fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
        
    
    def forward(self, nn_input, hidden):
 
        batch_size = nn_input.size(0)
        
        # embedding and lstm out
        embeds = self.embedding(nn_input)
        lstm_out, hidden = self.lstm(embeds, hidden)
        
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully connected layer
        out = self.dropout(lstm_out)
        out = self.fc(lstm_out)
        
        # reshape to batch size first
        out = out.view(batch_size, -1, self.output_size)
        out = out[:,-1]

        # return one batch of output word scores and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):

        # Implement function
        weight = next(self.parameters()).data
        
        # initialize hidden state with zero weights, and move to GPU if available
        if train_on_gpu:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        return hidden


tests.test_rnn(RNN, train_on_gpu)

### Define forward and backpropagation

Use the RNN class you implemented to apply forward and back propagation. This function will be called, iteratively, in the training loop as follows:
```
loss = forward_back_prop(decoder, decoder_optimizer, criterion, inp, target)
```

And it should return the average loss over a batch and the hidden state returned by a call to `RNN(inp, hidden)`. Recall that you can get this loss by computing it, as usual, and calling `loss.item()`.

**If a GPU is available, you should move your data to that GPU device, here.**

In [12]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):

    if train_on_gpu:
        inp, target = inp.cuda(), target.cuda()
    
    # perform backpropagation and optimization
    
    # we'd backprop through the entire training history
    hidden = tuple([each.data for each in hidden])
    
    # zero accumulated gradients
    rnn.zero_grad()
    
    output, hidden = rnn(inp, hidden)
    loss = criterion(output, target)
    loss.backward()
    
    nn.utils.clip_grad_norm_(rnn.parameters(), 5)
    optimizer.step()
    
    # return the loss over a batch and the hidden state produced by our model
    return loss.item(), hidden


tests.test_forward_back_prop(RNN, forward_back_prop, train_on_gpu)

Tests Passed


## Neural Network Training

With the structure of the network complete and data ready to be fed in the neural network, it's time to train it.



In [13]:

def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

### Hyperparameters



In [14]:
# Data params
# Sequence Length
sequence_length = 10   # of words in a sequence
# Batch Size
batch_size = 128

# data loader 
train_loader = batch_data(int_text, sequence_length, batch_size)

In [15]:
# Training parameters
# Number of Epochs
num_epochs = 20
# Learning Rate
learning_rate = 0.001

# Model parameters
# Vocab size
vocab_size = len(int_to_vocab)
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 150
# Hidden Dimension
hidden_dim = 512
# Number of RNN Layers
n_layers = 3


show_every_n_batches = 500

### Train


In [16]:

rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
    rnn.cuda()


optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# training the model
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# saving the trained model
helper.save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 20 epoch(s)...
Epoch:    1/20    Loss: 5.994940968513489

Epoch:    1/20    Loss: 5.7920553541183475

Epoch:    1/20    Loss: 5.747504928588867

Epoch:    1/20    Loss: 5.725408123970031

Epoch:    1/20    Loss: 5.7490744876861575

Epoch:    1/20    Loss: 5.76370175075531

Epoch:    1/20    Loss: 5.747068878173828

Epoch:    1/20    Loss: 5.800097204208374

Epoch:    1/20    Loss: 5.817925452232361

Epoch:    1/20    Loss: 5.840303137779236

Epoch:    1/20    Loss: 5.883670455932617

Epoch:    1/20    Loss: 5.846391112327575

Epoch:    1/20    Loss: 5.813758012771607

Epoch:    2/20    Loss: 5.840143046265908

Epoch:    2/20    Loss: 5.802691124916077

Epoch:    2/20    Loss: 5.795728998184204

Epoch:    2/20    Loss: 5.605874850273132

Epoch:    2/20    Loss: 4.980691374778748

Epoch:    2/20    Loss: 4.824830937862396

Epoch:    2/20    Loss: 4.719572492599488

Epoch:    2/20    Loss: 4.654422090530396

Epoch:    2/20    Loss: 4.584916620731354

Epoch:    2/20    Loss: 4

Epoch:   15/20    Loss: 3.4772553725242616

Epoch:   15/20    Loss: 3.4850223422050477

Epoch:   15/20    Loss: 3.4627790050506593

Epoch:   15/20    Loss: 3.4993852562904357

Epoch:   15/20    Loss: 3.485968002319336

Epoch:   15/20    Loss: 3.4910743527412413

Epoch:   16/20    Loss: 3.441590551498382

Epoch:   16/20    Loss: 3.3675909323692323

Epoch:   16/20    Loss: 3.3954517765045167

Epoch:   16/20    Loss: 3.4154037742614745

Epoch:   16/20    Loss: 3.402384232521057

Epoch:   16/20    Loss: 3.41231938123703

Epoch:   16/20    Loss: 3.4150117321014406

Epoch:   16/20    Loss: 3.414239936351776

Epoch:   16/20    Loss: 3.4183632831573485

Epoch:   16/20    Loss: 3.4317506384849548

Epoch:   16/20    Loss: 3.4551423139572144

Epoch:   16/20    Loss: 3.459237895488739

Epoch:   16/20    Loss: 3.4712672204971313

Epoch:   17/20    Loss: 3.397530146550591

Epoch:   17/20    Loss: 3.323518482208252

Epoch:   17/20    Loss: 3.332956910610199

Epoch:   17/20    Loss: 3.3687782340049743

In [17]:
import torch
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')

## Generate TV Script
With the network trained and saved, we'll use it to generate a new, "fake" Seinfeld TV script in this section.



In [18]:

import torch.nn.functional as F

def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
   
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

### Generate a New Script


In [19]:
# run the cell multiple times to get different results!
gen_length = 400 # modify the length to your preference
prime_word = 'jerry' # name for starting the script

pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

jerry: students cannot score. i got witnesses.

jerry: yeah.

jerry: well, perhaps we can go out with me on the plane.

jerry: yeah, that's right.

elaine:(to george) hey, what are you doing here, you can take it!

frank:(to helen) you think she was refunding?

elaine: oh. well, i don't think so.

george: i don't know, but i think i could get some sandwiches.

george: i thought they were changing america in a lifetime, like a glove, and i think i could be held into this.

kramer:(explaining) i think you know *why* you're going to be a banker?

george: i don't understand...(laughs)

newman:(laughs) oh, no. no, no, no.

elaine: well, i think i'll see you later.

george: oh, come on. come in, come back in there for a long time?

elaine: no.

jerry:(annoyed) oh, come on, come on, come on lets go.

elaine: well, you know what? i'll hide the whole thing on thursday.

jerry: yeah, yeah.

kramer: yeah, well, actually i think i'll tell you, uh...

elaine:(laughs) oh, i don't know why i don't ha

#### Save your favorite scripts

Once you have a script that you like (or find interesting), save it to a text file!

In [20]:
# save script to a text file
f =  open("generated_script_1.txt","w")
f.write(generated_script)
f.close()