# Intro
The goal of this project is to build an RNN model that can generate text. We will utilize real movie script to train this model. Later, we will use the trained model to author/generate scripts for other movies from scratch.

## Import Libraries and Load Data

The first step is to import relevant libraries required for this project. The raw movie script is saved in `./data/MovieScript.txt`. We also need to load this data as shown below:

In [1]:
# import libraries
import numpy as np
import helper
import problem_unittests as tests
from torch.utils.data import TensorDataset, DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F

# load data
data_dir = './data/MovieScript.txt'  
text = helper.load_data(data_dir)

## Explore Data
To understand our data better, we need extract some statistics. Next we can use the variable `explore_range` to explore specific lines within our script. We will find that the raw text has been preprocessed as follows:
- All letters are small letter
- each new line of the raw text is separated by newline character `\n`.

In [2]:
explore_range = (33, 43)

print('Data Statistics:========')
print('Number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))
print('========================')

print()
print('The lines {} to {}:'.format(*explore_range))
print('\n'.join(text.split('\n')[explore_range[0]:explore_range[1]]))

Number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 33 to 43:

george: no, you didnt! 

jerry: i thought i told you about it, yes, she teaches political science? i met her the night i did the show in lansing... 

george: ha. 

jerry: (looks in the creamer) theres no milk in here, what... 

george: wait wait wait, what is she... (takes the milk can from jerry and puts it on the table) what is she like? 


---
## Data Preprocessing
In order to prepare our raw data to be used as input to our model, we need to preprocess it. We will develop two functions that will be used to achieve this goal:
- Embedding lookup
- Tokenize Punctuation

### Embedding Lookup
Used to build word embedding by converting words into tokens and vice versa. This function will accept text as input and will return tuple dictionaries: 
- `vocab_to_int`: a dictionary that transforms words to integer 
- `int_to_vocab`: a dictionary that transforms integer to words

In [3]:
def create_lookup_tables(text):

    from collections import Counter
    # text --> {text:count}
    word_counts = Counter(text)
    # sorting the words from most to least frequent in text occurrence
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    # create dictionaries
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
    
    # return tuple
    return (vocab_to_int, int_to_vocab)


# test function
tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### Tokenize Punctuation
We'll be splitting the script into a word array using spaces as delimiters.  However, punctuations like periods and exclamation marks can create multiple ids for the same word. For example, "bye" and "bye!" would generate two different word ids.

We will implement the function `token_lookup` to return a dictionary that will be used to tokenize symbols like "!" into "||Exclamation_Mark||".  This function will act as lookup table for the following symbols where the symbol is the key and value is the token:
- Period ( **.** )
- Comma ( **,** )
- Quotation Mark ( **"** )
- Semicolon ( **;** )
- Exclamation mark ( **!** )
- Question mark ( **?** )
- Left Parentheses ( **(** )
- Right Parentheses ( **)** )
- Dash ( **-** )
- Return ( **\n** )

In [4]:
def token_lookup():
    punct = {
        '.':'||PERIOD||',
        ',':'||COMMA||',
        '"':'||QUOTATION_MARK||',
        ';':'||SEMICOLON||',
        '!':'||EXCLAMATION_MARK||',
        '?':'||QUESTION_MARK||',
        '(':'||LEFT_PAREN||',
        ')':'||RIGHT_PAREN||',
        '?':'||QUESTION_MARK||',
        '\n':'||NEW_LINE||',
        '-':'||DASH||'
    }

    return punct

# test function
tests.test_tokenize(token_lookup)

Tests Passed


## Pre-process all the data and save it

The function `preprocess_and_save_data` will apply above lookups to preprocess our raw data. Running the code cell below will pre-process all the data and save it to file.

In [5]:
# pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

# Check Point
the code bellow will create a checkpoint and is used to load saved data and reapply them to the notebook.

In [6]:
# create checkpoint
int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

## Build the Neural Network
To build our RNN model, we need to implement the following components:
* RNN Module 
* Forward function
* Backpropagation function

### Check Access to GPU

In [7]:
# detect cuda card
torch.cuda.current_device()
#torch.cuda.get_device_name(0)

    Found GPU1 Quadro K4000 which is of cuda capability 3.0.
    PyTorch no longer supports this GPU because it is too old.
    


0

In [8]:
# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

## Batching Input Data
Before we build our model we need to batch our input data. To achieve this, we will build the function `batch_data` to do the following:
1. Break data: The function `batch_data` breaks up word id's into the appropriate sequence lengths, such that only complete sequence lengths are constructed.
2. Create data: we’ll use [TensorDataset](http://pytorch.org/docs/master/data.html#torch.utils.data.TensorDataset) to convert data into Tensors and formatted with TensorDataset. 
3. Batch data: use [DataLoader](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader), to handle batching, shuffling, and other dataset iteration functions.

In [9]:
def batch_data(words, sequence_length, batch_size):
    
    # calculate total number of batches
    n_batches = len(words)//batch_size
    
    # Clip extra characters to create only full batches
    words = words[:n_batches * batch_size]
    # initiate features, target lists
    x = [] 
    y = []
    # index of last sequence:
    last_seq = len(words) - sequence_length
    # iterate through the words up to the beginning of last seq in "words":
    for n in range(0, last_seq):
        # The features
        x.append(words[n:n+sequence_length])
        # The targets, shifted by one
        y.append(words[n+sequence_length])
    feature_tensors = torch.from_numpy(np.asarray(x))
    target_tensors = torch.from_numpy(np.asarray(y))

    # create and batch data
    data = TensorDataset(feature_tensors, target_tensors)
    data_loader = torch.utils.data.DataLoader(data, batch_size=batch_size, shuffle=True)

    return data_loader

### Test batch function
To test our function we will generate a sample data `test_text` and we will observe if batching both the input sequence (sample_x) and the target (sample_y) is correct.

#### Sizes: 
sample_x should be of size (batch_size, sequence_length) and sample_y should just have one dimension: batch_size.
#### Values: 
targets, sample_y, shoud be the next value in the ordered test_text data. 

In [10]:
# test dataloader

test_text = range(50)
t_loader = batch_data(test_text, sequence_length=5, batch_size=10)

data_iter = iter(t_loader)
sample_x, sample_y = data_iter.next()

print(sample_x.shape)
print(sample_x)
print()
print(sample_y.shape)
print(sample_y)

torch.Size([10, 5])
tensor([[ 7,  8,  9, 10, 11],
        [14, 15, 16, 17, 18],
        [34, 35, 36, 37, 38],
        [18, 19, 20, 21, 22],
        [15, 16, 17, 18, 19],
        [33, 34, 35, 36, 37],
        [19, 20, 21, 22, 23],
        [21, 22, 23, 24, 25],
        [27, 28, 29, 30, 31],
        [44, 45, 46, 47, 48]], dtype=torch.int32)

torch.Size([10])
tensor([12, 19, 39, 23, 20, 38, 24, 26, 32, 49], dtype=torch.int32)


---
## Build the Neural Network

### RNN Module
We can implement our RNN using any PyTorch's [Module class](http://pytorch.org/docs/master/nn.html#torch.nn.Module). In this case we will use LSTM.

### Class functions
We will implement the following functions for the class:
 - `__init__` - The initialize function. 
 - `init_hidden` - The initialization function for LSTM hidden state
 - `forward` - Forward propagation function.
 
The initialize function should create the layers of the neural network and save them to the class. The forward propagation function will use these layers to run forward propagation and generate an output and a hidden state.

### Model output
**The output of this model should be the *last* batch of word scores** after a complete sequence has been processed. That is, for each input sequence of words, we only want to output the word scores for a single, most likely, next word.



In [11]:
class RNN(nn.Module):
    
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5):

        super(RNN, self).__init__()
        
        ## set class variables
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        ## define model layers
        # define the embedding lookup table
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # define the LSTM
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        
        # define the final, fully-connected output layer
        self.fc = nn.Linear(hidden_dim, output_size)
    
    
    def forward(self, nn_input, hidden):

        # define batch size
        batch_size = nn_input.size(0)

        # embeddings and lstm_out
        nn_input = nn_input.long() 
        embeds = self.embedding(nn_input)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs and copy tensor
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # fully-connected layer
        output = self.fc(lstm_out)

        # reshape into (batch_size, seq_length, output_size)
        output = output.view(batch_size, -1, self.output_size)
        
        # get last batch
        out = output[:, -1]
        
        # return one batch of output word scores and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        
        # initialize hidden state with zero weights, and move to GPU if available
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

# test function
tests.test_rnn(RNN, train_on_gpu)

Tests Passed


### Define forward and backpropagation

Apply forward and back propagation using our RNN class. This function will be called, iteratively, in the training loop as follows:
```
loss = forward_back_prop(decoder, decoder_optimizer, criterion, inp, target)
```

This function will return the average loss over a batch and the hidden state returned by a call to `RNN(inp, hidden)`. 


In [12]:
def forward_back_prop(rnn, optimizer, criterion, inp, target, hidden):
    """
    Forward and backward propagation on the neural network
    :param rnn: The PyTorch Module that holds the neural network
    :param optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    
    # move data to GPU, if available
    if(train_on_gpu):
        rnn.cuda()
        inp, target = inp.cuda(), target.cuda()

    h = tuple([each.data for each in hidden])

    # zero accumulated gradients
    rnn.zero_grad()

    # get the output from the model
    output, h = rnn(inp, h)

    # calculate the loss and perform backprop 
    loss = criterion(output, target.long()) 
    loss.backward()

    # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
    nn.utils.clip_grad_norm_(rnn.parameters(), 5) #clip = 5
    optimizer.step()

    # return the loss over a batch and the hidden state produced by our model
    return loss.item(), h


# test (only general checks on the expected outputs of our functions)
tests.test_forward_back_prop(RNN, forward_back_prop, train_on_gpu)

Tests Passed


## Neural Network Training

With the structure of the network complete and data ready to be fed in the neural network, it's time to train it.

### Train Loop

`train_rnn` function will train the network over all the batches for the number of epochs given. The model progress will be shown every number of batches. This number is set with the `show_every_n_batches` parameter. We'll set this parameter along with other parameters in the next section.

In [13]:
def train_rnn(rnn, batch_size, optimizer, criterion, n_epochs, show_every_n_batches=100):
    batch_losses = []
    
    rnn.train()

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize hidden state
        hidden = rnn.init_hidden(batch_size)
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(rnn, optimizer, criterion, inputs, labels, hidden)          
            # record loss
            batch_losses.append(loss)

            # printing loss stats
            if batch_i % show_every_n_batches == 0:
                print('Epoch: {:>4}/{:<4}  Loss: {}\n'.format(
                    epoch_i, n_epochs, np.average(batch_losses)))
                batch_losses = []

    # returns a trained rnn
    return rnn

### Hyperparameters

Set and train the neural network with the following parameters:
- Set `sequence_length` to the length of a sequence.
- Set `batch_size` to the batch size.
- Set `num_epochs` to the number of epochs to train for.
- Set `learning_rate` to the learning rate for an Adam optimizer.
- Set `vocab_size` to the number of unique tokens in our vocabulary.
- Set `output_size` to the desired size of the output.
- Set `embedding_dim` to the embedding dimension; smaller than the vocab_size.
- Set `hidden_dim` to the hidden dimension of your RNN.
- Set `n_layers` to the number of layers/cells in your RNN.
- Set `show_every_n_batches` to the number of batches at which the neural network should print progress.

If the network isn't getting the desired results, we can tweak these parameters and/or the layers in the `RNN` class.

In [14]:
## Data params
# Sequence Length
sequence_length = 10   # of words in a sequence

# Batch Size
batch_size = 128

# data loader - do not change
train_loader = batch_data(int_text, sequence_length, batch_size)

In [15]:
## Training parameters
# Number of Epochs
num_epochs = 20 

# Learning Rate
learning_rate = 0.001 

# Model parameters
# Vocab size
vocab_size = len(vocab_to_int)
# Output size
output_size = vocab_size
# Embedding Dimension
embedding_dim = 400 

# Hidden Dimension
hidden_dim = 512 

# Number of RNN Layers
n_layers = 2

# Show stats for every n number of batches
show_every_n_batches = 500

### Train
In the next cell, we will train the neural network on the pre-processed data.  We will experiment with different hyperparameters until we get acceptable loss (3.5 in this case). 


In [16]:
# create model and move to gpu if available
rnn = RNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout=0.5)
if train_on_gpu:
    rnn.cuda()

# defining loss and optimization functions for training
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# training the model
trained_rnn = train_rnn(rnn, batch_size, optimizer, criterion, num_epochs, show_every_n_batches)

# saving the trained model
helper.save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

Training for 20 epoch(s)...
Epoch:    1/20    Loss: 5.267440577507019

Epoch:    1/20    Loss: 4.636298000335693

Epoch:    1/20    Loss: 4.448770101070404

Epoch:    1/20    Loss: 4.325790477275849

Epoch:    1/20    Loss: 4.25684912776947

Epoch:    1/20    Loss: 4.228826926708221

Epoch:    1/20    Loss: 4.201540566921234

Epoch:    1/20    Loss: 4.138343175411224

Epoch:    1/20    Loss: 4.125659349441528

Epoch:    1/20    Loss: 4.0906209759712215

Epoch:    1/20    Loss: 4.060240993499756

Epoch:    1/20    Loss: 4.041012619018555

Epoch:    1/20    Loss: 4.042131265163421

Epoch:    2/20    Loss: 3.902938900161381

Epoch:    2/20    Loss: 3.8041995692253114

Epoch:    2/20    Loss: 3.821500693321228

Epoch:    2/20    Loss: 3.79368452501297

Epoch:    2/20    Loss: 3.7887437047958374

Epoch:    2/20    Loss: 3.798504003047943

Epoch:    2/20    Loss: 3.7891750044822694

Epoch:    2/20    Loss: 3.8044117736816405

Epoch:    2/20    Loss: 3.776681932449341

Epoch:    2/20    Loss:

Epoch:   15/20    Loss: 2.9228800716400145

Epoch:   15/20    Loss: 2.9264612908363343

Epoch:   15/20    Loss: 2.955915428161621

Epoch:   15/20    Loss: 2.961476556777954

Epoch:   15/20    Loss: 2.9843465938568117

Epoch:   15/20    Loss: 2.9886456422805785

Epoch:   16/20    Loss: 2.8871414348606237

Epoch:   16/20    Loss: 2.811547480583191

Epoch:   16/20    Loss: 2.833255507469177

Epoch:   16/20    Loss: 2.8428696751594544

Epoch:   16/20    Loss: 2.8480278248786925

Epoch:   16/20    Loss: 2.8723080806732177

Epoch:   16/20    Loss: 2.877551052093506

Epoch:   16/20    Loss: 2.9051781878471377

Epoch:   16/20    Loss: 2.9190689840316772

Epoch:   16/20    Loss: 2.921810088157654

Epoch:   16/20    Loss: 2.931290591239929

Epoch:   16/20    Loss: 2.9598481903076173

Epoch:   16/20    Loss: 2.9709917125701906

Epoch:   17/20    Loss: 2.870816743817211

Epoch:   17/20    Loss: 2.788284170150757

Epoch:   17/20    Loss: 2.8010914664268496

Epoch:   17/20    Loss: 2.826385179996490

  "type " + obj.__name__ + ". It won't be checked "


Model Trained and Saved


---
# Checkpoint

After running the above training cell, we will same our model by name, `trained_rnn`. We can use this checkpoint to resume our progress by running the next cell, which will load in our word:id dictionaries _and_ load in our saved model by name.

In [17]:
# create checkpoint
_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
trained_rnn = helper.load_model('./save/trained_rnn')

## Generate Movie Script
Now we will use our trained model to generate a new, "fake" movie script in this section.

### Generate Text
We will use the `generate` function below to generate our text. The network will start with a single word and repeat its predictions until it reaches a set length. It takes a word id to start with, `prime_id`, and generates a set length of text, `predict_len`. Also we will use topk sampling to introduce some randomness in choosing the most likely next word, given an output set of word scores!

In [18]:
def generate(rnn, prime_id, int_to_vocab, token_dict, pad_value, predict_len=100):
    """
    Generate text using the neural network
    :param decoder: The PyTorch Module that holds the trained neural network
    :param prime_id: The word id to start the first prediction
    :param int_to_vocab: Dict of word id keys to word values
    :param token_dict: Dict of puncuation tokens keys to puncuation values
    :param pad_value: The value used to pad a sequence
    :param predict_len: The length of text to generate
    :return: The generated text
    """
    rnn.eval()
    
    # create a sequence (batch_size=1) with the prime_id
    current_seq = np.full((1, sequence_length), pad_value)
    current_seq[-1][-1] = prime_id
    predicted = [int_to_vocab[prime_id]]
    
    
    for _ in range(predict_len):
        if train_on_gpu:
            current_seq = torch.LongTensor(current_seq).cuda()
        else:
            current_seq = torch.LongTensor(current_seq)
        
        # initialize the hidden state
        hidden = rnn.init_hidden(current_seq.size(0))
        
        # get the output of the rnn
        output, _ = rnn(current_seq, hidden)
        
        # get the next word probabilities
        p = F.softmax(output, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
         
        # use top_k sampling to get the index of the next word
        top_k = 5
        p, top_i = p.topk(top_k)
        top_i = top_i.numpy().squeeze()
        
        # select the likely next word index with some element of randomness
        p = p.numpy().squeeze()
        word_i = np.random.choice(top_i, p=p/p.sum())
        
        # retrieve that word from the dictionary
        word = int_to_vocab[word_i]
        predicted.append(word)     
        
        # the generated word becomes the next "current sequence" and the cycle can continue
        current_seq = current_seq.cpu().numpy() #code error
        current_seq = np.roll(current_seq, -1, 1)
        current_seq[-1][-1] = word_i
    
    gen_sentences = ' '.join(predicted)
    
    # Replace punctuation tokens
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        gen_sentences = gen_sentences.replace(' ' + token.lower(), key)
    gen_sentences = gen_sentences.replace('\n ', '\n')
    gen_sentences = gen_sentences.replace('( ', '(')
    
    # return all the sentences
    return gen_sentences

### Generate a New Script
To generate the text, we need to set `gen_length` to the length of script we want to generate and set `prime_word` to a word included in the source data (ex. "kramer") to start the prediction:

We can set the prime word to _any word_ in our dictionary, but it's best to start with a name for generating a script. 

In [19]:
# run the cell multiple times to get different results!
gen_length = 400 # modify the length to your preference
prime_word = 'kramer' # name for starting the script

pad_word = helper.SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)

kramer: crowded liquid.

jerry: what?

jerry: i don't know.

george:(shouting) i have to tell you that. i can't even have to do it.

jerry: well, i think i'm getting a free sample.

elaine:(to jerry) i told you not to tell you this thing!

jerry: i don't know if i was eighteen- worthy.

elaine: oh, well, that's the thing.

george: what are you talking about?(he picks it up) i was in the sauna with my cousin holly.

kramer: oh, no!

george: hey.

elaine: oh.

jerry:(to kramer) so, i guess i could help you out. i'm gonna get going.

jerry:(to jerry) hey, i gotta tell you, i'm sorry to disturb you.

jerry:(to the intercom) hello? yeah, tommy fries is...

george:(interrupting) i don't wanna see this. i don't know what to do with it.

kramer: oh.

george:(to george, to himself) you know, you should be ashamed of yourself!

elaine: oh my god, you know, i really think you're gonna be the one who wants you to do that.

jerry: i think i could.

elaine:(smiling, convincing) oh, you idiots!

jerr

#### Save scripts

Once we have a script that we like (or find interesting), we can save it to a text file!

In [21]:
# save script to a text file
f =  open("generated_script_1.txt","w")
f.write(generated_script)
f.close()