# TV Scripts Generator

## Project

Imagine your job was to produce a TV script for the next episode of a long running TV show. There were so many episodes and it is hard for you to come up with something new. Good news is you can use deep learning to write the script for you. There are many scripts from previous episodes that a nerual network can learn from. And what's important, the audience likes what it is used to, so it sounds like a perfect solution.

In this project I'm going to train a neural network using scripts from the 9 seasons of Seinfeld show. Based on recognized patterns, it will be able to generate a new text. And I will use it to create a one.

## Load the data

If you look at the data file, you will find out that it contains lines from scripts, appended episode by episode. I'm loading them all as a one huge string of text.

In [1]:
import helper

# Load in data
data_dir = './data/Seinfeld_Scripts.txt'
text = helper.load_data(data_dir)

Now I can access `text` letter by letter. Here I print first 1000 characters.

In [2]:
text[:1000]

'jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! you wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? where ever you are in life, its my feeling, youve gotta go. \n\njerry: (poi

You can see that new lines are represented by `\n`.

### Explore it

By using `view_line_range` I print out the first ten lines of text. Empty lines also count as lines, so in result it printed out 5 quotes.

You can also see some dataset stats. Rough estimate of unique words is overshot as it considers everything between spaces as a single word. So it will consider *back!* and *back* as two different words.

In [3]:
import numpy as np

# Determine which lines range to print
view_line_range = (0, 10)

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))

lines = text.split('\n')
print('Number of lines: {}'.format(len(lines)))
word_count_line = [len(line.split()) for line in lines]
print('Average number of words in each line: {}'.format(np.average(word_count_line)))

print()
print('The lines {} to {}:'.format(*view_line_range))
print('\n'.join(text.split('\n')[view_line_range[0]:view_line_range[1]]))

Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
Average number of words in each line: 5.544240293684143

The lines 0 to 10:
jerry: do you know what this is all about? do you know, why were here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! y

## Pre-process the data

While working with text data, it is useful to code words into numbers. I am going to do it now.

### Lookup tables

I will create two dictionaries, `vocab_to_int` that will map words to integers and `int_to_vocab` that will do the opposite. These lookup tables will make it easy to translate between words and their integer codings later on.

In [4]:
import problem_unittests as tests

def create_lookup_tables(text):
    '''Create lookup tables for vocabulary.
    
    Args:
        text(str): The text of tv scripts split into words
    
    Return: A tuple of dicts (vocab_to_int, int_to_vocab)
    '''
    # Transform text into a large tuple of unique words
    vocab = tuple(set(text))
    
    int_to_vocab = dict(enumerate(vocab))
    vocab_to_int = { vocab : i for i, vocab in int_to_vocab.items() }

    return (vocab_to_int, int_to_vocab)

# Test
tests.test_create_lookup_tables(create_lookup_tables)

Tests Passed


### Tokenize punctuation

Later, I'm going to split `text` into words using whitespaces. However, punctuation can cause some problems like for example recognizning *bye* and *bye!* as two different words. You can solve it by replacing every punctuation mark with some word token, like "." with "||period||". You might ask, why to add separators "|"? It's so during analysis you can distinguish between real words and the tokens.

Now I'm going to create a dictionary that can map punctuation marks into tokens.

In [5]:
def token_lookup():
    '''Generate a dict to turn punctuation into a token.
    
    Return: Tokenized dictionary where the key is the punctuation and the value is the token
    '''
    punctuation_to_token = {
        '.' : '||period||',
        ',' : '||comma||',
        '"' : '||quotation_mark||',
        ';' : '||semicolon||',
        '!' : '||exclamation_mark||',
        '?' : '||question_mark||',
        '(' : '||left_parentheses||',
        ')' : '||right_parentheses||',
        '-' : '||dash||',
        '\n': '||new_line||'
    }
        
    return punctuation_to_token

# Test
tests.test_tokenize(token_lookup)

Tests Passed


### Pre-process the data and save it

As a last step, I'm going to do the actual pre-processing. The code will translate punctuation marks nad create dictionaries. It will save the results in a file, so you can easily load it and just continue from this point.

In [6]:
# Pre-process training data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)

## Checkpoint

This code allows you to continue analysis whenever you come back to the notebook. Just run this cell and it will load the pre-processed data.

In [7]:
import helper
import problem_unittests as tests

# Load the pre-processed data
int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

In [8]:
# Count the number of unique words
print("Number of unique words: ", len(vocab_to_int))

Number of unique words:  21388


As you can see, after the pre-processing the number of unique words is more accurate. There are about 20,000 of them.