### Movie Script Generation

This notebook accompanies a [short post](http://perzan.io/projects/script-generator/) about training a character-level LSTM to write a screenplay.

There are 3 sections to this notebook:
1. Pre-processing raw text ripped from vector pdfs, mostly using awk and sed in bash
2. Embedding sequences of characters as sequences of one-hot vectors formatted correctly for an LSTM
3. Training/fitting the LSTM and generating text

#### Section 1: Text pre-processing using sed, awk, and tr:

In [1]:
%%bash
## Section 1: Text pre-processing using sed, awk, and tr

# Delete old files
rm -f tarantino.txt

# Pulp fiction
# Remove leading whitespace
sed 's/^[ \t]*//' raw_text/pulpFiction_raw.txt > pulpFiction.txt

# Reservoir dogs
# Remove scene number
sed 's/^[0-9]    //' raw_text/reservoirDogs_raw.txt > res.txt
sed 's/^[0-9][0-9]    //' -i res.txt
sed 's/^[ \t]*//' -i res.txt # Remove leading whitespace
mv res.txt reservoirDogs.txt 

# Inglourious basterds
sed 's/^[ \t]*//' raw_text/inglouriousBasterds_raw.txt > ing.txt # Remove leading whitespace
# Remove odd instances of several spaces in a row mid-sentence
tr -s ' ' < ing.txt > tmp.txt; mv tmp.txt ing.txt 
sed 's/\o14\ [[0-9]*\]//g' -i ing.txt # Remove page numbers (\o14 is the octal for Form Feed)
awk '$0 ~ /[a-zA-Z0-9]/{print}' ing.txt > tmp.txt; mv tmp.txt ing.txt # Remove all lines without alphanumeric
# Add back in blank lines above lines w/out any lowercase letters (assume that these are 
# camera instructions -- like "EXT. PATIO DAY" -- or character names -- like "COL. LANDA".)
awk '$0 !~ /[a-z]/{printf"\n"; print}; /[a-z]/{print}' ing.txt > inglouriousBasterds.txt
rm ing.txt

# Jackie Brown
# Remove leading whitespace
sed 's/^[ \t]*//' raw_text/jackieBrown_raw.txt > jb.txt
sed '/^$/d' -i jb.txt # Remove blank lines
awk '$0 !~ /[a-z]/{printf"\n"; print}; /[a-z]/{print}' jb.txt > jackieBrown.txt
rm jb.txt

# True Romance
sed 's/^[ \t]*//' raw_text/trueRomance_raw.txt > true.txt # Remove leading whitespace
tr -s ' ' < true.txt > tmp.txt; mv tmp.txt true.txt # Remove odd instances of several spaces in a row mid-sentence
sed '/^\o14/d' -i true.txt # Remove page numbers (those lines always start with \o14 -- form feed)
sed '/(MORE)/d' -i true.txt # Remove "(MORE).... (CONT'D)" that appear during page breaks
sed '/(CONT/d' -i true.txt
# Remove blank lines or lines of only punctuation (sed '/^$/d' doesn't work)
awk '$0 ~ /[a-zA-Z0-9]/{print}' true.txt > tmp.txt; mv tmp.txt true.txt
awk '$0 !~ /[a-z]/{printf"\n"; print}; /[a-z]/{print}' true.txt > trueRomance.txt
rm true.txt

# Natural Born Killers
sed 's/^[ \t]*//' raw_text/naturalBornKillers_raw.txt > naturalBornKillers.txt

# Four Rooms
sed 's/^[ \t]*//' raw_text/fourRooms_raw.txt > fourRooms.txt

# From Dusk till Dawn
sed 's/^[ \t]*//' raw_text/fromDuskTillDawn.txt > from.txt
cat -s from.txt > fromDuskTillDawn.txt
rm from.txt

# Kill Bill (1 and 2 combined)
sed 's/^[ \t]*//' raw_text/killBill_raw.txt | cat -s > kb.txt
sed 's/\*//g' kb.txt > killBill.txt 
rm kb.txt

# Django Unchained
sed 's/^[ \t]*//' raw_text/djangoUnchained_raw.txt > django.txt
awk '$0 ~ /[a-zA-Z0-9]/{print}' django.txt > tmp.txt; mv tmp.txt django.txt
awk '$0 ~ /[a-zA-Z]/{print}' django.txt > tmp.txt; mv tmp.txt django.txt
# Add back in blank lines above character names
awk '{if ($0 !~ /[a-z]/ || $0 ~ /^Dr\.SCHULTZ/){printf"\n"; print; next}}; /[a-z]/{print}' django.txt > djangoUnchained.txt
rm django.txt

# Hateful Eight
awk '$0 !~ /Page/' raw_text/hatefulEight_raw.txt > hate.txt
sed 's/\f//' -i hate.txt # Remove excess form feed
awk 'length($0) > 1' hate.txt > hatefulEight.txt
rm hate.txt

# Combine all to single file named "tarantino.txt"
rm -r clean_text/tarantino.txt
mv *.txt clean_text
cat clean_text/*.txt > tarantino.txt
mv tarantino.txt clean_text

In [2]:
import io

# Read in file
path = "clean_text/tarantino.txt"
with io.open(path, encoding='utf-8') as f:
    text = f.read()
print('corpus length:', len(text))

# Get a sorted list of unique characters 
chars = sorted(list(set(text)))
print('total chars:', len(chars))
print(chars)

corpus length: 1704676
total chars: 82
['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '–', '—']


So, we have a total of 1.7M characters in the corpus, with 82 unique characters. We have 52 letters (upper + lower case), 10 digits of 0 through 9, and other common punctuation and symbols. From this, it seems like we did a pretty good job cleaning up the text; there are no extraneous characters like form feed, line feed, carriage return, or other odd symbols that pop up when you convert a PDF to text. Perhaps one improvement would be to combine the m-dash, n-dash and hyphen as single symbol. Then again, all those have different meanings in screenplays, so perhaps it's best to leave them in.

#### Section 2: Embedding characters as one-hot vectors

In [3]:
# Section 2: Embedding characters as one-hot vectors

# Create lookup tables that convert indices to characters and back
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# Cut the text in short sequences
# These are the number of steps in our LSTM and the length of its
# memory; the model will receive a sequence of 50 characters and try
# to predict the 51st
seqlen = 50
step = 3
sequences = []
next_chars = []
for i in range(0, len(text) - seqlen, step):
    sequences.append(text[i: i + seqlen])
    next_chars.append(text[i + seqlen])
print('# of sequences:', len(sequences))
print('Note that number of sequences is the total number of training examples.')

# of sequences: 568209
Note that number of sequences is the total number of training examples.


In [4]:
import numpy as np

# Create zero-filled arrays of the shape expected by our LSTM
x = np.zeros((len(sequences), seqlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

# Iterate through each character vector and change the value at that
# character's index to 1
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
    
print("Embedded %s character sequences as one-hot vectors." % "{:,}".format(len(sequences)))
print('x is a 3D Numpy array composed of %d 2D arrays of shape %d (# of characters) x %d (sequence length)'
      % (len(sequences), len(chars), seqlen))
print('y is a 2D Numpy array composed of shape %d (# of sequences) x %d (# of characters)'
      % (len(sequences), len(chars)))

Embedded 568,209 character sequences as one-hot vectors.
x is a 3D Numpy array composed of 568209 2D arrays of shape 82 (# of characters) x 50 (sequence length)
y is a 2D Numpy array composed of shape 568209 (# of sequences) x 82 (# of characters)


In [5]:
# Split dataset into 80% train, 20% test
# If we were developing this for real, we would want to have separate train, test, and dev sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

So, we now have separate training and test sets. We have a total of 568,209 training examples, each with 82 features and 50 time steps. The LSTM will run for 50 steps (and receive a one-hot vector of length 82 at each step) before outputting a one-hot vector of length 82 after the 50th time step.

#### Section 3: Training the model and generating text

In [6]:
%%bash
# First, let's extract the .hdf5 file (at 100.94 MB, it's just over github's 100 MB file size limit)
tar -xzvf output/weights.tar.gz

output/lstm_02d_2l_512n-Bi_best_weights.09-0.59.hdf5


In [7]:
# Suppress tensorflow future warnings (or downgrade NumPy to 1.16)
import warnings
warnings.filterwarnings('ignore')

# Import user-defined module
from src.charLSTM import textGenModel

# Define the model hyperparameters
#     layers (# number of LSTM layers)
#     hidden_nodes (# of nodes in each LSTM layer)
#     bidirectional (bool of whether or not to use a bidirectional LSTM)
#     dropout (recurrent dropout to use in each LSTM layer)
lstm = textGenModel(chars, layers=2, dropout=0.2, hidden_nodes=512, 
                    name="lstm_02d_2l_512n-Bi", bidirectional=True)

# If we were fitting this from the beginning, we would run:
# Run time will vary depending on your machine
#      lstm.fit(x_train, y_train, validation_data=(x_test[:10000, :, :], y_test[:10000, :]))

# Define certain values for the class and load trained model weights
lstm.seqlen = 50
lstm.model = lstm.build_model()
lstm.load_params("output/lstm_02d_2l_512n-Bi_best_weights.09-0.59.hdf5")

Using TensorFlow backend.


Building model with following parameters...

Layers:  2
Bidirectional:  True
Hidden Nodes:  512
Dropout:  0.2





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Loaded a keras model with the following parameters:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_1 (Bidirection (None, 50, 1024)          2437120   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 1024)              6295552   
_________________________________________________________________
dense_1 (Dense)              (None, 82)                84050     
Total params: 8,816,722
Trainable params: 8,816,722
Non-trainable params: 0
_________________________________________________________________


In [8]:
# Now, feed in a 50-character seed and generate text
seed = "r two years.\nYou're married. You killed a guy.\n\nCL"
text = lstm.generate_text(seed, diversity=0.5, genlen=300)
print(seed+text)

r two years.
You're married. You killed a guy.

CLARENCE
What happened to die?

JACKIE
I think I thought I knew what we were doin'
it with a bunch of this asses on
to the table. You could be seen a couple of 
funny. I'm gonna be a keeping to fifty
hundred needed it.

DJANGO
I don't think I was sitting to it.

JACKIE
What do you think we're gonna ma
