# Dunder Mifflin RNNfinity

**Dunder Mifflin RNNfinity** is a recurrent neural network designed to generate new scripts for The Office. The model is trained using the transcripts from every episode across all 9 seasons of the show (74 hours) and can effectively generate new dialogues and similar sounding conversations between characters.

This notebook provides a walkthrough of the project including the pre-processing the training data, developing the model architecture, training the model, and using the model to generate text. 



## Part 1: Preparing the training data

The scripts from every episode of The Office has already been extracted and cleaned for training in the file, `the-office.txt`. This training data still needs to be pre-processed before it can be used to train our model on.

In [None]:
import tensorflow as tf
import numpy as np
import os

First, import the dataset from the file.

In [None]:
filename = 'the-office.txt'
file = open(filename,'rb')
text = file.read().decode('utf8')
file.close()

Vectorize the text by converting the strings into a numerical representation.

In [None]:
vocab = sorted(set(text))
char2idx = {u:i for i,u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])

Create example input & output sequences of short lengths. For a given input sequence, the corresponding output sequence is just the input sequence shifted by a single character.

In [None]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text
  
dataset = sequences.map(split_input_target)

Produce training batches by shuffling the dataset of example sequences and packing them into batches.

In [None]:
batch_size = 64
buffer_size = 10000
dataset = dataset.shuffle(buffer_size)
dataset = dataset.batch(batch_size, drop_remainder=True)

## Part 2: Building the model

The next step is to define the RNN's model architecture. This project uses a Sequential model with the following layers:
- an *embedding layer* that processes the input batch sequences and maps it onto 256-dimensional word embedding vector
- stacked *LTSM layers* that train the model using back-propagation and effectively "remember" past data & contexts in memory
- a *dense layer* that recieves input from the LTSM layers and returns a likelihood for each possible character to be generated

In [None]:
vocab_size = len(vocab) 
embedding_dim = 256
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,return_sequences=True,stateful=True,recurrent_initializer='glorot_uniform'),
    tf.keras.layers.LSTM(rnn_units,return_sequences=True,stateful=True,recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)])
    return model

model = build_model(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=batch_size)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (64, None, 256)           22784     
                                                                 
 lstm (LSTM)                 (64, None, 1024)          5246976   
                                                                 
 lstm_1 (LSTM)               (64, None, 1024)          8392704   
                                                                 
 dense (Dense)               (64, None, 89)            91225     
                                                                 
Total params: 13,753,689
Trainable params: 13,753,689
Non-trainable params: 0
_________________________________________________________________


Configure the model for training with the Adam optimizer and a loss function.

In [None]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
    
model.compile(optimizer='adam', loss=loss)

## Part 3: Training the model

Train the model across 50 epochs and save checkpoints to memory. This may take a few hours, so maybe watch The Office while you wait.

In [None]:
!mkdir -p training
checkpoint_dir = './training'
checkpoint_prefix = os.path.join(checkpoint_dir,"checkpoint_{epoch}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix, save_weights_only=True)

epochs = 50
history = model.fit(dataset, epochs=epochs, callbacks=[checkpoint_callback])

Rebuild the model and restore context from the last checkpoint. Then, save the weights to memory.


In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint("./training"))
model.build(tf.TensorShape([1, None]))

model.save_weights('./training/training_data')

## Part 4: Generate text

The model is now trained and ready to be used to generate text. Just set the following parameters:

- `start_string`: str; a prompt to start with and set the context of the text generation
- `num`: int; the number of characters to generate
- `temp`: float; the "temperature" of the text generated. The lower the value, the more predictable the text will be. The higher the value, the more surprising (and random) the text will be.


In [None]:
def generate_text(model, start_string, num = 1000, temp = 0.50):
    # converting start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)
    # empty list to store our results
    text_generated = []
    model.reset_states() # reset context from previous use
    for i in range(num):
        predictions = model(input_eval)
        # remove batch dimension
        predictions = tf.squeeze(predictions, 0)
        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temp
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
        # pass predicted character as next input to the model along with the previous (hidden) state
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])
    return (start_string + ''.join(text_generated))

print(generate_text(model, start_string="JIM:\n", num=1000, temp=0.25))

## Part 5 (Optional): Exporting the model

Export the model to be used in your application.

Warning: *This step likely would require restarting the runtime and reimporting an earlier version of Tensorflow in order to use some of the deprecated functionality. Some of the previous cells may also need to be rerun to prevent errors.*


In [None]:
%tensorflow_version 1.1x

import tensorflow as tf
import numpy as np
import os

tf.enable_eager_execution()
tf.keras.backend.clear_session()

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights('./training/training_data')
model.build(tf.TensorShape([1, None]))
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

!mkdir -p saved_model
tf.keras.experimental.export_saved_model(model,'python_model')

tf.compat.v1.disable_eager_execution()
tf.keras.backend.clear_session()
new_model=tf.keras.experimental.load_from_saved_model('python_model')

new_model.summary()

In [None]:
!pip install tensorflowjs
import tensorflowjs as tfjs

tfjs.converters.save_keras_model(new_model,'./javascript_model/model')
!zip -r javascript_model.zip javascript_model