#### RNN Shakespeare Assignment 

Following tutorial from: https://towardsdatascience.com/create-your-own-artificial-shakespeare-in-10-minutes-with-natural-language-processing-1fde5edc8f28

In [1]:
import tensorflow as tf
import numpy as np
import os

In [2]:
cwd = os.getcwd()
path_to_file = cwd + '/Shakespeare.txt'

In [3]:
shakespeare = open(path_to_file, 'rb').read()

print ('Total number of characters is:', len(shakespeare))
print('The first 100 characters are as follows:\n', shakespeare[:100])

Total number of characters is: 5461536
The first 100 characters are as follows:
 b'1609 \n \nTHE SONNETS \n \nby William Shakespeare \n \n \n \n                     1 \n  From fairest creature'


The text needs to be encoded so that it is both interpretable by the average human (i.e, without \n) and the algorithm. 
Therefore the string needs to be decoded as such.

In [4]:
shakespeare = shakespeare.decode(encoding = 'utf-8')
print('The first 100 characters are as follows:\n', shakespeare[:100])

The first 100 characters are as follows:
 1609 
 
THE SONNETS 
 
by William Shakespeare 
 
 
 
                     1 
  From fairest creature


Since we have over five million characters in our text file, we need to figure out how many unique characters there are.

In [5]:
shakespeare_unique_characters = sorted(set(shakespeare))
print ('The number of unique characters is', len(shakespeare_unique_characters))

The number of unique characters is 83


Next, we have to ceate a mapping from the characters to indicies. Then the unique elements are copied to a numpy array, which is then switched to a vector.

In [6]:
char_ind_map = {u:i for i, u in enumerate(shakespeare_unique_characters)}
unique_array = np.array(shakespeare_unique_characters)
text_vector = np.array([char_ind_map[c] for c in shakespeare])

Next we have to create the new training dataframe

In [7]:
char_df = tf.data.Dataset.from_tensor_slices(text_vector) 
seq_length = 100 
sequence = char_df.batch(seq_length+1, drop_remainder=True) 

The tutorial uses a mapping function to create a tuple of the above sequences for the RNN

In [8]:
def split_input_target(chunk):
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

sequence_dataset = sequence.map(split_input_target)

Now we shuffle the dataset and set the batch parameters

In [9]:
BUFFER_SIZE = 100000
BATCH_SIZE = 100
updated_dataset = sequence_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

Now we can build the model.

First we must set the parameters

In [10]:
char_length = len(shakespeare_unique_characters)
embedding_dim = 400
rnn_units = 1600

Now we can create a function to define and create the model so that it can be used repeatedly with ease.  

In [11]:
def model_func(char_length, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(char_length, embedding_dim,
                              batch_input_shape = [batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences = True,
                        stateful = True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(char_length)
  ])
  return model

In [12]:
model =  model_func(
    char_length = len(shakespeare_unique_characters),
    embedding_dim = embedding_dim,
    rnn_units = rnn_units, 
    batch_size = BATCH_SIZE)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (100, None, 400)          33200     
_________________________________________________________________
gru (GRU)                    (100, None, 1600)         9609600   
_________________________________________________________________
dense (Dense)                (100, None, 83)           132883    
Total params: 9,775,683
Trainable params: 9,775,683
Non-trainable params: 0
_________________________________________________________________


In [13]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer = 'adam', loss = loss)

In [14]:
# create a new directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# name the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath = checkpoint_prefix,
    save_weights_only = True)

The tutorial called for using 30 epochs, however it was taking an absurd amount of time to run so I changed the value to 6.

In [15]:
history = model.fit(updated_dataset, 
                    epochs = 6, 
                    callbacks = [checkpoint_callback])

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


In [16]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_6'

In [17]:
model = model_func(char_length, 
                   embedding_dim, 
                   rnn_units, 
                   batch_size = 1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 400)            33200     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1600)           9609600   
_________________________________________________________________
dense_1 (Dense)              (1, None, 83)             132883    
Total params: 9,775,683
Trainable params: 9,775,683
Non-trainable params: 0
_________________________________________________________________


Now the following function creates the alhorithm that generates the text.

In [23]:
def generate_text(model, num_generate, temperature, start_string):
  input_eval = [char_ind_map[s] for s in start_string] 
  input_eval = tf.expand_dims(input_eval, 0) 
  text_generated = []
  model.reset_states() 

  for i in range(num_generate): 
    predictions = model(input_eval) 
    predictions = tf.squeeze(predictions, 0)

    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    input_eval = tf.expand_dims([predicted_id], 0) 

    text_generated.append(unique_array[predicted_id]) 

  return (start_string + ''.join(text_generated))


Set parameters for the text.

In [24]:
generated_text = generate_text(
                    model, 
                    num_generate = 1000, 
                    temperature = 1, 
                    start_string = "JULI")

And now we can view our generated text. 

In [25]:
print(generated_text)

JULIO. Madam, how all this strong shrible fire; the greedier chick 
    Wherein great courtesh shall be villain. I am again; in so disgrace 
    Ay, to delibver me to be a king- 
    And I do prize myself to hazard you both, fright it, 
    Or a dog. 
  CLEOPATRA. Consurers in joy get stay in occasions and 
    requests, and died as thus 
   MANTIUS. Quick, th' unmanner's house, her son of Eglamour; for i' th' crimace. 
  FIRST SOLDIER. Wour Angile am I, none, sir; he that stay follows true 
    And suls by us he is eightle 
    Upon the streets, sonuse his wit' kiss us'd upon me here to a fault 
    Should not walk they more now but they are right. 
    I sent this vary wither'd uck for love, 
    For my hand and this time here live and think 
    The beginnant of persant'st her   less action you gain wheart away your wits! He new, 
    And sometime to kiss Hector write at morache; 
    Boansmed in speech, as to see her; we'll keep at poe.   
  Mer. But I am much upon his fortunes. 
 