### This program creates an algorithm that "learns" Harry Potter in order to be able to generate new text in a similar style.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import StringLookup,Embedding,LSTM,GRU,Dense,BatchNormalization
from tensorflow.keras import Input
import tensorflow_probability as tfp
import docx
import random
from pathlib import Path
import os

cd = Path.cwd()
filepath = os.path.join(cd,r'OneDrive\Desktop\Datasets\complete_harry_potter.docx')

doc = docx.Document(filepath)

full_text = []

for paragraph in doc.paragraphs:
    text = paragraph.text
    if text.isupper() == False and 'J.K. Rowling' not in text and text != '' and 'Page | ' not in text:
        full_text.append(text)
full_text = '\n'.join(full_text)
full_text = full_text.replace('\n','')

### The complete Harry Potter text is tokenized at the character level, meaning that each individual character (letter, number, punctuation, etc.) recieves its own embedding vector.

In [2]:
unique_characters = sorted(set(full_text))
vocab_size = len(unique_characters)
print('There are {} unique characters. They are: '.format(vocab_size),end='')
print(unique_characters)

char_tokenizer = StringLookup(vocabulary=unique_characters)

detokenizer = StringLookup(vocabulary=char_tokenizer.get_vocabulary(),
                          invert=True)

split_text = tf.strings.unicode_split(full_text,'UTF-8')
tokenized_text = char_tokenizer(split_text)

There are 86 unique characters. They are: [' ', '!', '"', '%', '&', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '>', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '~', '–', '—', '‘', '’', '“', '”']


### Once tokenized, the text is divided into sequences of characters (the size of which is defined by the variable sequence_length.) The sequences are then shuffled (with a seed so as to recreate the random shuffle to train on the same train dataset multiple times) and split into training and validation datasets.

In [3]:
shift = 12
sequence_length = 301
sequences = []
start_point = 0
while True:
    sequence = tokenized_text[start_point:start_point+sequence_length]
    sequences.append(sequence)
    start_point += shift
    if start_point + sequence_length >= len(tokenized_text):
        break
        
sequences = np.array(sequences)
seed = np.random.seed(100)
np.random.shuffle(sequences)

validation_size = int(len(sequences)*.03)
train_sequences = sequences[:-validation_size]
validation_sequences = sequences[-validation_size:]

### The training and validation sequences are turned into dataset objects in order to create a pipeline to feed into the model. This process splits each sequence into an input, which is the entire sequence besides the last letter, and an output, which is the entire sequence besides the first letter. 

In [4]:
batch_size = 128

def make_dataset(sequence_list):
    np.random.shuffle(sequence_list)
    dataset = tf.data.Dataset.from_tensor_slices(sequence_list)
    dataset = dataset.map(lambda sequence: (sequence[:-1],sequence[1:]))
    dataset = dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return dataset

train_dataset = make_dataset(train_sequences)
validation_dataset = make_dataset(validation_sequences)

### Here are some examples of what inputs and outputs look like:

In [5]:
print('Examples of inputs and outputs:\n')
t = 12
for batch in iter(train_dataset):
    if random.randint(0,250) == 1:
        num = random.randint(0,batch_size-1)
        input_example = tf.strings.reduce_join(detokenizer(batch[0][num]),
                                             axis=-1).numpy().decode()
        output_example = tf.strings.reduce_join(detokenizer(batch[1][num]),
                                             axis=-1).numpy().decode()
        print('Input example:\n',input_example)
        print('\nOutput example:\n',output_example)
        print('-'*40)
        print('-'*40)
        print()
        t += 1
    if t==1:
        break

Examples of inputs and outputs:

Input example:
 readful teacher she is, and how we’re not going to learn any defense from her at all,” said Hermione. “Well, what can we do about that?” said Ron, yawning. “ ’S too late, isn’t it? She got the job, she’s here to stay, Fudge’ll make sure of that.” “Well,” said Hermione tentatively. “You know, I was t

Output example:
 eadful teacher she is, and how we’re not going to learn any defense from her at all,” said Hermione. “Well, what can we do about that?” said Ron, yawning. “ ’S too late, isn’t it? She got the job, she’s here to stay, Fudge’ll make sure of that.” “Well,” said Hermione tentatively. “You know, I was th
----------------------------------------
----------------------------------------

Input example:
 y. “Charms, Defense Against the Dark Arts, Herbology, Transfiguration ... all fine. I must say, I was pleased with your Transfiguration mark, Potter, very pleased. Now, why haven’t you applied to continue with Potions? I thought it 

### The model uses an embedding layer to embed the vocabulary (the list of characters) in a vector space, the dimensionality of which is given by the variable embedding_dim. It then uses two LSTM layers and one GRU layer (with batch normalization in between each one). The recurrent layers output the final hidden state and the cell state, as well as all hidden states for the LSTM layers. 
### All too often, text generation gets stuck in a loop, repeating a short sequences of words or characters endlessly. To solve this problem, instead of becoming a deterministic algorithm, the model learns a probability distribution. Characters are then randomly sampled from the distribution. This stochastic approch offers flexibility and randomness that enables the algorithm to avoid getting stuck in any loops. Another advantage of learning a probability distribution is that the model can use the negative log likelihood, which measures how likely the true output is given the model's weights, as a loss function.

In [6]:
vocab_length = len(char_tokenizer.get_vocabulary())
embedding_dim = 800
input_length = sequence_length - 1
layer_size = 1200
states = None
tfd = tfp.distributions
tfpl = tfp.layers

inputs = keras.Input(shape=(None,))
embedding = Embedding(input_dim=vocab_length,
                     output_dim=embedding_dim,
                     input_length=input_length)(inputs)
X = BatchNormalization()(embedding)

lstm = LSTM(layer_size,
           return_sequences=True,
           return_state=True)

if states is None:
    states = lstm.get_initial_state(X)

X,hidden_state,cell_state = lstm(embedding,initial_state=states)
X = keras.layers.BatchNormalization()(X)

states = [hidden_state,
          cell_state]
X,hidden_state,cell_state = LSTM(layer_size,
                                 return_sequences=True,
                                 return_state=True)(X,initial_state=states)
X = BatchNormalization()(X)
states = [hidden_state,
          cell_state]
X,cell_state = GRU(layer_size,
                  return_sequences=True,
                  return_state=True)(X,initial_state=states[1])
X = BatchNormalization()(X)
states = [hidden_state,
         cell_state]
X = Dense(tfpl.OneHotCategorical.params_size(vocab_length))(X)
outputs = tfpl.OneHotCategorical(event_size=vocab_length)(X)

sequence_model = keras.Model(inputs=inputs,
                            outputs=outputs)

learning_rate = 1e-4
optimizer = keras.optimizers.Adam(learning_rate=learning_rate)

nll = lambda y_true,y_pred: -y_pred.log_prob(tf.one_hot(y_true,depth=vocab_length))

sequence_model.compile(loss=nll,
                      optimizer=optimizer,
                      metrics=['accuracy'])

sequence_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding (Embedding)          (None, None, 800)    69600       ['input_1[0][0]']                
                                                                                                  
 batch_normalization (BatchNorm  (None, None, 800)   3200        ['embedding[0][0]']              
 alization)                                                                                       
                                                                                                  
 tf.compat.v1.shape (TFOpLambda  (3,)                0           ['batch_normalization[0][0]']

### The model was trained using more than 40 hours of GPU. The weights were then downloaded and uploaded here. A small sample of the validation dataset is used to evlauate the model.

In [7]:
weights_path = os.path.join(cd,r'OneDrive\Desktop\Datasets\final-weights\harry-potter-weights.h5')
sequence_model.load_weights(weights_path)

sample_size = 512
test_sequences = random.sample(list(validation_sequences),sample_size)
test_sequences = np.array(test_sequences)
test_dataset = make_dataset(test_sequences)
sequence_model.evaluate(test_dataset)



[0.09840431809425354, 0.9711328148841858]

In [8]:
def generate_letter(seed,model):
    seed = seed.replace('\n','')
    split_seed = tf.strings.unicode_split(seed,'UTF-8')
    tokenized_seed = char_tokenizer(split_seed)
    expanded = tf.expand_dims(tokenized_seed,axis=0)
    predictions = model.predict(expanded).squeeze()
    tokens = np.argmax(predictions,axis=-1)
    predicted_str = tf.strings.reduce_join(detokenizer(tokens),axis=-1).numpy().decode()
    predicted_letter = predicted_str[-1]
    return predicted_letter

def generate(seed,num_letters=1200,model=sequence_model):
    for i in range(num_letters):
        seed += generate_letter(seed,model)
    return seed

### Finally, the model is being tested on a number of "seeds", which are small bits of texts used to start the text generation. Each seed is fed into the above functions and is used to generate more text.

In [9]:
seed = """He glared at Voldemort, staring into his snake-like eyes. Hooded Death Eaters surriounded them, jeering 
and laughing as their master taunted and tortured Harry. """
print(generate(seed))
print('-'*100+'\n')

seed = """Harry gripped his wand tightly, wondering which spell would come in most useful. """
print(generate(seed))
print('-'*100+'\n')

seed = """Harry was growing more and more frustrated by the assignment Snape had set them. How was he supposed to 
focus when he had a Goblin rebellion, a date, and a summons to the Ministry of Magic to worry about? Absentmindedly
 chewing his quill, he thought of what he might tell Sirius, and what Sirius might think. """
print(generate(seed))
print('-'*100+'\n')

seed = """Harry looked down at his History of Magic essay, his quill hanging aimlessly from his hand. 
He couldn't think of how to fill twelve inches of parchment with accounts of Goblin wars. He looked around the common
room where a few fifth years remained huddled over their O.W.L notes near the dying fire."""
print(generate(seed))
print('-'*100+'\n')

He glared at Voldemort, staring into his snake-like eyes. Hooded Death Eaters surriounded them, jeering 
and laughing as their master taunted and tortured Harry. “The servant died when I left his body, and I was left as weak as ever I had been,” Voldemort continued. “I returned to my hiding place far away, and I will not pretend to you that I didn’t then fear that I might never regain my powers. ... Yes, that was perhaps my darkest hour ... I could not hope that I would be sent another wizard to possess ... and I had given up hope, now, that any of my Death Eaters cared what had become of me. ...” One or two of the masked wizards in the circle moved uncomfortably, but Voldemort took no notice. “And then, not even a year ago, when I had almost abandoned hope, it happened at last ... a servant returned to me. Wormtail here, who had faked his own death to escape justice, was driven out of hiding by those he had once counted friends, and decided to return to his master. He sought me in the