# Natural Language Processing Project

#### For the course Complete Tensorflow 2 and Keras Deeplearning Bootcamp

In this project will be process the work of Charles Dickens and create a model capable of generate text based on the style of the author.

We can obtained any free text from [Gutenberg](https://www.gutenberg.org/). 

## Reading the data and creating a vocabulary

In [1]:
# Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

In [2]:
# Reading the data
text = open("data/dickens.txt", "r", encoding="utf8").read()

# Let's print the first 500 characters
print(text[:500]) 

 CHAPTER I.
TREATS OF THE PLACE WHERE OLIVER TWIST WAS BORN AND OF THE
CIRCUMSTANCES ATTENDING HIS BIRTH


Among other public buildings in a certain town, which for many reasons
it will be prudent to refrain from mentioning, and to which I will
assign no fictitious name, there is one anciently common to most towns,
great or small: to wit, a workhouse; and in this workhouse was born; on
a day and date which I need not trouble myself to repeat, inasmuch as
it can be of no possible consequence to t


In [3]:
# How many characters do we have
print(f"We have {len(text)} characters in the text")

We have 9035263 characters in the text


In [4]:
# Let's create a vocabulary with each character in the text.
# Vocabulary is a list

vocabulary = sorted(set(text))
vocabulary

['\n',
 ' ',
 '!',
 '"',
 '&',
 "'",
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '£',
 'æ',
 'ö',
 '—',
 '‘',
 '’',
 '“',
 '”']

In [5]:
# How long is our vocabulary
print(f"Our vocabulary has {len(vocabulary)} unique characters")

Our vocabulary has 89 unique characters


## Text processing

We will create two maps: one from character to index in the vocabulary, and another from index to character.

In [6]:
# Dictionary to obtain the index of a character

character_to_index = {character: index for index, character in enumerate(vocabulary)}
character_to_index

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '&': 4,
 "'": 5,
 '(': 6,
 ')': 7,
 '*': 8,
 ',': 9,
 '-': 10,
 '.': 11,
 '/': 12,
 '0': 13,
 '1': 14,
 '2': 15,
 '3': 16,
 '4': 17,
 '5': 18,
 '6': 19,
 '7': 20,
 '8': 21,
 '9': 22,
 ':': 23,
 ';': 24,
 '?': 25,
 'A': 26,
 'B': 27,
 'C': 28,
 'D': 29,
 'E': 30,
 'F': 31,
 'G': 32,
 'H': 33,
 'I': 34,
 'J': 35,
 'K': 36,
 'L': 37,
 'M': 38,
 'N': 39,
 'O': 40,
 'P': 41,
 'Q': 42,
 'R': 43,
 'S': 44,
 'T': 45,
 'U': 46,
 'V': 47,
 'W': 48,
 'X': 49,
 'Y': 50,
 'Z': 51,
 '[': 52,
 ']': 53,
 '_': 54,
 'a': 55,
 'b': 56,
 'c': 57,
 'd': 58,
 'e': 59,
 'f': 60,
 'g': 61,
 'h': 62,
 'i': 63,
 'j': 64,
 'k': 65,
 'l': 66,
 'm': 67,
 'n': 68,
 'o': 69,
 'p': 70,
 'q': 71,
 'r': 72,
 's': 73,
 't': 74,
 'u': 75,
 'v': 76,
 'w': 77,
 'x': 78,
 'y': 79,
 'z': 80,
 '£': 81,
 'æ': 82,
 'ö': 83,
 '—': 84,
 '‘': 85,
 '’': 86,
 '“': 87,
 '”': 88}

In [7]:
# Dictionary to obtain a character based on a index

index_to_character = np.array(vocabulary)
index_to_character

array(['\n', ' ', '!', '"', '&', "'", '(', ')', '*', ',', '-', '.', '/',
       '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?',
       'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
       '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
       'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
       'x', 'y', 'z', '£', 'æ', 'ö', '—', '‘', '’', '“', '”'], dtype='<U1')

In [8]:
# Example of use

print(f"Character to index de J: {character_to_index['J']}")
print(f"Index to character de 35: {index_to_character[35]}")

Character to index de J: 35
Index to character de 35: J


## Encoding the text

Let's encode the text using the character_to_index dictionary.

In [9]:
encoded_text = np.array([character_to_index[character] for character in text])
encoded_text

array([ 1, 28, 33, ..., 68, 59,  2])

In [10]:
# How many encoded characters do we have
print(f"We have {len(encoded_text)} encoded characters in the text")

We have 9035263 encoded characters in the text


#### Let's show the text and its encoded version

In [11]:
text[300:500]

'n to most towns,\ngreat or small: to wit, a workhouse; and in this workhouse was born; on\na day and date which I need not trouble myself to repeat, inasmuch as\nit can be of no possible consequence to t'

In [12]:
encoded_text[300:500]

array([68,  1, 74, 69,  1, 67, 69, 73, 74,  1, 74, 69, 77, 68, 73,  9,  0,
       61, 72, 59, 55, 74,  1, 69, 72,  1, 73, 67, 55, 66, 66, 23,  1, 74,
       69,  1, 77, 63, 74,  9,  1, 55,  1, 77, 69, 72, 65, 62, 69, 75, 73,
       59, 24,  1, 55, 68, 58,  1, 63, 68,  1, 74, 62, 63, 73,  1, 77, 69,
       72, 65, 62, 69, 75, 73, 59,  1, 77, 55, 73,  1, 56, 69, 72, 68, 24,
        1, 69, 68,  0, 55,  1, 58, 55, 79,  1, 55, 68, 58,  1, 58, 55, 74,
       59,  1, 77, 62, 63, 57, 62,  1, 34,  1, 68, 59, 59, 58,  1, 68, 69,
       74,  1, 74, 72, 69, 75, 56, 66, 59,  1, 67, 79, 73, 59, 66, 60,  1,
       74, 69,  1, 72, 59, 70, 59, 55, 74,  9,  1, 63, 68, 55, 73, 67, 75,
       57, 62,  1, 55, 73,  0, 63, 74,  1, 57, 55, 68,  1, 56, 59,  1, 69,
       60,  1, 68, 69,  1, 70, 69, 73, 73, 63, 56, 66, 59,  1, 57, 69, 68,
       73, 59, 71, 75, 59, 68, 57, 59,  1, 74, 69,  1, 74])

## Creating Batches

We need to understand how the text sequences are organized and shifted one character forward. We will use tensorflow datasets to create and shuffle batches.

The batches are the training sequences that the model will use to learn and then generate new text. We have to determine how long those sequences should be.

In [13]:
# Let's observe a part of the text and see if we can get a hold of how the data is structured.

print(text[:2000])

 CHAPTER I.
TREATS OF THE PLACE WHERE OLIVER TWIST WAS BORN AND OF THE
CIRCUMSTANCES ATTENDING HIS BIRTH


Among other public buildings in a certain town, which for many reasons
it will be prudent to refrain from mentioning, and to which I will
assign no fictitious name, there is one anciently common to most towns,
great or small: to wit, a workhouse; and in this workhouse was born; on
a day and date which I need not trouble myself to repeat, inasmuch as
it can be of no possible consequence to the reader, in this stage of
the business at all events; the item of mortality whose name is
prefixed to the head of this chapter.

For a long time after it was ushered into this world of sorrow and
trouble, by the parish surgeon, it remained a matter of considerable
doubt whether the child would survive to bear any name at all; in which
case it is somewhat more than probable that these memoirs would never
have appeared; or, if they had, that being comprised within a couple of
pages, they would h

In [14]:
# Paragraph length

paragraph = '''
Among other public buildings in a certain town, which for many reasons
it will be prudent to refrain from mentioning, and to which I will
assign no fictitious name, there is one anciently common to most towns,
great or small: to wit, a workhouse; and in this workhouse was born; on
a day and date which I need not trouble myself to repeat, inasmuch as
it can be of no possible consequence to the reader, in this stage of
the business at all events; the item of mortality whose name is
prefixed to the head of this chapter.
'''

len(paragraph)

# A paragraph (or maybe half of it) should help the model to understand how Dickens writes, 
# since each paragraph seems to be similar in style and length

524

In [15]:
# Let's define our sequence length as 250
sequence_length = 250

# How many sequences we have in the text
# The plus one is because of the zero indexing
total_number_sequences = len(text) // (sequence_length + 1)
total_number_sequences

35997

Let's create the training sequences using the tensorflow Dataset object. We need to use our encoded_text.

In [16]:
character_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)
character_dataset

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

In [17]:
# This TensorSliceDataset object allows us to create a dataset with atmost x number of characters, as in the example below.

for item in character_dataset.take(10): # A dataset with 10 characters
    x = item.numpy()
    print(f"{x} -- {index_to_character[x]}")

1 --  
28 -- C
33 -- H
26 -- A
41 -- P
45 -- T
30 -- E
43 -- R
1 --  
34 -- I


In [18]:
# We can create sequences using this TensorSliceDataset object method "batch"
# The batch method combines consecutive elements of the dataset into batches.

# --- plus 1 because zero indexing
# --- drop_remainer in case that the last part of the dataset has not batch_size elements to conform a batch.
sequences = character_dataset.batch(batch_size=sequence_length + 1, drop_remainder=True)

Let's create a function that allow us to have the input text and the output text. This output text (target) will be the input text one character shifted one forward.

In [19]:
def create_sequence_target(sequence):
    input_text = sequence[:-1]       # Hello, my nam
    target_text = sequence[1:]       # ello, my name
    return input_text, target_text

In [20]:
# Then we can have our final dataset using this function.
dataset = sequences.map(create_sequence_target)

In [21]:
# Let's see how this looks like

for input_text, target_text in dataset.take(1):
    print(input_text.numpy()) # How our network sees the dataset
    print("\n")
    print("".join(index_to_character[input_text.numpy()])) # what the dataset look to us
    print("\nTarget")
    print(target_text.numpy()) # target that our network sees
    print("\n")
    print("".join(index_to_character[target_text.numpy()])) # target that we see

[ 1 28 33 26 41 45 30 43  1 34 11  0 45 43 30 26 45 44  1 40 31  1 45 33
 30  1 41 37 26 28 30  1 48 33 30 43 30  1 40 37 34 47 30 43  1 45 48 34
 44 45  1 48 26 44  1 27 40 43 39  1 26 39 29  1 40 31  1 45 33 30  0 28
 34 43 28 46 38 44 45 26 39 28 30 44  1 26 45 45 30 39 29 34 39 32  1 33
 34 44  1 27 34 43 45 33  0  0  0 26 67 69 68 61  1 69 74 62 59 72  1 70
 75 56 66 63 57  1 56 75 63 66 58 63 68 61 73  1 63 68  1 55  1 57 59 72
 74 55 63 68  1 74 69 77 68  9  1 77 62 63 57 62  1 60 69 72  1 67 55 68
 79  1 72 59 55 73 69 68 73  0 63 74  1 77 63 66 66  1 56 59  1 70 72 75
 58 59 68 74  1 74 69  1 72 59 60 72 55 63 68  1 60 72 69 67  1 67 59 68
 74 63 69 68 63 68 61  9  1 55 68 58  1 74 69  1 77 62 63 57 62  1 34  1
 77 63 66 66  0 55 73 73 63 61]


 CHAPTER I.
TREATS OF THE PLACE WHERE OLIVER TWIST WAS BORN AND OF THE
CIRCUMSTANCES ATTENDING HIS BIRTH


Among other public buildings in a certain town, which for many reasons
it will be prudent to refrain from mentioning, and to whic

### Defining parameters

In [22]:
# Batch size is the number of training examples utilized in one iteration
batch_size = 128

# Let's shuffle the batches so the model does not learn from a particular ordering of the text
buffer_size = 10000 # to avoid to have all the batches in memory at the same time

# This method means - take 10000 elements and shuffle them
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

In [23]:
# input sequences -- 128 sequences of sequence_length length # 250
# target sequences -- 128 and sequence_length length # 250
dataset

<BatchDataset element_spec=(TensorSpec(shape=(128, 250), dtype=tf.int32, name=None), TensorSpec(shape=(128, 250), dtype=tf.int32, name=None))>

## Creating the model

We need to define some variables.

In [24]:
# The first one is the vocabulary size, defined by the vocabulary
vocabulary_size = len(vocabulary)
vocabulary_size

89

In [25]:
# The embedding_dimensions variable is for our Embedding layer. Ideally, an embedding captures some of the semantics of the 
# input by placing semantically similar inputs close together in the embedding space.

# This number should be in the same scale of the vocabulary size
embedding_dimensions = 64

In [26]:
# We will use a single layer with a lot of neurons
rnn_neurons = 1026

In [27]:
# Imports

from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense

# We use sparse_categorical_crossentropy because we want a single category. 
# We might use this loss function when:
#    our classes are mutually exclusive
#    or the number of categories is large so the prediction output becomes overwhelming.

In [28]:
# We need to define our function for sparse categorical loss because we need to define this "from_logits" as True
# We need this parameter to be true because use one-hot encoding

def sparse_categorical_loss(y_true, y_pred):
    return sparse_categorical_crossentropy(y_true, y_pred, from_logits=True) 

In [29]:
# Function to create the function what will take 
# - the vocabulary size
# - the number of rnn neurons
# - and the batch size

def create_model(vocabulary_size, rnn_neurons, batch_size):
    
    # Creation of the model
    model = Sequential()
    
    # our first layer is an Embedding layer
    model.add(Embedding(input_dim=vocabulary_size, output_dim=embedding_dimensions, batch_input_shape=[batch_size, None]))
    
    # our second (hidden) layer is a GRU (RNN) layer
    #   the return sequences will return the last output
    #   the stateful means that the states computed for the samples in one batch will be reused as initial states for the 
    # samples in the next batch
    #   "glorot_uniform" because it performs better
    model.add(GRU(units=rnn_neurons, return_sequences=True, stateful=True, recurrent_initializer="glorot_uniform"))
    
    # output layer with vocabulary_size neurons, since we need the output to be one value of the vocabulary.
    model.add(Dense(vocabulary_size))
    
    # Compilation of the model
    model.compile(optimizer="adam", loss=sparse_categorical_loss)
    
    return model

In [30]:
# Creation of the model

model = create_model(vocabulary_size=vocabulary_size, rnn_neurons=rnn_neurons, batch_size=batch_size)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (128, None, 64)           5696      
                                                                 
 gru (GRU)                   (128, None, 1026)         3361176   
                                                                 
 dense (Dense)               (128, None, 89)           91403     
                                                                 
Total params: 3,458,275
Trainable params: 3,458,275
Non-trainable params: 0
_________________________________________________________________


## Training the model

In [31]:
# This can take a while... 
# I trained in Google Colab making use of a GPU.

# model.fit(dataset, epochs=30) 

# I save the model result in /data/model_dickens.h5

## Loading the model

In [32]:
# Load the model to be able to use it now
model = create_model(vocabulary_size=vocabulary_size, rnn_neurons=rnn_neurons, batch_size=1)
model.load_weights("data/model_dickens.h5") # load weights based on the trained model

# we build the model using the input shape
# We set 1, None because we will have one array of an unknown number of seed characters
model.build(tf.TensorShape([1, None]))

In [33]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (1, None, 64)             5696      
                                                                 
 gru_1 (GRU)                 (1, None, 1026)           3361176   
                                                                 
 dense_1 (Dense)             (1, None, 89)             91403     
                                                                 
Total params: 3,458,275
Trainable params: 3,458,275
Non-trainable params: 0
_________________________________________________________________


## Generation of new text

Let's create a function that allows us to create new text using the model.

In [34]:
# Our function generate_text will use the following parameters
# - model: trained model to generate text
# - start_seed: initial seed (text -- string form)
# - gen_size: number of characters to generate
# - temperature: this parameter effects randomness in our resulting text.
    # higher temperature/probability => lesss surprising / more expected
    # lower temperature => more surprising / less expected

def generate_text(model, start_seed: str, gen_size=100, temperature=1.0):

    # number of characters to generate
    num_generate = gen_size

    # vectorizing the starting seed text (using our map from character to index)
    input_eval = [character_to_index[s] for s in start_seed]

    # expand our vector to match the batch format shape
    input_eval = tf.expand_dims(input_eval, 0)

    # this list will hold the generated text
    text_generated = []
    temperature = temperature

    # batch_size is equals to 1
    model.reset_states()

    for i in range(num_generate):

        # generation of predictions using input_eval
        predictions = model(input_eval)

        # remove the batch shape dimension
        predictions = tf.squeeze(predictions, 0)

        # use a categorical distribution to select the next character
        # so we don't always end up choosing the character with the highest probability
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # modify the input eval to take the predicted character an use it as the next input
        input_eval = tf.expand_dims([predicted_id], 0)

        # transform back to character and save it in the text_generated list
        text_generated.append(index_to_character[predicted_id])

    return (start_seed + "".join(text_generated))

In [35]:
print(generate_text(model, "dear", gen_size=1000))

dear, had not yet dropped
upon her from address her christen Mr. Weller’s pious face appeared to descinate
me what I suffered and hold the money out.’

‘Have you any degraded crease condescended to grom the circumbsy of his
sister’s, and their lips, shows: a fleshy cavern. Such things come, and all
was in them, and to accept him in my education and pleasure), looking at him with
that persict to come off.

‘Do; boy--’

Saturiate to slighted Riderhood’s waist, the whole of the united little children proceeded, and rendered
the coachman to complete the habit of satisfactory conceition, I
went downstairs into my own head and red immense. Cruzz too much
scowning away.’

‘The dead sole hearts, poor little fowl!--and it be my vivid--now my
daughter--’

Here Mr Milvoyth as I sits to grop was foolish.

In short, the idea of all the sight twist into the house, or all that
supervite of its cause of very large or four bill bargained; and the
paper will purchases she would have done to him--very qu