# Text Generation: Male Names Generator

The purpose of this Notebook is to build a small, simple **Recurrent Neural Network** to illustrate how text generation works, and how to implement it with TensorFlow. 

We are building a male name generator. For the sake of the example, we have also **manually implemented** all kinds of word **encoding** and decoding needed, even one-hot encoding.

The data used in this notebook has been obtained from Spain's National Statistics Institute, and it corresponds to **males with spanish citizenship**. For privacy reasons, the frecuency of the name must be at least 20 for it to appear on the list. Data can be found here https://www.ine.es/tnombres/formGeneral.do

Let's begin with some initial settings:

In [1]:
# Import modules 

import os
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow.keras.layers import SimpleRNN, TimeDistributed, Dense, Masking
from tensorflow.keras.models import load_model

In [2]:
# Set paths and filenames
DATA_PATH = '../data'
OUTPUT_PATH = '../output'
MALE_NAMES_FILEPATH = '../data/male_names.csv'

*Remark:* for the sake of better understanding, depending on the context, we will use the terms *character* and *letter* indistinctly.

We are going to use one of the most basic RNN, the **Elman Network**. We are going to train the network to predict the current letter given the previous letters. An Elman Network considers the previous inputs for computing the next output, but it does not consider future inputs.

The idea behind a text generator with this network is that, at every timestep, for each word in the dictionary, we compute its **probability** to appear conditioned to to the fact that we already have some previous inputs. 

First, we need to **encode** our inputs so we can feed them to the network:

### Mappings

We are going to map every character of the alphabet to an integer. Here we create mappings **map_char_to_int** and **map_int_to_char**, which map a character to its integer representation and viceversa. We will also map the **dot** (.), which is not part of any name, but will indicate our **EOF** (End of File); and a **space**, since some names consist of two words. 

In [3]:
# Standard latin alphabet
standard_chars = [chr(i) for i in range(97, 123)]
# Special characters for languages spoken in Spain, the dot, and the space
special_chars = ['à', 'á', 'è', 'é', 'í', 'ò', 'ó', 'ú', 'ñ', 'ç', "'", '.', ' ']

chars = standard_chars + special_chars

seq = [i for i in range(len(chars))]
map_char_to_int = dict(zip(chars, seq))
map_int_to_char = dict(zip(seq, chars))

In [4]:
print("Mapping of characters to integers:")
print(map_char_to_int)

Mapping of characters to integers:
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, 'à': 26, 'á': 27, 'è': 28, 'é': 29, 'í': 30, 'ò': 31, 'ó': 32, 'ú': 33, 'ñ': 34, 'ç': 35, "'": 36, '.': 37, ' ': 38}


#### Auxiliary functions for mapping
We are going to define some useful functions. We are going to use the following convention:
* Variables ending with `_int` will represent *something* mapped to its integer form.
* Variables ending with `_encoded` will represent *something* one-hot encoded.

In [5]:
def one_hot_encoding(word_int, features):
    '''
    word_int -- array of integers, shape (k,)
    features -- size of vocabulary
    returns -> array with integers one-hot encoded, shape (k, features)
    '''
    k = len(word_int)
    word_encoded = np.zeros((k, features), dtype = 'int8')
    for i in range(k):
        pos = word_int[i]
        word_encoded[i, pos] = 1
    return word_encoded

In [6]:
def one_hot_decoding(word_encoded):
    '''
    word_encoded -- array of shape (k, features)
    returns -> array of shape (k,)
    '''
    a, b = word_encoded.shape
    word_int = np.zeros(a, dtype = 'int32')
    for i in range(a):
        pos = np.argmax(word_encoded[i])
        word_int[i] = pos
    return word_int

In [7]:
def encode_word_to_int(word, mapping):
    '''
    word -- string
    mapping -- dictionary with characters as keys and integers as values
    returns -> array with integers, shape (k,), where k is the length of the word'''
    k = len(word)
    word_int = np.zeros(k, dtype = 'int32')
    for i, c in enumerate(word):
        word_int[i] = mapping[c]
    return word_int

In [8]:
def decode_int_to_word(word_int, mapping):
    '''
    word_int -- array with integers, shape (k,)
    mapping -- dictionary with integers as keys and characters as values
    returns -> string of length k
    '''
    word = ''
    for i in word_int:
        if i in mapping.keys():
            word += mapping[i]
        else:
            word += 'UNK'
    return word

In [9]:
def encode_list(array, mapping):
    '''
    array -- list of words
    mapping -- dictionary with characters as keys and integers as values
    returns -> list of word_encoded elements, each of one has shape (k, n), k being the length of each word'''
    result = []
    for word in array:
        word_int = encode_word_to_int(word, mapping)
        word_encoded = one_hot_encoding(word_int, len(mapping))
        result.append(word_encoded)
    return result

In [10]:
def decode_list(array, mapping):
    '''
    array -- list of word_encoded elements, each of one has shape (k, n), k being the length of each word
    mapping -- dictionary with integers as keys and characters as values
    returns -> list of words'''
    result = []
    for word_encoded in array:
        word_int = one_hot_decoding(word_encoded)
        word = decode_int_to_word(word_int, mapping)
        result.append(word)
    return result

### Load and Map Data
Load names and store them in **male_names_data**

In [11]:
male_names_raw = pd.read_csv(MALE_NAMES_FILEPATH, sep = ';', decimal = ',')
male_names_data = male_names_raw['Nombre'].tolist()

Store the parameters of the model. Here, we have that **samples** are **names**, **inputs** are **characters or letters**,  **features** referes to the **number of characters**, and **timesteps** to the **length of a sample**, that is, the number of inputs in a sample.  
`m:` number of samples  
`n:` number of features, vocabulary size  
`timesteps:` each sample has a different number of timesteps. For computation purposes, we set this to the **maximum length of the samples**.

In [12]:
m = len(male_names_data)
n = len(map_char_to_int)
timesteps = len(max(male_names_data, key = len))

Tranform everything to **lowercase**.

In [13]:
male_names_data = [x.lower() for x in male_names_data]

Create matrices $X$ and $y$. At every timestep, we will want our network to predict which is the letter for this timestep. We obviously want it to predict too when the name finishes, which will be indicated by a dot. Then, $y$ is just $X$ shifted one position to the right and with a dot added.

In [14]:
X_male = male_names_data
y_male = [name[1:] + '.' for name in X_male]

print("Some input examples:")
print(X_male[:3])
print("\nThe outputs for the examples above:")
print(y_male[:3])

Some input examples:
['antonio', 'jose', 'manuel']

The outputs for the examples above:
['ntonio.', 'ose.', 'anuel.']


Each letter will be encoded as an integer, which in turn, will be one-hot encoded. For example:  

$$ carlos \rightarrow [2, 0, 17, 11, 14, 18] \rightarrow [[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [...], ..., [...]]$$

Remember that the first representation we name it with `_int`, and the second with `_encoded`. Let's store both representations:

In [15]:
X_male_int = []
X_male_encoded = []

for i in range(m):
    word = X_male[i]
    
    word_int = encode_word_to_int(word, map_char_to_int)
    X_male_int.append(word_int)
    
    word_encoded = one_hot_encoding(word_int, n)
    X_male_encoded.append(word_encoded)

In [16]:
# Show some examples
print(f"Name Antonio mapped to integer:\n{X_male_int[0]}")
print(f"\nPrevious integers one-hot encoded, which represent Antonio one-hot-encoded:\n{X_male_encoded[0]}")

Name Antonio mapped to integer:
[ 0 13 19 14 13  8 14]

Previous integers one-hot encoded, which represent Antonio one-hot-encoded:
[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0]]


As for the labels, we could have computed them above in the same loop, but let's just use a different way: given that we already have the integers and one-hot encoded representations of each name, we just have to **shift** them one position forward and add the **mapping of the dot**. 

In [17]:
dot_encoded = one_hot_encoding([map_char_to_int['.']], n)
y_male_int = [word_int[1:] + [map_char_to_int['.']] for word_int in X_male_int]
y_male_encoded = [np.concatenate((l[1:], dot_encoded), axis = 0)for l in X_male_encoded]

### Padding
The fact that names have different sizes is something that needs to be addressed for computation purposes. The inputs of our model consist of single characters, but each of them form a word, whose length is the number of timesteps. When applying a batch optimization algorithm, it is required that all the samples have the same number of timesteps.

There are several approaches to tackle the fact that samples have different lengths, we are going to use **padding** and **masking**. 

1. Padding adds characters to the sample so they all have the same length (the length of the longest one will do). We have computed this number at the beginning and named it `timesteps`. TensorFlow provides a built-in function that performs this padding. We are going to pad using 0s, that is, **adding** as many **zero-arrays** of shape (n,) as necessary to each sample.

2. Masking refers to the fact of specifying that an input value **should not be considered**, since it is padding. Keras implements this with a layer, which has a `mask_value` parameter: when all the input features are equal to this mask_value, that timestep is ignored. 

In [18]:
padding = np.zeros(n, dtype = 'int')
X_male_padded = tf.keras.preprocessing.sequence.pad_sequences(X_male_encoded,
                                                            maxlen = timesteps,
                                                            padding = 'post',
                                                            truncating = 'post', 
                                                            value = 0)

In [19]:
y_male_padded = tf.keras.preprocessing.sequence.pad_sequences(y_male_encoded,
                                                       maxlen = timesteps,
                                                       padding = 'post',
                                                       truncating = 'post', 
                                                       value = 0)

In [20]:
X_male_input = X_male_padded
y_male_input = y_male_padded

### Model
We are going to use a very simple RNN, which is known as the **Elman Network**. There is not a built-in layer for this network, but it is not difficult to build. At each timestep, the Elman Network computes:

$$a_t = \sigma_a(W_x x_t + W_a a_{t-1} + b_a)$$
$$y_t = \sigma_y(W_y a_t + b_y)$$

Where:  
$W_x, W_a, b_a; W_y, b_y$ are the parameters that the network has to learn, and do not depend on the timestep.  
$\sigma_a, \sigma_y$ are activation functions, and are hyperparameters of the model. 

1. As discussed in the padding section, the `Masking` layer ignores all timesteps whose values are all 0s. 
2. Computing $a_t$ is done by the `SimpleRNN`layer, whose output is $a_t$. When `return_sequences` is set to `True`, it returns $\{a_0, a_1, ..., a_t\}$, instead of just $a_t$. We need this for the following layer:
3. If at each timestep $t$, the output $y$ was computed using all activations $\{a_0, a_1, ..., a_t\}$, it would be as simple as adding a usual Dense layer. However, at each timestep, we must consider only the output of the current timestep. This is done by the wrapping `TimeDistributed`.

*Remarks*:   
1. `input_shape` has the form `(timesteps, features)`. Later on, we will need to predict using unpadded samples of different lengths, so it must be set to `None`.
2. The number of **units** is a hyperparameter of the model.

#### OPTION 1: define the model

In [None]:
model = tf.keras.Sequential()
model.add(Masking(input_shape = (None, n),
                  mask_value = 0))
model.add(SimpleRNN(units = 100,
                   return_sequences = True,
                   activation = 'tanh'))
model.add(TimeDistributed(Dense(units = n,
                               activation = 'softmax')))

In [None]:
model.compile(optimizer = 'Adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.summary()

In [None]:
history = model.fit(X_male_input, y_male_input, epochs = 60)
model.save(os.path.join(OUTPUT_PATH, 'model.h5'))

#### Option 2: load an existing trained model

In [21]:
model = load_model(os.path.join(OUTPUT_PATH, 'model.h5'))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking (Masking)            (None, None, 39)          0         
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, None, 100)         14000     
_________________________________________________________________
time_distributed (TimeDistri (None, None, 39)          3939      
Total params: 17,939
Trainable params: 17,939
Non-trainable params: 0
_________________________________________________________________


In [22]:
model.evaluate(X_male_input, y_male_input)



[0.3787724329471588, 0.7473625]

### Performance
Let's check how good our algorithm performs.

In [23]:
y_preds_encoded = model.predict(X_male_input)
y_preds = np.array(decode_list(y_preds_encoded, map_int_to_char))

Let's compare some predictions with the original labels. We must consider that the original labels have different lengths, and our model predicts for a fixed number of timesteps. Considering that our dot (.) is our EOF, **everything that comes after should not be considered**. A correct prediction though should end with a dot, like the original labels.

In [24]:
wrong_correct_df = pd.DataFrame(zip(y_preds, y_male), columns = ["Prediction", "Original"])
wrong_correct_df

Unnamed: 0,Prediction,Original
0,ntonio,ntonio.
1,ose,ose.
2,anuel,anuel.
3,rancisco,rancisco.
4,anid,avid.
...,...,...
4995,rrerico jarlos.......,ederico carlos.
4996,rrnando anrusto......,ernando augusto.
4997,urardo jrancisco.....,erardo francisco.
4998,ensai................,ossam.


Our inputs are single characters, so every single character correctly predicted must be considered a **success**, regardless if the whole sample (name) has not been predicted correctly as a whole. 

In [25]:
mask_correct = []
for i in range(m):
    for j, c in enumerate(y_male[i]):
        mask_correct.append(y_preds[i][j] == c)

Nevertheless, we can also compute those samples (names) that have been predicted correctly as a whole (in this case we have chosen not to consider the final dot). 

In [26]:
mask_correct_words = np.array([False] * m)
for i in range(m):
    pred = y_preds[i]
    orig = y_male[i][:-1]  # do not consider the final dot
    mask_correct_words[i] = orig in pred
    
correct_preds = np.array(X_male)[mask_correct_words]
correct_preds

array(['antonio', 'jose', 'manuel', 'francisco', 'jose antonio', 'daniel',
       'carlos', 'pedro', 'luis', 'ramon', 'oscar', 'santiago', 'eduardo',
       'victor', 'guillermo', 'tomas', 'hector', 'xavier', 'isaac',
       'benito', 'antoni', 'pedro antonio', 'kevin', 'eduard',
       'luis angel', 'manuel antonio', 'anton', 'carlos antonio', 'xavi',
       'ulises', 'yassin', 'zakaria', 'victor jesus', 'ramon antonio',
       'dani', 'daniel antonio', 'fran', 'eduardo jesus', 'dan', 'santi',
       'carlo', 'tomas antonio', 'guillermo jesus', 'benito jose', 'manu',
       'kevin jesus', 'francis', 'oscar alejandro', 'franc',
       'francisco alexis', 'santiago angel', 'quirino', 'tom', 'willian',
       'isaac jesus', 'edu', 'hector juan'], dtype='<U21')

Compute **Accuracy**. We can also compute the accuracy over words, that is, how many names are predicted correctly as a whole. We can see that this way the accuracy plummets. 

In [27]:
acc = sum(mask_correct)/len(mask_correct)
acc_words = sum(mask_correct_words)/len(mask_correct_words)

print("Accuracy (single characters as inputs)")
print("{:.4f}".format(acc))
print("\nAccuracy over words")
print(acc_words)

Accuracy (single characters as inputs)
0.7474

Accuracy over words
0.0114


### Text Generation
The way we can generate text with a neural network trained like this is as follows; let's show it with an example:

If we chose a random letter, let's say *a*, and feed it to the network, we will obtain an output vector of shape `n`. These are the probabilities of each character being the second one considering that *a* is the first one.  
Remembering that each position $1...n$ represents a character, what we can do now is chose a number $i$ with a probability $y_i$. Let's say we get integer *13*, which corresponds to letter *n*. Now, we feed our model with *an* as input, and get the next output. We keep going until we obtain a dot.  

In [28]:
x = np.zeros((1, 1, n))
word = ''
c = '-'
while c != '.':
    y = model.predict(x)
    y_n = y[0, -1, :]
    y_n_hat = np.random.choice(range(n), p = y_n)
    y_n_hat_encoded = np.reshape(one_hot_encoding([y_n_hat], n), (1, 1, -1))
    
    c = map_int_to_char[y_n_hat]
    word += c
    x = np.concatenate((x, y_n_hat_encoded), axis = 1)
    
print(word)

pedro jesus.


If we take the most likely output at each timestep instead of chosing the output weighted by probability, we will get the **most likely name**, which is not subject to any randomness. With our model, and with no first letter given, **Eduardo Jesus** is the most likely name.

In [31]:
x = np.zeros((1, 1, n))
word = ''
c = '-'
while c != '.':
    y = model.predict(x)
    y_n = y[0, -1, :]
    y_n_hat = np.argmax(y_n)
    y_n_hat_encoded = np.reshape(one_hot_encoding([y_n_hat], n), (1, 1, -1))
    
    c = map_int_to_char[y_n_hat]
    word += c
    x = np.concatenate((x, y_n_hat_encoded), axis = 1)
    
print(word)

eduardo jesus.
