# Recurrent Neural Network Pokémon Name Generator

This project demonstrates how to generate new Pokémon names using Keras and Recurrent Neural Networks (RNNs) with Long Short-Term Memory cells.

In [1]:
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import os
import random
import time

## 1. Load data and perform preprocessing

First, we load all Pokémon names from a text file. Each row contains one Pokémon name. We perform some initial preprocessing by removing names from the dataset that include unwanted characters. This will improve the quality of generated names from the neural network.

Here, we also set up `char_to_index` and `index_to_char` dictionaries to help us translate between index value and character value. This will come in handy when we encode the dataset into a format that the neural network can operate on.

In [15]:
def load_data(file_name):
    """
    Load data from file.
    
    :param file_name: The name of the file to load data from.
    :return: An array of sequences that represent the file's contents, and 
    """
    with open(file_name) as f:
        names = f.readlines()

    chars = sorted(list(set("".join(name.lower() for name in names))))
    data = [name.rstrip().lower() for name in names]

    return data, chars

def is_invalid(sequence, invalid_chars):
    """
    Filter out words that contain invalid characters.
    
    :param sequence: The word to analyse.
    :param invalid_chars: An array of invalid characters.
    :return: True if the sequence contains an invalid character, else return False.
    """
    for char in sequence:
        if char in invalid_chars:
            return True

    return False

In [29]:
name_data, chars = load_data("pokemon.txt")

# Clean data
invalid_chars=[' ', "'", '-', '.', '2', '♀', '♂', 'é']
name_data = [data for data in name_data if not is_invalid(data, invalid_chars)]
chars = [char for char in chars if char not in invalid_chars]

# Get the longest and shortest names in dataset
max_sequence_length = len(max(name_data, key=len))
min_sequence_length = len(min(name_data, key=len))

# Set up dictionaries to convert between character indexes and vice versa, e.g. 'a' => 1, 'b' => 2. 
char_to_index = dict((c, i) for i, c in enumerate(chars))
index_to_char = dict((i, c) for i, c in enumerate(chars))

In [17]:
print("Corpus length: {}".format(len(name_data)))
print("Max sequence length: {}".format(max_sequence_length))
print("Min sequence length: {}".format(min_sequence_length))
print("Total characters: {}\n".format(len(chars)))

print("Data:")
for i in range(5):
    print("{}. {}".format((i + 1), name_data[i]))
print("...\n")

print("Characters:\n{}".format(chars))

Corpus length: 710
Max sequence length: 11
Min sequence length: 3
Total characters: 27

Data:
1. bulbasaur
2. ivysaur
3. venusaur
4. charmander
5. charmeleon
...

Characters:
['\n', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


## 2. Generate training data with sequences

Next, we create an array of sequences that are subsets of Pokémon names. We also generate an array of corresponding next characters so that the neural network can learn. We use the sequences and their next characters to train the neural network, which predict the next acharacter based on a given sequence of characters.

In [21]:
def generate_sequences(names, sequence_length=4, step=1):
    """
    Generate sequences of characters to use for training.

    :param names: Array of names to create sequences from.
    :param sequence_length: The length of each sequence generated.
    :param step:
    :return: An array of generated sequences and an array of corresponding expected next characters.
    """
    sequences = []
    name_lengths = []
    next_chars = []
    
    for name in names:
        curr_name_length = len(name)
        if curr_name_length <= sequence_length:
            sequences.append(name[i:i + sequence_length])
            next_chars.append(chars[1])
            name_lengths.append(len(name))
        else:
            for i in range(0, curr_name_length - sequence_length + 1, step):
                sequences.append(name[i : i + sequence_length])
                if (sequence_length + i) < curr_name_length:
                    next_chars.append(name[sequence_length + i])
                else:
                    next_chars.append(chars[1])
                name_lengths.append(i + sequence_length)

    
    print("Total sequences generated: {}".format(len(sequences)))
    print("Length of sequence: {}".format(sequence_length))

    return sequences, name_lengths, next_chars

In [22]:
sequence_length = 4
sequences, name_lengths, next_chars = generate_sequences(name_data, sequence_length)

print("\nSample sequences:")
for i in range(len(name_data[0]) - sequence_length + 1):
    print(" x: ['{}'], y: ['{}']".format(sequences[i], next_chars[i]))
print("...")

Total sequences generated: 3126
Length of sequence: 4

Sample sequences:
 x: ['bulb'], y: ['a']
 x: ['ulba'], y: ['s']
 x: ['lbas'], y: ['a']
 x: ['basa'], y: ['u']
 x: ['asau'], y: ['r']
 x: ['saur'], y: ['a']
...


## 3. One-hot encoding the sequences


One-hot encoding involves converting each sequence into an array of `0`s where a `1` represents the active letter. For example, the string `abc` becomes `[[1, 0, 0]` `[0, 1, 0]` `[0, 0, 1]]`.

We can achieve this using the `char_to_index` dictionary created earlier. This format makes it easy for the neural network model to learn on the data.

In [7]:
def one_hot_encoding(sequences, chars, next_chars):
    """
    
    :param sequences:
    :param next_chars:
    :param chars:
    :return:
    """
    total_chars = len(chars)
    total_next_chars = len(next_chars)
    total_sequences = len(sequences)
    
    x = np.zeros(shape=(total_sequences, max_sequence_length, total_chars), dtype="float32")
    for i, sequence in enumerate(sequences):
        for j, char in enumerate(sequence):
            x[i, j, char_to_index[char]] = 1
                 
    y = np.zeros(shape=(total_next_chars, total_chars), dtype="float32")
    for i, char in enumerate(next_chars):
        y[i, char_to_index[next_chars[i]]] = 1
    
    print('x.shape: {}'.format(x.shape))
    print('y.shape: {}'.format(x.shape))
    
    return x, y

In [23]:
x, y = one_hot_encoding(sequences, chars, next_chars)

x.shape: (3126, 11, 27)
y.shape: (3126, 11, 27)


## 4. Build RNN model with single LSTM cell

In [9]:
def generate_model():
    model = keras.Sequential(
        [
            keras.Input(shape=(max_sequence_length, len(chars))),
            layers.LSTM(64),
            layers.Dense(units=len(chars), activation="softmax"),
        ]
    )
    optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
    model.compile(optimizer=optimizer, loss="categorical_crossentropy")
    model.summary()
    
    return model

## 5. Helper functions

In [10]:
def sample(preds, temperature):
    """
    Helper function to sample a character index from an array of probabilities.
    
    :param preds:
    :param temperature:
    """
    preds = np.asarray(preds).astype("float64")
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

## 6. Generating text

In [11]:
def generate_name(model, sequences, sequence_lengths, chars, temperature=0.3):
    """
    Generate a name using the RNN.
    
    :param model: The model to predict on.
    :param sequences: An array of sequences to train the network on.
    :param sequence_lengths: An array of integers that represent the length of each sequence.
    :param chars: An array of all characters in the dataset.
    :param temperature: A 0-1 value of how confident the RNN model should be. A higher number causes more randomness.
    :return: A generated name as a string.
    """
    # Randomly generate a sequence and length
    sequence = sequences[np.random.randint(0, (len(sequences) - 1))]
    sequence_length = sequence_lengths[np.random.randint(0, (len(sequence_lengths) - 1))]
    sequence_length = len(sequence)
    result = ""
    
    # Initialise vector and populate with seeded sequence
    sequence_input = np.zeros(shape=(1, max_sequence_length, len(chars)))
    for i, char in enumerate(sequence):
        sequence_input[0, i, char_to_index[char]] = 1

    # Predict next character
    prediction = model.predict(sequence_input)[0]
    next_char_index = sample(prediction, temperature)
    
    while next_char_index < (len(chars) - 1) and len(result) < sequence_length:
        result += chars[next_char_index]
        
        sequence_input = np.zeros(shape=(1, max_sequence_length, len(chars)))
        for i, char in enumerate(result[(-sequence_length):]):
            sequence_input[0, i, char_to_index[char]] = 1
        
        prediction = model.predict(x=[sequence_input])[0]
        next_char_index = sample(prediction, temperature)

    print(result.capitalize())
    return result.capitalize()

def generate_names(model, sequences, sequence_lengths, amount=5):
    """
    Generate multiple names.
    
    :param model: The model to predict new names on.
    :param sequences: An array of sequences to train the network on.
    :param sequence_lengths: An array of integers that represent the length of each sequence.
    :param amount: An integer value of how many names to generate.
    :return: Return an array of generated names.
    """
    names = []
    for i in range(amount):
        name = generate_name(model, sequences, sequence_lengths, chars)
        names.append(name)
    return names

## 7. Training the model

In [12]:
def train_model(epochs=180, batch_size=128, verbose=0):
    """
    
    :param epochs:
    :param print_line:
    :param batch_size:
    """
    names, chars = load_data("pokemon.txt")
    
    names = [name for name in names if not is_invalid(name, invalid_chars)]
    chars = [char for char in chars if char not in invalid_chars]
    
    pokemon_sequences, name_lengths, pokemon_next_chars = generate_sequences(names)
    x, y = one_hot_encoding(pokemon_sequences, chars, pokemon_next_chars)
    model = generate_model()
    
    print("Generating names...")
    for i in range(epochs):
        history = model.fit(x, y, batch_size=batch_size, epochs=epochs, verbose=verbose)
        generated = generate_name(model, pokemon_sequences, name_lengths, chars)

    return generated

In [13]:
pokemon_names = train_model(epochs=20)

Total sequences: 3126
Length of sequences: 4
x.shape: (3126, 11, 27)
y.shape: (3126, 11, 27)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 64)                23552     
_________________________________________________________________
dense (Dense)                (None, 27)                1755      
Total params: 25,307
Trainable params: 25,307
Non-trainable params: 0
_________________________________________________________________
Generating names...
Kawatte
Oaaaart
Buatg
Easaa
Roisaan
Sina
Peaaa
Anart
Yacbelum
Easas
Anartwos
Aat
Kadoona
Laaffya
Anara
Rawaca
Nealyar
Aat
Iaaaat
Odoadaa
