# Pokémon Name Generation with Keras

Generate new unique Pokémon names with a LSTM using Andrej Karpathy's famous [Char-RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) which he used to generate poetry. There are more information in the blog, but the concept is fairly simple. We want the build a next-character-in-text predictor. We will do this by using a window of fixed length as our input and the next char as output and then train a LSTM to perform this task. Since the network won't understand raw characters we need to encode each character to a character vectors with one-hot encoding.

In [28]:
import pandas as pd
import numpy as np
import keras
import time
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop
import numpy as np
import random
import os

## Settings

In [40]:
step_length = 1    # The step length we take to get our samples from our corpus
epochs = 30       # Number of times we train on our full data
batch_size = 32    # Data samples in each training step
latent_dim = 64    # Size of our LSTM
dropout_rate = 0.2 # Regularization with dropout
model_path = os.path.realpath('./pokemon+5_perc_digimon.h5') # Location for the model
load_model = False # Enable loading model from disk
store_model = True # Store model to disk after training
verbosity = 1      # Print result for each epoch
gen_amount = 2000    # How many 

## Loading data

I have made a .txt where I have stored the names of Pokémon as rows. I have also done some ealy preprocessing like removing special characters and only using lowercase characters. To generate other things than Pokémon names the rows in this file can simply be replaced with some other text that one wishes to generate.

In [30]:
input_path = os.path.realpath('../data/input/pokemon_and_five_perc_digimon.txt')

In [31]:
input_names = []

print('Reading Pokénames from file:')
with open(input_path) as f:
    for name in f:
        name = name.rstrip()
        if len(input_names) < 10:
            print(name)
        input_names.append(name)
    print('...')

Reading Pokénames from file:
corvisquire
yanma
zebstrika
dunsparce
grimmsnarl
kangaskhan
wigglytuff
eldegoss
hakamo
fennekin
...


## Preprocessing
- Concatenate all Pokémon names into a long string corpus.
- Build dicionaries to translate chars to indices in a binary char vector.
- Find a suitable sequence window, I base it on the longest name I find.

In [32]:
# Make it all to a long string
concat_names = '\n'.join(input_names).lower()

# Find all unique characters by using set()
chars = sorted(list(set(concat_names)))
num_chars = len(chars)

# Build translation dictionaries, 'a' -> 0, 0 -> 'a'
char2idx = dict((c, i) for i, c in enumerate(chars))
idx2char = dict((i, c) for i, c in enumerate(chars))

# Use longest name length as our sequence window
max_sequence_length = max([len(name) for name in input_names])

print('Total chars: {}'.format(num_chars))
print('Corpus length:', len(concat_names))
print('Number of names: ', len(input_names))
print('Longest name: ', max_sequence_length)

Total chars: 30
Corpus length: 9463
Number of names:  1075
Longest name:  28


Make a training set where we take samples with sequence length as our input and the next char as label.

In [33]:
sequences = []
next_chars = []

# Loop over our data and extract pairs of sequances and next chars
for i in range(0, len(concat_names) - max_sequence_length, step_length):
    sequences.append(concat_names[i: i + max_sequence_length])
    next_chars.append(concat_names[i + max_sequence_length])

num_sequences = len(sequences)

print('Number of sequences:', num_sequences)
print('First 10 sequences and next chars:')
for i in range(10):
    print('X=[{}]   y=[{}]'.replace('\n', ' ').format(sequences[i], next_chars[i]).replace('\n', ' '))

Number of sequences: 9435
First 10 sequences and next chars:
X=[corvisquire yanma zebstrika ]   y=[d]
X=[orvisquire yanma zebstrika d]   y=[u]
X=[rvisquire yanma zebstrika du]   y=[n]
X=[visquire yanma zebstrika dun]   y=[s]
X=[isquire yanma zebstrika duns]   y=[p]
X=[squire yanma zebstrika dunsp]   y=[a]
X=[quire yanma zebstrika dunspa]   y=[r]
X=[uire yanma zebstrika dunspar]   y=[c]
X=[ire yanma zebstrika dunsparc]   y=[e]
X=[re yanma zebstrika dunsparce]   y=[ ]


One-hot encoding our data into char vectors by using the translation dictionary from earlier.

#### Example

- 'a'   => [1, 0, 0, ..., 0]

- 'b'   => [0, 1, 0, ..., 0]

- 'c'   => [0, 0, 1, ..., 0]

- 'abc' => [[1, 0, 0, ..., 0], [0, 1, 0, ..., 0], [0, 0, 1, ..., 0]] 

In [34]:
X = np.zeros((num_sequences, max_sequence_length, num_chars), dtype=np.bool)
Y = np.zeros((num_sequences, num_chars), dtype=np.bool)

for i, sequence in enumerate(sequences):
    for j, char in enumerate(sequence):
        X[i, j, char2idx[char]] = 1
    Y[i, char2idx[next_chars[i]]] = 1
    
print('X shape: {}'.format(X.shape))
print('Y shape: {}'.format(Y.shape))

X shape: (9435, 28, 30)
Y shape: (9435, 30)


## Build model

Build a standard LSTM network with: 

- Input shape: (max_sequence_length x num_chars) - representing our sequences.
- Output shape: num_chars - representing the next char coming after each sequence.
- Output activation: Softmax - since only one value should be 1 in output char vector.
- Loss: Categorical cross-entrophy - standard loss for multi-class classification.

In [35]:
model = Sequential()
model.add(LSTM(latent_dim, 
               input_shape=(max_sequence_length, num_chars),  
               recurrent_dropout=dropout_rate))
model.add(Dense(units=num_chars, activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer)

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 64)                24320     
_________________________________________________________________
dense_2 (Dense)              (None, 30)                1950      
Total params: 26,270
Trainable params: 26,270
Non-trainable params: 0
_________________________________________________________________


## Training

Watching the loss, doing cross-validation and all that good stuff is not that important here. The best model will not be found by optimizing some metric. We just want to strike a balance between a model that just output gibberish like 'sadsdaddddd' and model that memorizes the names it was trained on. For this it is better to just inspect the output and judge from that.

In [36]:
if load_model:
    model.load_weights(model_path)
else:
    
    start = time.time()
    print('Start training for {} epochs'.format(epochs))
    history = model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=verbosity)
    end = time.time()
    print('Finished training - time elapsed:', (end - start)/60, 'min')
    
if store_model:
    print('Storing model at:', model_path)
    model.save(model_path)

Start training for 30 epochs
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Finished training - time elapsed: 6.506796908378601 min
Storing model at: /blue/rcstudents/smaley/pokegan/name-generation/rnn/pokemon+5_perc_digimon.h5


## Generation

Generate names by starting with a real sequence from the corpus, continuously predicting the next char while updating the sequence. To get diversity the correct char is selected from a probability distribution based on the models prediction. This can also be furthered by something called temperature, which I didn't use here.

I also added some postprocessing to remove things I did not like manually. Some of this could possibly be done by teaking the network, but I was happy with the way the names looked overall. 

In [41]:
# Start sequence generation from end of the input sequence
sequence = concat_names[-(max_sequence_length - 1):] + '\n'

new_names = []

print('{} new names are being generated'.format(gen_amount))

while len(new_names) < gen_amount:
    
    # Vectorize sequence for prediction
    x = np.zeros((1, max_sequence_length, num_chars))
    for i, char in enumerate(sequence):
        x[0, i, char2idx[char]] = 1

    # Sample next char from predicted probabilities
    probs = model.predict(x, verbose=0)[0]
    probs /= probs.sum()
    next_idx = np.random.choice(len(probs), p=probs)   
    next_char = idx2char[next_idx]   
    sequence = sequence[1:] + next_char

    # New line means we have a new name
    if next_char == '\n':

        gen_name = [name for name in sequence.split('\n')][1]

        # Never start name with two identical chars, could probably also
        if len(gen_name) > 2 and gen_name[0] == gen_name[1]:
            gen_name = gen_name[1:]

        # Discard all names that are too short
        if len(gen_name) > 2:
            
            # Only allow new and unique names
            if gen_name not in input_names + new_names:
                new_names.append(gen_name.capitalize())

        if 0 == (len(new_names) % (gen_amount/ 10)):
            print('Generated {}'.format(len(new_names)))

2000 new names are being generated
Generated 0
Generated 200
Generated 400
Generated 600
Generated 800
Generated 1000
Generated 1200
Generated 1400
Generated 1600
Generated 1800
Generated 2000


## Results

Here are the results. I personally cannot tell the difference between generated names and names of Pokémon I dont know. Sometimes there are giveaways, but overall the names are convincing and diverse!

In [42]:
print_first_n = min(100, gen_amount)

print('First {} generated names:'.format(print_first_n))
for name in new_names[:print_first_n]:
    print(name)

First 100 generated names:
Gothimon thakegom
Fwissect
Flubf
Ealexgurdramon
Amoleds
Amoleds
Zoloth
Barre
Murmagmite
Armanders
Houndon
Houndon
Miscubitu
Erhipy
Nyree
Nator
Nickeavole
Nickeavole
Heroar
Madwool
Cregishoass
Sepede
Klaink
Chyril
Cuntolipe
Slabbilet
Limpla
Goltat
Olaking
Audoris
Floawgaede
Dermie
Elermarp
Mome
Dramplan
Capbumb
Belessiker
Belessiker
Xradiois
Milpy
Meltom
Vibwion
Teragee
Stazil
Ncrowatt
Cawnilut
Sping
Bellipede
Sandoous
Regattasic
Regattasic
Tototau
Raperat
Valolty
Skoopuff
Flacels
Wughla
Bramble
Rokorita
Amproke
Argowloot
Chiboude
Narudon
Droduik
Dearemibolocr
Kickitu
Kickitu
Bentimon
Darchic
Flaucobwa
Mawnklett
Arditurp
Arditurp
Niilorr
Consurrus
Consurrus
Blidoedre
Hatmo
Elucherna
Elucherna
Angginiar
Grubin
Mefselis
Dealish
Torvizar
Imime
Panster
Ropdee
Mulphor
Magemama
Skantrit
Backruu
Gleampla
Krobuble
Sirphenh
Dusteen
Retinat
Trounchen
Ballox
Mriarcion


Storing the results

In [43]:
concat_output = '\n'.join(sorted(new_names))
output_path = os.path.realpath('./output/pokemon+5_perc_digimon_2000_generated_names.txt')

with open(output_path, 'w') as f:
    f.write(concat_output)