# Generating baby names





The architecture for the RNN is strongly inspired by the Pokemon Name Generator of Yan Gobeil [(Github link)](https://github.com/yangobeil/Pokemon-name-generator/blob/master/Generate%20Pok%C3%A9mon%20names.ipynb?utm_source=pocket_mylist). With a relatively simple model, Yan was able to produce interesting results. 

## Processing

As with any data project, getting and wrangling the data is the most work. There are endless lists of popular first names over the years, but it is nearly impossible to download a single file with a total overview. In the Netherlands you can find some overview of baby names at the Sociale Verzekeringsbank. This is the instance that pays child benefit to parents. We can find an overview of children names here: [SVB kindernamen](https://www.svbkindernamen.nl/nl/kindernamen/index.htm). From the page we manually copy the table from 2020 and 2019 and transform it in a CSV file.

In [1]:
# Uncomment the following lines if you are running this notebook in Google Colab
!pip install polars
!pip install Unidecode

Collecting polars
  Downloading polars-0.8.17-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (10.2 MB)
[K     |████████████████████████████████| 10.2 MB 8.4 MB/s 
[?25hCollecting pyarrow>=4.0.*
  Downloading pyarrow-5.0.0-cp37-cp37m-manylinux2014_x86_64.whl (23.6 MB)
[K     |████████████████████████████████| 23.6 MB 126 kB/s 
Installing collected packages: pyarrow, polars
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 3.0.0
    Uninstalling pyarrow-3.0.0:
      Successfully uninstalled pyarrow-3.0.0
Successfully installed polars-0.8.17 pyarrow-5.0.0
Collecting Unidecode
  Downloading Unidecode-1.2.0-py2.py3-none-any.whl (241 kB)
[K     |████████████████████████████████| 241 kB 9.5 MB/s 
[?25hInstalling collected packages: Unidecode
Successfully installed Unidecode-1.2.0


In [2]:
# Use Polars instead of Pandas
import polars as pl

In [4]:
df_nl20 = pl.read_csv('data/NL_MEISJESNAMEN_2020.csv', encoding='utf8', sep=';', columns=['Naam', 'Aantal'], )
print(df_nl20.head(5))
print(f'There are: {len(df_nl20)} names in the dataframe')

shape: (5, 2)
╭───────────┬────────╮
│ Naam      ┆ Aantal │
│ ---       ┆ ---    │
│ str       ┆ i64    │
╞═══════════╪════════╡
│ "Aaliyah" ┆ 41     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Abby"    ┆ 38     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Abigail" ┆ 39     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Ada"     ┆ 33     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Adriana" ┆ 59     │
╰───────────┴────────╯
There are: 518 names in the dataframe


In [5]:
df_nl19 = pl.read_csv('data/NL_MEISJESNAMEN_2019.csv', encoding='utf8', sep=';', columns=['Naam', 'Aantal'], )
print(df_nl19.head(5))
print(f'There are: {len(df_nl19)} names in the dataframe')

shape: (5, 2)
╭───────────┬────────╮
│ Naam      ┆ Aantal │
│ ---       ┆ ---    │
│ str       ┆ i64    │
╞═══════════╪════════╡
│ "Aaliyah" ┆ 61     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aaltje"  ┆ 28     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Abby"    ┆ 36     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Abigail" ┆ 44     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Ada"     ┆ 34     │
╰───────────┴────────╯
There are: 525 names in the dataframe


We can already see that there is quite some overlap between the years. Let's create on list and see how big the overlap is.

In [6]:
# Create one single dataframe with all names
df_stacked = df_nl20.vstack(df_nl19)
df_stacked['Naam'].is_duplicated().sum()

932

Considering we have a total of little more than a 1000 names, 932 duplicated values is.. quite a bit. It makes sense when we take into account that names with a frequency lower than 25 are left out because of privacy reasons.

In [7]:
# Create a list with unique names and frequencies
df_nl = df_stacked.groupby('Naam').agg([pl.sum('Aantal')])

# Restoring the original column names
df_nl.columns = ['Naam', 'Aantal']

print(df_nl.head())
print(f'There are: {len(df_nl)} names in the dataframe')

shape: (5, 2)
╭───────────┬────────╮
│ Naam      ┆ Aantal │
│ ---       ┆ ---    │
│ str       ┆ i64    │
╞═══════════╪════════╡
│ "Jessie"  ┆ 132    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Jaylinn" ┆ 172    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Philou"  ┆ 261    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Lua"     ┆ 29     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Quinn"   ┆ 86     │
╰───────────┴────────╯
There are: 577 names in the dataframe


In [8]:
# Showing a small sample from the dataset
print(df_nl.sort('Aantal', reverse=True).sample(frac=0.01))

shape: (5, 2)
╭──────────┬────────╮
│ Naam     ┆ Aantal │
│ ---      ┆ ---    │
│ str      ┆ i64    │
╞══════════╪════════╡
│ "Joëlle" ┆ 107    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Layla"  ┆ 104    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Myla"   ┆ 71     │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Jinthe" ┆ 172    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Ivy"    ┆ 424    │
╰──────────┴────────╯


577 names are not a lot for a neural network. It would be very unlikely that we will get useful results out of it. After some searching I also found a list of the Nederlandse Voornamenbank (maintained by the [Meertens Instituut](https://www.meertens.knaw.nl/nvb/)). Although they haven't replied to my inquiries to get direct access to their database, I found another list at [Naamkunde.net](http://www.naamkunde.net/). Let's process this data as well.

We take all names from 1995 to 2006. Although they are old, we concluded earlier that there are not many new names introduced every year.

In [9]:
df_nl9506 = pl.read_csv('data/NL_MEISJESNAMEN_19952006.csv', encoding='utf8', sep=';', columns=['Naam', 'Aantal'])
print(df_nl9506.head(5))
print(f'There are {len(df_nl9506)} names in the dataframe')

shape: (5, 2)
╭───────────────┬────────╮
│ Naam          ┆ Aantal │
│ ---           ┆ ---    │
│ str           ┆ i64    │
╞═══════════════╪════════╡
│ "Aafje (V)"   ┆ 131    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aafke (V)"   ┆ 744    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aagje (V)"   ┆ 272    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aagtje (V)"  ┆ 29     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aaliyah (V)" ┆ 257    │
╰───────────────┴────────╯
There are 5343 names in the dataframe


In [10]:
# To prevent the model from learning that names contain (V) we remove these from the names
df_nl9506['Naam'] = [(i.split(' ')[0]) for i in df_nl9506['Naam']]
print(df_nl9506.head(5))

shape: (5, 2)
╭───────────┬────────╮
│ Naam      ┆ Aantal │
│ ---       ┆ ---    │
│ str       ┆ i64    │
╞═══════════╪════════╡
│ "Aafje"   ┆ 131    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aafke"   ┆ 744    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aagje"   ┆ 272    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aagtje"  ┆ 29     │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Aaliyah" ┆ 257    │
╰───────────┴────────╯


In [11]:
df_nl = df_nl.vstack(df_nl9506)

In [12]:
# Creating one single file again
df_nl = df_nl.groupby('Naam').agg([pl.sum('Aantal')])

In [13]:
# 10 most popular names from our datasets
print(df_nl.sort('Aantal_sum', reverse=True)[:10])
print(f'There are {len(df_nl)} names in the dataframe')

shape: (10, 2)
╭───────────┬────────────╮
│ Naam      ┆ Aantal_sum │
│ ---       ┆ ---        │
│ str       ┆ i64        │
╞═══════════╪════════════╡
│ "Sanne"   ┆ 22651      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Maria"   ┆ 21899      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Laura"   ┆ 20961      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Anne"    ┆ 20823      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...       ┆ ...        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Lisa"    ┆ 18078      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Anna"    ┆ 18030      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Iris"    ┆ 17285      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Johanna" ┆ 16975      │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Eva"     ┆ 14845      │
╰───────────┴────────────╯
There are 5413 names in the dataframe


5400 names is a lot better than before. After experimenting I found that this is still not enough. Fortunately, Belgium has similar names. Some web searching shows that the Belgium government actually publishes a list of first names. It can be found [here](https://statbel.fgov.be/nl/themas/bevolking/namen-en-voornamen). This list is immense and we can download all the names from 1995 to 2020. I have chosen here to also include the Wallonian names, even though these seem more French in general. 

In [14]:
df_be = pl.read_csv('data/BE_MEISJESNAMEN_19952020.csv', encoding='utf8', sep=';', columns=['Naam', 'Aantal'])
print(df_be.head(5))
print(f'There are {len(df_be)} names in the dataframe')

shape: (5, 2)
╭──────────┬────────╮
│ Naam     ┆ Aantal │
│ ---      ┆ ---    │
│ str      ┆ i64    │
╞══════════╪════════╡
│ "Emma"   ┆ 15779  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Laura"  ┆ 15260  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Marie"  ┆ 13922  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Louise" ┆ 12334  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Sarah"  ┆ 11973  │
╰──────────┴────────╯
There are 12799 names in the dataframe


In [15]:
# Checking if we need to clean the list and remove duplicates
df_be['Naam'].is_duplicated().sum()

0

In [16]:
# Creating one overview 
df = df_be.vstack(df_nl)
df = df.groupby('Naam').agg([pl.sum('Aantal')])
df.columns = ['Naam', 'Aantal']

In [17]:
print(df.sort(by='Aantal', reverse=True).sample(frac=0.0005))
print(f'There are {len(df)} names in the dataframe')

shape: (7, 2)
╭────────────┬────────╮
│ Naam       ┆ Aantal │
│ ---        ┆ ---    │
│ str        ┆ i64    │
╞════════════╪════════╡
│ "Tya"      ┆ 72     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Lauri"    ┆ 205    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Linna"    ┆ 10     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Djara"    ┆ 7      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Trish"    ┆ 9      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Eleanore" ┆ 60     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ "Noä"      ┆ 16     │
╰────────────┴────────╯
There are 14371 names in the dataframe


I have also including a large dataset with names from the USA to experiment with. This set can be found [here](https://www.ssa.gov/oact/babynames/limits.html). More names give more interesting results from the neural network that we are going to use. You can choose to comment out the following cell to leave these names out of the model.

In [18]:
# Processing the dataset in one cell to create a single file to work with again.
df_int = pl.read_csv('data/INT_MEISJESNAMEN_2020.csv', encoding='utf8', sep=';', columns=['Naam', 'Aantal'])

df = df.vstack(df_int)

df = df.groupby('Naam').agg([pl.sum('Aantal')])

print(df.head(5))
print(f'There are {len(df)} names in the dataframe')

shape: (5, 2)
╭────────────┬────────────╮
│ Naam       ┆ Aantal_sum │
│ ---        ┆ ---        │
│ str        ┆ i64        │
╞════════════╪════════════╡
│ "Legaciee" ┆ 5          │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Maryjo"   ┆ 11         │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Seriya"   ┆ 6          │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Shanen"   ┆ 5          │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "Serife"   ┆ 149        │
╰────────────┴────────────╯
There are 26754 names in the dataframe


In [19]:
# The neural network only needs the actual names. The frequencies can be used to do some data analyses on.
names = []

for name in df['Naam']:
    names.append(name)
    
len(names)

26754

In [22]:
# We write the list of names to a text file, so that we don't have to do all the every time we want to run the model.
with open('model_input/names.txt', 'w+') as f:
      
    # write elements of list
    for items in names:
        f.write('%s\n' %items)

    f.close()

In [23]:
df.to_csv('data/all_names.csv')

We are done with data processing. By merging several files with first names we have created a final list of 26754 names. The final list contains North American, Dutch and French names. Interesting mix of inputs for the model!

## Data transformation

With all the names in one file we still have to do some transformations to make sure that the model is able to process the names. 

In [24]:
# Importing packages for the transformations
import numpy as np
import tensorflow as tf
import unidecode

In [25]:
# Loading the text set to train the Tokenizer and allowing to skip the processing steps
with open('model_input/names.txt', 'r') as text:
    list_of_names = text.read()

list_of_names[:50]

names = list_of_names.splitlines()

In [26]:
# Using a temporary variable
_ = []

# Removing hyphens and other things like umlauts
for name in names:
    # split on hyphens
    x = name.split('-')
    x = ''.join(x)
    # normalise all text to plain letters
    x = unidecode.unidecode(x)
    # add dot to indicate end of name for model
    x = str(x)+'.'
    # lower case all letters for consistency
    x = x.lower()
    # remove apostrophes
    x = x.replace("'", "")
  
    _.append(x)

# Reassigning the cleaned list to the names variable
names = _


In [27]:
# Define tokenizer to create mapping of all characters
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    filters='!"#$%&()*+,-/:;<=>?@[\\]^_`{|}~',
    split='\n')

In [28]:
# Fitting the tokenizer on the original name list, without cleaning, to include all possible charachters 
tokenizer.fit_on_texts(list_of_names)

In [29]:
# Creating mappings for the characters
char_to_index = tokenizer.word_index
index_to_char = dict((v, k) for k, v in char_to_index.items())

In [30]:
# Adding a dot to both dictionaries, so the model can use it when generating names and knows when to move on
char_to_index['.'] = 0
index_to_char[0] = '.'

print(char_to_index)

{'a': 1, 'e': 2, 'i': 3, 'n': 4, 'l': 5, 'r': 6, 'y': 7, 's': 8, 'h': 9, 'm': 10, 'o': 11, 't': 12, 'd': 13, 'k': 14, 'u': 15, 'c': 16, 'j': 17, 'b': 18, 'z': 19, 'v': 20, 'g': 21, 'f': 22, 'p': 23, 'é': 24, 'w': 25, 'x': 26, 'ï': 27, 'ë': 28, 'q': 29, 'ü': 30, 'è': 31, 'â': 32, 'ş': 33, 'ç': 34, 'í': 35, 'ö': 36, 'ó': 37, "'": 38, 'ä': 39, 'á': 40, 'î': 41, 'ı': 42, 'ğ': 43, 'ÿ': 44, 'û': 45, 'i̇': 46, 'ù': 47, 'ê': 48, 'ú': 49, 'š': 50, 'ĝ': 51, 'å': 52, 'æ': 53, 'à': 54, '.': 0}


In [31]:
# Maximum number of characters in names. These are number of time steps used in the RNN model
max_char = len(max(names, key=len))

# Amount of names that are available
m = len(names)

# Number of potential characters
char_dim = len(char_to_index)

In [32]:
# Converting the list of names to a training dataset. This creates a matrix for each of the available names in 'm'.
X = np.zeros((m, max_char, char_dim))
Y = np.zeros((m, max_char, char_dim))

for i in range(m):
    name = list(names[i])
    for j in range(len(name)):
        X[i, j, char_to_index[name[j]]] = 1
        if j < len(name)-1:
            Y[i, j, char_to_index[name[j+1]]] = 1

## Generating names with an Recurrent Neural Network (RNN)

In [33]:
import os

from keras.models import load_model
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import LambdaCallback

Using the explanation of Yan Goebel on how the names are generated (from his original notebook): _"The idea is to input empty characters to the trained network and use the output of the first time step as a probability distribution for the first letter of the name. We then use this distribution to decide randomly the first character, record it and update the input to pass this character as an input for the second time step. This is continued for the following time steps to create a name._

_This is where using a '.' at the end of each name becomes important, because we stop the procedure once we get a '.' as an output, meaning that the generated name is done. Also if we reach the length of the largest name in the training set we put a '.' and end the procedure."_

In [34]:
def make_name(model):
    name = []
    x = np.zeros((1, max_char, char_dim))
    end = False
    i = 0
    
    while end==False:
        probs = list(model.predict(x)[0,i])
        probs = probs / np.sum(probs)
        index = np.random.choice(range(char_dim), p=probs)
        if i == max_char-2:
            character = '.'
            end = True
        else:
            character = index_to_char[index]
        name.append(character)
        x[0, i+1, index] = 1
        i += 1
        if character == '.':
            end = True
    print(''.join(name))
    return ''.join(name)

To monitor the model during training the following function is defined. After every 25 epochs we print 5 results to see what names are generated.

In [35]:
def generate_name_loop(epoch, _):
    if epoch % 25 == 0:
        
        print(f'Names generated after epoch {epoch}')

        for i in range(5):
            make_name(model)
        
        print()

In [36]:
# Convert the function to be able to use it as callback function. 
name_generator = LambdaCallback(on_epoch_end = generate_name_loop)

From Yan Gobeil: _"In the case of interest here we only consider one layer of recurrence, which we take to be LSTM with 128 units. We return the output of this layer and use it into a fully connected dense layer that converts the result of the LSTM layer into a vector of size char_dim using a softmax activation. We use categorical cross entropy as a cost function because of the softmax result and use Adam optimization. There is not really any useful metric to judge if the model does good so we will mostly just look at the results."_


In [148]:
# 
# Neural network architecture
model = Sequential()
model.add(LSTM(128, input_shape=(max_char, char_dim), return_sequences=True))
model.add(LSTM(128, input_shape=(max_char, char_dim), return_sequences=True))
model.add(Dense(char_dim, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')
    
model.fit(X, Y, epochs=276, batch_size=128, callbacks=[name_generator], verbose=0)

Names generated after epoch 0
rsælkèhkdagorambš.
ocabémmašáğ.
æznífæê.
byl.
oaá.

Names generated after epoch 25
annoh.
adeĝc.
eatliise.
elunrhsyehbhè.
ilièalåh.

Names generated after epoch 50
digine.
ùeæ.
apíu.
abea.
avàma.

Names generated after epoch 75
ellia.
amina.
aflyve.
anneki.
inyen.

Names generated after epoch 100
avyha.
autun.
elamari.
enicaá.
o.

Names generated after epoch 125
elynn.
a.
rane.
apena.
yaei.

Names generated after epoch 150
àel.
annet'.
akela.
herila.
adce.

Names generated after epoch 175
oûna.
aveenn.
aiden.
amiyah.
iû.

Names generated after epoch 200
ayelint.
aliyai.
olla.
ayaè.
acilea.

Names generated after epoch 225
oraya.
anyelis.
amary.
alsinah.
anye.

Names generated after epoch 250
icca.
alea.
emsia.
avia.
amtin.

Names generated after epoch 275
hilalia.
aålyaz.
amanie.
aneska.
amey.



<keras.callbacks.History at 0x7fb270000dd0>

It is clear that the model is making significant improvements every 25 epochs. Where the first generated names are jibberish (e.g.: '*ocabémmašáğ*'), we see that some potential names are formed in later epochs. During experimentation with the model it seems that it starts to overfit after 200 epochs. Considering our goal this might actually be good. Remembering our data processing steps, we already saw that names don't change that much over time. But we are still looking for something novel, so there are limits to the overfitting. Looking at the results in the latest printed epoch, the names seem novel ('Amey' or 'Aneska').

## Save generated names and the model

In [150]:
# Set path and filename to store results
path = './model_output/'
filename = 'generated_names.txt'

# Set number of names to be generated by model
number_of_names = 25

# Create directory to store names if not existing
if not os.path.exists(path):
    os.makedirs(path)

# Create file to store names if not existing
if not os.path.isfile(f'{path}/{filename}'):
    open(f'{path}/{filename}', 'w').close()

# Append the number of names to the file so previous names don't get overwritten
with open(f'{path}/{filename}', 'a') as text:
      
    output = []
    
    for i in range(number_of_names):
        # Removing the dot at the end and printing names on a new line
        x = str(make_name(model)[:-1]) + '\n'
        output.append(x)
      
    [text.write(x) for x in output]

text.close()

olnee.
ably.
aizja.
asziá.
anyana.
ayva.
ahara.
eorna.
ezly.
aniyah.
jose.
anylahy.
anlee.
odden.
efly.
irke.
untiah.
hajve.
orra.
amtin.
arlyn.
iulica.
alaita.
ada.
evrie.


In [152]:
# Saving the model so that we can use it later to quickly generate new names
model.save(f'./model/model.h5')  # creates a HDF5 file

## Names, names, names

In [153]:
# Loading the list of generated names
with open(f'{path}/{filename}', 'r') as file:
    gen_names = file.read()
    gen_names = gen_names.splitlines()
file.close()

gen_names = pl.Series(gen_names)

In [154]:
# Loading the list of original existing names 
with open(f'model_input/names.txt', 'r') as file:
    original_names = file.read()
    original_names = original_names.splitlines()
file.close()

_ = []

for name in original_names:
  _.append(name.lower())

original_names = list(_)

In [155]:
# Create a list of generated names and indicator if the name was in the original list
# True means the name is already present in the original list. False means the name is 'new'
check_names = gen_names.is_in(original_names)
name_existing = list(zip(list(gen_names), check_names.to_list()))

for value in name_existing:
  print(value)

('riebe', False)
('eoma', False)
('ocheda', False)
('ovee', True)
('ola', True)
('ola', True)
('ollil', False)
('ilou', False)
('vila', False)
('aleigh', True)
('enika', False)
('ionna', True)
('ashana', True)
('eliyah', True)
('emminy', False)
('aira', True)
('aviana', True)
('mou', False)
('ylia', False)
('ivere', False)
('avia', True)
('rielle', True)
('ilayqu', False)
('ambelle', False)
('ocheda', False)
('onsler', False)
('akonamo', False)
('ariah', True)
('ada', True)
('anoepa', False)
('adeni', False)
('armwi', False)
('amber', True)
('orianna', True)
('amifje', False)
('akosia', False)
('uztar', False)
('yanna', True)
('oureyda', False)
('utthel', False)
('adia', True)
('ami', True)
('amara', True)
('quer', False)
('amilah', True)
('amoni', True)
('avani', True)
('egritte', False)
('ayen', True)
('hija', False)
('olnee', False)
('ably', False)
('aizja', False)
('asziá', False)
('anyana', False)
('ayva', True)
('ahara', False)
('eorna', False)
('ezly', False)
('aniyah', True)
('

## Final remarks
It is interesting to find out that it is relatively easy to train an RNN to generate names. Fitting the model several times and generating names gave some interesting results. This use case shows the potential to offer an tool that generates outside-the-box content. I am convinced that this solution might name give someone his (if trained)/her name in the future. 