<a href="https://colab.research.google.com/github/fmigas/Projects/blob/main/names_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The objective of this project is to check how adding randomnes to a text-generating model adds quality to the end product.

The basis of the model is a simple LSTM network. Its task is to learn English and Polish names and later generate names using different schemes.

Let's start with the imports.

In [46]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import Bidirectional
from keras.src.utils.np_utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas as pd
import os
os. environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

Let's start with loading data - Polish and English names.

In [47]:
pnames = pd.read_csv('data/P_names.csv')
enames = pd.read_csv('data/E_names.csv')

In [48]:
print(f'Number of Polish names in a database: {len(pnames)}')
print(f'Number of English names in a database: {len(enames)}')

Number of Polish names in a database: 1711
Number of English names in a database: 18238


We have over 10 times as many English names as Polish names. We'll see how it affects the quality of what our model can produce.

A fan fact for English readers. In the USA if you have a kid, you can use any word that you wish as his/her name, your fantasty and your warm feelings towards your kid's future in a school is your limit.

In Poland (and probably in many other countries) you can only choose from an officially approved list of authorized names. Probably that is a reason why a list of Polish names is over 10 times shorter.

In [49]:
enames.head()

Unnamed: 0,name,len
0,michael,7
1,christopher,11
2,jessica,7
3,matthew,7
4,ashley,6


I preprocessed both files beforehead to save your time here. In both files we have no duplicates, all names lower cased and a 'len' column added with a char count for each name.

Our model is naturally a char-level generating model. It's core architecture and data stracture can be easily copied from any deep learning manual.

What is interesting here is how we apply randomness at two levels:
1. at the learning process by replacing some "next char" labels with random chars
2. at the generating process by applying what I call a "simplified Beam" in place of a greedy search routine

Will these two result in better, richer, more interesting and natural names? Let's see!

Let's start with English names.

In [50]:
names = enames

We need a full list of chars used in English names and length of the longest name.
As a standard procedure in such models, we also build char_to_int and int_to_char dictionaries.

In [51]:
max_len = names['len'].max()

nlist = list(names['name'])
text = ' '.join(nlist)
chars = sorted(list(set(text)))
print(f'The longest name is {max_len} chars long.')
print(f"Chars list: {chars}")
print(f"Number of unique chars, including white space: {len(chars)}")

char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

The longest name is 15 chars long.
Chars list: [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Number of unique chars, including white space: 27


Now we begin to build our X and Y datasets.

In [52]:
dataX = [] # x data
dataY_original = [] # y data (labels) - simply the next char

for name in names['name']:
    for i in range(len(name)):
        if i < len(name) - 1:
            dataX.append([char_to_int[char] for char in name[0:i+1]])
            dataY_original.append(char_to_int[name[i+1]])
        else:
            dataX.append([char_to_int[char] for char in name[0:i+1]])
            dataY_original.append(0)

print('X data')
print(dataX[:20])
print('Y data')
print(dataY_original[:20])
originalY = dataY_original[:20]

X data
[[13], [13, 9], [13, 9, 3], [13, 9, 3, 8], [13, 9, 3, 8, 1], [13, 9, 3, 8, 1, 5], [13, 9, 3, 8, 1, 5, 12], [3], [3, 8], [3, 8, 18], [3, 8, 18, 9], [3, 8, 18, 9, 19], [3, 8, 18, 9, 19, 20], [3, 8, 18, 9, 19, 20, 15], [3, 8, 18, 9, 19, 20, 15, 16], [3, 8, 18, 9, 19, 20, 15, 16, 8], [3, 8, 18, 9, 19, 20, 15, 16, 8, 5], [3, 8, 18, 9, 19, 20, 15, 16, 8, 5, 18], [10], [10, 5]]
Y data
[9, 3, 8, 1, 5, 12, 0, 8, 18, 9, 19, 20, 15, 16, 8, 5, 18, 0, 5, 19]


A label (Y data) is simply the next char in a name. That is how all char-based models work at the train phase.

Now we begin adding randomness to the model.
At first, we declare a RAND_Y value, which tells, every which char we replace by a random one. If RAND_Y equals zero, there's no randomness and all labels remain unchained.

In [53]:
def get_Y(dataY, RAND_Y = 0):
  if RAND_Y > 0:
      for i, y in enumerate(dataY):
          if y != 0:
              if np.random.randint(0, RAND_Y) == 0: # on average, we will replace every RAND_Yth char with a randomly selected char
                  dataY[i] = np.random.randint(1, len(chars))
  return dataY

Let's see how the first 10 labels look like after various RAND_Y values.

In [54]:
print("No randomness")
dataY = get_Y(dataY_original, RAND_Y = 0)
RAND_Y = 0
print(f"Original Y: {originalY}")
print(f"Modified Y: {dataY[:20]}")

No randomness
Original Y: [9, 3, 8, 1, 5, 12, 0, 8, 18, 9, 19, 20, 15, 16, 8, 5, 18, 0, 5, 19]
Modified Y: [9, 3, 8, 1, 5, 12, 0, 8, 18, 9, 19, 20, 15, 16, 8, 5, 18, 0, 5, 19]


In [55]:
RAND_Y = 7
print(f"Every {RAND_Y}th modified")
dataY = get_Y(dataY_original, RAND_Y)

print(f"Original Y: {originalY}")
print(f"Modified Y: {dataY[:20]}")

Every 7th modified
Original Y: [9, 3, 8, 1, 5, 12, 0, 8, 18, 9, 19, 20, 15, 16, 8, 5, 18, 0, 5, 19]
Modified Y: [9, 3, 8, 1, 5, 12, 0, 8, 18, 9, 19, 20, 15, 1, 8, 5, 18, 0, 5, 19]


Now we need to prepare data so that they fit an LSTM network.
All steps are basic and self-explanatory.

In [56]:
X = pad_sequences(dataX, maxlen=max_len, dtype='float32')
num_of_chars = max(dataY)
X = X/num_of_chars
Y = to_categorical(dataY)
X = X.reshape(X.shape[0], X.shape[1], 1)
print(f"X shape: {X.shape}")
print(f"Y shape: {Y.shape}")

X shape: (115316, 15, 1)
Y shape: (115316, 27)


Let's build our model. Two layers of Bidirectional LSTM with a serious dropout.

In [57]:
def get_model():
  model = Sequential()
  model.add(Bidirectional(LSTM(16, return_sequences=True, input_shape=(X.shape[1], 1))))
  model.add(Dropout(0.5))
  model.add(Bidirectional(LSTM(16)))
  model.add(Dropout(0.5))
  model.add(Dense(Y.shape[1], activation='softmax'))
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  return model

We've already seen how randomness is added at a training data level. Now let's see how a "simplified Beam" is added at a text generating level.

A simple choice is a greedy serach. After a predict method produces a vector of "probabilities" for each char, we choose the one with the highest value.

In [58]:
# c = model.predict(wordX.reshape(1, max_len, 1), verbose = 0) # a vector
# c_max = int(c.argmax(axis=1)) # amvalue with a max probability

An alternative is to choose between a couple of values with the highest score from the probability method. We can randomly choose from a list of all chars using a probability vector as a p parameter in an np.random.choice method.

However, if we have a more or less flat vector of probabilities, like let's say 'a' = 0.05, 'b' = 0.07, 'c' = 0.08 etc. we would end with a chaotic noise rather than any interesting output sounding like a real name.

To avoid this, we can penalize lower probability values and multiply higher values with a simple trick. We take each vector value and raise to power of MULTIPLIER level. Later, we normalize all values so that all probabilities sum up to 1 and can be used as a p parameter in an np.random.choice method.

I usually used MULTIPLIER = 3 value. Let's see how it works.
Let's say we have ten chars in our alphabet. We will generate a vector of probabilities for each char and normalize so that they sum up to 1.



In [59]:
v = np.random.random(10)
v = v / sum(v)
v = np.sort(v)[::-1]
v = np.round(v, 2)
print(v)
print(np.sum(v[:3]))

[0.15 0.14 0.14 0.13 0.13 0.11 0.08 0.06 0.04 0.02]
0.43000000000000005


Let's assume, that we treat top 2-3 chars as "good", which means that choosing out of them we expect to produce a sensible name that "sounds good". We should somehow minimze the probability of choosing out of the remaining 7-8 chars, but still keep the order of top 2-3 chars in terms of their probability. On the other hand, we do not want to simply cut off the long tail, rather diminish its probability.

Let's see how applying a "simplified Beam" with MULTIPLIER = 3 changes our vector.

In [60]:
MULTIPLIER = 3

In [61]:
c = [x ** MULTIPLIER for x in v] #premiujemy największe prawdopodobieństwa
c = [x / sum(c) for x in c]
c = np.round(c, 2)
print(c)
print(np.sum(c[:3]))

[0.22 0.18 0.18 0.14 0.14 0.09 0.03 0.01 0.   0.  ]
0.5800000000000001


As we can see, two top values (0.3 and 0.44) are strengthened and now sum up to 0.93, the other values were made significant.
On the other hand, it's still much more random than greedy search.

Now let's see how it looks like with different MULTIPLIER values (from 2 to 6) on a 10 char vector.

In [62]:
def beam(vec_len, multiplier):
    v = np.random.random(vec_len)
    v = v / sum(v)
    v = np.sort(v)[::-1]
    v = np.round(v, 2)
    print('Original vector sorted:')
    print(v)
    print(f"Sum of the first three values: {np.sum(v[:3]):.2f}\n")

    for m in multiplier:
        c = [x ** m for x in v] #premiujemy największe prawdopodobieństwa
        c = [x / sum(c) for x in c]
        c = np.round(c, 2)
        print(f"Vector with multiplier {m}:")
        print(c)
        print(f"Sum of the first three values: {np.sum(c[:3]):.2f}\n")


beam(vec_len = 10, multiplier = [2, 3, 4, 5, 6])

Original vector sorted:
[0.21 0.2  0.17 0.15 0.1  0.07 0.06 0.04 0.01 0.  ]
Sum of the first three values: 0.58

Vector with multiplier 2:
[0.28 0.26 0.19 0.14 0.06 0.03 0.02 0.01 0.   0.  ]
Sum of the first three values: 0.73

Vector with multiplier 3:
[0.34 0.29 0.18 0.12 0.04 0.01 0.01 0.   0.   0.  ]
Sum of the first three values: 0.81

Vector with multiplier 4:
[0.39 0.32 0.17 0.1  0.02 0.   0.   0.   0.   0.  ]
Sum of the first three values: 0.88

Vector with multiplier 5:
[0.43 0.33 0.15 0.08 0.01 0.   0.   0.   0.   0.  ]
Sum of the first three values: 0.91

Vector with multiplier 6:
[0.46 0.34 0.13 0.06 0.01 0.   0.   0.   0.   0.  ]
Sum of the first three values: 0.93



I will try 3 and 6 values in a model.

So, we are now ready with the model.
We will generate 10 series of names, 10 in each serie.
The model is compiled once at the beginning and it is fit with the data in each series. We will observe how the quality of the generated names gets augmented (or does not?) with each consecitive fit for 10 epochs.



Let's start with a greedy search and no randomnes in training data!

In [66]:
from time import time
names_full = []

not_starting = list(np.arange(26, 33)) # only applies to Polish names
not_starting = [] # empty for English names

model = get_model()

RAND_Y = [0, 3, 5]
MULTIPLIER_true = [3, 6]
MULTIPLIER_false = [1]
BEAM = [False, True]
EPOCHS = 5
SERIES = 10
RANDOM_CHARS = True

print(f"RandY: {RAND_Y}")

for i in range(SERIES):
    tic = time()
    model.fit(X, Y, epochs=EPOCHS, verbose=0, batch_size=64)
    names_ = []
    if i % 1 == 0:
        for k in range(10):
            n = np.random.randint(1, num_of_chars + 1) #we genearate the first char of the name
            while n in not_starting: #used only for Polish names
                n = np.random.randint(1, num_of_chars + 1)
            n = n/num_of_chars #scaling for an LSTM model

            word = [[]]
            word[0].append(n)
            for j in range(max_len):
                wordX = pad_sequences(word, maxlen=max_len, dtype='float32')
                c = model.predict(wordX.reshape(1, max_len, 1), verbose = 0)
                if RANDOM_CHARS == True:
                    c = list(c.flatten())
                    c = [x ** MULTIPLIER for x in c]
                    c = [x / sum(c) for x in c]
                    c_max = np.random.choice(chars, p = c)
                    c_max = char_to_int[c_max]
                else:
                    c_max = int(c.argmax(axis=1))
                word[0].append(c_max/num_of_chars)
                if c_max == 0:
                    break
            imie = ''.join([int_to_char[int(char*num_of_chars)] for char in word[0]])

            names_.append(imie)

        print(f"Series {i+1:>2} names: {names_} after {(time() - tic)/60:.2f} minutes.")
        names_full.append(names_)

        df = pd.DataFrame(names_full)

print(f'ALL NAMES GENERATED after {time() - tic:.2f} seconds')
df.to_csv('data/English_greedy_norand.csv')



RandY: [0, 3, 5]


NameError: ignored

In [None]:
import tensorflow as tf
import random as python_random
from time import time

def reset_seeds():
   np.random.seed(123)
   python_random.seed(123)
   tf.random.set_seed(1234)

def get_data(rand_y):
    names = pd.read_csv('data/E_names.csv')
    max_len = names['len'].max()

    nlist = list(names['name'])
    text = ' '.join(nlist)
    chars = sorted(list(set(text)))

    char_to_int = dict((c, i) for i, c in enumerate(chars))
    int_to_char = dict((i, c) for i, c in enumerate(chars))

    dataX = [] # x data
    dataY = [] # y data (labels) - simply the next char

    for name in names['name']:
        for i in range(len(name)):
            if i < len(name) - 1:
                dataX.append([char_to_int[char] for char in name[0:i+1]])
                dataY.append(char_to_int[name[i+1]])
            else:
                dataX.append([char_to_int[char] for char in name[0:i+1]])
                dataY.append(0)

    if rand_y > 0:
        for i, y in enumerate(dataY):
            if y != 0:
                if np.random.randint(0, rand_y) == 0: # on average, we will replace every RAND_Yth char with a randomly selected char
                    # print('losowo')
                    dataY[i] = np.random.randint(1, len(chars))


    X = pad_sequences(dataX, maxlen=max_len, dtype='float32')
    num_of_chars = max(dataY)
    X = X/num_of_chars
    Y = to_categorical(dataY)
    X = X.reshape(X.shape[0], X.shape[1], 1)

    return X, Y, max_len, char_to_int, int_to_char, num_of_chars, chars


For speed, we will use a slightly smaller network with 16 neurons in each of two LSTM layers.
For repeatability, we will reset seed each time at the model restart.

In [None]:
def get_model():
    reset_seeds()

    model = Sequential()
    model.add(Bidirectional(LSTM(16, return_sequences=True, input_shape=(X.shape[1], 1))))
    # model.add(Bidirectional(LSTM(16, input_shape=(X.shape[1], 1))))
    model.add(Dropout(0.5))
    model.add(Bidirectional(LSTM(16)))
    model.add(Dropout(0.3))
    model.add(Dense(Y.shape[1], activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
not_starting = list(np.arange(26, 33)) # only applies to Polish names
not_starting = [] # only applies to Polish names

RAND_Y = [0, 3, 5]
MULTIPLIER_true = [3, 6]
MULTIPLIER_false = [1]
BEAM = [False, True]
EPOCHS = 5
SERIES = 10

In [None]:
for beam in BEAM:
    if beam:
        MULTIPLIER = MULTIPLIER_true
    else:
        MULTIPLIER = MULTIPLIER_false
    for rand_y in RAND_Y:
        for multiplier in MULTIPLIER:
            X, Y, max_len, char_to_int, int_to_char, num_of_chars, chars = get_data(rand_y)
            file_name = 'English__beam:' + str(beam) + '__multiplier:' + str(multiplier) + '__rand_y:' + str(rand_y)
            print(f'\nGenerating: {file_name}')

            model = get_model()
            names_full = []
            for i in range(SERIES):
                tic = time()
                model.fit(X, Y, epochs=EPOCHS, verbose=0, batch_size=64)
                names_ = []
                for k in range(10):
                    n = np.random.randint(1, num_of_chars + 1)
                    while n in not_starting:
                        n = np.random.randint(1, num_of_chars + 1)
                    n = n/num_of_chars

                    word = [[]]
                    word[0].append(n)
                    for j in range(max_len):
                        wordX = pad_sequences(word, maxlen=max_len, dtype='float32')
                        c = model.predict(wordX.reshape(1, max_len, 1), verbose = 0)
                        if beam:
                            c = list(c.flatten())
                            c = [x ** multiplier for x in c] #premiujemy największe prawdopodobieństwa
                            c = [x / sum(c) for x in c]
                            c_max = np.random.choice(chars, p = c) # nowa metoda wyboru kolejnej linii; wybierane wg prawdopodobieństwa, a
                            c_max = char_to_int[c_max]
                        else:
                            c_max = int(c.argmax(axis=1)) # stara metoda wybierała zawsze argmax z predict
                        word[0].append(c_max/num_of_chars)
                        if c_max == 0:
                            break
                    imie = ''.join([int_to_char[int(char*num_of_chars)] for char in word[0]])
                    names_.append(imie)
                print(f"Series {i + 1:>2} names: {names_} after {(time() - tic) / 60:.2f} minutes.")
                names_full.append(names_)

            df = pd.DataFrame(names_full)
            df.to_csv('/content/drive/MyDrive/Python_data/' + file_name + '.csv')
            # print("\nSeries completed:", i+1, "\n")

print('model finished')

If we take a look at the above outcome, we can easily formulate a couple of conclusions:


1.   The less randomness, the more repetitive and "boring" the outcome
2.   The first three results come from a "greedy search" procedure and we can see they are are more repetitive than the next six lists with a "simplified beam" method
1.   Randomness at the word generation level ("beam" with lower multiplier value, e.g. 3) adds more flavour than randomness at the training level (randomly modified chars in a train set - rand_y > 0)
2.   In this specific task, randomnes does not produce more nonsense, worthless results. If we take a list with "beam" True and rand_y positive, we do not find more useless propositions than in series generated without randomness.





It is obvious that (at least at this level) only a human being can really assess the quality of the outcome. However, we can design some measures of the outcome diversity.
Let's analyze each set of names and find out:


1.   How many unique names we got
2.   How many unique endings (2-grams, 3-grams and 4-grams) we got
1.   How many different name lengths we have in a set





In [None]:
def print_df_stats(df):
    try:
        df.drop(['Unnamed: 0'], axis = 1, inplace = True)
    except:
        pass
    df = df.stack().reset_index()
    df.drop(['level_0', 'level_1'], inplace = True, axis = 1)

    df.columns = ['name']
    unique_len = len(df.name.unique())

    df['len'] = df['name'].apply(lambda x: len(x))
    df['last2'] = df['name'].apply(lambda x: x[-3:])
    df['last3'] = df['name'].apply(lambda x: x[-4:])
    df['last4'] = df['name'].apply(lambda x: x[-5:])
    len(df['len'].unique())

    print(f"Number of unique names: {unique_len}")
    print(f"Number of unique last 2 chars: {len(df['last2'].unique())}")
    print(f"Number of unique last 3 chars: {len(df['last3'].unique())}")
    print(f"Number of unique last 4 chars: {len(df['last4'].unique())}")
    print(f"Number of different name lengths: {len(df['len'].unique())}\n")

In [None]:
RAND_Y = [0, 5, 3]
MULTIPLIER = [6, 3, 1]
BEAM = [False, True]

for b in BEAM:
    for m in MULTIPLIER:
        for r in RAND_Y:
            file_name = 'English__beam:' + str(b) + '__multiplier:' + str(m) + '__rand_y:' + str(r)

            try:
                df = pd.read_csv('data/' + file_name + '.csv')
                print(file_name)
                print_df_stats(df)

            except:
                pass

A couple of observations:


1.   The first three results with a greedy search (beam: False) have around 60 unique names each of them (out of 100 names generated) and a very limited number of unique 2-gram endings. Adding randomnes at a training data level (data_y = 3 and 5) does not change anything.
2.   What really adds value in terms of diversity (without any visible loss in quality) is adding randomness at a text generation level (beam: True).

That's it. I hope you enjoyed it. The model itself is very simple and can be copied from any ML/NLP textbook, but I hope experimenting with randomnes was a fun!