# Finnish Lastname Generator

<img src="sukunimet.jpg">

Have you ever wondered what would be your name or surname if you were born in another country? During last two years when I have been living in Finland this question came to my mind several times. Once I even used online generator to get myself a Finnish name and surname. Visit http://www.visitfinland.com/campaigns/finngenerator/public/en/ to check it out.

However, after getting familiar with Sequential models I thought that it should be possible to use one of those to teach it to generate new surnames which would resemble already existing ones. To train such a model we would need a dataset of original Finnish lastnames. I found such at http://tuomas.salste.net/suku/nimi/index-en.html. It consists of 23,166 family names found in Finland.

Let's get started!

In [1]:
import numpy as np
from utils import *
import random

### 1 - Dataset and Preprocessing

Run the following cell to read the dataset of Finnish surnames, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size. 

In [2]:
data = open('finnish_lastnames.txt', 'r', encoding='latin-1').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))
print('Characters are ', chars)

There are 154642 total characters and 29 unique characters in your data.
Characters are  ['c', 'b', 'h', 'm', 'p', 'r', 'f', 'e', 'y', '\n', 'é', 'u', 'k', 'i', 't', 'ü', 'ä', 'v', '-', 'd', 'ö', 'l', 'a', 'w', 'j', 'g', 'o', 's', 'n']


The "\n" (or newline character) indicates the end of the surname rather than the end of a sentence. In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-38. We also create a second python dictionary that maps each index back to the corresponding character. This will help you figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, char_to_ix and ix_to_char are the python dictionaries. 

In [3]:
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(ix_to_char)

{0: '\n', 1: '-', 2: 'a', 3: 'b', 4: 'c', 5: 'd', 6: 'e', 7: 'f', 8: 'g', 9: 'h', 10: 'i', 11: 'j', 12: 'k', 13: 'l', 14: 'm', 15: 'n', 16: 'o', 17: 'p', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'y', 25: 'ä', 26: 'é', 27: 'ö', 28: 'ü'}


### 2 - Overview of the model

Your model will have the following structure:

- Initialize parameters 
- Run the optimization loop
    - Forward propagation to compute the loss function
    - Backward propagation to compute the gradients with respect to the loss function
    - Clip the gradients to avoid exploding gradients
    - Using the gradients, update your parameter with the gradient descent update rule.
- Return the learned parameters 

We will write two helper functions. One that clips gradient values which helps to avoid exploding gradient problem. Another samples a sequence of characters according to a sequence of probability distributions output of the RNN.

In [4]:
def clip(gradients, maxValue):
    '''
    Clips the gradients' values between minimum and maximum.
    
    Arguments:
    gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
    maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
    
    Returns: 
    gradients -- a dictionary with the clipped gradients.
    '''
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -maxValue, maxValue, out=gradient)
    
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

In [5]:
def sample(parameters, char_to_ix, seed):
    """
    Sample a sequence of characters according to a sequence of probability distributions output of the RNN

    Arguments:
    parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b. 
    char_to_ix -- python dictionary mapping each character to an index.
    seed -- used for grading purposes. Do not worry about it.

    Returns:
    indices -- a list of length n containing the indices of the sampled characters.
    """
    
    # Retrieve parameters and relevant shapes from "parameters" dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    # Step 1: Create the one-hot vector x for the first character (initializing the sequence generation). (≈1 line)
    x = np.zeros(vocab_size)
    # Step 1': Initialize a_prev as zeros (≈1 line)
    a_prev = np.zeros((n_a))
    
    # Create an empty list of indices, this is the list which will contain the list of indices of the characters to generate (≈1 line)
    indices = []
    
    # Idx is a flag to detect a newline character, we initialize it to -1
    idx = -1 
    
    # Loop over time-steps t. At each time-step, sample a character from a probability distribution and append 
    # its index to "indices". We'll stop if we reach 50 characters (which should be very unlikely with a well 
    # trained model), which helps debugging and prevents entering an infinite loop. 
    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):
        
        # Step 2: Forward propagate x using the equations (1), (2) and (3)
        a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
        z = np.dot(Wya, a) +  by 
        y = softmax(z)
        
        # for grading purposes
        np.random.seed(counter+seed) 
        
        # Step 3: Sample the index of a character within the vocabulary from the probability distribution y
        idx = np.random.choice(list(range(vocab_size)), p = y[:, counter].ravel())

        # Append the index to "indices"
        indices.append(idx)
        
        # Step 4: Overwrite the input character as the one corresponding to the sampled index.
        x = np.zeros((vocab_size, 1))
        x[idx] = 1
        
        # Update "a_prev" to be "a"
        a_prev = a
        
        # for grading purposes
        seed += 1
        counter +=1

    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

Next we will implement a function performing one step of stochastic gradient descent (with clipped gradients). You will go through the training examples one at a time, so the optimization algorithm will be stochastic gradient descent.

In [6]:
def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    """
    Execute one step of the optimization to train the model.
    
    Arguments:
    X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
    Y -- list of integers, exactly the same as X but shifted one index to the left.
    a_prev -- previous hidden state.
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    learning_rate -- learning rate for the model.
    
    Returns:
    loss -- value of the loss function (cross-entropy)
    gradients -- python dictionary containing:
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
                        db -- Gradients of bias vector, of shape (n_a, 1)
                        dby -- Gradients of output bias vector, of shape (n_y, 1)
    a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
    """
    
    # Forward propagate through time (≈1 line)
    loss, cache = rnn_forward(X, Y, a_prev, parameters)
    
    # Backpropagate through time (≈1 line)
    gradients, a = rnn_backward(X, Y, parameters, cache)
    
    # Clip your gradients between -5 (min) and 5 (max) (≈1 line)
    gradients = clip(gradients, 5)
    
    # Update parameters (≈1 line)
    parameters = update_parameters(parameters, gradients, learning_rate)
    
    ### END CODE HERE ###
    
    return loss, gradients, a[len(X)-1]

### 3 - Training the model 

Given the dataset of Finnish surnames, we use each line of the dataset (one surname) as one training example. Every 100 steps of stochastic gradient descent, you will sample 10 randomly chosen surnames to see how the algorithm is doing. Remember to shuffle the dataset, so that stochastic gradient descent visits the examples in random order. 

In [7]:
def model(data, ix_to_char, char_to_ix, num_iterations = 500000, n_a = 50, surnames = 10, vocab_size = 29):
    """
    Trains the model and generates dinosaur names. 
    
    Arguments:
    data -- text corpus
    ix_to_char -- dictionary that maps the index to a character
    char_to_ix -- dictionary that maps a character to an index
    num_iterations -- number of iterations to train the model for
    n_a -- number of units of the RNN cell
    surnames -- number of Finnish surnames you want to sample at each iteration. 
    vocab_size -- number of unique characters found in the text, size of the vocabulary
    
    Returns:
    parameters -- learned parameters
    """
    
    print('n_a',n_a)
    
    # Retrieve n_x and n_y from vocab_size
    n_x, n_y = vocab_size, vocab_size
    
    # Initialize parameters
    parameters = initialize_parameters(n_a, n_x, n_y)
    
    # Initialize loss (this is required because we want to smooth our loss, don't worry about it)
    loss = get_initial_loss(vocab_size, surnames)
    
    # Build list of all dinosaur names (training examples).
    with open("finnish_lastnames.txt", encoding='latin-1') as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    
    # Shuffle list of all dinosaur names
    np.random.seed(0)
    np.random.shuffle(examples)
    
    # Initialize the hidden state of your LSTM
    a_prev = np.zeros((n_a, 1))
    
    # Optimization loop
    for j in range(num_iterations):
        
        ### START CODE HERE ###

        # Use the hint above to define one training example (X,Y) (≈ 2 lines)
        index = j % len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]] 
        Y = X[1:] + [char_to_ix["\n"]]
        
        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        # Choose a learning rate of 0.01
        
        curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate=0.01)
        
        ### END CODE HERE ###
        
        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        loss = smooth(loss, curr_loss)

        # Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
        if j % 2000 == 0:
            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            
            # The number of dinosaur names to print
            seed = 0
            for name in range(surnames):
                
                # Sample indices and print them
                sampled_indices = sample(parameters, char_to_ix, seed)
                print_sample(sampled_indices, ix_to_char)
                
                seed += 1  # To get the same result for grading purposed, increment the seed by one. 
      
            print('\n')
        
    return parameters

In [8]:
parameters = model(data, ix_to_char, char_to_ix)

n_a 50
Iteration: 0, Loss: 33.686429

Nküäävcmerpeögury-ujjiyv
Knea
Küäävcmerpeögury-ujjiyv
Nea
Üäävcmerpeögury-ujjiyv
Ea
Äävcmerpeögury-ujjiyv
A
Ävcmerpeögury-ujjiyv



Iteration: 2000, Loss: 24.683657

Mivvurakenlaulolu
Kka
Käturakenlaulolu
Md
Äturakenlaulolu
E
Uttalalmaukontanilertaltiyla
-
Uralalmaukontanilertaltiyla



Iteration: 4000, Loss: 21.632903

Mewtvo
Kka
Läturaijko
Ma
Ätusalajnevanku
Ha
Uttalago
A
Usalako



Iteration: 6000, Loss: 20.293826

Nevtunen
Leja
Lätupala
Nei
Wuutalaanetalikanen
Iaahsta
Uttakali
A
Ttaikmi



Iteration: 8000, Loss: 19.629378

Netturanen
Lela
Lätto
Nel
Yuttamamkasinen
Jaalosi
Uutamanmatkolo
Aalori
Tukkanietimoranilalo
Aitti


Iteration: 10000, Loss: 19.317964

Nevvurajallatinirankalri
Liikakorha
Läusramallatimio
Neja-rula
Ysto-karnatanni
Jaalko
Vusamalkatileranjakon
Aakora
Utilakkasanmä
Ajttaa


Iteration: 12000, Loss: 18.960299

Nettuo
Leijaaro
Lävären
Nek
Ysto
Jaakko
Uusala
Aakora
Tujokko
Aistea


Iteration: 14000, Loss: 18.589429

Moutupahe
Leij

Iteration: 132000, Loss: 17.017431

Pousto
Miihaanta
Mäsvy-nikkiuojinen
Peidennel
Vääsilä
Kaalonen
Tyski-polvanniami
Haassa
Sumo
Akori


Iteration: 134000, Loss: 17.216597

Peutula
Luoi
Mätynen
Padakki
Vättilemmasalo
Kaakkoja
Särila
Haitteehirko
Sukkala
Almikaala


Iteration: 136000, Loss: 16.949653

Revpänen
Piha
Pöytä-lasneuri
Raaala
Vätäjäki
Kaajora
Tyvinen
Haaste
Sukiala
Ajonen


Iteration: 138000, Loss: 17.005093

Poutti
Mula
Mätvä
Pehakku
Väyrikaro
Laakno
Supela
Hahska
Sukkala
Almo


Iteration: 140000, Loss: 17.155461

Piätto
Melea
Mäsvä
Paehanpeija
Vustamaa
Kaamisaahio
Surimantasalo
Hakora
Sukkanen
Alopaa


Iteration: 142000, Loss: 16.977045

Pivästeljä
Muka
Mäsärehoinen
Pahajärvi
Vurtho
Kaallo
Tytlä
Haisti
Sulmankaummo
Ahtti


Iteration: 144000, Loss: 16.891704

Pousuo
Puke
Pöäsyksallakkolo
Pehakoski
Vättilä
Laamonen
Tytilä
Haaroja
Sukkamäks
Ahosaa


Iteration: 146000, Loss: 16.945478

Pousunen
Poika
Pöysta
Pehajärvi
Vurra
Laakko
Suokkala
Haasta
Suli
Ahtti


Iteration: 148000, 

Iteration: 266000, Loss: 16.803446

Petätta
Moiha-pyrjensari
Näätselmäki
Paaharmi
Vuosala
Kaakkola
Supalahti
Hahtia
Sollamäku
Airoka


Iteration: 268000, Loss: 16.813310

Phävän
Maka
Näättala
Paahanka
Vuosala
Kaalosaalimaa
Suolikka
Haesti
Sillanies
Almikaari


Iteration: 270000, Loss: 16.921769

Pouturanniemo
Moija
Nyyttela
Peidepäke
Vuorge
Kaalosaa
Surhi
Hahtoja
Sukkanen
Ahtoa


Iteration: 272000, Loss: 16.803362

Pousto
Moikahorka
Mätvä-lansasalo
Pieenome
Vuosila
Kaakkola
Suojokkkiperä
Haastela
Sulkkonen
Aistela


Iteration: 274000, Loss: 16.968128

Orvonen
Maiha
Mätänen
Oka
Uuttala
Kaakkola
Suramaa
Haerikaari
Sulajuri
Ahtola


Iteration: 276000, Loss: 16.714538

Nevästi
Miekainen
Mästäjäkky
Neiejääka
Vätäjämäki
Kaalonen
Särinen
Hahsokaarporju
Suinenmotjärvi
Airta


Iteration: 278000, Loss: 16.782660

Orvori
Moijakoski
Mätösalmi
Piedisti
Vuttela
Laakos
Syvio
Kajko
Sulikki
Aitula


Iteration: 280000, Loss: 16.802336

Opyuto
Muoka-suma
Mättö
Oka
Vustamaa
Kaakko
Surenius
Hahso
Sulkkokev

Iteration: 400000, Loss: 16.643063

Puutto
Mule
Märviemin
Peidesjo
Vustala
Laamisaa
Vuski
Hakos
Vullinieri
Ahske


Iteration: 402000, Loss: 16.649457

Pausunen
Opai
Ovuora
Pahajärse
Vurtamaa
Kaanto
Turanen
Halsta
Tiitakkaski
Ahri


Iteration: 404000, Loss: 16.715031

Piättä
Ola-ahtti
Ottonen
Peihkonen
Väätinen
Maakopaa
Tyykkylä
Kaisti
Tuitinen
Ahorahi


Iteration: 406000, Loss: 16.790271

Petävieri
Ola
Pöyttinen
Paahiniema
Vustamaa
Kaamonen
Turinen
I
Tukkamäkkimäki
Aarta


Iteration: 408000, Loss: 16.682007

Piärynen
Moija
Möttä
Pehakoski
Vuoselli
Kaalosaalin
Suukkamäki
Jaitte
Soininen
Ajusaahkorti


Iteration: 410000, Loss: 16.694335

Piärynen
Mihe
Nyyttalahti
Paahinne
Vurtamaa
Kaanto
Tyvimetsavarta
Haarraa
Sukola
Ahtula


Iteration: 412000, Loss: 16.681615

Okyvuohi
Lahd
Määtvilä
Oja
Vuosilä
Kaamonen
Suokkamo
Jahppi
Sihkennetiisi
Deste


Iteration: 414000, Loss: 16.788901

Pivärä
Piikalainen
Pötönen
Pehakoski
Tyvähti
Maakonen
Sushi
Hahtikaarmi
Sukkanen
Ahoska


Iteration: 416000, Los

#### References:

    This project took inspiration from AI SW Development Hands On workshop organized by Nokia and conducted by Tarry Singh. Implementation code uses parts of code from Andrew Ng's exercise Dinosaurus Island which is a part of Deep Learning specialization at Coursera (https://www.coursera.org/specializations/deep-learning).