Our first method will use simple counting i.e. count the number of times each character is predicted based on two characters provided e.g. the name jane would defines n as the next characters after 'ja' so that would add to 'n' probability of being the next character 

In [546]:
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
%pip install torch torchvision
import torch

Note: you may need to restart the kernel to use updated packages.


In [547]:
# Read in words
words = open('names.txt', 'r').read().splitlines()
words[:10]


['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In this instance, we are only looking at three characters. Two input characters, and the character we are predicting next.

e.g. what characters are likely to follow r

we also now what names are likely to start and finish at


We then do a simple count of most prominent trigrams

How many examples do we get from emma?

<S>e -> m
em -> m
mm -> a
ma -> <E>

In [548]:
t = {}
for w in words:
    chs = ['<S>'] + list(w) + ['<E>'] # hallucinate a start character and end character 
    for pair, ch2 in zip(zip(chs, chs[1:]), chs[2:]):
        trigram  = (''.join(pair), ch2)
        t[trigram ] = t.get(trigram, 0) + 1

Now lets get the count of each combo and sort. we'll see that ah followed by ending character occurs the

In [549]:
sorted_t = sorted(t.items(), key = lambda kv: -kv[1])
sorted_t[:10]

[(('ah', '<E>'), 1714),
 (('na', '<E>'), 1673),
 (('an', '<E>'), 1509),
 (('on', '<E>'), 1503),
 (('<S>m', 'a'), 1453),
 (('<S>j', 'a'), 1255),
 (('<S>k', 'a'), 1254),
 (('en', '<E>'), 1217),
 (('ly', 'n'), 976),
 (('yn', '<E>'), 953)]

Lets convert our trigram to an array. We need to deduce how many rows we have, so it should be all the combinations of characters against our list of 28 chars.

28 characters consist of 26 alphabet characters plus our <S> and <E> characters. So as we can pair same characters together e.g. -> a,a and with the other 27 chars, there is a total of 28*28 combinations -> 784

In [562]:
import string

# Create a list of all lowercase letters
letters = string.ascii_lowercase

# Create a list of tuples where the first element is a pair of characters and the second element is an integer index
pairs_string_to_integer = [(a+b, i) for i, (a, b) in enumerate((x, y) for x in letters for y in letters)]

# Print the first 10 items
print(pairs_string_to_integer[:10])

[('aa', 0), ('ab', 1), ('ac', 2), ('ad', 3), ('ae', 4), ('af', 5), ('ag', 6), ('ah', 7), ('ai', 8), ('aj', 9)]


[('lw', 0), ('lh', 1), ('tm', 2), ('hj', 3), ('eu', 4), ('nv', 5), ('mc', 6), ('lt', 7), ('ra', 8), ('un', 9)]


In [None]:
import torch

N = torch.zeros((len(pairs_string_to_integer), 28), dtype=torch.int32) # 28*28 = 784 for different pair combos
N[:5]

We need to create a lookup of integers to pairs so we can have an index. 
Basically we need an integer ref for all our points in our torch, that means each character pair needs an integer value, and each character needs a value e.g. ac = [100] and if it predicts d as next letter it would be axis point [100, 3]. Therefore each time this is shown, we add +1, allowing us to count the number of occurences 

If I understand this correctly, we therefore need two integer lookups: 
-
Taking our char pairs as the 'y' axis - we need 0:len(unique-pairs)  i.e. 0-600 and then our single character axis being x axis, we need 28 chars (alphabet including <S> and <E>)

0   |------------------------------28
    |<S>aa   <S>ab  <S>ac
    |aaa     aab    aac
    |..
    |..
    |..
600 |zaa zab

Lets get the character string to integer index for our 'y-axis'

In [None]:
# returns all the set of lowercase characters
chars = sorted(set(''.join(words)))
pred_string_to_integer = {s:i for i,s in enumerate(chars)}
pred_string_to_integer['<S>'] = 26
pred_string_to_integer['<E>'] = 27
print(list(pred_string_to_integer)[:10])

Now lets store our counts in a tensor

In [None]:
for w in words:
    chs = ['<S>'] + list(w) + ['<E>'] # hallucinate a start character and end character 
    for pair, ch2 in zip(zip(chs, chs[1:]), chs[2:]):
        trigram  = (''.join(pair), ch2)
        pairs_combined = trigram[0]
        ix1 = pairs_string_to_integer[pairs_combined]
        ix2 = pred_string_to_integer[ch2]
        N[ix1, ix2] += 1  

In [None]:
N[10]

In [None]:
%pip install matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

Create an inverse of string to integer

In [None]:
pred_itos = {i:s for s,i in pred_string_to_integer.items()}
pair_itos = {i:s for s,i in pairs_string_to_integer.items()}

In [None]:
# To do: FIX THIS
plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(len(pair_itos)):
    for j in range(len(pred_itos)):
        chstr = pair_itos[i] + pred_itos[j]
        plt.text(j, i, chstr, ha="center", va="bottom", color='gray')
        plt.text(j, i, N[i, j].item(), ha="center", va="top", color="gray")
plt.axis('off')

In [None]:
N[10]

Now lets convert to probabilitiess. Gives us the probability of any character being the first character of a name. 

In [None]:
P = (N+0.01).float()  
P /=  P.sum(1, keepdims=True)
P[1]

For sampling we will use torch.multinomial and a generator object to make everything deterministic 

This line of code is using PyTorch, a popular machine learning library in Python. 

The `torch.Generator()` is an object that holds the state of the random number generator. You can think of it as a container for the algorithm that produces pseudo-random numbers.

The `manual_seed()` function is used to set the seed for generating random numbers. This ensures that the random numbers generated are deterministic, meaning if you use the same seed, you will get the same sequence of random numbers. This is useful for debugging and testing purposes, as it allows for reproducibility in your code.

In [None]:
g = torch.Generator().manual_seed(214743647)

Logic -> We predict the 3rd character from the beginning two e.g. em -> a. 
We then strip the first character and append our 3rd character for the next prediction i.e. ma -> b

- Starting off the prediction, we need to start on <S>(x) character

In [None]:
start_chars = []
for key, value in pair_itos.items():
    if '<S>' in str(value):
        start_chars.append((key, value))
start_chars[:10]

We also need to know when to end the sequence, so we must find all characters with an ending character. This will be the value with pred = <E>

In [None]:
import random
ix = random.choice(start_chars)


In [None]:
iy_pred_reference = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
iy_pred_reference

In [None]:
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
ix

Note - When we run through predictions, we need to predict like ab -> c -> bc -> d -> cd -> e
Therefore we need to combine the second character in the pair with the predicted value. This is demonstrated below.

Note: -1 gets the last character from a string

In [None]:
ix_pair = pair_itos[10][-1] + pred_itos[10] 
ix_pair

### LATEST TO DO: We need to map every single type of pair from aa -> zz as we need to account for pairs that aren't predicted in our dataset
### e.g. bk is not predicted here, so it returns an error as the pair does not exist
### We could do an if not exists assign random character, but would be nicer to handle this inside the tensor
pairs_string_to_integer[ix_pair]

In [None]:
import random
g = torch.Generator().manual_seed(214748347)

for i in range(10):
    # Select a random value from start_chars
    out = []
    start_ix = random.choice(start_chars) # Gets all pairs that start with <S> 
    ix_pair = start_ix[0] # Gets integer reference for pair e.g. fe -> 45
    while True:
        p = P[ix_pair]
        iy_pred_reference = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(pred_itos[iy_pred_reference])
        if '<E>' in pred_itos[iy_pred_reference]:
            break
        # Combines last letter of pair and predicted char e.g. xy predicts z  so ix_pair now equals  = yz 
        ix_pair = pair_itos[ix_pair][-1] + pred_itos[iy_pred_reference] 
    # print(''.join(out))


In [None]:
print(t)
print(pairs)
# pred_string_to_integer = {s:i for i,s in enumerate(chars)}
# pred_string_to_integer['<S>'] = 26
# pred_string_to_integer['<E>'] = 27
# pred_string_to_integer

In [None]:
print(pred_itos)