Character-level language model that implements several different neural networks, such as Bigram, BoW, MLP, RNN, GRU and, finally, a transformer equivalent to GPT-2.

### Reading and exploring the dataset

In [13]:
import json

names = json.loads(open("names.txt", "r").read())
names = names["payload"]["blob"]["rawLines"]
names[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In [14]:
len(names)

32033

In [15]:
min([len(n) for n in names])

2

In [18]:
max([len(n) for n in names])

15

### Bigram Language Model

BLM is a simple model that always works with two characters at a time: the given character and the character to be predicted. It uses a lookup table to store the probability of each character following another character. The probability of a character following another character is calculated by dividing the number of times the second character follows the first character by the total number of times the first character appears in the training data.

#### Exploring the bigrams in the dataset

In [25]:
for n in names[:3]:
    print(n)
    chs = ["<S>"] + list(n) + ["<E>"]     # <S> and <E> are special start/end tokens
    print("chs: ", chs)
    for ch1, ch2 in zip(n, n[1:]):
        print(ch1, ch2)

emma
chs:  ['<S>', 'e', 'm', 'm', 'a', '<E>']
e m
m m
m a
olivia
chs:  ['<S>', 'o', 'l', 'i', 'v', 'i', 'a', '<E>']
o l
l i
i v
v i
i a
ava
chs:  ['<S>', 'a', 'v', 'a', '<E>']
a v
v a


### Counting bigrams in a Python dictionary

In order to learn the statistics about which characters are likely to follow other characters, the simplest way in bigrams LM is to count the number of times each character follows another character. This can be done by using a Python dictionary.

In [37]:
b = {}
for n in names:
    chs = ["<S>"] + list(n) + ["<E>"]     # <S> and <E> are special start/end tokens
    for ch1, ch2 in zip(chs, chs[1:]):
        bigram = (ch1, ch2)
        b[bigram] = b.get(bigram, 0) + 1

sorted(b.items(), key=lambda kv: -kv[1])[:10]

[(('n', '<E>'), 6763),
 (('a', '<E>'), 6640),
 (('a', 'n'), 5438),
 (('<S>', 'a'), 4410),
 (('e', '<E>'), 3983),
 (('a', 'r'), 3264),
 (('e', 'l'), 3248),
 (('r', 'i'), 3033),
 (('n', 'a'), 2977),
 (('<S>', 'k'), 2963)]

### Counting bigrams in a 2D torch tensor ("training the model")

The information is mapped to a 2D array, where the rows represent the first character and the columns represent the second character. The value in each cell is the number of times the second character follows the first character.

In [46]:
import torch

N = torch.zeros((28, 28), dtype=torch.int32)     # 28 : 26 letters + <S> + <E>
N

tensor([[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]], dtype=torch.int32)

In [None]:
for n in names:
    chs = ["<S>"] + list(n) + ["<E>"]     # <S> and <E> are special start/end tokens
    for ch1, ch2 in zip(chs, chs[1:]):
        bigram = (ch1, ch2)

In [56]:
chars = sorted(list(set("".join(names))))
len(chars)

26

In [58]:
chtoi = {ch:i for i, ch in enumerate(chars)}
chtoi
    

{'a': 0,
 'b': 1,
 'c': 2,
 'd': 3,
 'e': 4,
 'f': 5,
 'g': 6,
 'h': 7,
 'i': 8,
 'j': 9,
 'k': 10,
 'l': 11,
 'm': 12,
 'n': 13,
 'o': 14,
 'p': 15,
 'q': 16,
 'r': 17,
 's': 18,
 't': 19,
 'u': 20,
 'v': 21,
 'w': 22,
 'x': 23,
 'y': 24,
 'z': 25}