# Bigram Language Model 

This is based on Andrej's Youtube video [The spelled-out intro to language modeling: building makemore](https://www.youtube.com/watch?v=PaCmpygFfXo).

It explains how to develop a simple bigram language model using a neural network. It covers model training, sampling, and the evaluation of a loss function.

- [bigram notebook file](https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb)
- [makemore repository](https://github.com/karpathy/makemore)

## The Dataset and Language Model

In [2]:
words = open("names.txt", "r").read().splitlines()

In [5]:
print(words[:10])
print(len(words))

lengths = [len(word) for word in words]
print(min(lengths), max(lengths))

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia', 'harper', 'evelyn']
32033
2 15


### Character-Level Language Model

A character-level language tries to predict the next character based on the preceding characters. There are `32033` words that each word has a length from `2` to `15`.

For example, in the word `isabella`:

- `i` is the first character.
- each middle character follows another character.
- after the last `a`, the word terminates.

### Bigram Language Model

A bigram language model works on two characters at a time.

It only looks one character and predictes the next character.

It is a simple language model for pedagogical purpose. 

### Bigram From Words

Each word generates `n-1` pairs of characters, where `n` is the length of the word

In [None]:
# each word generates n-1 pairs of characters
# where n is the length of the word
for w in words[:3]:
    for ch1, ch2 in zip(w, w[1:]):
        print(ch1, ch2)

### Start and End

However, there are two implict characters in each word: `<S>` and `<E>`.

Thus the code to generating bigrams is as below:

In [None]:
for w in words[:3]:
    chars = ["<S>"] + list(w) + ["<E>"]
    for ch1, ch2 in zip(chars, chars[1:]):
        print(ch1, ch2)

### Bigram Frequency

The prediction is based on the bigram frequencies.


In [11]:
b = {}
for w in words:
    chars = ["<S>"] + list(w) + ["<E>"]
    for ch1, ch2 in zip(chars, chars[1:]):
        bigram = (ch1, ch2)
        b[bigram] = b.get(bigram, 0) + 1

In [20]:
from itertools import islice

three = islice(b.items(), 3)
print(list(three))
sorted(b.items(), key=lambda item: item[1], reverse=True)[:3]

[(('<S>', 'e'), 1531), (('e', 'm'), 769), (('m', 'm'), 168)]


[(('n', '<E>'), 6763), (('a', '<E>'), 6640), (('a', 'n'), 5438)]

In [21]:
import torch

In [26]:
a = torch.zeros((3, 5), dtype=torch.int32)
a[0, 0] = 5
a[1, 3] += 1
a

tensor([[5, 0, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0]], dtype=torch.int32)

In [37]:
# convert characters to integers
chars = sorted(list(set("".join(words))))
# print(chars)
stoi = {ch: i + 1 for i, ch in enumerate(chars)}
stoi["."] = 0

itos = {i: ch for ch, i in stoi.items()}

In [38]:
N = torch.zeros((27, 27), dtype=torch.int32)

for w in words:
    chars = ["."] + list(w) + ["."]
    for ch1, ch2 in zip(chars, chars[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(27):
    for j in range(27):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha="center", va="bottom", color='gray')
        plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray')
plt.axis('off');