Bigram Homework

Exercises:
E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?



In [58]:
import torch
import torch.nn.functional as F 

names = open('names.txt', 'r').read().splitlines()
stoi = {ch:i+1 for i, ch in enumerate(sorted(list(set(''.join(names)))))}
stoi['.'] = 0
itos = {i:ch for ch, i in stoi.items()}

In [257]:
# counting method
# make dataset
class Counting:
    def __init__(self, name_list):
        self.name_list = name_list
        self.P = self.trigram_counting_p_dist()
        self.loss = self.counting_p_loss()

    def trigram_counting_p_dist(self):
        N = torch.zeros((27,27,27), dtype = torch.int32)
        for name in self.name_list:
            name = ['.', '.'] + list(name) + ['.']
            for i in range(len(name) - 2):
                N[stoi[name[i]], stoi[name[i+1]], stoi[name[i+2]]] += 1
        P = (N+1).float()
        P /= P.sum(2, keepdim=True)         #this was hard.... tried 1, and omitting keepdim
        return P

    def counting_p_loss(self):
        log_likelihood = 0
        num_trigrams = 0
        for name in self.name_list:
            name = ['.', '.'] + list(name) + ['.']
            for i in range(len(name) - 2):
                log_likelihood += self.P[stoi[name[i]], stoi[name[i+1]], stoi[name[i+2]]].log()
                num_trigrams += 1
        loss = - log_likelihood/num_trigrams
        return loss

    def get_names(self,num_names):
        names = []
        for i in range(num_names):
            iix = ix = 0
            out = []
            while True:
                p = self.P[iix, ix]       # same result, this works
                x = torch.multinomial(p, num_samples=1, replacement=True).item()
                out.append(itos[x])
                if x == 0:
                    break
                iix, ix = ix, x
            names.append(''.join(out[:-1]))
        return names

In [264]:
whole_set = Counting(names)
print(whole_set.loss)
some_names = whole_set.get_names(10)
print(some_names)

tensor(2.2120)
['mee', 'athelin', 'lyn', 'bramarsdennai', 'vi', 'pozea', 'zymeritall', 'kmxbderdo', 'tavion', 'maday']


In [200]:
# looking at some of hte P distributions of certain bigrams coming up in the names

N = torch.zeros((27,27,27), dtype = torch.int32)
print(stoi)
for name in names:
    name = ['.', '.'] + list(name) + ['.']
    for i in range(len(name) - 2):
        N[stoi[name[i]], stoi[name[i+1]], stoi[name[i+2]]] += 1
P = (N+1).float()
P /= P.sum(2, keepdim=True)         #this was hard.... tried 1, and omitting keepdim
x = (P[13,3]*1000).int()
print(x.sum())
print(x)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}
tensor(980)
tensor([ 12,  12,  12, 115,  12,  12,  12,  25,  38,  12,  12, 474,  51,  12,
         12,  25,  12,  12,  12,  12,  12,  12,  12,  12,  12,  12,  12],
       dtype=torch.int32)


In [256]:
class Trigram_NN:
    def __init__(self, name_list):
        self.xs, self.ys = self.NN_dataset(name_list)
        self.W, self.xenc = self.initialize_parameters()


    def NN_dataset(self, names):
        xs, ys = [], []
        for name in names:
            # append bigram to xs, result to ys
            name = ['.', '.'] + list(name) + ['.']
            for i in range(len(name)-2):
                xs.append((stoi[name[i]], stoi[name[i+1]]))
                ys.append(stoi[name[i+2]])
        xs = torch.LongTensor(xs)           #tensor only holds values as ints
        ys = torch.IntTensor(ys)
        return xs, ys
    
    # defining NN parameters
    def initialize_parameters(self):
        W = torch.randn((54,27), requires_grad=True)
        xenc = F.one_hot(self.xs, num_classes=27).float()
        return W, xenc
    
    def grad_descent(self, cycles, step_size=0.05, print_losses=True):
        # grad descent
        xenc_reshaped = self.xenc.view(-1,54)
        for i in range(cycles):
            # forward pass
            logits = xenc_reshaped @ self.W
            counts = logits.exp()
            p = counts / counts.sum(1, keepdim=True)
            self.loss = -p[torch.arange(len(self.ys)), self.ys].log().mean()

            # backward pass
            self.W.grad = None
            self.loss.backward()

            # update params
            self.W.data += -step_size * self.W.grad             #increased step size from 0.1 to -.2
            if print_losses:
                print('Cycle #{0}, loss = {1}'.format(i+1, self.loss))

    def get_names(self, num_names):
        names = []
        for i in range(num_names):
            out = []
            iix = ix = 0
            while True:
                iixenc = F.one_hot(torch.LongTensor([iix]), num_classes = 27).float()
                ixenc = F.one_hot(torch.LongTensor([ix]), num_classes = 27).float()
                xenc = torch.cat((iixenc, ixenc), -1)
                logits = xenc @ self.W
                counts = logits.exp()
                p = counts / counts.sum(1, keepdim=True)
                x = torch.multinomial(p, num_samples= 1, replacement=True).item()
                # print(itos[iix], itos[ix], itos[x], p[0,x].data)
                out.append(itos[x])
                if x == 0:
                    break
                iix, ix = ix, x
            names.append(''.join(out[:-1]))
        return names


In [247]:
full_list_nn = Trigram_NN(names)
full_list_nn.grad_descent(1, 0.2)

Cycle #1, loss = 3.966810703277588


In [254]:
full_list_nn.grad_descent(200, 0.25, True)

Cycle #1, loss = 2.809159517288208
Cycle #2, loss = 2.808817148208618
Cycle #3, loss = 2.8084754943847656
Cycle #4, loss = 2.808134078979492
Cycle #5, loss = 2.807793617248535
Cycle #6, loss = 2.8074533939361572
Cycle #7, loss = 2.8071134090423584
Cycle #8, loss = 2.806774139404297
Cycle #9, loss = 2.8064351081848145
Cycle #10, loss = 2.8060972690582275
Cycle #11, loss = 2.8057591915130615
Cycle #12, loss = 2.805422306060791
Cycle #13, loss = 2.8050854206085205
Cycle #14, loss = 2.8047492504119873
Cycle #15, loss = 2.804413318634033
Cycle #16, loss = 2.8040783405303955
Cycle #17, loss = 2.803743362426758
Cycle #18, loss = 2.8034093379974365
Cycle #19, loss = 2.8030753135681152
Cycle #20, loss = 2.802741765975952
Cycle #21, loss = 2.8024089336395264
Cycle #22, loss = 2.802076578140259
Cycle #23, loss = 2.8017444610595703
Cycle #24, loss = 2.8014132976531982
Cycle #25, loss = 2.8010823726654053
Cycle #26, loss = 2.8007519245147705
Cycle #27, loss = 2.800421953201294
Cycle #28, loss = 2.8

In [255]:
full_list_nn.get_names(20)

['lari.',
 'solinakimrarl.',
 'ae.',
 'mareeovzmnnamyra.',
 'say.',
 'jocfsuefon.',
 'ka.',
 'ztvy.',
 'cozmcfzxvvie.',
 'bjnkehah.',
 'l.',
 'gana.',
 'can.',
 'duso.',
 'keriann.',
 'liallgbqsgrotfibdkgrepnynnbywvk.',
 'aaian.',
 'dambhn.',
 'lirlynn.',
 'liy.']

E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?



In [262]:
# split dataset into 80:10:10 - train 2 and 3gram on training set - evaluate on dev and test
import random
random.shuffle(names)
num_names = len(names)
training_set = names[:int(0.8*num_names)]
dev_set = names[int(0.8*num_names):int(0.9*num_names)]
test_set = names[int(0.9*num_names):]
print(len(training_set), len(dev_set), len(test_set))

25626 3203 3204


E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?




E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?




E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?




E06: meta-exercise! Think of a fun/interesting exercise and complete it.


Useful links for practice:
- Python + Numpy tutorial from CS231n https://cs231n.github.io/python-numpy... . We use torch.tensor instead of numpy.array in this video. Their design (e.g. broadcasting, data types, etc.) is so similar that practicing one is basically practicing the other, just be careful with some of the APIs - how various functions are named, what arguments they take, etc. - these details can vary.
- PyTorch tutorial on Tensor https://pytorch.org/tutorials/beginne...
- Another PyTorch intro to Tensor https://pytorch.org/tutorials/beginne...
