Bigram Homework

Exercises:
E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?



In [4]:
import torch
import torch.nn.functional as F 

names = open('names.txt', 'r').read().splitlines()
stoi = {ch:i+1 for i, ch in enumerate(sorted(list(set(''.join(names)))))}
stoi['.'] = 0
itos = {i:ch for ch, i in stoi.items()}

In [24]:
# counting method
# make dataset
class Counting_Trigrams:
    def __init__(self, name_list):
        self.name_list = name_list
        self.N, self.P = self.trigram_counting_p_dist()
        self.loss = self.counting_p_loss()

    def trigram_counting_p_dist(self):
        N = torch.zeros((27,27,27), dtype = torch.int32)
        for name in self.name_list:
            name = ['.', '.'] + list(name) + ['.']
            for i in range(len(name) - 2):
                N[stoi[name[i]], stoi[name[i+1]], stoi[name[i+2]]] += 1
        P = (N).float()
        P /= P.sum(2, keepdim=True)         #this was hard.... tried 1, and omitting keepdim
        return N, P

    def counting_p_loss(self):
        log_likelihood = 0
        num_trigrams = 0
        for name in self.name_list:
            name = ['.', '.'] + list(name) + ['.']
            for i in range(len(name) - 2):
                log_likelihood += self.P[stoi[name[i]], stoi[name[i+1]], stoi[name[i+2]]].log()
                num_trigrams += 1
        loss = - log_likelihood/num_trigrams
        return loss

    def get_names(self,num_names):
        names = []
        for i in range(num_names):
            iix = ix = 0
            out = []
            while True:
                p = self.P[iix, ix]       # same result, this works
                x = torch.multinomial(p, num_samples=1, replacement=True).item()
                out.append(itos[x])
                if x == 0:
                    break
                iix, ix = ix, x
            names.append(''.join(out[:-1]))
        return names
    
    def get_P_bigrams(self, ch1, ch2):
        prob = self.P[stoi[ch1], stoi[ch2]]
        return prob
    
    def get_N_bigrams(self, ch1, ch2):
        count = self.N[stoi[ch1], stoi[ch2]]
        return count

In [25]:
whole_set = Counting_Trigrams(names)
print(whole_set.loss)
some_names = whole_set.get_names(10)
print(some_names)

tensor(2.1857)
['itraci', 'kayan', 'nin', 'ma', 'fya', 'noutchaibanatse', 'palodna', 'mia', 'ni', 'magar']


In [21]:
# looking at some of hte P distributions of certain bigrams coming up in the names
print(stoi)
prob = whole_set.get_P_bigrams('n','n')
count = whole_set.get_N_bigrams('n','n')
print(prob)
print(count)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}
tensor([0.3248, 0.3321, 0.0000, 0.0005, 0.0010, 0.1369, 0.0000, 0.0005, 0.0000,
        0.1159, 0.0000, 0.0000, 0.0168, 0.0005, 0.0000, 0.0220, 0.0000, 0.0000,
        0.0000, 0.0037, 0.0010, 0.0010, 0.0000, 0.0000, 0.0000, 0.0425, 0.0005])
tensor([619, 633,   0,   1,   2, 261,   0,   1,   0, 221,   0,   0,  32,   1,
          0,  42,   0,   0,   0,   7,   2,   2,   0,   0,   0,  81,   1],
       dtype=torch.int32)


In [86]:
class NN_Trigrams:
    def __init__(self, name_list):
        self.xs, self.ys = self.NN_dataset(name_list)
        self.W, self.xenc = self.initialize_parameters()
        self.gd_cycles = 0


    def NN_dataset(self, names):
        xs, ys = [], []
        for name in names:
            # append bigram to xs, result to ys
            name = ['.', '.'] + list(name) + ['.']
            for i in range(len(name)-2):
                xs.append((stoi[name[i]], stoi[name[i+1]]))
                ys.append(stoi[name[i+2]])
        xs = torch.LongTensor(xs)           #tensor only holds values as ints
        ys = torch.IntTensor(ys)
        return xs, ys
    
    # defining NN parameters
    def initialize_parameters(self):
        W = torch.randn((54,27), requires_grad=True)
        xenc = F.one_hot(self.xs, num_classes=27).float()
        return W, xenc
    
    def grad_descent(self, cycles, step_size=0.05, print_losses=True):
        # grad descent
        xenc_reshaped = self.xenc.view(-1,54)
        for i in range(cycles):
            # forward pass
            logits = xenc_reshaped @ self.W
            counts = logits.exp()
            p = counts / counts.sum(1, keepdim=True)
            self.loss = -p[torch.arange(len(self.ys)), self.ys].log().mean()

            # backward pass
            self.W.grad = None
            self.loss.backward()

            # update params
            self.W.data += -step_size * self.W.grad             #increased step size from 0.1 to -.2
            self.gd_cycles += 1
            if print_losses:
                print('Cycle #{0}, loss = {1}'.format(self.gd_cycles, self.loss))

    def get_names(self, num_names):
        names = []
        for i in range(num_names):
            out = []
            iix = ix = 0
            while True:
                iixenc = F.one_hot(torch.LongTensor([iix]), num_classes = 27).float()
                ixenc = F.one_hot(torch.LongTensor([ix]), num_classes = 27).float()
                xenc = torch.cat((iixenc, ixenc), -1)
                logits = xenc @ self.W
                counts = logits.exp()
                p = counts / counts.sum(1, keepdim=True)
                x = torch.multinomial(p, num_samples= 1, replacement=True).item()
                # print(itos[iix], itos[ix], itos[x], p[0,x].data)
                out.append(itos[x])
                if x == 0:
                    break
                iix, ix = ix, x
            names.append(''.join(out[:-1]))
        return names
    
    def get_bigram_CP(self, ch1, ch2):
        # iixenc = F.one_hot(torch.LongTensor([stoi[ch1]]), num_classes = 27).float()
        # ixenc = F.one_hot(torch.LongTensor([stoi[ch2]]), num_classes = 27).float()
        ix = torch.LongTensor([stoi[ch1], stoi[ch2]])
        xenc = F.one_hot(ix, num_classes=27).float()
        xenc_reshaped = xenc.view(-1,54)
        logits = xenc_reshaped @ self.W
        counts = logits.exp()
        p = counts / counts.sum(1, keepdim=True)
        return counts, p


In [88]:
full_list_nn = NN_Trigrams(names)
full_list_nn.grad_descent(1, 0.2)

Cycle #1, loss = 3.971647262573242


In [92]:
full_list_nn.grad_descent(500, 0.25, True)

Cycle #502, loss = 3.0436205863952637
Cycle #503, loss = 3.042783737182617
Cycle #504, loss = 3.0419490337371826
Cycle #505, loss = 3.041116237640381
Cycle #506, loss = 3.040285348892212
Cycle #507, loss = 3.0394561290740967
Cycle #508, loss = 3.0386290550231934
Cycle #509, loss = 3.0378036499023438
Cycle #510, loss = 3.036980390548706
Cycle #511, loss = 3.036159038543701
Cycle #512, loss = 3.035339593887329
Cycle #513, loss = 3.03452205657959
Cycle #514, loss = 3.033705949783325
Cycle #515, loss = 3.0328919887542725
Cycle #516, loss = 3.0320804119110107
Cycle #517, loss = 3.0312700271606445
Cycle #518, loss = 3.030461311340332
Cycle #519, loss = 3.0296549797058105
Cycle #520, loss = 3.028850555419922
Cycle #521, loss = 3.028047561645508
Cycle #522, loss = 3.0272467136383057
Cycle #523, loss = 3.0264477729797363
Cycle #524, loss = 3.0256502628326416
Cycle #525, loss = 3.0248546600341797
Cycle #526, loss = 3.024061441421509
Cycle #527, loss = 3.0232694149017334
Cycle #528, loss = 3.0224

In [93]:
trigram_names_NN = full_list_nn.get_names(10)
print(trigram_names_NN)

['swor', 'mafzhgzie', 'jcnbzoen', 'aon', 'san', 'jjtykem', 'jsza', 'renxiw', 'ajtxnqsellyaucn', 'eh']


In [94]:
lett = sorted(itos.items())
for i in range(3):
    print(lett[i*9:(i+1)*9])
    
x_count, x_prob = full_list_nn.get_bigram_CP('j','j')
print(x_count)
print(x_prob)

[(0, '.'), (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e'), (6, 'f'), (7, 'g'), (8, 'h')]
[(9, 'i'), (10, 'j'), (11, 'k'), (12, 'l'), (13, 'm'), (14, 'n'), (15, 'o'), (16, 'p'), (17, 'q')]
[(18, 'r'), (19, 's'), (20, 't'), (21, 'u'), (22, 'v'), (23, 'w'), (24, 'x'), (25, 'y'), (26, 'z')]
tensor([[0.7665, 1.2685, 0.7532, 0.6770, 0.0859, 0.1488, 0.3046, 2.3514, 1.3209,
         1.5249, 1.8975, 0.3693, 0.1325, 1.2437, 0.5347, 0.6147, 0.0628, 0.2194,
         1.2475, 0.2340, 8.2969, 0.7160, 1.2995, 1.2223, 2.6417, 1.6705, 0.9754]],
       grad_fn=<ExpBackward0>)
tensor([[0.0235, 0.0389, 0.0231, 0.0208, 0.0026, 0.0046, 0.0093, 0.0722, 0.0405,
         0.0468, 0.0582, 0.0113, 0.0041, 0.0382, 0.0164, 0.0189, 0.0019, 0.0067,
         0.0383, 0.0072, 0.2547, 0.0220, 0.0399, 0.0375, 0.0811, 0.0513, 0.0299]],
       grad_fn=<DivBackward0>)


E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?



In [262]:
# split dataset into 80:10:10 - train 2 and 3gram on training set - evaluate on dev and test
import random
random.shuffle(names)
num_names = len(names)
training_set = names[:int(0.8*num_names)]
dev_set = names[int(0.8*num_names):int(0.9*num_names)]
test_set = names[int(0.9*num_names):]
print(len(training_set), len(dev_set), len(test_set))

25626 3203 3204


E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?




E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?




E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?




E06: meta-exercise! Think of a fun/interesting exercise and complete it.


Useful links for practice:
- Python + Numpy tutorial from CS231n https://cs231n.github.io/python-numpy... . We use torch.tensor instead of numpy.array in this video. Their design (e.g. broadcasting, data types, etc.) is so similar that practicing one is basically practicing the other, just be careful with some of the APIs - how various functions are named, what arguments they take, etc. - these details can vary.
- PyTorch tutorial on Tensor https://pytorch.org/tutorials/beginne...
- Another PyTorch intro to Tensor https://pytorch.org/tutorials/beginne...
