# First Neural Net
Let's figure out how we would take our data and use it to train a neural net. We'll focus on figuring out how to tokenize the data and how it would be fed into and out of the network

In [2]:
import numpy as np
import pandas as pd
import re
import torch
import torch.nn.functional as F

In [3]:
df = pd.read_csv("cleaned_data/clean_2.csv", keep_default_na=False)

## Vocabulary Building
Let's use pytorch to build up a vocabulary. The default tokenizer is probably fine for now, but something we should consider changing in the future.

In [4]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

answers_vocab = build_vocab_from_iterator(yield_tokens(df['answer'].values), specials=["<unk>"])
clues_vocab = build_vocab_from_iterator(yield_tokens(df['clue'].values), specials=["<unk>"])

In [5]:
print('Tokens in clues vocab:', len(clues_vocab))
print('Tokens in answers vocab:', len(answers_vocab))

Tokens in clues vocab: 84344
Tokens in answers vocab: 62953


Add some helpers to make conversion from strings to tokens easy:

In [6]:
clue_pipeline = lambda x: clues_vocab(tokenizer(x))

In [7]:
clue_pipeline('capital of canada')

[112, 4, 3052]

In [8]:
answers_vocab.get_stoi()['ottawa']

4411

## Data Setup
Let's create tensors for our inputs and labels.

### Input and Output
We want to input the clue and output the potential answer. Clues are encoded as a list of ints where each int ia a word in the clue vocabulary; answers are a single int which represents a word in the answer vocabulary.

We can directly input the encoded clue, but we need to consider that clues are different lengths and we need a consistent shape for our inputs. We can solve for this by padding inputs to a fixed length.

First, we must add a padding token to our vocabulary. Second, we need to figure out the length of the input which we can base off of the longest clue in the training set (note: this will fail if any clues in the test set are longer).

In [9]:
answers_vocab = build_vocab_from_iterator(yield_tokens(df['answer'].values), specials=["<pad>"])
clues_vocab = build_vocab_from_iterator(yield_tokens(df['clue'].values), specials=["<pad>"])

In [10]:
print(answers_vocab.get_itos()[0], clues_vocab.get_itos()[0])

<pad> <pad>


In [11]:
PADDING_TOKEN_INDEX = 0

### Data Splits
Let's create train, dev, and test splits of the data. We'll use a 80-10-10 breakdown

In [12]:
train, dev, test = np.split(df.sample(frac=1, random_state=42), [int(.8*len(df)), int(.9*len(df))])

# Print the size of each subset
print(f'Train set size: {len(train)}')
print(f'Dev set size: {len(dev)}')
print(f'Test set size: {len(test)}')

Train set size: 616288
Dev set size: 77036
Test set size: 77037


In [13]:
train.sort_index()

Unnamed: 0,answer,clue
0,pat,"action done while saying ""good dog"""
1,rascals,mischief-makers
2,pen,it might click for a writer
4,eco,kind to mother nature
6,wage,living ___
...,...,...
770352,ars,"""___ magna"" (anagrams, appropriately)"
770353,doze,nap
770356,nat,actor pendleton
770358,nea,teachers' org.


Let's figure out what the longest clue is

In [14]:
train['clue'].apply(lambda x: len(clue_pipeline(x))).max()

42

In [15]:
dev['clue'].apply(lambda x: len(clue_pipeline(x))).max()

30

In [16]:
test['clue'].apply(lambda x: len(clue_pipeline(x))).max()

33

Let's add some buffer to reduce chances that we will see a longer clue in the future.

In [18]:
PAD_TO_SIZE = 45

### Tensor Building
Let's build X and Y tensors for our splits

In [19]:
def build_dataset(split):
    splits = {
        'train': train,
        'dev': dev,
        'test': test
    }
    df = splits[split]
    answers_stoi = answers_vocab.get_stoi()
    
    X = []
    Y = []

    for clue in df['clue'].values:
        indicies = clue_pipeline(clue)
        indicies += [PADDING_TOKEN_INDEX] * (PAD_TO_SIZE - len(indicies))
        X.append(indicies)

    for answer in df['answer'].values:
        answer_index = answers_stoi[answer]
        Y.append(answer_index)
        
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    return (X, Y)


Xdev, Ydev = build_dataset('dev')
Xtest, Ytest = build_dataset('test')
Xtr, Ytr = build_dataset('train')

In [20]:
print(Xtr.shape, Ytr.shape)
print(Xdev.shape, Ydev.shape)
print(Xtest.shape, Ytest.shape)

torch.Size([616288, 45]) torch.Size([616288])
torch.Size([77036, 45]) torch.Size([77036])
torch.Size([77037, 45]) torch.Size([77037])


## Neural Network
Let's build a simple network based on the MLP discussed in [this lecture](https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part2_mlp.ipynb).

We will feed the clues into an embedding layer which will provide the inputs to a set of neurons which will output the liklihood of each item in the answers vocabulary being correct.

In [21]:
# hyperparameters
EMBEDDING_DIMENSIONS = 25
HIDDEN_NEURON_COUNT = 200
BATCH_SIZE = 64

g = torch.Generator().manual_seed(42)
C = torch.randn((len(clues_vocab), EMBEDDING_DIMENSIONS))
W1 = torch.randn(PAD_TO_SIZE * EMBEDDING_DIMENSIONS, HIDDEN_NEURON_COUNT)
b1 = torch.randn(HIDDEN_NEURON_COUNT)
W2 = torch.randn((HIDDEN_NEURON_COUNT, len(answers_vocab)))
b2 = torch.randn(len(answers_vocab))
parameters = [C, W1, b1, W2, b2]
for p in parameters:
    p.requires_grad = True

In [22]:
sum(p.nelement() for p in parameters)

14987353

Let's construct the forward pass and make sure we understand what's happening.

In [23]:
for i in range(1):
    # minibatch
    ix = torch.randint(0, Xtr.shape[0], (BATCH_SIZE,))
    
    # forward pass
    
    # do lookup
    emb = C[Xtr[ix]] # (BATCH_SIZE, PAD_TO_SIZE, EMBEDDING_DIMENSIONS)
    print(emb.shape)
    print(f'{BATCH_SIZE},{PAD_TO_SIZE}, {EMBEDDING_DIMENSIONS}')
    
    # concat embeddings together
    concat = emb.view(-1, PAD_TO_SIZE * EMBEDDING_DIMENSIONS) # (BATCH_SIZE, PAD_TO_SIZE * EMBEDDING_DIMENSIONS)
    print(concat.shape)
    print(f'{BATCH_SIZE},{PAD_TO_SIZE * EMBEDDING_DIMENSIONS}')
    
    # hidden layer
    h = torch.tanh(concat @ W1 + b1) # (BATCH_SIZE, HIDDEN_NEURON_COUNT)
    print(h.shape)
    print(f'{BATCH_SIZE},{HIDDEN_NEURON_COUNT}')
    
    # output layer
    logits = h @ W2 + b2 # (BATCH_SIZE, len(answers_vocab))
    print(logits.shape)
    print(f'{BATCH_SIZE},{len(answers_vocab)}')
    
    loss = F.cross_entropy(logits, Ytr[ix])
    print(loss.item())

torch.Size([64, 45, 25])
64,45, 25
torch.Size([64, 1125])
64,1125
torch.Size([64, 200])
64,200
torch.Size([64, 62953])
64,62953
59.47833251953125


Looks good! Loss is high, but we'll fix that 😅. Let's start with a simple training cycle to see if what we have works.

In [24]:
%%time
LEARNING_RATE = 0.1

for i in range(1000):
    # minibatch
    ix = torch.randint(0, Xtr.shape[0], (BATCH_SIZE,))
    
    # forward pass
    emb = C[Xtr[ix]] # (BATCH_SIZE, PAD_TO_SIZE, EMBEDDING_DIMENSIONS)
    concat = emb.view(-1, PAD_TO_SIZE * EMBEDDING_DIMENSIONS) # (BATCH_SIZE, PAD_TO_SIZE * EMBEDDING_DIMENSIONS)
    h = torch.tanh(concat @ W1 + b1) # (BATCH_SIZE, HIDDEN_NEURON_COUNT)
    logits = h @ W2 + b2 # (BATCH_SIZE, len(answers_vocab))
    loss = F.cross_entropy(logits, Ytr[ix])
    if i == 0:
        print('Initial loss:', loss.item())
    
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # update
    for p in parameters:
        p.data += -LEARNING_RATE * p.grad

print('Final loss:', loss.item())

Initial loss: 59.89610290527344
Final loss: 31.12291145324707
CPU times: user 1min 27s, sys: 12.7 s, total: 1min 40s
Wall time: 25.6 s


Seems, to be working! Now let's figure out how to optimize our hyperparameters. The first step is to evaluate our result against the dev set.

In [25]:
%%time
emb = C[Xdev] # (BATCH_SIZE, PAD_TO_SIZE, EMBEDDING_DIMENSIONS)
concat = emb.view(-1, PAD_TO_SIZE * EMBEDDING_DIMENSIONS) # (BATCH_SIZE, PAD_TO_SIZE * EMBEDDING_DIMENSIONS)
h = torch.tanh(concat @ W1 + b1) # (BATCH_SIZE, HIDDEN_NEURON_COUNT)
logits = h @ W2 + b2 # (BATCH_SIZE, len(answers_vocab))
loss = F.cross_entropy(logits, Ydev)

print('Dev loss:', loss.item())

Dev loss: 29.713613510131836
CPU times: user 20.2 s, sys: 9.38 s, total: 29.5 s
Wall time: 6.39 s


We have a solid setup for tuning our hyperparameters - let's learn how to do that in the next notebook.