### MLP

The earlier way of using the context of the previously happening single character to predict the next one was good but we weren't producing very name like sounding things.

Problem with this approach is
The table we saw in the previous file (tells deals with predicting the next character in a sequence in context to the previous one) quickly blows up and its size grows exponentially with the length of context.

This is because if we take a single character at a time, we will have 27 different possibilities of context that is 27 different rows. But if we were to consider the previous context of 2 characters to predict the next then we will have 27*27= 729 rows/possibilities of context. and it goes on

We will have too many rows and way too many counts for each possibility.

In the research paper [A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf),

it deals with the word level language modelling and it does have a vocabulary of 17000 possible words and each word is assosciated as a 30 dimensional feature vector. Literaly 17000 points or vectors in a 30 dimensional vector space.

Initially these words are initialised randomly, spreaded out randomly. Then we tune the embeddings of these words using **backpropogation**. 

So during the course of training of these neural networks go around in the space and words with almost similar meanings or are synonyms of each other end up in a similar part of space and conversely words with different meanings go away from each other.


Similar to our previous work, in the research paper they,
- try to predict the next word given the previous ones
- to train the neural network, they maximize the log likelihood of the training data
So the modelling approach is all identical

In the paper a example is sited,

The phrase **A dog was running in a room**
is likewise to **The cat is running in a room**

Mayb the network may have realised that a and the are frequently interchangeable and therefore the network may have put the embeddings of 'a' and 'the' near in the space and therefore the similarity is drawn. Therefore we can transfer knowledge through that embedding and we can generalise in that way. In the same way they may have found similarity in the 'dog' and 'cat' too therefore a mapping happened there too causing a knowledge transfer and a generalisation to novel scenarios.

![](img/neural_architecture.png)

In this example, we take 3 previous words to predict the upcoming 4th word in a sequence. These 3 are the index of the incoming word and because there are 17k words these indices are integers between 0 and 16999. 

Along with this we have a lookup table C which is a matrix 17k x 30.

This is considered as a lookup table and every index is actually plucking a row out of this embedding matrix so that each index is converted to a 30 dimensional vector that corresponds to the embedding vector for that word.

We have the input layer of 30 neurons for 3 words making up 90 neurons and C is being shared across all the words so we are indexing to the same matrix again and again for each one of these words.

Size of the hidden layer of this neural network is a _hyper parameter_ and this can be as large or small it can be. We will be working out multiple choices of sizes/different lengths of these hidden layers and then evaluate how well they do.

Lets say we have 100 neurons out of which all of them will be fully connected to the 90 words or 90 numbers that make up these three words so this is a fully connected layer and then there is a tan h long linearity.

In case of the output layer, there are 17k possible words that could appear next and all of these are fully connected to tall the neurons in the hidden layer.  Since there are a lot of words, there are lot of parameters here and therefore a lot of computation here.

Therefore we can also say there are ```17k logits``` here and at the top we have the softmax layer which exponentiates these logits, normalised to sum of 1 to get the best probability distribution for next word in sequence. In training we do have the label/ identity of the next word in the sequence, that index is used to pluck out the probability of that word and they are maximised  the probability of that word with respect to the parameters of the neural network. 

These parameters are nothing but the ```weights``` and ```biases``` of the output layer and of the hidden kayer and the embedding lookup table of C (matrix). All of these are optimised using the back propogation

In [1]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline

In [2]:
# read in all words
words = open('../../names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [3]:
len(words)

32033

In [15]:
# build the vocabulary
characters = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(characters)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
itos

{1: 'a',
 2: 'b',
 3: 'c',
 4: 'd',
 5: 'e',
 6: 'f',
 7: 'g',
 8: 'h',
 9: 'i',
 10: 'j',
 11: 'k',
 12: 'l',
 13: 'm',
 14: 'n',
 15: 'o',
 16: 'p',
 17: 'q',
 18: 'r',
 19: 's',
 20: 't',
 21: 'u',
 22: 'v',
 23: 'w',
 24: 'x',
 25: 'y',
 26: 'z',
 0: '.'}

In [16]:
# build the dataset
block_size = 3 # context length we use to predict the next character
X,Y = [],[] # X= input to the neural net, Y=label
for word in words[:5]:
    print(word)
    context = [0] * block_size # padded context of 0 tokens
    
    for character in word + '.':
        index = stoi[character] 
        X.append(context) # current running context
        Y.append(index)
        print(''.join(itos[i] for i in context), '-------->', itos[index])
        context =  context[1:] + [index] # crop the context and append the new character

X = torch.tensor(X)
Y = torch.tensor(Y)

emma
... --------> e
..e --------> m
.em --------> m
emm --------> a
mma --------> .
olivia
... --------> o
..o --------> l
.ol --------> i
oli --------> v
liv --------> i
ivi --------> a
via --------> .
ava
... --------> a
..a --------> v
.av --------> a
ava --------> .
isabella
... --------> i
..i --------> s
.is --------> a
isa --------> b
sab --------> e
abe --------> l
bel --------> l
ell --------> a
lla --------> .
sophia
... --------> s
..s --------> o
.so --------> p
sop --------> h
oph --------> i
phi --------> a
hia --------> .


In [17]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [18]:
X

tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13],
        [ 5, 13, 13],
        [13, 13,  1],
        [ 0,  0,  0],
        [ 0,  0, 15],
        [ 0, 15, 12],
        [15, 12,  9],
        [12,  9, 22],
        [ 9, 22,  9],
        [22,  9,  1],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1, 22],
        [ 1, 22,  1],
        [ 0,  0,  0],
        [ 0,  0,  9],
        [ 0,  9, 19],
        [ 9, 19,  1],
        [19,  1,  2],
        [ 1,  2,  5],
        [ 2,  5, 12],
        [ 5, 12, 12],
        [12, 12,  1],
        [ 0,  0,  0],
        [ 0,  0, 19],
        [ 0, 19, 15],
        [19, 15, 16],
        [15, 16,  8],
        [16,  8,  9],
        [ 8,  9,  1]])

In [19]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

In [20]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [21]:
C = torch.randn((27,2))

## Create a neural network It will take the input X and predict Y. 

#### 1.  Embedding lookup table C 
    We have 27 different characters and we will be embedding them in a lower level space. In the paper, they have a total of 17000 words and embed them in small dimensions as 30 (Cramping up 17k words in 30 dimensional space). 
    
    In our case similarly lets cramp them up to a 2 dimensional space.

In [97]:
#  Embedding a single integer to X C[5] 
C[5]# by taking the random number generated in C

tensor([-0.3847,  0.7616], grad_fn=<SelectBackward0>)

In [None]:
F.one_hot(torch.tensor(5), num_classes = 27) #2nd way of doing this. 
# embedding is done here in such a way that we get all 0's and just at 5 we have the value turned on as 1

tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

In [98]:
F.one_hot(torch.tensor(5), num_classes = 27) @ C

RuntimeError: expected m1 and m2 to have the same dtype, but got: long int != float

In [None]:
C.dtype, F.one_hot(torch.tensor(5), num_classes = 27).dtype

(torch.float32, torch.int64)

A multiplication between a 64 bit integer and a float won't happen.
@ can't do that
There we need to convert the one hot encoded one into a float first.

In [None]:
F.one_hot(torch.tensor(5), num_classes = 27).float() @ C
# we end up masking in all the rows and finally put out that 5th row.

tensor([1.0916, 0.2925])

When we look at it carefully we will find out C[5] is generated here. This is because we stripped out the 5 th row masking out the rest.

So one way to look at it is,
- an integer indexing into a lookup table C.
OR
- first layer to the bigger neural net. A layer which have linear neurons that have no nonlinearlity with the tan h and its weight matrix is C
And we are encoding those integers using one hot and feeding them into the neural net


So embedding a single integer is easy because we can easily retrieve them.
What to do if we have to embed all the 32 integers in X simultaneously?

We can index C with a single integer, a list of integers that are indices, a tensor of the integers or even a multi dimensional tensor

In [None]:
C[X], C[X].shape

(tensor([[[ 1.1696,  1.4134],
          [ 1.1696,  1.4134],
          [ 1.1696,  1.4134]],
 
         [[ 1.1696,  1.4134],
          [ 1.1696,  1.4134],
          [ 1.0916,  0.2925]],
 
         [[ 1.1696,  1.4134],
          [ 1.0916,  0.2925],
          [-0.6577, -1.7706]],
 
         [[ 1.0916,  0.2925],
          [-0.6577, -1.7706],
          [-0.6577, -1.7706]],
 
         [[-0.6577, -1.7706],
          [-0.6577, -1.7706],
          [-0.1823, -0.1603]],
 
         [[ 1.1696,  1.4134],
          [ 1.1696,  1.4134],
          [ 1.1696,  1.4134]],
 
         [[ 1.1696,  1.4134],
          [ 1.1696,  1.4134],
          [ 1.0128,  0.3325]],
 
         [[ 1.1696,  1.4134],
          [ 1.0128,  0.3325],
          [-1.5208,  0.8414]],
 
         [[ 1.0128,  0.3325],
          [-1.5208,  0.8414],
          [-0.3424,  0.1001]],
 
         [[-1.5208,  0.8414],
          [-0.3424,  0.1001],
          [-1.1273,  0.5526]],
 
         [[-0.3424,  0.1001],
          [-1.1273,  0.5526],
          

So we understand we have retained the 32 x 3 original shape of X and along with that we have kept the retrieved the embedding vector for every 32x3 values

In [None]:
X[13,2]

tensor(1)

In [None]:
C[X][13,2], C[1] # Both are same

(tensor([-0.1823, -0.1603]), tensor([-0.1823, -0.1603]))

So with the help of pytorch's amazing way of embedding, we can embed all the values simultaneously

In [None]:
embedding = C[X]
embedding.shape

torch.Size([32, 3, 2])

#### 2. Construction of the hidden layer

- Inputs to this layer is 3 times 2 because we have  3 units of 2 dimensional embeddings. 
- The number of neurons in the tan h layer is variable and lets assume it to be 100

In [None]:
W1 = torch.randn((6, 100))
b1 = torch.randn(100)

The normal way to construct this is to do <br>
```python
embedding @ W1 +b1
```

But this is not possible since embedding is 32 x 3 x 2. This is because these embeddings are stacked up in the dimensions of input tensor.

In [None]:
embedding @ W1 +b1

RuntimeError: mat1 and mat2 shapes cannot be multiplied (96x2 and 6x100)

In [None]:
# To solve this above error, we need to concatinate the all the 3 units (C)
# 32 X 3 X 2 TO 32 X 6
# The cat function in Torch gives the sequence of tensors in the given dimension

embedding[:, 0, :] # 32 x 2 embeddings of the 1st word (C  unit)
embedding[:, 1, :] # 32 x 2 embeddings of the 2nd word (C  unit)
embedding[:, 2, :] # 32 x 2 embeddings of the 3rd word (C  unit)

torch.cat([embedding[:, 0, :], embedding[:, 1, :], embedding[:, 2, :]],1)


tensor([[ 1.1696,  1.4134,  1.1696,  1.4134,  1.1696,  1.4134],
        [ 1.1696,  1.4134,  1.1696,  1.4134,  1.0916,  0.2925],
        [ 1.1696,  1.4134,  1.0916,  0.2925, -0.6577, -1.7706],
        [ 1.0916,  0.2925, -0.6577, -1.7706, -0.6577, -1.7706],
        [-0.6577, -1.7706, -0.6577, -1.7706, -0.1823, -0.1603],
        [ 1.1696,  1.4134,  1.1696,  1.4134,  1.1696,  1.4134],
        [ 1.1696,  1.4134,  1.1696,  1.4134,  1.0128,  0.3325],
        [ 1.1696,  1.4134,  1.0128,  0.3325, -1.5208,  0.8414],
        [ 1.0128,  0.3325, -1.5208,  0.8414, -0.3424,  0.1001],
        [-1.5208,  0.8414, -0.3424,  0.1001, -1.1273,  0.5526],
        [-0.3424,  0.1001, -1.1273,  0.5526, -0.3424,  0.1001],
        [-1.1273,  0.5526, -0.3424,  0.1001, -0.1823, -0.1603],
        [ 1.1696,  1.4134,  1.1696,  1.4134,  1.1696,  1.4134],
        [ 1.1696,  1.4134,  1.1696,  1.4134, -0.1823, -0.1603],
        [ 1.1696,  1.4134, -0.1823, -0.1603, -1.1273,  0.5526],
        [-0.1823, -0.1603, -1.1273,  0.5

In [None]:
torch.cat([embedding[:, 0, :], embedding[:, 1, :], embedding[:, 2, :]],1).shape
# took 32 nd squashed the the other 2 dimensions

torch.Size([32, 6])

Here we are indexing directly, but if the block size increases we need to change the code completely since I have indexed here directly. This is not what we need so we need to handle that too.

For this, we can make use of ```torch.unbind``` that removes a tensor dimension and returns a tuple of all sizes along a given dimension already without it

In [None]:
torch.unbind(embedding, 1) 
# this becomes equivalent to [embedding[:, 0, :], embedding[:, 1, :], embedding[:, 2, :]] 

(tensor([[ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [ 1.0916,  0.2925],
         [-0.6577, -1.7706],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [ 1.0128,  0.3325],
         [-1.5208,  0.8414],
         [-0.3424,  0.1001],
         [-1.1273,  0.5526],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [-0.1823, -0.1603],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [-0.3424,  0.1001],
         [-0.5990,  0.5137],
         [-0.1823, -0.1603],
         [-1.3208,  0.4055],
         [ 1.0916,  0.2925],
         [-1.5208,  0.8414],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [-0.5990,  0.5137],
         [ 1.0128,  0.3325],
         [-1.2333,  1.2925],
         [-0.0336,  1.8229]]),
 tensor([[ 1.1696,  1.4134],
         [ 1.1696,  1.4134],
         [ 1

In [None]:
torch.cat(torch.unbind(embedding,1),1).shape # it doesn't matter what the size of the block is

# But this is a highly inefficient way to do this as it will create a lot of memory because its not manipulating the view but creating a whole new set of memory

torch.Size([32, 6])

In [None]:
# better way to do this
a = torch.arange(18)
a.view(2,9) # we manipulate the shapes of the tensor 
# The only condition is 
a.view(3,3,2) 

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])

By creating a tensor we will have a single storage with a one dimensional sequence of 18 values

In [None]:
a.storage()

  a.storage()


 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
[torch.storage.TypedStorage(dtype=torch.int64, device=cpu) of size 18]

But .view() is a very efficient way of manipulating this storage. It manipulates this one dimensional into a n dimensional tensor. No memory is changed, copied or moved. The storage is identical. Some of the attributes related to view of the tensor are manipulated. The attributes like the storage offset, strides and shapes and they are manipulated so that this one dimensional space is seen as different n dimensional arrays.

In [None]:
embedding.shape

torch.Size([32, 3, 2])

In [None]:
embedding.view(32,6) # the 3 and 2 gets stacked up in a single row. similar to the concat and provides the exact similar output

tensor([[ 1.1696,  1.4134,  1.1696,  1.4134,  1.1696,  1.4134],
        [ 1.1696,  1.4134,  1.1696,  1.4134,  1.0916,  0.2925],
        [ 1.1696,  1.4134,  1.0916,  0.2925, -0.6577, -1.7706],
        [ 1.0916,  0.2925, -0.6577, -1.7706, -0.6577, -1.7706],
        [-0.6577, -1.7706, -0.6577, -1.7706, -0.1823, -0.1603],
        [ 1.1696,  1.4134,  1.1696,  1.4134,  1.1696,  1.4134],
        [ 1.1696,  1.4134,  1.1696,  1.4134,  1.0128,  0.3325],
        [ 1.1696,  1.4134,  1.0128,  0.3325, -1.5208,  0.8414],
        [ 1.0128,  0.3325, -1.5208,  0.8414, -0.3424,  0.1001],
        [-1.5208,  0.8414, -0.3424,  0.1001, -1.1273,  0.5526],
        [-0.3424,  0.1001, -1.1273,  0.5526, -0.3424,  0.1001],
        [-1.1273,  0.5526, -0.3424,  0.1001, -0.1823, -0.1603],
        [ 1.1696,  1.4134,  1.1696,  1.4134,  1.1696,  1.4134],
        [ 1.1696,  1.4134,  1.1696,  1.4134, -0.1823, -0.1603],
        [ 1.1696,  1.4134, -0.1823, -0.1603, -1.1273,  0.5526],
        [-0.1823, -0.1603, -1.1273,  0.5

In [None]:
# finally we can construct the hidden layer
hidden_layer = embedding.view(32,6) @ W1 + b1
hidden_layer # all the hidden states


tensor([[-1.2901,  0.0411,  0.3131,  ...,  0.3433, -2.8804, -0.3924],
        [-2.5487,  0.3683,  0.5410,  ...,  0.0306, -3.0687, -0.5475],
        [-1.4021,  2.0430, -3.5930,  ..., -0.6332, -4.6154, -2.9854],
        ...,
        [ 0.3353, -3.6396, -2.9627,  ...,  1.4435,  0.7158,  1.1763],
        [ 0.0293, -2.9802,  2.6207,  ...,  2.2869, -0.4369,  2.1615],
        [-0.8324, -2.6727, -0.9177,  ...,  0.2334, -5.3951, -0.5054]])

In [None]:
hidden_layer.shape # the 100 dimension activations for every one of the 32 examples

torch.Size([32, 100])

In [None]:
# for generalising

hidden_layer = embedding.view(embedding.shape[0], 6) @ W1 + b1
hidden_layer

tensor([[-1.2901,  0.0411,  0.3131,  ...,  0.3433, -2.8804, -0.3924],
        [-2.5487,  0.3683,  0.5410,  ...,  0.0306, -3.0687, -0.5475],
        [-1.4021,  2.0430, -3.5930,  ..., -0.6332, -4.6154, -2.9854],
        ...,
        [ 0.3353, -3.6396, -2.9627,  ...,  1.4435,  0.7158,  1.1763],
        [ 0.0293, -2.9802,  2.6207,  ...,  2.2869, -0.4369,  2.1615],
        [-0.8324, -2.6727, -0.9177,  ...,  0.2334, -5.3951, -0.5054]])

In [None]:
# OR

hidden_layer = embedding.view(-1, 6) @ W1 + b1
hidden_layer

tensor([[-1.2901,  0.0411,  0.3131,  ...,  0.3433, -2.8804, -0.3924],
        [-2.5487,  0.3683,  0.5410,  ...,  0.0306, -3.0687, -0.5475],
        [-1.4021,  2.0430, -3.5930,  ..., -0.6332, -4.6154, -2.9854],
        ...,
        [ 0.3353, -3.6396, -2.9627,  ...,  1.4435,  0.7158,  1.1763],
        [ 0.0293, -2.9802,  2.6207,  ...,  2.2869, -0.4369,  2.1615],
        [-0.8324, -2.6727, -0.9177,  ...,  0.2334, -5.3951, -0.5054]])

In [None]:
hidden_layer = torch.tanh(embedding.view(-1, 6) @ W1 + b1)
hidden_layer

tensor([[-0.8592,  0.0411,  0.3033,  ...,  0.3304, -0.9937, -0.3734],
        [-0.9878,  0.3525,  0.4937,  ...,  0.0306, -0.9957, -0.4987],
        [-0.8858,  0.9669, -0.9985,  ..., -0.5602, -0.9998, -0.9949],
        ...,
        [ 0.3233, -0.9986, -0.9947,  ...,  0.8944,  0.6143,  0.8263],
        [ 0.0292, -0.9949,  0.9895,  ...,  0.9796, -0.4111,  0.9738],
        [-0.6818, -0.9905, -0.7248,  ...,  0.2292, -1.0000, -0.4663]])

Now all the values are between -1 and 1

In [None]:
hidden_layer.shape

torch.Size([32, 100])

In [None]:
hidden_layer = (embedding.view(-1, 6) @ W1)
hidden_layer.shape, b1.shape

(torch.Size([32, 100]), torch.Size([100]))

In [None]:
# we will be having broadcasting done here
# 32 X 100 is broadcasted to 100
#      100
# boradcast will align to the right and then enter in a fake dimension in the first place.
# 32 X 100 
#  1 X 100 
# copies all the 32 rows, every one of this and then do a element wise addition


#### 3. Construction of the final output layer

We have 100 inputs and since we have 27 characters there will be 27 outputs.

In [49]:
W2 = torch.randn((100,27))
b2 = torch.randn(27)

In [50]:
logits = hidden_layer @ W2 + b2
logits

tensor([[ 3.5307e+00, -6.1167e+00, -7.7523e+00,  1.1603e+01,  1.2361e+01,
          2.1246e+01, -5.0853e+01, -2.4711e+00, -1.3251e+01, -1.5714e+01,
          5.7926e+01, -2.3510e+01,  2.2106e+01, -2.1171e+01,  1.6980e+01,
          2.9062e+01,  1.9111e+01,  2.9771e+01, -5.1988e+00,  1.3259e+01,
         -2.8121e+01, -2.1592e+01,  1.2597e+01, -1.6221e+01, -1.6764e+01,
         -1.5168e+01, -2.8787e+01],
        [-4.4348e+00, -4.0918e+00, -7.1336e+00,  1.7271e+01,  2.4175e+01,
          2.0607e+01, -2.4195e+01, -8.9282e+00, -2.3683e+00, -1.2514e+01,
          5.5166e+01, -1.6548e+01,  1.2500e+01, -5.1127e+00,  1.5303e+01,
          3.7853e+01,  4.3254e+00,  9.0689e+00, -3.5606e+00,  1.2535e+01,
         -3.8439e+01, -3.6996e+01,  2.0127e+01, -8.8383e+00, -1.5734e+01,
         -1.9581e+01, -2.5485e+01],
        [ 2.7976e+00,  2.8904e+01, -1.4060e+01,  3.6962e+01,  4.3405e+01,
          6.3184e+00,  4.1204e+01, -5.4497e+00,  1.2795e+01, -2.1973e+01,
          1.1750e+01, -2.7849e+00, -2.06

#### 4. Calculating the loss

In [51]:
counts = logits.exp() # exponentiate the logits to get the fake counts

In [52]:
# normalising the logits into a probability
probabilities = counts / counts.sum(1, keepdim=True)

In [53]:
probabilities

tensor([[2.3789e-24, 1.5367e-28, 2.9940e-29, 7.6247e-21, 1.6267e-20, 1.1753e-16,
         0.0000e+00, 5.8864e-27, 1.2250e-31, 1.0439e-32, 1.0000e+00, 4.2940e-36,
         2.7780e-16, 4.4517e-35, 1.6498e-18, 2.9137e-13, 1.3890e-17, 5.9204e-13,
         3.8478e-28, 3.9944e-20, 4.2678e-38, 2.9235e-35, 2.0593e-20, 6.2876e-33,
         3.6508e-33, 1.8023e-32, 2.1925e-38],
        [1.3059e-26, 1.8402e-26, 8.7868e-28, 3.4880e-17, 3.4754e-14, 9.8025e-16,
         3.4195e-35, 1.4604e-28, 1.0313e-25, 4.0462e-30, 1.0000e+00, 7.1677e-32,
         2.9542e-19, 6.6298e-27, 4.8765e-18, 3.0290e-08, 8.3257e-23, 9.5610e-21,
         3.1301e-26, 3.0615e-19, 2.2289e-41, 9.4349e-41, 6.0642e-16, 1.5977e-28,
         1.6172e-31, 3.4521e-33, 9.4161e-36],
        [2.7828e-20, 6.0573e-09, 1.3287e-27, 1.9141e-05, 1.2022e-02, 9.4093e-19,
         1.3316e-03, 7.2901e-24, 6.1117e-16, 4.8625e-31, 2.1505e-16, 1.0473e-22,
         1.8884e-30, 3.5827e-20, 1.0897e-16, 9.5566e-14, 7.3224e-22, 6.6958e-39,
         2.3005e-

In [54]:
probabilities[0].sum() # every row sum upto 1

tensor(1.)

Y has the identity of the next character in the sequence that we would love to predict. 

We would like to index into the rows of probabilties in each row and we would like to pluck out the probabilty assigned to the correct character as given here.

In [55]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

In [56]:
probabilities[torch.arange(32),Y] # torch.arange(32) iterates the rows and grabs coulmns provided by Y

tensor([1.1753e-16, 6.6298e-27, 3.5827e-20, 7.1389e-08, 5.8009e-18, 2.9137e-13,
        1.3491e-18, 6.4460e-44, 9.8177e-18, 4.9655e-19, 5.6729e-16, 4.7069e-14,
        1.5367e-28, 4.6644e-04, 1.4383e-10, 4.4775e-10, 1.0439e-32, 1.0000e+00,
        8.1062e-09, 5.7010e-04, 2.5728e-14, 1.4454e-08, 3.4318e-16, 5.8110e-19,
        8.6788e-19, 3.9944e-20, 2.4725e-18, 3.8582e-29, 1.6133e-30, 2.0227e-15,
        6.7071e-40, 8.6035e-09])

The values generated are some what not ok. From the values of some of the probabilities, the network thinks that some are extremely unlikely to happen. Also these values should be 1 at the end of training then only we can say we r correctly predicting the next character

In [57]:
loss = -probabilities[torch.arange(32),Y]
loss

tensor([-1.1753e-16, -6.6298e-27, -3.5827e-20, -7.1389e-08, -5.8009e-18,
        -2.9137e-13, -1.3491e-18, -6.4460e-44, -9.8177e-18, -4.9655e-19,
        -5.6729e-16, -4.7069e-14, -1.5367e-28, -4.6644e-04, -1.4383e-10,
        -4.4775e-10, -1.0439e-32, -1.0000e+00, -8.1062e-09, -5.7010e-04,
        -2.5728e-14, -1.4454e-08, -3.4318e-16, -5.8110e-19, -8.6788e-19,
        -3.9944e-20, -2.4725e-18, -3.8582e-29, -1.6133e-30, -2.0227e-15,
        -6.7071e-40, -8.6035e-09])

This loss needs to be minimsed so that the network can correctly predict the correct character in the sequence

In [76]:
# Summarise

g = torch.Generator().manual_seed(2147483647) # for reproducability
C = torch.randn((27,2), generator=g)
W1 = torch.randn((6,100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100,27), generator=g)
b2 = torch.randn(27, generator=g)

In [77]:
# just to have a peak into the number of parameters in total
parameters = [C, W1, b1, W2, b2] # clustering of parameters
sum(p.nelement() for p in parameters)

3481

In [78]:
embedding = C[X]
h = torch.tanh(embedding.view(-1,6) @ W1 + b1)
logits = h @ W2 + b2
counts = logits.exp()
probabilities = counts / counts.sum(1, keepdim=True)
loss = - probabilities[torch.arange(32), Y]
loss.log().mean() # how well the current networks work based on the parameters

tensor(nan)

In [79]:
loss = F.cross_entropy(logits, Y)
loss

tensor(17.7697)

if we loook at it one perspective, when we calculate loss based on logits, we are just doing classification and to calculate this simply we can use the cross entropy function in pytorch

```python
counts = logits.exp()
probabilities = counts / counts.sum(1, keepdim=True)
loss = - probabilities[torch.arange(32), Y].log().mean()
```
could be reduced to 

```python
F.cross_entropy(logits, Y)
```

<u>Advantages of using cross entropy </u>

- Unnecessary creation of a lot of tensors.<br>
    It makes the whole process fairly inefficient to run like this. By this single line of code we skip the creation of all these intermediate tensors and cluster up all these operations and very often have fused kernels that very efficiently evaluate these expressions.

- Backward pass can be made much more efficient<br>
    Not only because its a fused kernel but also that the expressions since they are clustered are of much simpler form mathematically. 
    

In [80]:
logits = torch.tensor([-2,-3,0,5])
counts = logits.exp()
probabilities = counts / counts.sum()
probabilities

tensor([9.0466e-04, 3.3281e-04, 6.6846e-03, 9.9208e-01])

- Cross entropy is also mathematically well behaved<br>
    Given [-2,-3,0,5] these values the probability distribution is well and good. But when these take up some extreme values which can happen because of the optimisation of neural network.

In [81]:
# for example
logits = torch.tensor([-100,-3,0,5])
counts = logits.exp()
probabilities = counts / counts.sum()
probabilities

tensor([0.0000e+00, 3.3311e-04, 6.6906e-03, 9.9298e-01])

In [82]:
counts

tensor([3.7835e-44, 4.9787e-02, 1.0000e+00, 1.4841e+02])

Having these negative numbers is also not much of a headache. it will cause a very very small number thats close to 0 something like 3.7835e-44.

In [83]:
# for example
logits = torch.tensor([-2,-3,0,100])
counts = logits.exp()
probabilities = counts / counts.sum()
probabilities

tensor([0., 0., 0., nan])

Having really big positive values, we could run into trouble even ending up with a ```nan```. The reason for these is that counts do have a value ```infinite```. We will run out of the range of floating point numbers 
(e^{100}) run out of the dynamic range of floating point number. So passing in very large logits is not possible in these expression

In [84]:
counts

tensor([0.1353, 0.0498, 1.0000,    inf])

- These operations allow us to produce the same output even after adding offsets.

In [85]:
# for example
logits = torch.tensor([-2,-3,0,5]) 
counts = logits.exp()
probabilities = counts / counts.sum()
probabilities

tensor([9.0466e-04, 3.3281e-04, 6.6846e-03, 9.9208e-01])

In [86]:
# for example
logits = torch.tensor([-2,-3,0,5]) + 7
counts = logits.exp()
probabilities = counts / counts.sum()
probabilities

tensor([9.0466e-04, 3.3281e-04, 6.6846e-03, 9.9208e-01])

Because we normalise the logits, we get the same output for the logits even after adding in some offsets to it

In [87]:
for p in parameters:
    p.requires_grad = True

In [90]:
for _ in range(100):
    # forward pass
    embedding = C[X]
    h = torch.tanh(embedding.view(-1,6) @ W1 + b1)
    logits = h @ W2 + b2
    counts = logits.exp()
    loss = F.cross_entropy(logits, Y)
    print(loss.item())
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()

    # update
    for p in parameters:
        p.data += -0.1 * p.grad #

1.7380964756011963
1.6535115242004395
1.579089879989624
1.5117665529251099
1.4496047496795654
1.3913121223449707
1.3359923362731934
1.2830528020858765
1.2321909666061401
1.1833815574645996
1.1367987394332886
1.0926642417907715
1.0510926246643066
1.0120267868041992
0.9752707481384277
0.9405564069747925
0.9076125025749207
0.876192033290863
0.8460890650749207
0.8171358108520508
0.7891990542411804
0.762174665927887
0.7359814047813416
0.7105578184127808
0.6858609914779663
0.6618651151657104
0.638565719127655
0.6159818768501282
0.5941659212112427
0.573210597038269
0.5532563924789429
0.5344881415367126
0.5171169638633728
0.5013313293457031
0.48724284768104553
0.47484058141708374
0.46399781107902527
0.4545145034790039
0.44617095589637756
0.4387665092945099
0.432133287191391
0.42613884806632996
0.4206799268722534
0.41567549109458923
0.4110615849494934
0.40678712725639343
0.4028107225894928
0.39909741282463074
0.3956180810928345
0.3923478424549103
0.3892652690410614
0.38635197281837463
0.3835917

As we run it further longer, the loss goes on to reduce further. This works because we are only over fitting 32 examples with 3481 parameters. It is calling overfitting because we are running all these examples over just 32 examples

In [94]:
print(loss.item()) # after training it again and again

0.31377431750297546


It would be surprising as to not see a 0 after all this training this is because:

In [95]:
logits.max(1) # outputs the original value and their indices 

torch.return_types.max(
values=tensor([11.4639, 13.4778, 19.0661, 17.9120, 13.2064, 11.4639, 13.2552, 11.8626,
        13.6934, 15.6432, 12.8634, 17.9044, 11.4639, 13.2158, 14.3344, 17.2696,
        11.4639, 14.0626, 11.7470, 13.5321, 15.9663, 12.5515,  8.1474,  8.1505,
        14.0189, 11.4639, 13.5286, 13.8694, 11.3024, 14.4007, 15.8964, 12.3963],
       grad_fn=<MaxBackward0>),
indices=tensor([ 1, 13, 13,  1,  0,  1, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  1, 19,
         1,  2,  5, 12, 12,  1,  0,  1, 15, 16,  8,  9,  1,  0]))

On keeping a closer look, we see the indices are almost equivalent to Y but there are differences

In [96]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

Therefore loss is not 0 <br>
since in accordance with our example "..." predicts e, o, a , i , s as possible outcomes for the subsequent character and therefore not able to completely overfit. But this overfitting correctly works for unique outputs generated by unique inputs