# Exercises Part2: 
- E01: Tune the hyperparameters of the training to beat my best validation loss of 2.2
- E02: I was not careful with the intialization of the network in this video. (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?
- E03: Read the Bengio et al 2003 paper ([link](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbnlHNlB5V2ZGT3lSVVljeXNMMGlUa2ktbmx3d3xBQ3Jtc0tsUGdMSU5OcWVEUHFxd0RzTVJQLVZnMzY3N3UzQXlzRE93ZjEydklDR0YwYWd6OThLNDJvNFNFM3FkajhGWGNtaU9ZVXl2VmkzR0NoYWdNUGpyeTd1RG9LN1dSRklQUHdkaEs5RWlPQWZxeW1rLUM3QQ&q=https%3A%2F%2Fwww.jmlr.org%2Fpapers%2Fvolume3%2Fbengio03a%2Fbengio03a.pdf&v=TCH_1BHY58I)), implement and try any idea from the paper. Did it work?


In [2]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt #for making figures
%matplotlib inline

In [3]:
words = open("names.txt", "r").read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [4]:
len(words)


32033

In [5]:
#String to integer and integer to string mappings have been created
chars = sorted(list(set( "".join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi["."] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)


{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [6]:
# build the dataset

block_size = 3 #Number of characters used to predict the next one 
X, Y = [], [] #X are the inputs, Y are the labels
for w in words[:5]:
    print(w)
    context = [0] * block_size
    for ch in w + ".":
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        print("".join(itos[i] for i in context), "--->", itos[ix])
        context = context[1:] + [ix]

X = torch.tensor(X)
Y = torch.tensor(Y)

emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


<img src= "Screenshot 2024-05-14 150859.png" width= "600
">


In [7]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [8]:
C = torch.randn((27, 2)) #We want to embed each of our 27 characters into a 2 dimensionial space
C
C[5]
C[[5,6,7]]
C[torch.tensor([5,6,7,7,7,7])] #Some random ways to access only some specific rows



tensor([[-0.2428, -0.0991],
        [-1.4703, -1.4746],
        [ 1.5090, -0.4990],
        [ 1.5090, -0.4990],
        [ 1.5090, -0.4990],
        [ 1.5090, -0.4990]])

In [9]:
emb = C[X]
emb.shape
emb


tensor([[[ 0.4704, -1.8919],
         [ 0.4704, -1.8919],
         [ 0.4704, -1.8919]],

        [[ 0.4704, -1.8919],
         [ 0.4704, -1.8919],
         [-0.2428, -0.0991]],

        [[ 0.4704, -1.8919],
         [-0.2428, -0.0991],
         [ 1.1876,  0.9466]],

        [[-0.2428, -0.0991],
         [ 1.1876,  0.9466],
         [ 1.1876,  0.9466]],

        [[ 1.1876,  0.9466],
         [ 1.1876,  0.9466],
         [-1.8704,  1.8879]],

        [[ 0.4704, -1.8919],
         [ 0.4704, -1.8919],
         [ 0.4704, -1.8919]],

        [[ 0.4704, -1.8919],
         [ 0.4704, -1.8919],
         [ 1.2696, -0.6277]],

        [[ 0.4704, -1.8919],
         [ 1.2696, -0.6277],
         [ 0.3588, -1.1145]],

        [[ 1.2696, -0.6277],
         [ 0.3588, -1.1145],
         [-0.3483, -0.5151]],

        [[ 0.3588, -1.1145],
         [-0.3483, -0.5151],
         [ 0.5891, -0.9654]],

        [[-0.3483, -0.5151],
         [ 0.5891, -0.9654],
         [-0.3483, -0.5151]],

        [[ 0.5891, -0

<img src= "Unbenannt.png" width= "600
"> </br>
The indexes of letters from 0-26 get replaced by their embeding into a 2 Dimensional Space, with initial values generated at random from a normal distribution 

In [10]:
W1 = torch.randn((6,100)) #What does 6,100 mean? The Hidden layer in the middle has 3 inputs, which actually are our 3 characters as input
# Each of them is embedded to 2D space 2x3 = 6. So each of the 100 Neurons we created is feeded with 6 values which all get a weight we first init randomly
b1 = torch.randn(100) #Every neuron also gets a random bias 

#Neuron = (input1 * weight1 + ... + input6 * weight 6) + bias
#So we want emb @ W1 + b1 which multiplies all our examples with weights and adds a bias BUT:
#We can not multiply a shape [32,3,2] @ [100,6] we need to squash it to [32,6] @ [6,100]

torch.cat(torch.unbind(emb, 1), 1)# <=> torch.cat ([emb[:,0 , :], emb[:,1 , :], emb[:,2 , :]], 1) Which means we take a list of 0th letters, 1st letters and 2nd letters and squash them

#BUUUUUT! This is not very efficient... There is a better way

#Each tensor stores something called .storage(), this is a one dimensional representation of the tensor, because thats how it is stored in PC memory
#The method .view() manipulates how this originally one-dim tensor (or array in simpler words) is interpreted by PyTorch, so no values are copied, no additional memory is needed etc.

h = emb.view(32,6) @ W1 + b1 #Voilaaaaaa!!!

#BUUUT: We dont want to hardcode numbers...
h = emb.view(emb.shape[0], 6) @ W1 + b1
#But why is 6 still here... Shuu Shuu go away... Well we already agreed on 3 input letters and 2 Dimensional embedings so its kinda okay...

h = torch.tanh(h) #Look at the image, the layer is non linear, it is a tanh 
h

tensor([[ 1.0000,  0.8471,  1.0000,  ...,  0.9914, -1.0000, -1.0000],
        [ 0.9990,  0.0188,  1.0000,  ...,  0.9999, -0.9937, -1.0000],
        [ 0.0049, -1.0000,  0.9996,  ...,  0.9991,  0.0587, -0.4351],
        ...,
        [ 0.9002,  0.9971,  0.9719,  ...,  1.0000, -0.9359, -1.0000],
        [ 0.9013,  0.8020,  0.9113,  ..., -0.4469, -0.7633, -0.9976],
        [-0.3186,  0.6468, -0.1698,  ...,  0.2997,  0.9997, -0.9540]])

<img src= "Screenshot 2024-05-14 150859.png" width= "300
"><img src= "tanh.jpg" width= "300
">
<img src= "hyperbolic functions.png" width= "300
">
https://www.youtube.com/watch?v=HnHnEnkZpJA Video zum Thema hyperbolic trigonmometric functions

In [12]:
#Now we create our Output layer:
W2 = torch.randn((100, 27))
b2 = torch.randn(27)
#Again lets take a moment here. We create a layer with 27 Neurons where each neuron gets 100 inputs from the 100 neurons of the previous layer
logits = h @ W2 + b2 #What are logits? Lochits are the outputs of a neural network. Usually, we plug the outputs to a softmax in order to receive a probability distribution 
counts = logits.exp()
prob =  counts / counts.sum(1, keepdim= True) #Look at the image below. This is softmax

<img src="th-1496899846.png" alt="Example Image" width="300" >
</br>
We take our "counts" and transform them into probs below </br></br>
<img src="probrow.png" alt="Example Image" width="300" >
</br> 
As you can see the probability distribution for each of the X Inputs is one

Logits have the shape [x, 100] @ [100, 27] -> [x, 27], where x is every encoded example as a row and 27 is the prob distribution for each character which sums up to one in every row

</br></br>
Now lets us look at the following:

In [26]:
enumerator = torch.arange(prob.size(dim = 0))
print(prob[enumerator, Y])
nll = -prob[enumerator, Y].log().mean() #We put all values from the table below and log them, then we take an average and then we take it by -1 to get a positive number
nll # <- Negative Log likelihood

tensor([2.3051e-08, 3.3493e-05, 4.0166e-12, 8.4881e-04, 1.3526e-06, 4.4306e-01,
        2.7379e-01, 2.0476e-09, 2.0995e-03, 6.7271e-09, 8.9223e-11, 1.9935e-04,
        1.3739e-04, 9.3795e-04, 7.2837e-03, 7.7846e-09, 1.8155e-07, 8.7850e-09,
        1.3488e-05, 3.6653e-14, 4.5156e-10, 9.2625e-12, 2.0620e-05, 1.6007e-05,
        5.1942e-06, 3.2195e-09, 1.5942e-01, 1.7708e-07, 8.8152e-07, 6.6709e-08,
        2.8286e-05, 1.0633e-12])


tensor(14.2351)

What does it mean? We print for every row of prob the current probability we give to the actual correct next character (Y if you remember are our labels)
Then as you might recall we calculate the negative log likelihood, which we will try to minimize, for our network.