# Bigram Neural Net Model – Makemore (Karpathy)
This notebook implements a bigram character-level language model using a single-layer neural network, in contrast to the bigram_count.ipynb that used frequencies to predict the next character.

## Basic Overview
1. The NN will receive a character as input
2. There will be some params (weights)
3. The output will be a prob distribution for the next character
4. The output of the NN will be evaluated using the loss function 
5. The weights will be modified using gradient based optimization

In [1]:
import torch
import torch.nn.functional as F


## Data Preprocessing
1. Read the names from the dataset
2. Create a vocabulary of characters
3. Create a training set of bigrams (x,y)
4. Convert the training set of bigrams into one-hot encoded vectors

### 1. Read the names from the dataset

In [2]:
# Load a list of names from the dataset (each name on a new line)
words = open("data/names.txt", 'r').read().splitlines()
print(words[:3])

['emma', 'olivia', 'ava']


### 2. Create a vocabulary of characters

In [3]:
# Build vocabulary of characters from all names
vocab = sorted(set("".join(words)))

# Add '.' as the start/end token, mapped to index 0
# create a mapping from characters to integers
stoi = {s: i+1 for i, s in enumerate(vocab)}
stoi['.'] = 0
# create a mapping from integers to characters
itos = {i: s for s, i in stoi.items()}
print(itos)


{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


### 3. Create a training set of bigrams (x,y)

#### Create the training set of bigrams (x,y) - where x is the preceeding character and y is the following character.
#### x will be the input, while y will be the label. 
The neural net will predict y for a given value of x


In [4]:
# Build the dataset of bigram (x, y) pairs: x is the current char, y is the next char
xs, ys = [], []
for w in words[:1]:
    chars = ["."] + list(w) + ["."]
    for ch1, ch2 in zip(chars, chars[1:]):
        # print(f"ch1: {ch1}, ch2: {ch2}")
        print(ch1, ch2)
        xs.append(stoi[ch1])
        ys.append(stoi[ch2])


. e
e m
m m
m a
a .


#### Convert the training set of bigrams from list to tensors for efficient computation

In [5]:
# convert inputs and labels to tensors so we can work in pytorch
# use torch.tensor instead of torch.Tensor as the dtype for torch.tensor is int64 and that of torch.Tensor is float32. 
# since we are just storing numbers, we can use torch.tensor
xs = torch.tensor(xs)
ys = torch.tensor(ys)
print(f"input: {xs}")
print(f"labels:{ys}")

input: tensor([ 0,  5, 13, 13,  1])
labels:tensor([ 5, 13, 13,  1,  0])


We are considering only the first word "emma". 

The first word "emma" can be broken down into five examples. 

. e

e m

m m

m a

a .

So the input has 5 examples.

According to the above input xs and lables ys, when the input to the network is 0, the label is 5, so the predicted output should be 5, which corresponds to e, that is the probability of 5th index in the output should be the highest.

### 4. Convert the training set of bigrams into one-hot encoded vectors

Only the appropriate bit (index) will be 1, all the remaining indices will be 0.

This will be the input to the neural net

In [6]:
# num_classes = 27 because the vocab size is 27
# if we dont mention num_classes, one_hot will assume the number or classes to be 14, as the highest value in xs is 13
# however, if we were using the entire corpus and not just the 1st word, then the highest value would have been 26 and the num_class would have
# automatically be considered to be 27
xenc = F.one_hot(xs, num_classes=27)
print(xenc.shape)
print(xenc)
print(xenc.dtype)
# the dtype of xenc is int, but we always give floating point numbers to a nn that can take on continuous values.
# so convert the dtype of xenc to float
xenc = xenc.float()
print(xenc.dtype)

torch.Size([5, 27])
tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0]])
torch.int64
torch.float32


## Initialize Weights


#### Construct a single neuron that perform W*X + b
Here we will only be doing W*X

1. Initialize W
2. Matrix multiplication @ between W and input vector xenc

In [7]:
# Randomly initialize weights W using randn - which draws random numbers from a normal distribution
# size = (27,1) means we create 27 weights for one neuron
# W will be a column vector of 27 numbers
# number of characters in vocab
vocab_size = len(itos)
W = torch.randn(size = (vocab_size,1))
print(f"Shape of xenc: {xenc.shape}")
print(f"\nShape of Weight matrix W: {W.shape}")
# matrix multiplication of W and input
print(f"\nShape of xenc @ W: {(xenc @ W).shape}")
print(f"\nMat mul xenc @ W: {xenc @ W}")


Shape of xenc: torch.Size([5, 27])

Shape of Weight matrix W: torch.Size([27, 1])

Shape of xenc @ W: torch.Size([5, 1])

Mat mul xenc @ W: tensor([[-1.2920],
        [-0.5032],
        [-0.4323],
        [-0.4323],
        [-0.9469]])


input is (5,27)

weights is (27, 1) : 27 = number of weights, 1 = number of neurons

The output of xenc @ W will be a (5, 1) vector, which  provides us the 5 activations of a single neuron on the 5 input examples.
We fed all the five inputs to a single neuron, and pytorch handled all of the inputs simultaneously. In micrograd, we had to use a loop.


#### 27 neurons instead of a single neuron

The reason we need 27 neurons instead of a single neuron is that we need a probability distribution of 27 characters. A single neuron's output was(5,1), meaning only a single probability, but 27 neurons output will be (5,27), which means a probability distribution of 27 characters.

Another reason is, that our end goal is to create a matrix equivalent to that of matrix P as we had in the simple count based bigram model. P was of size (27,27), where each row represented the first character of a biagram and each column respresented the second character of a bigram .
The values inside each row represented the probability distribution of bigrams of a certain character. E.g, the row indexed 1, had the probability distribution of all the bigrams of character a. ab, ac, ad, ...., az, *a`.`* (*`.`* is the end token).

*W with shape (27,27) means we have 27 neurons with each neuron having 27 connections (connections have the weights)*

## Compute Logits (log-counts) and Counts



In [8]:
# Randomly initialize weights W using randn - which draws random numbers from a normal distribution
# size = (27,27) means we create 27 weights for 27 neurons
# W will be a column vector of 27 numbers
vocab_size = len(itos) # 27
# initialize weights
W = torch.randn((vocab_size, vocab_size), requires_grad=True)

print(f"Shape of W: {W.shape}")
print(f"\nShape of xenc: {xenc.shape}")

# calculate logits
y = xenc @ W
print(f"\nShape of logits: {y.shape}")



Shape of W: torch.Size([27, 27])

Shape of xenc: torch.Size([5, 27])

Shape of logits: torch.Size([5, 27])


The output of xenc @ W will be a (5, 27) vector. 
The first row will provide us the 27 activations of the 27 neurons for thr first example.
This is achieved as a result of dot product

In [9]:
(xenc @ W)[3][13] # gives the activation of the 13th neuron in the 3rd example

tensor(-0.0247, grad_fn=<SelectBackward0>)

In [10]:
print(f"row 1 of logits: {y[0]}")

row 1 of logits: tensor([-0.0340,  0.1212,  0.0355, -1.1042,  1.0493,  0.2631,  1.0342, -1.4824,
         0.7879, -0.9840, -0.0260,  0.2059,  0.1654, -1.2550, -1.3606,  0.0512,
         0.5799, -0.4467,  2.2150,  0.7303,  0.1610, -1.3518,  1.0963,  0.7427,
         0.1634, -1.0055, -2.0936], grad_fn=<SelectBackward0>)


### Interpretation of the output of Logits (xenc @ W)
xenc @ W is a (27,27) matrix with positive and negative numbers.

The end goal is to produce the same matrix as P in the simple bigram model. P was derived by N which had the counts of every bigram.
To achieve this goal, we want to convert the output of xenc @ W into a representation that can be considered as counts, and then we can normalize those counts to get prob distributions, just like we did in the simple bigram model.

`log-counts`: So, we assume that the output of xenc @ W is the *log* of counts (log counts), as log can take both positive and negative values and the output of xenc @ W has both positive and negative values. 

`Exponent to get Counts`: To get the counts out of the log counts, we exponentiate the log counts.

Exponent of anything negative is less than 1 and greater than zero. i.e between 0 and 1

Exponents of anything positive is greater than 1.

##### ```After performing element-wise exponent on xenc @ W, ```

*negative numbers will be tranformed to positive numbers (0 and 1).*
    
*positive numbers will become even more positive*

** The exponents of log counts can be interpreted as counts as these values are always positive**

The output of *xenc @ W* is known as *log-counts* or *logits*
The output of exp of logits is the *counts* which represent a matrix that is equivalent to N matrix

Once we have the counts, we can get the probabilites simply by normalizing the counts, just as was done for the simple bigram model.
Every row of prob matrix should sum to 1.





The two steps of taking exp to get counts and then normalizing those counts to get probability distributions are collectively the *SOFTMAX* function

### Computing counts

In [11]:
logits = xenc @ W #log-counts
counts = logits.exp() # counts, equivalent N
probs = counts / counts.sum(1, keepdim=True) # counts, probs is the softmax
print(f"logits: {logits[0]}") # poitive and negative values
print(f"counts: {counts[0]}") # all positive values
print(f"probs: {probs[0]}") # [[each row should sum up to 1
print(f"Shape of probs:{probs.shape}")

logits: tensor([-0.0340,  0.1212,  0.0355, -1.1042,  1.0493,  0.2631,  1.0342, -1.4824,
         0.7879, -0.9840, -0.0260,  0.2059,  0.1654, -1.2550, -1.3606,  0.0512,
         0.5799, -0.4467,  2.2150,  0.7303,  0.1610, -1.3518,  1.0963,  0.7427,
         0.1634, -1.0055, -2.0936], grad_fn=<SelectBackward0>)
counts: tensor([0.9666, 1.1288, 1.0362, 0.3315, 2.8557, 1.3010, 2.8128, 0.2271, 2.1987,
        0.3738, 0.9744, 1.2287, 1.1799, 0.2851, 0.2565, 1.0525, 1.7858, 0.6398,
        9.1618, 2.0757, 1.1746, 0.2588, 2.9931, 2.1016, 1.1775, 0.3659, 0.1232],
       grad_fn=<SelectBackward0>)
probs: tensor([0.0241, 0.0282, 0.0259, 0.0083, 0.0713, 0.0325, 0.0702, 0.0057, 0.0549,
        0.0093, 0.0243, 0.0307, 0.0294, 0.0071, 0.0064, 0.0263, 0.0446, 0.0160,
        0.2287, 0.0518, 0.0293, 0.0065, 0.0747, 0.0525, 0.0294, 0.0091, 0.0031],
       grad_fn=<SelectBackward0>)
Shape of probs:torch.Size([5, 27])


In [12]:
probs[1].sum()

tensor(1., grad_fn=<SumBackward0>)

#### What does the probs matrix tell us

Shape of the input (xenc) is (5,27)

Shape of probs is (5, 27)

The probs matrix gives us:

For every example in the input vector, it gives us the probability distribution of the next character.

Probs[0] will represent the output prob distribution of the first input example. It will have 27 values. 

It will tell us *HOW LIKELY IS EACH CHARACTER TO COME NEXT*.

  The first value will give the probability of the start token, the next value will give the prob of the next character to be "a", the third will have prob for "b" and so on, such that all these probabilities will sum up to 1, giving us a prob distribution for the next character.



In [13]:
probs[1][1] # prob of the next character to be "a" for the second input which was "e"

tensor(0.0576, grad_fn=<SelectBackward0>)

As we tune W, the values of probs will change. The ultimate goal will be to achieve the same values of probs as the matrix P(in bigram_count.ipynb), as it represented the real probabilities for the given dataset. 

**The way to tune W is by the LOSS FUNCTION.**

## Forward Pass
A forward pass performs all the steps from one-hot encoding the input to calculating logits

1. One-hot encode input
2. Initialize weights
3. Logits (log-counts)
4. Exponentiate logits (get counts)
5. Normalize counts
6. Negative log likelihood
7. Loss

In [14]:
print(f"Raw input xs: {xs}")
print(f"Raw labels ys: {ys}")

Raw input xs: tensor([ 0,  5, 13, 13,  1])
Raw labels ys: tensor([ 5, 13, 13,  1,  0])


In [15]:
# geneartor for reproducibility
g = torch.Generator().manual_seed(2147483647)
# one-hot encode input
xenc = F.one_hot(xs,num_classes=27).float()
# initialize weight matrix
W = torch.randn((27,27), generator=g)
# calculate logits
logits = xenc @ W  #(5,27), (27,27) -> (5,27)
# calculate counts
counts = logits.exp()
# calculate probability
probs = counts / counts.sum(1, keepdim=True) # the exp and the sum step constiture the softmax function
print(f"Shape of probs matrix: {probs.shape}")

Shape of probs matrix: torch.Size([5, 27])


### **Understanding Probabilities and Loss Calculation**

`probs` has 5 rows, each row representing the probability of 27 characters to be the next character. However, we are only interested in the probability of the character that is the actual label.

---

#### **Example:**

For the first input example:
- `xs[0] = 0` (`.`)
- Ground truth is `ys[0] = 5` (`e`)

So, for the output probability distribution for the first input example, there are 27 probabilities, but we are only interested in the probability of the 5th index, which is the probability of `e` to be the next character. We want this probability to be maximized, and the gap between the predicted probability for the 5th character and the actual probability of the 5th character (i.e., actual probability = 1) is the **loss**.

**Likelihood**: The prob of the correct index tells us the **likelihood** of the model on that example. so if the prob of "e" is 0.0123, it means that the model is 1.2 percent accurate on that particular example. ``

Adding the likelihoods of all the examples will give the likelihood of the model on the entire dataset.

---

### **Extracting the 5th Probability**

We need to extract the 5th probability from the 0th row of `probs`. This can be done in two ways:

---

#### **1. Indexing:**

For the 0th example, index the label index from the 0th row:

```python
probs[0, 5]


0 is the index of the input, corresponding to rows of probs.

5 is the index of the label, which we got from ys.


#### **2. Dot Product of Predicted Probabilities with Actual Probabilities of Labels**

The predicted probability distribution for the first example is:

```python
probs[0] = [0.0607, 0.0100, 0.0123, 0.0042, 0.0168, 0.0123, 0.0027, 0.0232, 0.0137,
            0.0313, 0.0079, 0.0278, 0.0091, 0.0082, 0.0500, 0.2378, 0.0603, 0.0025,
            0.0249, 0.0055, 0.0339, 0.0109, 0.0029, 0.0198, 0.0118, 0.1537, 0.1459]
```

To get the actual probability of labels, we one-hot encode the labels vector `ys`:

```python
ys = [5, 13, 13, 1, 0]
```

The one-hot encoded vector for the first label `5` becomes:

```python
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```

Here, `1` is only at the 5th index, while all other indices are `0`.

---

##### **Dot Product Calculation:**

The dot product of `probs[0]` and the one-hot encoded vector for the first label is:

```python
[0.0607, 0.0100, 0.0123, 0.0042, 0.0168, 0.0123, 0.0027, 0.0232, 0.0137,
 0.0313, 0.0079, 0.0278, 0.0091, 0.0082, 0.0500, 0.2378, 0.0603, 0.0025,
 0.0249, 0.0055, 0.0339, 0.0109, 0.0029, 0.0198, 0.0118, 0.1537, 0.1459]
 *
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```

This results in:

```python
[0, 0, 0, 0, 0, 0.0123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```

Taking the sum of this vector gives:

```python
0.0123
```
```

## Calculate NLL and Loss

In [16]:
# Caluculating NLL for the 5 examples of "emma" using indexing

nlls = torch.zeros(5)
for i in range(5):
    y = ys[i].item()
    # likelihood of the correct index
    p = probs[i, y] # using indexing
    # calculate log of p (loglikelihood)
    log_p = torch.log(p)
    # calculate neg of log_p (NLL)
    nll = -log_p
    nlls[i] = nll
print(nlls)

tensor([4.3993, 4.0146, 3.6234, 2.6081, 4.2012])


For simplicity:

for i in range(5):

    y = ys[i].item()
    
    p = probs[i, y]

these three lines can be replaced by 
p = probs[torch.arange(5), ys]


So the code becomes:

In [17]:
# pluck out the probability from prob for values in ys 
p = probs[torch.arange(5), ys] 
print(f"likelihood of correct indices: {p}")
# log-likelihood
log_p = torch.log(p)
# negiative log-likelihood
nll = -log_p
nlls = []
nlls.append(nll)
print(nlls)


likelihood of correct indices: tensor([0.0123, 0.0181, 0.0267, 0.0737, 0.0150])
[tensor([4.3993, 4.0146, 3.6234, 2.6081, 4.2012])]


In [18]:
# Caluculating NLL for the 5 examples of "emma" using the dot product approach 
# using the dot product is the standard approach of plucking out the prob of the right index
nlls = torch.zeros(5)
for i in range(5):
    yenc = F.one_hot(ys[i], num_classes=27)
    prob_i = probs[i]
    p = (yenc * prob_i).sum()
    log_p = torch.log(p)
    nll = -log_p
    nlls[i] = nll
print(nlls)

tensor([4.3993, 4.0146, 3.6234, 2.6081, 4.2012])


#### Understanding the nlls vector
the nlls vector has 5 values, [4.3993, 4.0146, 3.6234, 2.6081, 4.2012]. Each value provides the negative log likelihood for an input example.
The first value 4.3993 is the the NLL for the first input (.), the second value in the NLL for the second input (e) and so on. 

The NLLs for all examples are very high, which means the loss will be very high.

#### Calculate Loss from NLL
Loss is simply the mean of all the nlls across the dataset. 5 examples (one word) in our case.

Changing W will change the loss.


In [19]:
# loss
loss = (nlls.sum()/len(nlls)).item() # nlls.mean().item()
loss

3.7693049907684326

#### Summarize the Forward Pass and loss calcualtion
1. One-hot encode input
2. Initialize weights
3. Logits (log-counts)
4. Exponentiate logits (get counts)
5. Normalize counts
6. Negative log likelihood
7. Loss

In [20]:
g = torch.Generator().manual_seed(2147483647)
xenc = F.one_hot(xs,num_classes=27).float()
W = torch.randn((27,27), generator=g, requires_grad= True) # set requires grad= True to make sure grads for W are stored
logits = xenc @ W  #(5,27), (27,27) -> (5,27)
counts = logits.exp()
probs = counts / counts.sum(1, keepdim=True) # predictions
loss = -probs[torch.arange(5), ys].log().mean()
loss

tensor(3.7693, grad_fn=<NegBackward0>)

In [21]:
print(f"initial values of a subset of W matrix\n {W[0, :10]}")

initial values of a subset of W matrix
 tensor([ 1.5674, -0.2373, -0.0274, -1.1008,  0.2859, -0.0296, -1.5471,  0.6049,
         0.0791,  0.9046], grad_fn=<SliceBackward0>)


## Backpropagation to Optimize the Weights Using Gradient Based Optimization

Optimize the weights using Gradient Descent. Conside learning rate to be 0.5

In [22]:

W.grad = None # set the grads to zero. in pytorch, we set grad to None which is equal to setting it to 0
loss.backward() # calculate grads



#### loss.backward() will store the grads for all the weights (params) and W.grad won't be NONE anymore

Every element of W.grad is telling us the impact of that W on the loss. 
If the grad is positive, the W has a positive impact, if the loss is negative, the weight has a negative impact

In [23]:
print(W.grad.shape)

torch.Size([27, 27])


In [24]:
# update the weights using gradient descent
lr = 0.5
W.data += -lr * W.grad # W.data = W.data - (0.1 * W.grad)


In [25]:
print(f"updated values of a subset of W matrix W:\n{W[0, :10]}")

updated values of a subset of W matrix W:
tensor([ 1.5613, -0.2383, -0.0286, -1.1012,  0.2842,  0.0691, -1.5473,  0.6026,
         0.0778,  0.9015], grad_fn=<SliceBackward0>)


The values in W have changed slightly after Gradient descent. 

#### Summarizing Backprop
 Set the grads to zero. in pytorch, we set grad to None which is equal to setting it to 0 (W.grad = None)

 Calculate grads (loss.backward())

 Update weights (W.data += -0.5 * W.grad)

### Running 10 iterations of forward and backward pass to reduce the loss

The loss decreases from 3.769 to 3.59

In [26]:
# initialize the loss function
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27,27), generator=g, requires_grad= True) # set requires grad= True to make sure grads for W are stored

for i in range(10):
    # forward pass
    xenc = F.one_hot(xs,num_classes=27).float()
    logits = xenc @ W  #(5,27), (27,27) -> (5,27)  
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdim=True) # predictions # counts and probs form softmax
    loss = -probs[torch.arange(5), ys].log().mean()
    print(f"loss:{loss}")

    # backward pass
    W.grad = None # set the grads to zero. in pytorch, we set grad to None which is equal to setting it to 0
    loss.backward() # calculate grads

    # update
    W.data += -0.1 * W.grad


        

loss:3.7693049907684326
loss:3.7492127418518066
loss:3.7291626930236816
loss:3.7091546058654785
loss:3.6891887187957764
loss:3.6692662239074707
loss:3.6493873596191406
loss:3.629552125930786
loss:3.6097614765167236
loss:3.5900158882141113


#### Using the entire dataset instead of the first word (5 examples)
Everything should work the same

In [27]:
# create the training set of bigrams (x,y)
xs, ys = [], []
for w in words:
    chars = ["."] + list(w) + ["."]
    for ch1, ch2 in zip(chars, chars[1:]):
        xs.append(stoi[ch1])
        ys.append(stoi[ch2])
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = len(xs)
print("number of examples = ", num)




number of examples =  228146


In [36]:
#initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27,27), generator=g, requires_grad= True) # set requires grad= True to make sure grads for W are stored
lr = 0.2

In [40]:
# training loop
for i in range(20000):
    # forward pass
    xenc = F.one_hot(xs,num_classes=27).float()
    logits = xenc @ W  #(5,27), (27,27) -> (5,27)
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdim=True) # predictions , equivalent to N
    loss = -probs[torch.arange(num), ys].log().mean()
    if i % 100 == 0:
        print(f"loss at iteration {i}: {loss}")

    # backward pass
    W.grad = None # set the grads to zero. in pytorch, we set grad to None which is equal to setting it to 0
    loss.backward() # calculate grads

    # update
    W.data += -lr * W.grad




loss at iteration 0: 2.4837753772735596
loss at iteration 100: 2.483569622039795
loss at iteration 200: 2.4833669662475586
loss at iteration 300: 2.483166456222534
loss at iteration 400: 2.4829680919647217
loss at iteration 500: 2.4827725887298584
loss at iteration 600: 2.4825797080993652
loss at iteration 700: 2.482388734817505
loss at iteration 800: 2.4822006225585938
loss at iteration 900: 2.4820139408111572
loss at iteration 1000: 2.48183012008667
loss at iteration 1100: 2.4816482067108154
loss at iteration 1200: 2.481468439102173
loss at iteration 1300: 2.4812910556793213
loss at iteration 1400: 2.4811151027679443
loss at iteration 1500: 2.4809417724609375
loss at iteration 1600: 2.4807705879211426
loss at iteration 1700: 2.4806013107299805
loss at iteration 1800: 2.480433464050293
loss at iteration 1900: 2.4802677631378174
loss at iteration 2000: 2.4801039695739746
loss at iteration 2100: 2.4799420833587646
loss at iteration 2200: 2.4797821044921875
loss at iteration 2300: 2.4796

 The loss stops decreasing after 2.47. This is almost the same loss as we achieved by using counts in the simple bigram model.
 The loss we achieved in the simple bigram model was based on counts.  Here, the loss is achieved using gradient based optimization.

 This makes sense because we are not taking any additional information. 

 **AT THE END OF OPTIMIZATION, THE ARRAY W IS BASICALLY THE SAME AS THE MATRIX N IN THE BIGRAM MODEL. THE ONLY DIFFERENCE IS, THAT IN THE SIMPLE MODEL WE POPULATED THAT ARRAY WITH COUNTS, NOW WE ARE INITIALIZING COUNTS RANDOMLY AND OPTIMIZING THEM WITH THE HELP OF THE LOSS TO THE ACTUAL COUNT VALUES.**

#### SAMPLE

In [49]:
g = torch.Generator().manual_seed(2147483647) 

for i in range(5):
    result = []
    ix = 0
    while True:
        # get the probs
        xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float() # one hot encoding the input. input will be a single character index while sampling
        logits = xenc @ W  #(228146,27), (27,27) -> (228146,27)
        counts = logits.exp() # counts, equivalent to N
        probs = counts / counts.sum(1, keepdim=True)

        # you have probs, now sample one character at a time from this distribution
        ix = torch.multinomial(probs, num_samples=1, replacement=True, generator=g).item()
        result.append(itos[ix])
        if ix == 0:
            break

    print("".join(result))



junide.
janasah.
pxzfay.
a.
nn.


the output of sampling is almost the same as the simple bigram model with the same seed as we have reached the same count values as the simple bigram model.