<a href="https://colab.research.google.com/github/shihongji/aml-exercise/blob/main/Fall_Hongji_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Bigram Language Model and Generative Pretrained Transformer (GPT)


The objective of this assignment is to train a simplified transformer model. The primary differences between the implementation:
* tokenizer (we use a character level encoder simplicity and compute constraints)
* size (we are using 1 consumer grade gpu hosted on colab and a small dataset. in practice, the models are much larger and are trained on much more data)
* efficiency


Most modern LLMs have multiple training stages, so we won't get a model that is capable of replying to you yet. However, this is the first step towards a model like ChatGPT and Llama.




In [12]:
%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from torch import nn

## Part 1: Bigram MLP for TinyShakespeare (35 points)

1a) (1 point). Create a list `chars` that contains all unique characters in `text`

1b) (2 points). Implement `encode(s: str) -> list[int]`

1c) (2 points). Implement `decode(ids: list[int]) -> str`

1d) (5 points). Create two tensors, `inputs_one_hot` and `outputs_one_hot`. Use one hot encoding. Make sure to get every consecutive pair of characters. For example, for the word 'hello', we should create the following input-output pairs
```
he
el
ll
lo
```

1e) (10 points). Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token. Specifically, implement the constructor, forward, and generate. The output dimension of the first layer should be 8. Use `torch.optim`. The activation function for the first layer should be `nn.LeakyReLU()`

Note: Use the `torch.nn.function.cross_entropy` loss. Read the [docs](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html) about how this loss function works. The logits are the output of a network WITHOUT an activation function applied to the last layer. There are activation functions are applied to every layer except the last.

1f) (5 points). Train the BigramOneHotMLP for 1000 steps.

1g) (5 points). Create two tensors, `input_ids` and `outputs_one_hot`. These `input_ids` will be used for the embedding layer.

1h) (5 points). Implement and train BigramEmbeddingMLP, a 2 layer mlp that predicts the next token. Specifically, implement the constructor, forward, and generate functions. The output dimension of the first layer should be 8. Use `torch.optim`.



Note: the output will look like gibberish


In [13]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-10-16 18:57:59--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2024-10-16 18:57:59 (31.3 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [None]:
# For the bigram model, let's use the first 1000 characters for the data

with open('input.txt', 'r') as f:
    text = f.read()
text = text[:1000]

In [None]:
print(text)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [None]:
chars = sorted(set(text)) # implement
print(len(chars))
# print(chars)

c2id = {c: i for i, c in enumerate(chars)} # implement
id2c = {i: c for c, i in c2id.items()} # implement
# dictionary.items() return the list of key-value pairs as tuples
print(c2id)
print(id2c)

46
{'\n': 0, ' ': 1, '!': 2, "'": 3, ',': 4, '.': 5, ':': 6, ';': 7, '?': 8, 'A': 9, 'B': 10, 'C': 11, 'F': 12, 'I': 13, 'L': 14, 'M': 15, 'N': 16, 'O': 17, 'R': 18, 'S': 19, 'W': 20, 'Y': 21, 'a': 22, 'b': 23, 'c': 24, 'd': 25, 'e': 26, 'f': 27, 'g': 28, 'h': 29, 'i': 30, 'j': 31, 'k': 32, 'l': 33, 'm': 34, 'n': 35, 'o': 36, 'p': 37, 'r': 38, 's': 39, 't': 40, 'u': 41, 'v': 42, 'w': 43, 'y': 44, 'z': 45}
{0: '\n', 1: ' ', 2: '!', 3: "'", 4: ',', 5: '.', 6: ':', 7: ';', 8: '?', 9: 'A', 10: 'B', 11: 'C', 12: 'F', 13: 'I', 14: 'L', 15: 'M', 16: 'N', 17: 'O', 18: 'R', 19: 'S', 20: 'W', 21: 'Y', 22: 'a', 23: 'b', 24: 'c', 25: 'd', 26: 'e', 27: 'f', 28: 'g', 29: 'h', 30: 'i', 31: 'j', 32: 'k', 33: 'l', 34: 'm', 35: 'n', 36: 'o', 37: 'p', 38: 'r', 39: 's', 40: 't', 41: 'u', 42: 'v', 43: 'w', 44: 'y', 45: 'z'}


In [None]:
test_text = "Let"
print(encode_p1(test_text))
print(decode_p1(encode_p1(test_text)))

[14, 26, 40]
Let


In [None]:
en1 = [0, 1, 5]
en2 = [0, 5, 1]

In [None]:
test_vector = F.one_hot(torch.tensor(en1))
test_vector2 = F.one_hot(torch.tensor(en2))
print(test_vector, test_vector2)
t = [test_vector, test_vector]
print(f"Before stack:\n{test_vector.shape}")
print(f"After stack:\n{torch.stack(t)} {torch.stack(t).shape}")

tensor([[1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1]]) tensor([[1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1],
        [0, 1, 0, 0, 0, 0]])
Before stack:
torch.Size([3, 6])
After stack:
tensor([[[1, 0, 0, 0, 0, 0],
         [0, 1, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 1]],

        [[1, 0, 0, 0, 0, 0],
         [0, 1, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 1]]]) torch.Size([2, 3, 6])


In [None]:
from typing import Tuple
from torch.nn import functional as F

def encode_p1(s: str) -> list[int]:
    return [c2id[c] for c in s]

def decode_p1(ids: list[int]) -> str:
    return ''.join([id2c[i] for i in ids])

input_features = len(chars)
output_features = len(chars)

def create_one_hot_inputs_and_outputs() -> Tuple[torch.Tensor, torch.Tensor]:
    # implement
    num_classes = len(chars)
    inputs = []
    outputs = []

    encoded_text = encode_p1(text)
    for i in range(len(encoded_text) - 1):
        input_vector = F.one_hot(torch.tensor(encoded_text[i]), num_classes=num_classes).float() # convert to float for compatibility for model params
        output_vector = F.one_hot(torch.tensor(encoded_text[i + 1]), num_classes=num_classes).float()
        inputs.append(input_vector)
        outputs.append(output_vector)

    inputs_one_hot = torch.stack(inputs)
    outputs_one_hot = torch.stack(outputs)
    return (inputs_one_hot, outputs_one_hot)

inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs()


In [None]:
print(inputs_one_hot.shape)
print(outputs_one_hot)

torch.Size([999, 46])
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.]])


In [None]:
class BigramOneHotMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(input_features, 8)
        self.fc2 = nn.Linear(8, output_features)
        self.activation = nn.LeakyReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        return x

    def generate(self, start='a', max_new_tokens=100) -> str:
        self.eval()
        with torch.no_grad():
            current_char = start
            generate_char = current_char
            for _ in range(max_new_tokens):
                # The reason to use unsqueeze(0) is to add a new dimension at index 0, making a 1D tensor (e.g., [num_classes]) into a 2D tensor (e.g., [1, num_classes]).
                # Not necessary but it's best practice.
                input_tensor = F.one_hot(torch.tensor(c2id[current_char]), num_classes=input_features).unsqueeze(0).float()
                output_tensor = self.forward(input_tensor)
                next_char_id = torch.argmax(output_tensor).item()
                if _ == 0:
                    print(encode_p1(current_char))
                    print(decode_p1(encode_p1(current_char)))
                    print(input_tensor.shape)
                    print(next_char_id)
                    print(f"{output_tensor=}")
                next_char = [k for k, v in c2id.items() if v == next_char_id][0]
                current_char = next_char
                generate_char += current_char
        return generate_char



In [None]:
# Test cell
v1 = torch.tensor([1, 0, 0, 0])
v2 = torch.tensor([0, 1, 0, 0])
v1.unsqueeze(0)
print(v1.shape)
print(v1.unsqueeze(0).shape)

torch.Size([4])
torch.Size([4, 1])


In [None]:
# bigram_one_hot_mlp.eval()
out = bigram_one_hot_mlp.generate()
print(out)

[22]
a
torch.Size([1, 46])
5
output_tensor=tensor([[ 0.1825,  0.1354,  0.1488,  0.0085, -0.1154,  0.3481,  0.1484, -0.1705,
          0.1311, -0.4176, -0.1259, -0.0434,  0.1160, -0.2528, -0.2291,  0.2963,
         -0.2588,  0.1056,  0.0987, -0.1033,  0.3266, -0.1704, -0.1108, -0.3189,
         -0.3457, -0.1601,  0.2447, -0.0992, -0.2696, -0.1764, -0.0650,  0.2387,
         -0.1041, -0.1703,  0.1273, -0.0465,  0.2838,  0.2606, -0.0869,  0.1479,
         -0.2140, -0.4262, -0.1323,  0.2011, -0.2728, -0.1477]])
a....................................................................................................


In [None]:
# training loop
bigram_one_hot_mlp = BigramOneHotMLP()

optimizer = torch.optim.Adam(bigram_one_hot_mlp.parameters(), lr=0.01)
for _ in range(1000):
    optimizer.zero_grad()
    pred = bigram_one_hot_mlp(inputs_one_hot)
    loss = F.cross_entropy(pred, outputs_one_hot)
    loss.backward()
    optimizer.step()


print(bigram_one_hot_mlp.generate())

[22]
a
torch.Size([1, 46])
35
output_tensor=tensor([[ -5.9525,   2.7035, -11.9712,  -1.1866,  -2.9978,  -0.3760,  -1.4835,
          -4.3004, -11.0841,  -9.5644,  -8.8597,  -1.2098, -10.2075,  -3.0092,
          -3.4980,  -2.9780,  -8.8827,  -8.5291,  -8.4442, -10.3108,  -9.2844,
          -8.5049,   1.8790,   1.9435,   2.0768,   2.3626,   1.5916,   2.3646,
          -1.5555,  -0.5431,   1.7667,  -8.9618,   3.8356,   3.0190,   2.6660,
           4.4469,  -0.0996,   1.2779,   4.3095,   2.5957,   3.7881,   1.8603,
           2.3841,   2.8070,   2.8216,  -3.7654]])
an the the the the the the the the the the the the the the the the the the the the the the the the th


In [None]:
def create_embedding_inputs_and_outputs() -> list[torch.tensor, torch.tensor]:
    encoded = encode_p1(text)
    num_classes = len(chars)
    inputs = []
    outputs = []

    for i in range(len(encoded) - 1):
        input_id = torch.tensor(encoded[i])
        output_vector = F.one_hot(torch.tensor(encoded[i + 1]), num_classes=num_classes).float()
        inputs.append(input_id)
        outputs.append(output_vector)

    input_ids = torch.stack(inputs)
    outputs_one_hot = torch.stack(outputs)
    return [input_ids, outputs_one_hot]

input_ids, outputs_one_hot = create_embedding_inputs_and_outputs()

class BigramEmbeddingMLP(nn.Module):
    def __init__(self, input_dim, embedding_dim=5):
        super().__init__()
        self.token_embedding = nn.Embedding(num_embeddings=input_dim, embedding_dim=embedding_dim)
        self.fc1 = nn.Linear(embedding_dim, 8)
        self.fc2 = nn.Linear(8, input_dim)
        self.activation = nn.LeakyReLU()
    def forward(self, x):
        x = self.token_embedding(x)
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        return x

    def generate(self, start='a', max_new_tokens=100) -> str:
        self.eval()
        with torch.no_grad():
            current_char = start
            generate_char = current_char
            for _ in range(max_new_tokens):
                input_tensor = torch.tensor([c2id[current_char]])
                output_tensor = self.forward(input_tensor)
                next_char_id = torch.argmax(output_tensor).item()
                next_char = [k for k, v in c2id.items() if v == next_char_id][0]
                current_char = next_char
                generate_char += current_char
        return generate_char

def train(model, input_ids, outputs_one_hot, epochs=1000, learning_rate=0.01):
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()
    model.train()
    for epoch in range(epochs):
        optimizer.zero_grad()
        pred = model(input_ids)
        loss = criterion(pred, torch.argmax(outputs_one_hot, dim=1))
        loss.backward()
        optimizer.step()
        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item()}")


input_dim = len(chars)
bigram_embedding_mlp = BigramEmbeddingMLP(input_dim)
print(f"Pretraining: {bigram_embedding_mlp.generate()}")

train(bigram_embedding_mlp, input_ids, outputs_one_hot)


print(bigram_embedding_mlp.generate())

Epoch 0, Loss: 3.802501678466797
Epoch 100, Loss: 2.472047805786133
Epoch 200, Loss: 2.25669264793396
Epoch 300, Loss: 2.2086939811706543
Epoch 400, Loss: 2.1860597133636475
Epoch 500, Loss: 2.1704888343811035
Epoch 600, Loss: 2.160346508026123
Epoch 700, Loss: 2.153761625289917
Epoch 800, Loss: 2.148449182510376
Epoch 900, Loss: 2.142740488052368
an t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t t 


## Part 2: Generative Pretrained Transformer (65 points)

For this part, it is best to use a gpu. In the settings at the top go to Runtime -> Change Runtime Type and select T4 GPU

In [14]:
# run nvidia-smi to check gpu usage
!nvidia-smi

Wed Oct 16 18:58:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P0              28W /  70W |    123MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [15]:
# For the gpt model, let's use the full text

with open('input.txt', 'r') as f:
    text = f.read()

Implement a character level tokenization function.

1. Create a list of unique characters in the string. (1 points)
2. Implement a function `encode(s: str) -> list[int]` that takes a string and returns a list of ids (1 point)
3. Implement a function `decode(ids: list[int]) -> str` that takes a list of ids (ints) and returns a string (1 point)


In [16]:
chars = sorted(list(set(text)))
c2id = {c: i for i, c in enumerate(chars)} # implement
id2c = {i: c for c, i in c2id.items()} # implemen

def encode(s: str) -> list[int]:
    return [c2id[c] for c in s if c in c2id]

def decode(ids: list[int]) -> str:
    return "".join([id2c[i] for i in ids])

In [17]:
data = torch.tensor(encode(text), dtype=torch.long).cuda()

In [9]:
block_size = 16
data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43],
       device='cuda:0')

To train a transformer, we feed the model `n` tokens (context) and try to predict the `n+1`th token (target) in the sequence.



In [10]:
x = data[:block_size]
y = data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18], device='cuda:0') the target: 47
when input is tensor([18, 47], device='cuda:0') the target: 56
when input is tensor([18, 47, 56], device='cuda:0') the target: 57
when input is tensor([18, 47, 56, 57], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58], device='cuda:0') the target: 1
when input is tensor([18, 47, 56, 57, 58,  1], device='cuda:0') the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47], device='cuda:0') the target: 64
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64], device='cuda:0') the target: 43
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43], device='cuda:0') the target: 52
when input is tensor([18, 47,

In [11]:
batch_size = 64
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


### Single Self Attention Head (5 points)
![](https://i.ibb.co/GWR1XG0/head.png)

In [18]:
class SelfAttentionHead(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.head_size = head_size

        # create query, key and value
        self.q_linear = nn.Linear(head_size, head_size)
        self.k_linear = nn.Linear(head_size, head_size)
        self.v_linear = nn.Linear(head_size, head_size)

        # scaled dot-product attention requires scaling factor
        self.scale = torch.sqrt(torch.FloatTensor([head_size])).to(device)

    def forward(self, x):
        query = self.q_linear(x)
        key = self.k_linear(x)
        value = self.v_linear(x)

        scores = torch.matmul(query, key.transpose(1, 2)) / self.scale
        attention_weights = torch.nn.functional.softmax(scores, dim=-1)
        atten_values = torch.matmul(attention_weights, value)
        return atten_values, attention_weights

### Multihead Self Attention (5 points)

`constructor`

- Create 4 `SelfAttentionHead` instances. Consider using `nn.ModuleList`
- Create a linear layer with n_embd input dim and n_embd output dim

`forward`

In the forward implementation, pass `x` through each head, then concatenate all the outputs along the feature dimension, then pass the concatenated output through the linear layer

![](https://i.ibb.co/y5SwyZZ/multihead.png)

In [25]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.linear = nn.Linear(num_heads * head_size, num_heads * head_size)
        self.dropout = nn.Dropout(0.1)
        self.heads = nn.ModuleList([SelfAttentionHead(head_size) for _ in range(num_heads)])

    def forward(self, x):
        head_outputs = [head(x)[0] for head in self.heads]
        concat_output = torch.cat(head_outputs, dim=-1)
        linear_output = self.linear(concat_output)
        return linear_output


## MLP (2 points)
Implement a 2 layer MLP


![](https://i.ibb.co/C0DtrF5/ff.png)

In [20]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(64, 256)
        self.linear2 = nn.Linear(256, 64)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x: torch.tensor) -> torch.tensor:
        x = self.linear1
        x = torch.relu(x)
        x = self.linear2
        x = self.dropout(x)
        return x


## Transformer block (20 points)

Layer normalization help training stability by normalizing the outputs of neurons within a single layer across all features for each individual data point, not across a full batch or a specific feature.

Dropout is a form of regularization to prevent overfitting.

This is the diagram of a transformer block:

![](https://i.ibb.co/X85C473/block.png)

In [29]:
class Block(nn.Module):
    def __init__(self, n_embd: int, n_head: int):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        self.multi_head_attention = MultiHeadAttention(n_head, n_embd // n_head)
        self.mlp = MLP()

    def forward(self, x):
        x = x + self.multi_head_attention(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

## GPT

`constructor` (5 points)

1. create the token embedding table and the position embedding table
2. create variable `self.blocks` that is a series of 4 `Block`s. The data will pass through each block sequentially. Consider using `nn.Sequential`
3. create a layer norm layer
4. create a linear layer for predicting the next token

`forward(self, idx, targets=None)`. (5 points)

`forward` takes a batch of context ids as input of size (B, T) and returns the logits and the loss, if targets is not None. If targets is None, return the logits and None.
1. get the token by using the token embedding table created in the constructor
2. create the position embeddings
3. sum the token and position embeddings to get the model input
4. pass the model through the blocks, the layernorm layer, and the final linear layer
5. compute the loss

`generate(start_char, max_new_tokens, top_p, top_k, temperature) -> str` (5 points)
1. implement top p, top_k, and temperature for sampling



![](https://i.ibb.co/n8sbQ0V/Screenshot-2024-01-23-at-8-59-08-PM.png)

In [22]:
class GPT(nn.Module):
    def __init__(self, vocab_size, n_embd, n_head, block_size, n_blocks):
        super().__init__()
        self.max_len = block_size
        # token embedding table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # position embedding table
        self.pos_embedding_table = nn.Parameter(torch.zeros(1, block_size, n_embd))
        embed_dropout = 0.1
        self.dropout = nn.Dropout(embed_dropout)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_blocks)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False)

    def forward(self, idx, targets=None):
        seq_len = idx.size(1)
        # Get the token embeddings
        token_embeddings = self.token_embedding_table(idx)
        # Create the position embeddings
        position_embeddings = self.pos_embedding_table[:, :seq_len, :]
        # sum the token and position embeddings
        x = token_embeddings + position_embeddings
        # pass througn the layernorm
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        # compute the loss if targets are provided
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, start_char, max_new_tokens, top_p, top_k, temperature):
        self.eval()
        with torch.no_grad():
            # start token is converted to an index
            current_idx = torch.tensor([c2id[start_char]], dtype=torch.long).unsqueeze(0).to(self.token_embedding_table.weight.device)
            generated = []
            print(current_idx)
            for _ in range(max_new_tokens):
                # Get the logits for the current sequence
                logits, _ = self(current_idx)
                logits = logits - logits.max()
                logits = logits / temperature

                # Apply top-k filtering if top_k > 0
                if top_k > 0:
                    indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
                    logits[indices_to_remove] = float('-inf')
                # Apply top-p filtering if top_p < 1.0
                if top_p < 1.0:
                    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                    sorted_indices_to_remove = cumulative_probs > top_p
                    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
                    sorted_indices_to_remove[..., 0] = 0
                    indices_to_remove = sorted_indices[sorted_indices_to_remove]
                    logits[indices_to_remove] = float('-inf')

                # compute softmax and check for validity
                probs = F.softmax(logits, dim=-1)
                # proceed with sampling
                idx_next = torch.multinomial(probs, num_samples=1)

                generated.append(idx_next.item())
                current_idx = torch.cat((current_idx, idx_next), dim=1)

            # decode the generated indices to characters
            generated_text = start_char + decode(generated)
            return generated_text




### Training loop (15 points)

implement training loop

In [30]:
max_iters = 5000

n_embd = 64
n_head = 4
learning_rate = 3e-4
vocab_size = len(chars)
block_size = 32
n_blocks = 6
model = GPT(vocab_size, n_embd, n_head, block_size, n_blocks).to('cuda') # make you are running this on the GPU

B, T = 8, 32
idx = torch.randint(0, len(chars), (B, T)).to('cuda')
targets = torch.randint(0, len(chars), (B, T)).to('cuda')

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    model.train()
    optimizer.zero_grad()
    logits, loss = model(idx, targets)
    loss.backward()
    optimizer.step()

    if iter % 100 == 0:
        print(f"Iteration {iter}, Loss: {loss.item()}")

RuntimeError: mat1 and mat2 shapes cannot be multiplied (256x64 and 16x16)

### Generate text


print some text that your model generates