# Assignment 2: Bigram Language Model and Generative Pretrained Transformer (GPT)


The objective of this assignment is to train a simplified transformer model. The primary differences between the implementation:
* tokenizer (we use a character level encoder simplicity and compute constraints)
* size (we are using 1 consumer grade gpu hosted on colab and a small dataset. in practice, the models are much larger and are trained on much more data)
* efficiency


Most modern LLMs have multiple training stages, so we won't get a model that is capable of replying to you yet. However, this is the first step towards a model like ChatGPT and Llama.




In [None]:
%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
import torch.nn as nn

## Part 1: Bigram MLP for TinyShakespeare (35 points)

1a) (1 point). Create a list `chars` that contains all unique characters in `text`

1b) (2 points). Implement `encode(s: str) -> list[int]`

1c) (2 points). Implement `decode(ids: list[int]) -> str`

1d) (5 points). Create two tensors, `inputs_one_hot` and `outputs_one_hot`. Use one hot encoding. Make sure to get every consecutive pair of characters. For example, for the word 'hello', we should create the following input-output pairs
```
he
el
ll
lo
```

1e) (10 points). Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token. Specifically, implement the constructor, forward, and generate. The output dimension of the first layer should be 8. Use `torch.optim`. The activation function for the first layer should be `nn.LeakyReLU()`

Note: Use the `torch.nn.function.cross_entropy` loss. Read the [docs](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html) about how this loss function works. The logits are the output of a network WITHOUT an activation function applied to the last layer. There are activation functions are applied to every layer except the last.

1f) (5 points). Train the BigramOneHotMLP for 1000 steps.

1g) (5 points). Create two tensors, `input_ids` and `outputs_one_hot`. These `input_ids` will be used for the embedding layer.

1h) (5 points). Implement and train BigramEmbeddingMLP, a 2 layer mlp that predicts the next token. Specifically, implement the constructor, forward, and generate functions. The output dimension of the first layer should be 8. Use `torch.optim`.



Note: the output will look like gibberish


In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-10-13 03:36:46--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-10-13 03:36:46 (29.2 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
# For the bigram model, let's use the first 1000 characters for the data

with open('input.txt', 'r') as f:
    text = f.read()
text = text[:1000]

In [None]:
chars = sorted(set(text)) # implement

chars2id={c:i for i,c in enumerate(chars)}
id2chars={i:c for c,i in chars2id.items()}

def encode(s: str) -> list[int]:
    # implement
    return [chars2id[c] for c in text]
    pass

def decode(ids: list[int]) -> str:
    # implement
    return ''.join([id2chars[i] for i in ids])
    pass

def create_one_hot_inputs_and_outputs() -> list[torch.tensor, torch.tensor]:
    # implement
    pass

def create_one_hot_inputs_and_outputs()->list[torch.tensor, torch.tensor]:
    inputs_one_hot=[]
    outputs_one_hot=[]
    for c1,c2 in zip(text,text[1:]):
      inputs_one_hot.append(F.one_hot(torch.tensor(chars2id[c1]), len(chars)))
      outputs_one_hot.append(F.one_hot(torch.tensor(chars2id[c2]),len(chars)))
    return torch.stack(inputs_one_hot),torch.stack(outputs_one_hot)

inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs()


class BigramOneHotMLP(nn.Module):
    def __init__(self):
        # implement
        super().__init__()
        self.fc1 = nn.Linear(len(chars), 8)
        self.activation = nn.LeakyReLU()
        self.fc2 = nn.Linear(8, len(chars))
        pass

    def forward(self, x):
        # implement
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        return x
        pass

    def generate(self, start='a', max_new_tokens=100) -> str:
        # implement
        self.eval()
        with torch.no_grad():
            current_char = start
            word = current_char
            for _ in range(max_new_tokens):
                input_ids = F.one_hot(torch.tensor(chars2id[current_char]), len(chars)).unsqueeze(0).float()
                output = self(input_ids)
                next_char_id = torch.argmax(output).item()
                next_char = [k for k, v in chars2id.items() if v == next_char_id][0]
                current_char = next_char
                word += current_char
            return word
        pass

bigram_one_hot_mlp = BigramOneHotMLP()

generated_word = bigram_one_hot_mlp.generate()
print(f'Generated word: {generated_word}')

learning_rate = 0.01
optimizer = optim.SGD(bigram_one_hot_mlp.parameters(), lr=learning_rate)

# training loop
for epoch in range(1000):
    # implement
    optimizer.zero_grad()
    predictions = bigram_one_hot_mlp(inputs_one_hot.float())
    loss = F.cross_entropy(predictions, outputs_one_hot.float())
    loss.backward()
    optimizer.step()
    #print(f'Epoch {epoch + 1}, Loss: {loss.item()}')
    # if epoch%100==0:
    #   print(f'Generated word:{bigram_one_hot_mlp.generate()}')
    pass



print(bigram_one_hot_mlp.generate())
print(len(bigram_one_hot_mlp.generate()))

Generated word: awYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwYwY
a                                                                                                    
101


In [None]:
def create_embedding_inputs_and_outputs() -> list[torch.tensor, torch.tensor]:
    # implement
    inputs_ids=[]
    outputs_one_hot=[]
    for c1, c2 in zip(text, text[1:]):
        inputs_ids.append(torch.tensor(chars2id[c1]))
        outputs_one_hot.append(F.one_hot(torch.tensor(chars2id[c2]), len(chars)))
    return torch.stack(inputs_ids), torch.stack(outputs_one_hot)
    pass

input_ids, outputs_one_hot = create_embedding_inputs_and_outputs()

class BigramEmbeddingMLP(nn.Module):
    def __init__(self):
        # implement
        super().__init__()
        self.token_embedding=nn.Embedding(len(chars),8)
        self.fc1 = nn.Linear(8, 8)
        self.activation = nn.LeakyReLU()
        self.fc2 = nn.Linear(8, len(chars))
        pass

    def forward(self, x):
        # implement
        x = self.token_embedding(x)
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        return x
        pass

    def generate(self, start='a', max_new_tokens=100) -> str:
        # implement
        self.eval()
        with torch.no_grad():
            current_char = start
            word = current_char
            for _ in range(max_new_tokens):
                input_ids = torch.tensor(chars2id[current_char])
                output = self(input_ids)
                next_char_id = torch.argmax(output).item()
                next_char = [k for k, v in chars2id.items() if v == next_char_id][0]
                current_char = next_char
                word += current_char
            return word
        pass

bigram_embedding_mlp = BigramEmbeddingMLP()
generated_word = bigram_embedding_mlp.generate()
print(f'Generated word: {generated_word}')

learning_rate = 0.01
optimizer = optim.SGD(bigram_one_hot_mlp.parameters(), lr=learning_rate)

# training loop
for _ in range(1000):
    # implement
    optimizer.zero_grad()
    predictions = bigram_one_hot_mlp(inputs_one_hot.float())
    loss = F.cross_entropy(predictions, outputs_one_hot.float())
    loss.backward()
    optimizer.step()
    pass


print(f'Generated word: {bigram_embedding_mlp.generate()}')

Generated word: agggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg
Generated word: agggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg


## Part 2: Generative Pretrained Transformer (65 points)

For this part, it is best to use a gpu. In the settings at the top go to Runtime -> Change Runtime Type and select T4 GPU

In [None]:
# run nvidia-smi to check gpu usage
!nvidia-smi

Sun Oct 13 03:37:07 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# For the gpt model, let's use the full text

with open('input.txt', 'r') as f:
    text = f.read()

Implement a character level tokenization function.

1. Create a list of unique characters in the string. (1 points)
2. Implement a function `encode(s: str) -> list[int]` that takes a string and returns a list of ids (1 point)
3. Implement a function `decode(ids: list[int]) -> str` that takes a list of ids (ints) and returns a string (1 point)


In [None]:
chars = sorted(set(text))
print(len(chars))
chars2id={c:i for i,c in enumerate(chars)}
print(chars2id)
id2chars={i:c for c,i in chars2id.items()}

def encode(s: str) -> list[int]:
    return [chars2id[c] for c in text]
    pass

def decode(ids: list[int]) -> str:
    return ''.join([id2chars[i]] for i in ids)
    pass

65
{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}


In [None]:
data = torch.tensor(encode(text), dtype=torch.long).cuda()

In [None]:
block_size = 16
data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43],
       device='cuda:0')

To train a transformer, we feed the model `n` tokens (context) and try to predict the `n+1`th token (target) in the sequence.



In [None]:
x = data[:block_size]
y = data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18], device='cuda:0') the target: 47
when input is tensor([18, 47], device='cuda:0') the target: 56
when input is tensor([18, 47, 56], device='cuda:0') the target: 57
when input is tensor([18, 47, 56, 57], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58], device='cuda:0') the target: 1
when input is tensor([18, 47, 56, 57, 58,  1], device='cuda:0') the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47], device='cuda:0') the target: 58
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58], device='cuda:0') the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47], device='cuda:0') the target: 64
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64], device='cuda:0') the target: 43
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43], device='cuda:0') the target: 52
when input is tensor([18, 47,

In [None]:
batch_size = 64
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y
print(x)
print(y)


tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14],
       device='cuda:0')
tensor([47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43],
       device='cuda:0')


### Single Self Attention Head (5 points)
![](https://i.ibb.co/GWR1XG0/head.png)

In [None]:
class SelfAttentionHead(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        #Linear projections for queries, keys, and values along with head_size output dimensions
        self.q_proj = nn.Linear(64, head_size, bias=False)
        self.k_proj = nn.Linear(64, head_size, bias=False)
        self.v_proj = nn.Linear(64, head_size, bias=False)
        self.dropout = nn.Dropout(p=0.5)
        pass

    def forward(self, x):
        B, T, C = x.shape
        k = self.k_proj(x)  # (B,T,C)
        q = self.q_proj(x)  # (B,T,C)
        #perform matrix multiplication between query and key transpose
        attention = q @ k.transpose(-2, -1) * C ** -0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)

        mask = torch.tril(torch.ones(T, T)).to(device)
        # Apply the mask to the attention scores
        masked_attention = attention.masked_fill(mask == 0, float('-inf'))
        # Apply softmax to get normalized attention weights
        masked_attention = F.softmax(masked_attention, dim=-1)  # (B, T, T)
        # Apply dropout to the attention weights
        masked_attention = self.dropout(masked_attention)

        # perform the weighted aggregation of the values
        v = self.v_proj(x)  # (B,T,C)
        #perform matrix multiplication between values and masked attention to get the output
        out = masked_attention @ v  # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out
        pass
attn = SelfAttentionHead(16).to(device)
x = torch.randn((8, 32, 64)).float().to(device) #input tensor with shape (batch_size=8, sequence_length=32, input_size=64) and move it to the device
print(attn(x).shape)

torch.Size([8, 32, 16])


### Multihead Self Attention (5 points)

`constructor`

- Create 4 `SelfAttentionHead` instances. Consider using `nn.ModuleList`
- Create a linear layer with n_embd input dim and n_embd output dim

`forward`

In the forward implementation, pass `x` through each head, then concatenate all the outputs along the feature dimension, then pass the concatenated output through the linear layer

![](https://i.ibb.co/y5SwyZZ/multihead.png)

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([SelfAttentionHead(head_size) for _ in range(num_heads)])
        self.linear = nn.Linear(num_heads*head_size,64)
        self.dropout = nn.Dropout(p=0.5)
        pass

    def forward(self, x):
        head_outputs = [head(x) for head in self.heads]
        concatenated = torch.cat(head_outputs, dim=-1)

        # Pass through linear layer
        linear_output = self.linear(concatenated)
        linear_output=self.dropout(linear_output)
        return linear_output
        pass
num_heads = 4 # number of heads
head_size = 16


multi_head_attention = MultiHeadAttention(num_heads, head_size).to(device)

# Example
x = torch.randn((8, 32, 64)).float().to(device)
output = multi_head_attention(x)
print(output.shape)


torch.Size([8, 32, 64])


## MLP (2 points)
Implement a 2 layer MLP


![](https://i.ibb.co/C0DtrF5/ff.png)

In [None]:
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        # implement
        self.fc1 = nn.Linear(64, 256)
        self.activation = nn.LeakyReLU()
        self.fc2 = nn.Linear(256, 64)
        self.dropout = nn.Dropout(p=0.5)
        pass

    def forward(self, x: torch.tensor) -> torch.tensor:
        # implement
        x=self.fc1(x)
        x=self.activation(x)
        x=self.fc2(x)
        x=self.dropout(x)
        return x
        pass

x = torch.randn((8, 32, 64)).float()
mlp_model=MLP()
output = mlp_model(x)
print(output.shape)

torch.Size([8, 32, 64])


## Transformer block (20 points)

Layer normalization help training stability by normalizing the outputs of neurons within a single layer across all features for each individual data point, not across a full batch or a specific feature.

Dropout is a form of regularization to prevent overfitting.

This is the diagram of a transformer block:

![](https://i.ibb.co/X85C473/block.png)

In [None]:
class Block(nn.Module):
    def __init__(self, n_embd: int, n_head: int):
        super().__init__()
        self.norm1 = nn.LayerNorm(n_embd)
        self.heads=MultiHeadAttention(n_head,16)
        self.fc1 = nn.Linear(n_embd, n_embd)
        pass

    def forward(self, x):
        y=self.norm1(x)
        y=self.heads(y)
        x=x+y
        z=self.norm1(x)
        z=self.fc1(z)
        x=x+z
        return x
        pass

x = torch.randn((8, 32, 64)).float().to(device)
tblock=Block(64,4).to(device)
output=tblock(x)
print(output.shape)

torch.Size([8, 32, 64])


## GPT

`constructor` (5 points)

1. create the token embedding table and the position embedding table
2. create variable `self.blocks` that is a series of 4 `Block`s. The data will pass through each block sequentially. Consider using `nn.Sequential`
3. create a layer norm layer
4. create a linear layer for predicting the next token

`forward(self, idx, targets=None)`. (5 points)

`forward` takes a batch of context ids as input of size (B, T) and returns the logits and the loss, if targets is not None. If targets is None, return the logits and None.
1. get the token by using the token embedding table created in the constructor
2. create the position embeddings
3. sum the token and position embeddings to get the model input
4. pass the model through the blocks, the layernorm layer, and the final linear layer
5. compute the loss

`generate(start_char, max_new_tokens, top_p, top_k, temperature) -> str` (5 points)
1. implement top p, top_k, and temperature for sampling



![](https://i.ibb.co/n8sbQ0V/Screenshot-2024-01-23-at-8-59-08-PM.png)

In [None]:
class GPT(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        self.token_embedding = nn.Embedding(len(chars),n_embd)
        self.positional_embedding = nn.Embedding(32, n_embd)
        self.dropout = nn.Dropout(p=0.5)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(4)])

        self.layer_norm = nn.LayerNorm(n_embd)
        self.linear = nn.Linear(n_embd, len(chars))


    def forward(self, idx, targets=None):
        B,T=idx.shape
        token_embeddings = self.token_embedding(idx).to(device)
        positional_embedding = self.positional_embedding(torch.stack([torch.arange(T) for _ in range(B)], 0).to(device))
        x = token_embeddings + positional_embedding
        x = self.dropout(x)
        x= self.blocks(x)
        x = self.layer_norm(x)
        logits = self.linear(x)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits,loss


    def generate(self, start_char, max_new_tokens, top_p, top_k, temperature):
        self.eval()
        with torch.no_grad():
            current_seq = torch.tensor([chars2id[start_char]], dtype=torch.long).to(device).unsqueeze(0)
            for _ in range(max_new_tokens):
                """pass the last 32 elements from the tensor """
                logits, loss = self(current_seq[:, -32:])
                logits = logits[:, -1, :]
                scaled_logits = logits / temperature
                probabilities = F.softmax(scaled_logits, dim=-1)
                if top_k is not None:
                    sampled_index = top_k_sampling(probabilities, k=top_k)
                elif top_p is not None:
                    sampled_index = top_p_sampling(probabilities, p=top_p)
                sampled_token = torch.tensor([[sampled_index]]).to(device)
                current_seq = torch.cat([current_seq, sampled_token], dim=1)

        generated_string = ''.join([id2chars[token.item()] for token in current_seq[0]])
        return generated_string

def top_k_sampling(probabilities, k=5):
        """
        Performs top-k sampling from a probability distribution.
        """
        probabilities = probabilities.cpu().numpy().flatten()
        top_k_indices = np.argsort(probabilities)[-k:]
        top_k_probabilities = probabilities[top_k_indices]
        top_k_probabilities /= top_k_probabilities.sum()
        chosen_index = np.random.choice(top_k_indices, p=top_k_probabilities)

        return chosen_index

def top_p_sampling(probabilities, p=0.9):
    """
    Selects tokens from a probability distribution that have a cumulative probability
    greater than the threshold p """
    probabilities = probabilities.cpu().numpy().flatten()

    if len(probabilities) > 1:
      sorted_indices = np.argsort(probabilities)[::-1]
    else:
      sorted_indices = np.argsort(probabilities)

    sorted_probabilities = probabilities[sorted_indices]
    cumulative_probabilities = np.cumsum(sorted_probabilities)
    cutoff_index = np.where(cumulative_probabilities > p)[0][0]

    filtered_indices = sorted_indices[:cutoff_index + 1]
    filtered_probabilities = sorted_probabilities[:cutoff_index + 1]

    filtered_probabilities /= filtered_probabilities.sum()
    chosen_index = np.random.choice(filtered_indices, p=filtered_probabilities)

    return chosen_index


gpt_model = GPT(64, 4).to(device)
generated_text = gpt_model.generate(start_char='a', max_new_tokens=100,top_k=None,top_p=0.9, temperature=1.0)
print(generated_text)

aO!pOX.kzgDZwa!XmVhOFRhjBA$YFkSLB,iOldbuMwq:lUVrIvIo.YxFFB&K$QzzzNH,-:KVQLMO;qkstjQ,riCVIFRLlvLpMXKLW


### Training loop (15 points)

implement training loop

In [None]:
# make you are running this on the GPU
gpt_model = GPT(64, 4).to(device)
max_iters = 5000
learning_rate = 0.001
optimizer = optim.SGD(gpt_model.parameters(), lr=learning_rate)
for iter in range(max_iters):
  optimizer.zero_grad()
  xb, yb = get_batch()
  logits, loss = gpt_model(xb, yb)
  loss.backward()
  optimizer.step()
  if iter%1000==0:
      print(f'Iteration {iter+1}, Loss: {loss.item()}')
      print(f'Generated text:{gpt_model.generate(start_char="a", max_new_tokens=100,top_k=5,top_p=None, temperature=1.0)}')



Iteration 1, Loss: 4.424371242523193
Generated text:apqQHw;lM-liCiCCvPTOqcr;cz:tT:OuMsLMLLbffuLfLNMLNb.CffLN.CC?MuuC?fs?bbtn-NNM,uBbtLbbtbLbfbLNMbLNMbfsL
Iteration 1001, Loss: 3.860344409942627
Generated text:asmzcne eloool  wneCo
Ide,  n$oDIN...CLMsfbIMbIMMbIMMuLbttLMLNMbLbffubfuLLbttINMbLLN.CCMsMLNb.CCCMLNM
Iteration 2001, Loss: 3.407477855682373
Generated text:ae t e t o too h tone th h and hon ou  tonthettttInthtneMrentortou enbfbffbhaLthaLNththatfuLthtIthant
Iteration 3001, Loss: 3.242250680923462
Generated text:a th orterooreanootoo
 ho h hilserthoue tth thoustorenotathtarsthe athenottthtthtenotnototthttthenter
Iteration 4001, Loss: 3.097386598587036
Generated text:an theot  ta
 orer has orile t  othfursthonensnbarnerthorstttotenenouerttthore hftor thatoththone the


### Generate text


print some text that your model generates

In [None]:
print(gpt_model.generate(start_char="a", max_new_tokens=100,top_k=5,top_p=None, temperature=1.0))

as or o or a t tthane, a inte hesnernethethensthantane othe andte hesernttonotonenthesfthouretasentab
