# Introduction

---

## Introduction to Building GPT from Scratch

Welcome to this Colab notebook, a compilation of insights and learnings garnered from delving into Andrej Karpathy's YouTube tutorial titled ["Let's build GPT: from scratch, in code, spelled out"](https://www.youtube.com/watch?v=kCc8FmEb1nY) for the second time.


# Imports

In [None]:
import math
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

<torch._C.Generator at 0x7c2240dc3b10>

# Download dataset

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-10-04 17:25:22--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-10-04 17:25:23 (5.19 MB/s) - ‘input.txt’ saved [1115394/1115394]



# Dataset processing

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-09-29 19:51:29--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-09-29 19:51:29 (98.5 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
with open('input.txt',mode='r', encoding='utf-8') as f:
    text = f.read()

In [None]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [None]:
text[:100]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

In [None]:
print(text[:200])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


In [None]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
print("".join(chars))

65

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [None]:
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])
print(encode("hello dude"))
print(decode(encode("hello dude")))

[46, 43, 50, 50, 53, 1, 42, 59, 42, 43]
hello dude


In [None]:
stoi

{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19,
 'H': 20,
 'I': 21,
 'J': 22,
 'K': 23,
 'L': 24,
 'M': 25,
 'N': 26,
 'O': 27,
 'P': 28,
 'Q': 29,
 'R': 30,
 'S': 31,
 'T': 32,
 'U': 33,
 'V': 34,
 'W': 35,
 'X': 36,
 'Y': 37,
 'Z': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64}

In [None]:
data = torch.tensor(encode(text), dtype= torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [None]:
n = int(len(data)*0.9)
train_data = data[:n]
val_data = data[n:]
print(len(train_data), len(val_data))

1003854 111540


In [None]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [None]:
torch.manual_seed(1337)

batch_size = 4
block_size = 8

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y


xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b,:t+1]
        target = yb[b, t]
        print(f"when input is {context} the target: {target}")


inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is tensor([24]) the target: 43
when input is tensor([24, 43]) the target: 58
when input is tensor([24, 43, 58]) the target: 5
when input is tensor([24, 43, 58,  5]) the target: 57
when input is tensor([24, 43, 58,  5, 57]) the target: 1
when input is tensor([24, 43, 58,  5, 57,  1]) the target: 46
when input is tensor([24, 43, 58,  5, 57,  1, 46]) the target: 43
when input is tensor([24, 43, 58,  5, 57,  1, 46, 43]) the target: 39
when input is tensor([44]) the target: 53
when input is tensor([44, 53]) the target: 56
when input is tensor([44, 53, 56]) the target: 1
when input is tenso

# Model

## Model 1: Embedding table

In [None]:

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx, targets are both (B, T) tensors
        logits = self.token_embedding_table(idx) # (B, T, C)

        if targets is None:
            loss = None
        else:
            # need to reshape as Pytorch Cross Entropy expects 2d logits and 1D targets
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, _ = self(idx);
            logits = logits[:,-1,:]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx


    # def generate(self, idx, max_new_tokens):
    #     # idx is (B, T) array of indices in the current context
    #     for _ in range(max_new_tokens):
    #         logits, _ = self(idx) # inference (B, T, C)
    #         # indexing (not slicing) doesn't retain dimension, becomes (B, C)
    #         logits = logits[:, -1, :]
    #         # apply softmax to get probabilities
    #         probs = F.softmax(logits, dim=-1) # (B, C)
    #         # sample from the distribution
    #         idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
    #         # append sampled index to the running sequence
    #         idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
    #     return idx




m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)

print(-math.log(1/65))
print(loss.item())
print(logits.shape)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))



4.174387269895637
4.725051403045654
torch.Size([256, 65])

emdhPRufnIAsF
-J':qOGMClYzOc'KTf,Z
cuRoRgYuKMCJhfto,pos.!D&SWFSiHwDba&!3CHzlKthxYq!,.qGD.qID?.tNGzX3'fdr:$-eygqHSYfPZXV
j'fvtOyL$b.-icacNVIwksq
lbf
,kFsvz,&q
tsWvtIGemEVsrOrEtSGoPh$hiHA,zZIXbWxYTnNhu&


In [None]:
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 65])

Model's Structure: 
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 65)
)

Total Parameters: 4225
Trainable Parameters: 4225


In [None]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32

for steps in range(10000):

    # mini-batch
    xb, yb = get_batch("train")

    # forward pass
    logits, loss = m(xb, yb)

    # backward pass
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    # update step
    optimizer.step()

print(loss.item())


2.5047218799591064


In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))


Hananfou t llaceit, br g inde, WAThyoecar, d he curor ance bur bas tot HAD arutrithat ltuanooul prthembu imy vehJTh, ucerir:

Whe uieformy t't iofonentareethabearusllin Fie thyomoordomed fesathe.
D ou


## Model 2: Add GPU, validation losses and refactor

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))


step 0: train loss 4.7305, val loss 4.7241
step 300: train loss 2.8110, val loss 2.8249
step 600: train loss 2.5434, val loss 2.5682
step 900: train loss 2.4932, val loss 2.5088
step 1200: train loss 2.4863, val loss 2.5035
step 1500: train loss 2.4665, val loss 2.4921
step 1800: train loss 2.4683, val loss 2.4936
step 2100: train loss 2.4696, val loss 2.4846
step 2400: train loss 2.4638, val loss 2.4879
step 2700: train loss 2.4738, val loss 2.4911

od nos CAy go ghanoray t, co haringoudrou clethe k,LARof fr werar,
Is fa!


Thilemel cia h hmboomyorarifrcitheviPO, tle dst f qur'dig t cof boddo y t o ar pileas h mo wierl t,
S:
STENENEat I athe thou
CPU times: user 6.22 s, sys: 22.8 ms, total: 6.24 s
Wall time: 6.45 s


In [None]:
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 65])

Model's Structure: 
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 65)
)

Total Parameters: 4225
Trainable Parameters: 4225


## Aside: Attention explored

### Stage 1: Averaging

In [None]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2
x = torch.tensor(range(64), dtype=torch.float32)
x = x.view((B, T, C))
print(x.shape)
print(x[:2].shape)
print(x[1].shape)
print(x[1,:3].shape)
print(x[0])
print(x[0,:3])


torch.Size([4, 8, 2])
torch.Size([2, 8, 2])
torch.Size([8, 2])
torch.Size([3, 2])
tensor([[ 0.,  1.],
        [ 2.,  3.],
        [ 4.,  5.],
        [ 6.,  7.],
        [ 8.,  9.],
        [10., 11.],
        [12., 13.],
        [14., 15.]])
tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])


In [None]:
xbow = torch.zeros((B, T, C))

for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev,0) # expects (C,)

In [None]:
x[0]

tensor([[ 0.,  1.],
        [ 2.,  3.],
        [ 4.,  5.],
        [ 6.,  7.],
        [ 8.,  9.],
        [10., 11.],
        [12., 13.],
        [14., 15.]])

In [None]:
xbow[0]

tensor([[0., 1.],
        [1., 2.],
        [2., 3.],
        [3., 4.],
        [4., 5.],
        [5., 6.],
        [6., 7.],
        [7., 8.]])

In [None]:
x

tensor([[[ 0.,  1.],
         [ 2.,  3.],
         [ 4.,  5.],
         [ 6.,  7.],
         [ 8.,  9.],
         [10., 11.],
         [12., 13.],
         [14., 15.]],

        [[16., 17.],
         [18., 19.],
         [20., 21.],
         [22., 23.],
         [24., 25.],
         [26., 27.],
         [28., 29.],
         [30., 31.]],

        [[32., 33.],
         [34., 35.],
         [36., 37.],
         [38., 39.],
         [40., 41.],
         [42., 43.],
         [44., 45.],
         [46., 47.]],

        [[48., 49.],
         [50., 51.],
         [52., 53.],
         [54., 55.],
         [56., 57.],
         [58., 59.],
         [60., 61.],
         [62., 63.]]])

In [None]:
xbow

tensor([[[ 0.,  1.],
         [ 1.,  2.],
         [ 2.,  3.],
         [ 3.,  4.],
         [ 4.,  5.],
         [ 5.,  6.],
         [ 6.,  7.],
         [ 7.,  8.]],

        [[16., 17.],
         [17., 18.],
         [18., 19.],
         [19., 20.],
         [20., 21.],
         [21., 22.],
         [22., 23.],
         [23., 24.]],

        [[32., 33.],
         [33., 34.],
         [34., 35.],
         [35., 36.],
         [36., 37.],
         [37., 38.],
         [38., 39.],
         [39., 40.]],

        [[48., 49.],
         [49., 50.],
         [50., 51.],
         [51., 52.],
         [52., 53.],
         [53., 54.],
         [54., 55.],
         [55., 56.]]])

### Stage2a: Matrix Multiplication

The central idea is that by changing the elements of the multiplying matrix you can control the aggregation type and extent.

In the Transformer model's self-attention mechanism, tokens are aggregated based on their importance or relevance, and this importance is dynamically learned. Instead of static ones, zeros, or normalized values, the Transformer learns weights (or attention scores) to aggregate tokens in a context-aware manner.

In [None]:
torch.manual_seed(42)
a = torch.ones(3,3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print("a=")
print(a)
print("---")
print("b=")
print(b)
print("---")
print("c=")
print(c)
print("---")

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
---
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
---
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])
---


In [None]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3,3))
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print("a=")
print(a)
print("---")
print("b=")
print(b)
print("---")
print("c=")
print(c)
print("---")

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
---
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
---
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])
---


In [None]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3,3))
a = a/torch.sum(a, dim=1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print("a=")
print(a)
print("---")
print("b=")
print(b)
print("---")
print("c=")
print(c)
print("---")

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
---
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
---
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])
---


### Stage 2b: wei

PyTorch provides a way to perform batched matrix multiplication. Even if the matrices' dimensions aren't perfectly aligned, PyTorch can infer a batch dimension and apply the multiplication across each batch.

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)

xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)

wei = torch.tril(torch.ones(T, T))
wei = wei/wei.sum(dim=1, keepdims=True)
xbow2 = wei @ x # (T, T) @ (B, T, C) --> (B, T, C)

print(xbow)

torch.allclose(xbow, xbow2)

tensor([[[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]],

        [[ 1.3488, -0.1396],
         [ 0.8173,  0.4127],
         [-0.1342,  0.4395],
         [ 0.2711,  0.4774],
         [ 0.2421,  0.0694],
         [ 0.0084,  0.0020],
         [ 0.0712, -0.1128],
         [ 0.2527,  0.2149]],

        [[-0.6631, -0.2513],
         [ 0.1735, -0.0649],
         [ 0.1685,  0.3348],
         [-0.1621,  0.1765],
         [-0.2312, -0.0436],
         [-0.1015, -0.2855],
         [-0.2593, -0.1630],
         [-0.3015, -0.2293]],

        [[ 1.6455, -0.8030],
         [ 1.4985, -0.5395],
         [ 0.4954,  0.3420],
         [ 1.0623, -0.1802],
         [ 1.1401, -0.4462],
         [ 1.0870, -0.4071],
         [ 1.0430, -0.1299],
         [ 1.1138, -0.1641]]])


True

### Stage 3: Softmax

In [None]:
tril = torch.tril(torch.ones((T,T)))
print(tril)
wei = torch.zeros((T,T))
print(wei)
wei = wei.masked_fill(tril==0, float("-inf"))
print(wei)
wei = F.softmax(wei, dim=-1)
print(wei)
xbow3 = wei @ x
print(xbow3)
torch.allclose(xbow, xbow3)


tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0.,

True

### Stage 4: Self-Attention

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

tril = torch.tril(torch.ones((T,T)))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril==0, float("-inf"))
wei = F.softmax(wei, dim=-1)
out = wei @ x

print(wei.shape)
print(x.shape)
print(out.shape)
print(wei)




torch.Size([8, 8])
torch.Size([4, 8, 32])
torch.Size([4, 8, 32])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


In [None]:
# query and key interaction

torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)

k = key(x) # (B,T,head_size)
q = query(x)
wei = q @ k.transpose(-2, -1) # (B,T,head_size) @ (B,T,head_size) --> (B,T,T)
print(wei.shape)
print(wei[0])

tril = torch.tril(torch.ones((T,T)))
# wei = torch.zeros((T,T))
wei = wei.masked_fill(tril==0, float("-inf"))
print(wei[0])
wei = F.softmax(wei, dim=-1)
print(wei[0])
out = wei @ x

print(wei.shape)
print(x.shape)
print(out.shape)
print(wei)

torch.Size([4, 8, 8])
tensor([[-1.7629, -1.3011,  0.5652,  2.1616, -1.0674,  1.9632,  1.0765, -0.4530],
        [-3.3334, -1.6556,  0.1040,  3.3782, -2.1825,  1.0415, -0.0557,  0.2927],
        [-1.0226, -1.2606,  0.0762, -0.3813, -0.9843, -1.4303,  0.0749, -0.9547],
        [ 0.7836, -0.8014, -0.3368, -0.8496, -0.5602, -1.1701, -1.2927, -1.0260],
        [-1.2566,  0.0187, -0.7880, -1.3204,  2.0363,  0.8638,  0.3719,  0.9258],
        [-0.3126,  2.4152, -0.1106, -0.9931,  3.3449, -2.5229,  1.4187,  1.2196],
        [ 1.0876,  1.9652, -0.2621, -0.3158,  0.6091,  1.2616, -0.5484,  0.8048],
        [-1.8044, -0.4126, -0.8306,  0.5899, -0.7987, -0.5856,  0.6433,  0.6303]],
       grad_fn=<SelectBackward0>)
tensor([[-1.7629,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-3.3334, -1.6556,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-1.0226, -1.2606,  0.0762,    -inf,    -inf,    -inf,    -inf,    -inf],
        [ 0.7836, -0.8014, -0.3368, -0.84

In [None]:
# add value

torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x) # (B,T,head_size)
q = query(x)
wei = q @ k.transpose(-2, -1) # (B,T,head_size) @ (B,T,head_size) --> (B,T,T)

tril = torch.tril(torch.ones((T,T)))
wei = wei.masked_fill(tril==0, float("-inf"))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v

print(wei.shape)
print(x.shape)
print(out.shape)


torch.Size([4, 8, 8])
torch.Size([4, 8, 32])
torch.Size([4, 8, 16])


In [None]:
tril = torch.tril(torch.ones((T,T)))
print(tril[:T, :T])
print(tril[:2, :2])

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])
tensor([[1., 0.],
        [1., 1.]])


## Model 3: Add Token Embeddings and Position Embeddings

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))


step 0: train loss 4.4801, val loss 4.4801
step 300: train loss 2.5404, val loss 2.5566
step 600: train loss 2.5160, val loss 2.5335
step 900: train loss 2.4967, val loss 2.5149
step 1200: train loss 2.5106, val loss 2.5254
step 1500: train loss 2.4853, val loss 2.5109
step 1800: train loss 2.4966, val loss 2.5198
step 2100: train loss 2.4949, val loss 2.5100
step 2400: train loss 2.4937, val loss 2.5102
step 2700: train loss 2.5040, val loss 2.5114

 ald, arhis'sho risisthanthatarend un'soto vat s kn, use he ute f whongeindd t acoe ts ansur thy ppr h.


Y:
KIIsqu pcinded chor whave o se bll owhored miner t ooon'stoume wh tomo! fifoveghind hiarnge
CPU times: user 7.01 s, sys: 25.6 ms, total: 7.04 s
Wall time: 7.08 s


In [None]:
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
lm_head.weight 	 torch.Size([65, 32])
lm_head.bias 	 torch.Size([65])

Model's Structure: 
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

Total Parameters: 4481
Trainable Parameters: 4481


## Model 4: Attention Head

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_head = Head(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.sa_head(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))


step 0: train loss 4.1990, val loss 4.2031
step 300: train loss 2.5062, val loss 2.5121
step 600: train loss 2.4679, val loss 2.4931
step 900: train loss 2.4549, val loss 2.4599
step 1200: train loss 2.4556, val loss 2.4681
step 1500: train loss 2.4336, val loss 2.4524
step 1800: train loss 2.4198, val loss 2.4474
step 2100: train loss 2.4168, val loss 2.4283
step 2400: train loss 2.3974, val loss 2.4266
step 2700: train loss 2.4021, val loss 2.4301

Ad winth odl Cod bardi ma thingro rKesell, thimledl gu berom y whashr wiantorou hd lis sous w lous inn
Farils peen ishn shime mip, cize fo thit wy; fo othur out bllow'm tato rtle hflo aty bus baturs s
CPU times: user 10.3 s, sys: 20.3 ms, total: 10.4 s
Wall time: 10.4 s


In [None]:
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
sa_head.tril 	 torch.Size([8, 8])
sa_head.query.weight 	 torch.Size([32, 32])
sa_head.key.weight 	 torch.Size([32, 32])
sa_head.value.weight 	 torch.Size([32, 32])
lm_head.weight 	 torch.Size([65, 32])
lm_head.bias 	 torch.Size([65])

Model's Structure: 
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (sa_head): Head(
    (query): Linear(in_features=32, out_features=32, bias=False)
    (key): Linear(in_features=32, out_features=32, bias=False)
    (value): Linear(in_features=32, out_features=32, bias=False)
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

Total Parameters: 7553
Trainable Parameters: 7553


## Model 5: Multi-head Attention

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)

class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList(Head(head_size) for _ in range(num_heads))

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))


step 0: train loss 4.2229, val loss 4.2220
step 300: train loss 2.4194, val loss 2.4281
step 600: train loss 2.3258, val loss 2.3593
step 900: train loss 2.2979, val loss 2.3138
step 1200: train loss 2.2777, val loss 2.3133
step 1500: train loss 2.2428, val loss 2.2975
step 1800: train loss 2.2339, val loss 2.3005
step 2100: train loss 2.2230, val loss 2.2775
step 2400: train loss 2.2117, val loss 2.2859
step 2700: train loss 2.2099, val loss 2.2828

A has thave dus! Tarwingr ass, gocrtysell, to hath ris I! Thas whash wifall foulld lives
Whe ways.


But when pere ishot 't ther Cacizen: cre to you of thur thing swa'm to of tle hal waryte hablefrs s
CPU times: user 18.2 s, sys: 49.6 ms, total: 18.2 s
Wall time: 19 s


In [None]:
# Print model's state_dict
print("Model's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
sa_heads.heads.0.tril 	 torch.Size([8, 8])
sa_heads.heads.0.query.weight 	 torch.Size([8, 32])
sa_heads.heads.0.key.weight 	 torch.Size([8, 32])
sa_heads.heads.0.value.weight 	 torch.Size([8, 32])
sa_heads.heads.1.tril 	 torch.Size([8, 8])
sa_heads.heads.1.query.weight 	 torch.Size([8, 32])
sa_heads.heads.1.key.weight 	 torch.Size([8, 32])
sa_heads.heads.1.value.weight 	 torch.Size([8, 32])
sa_heads.heads.2.tril 	 torch.Size([8, 8])
sa_heads.heads.2.query.weight 	 torch.Size([8, 32])
sa_heads.heads.2.key.weight 	 torch.Size([8, 32])
sa_heads.heads.2.value.weight 	 torch.Size([8, 32])
sa_heads.heads.3.tril 	 torch.Size([8, 8])
sa_heads.heads.3.query.weight 	 torch.Size([8, 32])
sa_heads.heads.3.key.weight 	 torch.Size([8, 32])
sa_heads.heads.3.value.weight 	 torch.Size([8, 32])
lm_head.weight 	 torch.Size([65, 32])
lm_head.bias 	 torch.Size([65])

Model's Structu

## Model 6: Feed forward

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)

class FeedForward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList(Head(head_size) for _ in range(num_heads))

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_heads = MultiHeadAttention(4, n_embd//4)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        x = self.ffwd(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print("\nSample from the model:")

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))

print("-----")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


step 0: train loss 4.1839, val loss 4.1819
step 300: train loss 2.3986, val loss 2.4250
step 600: train loss 2.3145, val loss 2.3607
step 900: train loss 2.2494, val loss 2.2812
step 1200: train loss 2.2289, val loss 2.2805
step 1500: train loss 2.2077, val loss 2.2451
step 1800: train loss 2.2009, val loss 2.2584
step 2100: train loss 2.1882, val loss 2.2341
step 2400: train loss 2.1832, val loss 2.2417
step 2700: train loss 2.1591, val loss 2.2284

Casto oll won se hin
Will you sucleed
S J
LUKE OF Ser,
thims aly ever ene; deed blis thall grar to the sraws hore, a nothly,-dend't thend Cit the the thot pe;
Ame to the sweree soncien to: you fa and 
Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
sa_heads.heads.0.tril 	 torch.Size([8, 8])
sa_heads.heads.0.query.weight 	 torch.Size([8, 32])
sa_heads.heads.0.key.weight 	 torch.Size([8, 32])
sa_heads.heads.0.value.weight 	 torch.Size([8, 32])
sa_heads.heads.1.tril

## Model 7: Block

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)

class FeedForward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList(Head(head_size) for _ in range(num_heads))

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return out

class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print("\nSample from the model:")

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))

print("-----")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


step 0: train loss 4.1635, val loss 4.1642
step 300: train loss 2.4873, val loss 2.4880
step 600: train loss 2.3878, val loss 2.4132
step 900: train loss 2.3357, val loss 2.3376
step 1200: train loss 2.2948, val loss 2.3032
step 1500: train loss 2.2547, val loss 2.2894
step 1800: train loss 2.2380, val loss 2.2550
step 2100: train loss 2.2237, val loss 2.2598
step 2400: train loss 2.2028, val loss 2.2489
step 2700: train loss 2.1903, val loss 2.2382

Sample from the model:


BERENETING:
Asas seeereds gearty lixe there pelop.
Th fearr is.
I know sash Beast is; hee pleious therbes.s you thee our feinnssour you couthe, To deark lincise ford wam baksy peene regatere,
So 'cin
-----

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
blocks.0.sa.heads.0.tril 	 torch.Size([8, 8])
blocks.0.sa.heads.0.query.weight 	 torch.Size([8, 32])
blocks.0.sa.heads.0.key.weight 	 torch.Size([8, 32])
blocks.0.sa.heads.0.value.weight 

## Model 8: Residual Connections

In Model 7, even though we had a bigger network than Model 6, with more parameters, our loss was higher. The reason is that deeper networks need things like Layer Normalization and Residual Connections for good results.

---

The idea behind adding an extra Linear layer in the `MultiHeadAttention` and `FeedForward` modules is to allow the model to have more capacity and learn a more complex function. While your model can learn without these extra linear layers, adding them might help in learning more complex relationships in the data, especially when the amount of data is large.

### FeedForward
In the `FeedForward` module, having an extra linear layer can be thought of as having two layers of transformation. This can help the model to learn more complex relationships in the data. However, it's also common to have different dimensions for the inner layer. For example:
```python
self.net = nn.Sequential(
    nn.Linear(n_embd, 4 * n_embd),  # Increase dimensionality
    nn.ReLU(),
    nn.Linear(4 * n_embd, n_embd),  # Project back to original dimensionality
)
```
This allows the model to have more capacity in the feed-forward part.

### MultiHeadAttention
For the `MultiHeadAttention`, the additional Linear layer (projection layer) is typically added to allow the concatenated heads to be transformed back to the original embedding dimensionality. It's a standard practice in the original Transformer model. When you concatenate the outputs of multiple attention heads, the dimensionality of the resultant tensor is `num_heads * head_size`. The additional Linear layer projects it back to the original dimensionality `n_embd`.

### Conclusion
While adding these layers increases the model's capacity to learn complex functions, it also increases the number of parameters, and therefore the risk of overfitting, especially when the amount of training data is limited. If you find that the model is overfitting, you might want to remove some of these layers or use regularization techniques like dropout. Similarly, if the model is underfitting, adding more layers or increasing the size of the existing layers could be beneficial.

In practice, the architecture of the model is usually determined based on empirical results obtained through experimenting with different configurations and observing their performance on the validation dataset. If the simpler model without additional Linear layers is performing well in terms of generalization to the validation dataset, it might be the preferable choice.

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)

class FeedForward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd * 4),
            nn.ReLU(),
            nn.Linear(n_embd * 4, n_embd),
        )

    def forward(self, x):
        return self.net(x)

class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList(Head(head_size) for _ in range(num_heads))
        self.proj = nn.Linear(n_embd, n_embd)
        # the the additional Linear layer (projection layer)
        # is typically added to allow the concatenated heads
        # to be transformed back to the original embedding dimensionality.

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = x + self.sa(x)
        x = x + self.ffwd(x)
        return x




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print("\nSample from the model:")

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))

print("-----")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


step 0: train loss 4.6208, val loss 4.6182
step 300: train loss 2.4846, val loss 2.4859
step 600: train loss 2.3614, val loss 2.3625
step 900: train loss 2.2820, val loss 2.2898
step 1200: train loss 2.2169, val loss 2.2440
step 1500: train loss 2.1824, val loss 2.2152
step 1800: train loss 2.1588, val loss 2.1988
step 2100: train loss 2.1231, val loss 2.1659
step 2400: train loss 2.1084, val loss 2.1568
step 2700: train loss 2.1072, val loss 2.1533

Sample from the model:

Praqows couss, my sitior the vercent ast him, the havince to wise our seame? net upeponjresconsts folsuled ever I
What of Is wertmen:
Olam; thum I lovoings:
Mane cout you the but siden and up thing,
W
-----

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
blocks.0.sa.heads.0.tril 	 torch.Size([8, 8])
blocks.0.sa.heads.0.query.weight 	 torch.Size([8, 32])
blocks.0.sa.heads.0.key.weight 	 torch.Size([8, 32])
blocks.0.sa.heads.0.value.weight 

## Model 9: Layer Normalization

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)

class FeedForward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd * 4),
            nn.ReLU(),
            nn.Linear(n_embd * 4, n_embd),
        )

    def forward(self, x):
        return self.net(x)

class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList(Head(head_size) for _ in range(num_heads))
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            nn.LayerNorm(n_embd),
        )
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print("\nSample from the model:")

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))

print("-----")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'\nTrainable Parameters: {trainable_params}')

# Print model's state_dict
# print("\nModel's state_dict:")
# for param_tensor in m.state_dict():
#     print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)




step 0: train loss 4.3090, val loss 4.3083
step 300: train loss 2.5221, val loss 2.5322
step 600: train loss 2.3602, val loss 2.3642
step 900: train loss 2.2725, val loss 2.2829
step 1200: train loss 2.1968, val loss 2.2269
step 1500: train loss 2.1618, val loss 2.1961
step 1800: train loss 2.1333, val loss 2.1737
step 2100: train loss 2.0997, val loss 2.1378
step 2400: train loss 2.0890, val loss 2.1313
step 2700: train loss 2.0820, val loss 2.1246

Sample from the model:

PORINGO: grust my sithou kne, that warst, to sown hy to whan wise our seam fate, upos for of,
Whear bell,
Sere his dearge-ss?

Whmongly de; that is he inestiove;

And you the but sided and up the mim 
-----

Total Parameters: 42369

Trainable Parameters: 42369

Model's Structure: 
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head

## Model 10: Dropout (and code cleanup)

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
n_head = 4
n_layer = 4
dropout = 0.1
# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        wei = self.dropout(wei)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)

class FeedForward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd * 4),
            nn.ReLU(),
            nn.Linear(n_embd * 4, n_embd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList(Head(head_size) for _ in range(num_heads))
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        out = self.dropout(out)
        return out

class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print("\nSample from the model:")

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))

print("-----")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'\nTrainable Parameters: {trainable_params}')

# Print model's state_dict
# print("\nModel's state_dict:")
# for param_tensor in m.state_dict():
#     print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)




step 0: train loss 4.4066, val loss 4.3990
step 300: train loss 2.5158, val loss 2.5268
step 600: train loss 2.4060, val loss 2.4122
step 900: train loss 2.3254, val loss 2.3335
step 1200: train loss 2.2503, val loss 2.2766
step 1500: train loss 2.2171, val loss 2.2317
step 1800: train loss 2.2017, val loss 2.2102
step 2100: train loss 2.1681, val loss 2.1876
step 2400: train loss 2.1387, val loss 2.1708
step 2700: train loss 2.1250, val loss 2.1755

Sample from the model:

me orcail, at prand thine her, apnd wite's ath reas coud!

QOr not:
For.

GETUNG:
Was not it dond ang wom tros; hin BUTI:
Coche mowe ir adorlee; firlow: forks her, a wilTIO:

Fors and greand berm? I s
-----

Total Parameters: 54977

Trainable Parameters: 54977

Model's Structure: 
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head

## Saving Model to HuggingFace Hub

In [None]:
import os

# Define the path where you want to save your model
save_directory = os.getcwd()  # This gets the current working directory
model_path = os.path.join(save_directory, "model_weights.pth")

# Save the model
torch.save(model.state_dict(), model_path)


In [None]:
import json

# Save the stoi and itos mappings
with open(f"{save_directory}/stoi.json", 'w') as f:
    json.dump(stoi, f)

with open(f"{save_directory}/itos.json", 'w') as f:
    json.dump(itos, f)


In [None]:
!pip install huggingface_hub

Collecting huggingface_hub
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/295.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.17.3


In [None]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from huggingface_hub import upload_file


# Define the local file path, repository ID, and path in the repository
local_file_path = "./model_weights.pth"  # replace with your local file path
repo_id = "RubyDiamond/mini-gpt"  # replace with your repository ID
path_in_repo = "model_weights.pth"  # replace with your desired path in the repository

# Upload the file
url = upload_file(
    path_or_fileobj=local_file_path,
    path_in_repo=path_in_repo,
    repo_id=repo_id
)

print(f"File uploaded to {url}")


model_weights.pth:   0%|          | 0.00/263k [00:00<?, ?B/s]

File uploaded to https://huggingface.co/RubyDiamond/mini-gpt/blob/main/model_weights.pth


In [None]:
from huggingface_hub import upload_file


# Define the local file path, repository ID, and path in the repository
local_file_path = "./stoi.json"  # replace with your local file path
repo_id = "RubyDiamond/mini-gpt"  # replace with your repository ID
path_in_repo = "stoi.json"  # replace with your desired path in the repository

# Upload the file
url = upload_file(
    path_or_fileobj=local_file_path,
    path_in_repo=path_in_repo,
    repo_id=repo_id
)

print(f"File uploaded to {url}")

File uploaded to https://huggingface.co/RubyDiamond/mini-gpt/blob/main/stoi.json


## Saving Model to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import os

# Define the save directory
save_directory = "/content/drive/MyDrive/projects/mini-gpt"

# Check if directory exists, if not create it
if not os.path.exists(save_directory):
    os.makedirs(save_directory)

# Now, save the model
torch.save(m.state_dict(), f"{save_directory}/model_weights.pth")


In [None]:
import json

# Define the directory in Google Drive to save the files
save_directory = '/content/drive/MyDrive/projects/mini-gpt'  # Replace with your directory

# Save Model Weights
torch.save(model.state_dict(), f"{save_directory}/model_weights.pth")

# Save itos and stoi as json files
with open(f"{save_directory}/itos.json", 'w') as f:
    json.dump(itos, f)

with open(f"{save_directory}/stoi.json", 'w') as f:
    json.dump(stoi, f)

# Save hyperparameters and other components as a json file
hyperparameters = {
    'block_size': 256,
    'n_embd': 384,
    'n_head': 6,
    'n_layer': 6,
    'dropout': 0.2,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu'
}

with open(f"{save_directory}/hyperparameters.json", 'w') as f:
    json.dump(hyperparameters, f)



## Model 11: Refactor (Parameterize)

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
config = {
    'block_size': 8,
    'n_embd': 32,
    'n_head': 4,
    'n_layer': 4,
    'dropout': 0.1,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu'
}

block_size = config["block_size"]
batch_size = 32
max_iters = 3000
eval_interval = 300
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200

# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, n_embd, head_size, block_size, dropout):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        wei = self.dropout(wei)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)

class FeedForward(nn.Module):

    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd * 4),
            nn.ReLU(),
            nn.Linear(n_embd * 4, n_embd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class MultiHeadAttention(nn.Module):

    def __init__(self, n_embd, num_heads, head_size, block_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList(Head(n_embd, head_size, block_size, dropout) for _ in range(num_heads))
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        out = self.dropout(out)
        return out

class Block(nn.Module):

    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_embd, n_head, head_size, block_size, dropout)
        self.ffwd = FeedForward(n_embd, dropout)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size, config):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, config["n_embd"])
        self.position_embedding_table = nn.Embedding(config["block_size"], config["n_embd"])
        self.blocks = nn.Sequential(*[Block(config["n_embd"], config["n_head"],config["block_size"],config["dropout"]) for _ in range(config["n_layer"])])
        self.ln_f = nn.LayerNorm(config["n_embd"])
        self.lm_head = nn.Linear(config["n_embd"], vocab_size)
        self.device = config["device"]

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=self.device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens, block_size):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size, config)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print("\nSample from the model:")

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200, block_size=config["block_size"])[0].tolist()))

print("-----")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'\nTrainable Parameters: {trainable_params}')

# Print model's state_dict
# print("\nModel's state_dict:")
# for param_tensor in m.state_dict():
#     print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)




step 0: train loss 4.4066, val loss 4.3990
step 300: train loss 2.5158, val loss 2.5268
step 600: train loss 2.4060, val loss 2.4122
step 900: train loss 2.3254, val loss 2.3335
step 1200: train loss 2.2503, val loss 2.2766
step 1500: train loss 2.2171, val loss 2.2317
step 1800: train loss 2.2017, val loss 2.2102
step 2100: train loss 2.1681, val loss 2.1876
step 2400: train loss 2.1387, val loss 2.1708
step 2700: train loss 2.1250, val loss 2.1755

Sample from the model:

me orcail, at prand thine her, apnd wite's ath reas coud!

QOr not:
For.

GETUNG:
Was not it dond ang wom tros; hin BUTI:
Coche mowe ir adorlee; firlow: forks her, a wilTIO:

Fors and greand berm? I s
-----

Total Parameters: 54977

Trainable Parameters: 54977

Model's Structure: 
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head

## Model 11: Scale up (and train on GPU)

In [None]:
%%time

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
config = {
    'block_size': 256,
    'n_embd': 384,
    'n_head': 6,
    'n_layer': 6,
    'dropout': 0.2,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu'
}

block_size = config["block_size"]
batch_size = 64
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200

# ------------

torch.manual_seed(1337)

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):

    def __init__(self, n_embd, head_size, block_size, dropout):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)  # Transform to Query: (B, T, C) -> (B, T, head_size)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones((block_size,block_size))))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):  # x is of shape (B, T, C)
        B, T, C = x.shape

        q = self.query(x)  # Query: (B, T, head_size)
        k = self.key(x)    # Key: (B, T, head_size)

        # Compute attention weights: (B, T, T)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T]==0, float("-inf"))
        wei = F.softmax(wei, dim=-1)

        wei = self.dropout(wei)

        v = self.value(x)  # Value: (B, T, head_size)
        out = wei @ v  # Output: (B, T, head_size) -> weighted sum of value vectors

        return out  # Output tensor of shape (B, T, head_size)

class FeedForward(nn.Module):

    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd * 4),
            nn.ReLU(),
            nn.Linear(n_embd * 4, n_embd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class MultiHeadAttention(nn.Module):

    def __init__(self, n_embd, num_heads, head_size, block_size, dropout):
        super().__init__()
        self.heads = nn.ModuleList(Head(n_embd, head_size, block_size, dropout) for _ in range(num_heads))
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        out = self.dropout(out)
        return out

class Block(nn.Module):

    def __init__(self, n_embd, n_head, block_size, dropout):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_embd, n_head, head_size, block_size, dropout)
        self.ffwd = FeedForward(n_embd, dropout)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x




# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size, config):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, config["n_embd"])
        self.position_embedding_table = nn.Embedding(config["block_size"], config["n_embd"])
        self.blocks = nn.Sequential(*[Block(config["n_embd"], config["n_head"],config["block_size"],config["dropout"]) for _ in range(config["n_layer"])])
        self.ln_f = nn.LayerNorm(config["n_embd"])
        self.lm_head = nn.Linear(config["n_embd"], vocab_size)
        self.device = config["device"]

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=self.device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens, block_size):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModel(vocab_size, config)
m = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print("\nSample from the model:")

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200, block_size=config["block_size"])[0].tolist()))

print("-----")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'\nTrainable Parameters: {trainable_params}')

# Print model's state_dict
# print("\nModel's state_dict:")
# for param_tensor in m.state_dict():
#     print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)






step 0: train loss 4.4753, val loss 4.4709
step 500: train loss 2.0800, val loss 2.1442
step 1000: train loss 1.6669, val loss 1.8302
step 1500: train loss 1.4938, val loss 1.6808
step 2000: train loss 1.3909, val loss 1.6088
step 2500: train loss 1.3228, val loss 1.5603
step 3000: train loss 1.2662, val loss 1.5272
step 3500: train loss 1.2215, val loss 1.5059
step 4000: train loss 1.1823, val loss 1.4890
step 4500: train loss 1.1460, val loss 1.4831

Sample from the model:


RuntimeError: ignored

In [None]:
print("-----")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'\nTrainable Parameters: {trainable_params}')

# Print model's state_dict
# print("\nModel's state_dict:")
# for param_tensor in m.state_dict():
#     print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

-----

Total Parameters: 10788929

Trainable Parameters: 10788929

Model's Structure: 
BigramLanguageModel(
  (token_embedding_table): Embedding(65, 384)
  (position_embedding_table): Embedding(256, 384)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-5): 6 x Head(
            (query): Linear(in_features=384, out_features=64, bias=False)
            (key): Linear(in_features=384, out_features=64, bias=False)
            (value): Linear(in_features=384, out_features=64, bias=False)
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
        (proj): Linear(in_features=384, out_features=384, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
      )
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Linear(in_features=1536, out_features=384, bias=True)
          (3): Dropout(p=0.2, inp

In [None]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)

print("\nSample from the model:")

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long, device=device), max_new_tokens=1000, block_size=config["block_size"])[0].tolist()))


print("-----")



Sample from the model:

i' tolded my hand
This lack of truth's blind your own green enforce:
We'll add, in gross choose to feelous windown down
Shephing such a crowned eating, if you our murder,
From the only gives, and you disposed, were'd England!
But which with dispatcher, likes a rich pave,
Hould far breather, to drunk and show you do forges.

GLOUCESTER:
Clarendent tickly now met did came but with the
eading of kinstrady.
BIAK:
I am lady as our throne condital.

CATESBY:
Courageous wits much, my displant?

EDWARD:
I be a progicular.

SIXINA:
O God!
O might!
What mighty seem comests a bosom, then about marr's love.

LADY ANNE:
What, by the dead!

LADY ANNE:
I met that anced but thou art an ears?

HASTINGS:
Think it me in stay sight so.

QUEEN MARGARET:
Her, Ely and Aumerly and men!
O hope for my sweet means witness bout without
Art we are prompt for that tongue remove.
Our suin the latter singly of Warwick's reposeth foots libert
The flock and Sicinatio it master'd at mark.
3 KING

# Additonal Notes

## `torch.nn.Embedding`

`torch.nn.Embedding` is used to create a lookup table where each row represents the embedding of a certain index, often corresponding to a word or a token in NLP. It is initialized with a `num_embeddings` parameter specifying the number of embeddings (or the size of the vocabulary) and an `embedding_dim` parameter specifying the size of each embedding vector.

### Tensor Shapes:

- If you input a tensor of shape `(B, T)`, where `B` is the batch size and `T` is the sequence length, you will get back a tensor of shape `(B, T, C)`, where `C` is the embedding dimension (`embedding_dim`).
- Here, `B` represents the number of sequences in a batch, `T` represents the number of indices in each sequence, and `C` represents the number of elements in the embedding vector for each index.

### Example:

Let's consider a simple example, detailing the shapes of the tensors:

```python
import torch
import torch.nn as nn

# Define Vocabulary Size and Embedding Dimension
vocab_size = 5  # e.g., {'<pad>': 0, 'the': 1, 'cat': 2, 'sat': 3, 'on': 4}
embedding_dim = 3  # Each word is represented by a 3D vector

# Create an Embedding Layer
embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

# Define Input Tensor of shape (B, T), where B is batch size and T is sequence length
input_tensor = torch.tensor([[1, 2, 3], [4, 1, 0]])  # 2 sequences, each of length 3
# e.g., input_tensor:
# tensor([[1, 2, 3],  # <-- Sequence 1: 'the cat sat'
#         [4, 1, 0]])  # <-- Sequence 2: 'on the <pad>'

# Get the Embeddings
embeddings = embedding_layer(input_tensor)
# The shape of the embeddings tensor will be (B, T, C) = (2, 3, 3)

# Print the Input and Output Tensors
print("Input Tensor (shape: {})".format(input_tensor.shape))
# Input Tensor (shape: torch.Size([2, 3]))
# tensor([[1, 2, 3],
#         [4, 1, 0]])

print("Embeddings Tensor (shape: {})".format(embeddings.shape))
# Embeddings Tensor (shape: torch.Size([2, 3, 3]))
# tensor([[[ 0.3367,  0.1288, -1.4232],  # <-- Embeddings for Sequence 1: 'the cat sat'
#          [-0.2694,  0.4839, -1.0219],
#          [ 1.3312,  0.3535,  0.8394]],
#
#         [[ 1.0736, -0.7456,  1.1174],  # <-- Embeddings for Sequence 2: 'on the <pad>'
#          [ 0.3367,  0.1288, -1.4232],
#          [-0.5273, -0.1164,  0.1453]]], requires_grad=True)
```

### Summary:

In this example, the input tensor of shape `(2, 3)` representing 2 sequences each of length 3 is passed through an embedding layer, and the output is a tensor of shape `(2, 3, 3)`, where each index in the input sequences is replaced by its corresponding 3-dimensional embedding.

## `torch.nn.functional.cross_entropy`

`torch.nn.functional.cross_entropy` is a loss function that combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class, and it is used to measure the performance of a classification model whose output is a probability distribution over classes.

### Difference between `torch.nn.functional.cross_entropy` and `torch.nn.CrossEntropyLoss`:

1. **Functional API (`torch.nn.functional.cross_entropy`):**
   - This is just a functional, stateless approach.
   - Does not maintain state, i.e., it does not have internal parameters that need to be stored.
   - This is typically used when you do not need to keep track of any state or parameters of the loss function between calls.

2. **Module API (`torch.nn.CrossEntropyLoss`):**
   - This is an object-oriented approach, where you first create an object of the loss function and then use it to calculate the loss.
   - Can maintain state, and has parameters like `weight`, `size_average`, etc., which can be set when the object is created.
   - Useful when you need to keep track of parameters or state of the loss function.

### Tensor Shapes:

- **Input (logits):** The shape is typically `(B, C, T)`, where `B` is the batch size, `C` is the number of classes, and `T` is the sequence length.
- **Target:** The shape is `(B, T)`, where each element is the class index in the range `[0, C-1]`.

### Use Case and Example:

When you have sequences of logits and corresponding sequences of targets, you can use `torch.nn.functional.cross_entropy` to compute the loss. Below is an illustrative example:

```python
import torch
import torch.nn.functional as F

# Define a batch of logits (randomly initialized), shape: (B, T, C) -> (2, 3, 4)
logits = torch.randn(2, 3, 4)
# e.g., logits:
# tensor([[[ 0.3367,  0.1288, -1.4232, -0.2694],
#          [ 0.4839, -1.0219,  1.3312,  0.3535],
#          [ 0.8394,  1.0736, -0.7456,  1.1174]],
#
#         [[ 0.1453, -0.5273, -0.1164,  0.1453],
#          [ 0.6883,  0.7456,  0.8535, -0.6341],
#          [-0.8234, -0.6341,  1.2345, -0.1234]]])

# Define a batch of corresponding target class indices, shape: (B, T) -> (2, 3)
targets = torch.tensor([[0, 2, 1], [3, 1, 0]], dtype=torch.long)
# e.g., targets:
# tensor([[0, 2, 1],
#         [3, 1, 0]])

# Compute the cross entropy loss
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

# Print the Loss
print("Cross Entropy Loss:", loss.item())
# Cross Entropy Loss: 2.3043 (Note: This value is illustrative and will vary due to the random initialization of logits)
```

Here, we reshape both logits and targets to be 1D before passing them to `F.cross_entropy` because it expects the input logits tensor to be 2D `(N, C)` where `N` is the number of samples and `C` is the number of classes, and the targets to be 1D. After the loss calculation, the single scalar loss value representing the mean loss across all samples in the batch is printed.

## `torch.nn.CrossEntropyLoss`

You would typically use `torch.nn.CrossEntropyLoss` when you are defining your model's loss during the model's initialization, and you plan to use this loss multiple times during the training. The advantage is that you can set its parameters like `weight`, `size_average`, etc., during initialization and don't need to pass them every time you compute the loss.

### Example:

Let's create a simple scenario where we have a classification model and we use `torch.nn.CrossEntropyLoss` as the loss function during the training of this model.

#### 1. Define a simple model:
```python
import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 4)  # A simple linear layer with 10 input features and 4 output features (4 classes)

    def forward(self, x):
        return self.fc(x)

# Initialize the model
model = SimpleModel()
```

#### 2. Initialize CrossEntropyLoss and Optimizer:
```python
criterion = nn.CrossEntropyLoss()  # Initialize the CrossEntropyLoss
optimizer = optim.SGD(model.parameters(), lr=0.01)  # Initialize the optimizer
```

#### 3. Forward Pass, Loss Computation, and Backward Pass:
```python
# Example input tensor (Batch size: 2, Features: 10)
inputs = torch.randn(2, 10)
# Corresponding labels (2 classes for the 2 input samples)
labels = torch.tensor([1, 2], dtype=torch.long)

# Forward pass
outputs = model(inputs)

# Compute Loss
loss = criterion(outputs, labels)

# Zero gradients, backward pass, optimizer step
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Print the Loss
print("Loss:", loss.item())
# Loss: 1.5462  (Note: This value is illustrative and will vary due to the random initialization of model parameters and inputs)
```

### Summary:
In this example, `torch.nn.CrossEntropyLoss` is used as it allows defining the loss function at the time of model initialization, and then it can be easily used to compute the loss multiple times during the training loop. The computed loss is then used to perform the backward pass and update the model parameters using the optimizer.

## The `forward` method, for both training and inference

It is indeed a common practice in designing neural networks to use the same method, typically the `forward` method, for both training and inference, but with different execution paths depending on the mode (training or inference).

### **Training Path:**
During training, the `forward` method is used to compute the network's predictions and the loss between the predictions and the ground truth. The gradients are then backpropagated through the network to update the model's parameters. This often involves additional tensors and computations, like targets and loss, which are not needed during inference.

### **Inference Path:**
During inference, the primary goal is to compute the network's predictions based on the input, and there is no need to compute the loss or backpropagate gradients. Therefore, aspects related to computing the loss and other training-specific computations are usually bypassed, and only the raw output (like logits or activations) is computed and returned.

### **Example:**
In the example you've provided, when `targets` is `None`, it is likely that the model is in inference mode, and thus, it doesn't compute the loss and doesn't reshape the logits, returning the raw 3D logits. When `targets` are provided, it is likely in training mode, and it computes the loss using the targets and reshapes the logits as needed for loss computation.

### **Benefits:**
- **Consistency:** It maintains consistency in the computation graph, whether it is training or inference.
- **Code Reusability:** It enables the reuse of the same computation procedures for both training and inference, reducing redundancy.
- **Ease of Maintenance:** It makes it easier to maintain and understand the code as training and inference share the same flow, diverging only where necessary.

### **Switching Between Modes:**
Neural network frameworks, like PyTorch, typically provide mechanisms to switch between training and evaluation modes, such as the `.train()` and `.eval()` methods in PyTorch, which set the mode of the model and its submodules. These mechanisms help in managing aspects like dropout and batch normalization, which have different behaviors during training and inference.

## Slicing and indexing in Pytorch

The reason `c` and `d` have different shapes is due to the way slicing works in PyTorch.

### For `c`:
```python
c = b[:, 2:, :]
```
Here, you are slicing the middle dimension from index `2` to the end. The slicing `2:` results in a sub-tensor of shape `(1, 4)` for each element in the batch dimension. Thus, `c` retains the middle dimension, resulting in a shape of `(2, 1, 4)`.

### For `d`:
```python
d = b[:, -1, :]
```
Here, you are selecting the last element `(-1)` of the middle dimension, which results in reducing the middle dimension. Hence, `d` does not retain the middle dimension and results in a shape of `(2, 4)`.

### Summary:
- When you use slicing `:`, it keeps the dimension even if it's of size `1`.
- When you use indexing with a specific value, it reduces that dimension.

Here’s a bit more visualization to help:

```
b:
[
 [[ 0,  1,  2,  3],    --> [0]
  [ 4,  5,  6,  7],    --> [1]
  [ 8,  9, 10, 11]],   --> [2]
                        
 [[12, 13, 14, 15],    --> [0]
  [16, 17, 18, 19],    --> [1]
  [20, 21, 22, 23]]    --> [2]
]

c = b[:, 2:, :]
c:
[
 [[ 8,  9, 10, 11]],   --> [2]
 [[20, 21, 22, 23]]    --> [2]
]
c.shape: (2, 1, 4)

d = b[:, -1, :]
d:
[
 [ 8,  9, 10, 11],     --> [2]
 [20, 21, 22, 23]      --> [2]
]
d.shape: (2, 4)
```

In [None]:
a= torch.tensor(range(24))
b = a.view(2,3,4)
print(b)
c= b[:,2:,:]
print(c)
print(c.shape)
d= b[:,-1,:]
print(d)
print(d.shape)

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])
tensor([[[ 8,  9, 10, 11]],

        [[20, 21, 22, 23]]])
torch.Size([2, 1, 4])
tensor([[ 8,  9, 10, 11],
        [20, 21, 22, 23]])
torch.Size([2, 4])


## `torch.nn.functional.softmax` dim

The `torch.nn.functional.softmax` function can be applied to tensors of any shape, and it's quite common to use it with batches of data, e.g., `(B, C)`, where `B` is the batch size and `C` is the number of classes.

### Using `dim=-1`:
```python
probs = F.softmax(logits, dim=-1)  # (B, C)
```
Here, `dim=-1` implies applying softmax to the last dimension of the tensor, i.e., across the classes for each instance in the batch independently. This is usually the desired behavior when dealing with batches of logits, as the softmax is typically applied to the scores (logits) corresponding to different classes.

### Using `dim=1`:
```python
probs = F.softmax(logits, dim=1)  # (B, C)
```
Here, `dim=1` is equivalent to `dim=-1` for a 2D tensor with shape `(B, C)`, as it also applies softmax across the classes (the second dimension) for each instance in the batch independently.

### Advantages of Specifying Dimension:
- **Correctness:** Specifying the dimension is crucial to ensure that the softmax is applied along the correct dimension, especially when working with tensors with more than two dimensions. It ensures the softmax operation is applied independently to each group of scores corresponding to different classes.
- **Flexibility:** It provides flexibility, allowing you to apply softmax along any specific dimension of a tensor depending on the use case.
- **Clarity:** Explicitly mentioning the dimension adds to the readability and clarity of the code, making it evident to others (and to "future you") along which dimension the softmax is intended to be applied.

In summary, whether you use `dim=-1` or `dim=1` for a 2D tensor `(B, C)`, it yields the same result, but it is crucial to specify the correct dimension, and it is often good practice to explicitly state the dimension along which the operation is performed for the sake of clarity.

## Examining the model

To examine your model, you can print the model itself to get an overview of its structure, and you can also calculate and print the number of parameters. Here's how you can do it:

### 1. Print the Model
Printing the model gives an overview of all layers and components in your model.

```python
print(m)
```

### 2. Calculate and Print the Number of Parameters
You can calculate the total number of parameters, as well as the number of trainable (requires_grad=True) parameters.

```python
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)

print(f'Total Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')
```

### Example:

```python
m = BigramLanguageModel(100)  # vocab_size=100 for example

# Print model's state_dict
print("Model's state_dict:")
for param_tensor in m.state_dict():
    print(param_tensor, "\t", m.state_dict()[param_tensor].size())

# Print the model structure
print("\nModel's Structure: ")
print(m)

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in m.parameters())
trainable_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')
```

### Output:
```
Model's state_dict:
token_embedding_table.weight    torch.Size([100, 100])

Model's Structure:
BigramLanguageModel(
  (token_embedding_table): Embedding(100, 100)
)

Total Parameters: 10000
Trainable Parameters: 10000
```

In this example, you would replace `100` with your actual `vocab_size`, and you should see the structure of your model, the size of the embedding weight tensor, and the total number of parameters in your model.

## `model.eval()` and `model.train()`

The `model.eval()` and `model.train()` methods in PyTorch are used to set the model in evaluation and training modes, respectively, and they are essential for ensuring that specific layers in the model behave appropriately in each mode.

### **1. `model.train()`:**
   - **Purpose:** Sets the model to training mode.
   - **Effect on Layers:**
     - **Dropout Layers:** They are active and will zero out a random subset of units.
     - **Batch Normalization Layers:** They compute the mean and variance of the current batch and use these for normalization, and they also update the running mean and variance.
   - **When to Use:** During training.

### **2. `model.eval()`:**
   - **Purpose:** Sets the model to evaluation (or inference) mode.
   - **Effect on Layers:**
     - **Dropout Layers:** They are deactivated, and no units are zeroed out.
     - **Batch Normalization Layers:** They use the running mean and variance accumulated during training for normalization.
   - **When to Use:** During evaluation, validation, testing, or any other inference task.

### **Layers/Components Affected:**
1. **Dropout Layers:**
   - **Why Needed:** To prevent overfitting during training and to ensure all units are active during inference for a deterministic output.

2. **Batch Normalization Layers:**
   - **Why Needed:** To use batch statistics during training and running statistics during inference to correctly normalize the input.

3. **Any Custom Layers/Components:**
   - **Why Needed:** Any custom layers or components that have different behaviors during training and inference will need to be aware of these modes to behave correctly.

### **Summary:**
The `model.train()` and `model.eval()` methods are crucial for models containing layers like Dropout and Batch Normalization that have differing behaviors during training and inference, ensuring appropriate and correct functionality in each phase.

## Why do the positional embedding get moved to the "device"?

When you use `model.to(device)`, it moves all the model's parameters and buffers to the specified device. This includes all the learnable parameters of your model, such as the weights in the embedding layers, linear layers, etc. This is why you don’t have to specify the device individually for each parameter like token embeddings.

However, the positional embeddings line is not creating a learnable parameter; it is creating a new tensor on-the-fly during the forward pass:

```python
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
```

Here, `torch.arange(T, device=device)` is creating a new tensor representing the position indices, and this tensor needs to be on the same device as the rest of your model and data to avoid errors during the forward pass. That’s why `device=device` is explicitly specified here.

To clarify, the `self.position_embedding_table` does get moved to the correct device when you do `model.to(device)`, but the tensor created by `torch.arange(T, device=device)` needs to have its device specified at the point of creation, because it is not a parameter of the model, but a temporary tensor created during the forward pass.

The token embeddings do not need the device to be specified at the point of use because they are not creating any new tensors on-the-fly during the forward pass in the manner that the positional embeddings line is. The input indices for the token embeddings are typically already on the correct device by the time they are used in the forward pass.

## `torch.mean` dimension

The error occurs because you are trying to assign a tensor of incorrect shape to `xbow[b, t]`.

### Understanding the Error:
When you do:
```python
xbow[b, t] = torch.mean(xprev, 1)
```
You are attempting to take the mean along the second dimension (1-based index) of `xprev`, which corresponds to dimension `C`. This would result in a tensor of shape `(t+1,)`, since the mean is computed across the `C` dimension. This is not compatible with the shape `(C,)` expected for `xbow[b, t]`, leading to a `RuntimeError`.

### Solution:
Since `xbow[b, t]` expects a tensor of shape `(C,)`, you should compute the mean across the `T` dimension (0-based index) of `xprev` to get a tensor of the correct shape `(C,)`. Therefore, you should use:
```python
xbow[b, t] = torch.mean(xprev, 0)
```
This will correctly compute the mean of all the `t+1` time steps for each feature in `C`, resulting in a tensor of shape `(C,)` which can be correctly assigned to `xbow[b, t]`.

### Summary:
- Use `torch.mean(xprev, 0)` to compute the mean across the time steps, resulting in a tensor of shape `(C,)`.
- `torch.mean(xprev, 1)` attempts to compute the mean across the feature dimension, resulting in a tensor of shape `(T,)` (or `(t+1,)` in this case), which is incompatible with the expected shape `(C,)`.

---

Let's consider a simpler example to illustrate the error:

Suppose we have a tensor `x` of shape `(3, 2)`, representing 3 time steps and 2 features:

```python
x = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0],
                  [5.0, 6.0]])  # shape (3, 2)
```

Now, if we try to take the mean along the second dimension (features):

```python
mean_x = torch.mean(x, 1)  # Attempting to take the mean along the second dimension.
print(mean_x)  # This will result in a tensor of shape (3,)
```

This will output:

```python
tensor([1.5, 3.5, 5.5])  # shape (3,)
```

Now, if we have a target tensor `target` of shape `(2,)`:

```python
target = torch.zeros(2)  # shape (2,)
```

If we try to assign `mean_x` to `target`, it will throw an error similar to the one you experienced, as the shapes are incompatible:

```python
target = mean_x  # This will throw an error as the shapes are (2,) and (3,) respectively.
```

Instead, if you want to compute the mean of `x` and store it in a tensor of shape `(2,)`, you should compute the mean along the first dimension (time steps):

```python
correct_mean_x = torch.mean(x, 0)  # Taking the mean along the first dimension.
print(correct_mean_x)  # This will result in a tensor of shape (2,)
```

This will output:

```python
tensor([3., 4.])  # shape (2,)
```

Now, assigning `correct_mean_x` to `target` will not throw an error, as their shapes are compatible:

```python
target = correct_mean_x  # No error, as the shapes are both (2,).
```

## The `keepdim` parameter in functions like `torch.sum`

The `keepdim` parameter in functions like `torch.sum` determines whether to retain the summed dimension in the output tensor's shape.

### When `keepdim=True`:
The summed dimension is retained as a dimension of size 1 in the resulting tensor.

### When `keepdim=False` (Default):
The summed dimension is removed from the resulting tensor's shape.

### Example:
Let’s consider a simple 2x3 tensor:
```python
x = torch.tensor([[1, 2, 3],
                  [4, 5, 6]])
```

#### 1. **Using `keepdim=True`**:
```python
sum_x_keepdim = torch.sum(x, dim=1, keepdim=True)
```
This will keep the summed dimension, resulting in a shape of `(2, 1)`:
```python
# sum_x_keepdim
tensor([[ 6],
        [15]])
```

#### 2. **Using `keepdim=False`** (or omitting it, as `False` is the default):
```python
sum_x_no_keepdim = torch.sum(x, dim=1)
```
This will remove the summed dimension, resulting in a shape of `(2,)`:
```python
# sum_x_no_keepdim
tensor([ 6, 15])
```

### Role of `keepdim`:
- **Preserving Dimensions:** When performing subsequent operations that rely on the original dimensionality, `keepdim=True` is useful to avoid shape mismatch errors.
- **Broadcasting:** Keeping the dimension is crucial when you want to use broadcasting in subsequent operations, where matching dimensions are essential.
- **Readability:** It can make the code more readable by making explicit the intention to retain the original number of dimensions.

## initializing tensors in PyTorch

When initializing tensors in PyTorch, it is generally best practice to be explicit about the shape of the tensor. This aids in readability and reduces the likelihood of bugs due to unintended shapes. Both `torch.ones(T, T)` and `torch.ones((T, T))` will yield tensors with the same shape, but the latter is usually preferred as it is more explicit about the intended shape of the tensor.

### Best Practices for Initializing Tensors:
1. **Be Explicit with Shape:**
   ```python
   torch.ones((T, T))  # Preferred way, more readable and explicit about the shape.
   ```
   This is clearer as it explicitly denotes the shape as a single argument (a tuple), making it evident that the resulting tensor is 2-dimensional.

2. **Use `dtype` and `device` Arguments:**
   When you need a tensor of a specific data type or on a specific device, specify these using the `dtype` and `device` arguments:
   ```python
   torch.ones((T, T), dtype=torch.float32, device='cuda')
   ```
   This ensures the tensor is created with the right type and on the right device, avoiding potential type/device mismatches later in the code.

3. **Explicitly Set `requires_grad`:**
   If the tensor will be used for gradient computation, explicitly set `requires_grad=True`:
   ```python
   torch.ones((T, T), requires_grad=True)
   ```
   This makes it clear that this tensor is part of the computation graph for gradient computation.

4. **Use `torch.zeros` for Initializing to Zero:**
   When you need a tensor initialized with zeros, use `torch.zeros` with explicit shape:
   ```python
   torch.zeros((T, T))
   ```
   This is more intuitive and readable compared to creating a tensor with another method and then zeroing it out.

5. **Use `torch.randn` for Random Initialization:**
   When you need a tensor initialized with values from a standard normal distribution, use `torch.randn` with explicit shape:
   ```python
   torch.randn((T, T))
   ```
   This is more clear and concise than creating an empty tensor and then filling it with random values.

### Summary:
Being explicit about tensor shapes and properties such as data type, device, and whether it requires gradients, can make the code more readable, understandable, and less prone to bugs and unintended behaviors.

## Hugging Face Model Hub

Given that you've trained a model on Google Colab and you want to upload this model to Hugging Face Model Hub, you would typically follow these steps.

### 1. Save Model and Tokenizer

Before you can upload your model to Hugging Face, you need to save it to disk along with any other files that are required to use it, such as your tokenizer.

#### a. Save the Model:
```python
# Define the path where you want to save your model
save_directory = "/path/to/your/model"

# Save the model
torch.save(model.state_dict(), f"{save_directory}/model_weights.pth")
```

#### b. Save the Tokenizer:
Since you have a custom tokenizer, you should also save the necessary files (e.g., vocab files) to be able to load it back. Since your tokenizer is quite simple, you might want to save the `stoi` and `itos` mappings.

```python
import json

# Save the stoi and itos mappings
with open(f"{save_directory}/stoi.json", 'w') as f:
    json.dump(stoi, f)

with open(f"{save_directory}/itos.json", 'w') as f:
    json.dump(itos, f)
```

### 2. Upload to Hugging Face Model Hub

#### a. Authenticate to Hugging Face
```python
from huggingface_hub import notebook_login

notebook_login()
```

#### b. Create a Model Repository
Create a repository on [Hugging Face](https://huggingface.co/new). The name of this repository will be your `model_name` below.

#### c. Upload Model and Tokenizer Files

```python
from huggingface_hub import Repository

# Replace with the name of your Hugging Face repository
model_name = "username/repository_name"
model_repo = Repository(local_dir=save_directory, repo_url=f"{model_name}")

# Commit and push files to Hugging Face Model Hub
model_repo.push_to_hub(commit_message='Initial commit')
```

### 3. Load Model and Tokenizer for Inference

Once your model and tokenizer are available on Hugging Face, you can load them for inference as follows:

#### a. Load the Model:
```python
model = BigramLanguageModel(vocab_size)
model.load_state_dict(torch.hub.load('huggingface/' + model_name, 'model_weights.pth'))
model.eval()
```

#### b. Load the Tokenizer:
```python
import json

# Load stoi and itos mappings
with open(torch.hub.get_dir() + '/stoi.json', 'r') as f:
    stoi = json.load(f)

with open(torch.hub.get_dir() + '/itos.json', 'r') as f:
    itos = json.load(f)

# Now you can use stoi and itos for tokenization and detokenization
```

### Important Note:
- You may need to adjust paths and filenames based on where and how you decide to save your model and tokenizer files.
- This example assumes that you are working with a PyTorch model and a simple character-based tokenizer. If your model or tokenizer is different, you may need to adjust the saving and loading code accordingly.

## Compatibility with Hugging Face Transformers Ecosystem

Repositories and models on the Hugging Face Model Hub can vary in compatibility. Many models are indeed compatible with the Hugging Face Transformers library, as they are uploaded using the library's tools and adhere to the standard architectures (like BERT, GPT-2, etc.). These can be easily loaded using the library's `from_pretrained` methods.

However, the Model Hub also allows the hosting of models that may not strictly adhere to the standard architectures or may not be directly compatible with the Transformers library. These might require custom code to load and use.

### Uploading to Model Hub vs Google Drive

#### 1. **Public Accessibility:**
   - **Model Hub:** Offers a platform where models are publicly available to the community, and others can easily discover, download, and use them.
   - **Google Drive:** More suited for personal storage, and sharing models can be cumbersome due to access permissions.

#### 2. **Ecosystem Integration:**
   - **Model Hub:** When models are compatible with the Transformers library, they can be seamlessly integrated into the ecosystem, allowing easy loading and utilization using Hugging Face methods.
   - **Google Drive:** Requires manual download and loading, regardless of compatibility.

#### 3. **Versioning and Documentation:**
   - **Model Hub:** Supports versioning and provides a platform to document the model, its usage, and any relevant details.
   - **Google Drive:** Lacks built-in versioning and documentation features for shared models.

#### 4. **Community and Collaboration:**
   - **Model Hub:** Facilitates community interactions, feedback, and contributions, enhancing collaborative model development.
   - **Google Drive:** Is more isolated and does not inherently support community interactions around shared models.

### Utilizing Model Hub Efficiently

- **Compatible Models:** To fully leverage the benefits of the Model Hub, it is advisable to make models compatible with the Transformers library so that the community can easily use them.
- **Custom Models:** For non-standard or custom models, including clear instructions and possibly custom loading code in the repository can assist users in utilizing the models.

### Conclusion

While uploading a model to the Model Hub without compatibility is somewhat akin to storing it on Google Drive, making it compatible with the Hugging Face ecosystem and providing clear documentation can significantly enhance its usability and accessibility within the community.

## Environment variables

To avoid including sensitive information such as your Hugging Face username or any other user-specific information directly in your Colab notebook, you can use environment variables or Google Colab's "Forms" feature.

### 1. Using Environment Variables:

You can set environment variables from the notebook cell as follows:

```shell
import os
os.environ['HF_USERNAME'] = 'your_username'
```

Then, you can access this environment variable whenever you need it:

```python
username = os.getenv('HF_USERNAME')
```

### 2. Using Google Colab's Forms:

Google Colab provides a feature called "Forms" which allows you to create fields in your notebook where you can input values.

You can add a text field in your Colab notebook like this:

```python
#@param {type:"string"}
username = "your_username"  # you can leave this empty
```

Now, you can edit this field directly to input your username without modifying the code.

### 3. Using Google Drive:

You could also store your sensitive information in a file on Google Drive and read the file in your Colab notebook. After mounting your Google Drive in Colab, you can read the file as follows:

```python
with open('/content/drive/MyDrive/your_file.txt', 'r') as f:
    username = f.read().strip()
```

### 4. Input Prompt:

Use Python's `input` function to prompt you to enter sensitive information when running the cell:

```python
username = input("Enter your Hugging Face username: ")
```

### Recommendations:

- Do not store sensitive information, such as passwords or secret keys, directly in your notebook, even with the above methods.
- For secret keys or tokens, consider using a secure vault or secrets manager.
- After using any sensitive information in your notebook, make sure to clear the output cells before sharing the notebook with others.

Make sure to choose a method that suits your needs and ensures the security of your sensitive information.

## `@dataclass`

Using a `dataclass` or a class-based configuration is indeed a good practice, especially as projects grow larger and more complex. It offers type checking, auto-generation of `__init__` methods, and other utilities, making it more maintainable and error-resistant compared to a simple dictionary.

Here’s a brief explanation for why and when to use a class-based configuration:

### 1. **Type Safety:**
   - With a class, you can leverage type hints to ensure that each configuration parameter is of the correct type. This can catch potential bugs early.
   - In a dictionary, it’s easy to mistakenly assign a value of the wrong type, leading to potential issues.

### 2. **Autocompletion and Documentation:**
   - Many IDEs offer autocompletion and inline documentation for class attributes but not for dictionary keys.
   - This can make the development process smoother and reduce the risk of typos or using non-existent configuration parameters.

### 3. **Default Values:**
   - With classes, you can easily set default values for your configuration parameters.
   - This can reduce the verbosity of your configuration and make it easier to understand which parameters are essential and which are optional.

### 4. **Immutability:**
   - Classes (especially dataclasses with frozen parameters) can be made immutable, preventing accidental modifications of the configuration during runtime.
   - Dictionaries are mutable by default.

### Example using `dataclass`:
```python
from dataclasses import dataclass

@dataclass
class Config:
    block_size: int
    n_embd: int
    n_head: int
    n_layer: int
    dropout: float
    device: str

config = Config(block_size=8, n_embd=32, n_head=4, n_layer=4, dropout=0.1, device='cuda' if torch.cuda.is_available() else 'cpu')
```

### Why it might not have been used:
- **Simplicity and Quick Prototyping:** For smaller, simpler projects or during initial prototyping stages, using a dictionary might be simpler and quicker.
- **Familiarity:** The author might be more familiar or comfortable with using dictionaries for configuration.

### Conclusion:
While using dictionaries is fine for simpler scenarios or quick prototyping, adopting a class-based approach for configurations is generally a good practice for larger and more complex projects, as it provides additional safety, clarity, and development convenience.

It would indeed be a beneficial refactor to use a `dataclass` or a similar approach for the configuration in the provided code, especially as it grows and evolves.

## Benefits of External Configuration Files

Using `@dataclass` does make it convenient to organize configuration or hyperparameter settings within your code, and it can help in keeping the code cleaner and more structured. However, it doesn't inherently eliminate the need or usefulness of an external `hyperparameters.json` file or another configuration file.

### Benefits of External Configuration Files:
1. **Easy Updates:** External configuration files allow you to change hyperparameters without modifying the code. This is particularly useful when you want to experiment with different hyperparameters or when different users need different settings.
2. **Version Control:** Keeping hyperparameters in an external file can be beneficial for version control. The code remains unchanged, and only the configuration file is modified.
3. **Automation and Scaling:** For automated experimentation and hyperparameter tuning, external configuration files are essential as they can be easily generated and modified by scripts.

### Example Workflow with `@dataclass` and External Config File:
1. **Define a `@dataclass` for Configuration:**
   ```python
   @dataclass
   class ModelConfig:
       learning_rate: float
       batch_size: int
       num_epochs: int
   ```
   
2. **Load Hyperparameters from an External JSON File:**
   ```python
   import json
   from dataclasses import asdict

   with open('config.json', 'r') as f:
       config_dict = json.load(f)

   config = ModelConfig(**config_dict)
   ```

3. **Use the Config in your Model:**
   ```python
   model = MyModel(config)
   ```

This way, you can leverage the benefits of both `@dataclass` for structured and clean code and external configuration files for flexibility and convenience.

## Roadmap to go from a document completer/babbler to a question-answerer

Below is a simplified roadmap to go from a document completer/babbler (like GPT) to a question-answerer, based on the transcript you provided:

### 1. **Pre-training Stage**
   - **Goal:** Train a language model to understand language structure and generate coherent text.
   - **Model:** Transformer-based model (e.g., GPT-3).
   - **Training Data:** Large chunk of the internet.
   - **Task:** Predict the next word/token in a sequence.
   - **Outcome:** A model capable of completing documents, babbling coherent and diverse text, but not necessarily useful or context-aware responses.

### 2. **Fine-tuning Stage**
   - **Goal:** Refine the pre-trained model to respond to questions with coherent, contextually appropriate, and useful answers.
   - **Model:** The pre-trained model from stage 1.
   - **Training Data:** Curated datasets with a format of a question followed by an answer.
   - **Task:** Predict the appropriate answer given a question.
   - **Outcome:** A model that is more aligned to being an assistant, expecting to complete questions with coherent answers.

### 3. **Reward Modeling and Reinforcement Learning**
   - **Goal:** Further refine the model’s responses using human feedback.
   - **Model:** The fine-tuned model from stage 2.
   - **Training Data:** Model-generated responses ranked by human raters.
   - **Task:** Optimize the model’s policy to generate responses that are expected to score high rewards according to human preference.
   - **Technique:** Proximal Policy Optimization (PPO).
   - **Outcome:** A model that generates responses that are more likely to be preferred by humans, serving as an efficient question-answerer.

### How it Works:
   - **Fine-tuning Aligns the Model:** The model is fine-tuned on specific Q&A formatted data, aligning its responses to be more like answers to the questions, rather than just completing documents.
   - **Reward Model Predicts Desirability:** A separate network (reward model) is trained to predict the desirability of the model's responses based on human rankings.
   - **PPO Optimizes Policy:** The PPO algorithm uses the reward model to optimize the model's policy, making it generate responses that are expected to receive higher rewards.

### Conclusion:
Yes, the reward model is used to fine-tune the initial model. The reward model helps in guiding the reinforcement learning algorithm (PPO) to make the fine-tuned model generate more human-preferred responses, effectively transforming it from a document completer to a question-answerer.

### Additional Note:
The fine-tuning stage and the subsequent reinforcement learning stage are crucial and involve proprietary data and techniques, making them harder to replicate without access to such resources and knowledge.

In [None]:
list(range(10))[1::2]

[1, 3, 5, 7, 9]

In [None]:
import torch


torch.Size([4, 8, 10])

In [None]:
e = torch.randn(2,3,4)
e[:,:,:1]

tensor([[[ 0.9672],
         [-0.3810],
         [ 0.2085]],

        [[-1.9406],
         [ 2.4469],
         [-0.2574]]])