<a href="https://colab.research.google.com/github/rayxuan2000/GPT-and-LLM/blob/main/baby_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Build a GPT from scratch, teach the baby-GPT a sentence

### prepare our simple dataset, for instance, one sentence

In [None]:
text = "Your name is GPT-3"

In [None]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  18


In [None]:
# check our text
print(text)

Your name is GPT-3


In [None]:
# Identifies all unique characters present in the given text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Characters from the sentence:", "".join(chars))
print("vocab_size from the sentence: ", vocab_size)

Characters from the sentence:  -3GPTYaeimnorsu
vocab_size from the sentence:  16


In [None]:
# Create a vocabulary dictionary mapping each character to a unique index
vocab_table = {}
for i, char in enumerate(chars):
    vocab_table[char] = i

# Collect keys (characters) and values (indexes) from the vocabulary table
characters = list(vocab_table.keys())
indexes = list(vocab_table.values())

# Print the characters and their corresponding indexes in tab-separated format
print("Character\t", "\t".join(characters))  # Print all characters separated by tabs
print("Index\t\t", "\t".join(map(str, indexes)))  # Print all indexes separated by tabs

Character	  	-	3	G	P	T	Y	a	e	i	m	n	o	r	s	u
Index		 0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15


In [None]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
print(encode, type(encode))
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string


print(decode(encode(text)))
print(encode(f"{text}"))

<function <lambda> at 0x7f4b619de7a0> <class 'function'>
Your name is GPT-3
[6, 12, 15, 13, 0, 11, 7, 10, 8, 0, 9, 14, 0, 3, 4, 5, 1, 2]


In [None]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
real_text = lambda tensor: { decode(tensor.tolist()) }
real_char = lambda tensor: { decode([tensor.item()]) }

# let's print the data
print(data.shape, data.dtype)
print(data)
print(real_text(data))

torch.Size([18]) torch.int64
tensor([ 6, 12, 15, 13,  0, 11,  7, 10,  8,  0,  9, 14,  0,  3,  4,  5,  1,  2])
{'Your name is GPT-3'}


### Use the sentence for training

We are concerned about text generation here, which means given some texts what would be the next token (character)?

In [None]:
# Let's now split up the data into train and validation sets
train_data = data

#### example

In [None]:
block_size = 3 # how many tokens are there in one batch
block_data = train_data[:block_size]
target_data =  train_data[1:block_size+1]
print(block_data)
print(real_text(block_data))
print(target_data)
print(real_text(target_data))

tensor([ 6, 12, 15])
{'You'}
tensor([12, 15, 13])
{'our'}


In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    real_context = real_text(context)
    real_target = real_char(target)
    print(f"when input is {context}--'{real_context}', the target is: [{target}]--'{real_target}'")


when input is tensor([6])--'{'Y'}', the target is: [12]--'{'o'}'
when input is tensor([ 6, 12])--'{'Yo'}', the target is: [15]--'{'u'}'
when input is tensor([ 6, 12, 15])--'{'You'}', the target is: [13]--'{'r'}'


#### implementation

In [None]:
train_data

tensor([ 6, 12, 15, 13,  0, 11,  7, 10,  8,  0,  9, 14,  0,  3,  4,  5,  1,  2])

In [None]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
# block_size = len(data) - 1 # what is the maximum context length for predictions?
block_size = 5 # what is the maximum context length for predictions?


def get_batch():
    # generate a small batch of data of inputs x and targets y

    data = train_data

    # generates batch_size number of random starting indices
    ix = torch.randint(len(data) - block_size, (batch_size,))

    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch()
print('inputs:')
print(xb.shape)
print(xb)
for x in xb:
    print(real_text(x))
print('targets:')
print(yb.shape)
print(yb)
for y in yb:
    print(real_text(y))


inputs:
torch.Size([4, 5])
tensor([[ 0,  9, 14,  0,  3],
        [ 0,  9, 14,  0,  3],
        [10,  8,  0,  9, 14],
        [14,  0,  3,  4,  5]])
{' is G'}
{' is G'}
{'me is'}
{'s GPT'}
targets:
torch.Size([4, 5])
tensor([[ 9, 14,  0,  3,  4],
        [ 9, 14,  0,  3,  4],
        [ 8,  0,  9, 14,  0],
        [ 0,  3,  4,  5,  1]])
{'is GP'}
{'is GP'}
{'e is '}
{' GPT-'}


In [None]:
for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        real_context = decode(context.tolist())
        real_target = decode([target.item()])
        print(f"when input is {context}--'{real_context}', the target is: [{target}]--'{real_target}'")

when input is tensor([0])--' ', the target is: [9]--'i'
when input is tensor([0, 9])--' i', the target is: [14]--'s'
when input is tensor([ 0,  9, 14])--' is', the target is: [0]--' '
when input is tensor([ 0,  9, 14,  0])--' is ', the target is: [3]--'G'
when input is tensor([ 0,  9, 14,  0,  3])--' is G', the target is: [4]--'P'
when input is tensor([0])--' ', the target is: [9]--'i'
when input is tensor([0, 9])--' i', the target is: [14]--'s'
when input is tensor([ 0,  9, 14])--' is', the target is: [0]--' '
when input is tensor([ 0,  9, 14,  0])--' is ', the target is: [3]--'G'
when input is tensor([ 0,  9, 14,  0,  3])--' is G', the target is: [4]--'P'
when input is tensor([10])--'m', the target is: [8]--'e'
when input is tensor([10,  8])--'me', the target is: [0]--' '
when input is tensor([10,  8,  0])--'me ', the target is: [9]--'i'
when input is tensor([10,  8,  0,  9])--'me i', the target is: [14]--'s'
when input is tensor([10,  8,  0,  9, 14])--'me is', the target is: [0]--' 

In [None]:
print(xb) # our input to the transformer

tensor([[ 0,  9, 14,  0,  3],
        [ 0,  9, 14,  0,  3],
        [10,  8,  0,  9, 14],
        [14,  0,  3,  4,  5]])


### Build our simple NLP model with our tokenizer

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class SimpleNLPModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        # data flow: embedding -> linear layer -> logits

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            # each C dimension tensor is a representation of a token
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens): # used for generating new text

        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)    1: 0.1 2: 0.2 3: 0.7 -> 3   max
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            # you -> r   you + r = your
            # your -> " "
            # your -> n
        return idx

m = SimpleNLPModel(vocab_size)
print(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)


16
torch.Size([20, 16])
tensor(3.4646, grad_fn=<NllLossBackward0>)


### Try to generate something with untrained model

In [None]:
print("Generate something with untrained model:")
# start with space char!
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

Generate something with untrained model:
 eT TYneP-TanroT uoGYnamGsPes-smYrYGrrrrPomT uPsm-o P-uuT PnissmT aeoPr uuomiaa333ai ia isr eTiT T oi


The text is meaningless and super random!

### let's train our simple nlp model

In [None]:
import torch
from tqdm import tqdm

batch_size = 1
device = 'cuda' if torch.cuda.is_available() else 'cpu'
m = m.to(device)
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# Use tqdm to add a progress bar and display intermediate loss values in the progress bar
pbar = tqdm(range(5000))
for steps in pbar:
    # Sample a batch of data
    xb, yb = get_batch()
    xb, yb = xb.to(device), yb.to(device)
    # Evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    # Update the progress bar description to show the current loss value
    pbar.set_description(f"Loss: {loss.item():.4f}")

print(loss.item())


Loss: 0.7557: 100%|██████████| 5000/5000 [00:19<00:00, 256.36it/s]

0.7556718587875366





In [None]:
print("Generate something with trained simple model:")
print(decode(m.generate(idx = torch.ones((1, 1), dtype=torch.long).to(device), max_new_tokens=100)[0].tolist()))
# print(real_char(torch.ones((1, 1))))

Generate something with trained simple model:
-3TGPT-33ami GPTouYr name Yr GPT name nr isrGPme GPT-ais name name nmeT-ur name is GPeeuouu-uour is Y


It seems it has improved a little but there is still a long way to go!

In [None]:
start_id = torch.tensor([encode('Y')], dtype=torch.long)
start_id = start_id.to(device)
# start with 'Y'!
print(decode(m.generate(idx = start_id, max_new_tokens=100)[0].tolist()))

Yr name Gre GPTYGYr i-ur nme GPPGYr GPT GPT-33Pr3Genam3 isYoui-amis our is namens is nam-3ame is GT-s


### Illustration of all shapes

In [None]:
def forward_with_log(self, idx, targets=None):

    print("The shape of idx (B,T):",idx.shape)
    print("The shape of targets (B,T):",targets.shape)
    print("The shape of token_embedding_table (C,C):",self.token_embedding_table.weight.shape)
    # idx and targets are both (B,T) tensor of integers
    logits = self.token_embedding_table(idx) # (B,T,C)
    print("The batch size is:", batch_size)
    print("The length of the sequence is:", block_size)
    print("The vocab size is:", vocab_size)
    print("targets:", targets)
    if targets is None:
        loss = None
    else:
        B, T, C = logits.shape
        logits = logits.view(B*T, C)
        print("logits[0]:", logits)
        targets = targets.view(B*T)
        print("targets:", targets)
        loss = F.cross_entropy(logits, targets)   #   [1,2,3] -- [1,2,6]  -- 0 [1,3,9]  --  0 -> 1 (99%)  tar: [1,2,4]  --  [1,2,4]  - loss 0

    # get the most likely token
    _, predicted_labels = torch.max(logits[1], dim=-1)

    print("Predicted labels:", predicted_labels)

    return logits, loss

m.forward = forward_with_log.__get__(m)
m = m.to(device)
data = data.to(device)
print("block_size:",block_size)
print("data[:block_size]:",real_text(data[:block_size]))
m.forward(data[:block_size].unsqueeze(0), data[1:block_size+1].unsqueeze(0))
print(real_text(data))
print(data[1:block_size+1].unsqueeze(0))


block_size: 5
data[:block_size]: {'Your '}
The shape of idx (B,T): torch.Size([1, 5])
The shape of targets (B,T): torch.Size([1, 5])
The shape of token_embedding_table (C,C): torch.Size([16, 16])
The batch size is: 1
The length of the sequence is: 5
The vocab size is: 16
targets: tensor([[12, 15, 13,  0, 11]], device='cuda:0')
logits[0]: tensor([[-2.9378, -2.6484, -1.4984, -0.4172, -1.3095, -2.0510, -1.6252, -1.3233,
         -1.5096, -1.4677, -1.0579, -0.0120,  1.4445,  1.0971, -1.9537, -3.2321],
        [-1.5576, -2.0704, -0.0153, -0.4037, -0.7045, -2.3323, -2.1822, -1.1957,
         -1.7236, -3.3376, -0.6676, -1.2417, -2.9353, -2.6657, -1.8229,  1.8449],
        [-0.7712, -2.2525, -2.9893, -3.3333, -1.8029, -2.4856, -2.2264, -3.0594,
         -2.5246, -2.0379, -3.1098, -3.4164, -1.1366,  0.3925, -2.7221, -1.1737],
        [ 2.3739, -2.7966, -2.3969, -2.5359, -1.4751, -1.5105, -0.9007, -1.3898,
         -1.6100, -1.2559, -2.3605, -2.8164, -2.6586, -1.7395, -2.6319, -2.4802],
        

### The mathematical trick in self-attention

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"


# you are the best
# 1    2    3   4
torch.manual_seed(42)
wei = torch.tril(torch.ones(7, 7)) # lower trianguler matrix
wei = wei / torch.sum(wei, 1, keepdim=True)
b = torch.randint(0,10,(7,2)).float()
c = wei @ b
print('wei=')
print(wei)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

     # you are the best

# you  11  21   23  23
# are
# the
# best


# One important observation here is that in wei matrix, each row is like a batch,
# we only care about using preceding tokens and current token to predict current
# stuff. So this natural auto-regressive structure can be intuitive.

wei=
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.],
        [0., 4.],
        [0., 3.],
        [8., 4.],
        [0., 4.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333],
        [3.5000, 5.0000],
        [2.8000, 4.6000],
        [3.6667, 4.5000],
        [3.1429, 4.4286]])


In [None]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
# x = torch.ones(B,T,C)
x.shape
print(x)

tensor([[[ 0.1808, -0.0700],
         [-0.3596, -0.9152],
         [ 0.6258,  0.0255],
         [ 0.9545,  0.0643],
         [ 0.3612,  1.1679],
         [-1.3499, -0.5102],
         [ 0.2360, -0.2398],
         [-0.9211,  1.5433]],

        [[ 1.3488, -0.1396],
         [ 0.2858,  0.9651],
         [-2.0371,  0.4931],
         [ 1.4870,  0.5910],
         [ 0.1260, -1.5627],
         [-1.1601, -0.3348],
         [ 0.4478, -0.8016],
         [ 1.5236,  2.5086]],

        [[-0.6631, -0.2513],
         [ 1.0101,  0.1215],
         [ 0.1584,  1.1340],
         [-1.1539, -0.2984],
         [-0.5075, -0.9239],
         [ 0.5467, -1.4948],
         [-1.2057,  0.5718],
         [-0.5974, -0.6937]],

        [[ 1.6455, -0.8030],
         [ 1.3514, -0.2759],
         [-1.5108,  2.1048],
         [ 2.7630, -1.7465],
         [ 1.4516, -1.5103],
         [ 0.8212, -0.2115],
         [ 0.7789,  1.5333],
         [ 1.6097, -0.4032]]])


In [None]:
# self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16 # hyperparameter
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
print(wei[0])

tensor([[-1.7629, -1.3011,  0.5652,  2.1616, -1.0674,  1.9632,  1.0765, -0.4530],
        [-3.3334, -1.6556,  0.1040,  3.3782, -2.1825,  1.0415, -0.0557,  0.2927],
        [-1.0226, -1.2606,  0.0762, -0.3813, -0.9843, -1.4303,  0.0749, -0.9547],
        [ 0.7836, -0.8014, -0.3368, -0.8496, -0.5602, -1.1701, -1.2927, -1.0260],
        [-1.2566,  0.0187, -0.7880, -1.3204,  2.0363,  0.8638,  0.3719,  0.9258],
        [-0.3126,  2.4152, -0.1106, -0.9931,  3.3449, -2.5229,  1.4187,  1.2196],
        [ 1.0876,  1.9652, -0.2621, -0.3158,  0.6091,  1.2616, -0.5484,  0.8048],
        [-1.8044, -0.4126, -0.8306,  0.5899, -0.7987, -0.5856,  0.6433,  0.6303]],
       grad_fn=<SelectBackward0>)


In [None]:
tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print("masked:",wei[0])
wei = F.softmax(wei, dim=-1)
print("masked:",wei[0])

masked: tensor([[-1.7629,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-3.3334, -1.6556,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-1.0226, -1.2606,  0.0762,    -inf,    -inf,    -inf,    -inf,    -inf],
        [ 0.7836, -0.8014, -0.3368, -0.8496,    -inf,    -inf,    -inf,    -inf],
        [-1.2566,  0.0187, -0.7880, -1.3204,  2.0363,    -inf,    -inf,    -inf],
        [-0.3126,  2.4152, -0.1106, -0.9931,  3.3449, -2.5229,    -inf,    -inf],
        [ 1.0876,  1.9652, -0.2621, -0.3158,  0.6091,  1.2616, -0.5484,    -inf],
        [-1.8044, -0.4126, -0.8306,  0.5899, -0.7987, -0.5856,  0.6433,  0.6303]],
       grad_fn=<SelectBackward0>)
masked: tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000

In [None]:
v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. **This block here is called a "decoder" attention block** because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [None]:
# example to illustrate scaled attention
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1)  #* head_size**-0.5

In [None]:
k.var()

tensor(1.0449)

In [None]:
q.var()

tensor(1.0700)

In [None]:
wei.var()

tensor(17.4690)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])/8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.1999, 0.1925, 0.2049, 0.1925, 0.2101])

### Add attention magic

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F


text = """GPT, short for Generative Pre-trained Transformer, represents a
       groundbreaking advancement in the field of artificial intelligence
       and natural language processing. Developed by OpenAI, GPT is designed
       to understand, generate, and interpret human language with remarkable
       accuracy and fluency. It operates on the principle of machine learning,
       where the model is initially pre-trained on a vast corpus of text data.
       This pre-training enables GPT to grasp the intricacies of language,
       including grammar, context, and even subtleties like humor and sarcasm.
       Following the pre-training phase, GPT undergoes fine-tuning, where it is
       further trained on a smaller, more specialized dataset to perform
       specific tasks like translation, question-answering, and content
       creation. What sets GPT apart is its deep learning architecture,
       which consists of multiple layers of transformers—hence the name.
       These transformers allow the model to process and analyze text in a
       highly efficient and nuanced manner, making GPT capable of generating
       text that is often indistinguishable from that written by humans.
       As technology evolves, GPT continues to push the boundaries of what
       artificial intelligence can achieve in understanding and mimicking human language."""

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Characters from the sentence:", "".join(chars))
print("vocab_size from the sentence: ", vocab_size)

Characters from the sentence: 
 ,-.ADFGIOPTWabcdefghiklmnopqrstuvwxyz—
vocab_size from the sentence:  40


In [None]:
m = nn.Linear(20, 30)
input = torch.randn(128, 20)
output = m(input)
print(output.size())

torch.Size([128, 30])


In [None]:
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

data = torch.tensor(encode(text), dtype=torch.long)
train_data = data
def get_batch():
    # generate a small batch of data of inputs x and targets y
    data = train_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = len(data) - 1 # what is the maximum context length for predictions?
# block_size = 192 # what is the maximum context length for predictions?
device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_embd = 64 # embd_dim
n_head = 1
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,H)
        q = self.query(x) # (B,T,H)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, H) @ (B, H, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,H)
        out = wei @ v # (B, T, T) @ (B, T, H) -> (B, T, H)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

        # projects the concatenated output back to the original embedding size.
        self.proj = nn.Linear(n_embd, n_embd)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        # mimic original paper: The dimensionality of input and output is d_model = 512,
        # and the inner-layer has dimensionality d_ff = 2048.
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """
    """ stuff inside the solid line box in the paper """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)

        # Layer normalization ensures that each feature has zero mean and unit
        # variance for each individual sample, making it effective for
        # stabilizing and accelerating the training of neural networks.
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # the order is a little different from the original paper

        # don't forget residual path
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class BabyGPT(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

### let's train our babyGPT

In [None]:
m = BabyGPT()

In [None]:
import torch
from tqdm import tqdm

batch_size = 1
device = 'cuda' if torch.cuda.is_available() else 'cpu'
m = m.to(device)
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# 使用tqdm添加进度条，并在进度条中显示中间loss值
pbar = tqdm(range(1000))
for steps in pbar:
    # sample a batch of data
    xb, yb = get_batch()
    xb, yb = xb.to(device), yb.to(device)
    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    # 更新进度条的描述以显示当前的loss值
    pbar.set_description(f"Loss: {loss.item():.4f}")

print(loss.item())

Loss: 0.0010: 100%|██████████| 1000/1000 [00:18<00:00, 53.11it/s]

0.0009724152041599154





In [None]:
start_id = torch.tensor([encode('GPT')], dtype=torch.long)
start_id = start_id.to(device)
print(decode(m.generate(idx = start_id, max_new_tokens=20)[0].tolist()))

GPT, short for Generati


### Averaged attention GPT

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F


text = 'GPT, short for Generative Pre-trained Transformer, represents a groundbreaking advancement in the field of artificial intelligence and natural language processing. Developed by OpenAI, GPT is designed to understand, generate, and interpret human language with remarkable accuracy and fluency. It operates on the principle of machine learning, where the model is initially pre-trained on a vast corpus of text data. This pre-training enables GPT to grasp the intricacies of language, including grammar, context, and even subtleties like humor and sarcasm. Following the pre-training phase, GPT undergoes fine-tuning, where it is further trained on a smaller, more specialized dataset to perform specific tasks like translation, question-answering, and content creation. What sets GPT apart is its deep learning architecture, which consists of multiple layers of transformers—hence the name. These transformers allow the model to process and analyze text in a highly efficient and nuanced manner, making GPT capable of generating text that is often indistinguishable from that written by humans. As technology evolves, GPT continues to push the boundaries of what artificial intelligence can achieve in understanding and mimicking human language.'
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Characters from the sentence:", "".join(chars))
print("vocab_size from the sentence: ", vocab_size)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
data = torch.tensor(encode(text), dtype=torch.long)
train_data = data
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = len(data) - 1 # what is the maximum context length for predictions?
# block_size = 32 # what is the maximum context length for predictions?
device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_embd = 64
n_head = 1
n_layer = 1
dropout = 0.0
# ------------

torch.manual_seed(1337)

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        # compute attention scores ("affinities")
        # wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = torch.tril(torch.ones(B, T, T, device=device)) # (B, T, T)
        wei = wei / torch.sum(wei, dim=-1, keepdim=True) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class FoolBabyGPT(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

Characters from the sentence:  ,-.ADFGIOPTWabcdefghiklmnopqrstuvwxyz—
vocab_size from the sentence:  39


In [None]:
m = FoolBabyGPT()

In [None]:
import torch
from tqdm import tqdm

batch_size = 1
device = 'cuda' if torch.cuda.is_available() else 'cpu'
m = m.to(device)
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# add tqdm progress bar with loss
pbar = tqdm(range(1000))
for steps in pbar:
    # sample a batch of data
    xb, yb = get_batch('train')
    xb, yb = xb.to(device), yb.to(device)
    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    # update progress bar to display loss
    pbar.set_description(f"Loss: {loss.item():.4f}")

print(loss.item())

Loss: 0.0009: 100%|██████████| 1000/1000 [00:07<00:00, 133.63it/s]

0.0008613121462985873





In [None]:
start_id = torch.tensor([encode('GPT')], dtype=torch.long)
start_id = start_id.to(device)
print(decode(m.generate(idx = start_id, max_new_tokens=500)[0].tolist()))

GPT, short for Generative Pre-trained Transformer, represents a groundbreaking advancement in the field of artificial intelligence and natural language processing. Deic oped by OpenAI, GPT is designed to understand, generate, and interpret r man language with remarkable accuracy and fluency. It operates on the principle of machine learning, where the model is initially pre-trained on a vast corpus of text data. This pre-training enables GPT to grasp the intricacies of language, including grammar, c


### Reference
https://github.com/karpathy/nanoGPT