##### Build a LM that will produce Shakespeare-like prose. 
  
`Lecture`

https://www.youtube.com/watch?v=kCc8FmEb1nY&t=14s

`Workflow` 

https://www.youtube.com/watch?v=-j6y-5t37os&t=1s running notebooks in VSCode:

* Initiate VENV in Terminal
* Start Jupyter Notebook
* Copy url of the server into VSCode (server button on bottom-middle)
* Select Kernal: Python 3 (ipykernel) for Jupyter session

`Dataset` - 

The Shakespeare corpus

In [2]:
# Dataset
with open('data/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print("length of dataset in characters: ", len(text))

# Chars
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("vocab size (unique chars): ", vocab_size)

length of dataset in characters:  1115394
vocab size (unique chars):  65


`Tokenization` -

Mapping chars to a sting of int.

It is common to do sub-words.

E.g., Google uses `SentencePiece`

`TikToken (OAI)` 
* Uses sub-words
* ~50k tokens (https://news.ycombinator.com/item?id=34008839), a larger codebook size 
* Also see HuggingFace tokenizers: https://huggingface.co/docs/tokenizers/quicktour
* Note, using gpt2 encoding `Hi There` is encoded to 2 tokens between `1 - 50257`

In [3]:
import tiktoken 
enc = tiktoken.get_encoding('gpt2')
print(enc.n_vocab)
print(enc.encode('Hi There'))

50257
[17250, 1318]


`Ours`
* We use a char-level tokenizer
* Small number of tokens, a small codebook
* With char encoding `Hi There` is encoded to 8 tokens between `1 - 65`

In [4]:
# char mapping
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

# encode and decode
encode = lambda s: [stoi[c] for c in s] # encode a string as a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decode a list of integers as a string

print(encode("Hi There"))
print(decode(encode("Hi There")))

[20, 47, 1, 32, 46, 43, 56, 43]
Hi There


`Dataset`

The works of Shakespeare.

In [8]:
# Encode dataset and wrap in Torch tensor
import torch, numpy as np
data = torch.tensor(encode(text),dtype=torch.long)
print(data.shape,data.dtype)
 
# Create splits
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

torch.Size([1115394]) torch.int64


`Input`
 
We work with dataset chuncks (or `blocks`).

Transformer will never see more than `block_size` when predicting the output.

In [9]:
# First 9 chars in train set
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [33]:
# Each chunk of 9 chars has 8 examples
x = train_data[:block_size]
y = train_data[1:block_size+1]
for i in range(block_size):
    context = x[:i+1]
    target = y[i]
    print(f"when input is {context} and output is {target}")

when input is tensor([18]) and output is 47
when input is tensor([18, 47]) and output is 56
when input is tensor([18, 47, 56]) and output is 57
when input is tensor([18, 47, 56, 57]) and output is 58
when input is tensor([18, 47, 56, 57, 58]) and output is 1
when input is tensor([18, 47, 56, 57, 58,  1]) and output is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) and output is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) and output is 58


Each `4 x 8` input tensor has 32 examples.

In [10]:
torch.manual_seed(1337)
# how many independent sequences will we process for every fwd, bkwd pass of transfomer 
batch_size = 4 
# what is the maximum context length for predictions
block_size = 8 
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    # 4 random start positions  
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)
print('----')
for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

`Bigram LM`
 
* `Embedding table` encodes the probability of the next char for each char 
* So, for each input char we just pluck out it's row from the embedding table
* That row gives us the raw scores (logits) for the `next char`
 
`Loss`
 
* Apply `softmax` to exp and normalize logits, resulting in a probability  
* Then, just fetch the probability of picking the *correct* `next char`
* Multiple probabilities (or, in practice, sum the log probabilties) of the correct next char in the batch 
* If everything is correct, we want `0 loss`, which works out b/c `log(P=1 for correct char) = 0`
* This is `negative log liklihood (NLL)`
* `Cross entropy loss` just rolls `softmax` normalization and `NLL` into one step 
* `loss_=F.cross_entropy(logits,Y[ix])` 
*  `F.cross_entropy` function returns the mean cross-entropy loss over all the elements of the input

`Class`:

* In the class `BigramLangModel`, the forward method is the one that is being called when `logits, loss = self(idx)` is used in the generate function. 
* This is because the forward method is defined as the forward pass of the model, which is the default behavior when the `nn.Module` class is called.

In [35]:
import torch.nn as nn 
import torch.nn.functional as F

torch.manual_seed(1337)

class BigramLangModel(nn.Module):

    def __init__(self, vocab_size):

        super().__init__()
        # Each token just looks up scores in embedding table
        # The scores are the logits for the next char 
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # ids and targets are both (B, T) tensor of integers
        # B=batch (4), T=time/context (8), C=vocab_size(64)
        logits = self.token_embedding_table(idx) # (B,T,C)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            # C logits per input char 
            logits = logits.view(B*T, C)
            # 1 output char target 
            targets = targets.view(B*T)
            # For batch of B*T = 4*8 = 32 chars 
            # Softmax: Compute probability from logits per char 
            # NLL: Look up probability of the correct char and sum the negative log across all 32 in batch 
            loss = F.cross_entropy(logits, targets)
        return logits, loss 

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # run forward pass to get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution, one prediction for what is next / batch 
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLangModel(vocab_size)
# Outputs 
logits, loss = m(xb,yb) 
print(logits.shape)
print(loss.item())
print(np.log(1/65))
# Generates 100 tokens, untraind model
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([256, 65])
4.725085258483887
-4.174387269895637

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


Again the loss is the mean `-log(P_correct_char)`.

We expect that it is close to random, or `-log(P=1/65)`.

Note, we can easily train. 

The weights are in the `token_embedding_table` attribute, which is an instance of the `nn.Embedding` module.

The embedding layer is initialized with random weights and it is trained during the training process to better represent the input tokens.

The `nn.Embedding` module uses these weights to project the input tokens into the continuous vector space.
 
It updates these weights during training to improve the representation of the input tokens, producing raw logits for the next token given the current one.
  
So the weights are inside the `token_embedding_table` and can be accessed by calling `model.token_embedding_table.weight`.

In [36]:

# Create pytorch optimizer
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)
# Set up training loop 
batch_size = 32
for steps in range(10000): 
    # Sample a batch of data
    xb, yb = get_batch('train')
    # Evaulate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
print(loss.item())

2.3796486854553223


The model is making progress! 

We can see the loss is lower than before.

Run inference on 100 chars.

In [64]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))



Sel ule-
ABOLaig,
PUCLEO hoofr rde!
Whbodishat, d, igof gr.
AMy me atheangrd mbt wofoun dapimurdu's


But is is only looking at the `last char` to predict the `next char`.

We can do better.

Let's introduce `self-attention`:
 
We can start with `8 tokens` per batch.

In [37]:
torch.manual_seed(1337)
B,T,C = 4,8,2
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

We want the tokens to "talk" to each other.

In this case, information only flows from prior timesteps:
  
* E.g., The token in 5th location can only talk to prior tokens. 
 
* And to accumulate, we calculate the `average` of the channels in all prior tokens.

In [39]:
# "bag of words" (term for averaging) where each location has a word
xbow = torch.zeros((B,T,C))
# iterate over batch 
for b in range(B):
    # iterate over time 
    for t in range(T):
        # batch everything up to T (previous chunk)
        xprev = x[b,:t+1]
        xbow[b,t] = torch.mean(xprev, 0)

Recall:

* Each token is a char 
* The model has `block_size = 8`, so it looks at 8 chars
* Each char is embedded
* So: one example to the model will simply be: `block_size x C` 

In [42]:
# We can see one example 
print(x[0].shape)
x[0]

torch.Size([8, 2])


tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

Then, we can see the mean for all tokens up to and including the current in the `bow` tensor.

In [76]:
# BOW for the first channel of the second token
print(np.mean([0.1808,-0.3596]))
xbow[0]

-0.0894


tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [78]:
torch.manual_seed(42)
a = torch.ones(3,3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=',a)
print('b=',b)
print('c=',c)

a= tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
b= tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c= tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


Producing `c` is just dot-product of row 1 `a` and col 1 `b`:  
  
`[1., 1., 1.]` dot `[2., 6., 6.]` = `14`

Note, we can use  `tril` to perform the `BOW` sum for all tokens up to and including the current:

Does a sum of a variable number of rows from `b` and deposits into `c`.

We can normalize `a` so that the average gets deposited into `c`!

In [82]:
a = torch.tril(torch.ones(3,3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=',a)
print('b=',b)
print('c=',c)

a= tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b= tensor([[7., 6.],
        [9., 6.],
        [3., 1.]])
c= tensor([[7.0000, 6.0000],
        [8.0000, 6.0000],
        [6.3333, 4.3333]])


So `b` is our `block_size x C` tensor.

We want to accumulate the mean up to each token.

`wei` (short for weights) is a mask that enables this.

In [85]:
# So, again we had "bag of words" (term for averaging) where each location has a word
xbow = torch.zeros((B,T,C))
# iterate over batch 
for b in range(B):
    # iterate over time 
    for t in range(T):
        # bach and everything up to T (previous chunk)
        xprev = x[b,:t+1]
        xbow[b,t] = torch.mean(xprev, 0)
 
# Faster way to do the same thing! 
wei = torch.tril(torch.ones(T,T))
wei = wei / wei.sum(1, keepdim=True)
print(wei)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) => B, T, C
print("We can see they are indentical:")
torch.allclose(xbow, xbow2)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
We can see they are indentical:


True

We are basically doing a weighted sum. 

The weights come from `T`. 
 
They are applied to `x`.

So, each element in the output `xbow2` is just the average of all tokens up and including that element.

We cam do this a third way:

In [88]:
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
# For all element where tril = 0 become -inf 
wei = wei.masked_fill(tril == 0, float('-inf'))
# Softmax does a normalization (exp and divide by sum)
# Creates the same mask as above
wei = F.softmax(wei, dim=1)
xbow3 = wei @ x # (B, T, T) @ (B, T, C) => B, T, C
torch.allclose(xbow, xbow3)

True

`wei` has interesting properties:

* Each value is an interaction strength
* How much from each token in the past do we want to aggregate 
* And we wil not aggregate anything from future tokens
* These affinities are going to be learned 

In short, this is just a weighted aggregated of past token. 

This is the preview for self-attention!

In [90]:
# T x T 
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [95]:
# T x C 
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [97]:
xbow3[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

So, the output at `token 2`, for example, will be a weighted aggregated of past tokens.

For the first channel:

`[0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]` 

`@` 

`[ 0.1808, -0.3596,  0.6258,  0.9545,  0.3612, -1.3499,  0.2360, -0.9211]`

In [99]:
# Token 2:
0.1808 * 0.5 + -0.3596 * 0.5000

-0.0894

We can see this is in the output, `-0.0894`.

So, in this simple case: we equally aggerate (`0.5`) from token 1 and 2 for the output at token 2.

In [43]:
iter = 1
losses={}
losses['train']=1
losses['val']=1
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

step 1: train loss 1.0000, val loss 1.0000


`Self-attention`

Start with a 4x8 arrangement of tokens. 

This computes an average of all past tokens to compute the logit for current token. 

In [102]:
torch.manual_seed(1337)
B,T,C = 4,8,32
x=torch.rand(B,T,C)
# mask 
tril = torch.tril(torch.ones(T,T))
# weight matrix, uniformly default to zero 
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x
out.shape 

torch.Size([4, 8, 32])

Now, the weight matrix is uniformly set to zero.

So, each token pays equal attention to each prior token.

But, we want select information from past in the data-dependent way.
 
`Self-attention`
 
`Query`: what am I looking for.
`Key`: what do I contain.

Each query does a dot-product with each prior location.

If a `Key` and `Query` are aligned, they will produce a high value.

Then, I will learn more about that token!

We simply specify a `head_size` for this self-attention layer.

Then, we forward key and query on `x`: all tokens in every position on `B x C` produce a key and query.

In [None]:
head_size = 16

# Input 
B,T,C = 4,8,32
x=torch.rand(B,T,C)
 
# Per token embeddings
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

# Apply to all tokens 
k = key(x) # B, T, 16
q = query(x) # B, T, 16

So, we have a B, T tensor of tokens.

We embed each to a Key and Query.

We then let the Key and Query communicate across tokens.

In [44]:



# Weight, measuring affinity between tokens 
wei = q @ k.transpose(-2,-1) * head_size**0.5 # (B,T,16) @ (B, 16, T) -> (B, T, T)

# Mask future tokens since we are decoding 
tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

# V gets aggregated for this self-attention head
v = value(x)
out = wei @ v
out.shape 

torch.Size([4, 8, 16])

In [48]:
x.shape

torch.Size([4, 8, 2])

Weight is now `T x T`

The attention that token N pays to prior tokens is learned by data.

Specifically, it is learned by the dot product of `k`, `q`.

In [119]:
wei = q @ k.transpose(-2,-1) * head_size**0.5
wei[0]

tensor([[ 2.2924,  2.7755,  1.7500,  2.1041,  1.1474,  2.4048,  0.1968,  1.3166],
        [ 4.6807,  5.0168,  3.5054,  3.3999,  3.2272,  5.0683,  1.7219,  2.1858],
        [ 3.4592,  3.0671,  2.8662,  2.3773,  2.2361,  3.1440,  1.5106,  1.1534],
        [ 3.4917,  2.5129,  2.2739,  2.0483,  2.1242,  3.0338,  0.8786,  0.1875],
        [ 3.8347,  3.7708,  3.5620,  2.4863,  2.4752,  4.8359,  0.8099,  1.4257],
        [ 4.1176,  3.5148,  2.5157,  2.0536,  1.9548,  3.4721,  1.4021,  0.5436],
        [ 2.7520,  1.7684,  2.0399,  1.5793,  1.4491,  2.5530,  0.4618,  0.0639],
        [ 1.3894,  1.9540,  1.2960,  0.9153,  1.1477,  1.7391, -0.1393, -0.2193]],
       grad_fn=<SelectBackward0>)

Attention is a general message passing mechanism for any directed graph.
 
E.g., - 

`token 8` is pointed to by itself, and all 7 prior nodes.

so, it can aggregate information from all nodes that point to it!

`x` is private information to `token 8`.

`q` is what `token 8` is interested in.

`k` is what `token 8` has.

`v` is what `token 8` will communicate to you if you find it interesting.

Strong attention pairs have high value of `Q @ K`! 

Also, there is no notion of space by default; this is encoded.

Also, there is no communcation across nodes. 

 `Encoder vs Decoder`

`decoder`

Typically just self-attention (communication) and feed-forward (compute).

Uses triangular mask on future tokens.
 
It has an auto-regressive property where we can sample from it.

The example here is decoder only.

`encoder`

What if we want to `condition` the decoding (above) on additional information?

If so, we can add an encoder.

E.g., an encoder reads French and creates tokens. All tokens can talk! No masking.

The encoder feeds `k`, `v` to the decoder via `cross-attention`.

Self-attention: the same source X produces `K`,`Q`,`V`

But, the `Q` can come from X while `K`,`V` come from a seperate source.

With `cross-attention`, we pool from a seperate source of nodes.

So, we condition decoding on:

1/ The past (as normal)

2/ The full encoded input sentence (e.g., in French)

`Multi-head attention`

Compute many attentions in parallel. 

Basically, we create many channels of communication between the tokens. 

Concatenate the results. 

`Feed Forward`

Self-attention is the communication.

FF is the "thinking" or computing of this data.
 
It is common for a FF layer on each token independently. 

Self-attention (communication) and compute blocks are stacked.

`Overfitting`

Two innovations help with large networks.
 
`1. Skip (residual) connections`

We use this to improve generalization. 

We bypass computation and sum.

Addition distributes gradients equally during backprop.

At the start of training, residual blocks to do not contribute to gradient. 

So, there is a gradient superhighway back to the input.
 
But, over time, residual blocks kick in. 

`2. Layer norm`

Very similar to batch norm, which ensures each neuron has unit Gaussian output.

`3. Dropout`
 
Regularization technique for prevent overfitting.

Effectivly, this trains an ensemble of sub-netoworks. 


Model size - 

Pretraining:

Ours: 10M params and dataset is ~300k tokens (chars)
    
OAI: Up to 175B (96 heads, 128 head size) using sub-word chunks (~50k vocab) and 300b tokens.

This is document completing decoder model, babbling "internet" rather than Shakespeare. 

It is not aligned.

Fine-tuning:

Very sample effieicnt in fine-tuning.

In the second stage, we align it:

`SFT:` Q and A documents. ~Thousands. Align model to expect question and complete answer. 

`RLFH` Ranked responses. Predict how much of any response would be desirable via reward model. Then run PPO to fine-tune sampling policy. Answers expect to have high reward.

Thus, the model goes from a document completer to a question-answer-er.
