<a href="https://colab.research.google.com/github/royam0820/Build_a_Mini_GPT/blob/main/01_charLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Character-Level Language Model (Bigram)
Fast first win on *Alice in Wonderland*.


In [None]:
# Uninstall conflicting libraries
!pip uninstall -y fastai torchaudio torchvision

# Reinstall torch and compatible versions of fastai, torchaudio, and torchvision
!pip install torch==2.9.0 fastai torchaudio torchvision --upgrade

In [4]:
# Setup
!pip -q install torch --upgrade
import math, random, textwrap, requests
import torch, torch.nn as nn, torch.nn.functional as F
from tqdm.auto import tqdm # progress bar
device = "cuda" if torch.cuda.is_available() else "cpu"; torch.manual_seed(1337); print("Device:", device)

Device: cuda


NB: `torch.manual_seed(1337)`Sets the random seed for PyTorch to 1337. This
ensures that the random number generation is reproducible, which is important for debugging and comparing results.

In [5]:
# Dataset : data loading, preprocessing, and preparation for training the language model.

# Fetching the text of "Alice's Adventures in Wonderland" from Project Gutenberg
text = requests.get("https://www.gutenberg.org/files/11/11-0.txt").text
text = text[:300_000]
chars = sorted(list(set(text))); vocab_size = len(chars)

# dictionaries:  string to integer (stoi) and integer to string (itos)
stoi = {ch:i for i,ch in enumerate(chars)}; itos = {i:ch for ch,i in stoi.items()}

# encoding string to tensor
encode = lambda s: torch.tensor([stoi[c] for c in s], dtype=torch.long)
# decoding tensor to string
decode = lambda t: "".join(itos[int(i)] for i in t)

# convert the entire text inot a Pytorch tensor
data = torch.tensor([stoi[c] for c in text], dtype=torch.long)
# splitting the dataset - 90% is used for training
n = int(0.9*len(data)); train_data, val_data = data[:n], data[n:]
batch_size, block_size = 64, 1

def get_batch(split):
    src = train_data if split=="train" else val_data
    ix = torch.randint(len(src)-block_size-1, (batch_size,))
    x = torch.stack([src[i:i+block_size] for i in ix])
    y = torch.stack([src[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

NB: `batch_size, block_size = 64, 1`
These lines define the batch_size (number of independent sequences processed in parallel) and block_size (the maximum context length for predictions). In this bigram model, block_size is 1, meaning the model only considers the previous character to predict the next one.

`def get_batch(split)`: ...: This function generates a batch of data for training or validation. It randomly selects batch_size starting indices from the specified data split (train_data or val_data) and creates input x and target y tensors. For a bigram model, x contains a single character and y contains the next character.

In [6]:
# dataset split (training and evaluation)
len(train_data), len(val_data)

(130226, 14470)

The standard way to define neural networks model in Pytorch.

In [8]:
# The Bigram Language Model
class BigramLM(nn.Module):
    def __init__(self, vocab_size):
        super().__init__(); self.token_emb = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_emb(idx); loss=None
        if targets is not None: loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens=200, temperature=1.0):
        for _ in range(max_new_tokens):
            logits,_ = self(idx); logits = logits[:,-1,:]/temperature
            probs = F.softmax(logits, dim=-1); nxt = torch.multinomial(probs, 1)
            idx = torch.cat([idx, nxt], dim=1)
        return idx

NB: `super().__init__()`: This line calls the constructor of the parent class (nn.Module), which is necessary for proper initialization.

`self.token_emb = nn.Embedding(vocab_size, vocab_size)`: This line creates an embedding layer. An embedding layer is essentially a lookup table. In this case, it takes an integer representing a character and returns a vector of size vocab_size. This vector can be interpreted as the "embedding" or representation of that character. For a simple bigram model, the embedding dimension being equal to the vocabulary size means that the embedding for each character is a one-hot encoded vector, but with trainable weights.

`def forward(self, idx, targets=None)`: This is the forward pass method of the model. It defines how the input data flows through the model.
- `idx`: This is the input tensor, representing the sequence of characters (as integer indices). For a bigram model, each element in the sequence is a single character.
- `targets=None`: This is an optional argument for the target characters (the characters the model is trying to predict). It's used during training to calculate the loss.
- `logits` = self.token_emb(idx): This line passes the input indices idx through the embedding layer (self.token_emb). It retrieves the embedding vector for each input character. The output `logits` will have a shape of (`batch_size`, `block_size`, `vocab_size`).
- loss=None: Initializes the loss variable to None.

`@torch.no_grad()`: This decorator indicates that the following method (generate) should not track gradients. This is important for inference (generating new text) because we don't need to compute gradients for backpropagation during this phase.

`def generate(self, idx, max_new_tokens=200, temperature=1.0)`: This method generates new text based on an initial input sequence.
idx: The initial input tensor (the context) to start generating from.
- `max_new_tokens=200`: The maximum number of new characters to generate.
- `temperature=1.0`: Controls the randomness of the generated text. Higher temperatures lead to more random outputs, while lower temperatures make the output more deterministic.
- `for _ in range(max_new_tokens)`: This loop runs for the specified number of new tokens to generate.
- `logits,_ = self(idx)`: Performs a forward pass of the model using the current context idx. It gets the logits (predictions) for the next character. We only need the logits here, so we ignore the loss.
- `logits = logits[:,-1,:]/temperature`: Takes the logits for the last character in the sequence (the one we want to predict) and applies the temperature. Dividing by temperature scales the logits, making the probability distribution sharper (lower temperature) or smoother (higher temperature).
- `probs = F.softmax(logits, dim=-1)`: Converts the logits into probability distribution over the vocabulary using the softmax function.
`nxt = torch.multinomial(probs, 1)`: Samples the next character from the probability distribution using multinomial sampling. This introduces randomness based on the probabilities.
- `idx = torch.cat([idx, nxt], dim=1)`: Appends the newly generated character (`nxt`) to the input sequence idx, which becomes the new context for the next step of generation.
- `return idx`: After generating max_new_tokens, the method returns the complete generated sequence (including the initial context).

In summary, this BigramLM class defines a simple neural network model that learns to predict the next character based on the current character. The forward method is used for training and calculating the loss, while the generate method is used for generating new text by iteratively predicting and appending the next character.



In [9]:
# Training

# model instantiation
model = BigramLM(vocab_size).to(device)

# the optimization algorithm
opt = torch.optim.AdamW(model.parameters(), lr=1e-2)

# number of training steps
max_steps, eval_interval = 2000, 200

# estimating the loss
@torch.no_grad()
def estimate_loss():
    out={}; model.eval()
    for split in ["train","val"]:
        losses=[];
        for _ in range(20):
            xb,yb = get_batch(split); _,loss = model(xb,yb); losses.append(loss.item())
        out[split]=sum(losses)/len(losses)
    model.train(); return out

for step in range(max_steps):
    if step % eval_interval == 0:
        l = estimate_loss(); print(f"step {step:4d}: train {l['train']:.3f} | val {l['val']:.3f}")
    xb,yb = get_batch("train"); _,loss = model(xb,yb)
    opt.zero_grad(set_to_none=True); loss.backward(); opt.step()

print("Training done.")

step    0: train 4.721 | val 4.750
step  200: train 3.314 | val 3.354
step  400: train 2.768 | val 2.802
step  600: train 2.575 | val 2.539
step  800: train 2.620 | val 2.640
step 1000: train 2.429 | val 2.565
step 1200: train 2.473 | val 2.554
step 1400: train 2.451 | val 2.471
step 1600: train 2.487 | val 2.458
step 1800: train 2.426 | val 2.520
Training done.


You can fine-tune this model by adjusting the hyperparameters in the training loop. The most common parameters to experiment with are:
- **Learning Rate (lr)**: Controls the step size of the optimizer. A smaller learning rate might lead to finer tuning, while a larger one can speed up training but might overshoot the optimal parameters.
- **Number of Training Steps (max_steps)**: Determines how long the model trains. More steps can lead to better performance but also take more time.
- **Batch Size (batch_size)**: The number of samples processed in each training iteration. Larger batch sizes can provide a more accurate estimate of the gradient but require more memory.
- **Evaluation Interval (eval_interval)**: How often the model is evaluated during training. More frequent evaluation gives better insight into the training progress but adds overhead.

In [10]:
# Inference
for temp in [0.7,1.0,1.3]:
    # create an initial context tensor (ctx) for text generation
    ctx = torch.tensor([[torch.randint(vocab_size, (1,)).item()]], device=device)
    # generate new text
    out = model.generate(ctx, 300, temperature=temp)[0].tolist()
    print(f"\n=== SAMPLE (temp={temp}) ===\n"); print(textwrap.fill(decode(out), width=90))



=== SAMPLE (temp=0.7) ===

—oraisant y, “I re, as w it wou  waner that t it Qut oule he be t ck Whico te prlery t
owan howour, _ are t t wasainowousut’ly “Whinoulofa the t se her isthes, _bed s_ ngithis
asalery llyE touseseld ice sthintre her g ge sthing. gingh hin hat sonke id ng, rote
inghed t t athe s bor tlf tht rs shoonga

=== SAMPLE (temp=1.0) ===

3ghe win’s t wiAldd th ‘ù1xitheshaws win, shile shay bushand prok sshioimpat a waim the t
wariore t agNSusor, wa Nom: ly Qut. maitwsth!” artZ)  t  mine  the’llishitopoce hof tte
izzFims onel) t tas! stf in tin,f, hemphowiele I t htho n it satht
h-tyishentlleshougsousthorerusted an oithypot she.—” hit

=== SAMPLE (temp=1.3) ===

e-Fa n’tthand-)‘Ww g. o tCd nkùHogel deeneynclute‘_; whidg—” aid:]Allw‘Jtheen Dof
wquthe.nerve pha pats! abPugofJ” tondou Alù3*[NDifralled athelo iùD, shele.” ouceshelen
“br ac-Ca “rr h f s bea “She ile,” lleLaggo id sid t,” D” y Grn, chas. s f s—lo )
iEdiniliA whothan oosuled, d—ERIois ideno _ athid
