In [None]:
# @title
from IPython.display import display, HTML

display(HTML("""
<script>
const firstCell = document.querySelector('.cell.code_cell');
if (firstCell) {
  firstCell.querySelector('.input').style.pointerEvents = 'none';
  firstCell.querySelector('.input').style.opacity = '0.5';
}
</script>
"""))

html = """
<div style="display:flex; flex-direction:column; align-items:center; text-align:center; gap:12px; padding:8px;">
  <h1 style="margin:0;">üëã Welcome to <span style="color:#1E88E5;">Algopath Coding Academy</span>!</h1>

  <img src="https://raw.githubusercontent.com/sshariqali/mnist_pretrained_model/main/algopath_logo.jpg"
       alt="Algopath Coding Academy Logo"
       width="400"
       style="border-radius:15px; box-shadow:0 4px 12px rgba(0,0,0,0.2); max-width:100%; height:auto;" />

  <p style="font-size:16px; margin:0;">
    <em>Empowering young minds to think creatively, code intelligently, and build the future with AI.</em>
  </p>
</div>
"""

display(HTML(html))

## Day 7 - Part 1: Introduction to Bigram Language Models

---

### üéØ **Agenda for this Notebook**

| Section | Topic | Description |
|:-------:|-------|-------------|
| 1 | **Loading Modules & Dataset** | Load PyTorch and the Tiny Shakespeare dataset |
| 2 | **Understanding the Data** | Explore characters, vocabulary, and text structure |
| 3 | **Building Tokenization** | Create character-level encoder and decoder |
| 4 | **Data Preparation** | Encode data, split into train/validation, create batches |
| 5 | **Bigram Language Model** | Build our first neural network for text generation |
| 6 | **Training & Generation** | Train the model and generate Shakespeare-like text |

---

### üéì **Learning Objectives**

By the end of this notebook, you will:
- ‚úÖ Understand how to prepare text data for language modeling
- ‚úÖ Build a character-level tokenizer from scratch
- ‚úÖ Implement your first neural language model (Bigram)
- ‚úÖ Train the model and generate text

Let's begin our journey into the world of language models! üöÄ

---
## Section 1: Loading Modules

Before we dive into language modeling, we need to import **PyTorch** - the deep learning framework we'll use throughout this course.

PyTorch provides:
- üî¢ Tensor operations (like NumPy, but with GPU support)
- üß† Neural network building blocks
- üìâ Automatic differentiation for training

In [1]:
import torch

---
## Section 2: Loading & Exploring the Dataset

Now that we have PyTorch ready, let's load our training data!

We'll use the **Tiny Shakespeare** dataset - a collection of Shakespeare's works combined into a single text file. This is a popular dataset for learning language modeling because:

- üìù It's small enough to train quickly on a CPU
- üé≠ The text has interesting patterns and vocabulary
- üî§ It's character-level, making tokenization simple

In [2]:
# Load the tiny shakespeare dataset
dataset = "tiny_shakespeare.txt"

# Load the dataset into a string
with open(dataset, "r", encoding="utf-8") as f:
    text = f.read()

Let's see how big our dataset is:

In [3]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


Here's a preview of the first 500 characters:

In [4]:
# Print the first 500 characters
print(text[:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


### üî§ Understanding the Vocabulary

Neural networks work with **numbers**, not characters. So we need to know:
1. What unique characters exist in our text?
2. How many unique characters are there? (This is our **vocabulary size**)

In [5]:
# Check all unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(vocab_size)



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


---
## Section 3: Building Tokenization (Encoder & Decoder)

Now comes a crucial step: **tokenization**. We need to convert characters to numbers and back.

**Why Tokenization?**
- Neural networks can only process numbers
- We need a consistent mapping: character ‚Üí integer ‚Üí character

**Our Approach:**
- **Encoder**: Takes a string ‚Üí Returns a list of integers
- **Decoder**: Takes a list of integers ‚Üí Returns a string

```
"hello" ‚Üí Encoder ‚Üí [46, 43, 50, 50, 53] ‚Üí Decoder ‚Üí "hello"
```

In [6]:
# create mapping from characters to integers

stoi = { ch: i for i, ch in enumerate(chars)}
itos = { i: ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

Let's test our encoder and decoder:

In [7]:
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### üåü Bonus: How Real LLMs Tokenize

Our character-level tokenizer is simple but not very efficient. Real models like GPT-2 use **Byte Pair Encoding (BPE)** which creates tokens from common character sequences (subwords).

Let's see how OpenAI's GPT-2 tokenizer works:

In [8]:
# Checking vocab size for GPT 2 using tiktoken
import tiktoken

enc = tiktoken.get_encoding("gpt2")
enc.n_vocab

50257

In [9]:
enc.encode("hii there")

[71, 4178, 612]

In [10]:
enc.decode([71, 4178, 612])

'hii there'

---
## Section 4: Data Preparation

Now that we can convert text to numbers, let's prepare our data for training!

### Step 1: Encode the Entire Dataset
Convert all the Shakespeare text into a tensor of integers:

In [11]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:500])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

### Step 2: Split into Train and Validation Sets

We split our data to:
- **Train set (90%)**: Used to train the model
- **Validation set (10%)**: Used to check if the model generalizes well

In [14]:
# Split the Data into Train and Validation Sets
n = int(len(data) * 0.9)
train_data = data[:n]
val_data = data[n:]

print(len(train_data), len(val_data))

1003854 111540


### Step 3: Chunking the Data

We don't feed the entire text to the model at once - that would be computationally expensive!

Instead, we break the data into **chunks** (also called **blocks**). The maximum chunk length is called `block_size`.

**üí° Key Insight:** Each chunk actually contains MULTIPLE training examples!

For `block_size = 8`, one chunk gives us 8 examples:

| Input Context | Target |
|:--------------|:------:|
| [char_1] | char_2 |
| [char_1, char_2] | char_3 |
| [char_1, char_2, char_3] | char_4 |
| ... | ... |
| [char_1, ..., char_8] | char_9 |

This helps the model learn to predict from contexts of varying lengths!

In [15]:
block_size = 8
train_data[:block_size+1] # This chunk has multiple examples packed into it. There are 8 examples here.

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [16]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context} the target is {target}")

When input is tensor([18]) the target is 47
When input is tensor([18, 47]) the target is 56
When input is tensor([18, 47, 56]) the target is 57
When input is tensor([18, 47, 56, 57]) the target is 58
When input is tensor([18, 47, 56, 57, 58]) the target is 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


Let's visualize all the training examples packed into one chunk:

### Step 4: Batching

After chunking, we **batch** multiple chunks together to train in parallel.

**Why Batching?**
- üöÄ **Speed**: GPUs can process many examples simultaneously
- üìä **Stability**: Averaging gradients over a batch gives smoother updates

**Dimensions:**
- `batch_size`: Number of chunks processed in parallel
- `block_size`: Length of each chunk

In [17]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of input-target pairs
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i : i+block_size] for i in ix])
    y = torch.stack([data[i+1 : i+block_size+1] for i in ix])
    return x, y

Let's generate a batch and see its shape:

In [18]:
xb, yb = get_batch('train')

print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


Let's see all the input-target pairs in our batch (4 chunks √ó 8 examples each = 32 training examples!):

In [19]:
for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print("when input is ", context, "target: ", target)

when input is  tensor([24]) target:  tensor(43)
when input is  tensor([24, 43]) target:  tensor(58)
when input is  tensor([24, 43, 58]) target:  tensor(5)
when input is  tensor([24, 43, 58,  5]) target:  tensor(57)
when input is  tensor([24, 43, 58,  5, 57]) target:  tensor(1)
when input is  tensor([24, 43, 58,  5, 57,  1]) target:  tensor(46)
when input is  tensor([24, 43, 58,  5, 57,  1, 46]) target:  tensor(43)
when input is  tensor([24, 43, 58,  5, 57,  1, 46, 43]) target:  tensor(39)
when input is  tensor([44]) target:  tensor(53)
when input is  tensor([44, 53]) target:  tensor(56)
when input is  tensor([44, 53, 56]) target:  tensor(1)
when input is  tensor([44, 53, 56,  1]) target:  tensor(58)
when input is  tensor([44, 53, 56,  1, 58]) target:  tensor(46)
when input is  tensor([44, 53, 56,  1, 58, 46]) target:  tensor(39)
when input is  tensor([44, 53, 56,  1, 58, 46, 39]) target:  tensor(58)
when input is  tensor([44, 53, 56,  1, 58, 46, 39, 58]) target:  tensor(1)
when input i

---
## Section 5: Building the Bigram Language Model

Now for the exciting part - building our first neural network!

### What is a Bigram Model?

A **Bigram** model predicts the next character based ONLY on the current character. It's the simplest language model possible.

```
"The" ‚Üí 'T' predicts 'h', 'h' predicts 'e'
```

**Architecture:**
- Single **Embedding layer** that maps each character to prediction scores (logits)
- Each row of the embedding table tells us: "Given this character, what character is likely next?"

**Limitation:** It ignores all context except the immediate previous character. But it's a great starting point!

In [51]:
torch.manual_seed(1337)

class BigramLanguageModel(torch.nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # Each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = torch.nn.Embedding(vocab_size, vocab_size)

    def forward(self, x, y = None):
        # x is (B, T) tensor of indices.
        logits = self.token_embedding_table(x) # (B, T, C) = (4, 8, 65)
        
        # Loss
        if y is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            y = y.view(B*T)
            loss = torch.functional.F.cross_entropy(logits, y)

        return logits, loss

    def generate(self, x, max_new_tokens):
        # x is (B, T) tensor of indices in the current context
        for _ in range(max_new_tokens):
            # get predictions
            logits, loss = self(x)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = torch.nn.functional.softmax(logits, dim = -1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples = 1) # (B, 1)
            # append sampled index to the running sequence
            x = torch.cat((x, idx_next), dim = 1) # (B, T+1)
        return x

**Understanding the Code:**

| Method | Purpose |
|--------|---------|
| `__init__` | Creates an embedding table of size (vocab_size √ó vocab_size) |
| `forward` | Takes input tokens, returns logits and loss |
| `generate` | Given a starting context, generates new tokens one by one |

Let's create the model and test it with our batch:

In [52]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

torch.Size([256, 65])
tensor(4.7405, grad_fn=<NllLossBackward0>)


### Generating Text (Before Training)

Let's see what our untrained model produces. Starting from a newline character (index 0), we generate 100 characters:

**Expected:** Complete gibberish! The model hasn't learned anything yet.

In [53]:
x = torch.zeros((1, 1), dtype = torch.long) # Since idx 0 is a new line character
out = m.generate(x, max_new_tokens = 100)
print(decode(out[0].tolist()))


SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


---
## Section 6: Training the Model

Now let's train our model! The training loop:

1. **Sample** a batch of data
2. **Forward pass**: Get predictions and compute loss
3. **Backward pass**: Compute gradients
4. **Update**: Adjust weights using optimizer
5. **Repeat!**

We use the **AdamW** optimizer - a popular choice for training neural networks.

In [54]:
optimizer = torch.optim.AdamW(m.parameters(), lr = 1e-3)

Let's train for 10,000 steps:

In [57]:
batch_size = 32

for steps in range(10000):

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    logits, loss = m(xb, yb)

    # Backpropagate and update weights
    loss.backward()

    # Update the weights
    optimizer.step()

    # Reset the gradients
    optimizer.zero_grad(set_to_none = True) # Set to None instead of zero to free up memory

print(loss.item())

2.456190347671509


### Generating Text (After Training)

Now let's see what our trained model produces! It should be much better, though still not perfect:

In [58]:
x = torch.zeros((1, 1), dtype = torch.long) # Since idx 0 is a new line character
out = m.generate(x, max_new_tokens = 100)
print(decode(out[0].tolist()))


BRS:

THAnrt t fa boun-s trconnou

No: ENGUS:
tepare ofo.'s ne:
We Prellothe, s;
NE: t an re, belono


---
## Summary & What's Next?

**What We Learned:**
- ‚úÖ Loaded and explored the Tiny Shakespeare dataset
- ‚úÖ Built a character-level tokenizer (encoder/decoder)
- ‚úÖ Prepared data with chunking and batching
- ‚úÖ Implemented a Bigram Language Model
- ‚úÖ Trained the model and generated text

**Limitation of Bigram Model:**
The Bigram model only looks at the **previous character** to make predictions. It has no memory of earlier context!

**Next Up: Part 2**
We'll organize our code better with:
- Proper loss estimation (train + validation)
- GPU support
- More structured training loop

‚û°Ô∏è Continue to **Part 2: Complete Bigram Model**