# Understanding Large Language Models: A Step-by-Step Journey

## Our Goal

In this notebook, we will understand how language models predict text by following the complete process of predicting "we love deep learning" word by word. We'll explore four fundamental concepts that make modern language models like ChatGPT work:

1. **Forward Pass**: How models generate predictions
2. **Loss Calculation**: How we measure prediction quality
3. **Backpropagation**: How we identify what needs improvement
4. **Gradient Descent**: How we make those improvements

## Our Vocabulary and Target

We'll work with a simple vocabulary to keep things clear and manageable. Our model will learn to predict each word in our target sequence step by step.


In [1]:
# Setup our simple vocabulary and target sequence
VOCAB = ["<BOS>", "we", "love", "deep", "learning", "<EOS>", "the", "is", "great", "model", "hello", "world"]
target_sequence = ["we", "love", "deep", "learning"]

print("VOCABULARY:", VOCAB)
print("TARGET SEQUENCE:", target_sequence)
print("VOCABULARY SIZE:", len(VOCAB))

# Initialize simple model parameters (weights) - these will be updated during training
# In real models, these would be millions or billions of parameters
model_parameters = {
    'layer1_weights': [0.1, -0.3, 0.5, 0.2, -0.1, 0.4, 0.8, -0.2, 0.3, -0.5, 0.7, -0.4],
    'layer2_weights': [0.2, 0.1, -0.4, 0.6, 0.3, -0.2, -0.1, 0.5, -0.3, 0.4, -0.6, 0.1],
    'output_weights': [0.3, -0.1, 0.4, -0.2, 0.5, 0.1, -0.3, 0.2, 0.6, -0.4, 0.1, 0.3]
}

print("\nInitial model parameters (simplified representation):")
print("Layer 1 weights:", len(model_parameters['layer1_weights']), "parameters")
print("Layer 2 weights:", len(model_parameters['layer2_weights']), "parameters") 
print("Output weights:", len(model_parameters['output_weights']), "parameters")

VOCABULARY: ['<BOS>', 'we', 'love', 'deep', 'learning', '<EOS>', 'the', 'is', 'great', 'model', 'hello', 'world']
TARGET SEQUENCE: ['we', 'love', 'deep', 'learning']
VOCABULARY SIZE: 12

Initial model parameters (simplified representation):
Layer 1 weights: 12 parameters
Layer 2 weights: 12 parameters
Output weights: 12 parameters


## What Are Logits?

Logits are the raw numerical scores that a language model assigns to every word in its vocabulary when predicting the next word. Think of logits as the model's initial "gut feeling" about how likely each word is to come next, before any normalization.

**Key Properties of Logits:**

- They can be any real number (positive, negative, large, small)
- Higher logits indicate the model thinks a word is more likely
- Lower logits indicate the model thinks a word is less likely
- They are computed by passing the current context through the neural network layers

Let's see what logits look like when our model tries to predict the first word after the beginning-of-sequence token.


In [3]:
def demonstrate_logits():
    # Pre-calculated realistic logit values for predicting first word after <BOS>
    logits_after_bos = {
        "we": 2.1,      # Target word - decent score
        "the": 3.2,     # Highest - most common starter
        "hello": 2.8,   # High - common greeting
        "is": 1.5,      # Medium - possible starter
        "love": -0.5,   # Low - uncommon starter
        "deep": -1.2,   # Lower - rare starter
        "learning": -2.0, # Very low - very rare starter
        "<EOS>": -10.0, # Impossible - can't start with end token
    }
    
    print("Logit scores for predicting first word after <BOS>:")
    print("=" * 50)
    for word, logit in sorted(logits_after_bos.items(), key=lambda x: x[1], reverse=True):
        marker = " <- Our target!" if word == "we" else ""
        print(f"{word:10}: {logit:6.1f}{marker}")
    
    print(f"\nNotice how 'the' has the highest logit ({logits_after_bos['the']}) because it's")
    print("the most common way to start sentences in English.")
    print(f"Our target word 'we' has a logit of {logits_after_bos['we']}, which is decent")
    print("but not the highest - the model will need training to improve this!")
    
    return logits_after_bos

# Run the demonstration
logits_step1 = demonstrate_logits()

Logit scores for predicting first word after <BOS>:
the       :    3.2
hello     :    2.8
we        :    2.1 <- Our target!
is        :    1.5
love      :   -0.5
deep      :   -1.2
learning  :   -2.0
<EOS>     :  -10.0

Notice how 'the' has the highest logit (3.2) because it's
the most common way to start sentences in English.
Our target word 'we' has a logit of 2.1, which is decent
but not the highest - the model will need training to improve this!


## From Logits to Probabilities: The Softmax Operation

Raw logits are useful for the model internally, but they're not easy for us to interpret. We need to convert them into probabilities - numbers between 0 and 1 that sum to exactly 1.0. This conversion is done using the **softmax function**.

**The Softmax Process:**

1. Calculate e^(logit) for each word (this makes all values positive)
2. Sum all these exponential values
3. Divide each exponential value by the sum

This ensures we get valid probabilities that represent the model's confidence in each possible next word.


In [4]:
import math

def demonstrate_softmax(logits_dict):
    print("SOFTMAX CONVERSION: From Logits to Probabilities")
    print("=" * 55)
    
    words = list(logits_dict.keys())
    logits = list(logits_dict.values())
    
    print("STEP 1: Calculate e^(logit) for each word")
    print("-" * 40)
    exp_logits = []
    for word, logit in zip(words, logits):
        exp_val = math.exp(logit)
        exp_logits.append(exp_val)
        print(f"e^({logit:5.1f}) = {exp_val:8.2f}  for '{word}'")
    
    print(f"\nSTEP 2: Sum all exponential values")
    print("-" * 35)
    total = sum(exp_logits)
    print(f"Total sum = {total:.2f}")
    
    print(f"\nSTEP 3: Calculate final probabilities")
    print("-" * 38)
    probabilities = {}
    for word, exp_val in zip(words, exp_logits):
        prob = exp_val / total
        probabilities[word] = prob
        marker = " <- Our target!" if word == "we" else ""
        print(f"P({word:8}) = {exp_val:8.2f} / {total:.2f} = {prob:.4f}{marker}")
    
    print(f"\nVerification: All probabilities sum to {sum(probabilities.values()):.6f}")
    print(f"\nKey insight: '{max(logits_dict, key=logits_dict.get)}' had the highest logit")
    print(f"and now has the highest probability ({max(probabilities.values()):.4f})")
    
    return probabilities

# Convert our logits to probabilities
probabilities_step1 = demonstrate_softmax(logits_step1)

SOFTMAX CONVERSION: From Logits to Probabilities
STEP 1: Calculate e^(logit) for each word
----------------------------------------
e^(  2.1) =     8.17  for 'we'
e^(  3.2) =    24.53  for 'the'
e^(  2.8) =    16.44  for 'hello'
e^(  1.5) =     4.48  for 'is'
e^( -0.5) =     0.61  for 'love'
e^( -1.2) =     0.30  for 'deep'
e^( -2.0) =     0.14  for 'learning'
e^(-10.0) =     0.00  for '<EOS>'

STEP 2: Sum all exponential values
-----------------------------------
Total sum = 54.67

STEP 3: Calculate final probabilities
--------------------------------------
P(we      ) =     8.17 / 54.67 = 0.1494 <- Our target!
P(the     ) =    24.53 / 54.67 = 0.4488
P(hello   ) =    16.44 / 54.67 = 0.3008
P(is      ) =     4.48 / 54.67 = 0.0820
P(love    ) =     0.61 / 54.67 = 0.0111
P(deep    ) =     0.30 / 54.67 = 0.0055
P(learning) =     0.14 / 54.67 = 0.0025
P(<EOS>   ) =     0.00 / 54.67 = 0.0000

Verification: All probabilities sum to 1.000000

Key insight: 'the' had the highest logit
and now h

## Sequential Prediction: How Context Shapes Predictions

Language models don't just predict single words in isolation - they use the entire preceding context to make increasingly informed predictions. As we build up the sequence "we love deep learning," each new word provides more context that helps the model make better predictions for the next word.

Let's observe how the model's logit scores change as we provide more context at each step.

In [2]:
def demonstrate_sequence_prediction():
    print("COMPLETE SEQUENCE PREDICTION: 'we love deep learning'")
    print("=" * 60)
    
    # Pre-calculated logits for each prediction step showing how context improves predictions
    prediction_steps = [
        {
            "step": 1,
            "context": "<BOS>",
            "target": "we", 
            "logits": {"we": 2.1, "love": -0.5, "deep": -1.2, "learning": -2.0, "the": 3.2, "hello": 2.8, "is": 1.5, "<EOS>": -10.0},
            "explanation": "'we' is a common sentence starter, though 'the' is even more common in general text"
        },
        {
            "step": 2,
            "context": "we",
            "target": "love",
            "logits": {"we": -3.0, "love": 2.8, "deep": -0.5, "learning": -1.5, "the": 1.2, "hello": -5.0, "is": 2.1, "<EOS>": -8.0},
            "explanation": "After 'we', action words like 'love', 'are', 'have' become much more likely than nouns"
        },
        {
            "step": 3,
            "context": "we love",
            "target": "deep",
            "logits": {"we": -5.0, "love": -4.0, "deep": 1.9, "learning": 0.8, "the": 0.5, "hello": -6.0, "is": -2.0, "<EOS>": -7.0},
            "explanation": "After 'we love', we expect objects or concepts; 'deep' scores well as it often precedes 'learning'"
        },
        {
            "step": 4,
            "context": "we love deep",
            "target": "learning",
            "logits": {"we": -6.0, "love": -5.0, "deep": -4.0, "learning": 3.5, "the": -2.0, "hello": -8.0, "is": -3.0, "<EOS>": -6.0},
            "explanation": "'deep learning' is a common collocation - 'learning' becomes very likely after 'deep' in this context"
        }
    ]
    
    all_step_probabilities = []
    
    for step_info in prediction_steps:
        print(f"\nSTEP {step_info['step']}: Context = '{step_info['context']}' -> Predicting '{step_info['target']}'")
        print("-" * 70)
        
        # Show top 5 logits
        sorted_logits = sorted(step_info['logits'].items(), key=lambda x: x[1], reverse=True)
        print("Top 5 logit scores:")
        for i, (word, logit) in enumerate(sorted_logits[:5]):
            marker = " ** TARGET **" if word == step_info['target'] else ""
            print(f"  {i+1}. {word:10}: {logit:6.1f}{marker}")
        
        # Calculate probability for target word
        target_logit = step_info['logits'][step_info['target']]
        exp_values = [math.exp(logit) for logit in step_info['logits'].values()]
        total_exp = sum(exp_values)
        target_prob = math.exp(target_logit) / total_exp
        all_step_probabilities.append(target_prob)
        
        print(f"\nTarget word '{step_info['target']}' probability: {target_prob:.3f}")
        print(f"Context effect: {step_info['explanation']}")
    
    return prediction_steps, all_step_probabilities

# Generate predictions for the complete sequence
steps_data, step_probabilities = demonstrate_sequence_prediction()

COMPLETE SEQUENCE PREDICTION: 'we love deep learning'

STEP 1: Context = '<BOS>' -> Predicting 'we'
----------------------------------------------------------------------
Top 5 logit scores:
  1. the       :    3.2
  2. hello     :    2.8
  3. we        :    2.1 ** TARGET **
  4. is        :    1.5
  5. love      :   -0.5

Target word 'we' probability: 0.149
Context effect: 'we' is a common sentence starter, though 'the' is even more common in general text

STEP 2: Context = 'we' -> Predicting 'love'
----------------------------------------------------------------------
Top 5 logit scores:
  1. love      :    2.8 ** TARGET **
  2. is        :    2.1
  3. the       :    1.2
  4. deep      :   -0.5
  5. learning  :   -1.5

Target word 'love' probability: 0.571
Context effect: After 'we', action words like 'love', 'are', 'have' become much more likely than nouns

STEP 3: Context = 'we love' -> Predicting 'deep'
----------------------------------------------------------------------
Top 5 l

## Loss Function: Quantifying Prediction Quality

The **loss function** is how we measure how well our model is performing. Specifically, it measures how "surprised" the model is when it sees the correct answer. The mathematical formulation we use is called **cross-entropy loss** or **log-likelihood loss**.

**Key Concepts:**
- Lower loss = better predictions = less surprise when seeing the correct answer
- Higher loss = worse predictions = more surprise when seeing the correct answer  
- Loss is calculated as: Loss = -log(probability of correct word)
- Training aims to minimize the total loss across all predictions

The logarithm has a useful property: it heavily penalizes very low probabilities. If the model assigns a probability of 0.01 to the correct word, the loss is much higher than if it assigns 0.1.

In [5]:
def demonstrate_loss_calculation():
    print("UNDERSTANDING LOSS: Measuring Model Performance")
    print("=" * 55)
    
    print("Loss measures surprise: How unexpected was the correct answer?")
    print("Formula: Loss = -log(probability of correct word)")
    print("\nExample scenarios for predicting 'learning':")
    print("-" * 50)
    
    scenarios = [
        {"probability": 0.85, "surprise_level": "Very Low", "quality": "Excellent prediction"},
        {"probability": 0.45, "surprise_level": "Medium", "quality": "Decent prediction"},
        {"probability": 0.15, "surprise_level": "High", "quality": "Poor prediction"},
        {"probability": 0.02, "surprise_level": "Very High", "quality": "Terrible prediction"}
    ]
    
    for scenario in scenarios:
        prob = scenario["probability"]
        loss = -math.log(prob)
        print(f"P(learning) = {prob:4.2f} -> Loss = {loss:4.2f} -> {scenario['surprise_level']:10} surprise -> {scenario['quality']}")
    
    print(f"\nTraining Goal: Minimize total loss across all predictions!")
    print(f"This means: Assign high probabilities to correct words\n")
    
    # Calculate loss for our actual sequence
    print("LOSS CALCULATION FOR OUR SEQUENCE:")
    print("-" * 40)
    words = ["we", "love", "deep", "learning"]
    total_loss = 0
    
    for i, (word, prob) in enumerate(zip(words, step_probabilities)):
        loss = -math.log(prob)
        total_loss += loss
        print(f"Step {i+1} - P({word:8}) = {prob:.3f} -> Loss = {loss:.3f}")
    
    print(f"\nTotal Loss for sequence = {total_loss:.3f}")
    print(f"Average Loss per word = {total_loss/len(words):.3f}")
    print("\nLower total loss indicates better overall performance!")
    
    return total_loss

# Calculate loss for our predictions
sequence_loss = demonstrate_loss_calculation()

UNDERSTANDING LOSS: Measuring Model Performance
Loss measures surprise: How unexpected was the correct answer?
Formula: Loss = -log(probability of correct word)

Example scenarios for predicting 'learning':
--------------------------------------------------
P(learning) = 0.85 -> Loss = 0.16 -> Very Low   surprise -> Excellent prediction
P(learning) = 0.45 -> Loss = 0.80 -> Medium     surprise -> Decent prediction
P(learning) = 0.15 -> Loss = 1.90 -> High       surprise -> Poor prediction
P(learning) = 0.02 -> Loss = 3.91 -> Very High  surprise -> Terrible prediction

Training Goal: Minimize total loss across all predictions!
This means: Assign high probabilities to correct words

LOSS CALCULATION FOR OUR SEQUENCE:
----------------------------------------
Step 1 - P(we      ) = 0.149 -> Loss = 1.901
Step 2 - P(love    ) = 0.571 -> Loss = 0.561
Step 3 - P(deep    ) = 0.623 -> Loss = 0.472
Step 4 - P(learning) = 0.994 -> Loss = 0.006

Total Loss for sequence = 2.941
Average Loss per word 