# Graded Lab: GRPO Post Training Lab

Welcome to the second assignment of this module!

Carefully read each Markdown (text) cell, which include instructions and hints. Start by reading the background behind your upcoming tasks.

When you are done, submit your solution by saving it, then clicking on the submit button at the top right side of the page.

## In order for your submission to be graded correctly, you **MUST**:
* **Use the provided variable names**, otherwise the autograder will not be able to locate the variable for grading. 

* **Replace any instances of `None` with your own code.** 

* **Only modify the cells that start with the comment `# GRADED CELL`**.  

* **Use the provided cells for your solution.** You can add new cells to experiment, but these will be omitted when grading. 

To submit your solution, save it, then click on the blue submit button at the top of the page.

<div style="background-color: #FAD888; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
<strong>Important notes</strong>:

- Code blocks with None will not run properly. If you run them before completing the exercise, you will likely get an error. 

- The notebooks work best in Chrome browser. If you are having problems, please switch to Chrome.

- Make sure you always save before submitting.
</div>

## Introduction

In this hands-on tutorial, you'll learn how to improve a Large Language Model's ability to solve math problems using a technique called **GRPO**. This involves generating multiple answers to the same question, comparing answers to find which are better, and teaching the model to prefer better approaches.

## Objectives

You will build a comprehensive reward system for training a Large Language Model to solve math problems using GRPO (Group Relative Policy Optimization). This involves creating sophisticated reward functions that can evaluate model responses across different quality levels and implementing a complete GRPO training pipeline.

* **Extract Numerical Answers from Model Responses:** Implement robust parsing to extract numerical answers from various response formats including GSM8K standard format and common answer phrases.
* **Analyze Response Quality Indicators:** Build a quality analysis system that detects mathematical reasoning, step-by-step thinking, and structured solutions in model responses.
* **Reward Unparseable Responses:** Create a reward system for responses without clear numerical answers, based on effort and reasoning quality.
* **Reward High-Quality Correct Answers:** Implement a bonus system that encourages not just correctness but also clear mathematical communication and detailed explanations.
* **Implement Partial Credit for Wrong Answers:** Develop a partial credit system that provides learning gradients for wrong answers based on proximity to correct solutions and quality of reasoning shown.

## Table of Contents

* [Setup](#setup)
* [Training Configuration](#trainingconfiguration)
* [Load the GSM8K dataset](#loadGSM8K)
* [Create the Reward Function](#createrewardfunction) - Exercise 1, 2, 3, 4, 5
* [Load the Language Model](#loadthelanguagemodel)
* [Prepare Training and Validation datasets](#preparetraining)
* [Create Evaluation Callback](#createevaluation)
* [Configure GRPO Trainer](#configuregrpotrainer)
* [Train the Model with GRPO! (Ungraded Part)](#trainmodelwithgrpo)
* [Summary](#summary)

## Setup <a id="setup"></a>

Start by importing all the necessary packages, setting up random seeds for reproducibility, setting up the devices and the logger.

In [1]:
import os 
# Disable progress bars to avoid Jupyter context errors
os.environ['HF_DATASETS_DISABLE_PROGRESS_BAR'] = '1'
import re          
import random      
import logging     
import warnings    
from typing import List, Dict, Optional, Tuple  
from dataclasses import dataclass, field  

import numpy as np  
import torch 
from trl import (
    GRPOConfig,   
    GRPOTrainer 
)
from transformers import (
    AutoTokenizer,  
    AutoModelForCausalLM,    
)

from datasets import load_dataset, load_from_disk


from utils import (
    setup_logging,  
    load_and_explore_gsm8k_dataset, 
    prepare_dataset, 
    GSM8KEvaluationCallback, 
    evaluate_and_compare              
)


warnings.filterwarnings('ignore')
random.seed(42)    
np.random.seed(42)
torch.manual_seed(42)
torch.use_deterministic_algorithms(True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
logger, log_file = setup_logging(device=device)

print("‚úÖ libraries imported")

‚úÖ Logging configured - all detailed logs will be written to: ./grpo_logs/grpo_20251105_052057.log
Note: Detailed reward computation logs will only appear in the log file, not in console output
‚úÖ libraries imported


## Training Configuration <a id="trainingconfiguration"></a>

You'll configure all your training settings in one place. This makes it easy to experiment with different values.

Important Trade-offs:

- **Higher batch size** = More stable but needs more memory
- **More generations** = Better comparison but slower training
- **Higher learning rate** = Faster learning but might "overshoot"
- **More epochs** = More learning but might overfit

In [2]:
# ============================================
# Define Training Configuration Class
# ============================================

@dataclass
class TrainingConfig:
    """
    Configuration for GRPO training.
    
    This class holds all the settings (hyperparameters) for training.
    Think of it as a control panel with all the knobs and switches.
    
    Each parameter has:
    - A default value (recommended value)
    - A description (what it does)
    - A type (what kind of value it expects)
    """
    
    # ========== MODEL SETTINGS ==========
    # Which model to use and where to save it
    
    model_name: str = field(
        default="/app/models/deepseek-math-7b-base",
        metadata={"help": "The pre-trained model to start with."}
    )
    
    output_dir: str = field(
        default="./grpo_finetuned_model",
        metadata={"help": "Where to save the trained model. Like a 'Save As' location."}
    )
    
    # ========== TRAINING DURATION ==========
    # How long to train for
    
    num_train_epochs: int = field(
        default=5,
        metadata={"help": "How many times to go through the training data. More = more learning."}
    )
    
    # ========== BATCH SETTINGS ==========
    # How many examples to process at once
    
    per_device_train_batch_size: int = field(
        default=2,
        metadata={"help": "How many problems to process at once. Limited by GPU memory."}
    )
    
    gradient_accumulation_steps: int = field(
        default=32,
        metadata={"help": "Accumulate gradients over multiple batches. Simulates larger batch size."}
    )
    # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps = 64
    
    # ========== LEARNING SETTINGS ==========
    # How fast the model learns
    
    learning_rate: float = field(
        default=5e-6,  # 0.000005 in decimal
        metadata={"help": "How big of a step to take when learning. Too high = unstable."}
    )
    
    # ========== GRPO SPECIFIC SETTINGS ==========
    # Settings unique to GRPO algorithm
    
    num_generations: int = field(
        default=12,
        metadata={"help": "How many different answers to generate per question. More = better comparison."}
    )
    
    temperature: float = field(
        default=0.8,
        metadata={"help": "Controls randomness. 0.0 = always same answer, 1.0 = very random."}
    )
    
    max_new_tokens: int = field(
        default=400,
        metadata={"help": "Maximum length of generated answers. Needs to be long enough for full solutions."}
    )
    
    # ========== DATA SETTINGS ==========
    # How to handle the dataset
    
    max_prompt_length: int = field(
        default=512,
        metadata={"help": "Maximum length of input questions in tokens."}
    )
    
    train_split_ratio: float = field(
        default=0.8,
        metadata={"help": "What fraction of data to use for training (rest for validation)."}
    )
    
    # ========== MONITORING SETTINGS ==========
    # How often to check progress
    
    eval_steps: int = field(
        default=20,
        metadata={"help": "Evaluate model every N steps to check progress."}
    )
    
    save_steps: int = field(
        default=20,
        metadata={"help": "Save a checkpoint every N steps (for recovery if training stops)."}
    )
    
    logging_steps: int = field(
        default=20,
        metadata={"help": "Log training metrics every N steps."}
    )
    
    save_total_limit: int = field(
        default=3,
        metadata={"help": "Keep only the N most recent checkpoints to save disk space."}
    )
    
    # ========== OTHER SETTINGS ==========
    
    seed: int = field(
        default=42,
        metadata={"help": "Random seed for reproducibility."}
    )
    
    use_8bit: bool = field(
        default=False,
        metadata={"help": "Load model in 8-bit mode to save memory (slightly less accurate)."}
    )

print("‚úÖ Configuration class defined!")

‚úÖ Configuration class defined!


In [3]:
# ============================================
# Create and Display Configuration
# ============================================

# Create an instance of your configuration
# This uses all the default values defined above
config = TrainingConfig()

# Display all configuration values
print("TRAINING CONFIGURATION")
print("="*50)

# Group settings by category for easier reading
print("\nModel Settings:")
print(f"  Model: {config.model_name}")
print(f"  Output directory: {config.output_dir}")

print("\nTraining Duration:")
print(f"  Epochs: {config.num_train_epochs}")
print(f"  Batch size per device: {config.per_device_train_batch_size}")
print(f"  Gradient accumulation: {config.gradient_accumulation_steps}")
print(f"  Effective batch size: {config.per_device_train_batch_size * config.gradient_accumulation_steps}")

print("\nGRPO Settings:")
print(f"  Generations per prompt: {config.num_generations}")
print(f"  Temperature: {config.temperature}")
print(f"  Max new tokens: {config.max_new_tokens}")
print(f"  Learning rate: {config.learning_rate}")

print("\nMonitoring:")
print(f"  Evaluate every: {config.eval_steps} steps")
print(f"  Save every: {config.save_steps} steps")
print(f"  Log every: {config.logging_steps} steps")

# Calculate approximate training time
print("\nEstimated Training Info:")
print(f"  This configuration will generate {config.num_generations} answers per question")
print(f"  The model will learn by comparing these answers")

TRAINING CONFIGURATION

Model Settings:
  Model: /app/models/deepseek-math-7b-base
  Output directory: ./grpo_finetuned_model

Training Duration:
  Epochs: 5
  Batch size per device: 2
  Gradient accumulation: 32
  Effective batch size: 64

GRPO Settings:
  Generations per prompt: 12
  Temperature: 0.8
  Max new tokens: 400
  Learning rate: 5e-06

Monitoring:
  Evaluate every: 20 steps
  Save every: 20 steps
  Log every: 20 steps

Estimated Training Info:
  This configuration will generate 12 answers per question
  The model will learn by comparing these answers


## Load the GSM8K Dataset <a id="loadGSM8K"></a>

Here you will load the GSM8K dataset, which you are already familiar with from the previous assignments.

In [4]:
# ============================================
# Load and Explore GSM8K Dataset
# ============================================

# Load the GSM8K dataset
# This function:
# 1. Downloads the dataset (if not already downloaded)
# 2. Shows dataset statistics
# 3. Displays sample problems
dataset = load_and_explore_gsm8k_dataset()

# The dataset has two parts:
# - 'train': Problems for training (about 7,500)
# - 'test': Problems for testing (about 1,300)

Dataset Structure:
DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})

Dataset splits:
- Train: 7473 examples
- Test: 1319 examples

Sample Problem from Training Set:

üìù Question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

‚úÖ Answer:
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72

üî¢ Numerical Answer: 72


## Create the Reward Function <a id="createrewardfunction"></a>

How It Works

1. **Extract** the numerical answer from the text
2. **Compare** with the correct answer
3. **Check** for partial credit (showing work, being close)
4. **Assign** a reward score

In [5]:
# GSM8KRewardSignal class definition
class GSM8KRewardSignal:
    """
    BASELINE Reward Model for GSM8K (Simplified Version)
    
    This is a basic reward model with only 3 categories:
    - Correct answer: 1.0
    - Wrong answer: 0.0  
    - Unparseable: 0.0
    
    ‚ö†Ô∏è LIMITATIONS: This simple reward signal makes GRPO training difficult!
    The lack of partial credit means the model gets no signal for improvements.
    """
    
    def __init__(self):
        self.logger = logging.getLogger(__name__)

### Exercise 1: Extract Numerical Answer from Model Response

The first critical component of your reward model is the ability to extract numerical answers from the model's generated text. Language models often produce verbose responses with explanations, calculations, and natural language, but you need to identify and extract the actual numerical answer to compare it with the ground truth.

Currently, the basic implementation only looks for answers in the GSM8K standard format (marked with ####). However, models don't always follow this format perfectly. Your task is to implement a more robust extraction system that can handle various answer formats that models might produce.

Consider implementing patterns to match common answer phrases like "The answer is X", "equals X", "total is X", or "Therefore, X". Remember to handle edge cases such as numbers with commas (e.g., 1,000), dollar signs, negative numbers, and decimal points (you can use this regular expression: `r'([+-]?\d+(?:,\d{3})*(?:\.\d+)?)'`). As a fallback strategy, you might want to extract the last number mentioned in the response, but make sure your regex pattern is sophisticated enough to properly identify valid numbers.

In [8]:
# GRADED CELL: exercise 1
 
def extract_numerical_answer(self, text: str) -> Optional[float]:
    """
    Extract the numerical answer from model's response.

    This function looks for numbers in different formats:
    - #### 42 (GSM8K format)
    - "The answer is 42"
    - "= 42"
    - Just finds the last number if nothing else works

    Args:
        text: The model's generated response

    Returns:
        The extracted number, or None if no number found
    """

    # First, check for GSM8K format (#### answer)
    if "####" in text:
        answer = text.split("####")[-1].strip()
        answer = answer.replace(',', '').replace('$', '')
        try:
            return float(answer)
        except:
            pass

    # Try various answer patterns
    # Note: \$ in regex matches a literal dollar sign
    patterns = [
        r"(?:The answer is|answer:|Answer:)\s*\$?([+-]?\d+(?:,\d{3})*(?:\.\d+)?)",
        r"(?:equals?|=)\s*\$?([+-]?\d+(?:,\d{3})*(?:\.\d+)?)",
        r"(?:total|sum|result|Total|Final answer)\s*(?:is|:|=)?\s*\$?([+-]?\d+(?:,\d{3})*(?:\.\d+)?)",
        r"Therefore,?\s*\$?([+-]?\d+(?:,\d{3})*(?:\.\d+)?)",
    ]

    ### START CODE HERE ### 
    # Use the patterns list defined above to find patterns like "The answer is X", "equals X", "total is X", "Therefore, X" etc-

    # For each of the above patterns
    for pattern in patterns:
        # Use re.findall() with to match different answer formats
        matches = re.findall(pattern, text, re.IGNORECASE | re.MULTILINE)
        if matches:
            try:
                # Remove commas and dollar signs and convert to float
                return float(matches[-1].replace(',','').replace('$',''))
            except:
                continue

    # Last resort: find any number in the text
    # Update this simple regex to handle negative numbers and decimal points
    # As a fallback, find the last number in the text using regex
    numbers = re.findall(r'-?\d+(?:\.\d+)?', text)
    if numbers:
        try:
            return float(numbers[-1].replace(',', '')) 
        except:
            pass

    ### END CODE HERE ###

    return None

# Add the method to GSM8KRewardSignal class for grading purposes
GSM8KRewardSignal.extract_numerical_answer = extract_numerical_answer

In [9]:
# ============================================
# UNIT TEST: Exercise 1 - Extract Numerical Answer
# ============================================

def test_extract_numerical_answer():
    """Unit test for the extract_numerical_answer method."""
    print("üß™ Testing Exercise 1: Extract Numerical Answer")
    print("="*50)
    
    # Create an instance of the reward model
    test_model = GSM8KRewardSignal()
    
    # Test cases: (input_text, expected_output, description)
    test_cases = [
        # Standard GSM8K format (already works in base implementation)
        ("The calculation is 5 + 3 = 8 #### 8", 8.0, "GSM8K standard format"),
        
        # Common answer phrases (MUST work after student implementation)
        ("The answer is 42", 42.0, "Common phrase: 'The answer is'"),
        ("Answer: 100", 100.0, "Common phrase: 'Answer:'"),
        ("equals 25", 25.0, "Common phrase: 'equals'"),
        ("total is 15.5", 15.5, "Common phrase: 'total is'"),
        ("Therefore, 7", 7.0, "Common phrase: 'Therefore,'"),
        
        # Numbers with formatting (MUST handle these)
        ("The answer is $1,234.56", 1234.56, "Dollar sign and commas"),
        ("Total: -45", -45.0, "Negative number"),
        ("equals 0.003", 0.003, "Small decimal"),
        
        # Fallback to last number (minimum requirement)
        ("First we have 10, then 20, finally 30", 30.0, "Last number fallback"),
        ("No clear answer but mentions 99", 99.0, "Last number in text"),
    ]
    
    passed = 0
    failed = 0
    
    for text, expected, description in test_cases:
        try:
            result = test_model.extract_numerical_answer(text)
            
            if result is None and expected is not None:
                print(f"‚ùå FAILED: {description}")
                print(f"   Input: '{text}'")
                print(f"   Expected: {expected}, Got: None")
                failed += 1
            elif result is not None and expected is None:
                print(f"‚ùå FAILED: {description}")
                print(f"   Input: '{text}'")
                print(f"   Expected: None, Got: {result}")
                failed += 1
            elif result is not None and abs(result - expected) > 0.01:
                print(f"‚ùå FAILED: {description}")
                print(f"   Input: '{text}'")
                print(f"   Expected: {expected}, Got: {result}")
                failed += 1
            else:
                print(f"‚úÖ PASSED: {description}")
                passed += 1
                
        except Exception as e:
            print(f"‚ùå ERROR: {description}")
            print(f"   Exception: {e}")
            failed += 1
    
    # Summary
    print("\n" + "="*50)
    print(f"Results: {passed}/{len(test_cases)} passed, {failed}/{len(test_cases)} failed")
    
    if failed == 0:
        print("üéâ All tests passed! Exercise 1 is complete.")
    else:
        print("‚ö†Ô∏è Some tests failed. Please review your implementation.")
        print("\nMinimum requirements:")
        print("- Extract answers from common phrases (The answer is, equals, total is)")
        print("- Handle numbers with commas and dollar signs")
        print("- Support negative numbers and decimals")
        print("- Fallback to extracting the last number in text")
    
    return passed == len(test_cases)

# Run the test
test_extract_numerical_answer()

üß™ Testing Exercise 1: Extract Numerical Answer
‚úÖ PASSED: GSM8K standard format
‚úÖ PASSED: Common phrase: 'The answer is'
‚úÖ PASSED: Common phrase: 'Answer:'
‚úÖ PASSED: Common phrase: 'equals'
‚úÖ PASSED: Common phrase: 'total is'
‚úÖ PASSED: Common phrase: 'Therefore,'
‚úÖ PASSED: Dollar sign and commas
‚úÖ PASSED: Negative number
‚úÖ PASSED: Small decimal
‚úÖ PASSED: Last number fallback
‚úÖ PASSED: Last number in text

Results: 11/11 passed, 0/11 failed
üéâ All tests passed! Exercise 1 is complete.


True

### Exercise 2: Analyze Response Quality Indicators

Before assigning rewards to model responses, you need to understand what makes a good mathematical solution. This exercise focuses on building a comprehensive quality analysis system that examines various aspects of the generated response.

Your task is to implement quality indicators that can detect whether a response contains mathematical reasoning. The current implementation has basic checks, but you should enhance it to identify mathematical operations (not just symbols like +, -, *, / but also words like "multiply", "divide", "add", "subtract"), reasoning indicators (words like "first", "then", "next", "finally" that show step-by-step thinking), and structured solutions (numbered steps or bullet points).

Additionally, consider counting meaningful metrics like the number of sentences (which might indicate detailed explanation), checking for the presence of numbers (essential for math problems), and measuring response length. These quality indicators will be crucial for the reward functions in the following exercises, as they help distinguish between different types of responses even when you can't extract a final answer.

In [12]:
# GRADED CELL: exercise 2

def analyze_response_quality(self, response: str) -> Dict[str, any]:
    """
    Analyze the quality indicators of a response.

    This method examines various aspects of the response to determine
    its quality, even if the final answer is wrong or unparseable.

    Args:
        response: The model's generated response

    Returns:
        Dictionary with quality indicators:
        - has_calculation: Boolean, if response contains math operations
        - has_steps: Boolean, if response shows multiple steps
        - has_reasoning: Boolean, if response uses reasoning words
        - has_numbers: Boolean, if response contains numbers
        - response_length: Int, character count
        - sentence_count: Int, number of sentences
    """
    ### START CODE HERE ### 

    # Check for calculations (math operators and related words)
    calculation_indicators = ['=', '+', '-', '*', '/', 'multiply', 'divide', 'add', 'subtract'] 
    # set has_calculation to True if any of the indicators are in response.lower()
    has_calculation = any(calc_ind in response.lower() for calc_ind in calculation_indicators)

    # Check for steps (multiple sentences or explicit step markers)
    sentences = [s.strip() for s in re.split(r'[.!?]+', response) if s.strip()] 
    sentence_count = len(sentences)
    # Set has_steps to true if there are at least two sentences or if the word 'step' is in response.lower()
    has_steps = True if (sentence_count >= 2 or 'step' in response.lower()) else False

    # Check for reasoning words
    # These words indicate logical reasoning or step-by-step thinking
    reasoning_words = ['therefore', 'because', 'since', 'so', 'thus', 'hence', 
                        'first', 'then', 'next', 'finally', 'must', 'need'] 
    # Set has_reasoning to True if ANY of these reasoning words appear in the response
    # Use any() with a generator expression to check if any word is in response.lower()
    has_reasoning = any(word in response.lower() for word in reasoning_words)

    # Check for numbers
    has_numbers = bool(re.search(r'\d', response)) 

    # Response length
    response_length = len(response)

    ### END CODE HERE ###

    quality = {
        'has_calculation': has_calculation,
        'has_steps': has_steps,
        'has_reasoning': has_reasoning,
        'has_numbers': has_numbers,
        'response_length': response_length,
        'sentence_count': sentence_count
    }

    return quality

# Add the method to GSM8KRewardSignal class for grading purposes
GSM8KRewardSignal.analyze_response_quality = analyze_response_quality

In [13]:
# ============================================
# UNIT TEST: Exercise 2 - Analyze Response Quality
# ============================================

def test_analyze_response_quality():
    """Unit test for the analyze_response_quality method."""
    print("üß™ Testing Exercise 2: Analyze Response Quality")
    print("="*50)
    
    # Create an instance of the reward model
    test_model = GSM8KRewardSignal()
    
    # Test cases with expected quality indicators
    test_cases = [
        # Test 1: Response with calculations and steps
        {
            "response": "First, let's add 5 + 3 = 8. Then multiply by 2 to get 16.",
            "description": "Response with math operations and reasoning",
            "expected": {
                "has_calculation": True,
                "has_steps": True,
                "has_numbers": True,
                "has_reasoning": True,  # Changed from has_reasoning_words
                "min_length": 30,
                "min_sentences": 1
            }
        },

        # Test 2: Simple answer without work
        {
            "response": "42",
            "description": "Single number response",
            "expected": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": True,
                "has_reasoning": False,
                "max_length": 10,
                "max_sentences": 1
            }
        },

        # Test 3: Detailed step-by-step solution
        {
            "response": "Step 1: Calculate 10 * 5 = 50. Step 2: Subtract 15 from 50 to get 35. Finally, divide by 7 for the answer.",
            "description": "Numbered steps with calculations",
            "expected": {
                "has_calculation": True,
                "has_steps": True,
                "has_numbers": True,
                "has_reasoning": True,
                "min_length": 50,
                "min_sentences": 2
            }
        },

        # Test 4: "I don't know" response
        {
            "response": "I don't know",
            "description": "Give-up response",
            "expected": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": False,
                "has_reasoning": False,
                "max_length": 20,
                "max_sentences": 1
            }
        },

        # Test 5: Response with word-based operations
        {
            "response": "We need to multiply twelve by three and then add seven.",
            "description": "Word-based math operations",
            "expected": {
                "has_calculation": True,
                "has_steps": False,
                "has_numbers": False,  # Changed: No digit characters in response
                "has_reasoning": True,
                "min_length": 20,
                "min_sentences": 1
            }
        }
    ]
    
    passed = 0
    failed = 0
    
    for test_case in test_cases:
        response = test_case["response"]
        description = test_case["description"]
        expected = test_case["expected"]
        
        try:
            quality = test_model.analyze_response_quality(response)
            
            test_passed = True
            failures = []
            
            # Check boolean indicators
            for key in ["has_calculation", "has_steps", "has_numbers", "has_reasoning_words"]:
                if key in expected:
                    if quality.get(key) != expected[key]:
                        failures.append(f"{key}: expected {expected[key]}, got {quality.get(key)}")
                        test_passed = False
            
            # Check response length
            if "min_length" in expected and quality.get("response_length", 0) < expected["min_length"]:
                failures.append(f"response_length: expected >= {expected['min_length']}, got {quality.get('response_length', 0)}")
                test_passed = False
            if "max_length" in expected and quality.get("response_length", 0) > expected["max_length"]:
                failures.append(f"response_length: expected <= {expected['max_length']}, got {quality.get('response_length', 0)}")
                test_passed = False
            
            # Check sentence count
            if "min_sentences" in expected and quality.get("sentence_count", 0) < expected["min_sentences"]:
                failures.append(f"sentence_count: expected >= {expected['min_sentences']}, got {quality.get('sentence_count', 0)}")
                test_passed = False
            if "max_sentences" in expected and quality.get("sentence_count", 0) > expected["max_sentences"]:
                failures.append(f"sentence_count: expected <= {expected['max_sentences']}, got {quality.get('sentence_count', 0)}")
                test_passed = False
            
            if test_passed:
                print(f"‚úÖ PASSED: {description}")
                passed += 1
            else:
                print(f"‚ùå FAILED: {description}")
                print(f"   Response: '{response[:50]}...' " if len(response) > 50 else f"   Response: '{response}'")
                for failure in failures:
                    print(f"   - {failure}")
                failed += 1
                
        except Exception as e:
            print(f"‚ùå ERROR: {description}")
            print(f"   Exception: {e}")
            failed += 1
    
    # Summary
    print("\n" + "="*50)
    print(f"Results: {passed}/{len(test_cases)} passed, {failed}/{len(test_cases)} failed")
    
    if failed == 0:
        print("üéâ All tests passed! Exercise 2 is complete.")
    else:
        print("‚ö†Ô∏è Some tests failed. Please review your implementation.")
        print("\nMinimum requirements:")
        print("- Detect math operations (symbols AND words like 'multiply', 'add')")
        print("- Identify reasoning words ('first', 'then', 'next', 'finally')")
        print("- Detect step indicators ('step', numbered lists)")
        print("- Count sentences and measure response length")
        print("- Check for presence of numbers")
    
    return passed == len(test_cases)

# Run the test
test_analyze_response_quality()

üß™ Testing Exercise 2: Analyze Response Quality
‚úÖ PASSED: Response with math operations and reasoning
‚úÖ PASSED: Single number response
‚úÖ PASSED: Numbered steps with calculations
‚úÖ PASSED: Give-up response
‚úÖ PASSED: Word-based math operations

Results: 5/5 passed, 0/5 failed
üéâ All tests passed! Exercise 2 is complete.


True

### Exercise 3: Reward Unparseable Responses

Not all model responses will contain a clear numerical answer that you can extract. Sometimes the model might provide a detailed explanation without a final answer, give up with "I don't know", or produce garbled output. The current implementation harshly gives 0.0 reward to all unparseable responses, which doesn't help the model learn what aspects of its response were good or bad.

Your task is to implement a nuanced reward system for unparseable responses. Use the quality indicators from Exercise 2 to provide graduated rewards. For instance, if a response is lengthy (over 200 characters) and contains calculations, it shows the model is attempting to solve the problem and deserves some credit (perhaps 0.2). If it has reasoning and numbers but no clear answer, it might deserve 0.1. Very short responses (under 20 characters) are likely dismissive responses like "I don't know" and should receive 0.0. 

The goal is to guide the model toward better responses even when it doesn't produce a parseable answer, encouraging it to show its work and reasoning process.

In [14]:
# GRADED CELL: exercise 3

# This cell will be graded - compute_unparseable_reward method
def compute_unparseable_reward(self, response: str, correct_answer: float,
                                quality: Dict[str, any], question: str = None) -> float:
    """
    Compute reward for unparseable responses (no numerical answer found).

    Instead of giving 0.0 to all unparseable responses, give partial
    credit based on the quality of the attempt.

    Args:
        response: The model's generated response
        correct_answer: The correct numerical answer
        quality: Quality indicators from analyze_response_quality

    Returns:
        Reward between 0.0 and 0.3
    """

    response_length = quality.get('response_length', 0)
    has_calculation = quality.get('has_calculation', False)
    has_steps = quality.get('has_steps', False)
    has_numbers = quality.get('has_numbers', False)

    # Very short responses get minimal credit
    if response_length < 20:
        return 0.0

    # Build up reward based on effort indicators
    reward = 0.05  # Base reward for any attempt
    
    ### START CODE HERE ### 

    # If response has calculations ‚Üí reward plus 0.05
    if has_calculation:
        reward += 0.05

    # If response has steps ‚Üí reward plus 0.05
    if has_steps:
        reward += 0.05

    # If response has numbers ‚Üí reward plus 0.05
    if has_numbers:
        reward += 0.05

    # If response is long (>100 chars) ‚Üí reward plus 0.05
    if response_length > 100:
        reward += 0.05

    # If response is very long (>200 chars) ‚Üí reward plus another 0.05
    if response_length > 200:
        reward += 0.05

    ### END CODE HERE ### 

    self.logger.info(f"‚ö™ UNPARSEABLE RESPONSE:")
    self.logger.info(f"  Response: {response[:200]}...")
    self.logger.info(f"  Expected: {correct_answer}")
    self.logger.info(f"  Reward: {reward}")

    # Cap reward at 0.3
    return min(0.3, reward)

# Add the method to GSM8KRewardSignal class for grading purposes
GSM8KRewardSignal.compute_unparseable_reward = compute_unparseable_reward

In [15]:
# ============================================
# UNIT TEST: Exercise 3 - Compute Unparseable Reward
# ============================================

def test_compute_unparseable_reward():
    """Unit test for the compute_unparseable_reward method."""
    print("üß™ Testing Exercise 3: Compute Unparseable Reward")
    print("="*50)
    
    # Create an instance of the reward model
    test_model = GSM8KRewardSignal()
    
    # Test cases: (response, quality_dict, expected_reward_range, description)
    test_cases = [
        # Test 1: Long response with calculations (good attempt)
        {
            "response": "Let me work through this step by step. First I need to add the numbers together, then multiply by the factor mentioned. The calculation involves several steps that I'll show below.",
            "quality": {
                "has_calculation": True,
                "has_steps": True,
                "has_numbers": False,
                "response_length": 180,
                "has_reasoning_words": True,
                "sentence_count": 3
            },
            "min_reward": 0.15,
            "max_reward": 0.25,
            "description": "Long response with calculations (>200 chars)"
        },
        
        # Test 2: Very short dismissive response
        {
            "response": "I don't know",
            "quality": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": False,
                "response_length": 12,
                "has_reasoning_words": False,
                "sentence_count": 1
            },
            "min_reward": 0.0,
            "max_reward": 0.0,
            "description": "Very short response (<20 chars)"
        },
        
        # Test 3: Medium response with some reasoning
        {
            "response": "First calculate the total, then find the average",
            "quality": {
                "has_calculation": False,
                "has_steps": True,
                "has_numbers": False,
                "response_length": 49,
                "has_reasoning_words": True,
                "sentence_count": 1
            },
            "min_reward": 0.05,
            "max_reward": 0.15,
            "description": "Medium response with reasoning"
        },
        
        # Test 4: Response with numbers but no final answer
        {
            "response": "We have 10 apples and 5 oranges to work with",
            "quality": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": True,
                "response_length": 45,
                "has_reasoning_words": False,
                "sentence_count": 1
            },
            "min_reward": 0.05,
            "max_reward": 0.15,
            "description": "Has numbers but no clear answer"
        },
        
        # Test 5: Long detailed attempt without parseable answer
        {
            "response": "To solve this problem, I would first identify all the given values. Then I'd set up the equation properly. Next, I'd solve for the unknown variable. Finally, I'd check my answer to make sure it makes sense in the context.",
            "quality": {
                "has_calculation": False,
                "has_steps": True,
                "has_numbers": False,
                "response_length": 225,
                "has_reasoning_words": True,
                "sentence_count": 4
            },
            "min_reward": 0.1,
            "max_reward": 0.25,
            "description": "Long detailed methodology without numbers"
        }
    ]
    
    passed = 0
    failed = 0
    
    for test_case in test_cases:
        response = test_case["response"]
        quality = test_case["quality"]
        min_reward = test_case["min_reward"]
        max_reward = test_case["max_reward"]
        description = test_case["description"]
        
        try:
            reward = test_model.compute_unparseable_reward(
                response=response,
                correct_answer=42.0,  # Arbitrary correct answer
                quality=quality,
                question="Test question"
            )
            
            if min_reward <= reward <= max_reward:
                print(f"‚úÖ PASSED: {description}")
                print(f"   Reward: {reward:.2f} (expected: {min_reward:.2f}-{max_reward:.2f})")
                passed += 1
            else:
                print(f"‚ùå FAILED: {description}")
                print(f"   Response length: {quality['response_length']}")
                print(f"   Reward: {reward:.2f} (expected: {min_reward:.2f}-{max_reward:.2f})")
                failed += 1
                
        except Exception as e:
            print(f"‚ùå ERROR: {description}")
            print(f"   Exception: {e}")
            failed += 1
    
    # Summary
    print("\n" + "="*50)
    print(f"Results: {passed}/{len(test_cases)} passed, {failed}/{len(test_cases)} failed")
    
    if failed == 0:
        print("üéâ All tests passed! Exercise 3 is complete.")
    else:
        print("‚ö†Ô∏è Some tests failed. Please review your implementation.")
        print("\nMinimum requirements:")
        print("- Long responses (>200 chars) with calculations ‚Üí 0.2 reward")
        print("- Responses with reasoning and numbers ‚Üí 0.1 reward")
        print("- Very short responses (<20 chars) ‚Üí 0.0 reward")
        print("- Any reasonable attempt ‚Üí at least 0.05 reward")
    
    return passed == len(test_cases)

# Run the test
test_compute_unparseable_reward()

üß™ Testing Exercise 3: Compute Unparseable Reward
‚úÖ PASSED: Long response with calculations (>200 chars)
   Reward: 0.20 (expected: 0.15-0.25)
‚úÖ PASSED: Very short response (<20 chars)
   Reward: 0.00 (expected: 0.00-0.00)
‚úÖ PASSED: Medium response with reasoning
   Reward: 0.10 (expected: 0.05-0.15)
‚úÖ PASSED: Has numbers but no clear answer
   Reward: 0.10 (expected: 0.05-0.15)
‚úÖ PASSED: Long detailed methodology without numbers
   Reward: 0.20 (expected: 0.10-0.25)

Results: 5/5 passed, 0/5 failed
üéâ All tests passed! Exercise 3 is complete.


True

### Exercise 4: Reward High-Quality Correct Answers

When the model produces the correct answer, you want to encourage not just correctness but also good mathematical communication. A response that simply states "8" is correct but less valuable than one that shows "5 + 3 = 8" with clear reasoning steps.

Your task is to implement a bonus system on top of the base 1.0 reward for correct answers. Use the quality indicators to identify exemplary responses. For example, if the response shows clear step-by-step reasoning (has_steps is true), add a 0.2 bonus. If it provides a detailed explanation (response_length > 100 characters), add 0.1. If it uses proper mathematical reasoning words, add another 0.05.

This bonus system teaches the model that you value not just the right answer but also the problem-solving process, which is crucial for building trust in AI systems and helping users understand the solution.

In [18]:
# GRADED CELL: exercise 4

def compute_correct_reward(self, response: str, predicted: float,
                            correct_answer: float, quality: Dict[str, any], question: str = None) -> float:
    """
    Compute reward for correct answers.

    Base reward is 1.0, with bonuses for showing high-quality work.

    Args:
        response: The model's generated response
        predicted: The extracted numerical answer
        correct_answer: The correct numerical answer
        quality: Quality indicators from analyze_response_quality

    Returns:
        Reward between 1.0 and 1.3
    """

    reward = 1.0  # Base reward for correct answer

    # Bonus for showing steps (encourages explanation)
    if quality.get('has_steps', False):
        reward += 0.1

    ### START CODE HERE ### 

    # Add +0.1 bonus for detailed explanation (response_length > 100)
    if quality.get("response_length") > 100:
        reward += 0.1

    # Add +0.05 bonus for using proper reasoning words
    if quality.get("has_reasoning_words",False):
        reward += 0.05

    # Add +0.05 bonus for showing calculations
    if quality.get("has_calculation",False):
        reward += 0.05

    # Cap the maximum reward at 1.3
    reward = min(1.3, reward)

    ### END CODE HERE ###
    
    self.logger.info(f"‚úÖ CORRECT ANSWER:")
    self.logger.info(f"  Predicted: {predicted}")
    self.logger.info(f"  Expected: {correct_answer}")
    self.logger.info(f"  Reward: {reward}")
    
    return reward

# Add the method to GSM8KRewardSignal class for grading purposes
GSM8KRewardSignal.compute_correct_reward = compute_correct_reward

In [19]:
# ============================================
# UNIT TEST: Exercise 4 - Compute Correct Reward
# ============================================

def test_compute_correct_reward():
    """Unit test for the compute_correct_reward method."""
    print("üß™ Testing Exercise 4: Compute Correct Reward")
    print("="*50)
    
    # Create an instance of the reward model
    test_model = GSM8KRewardSignal()
    
    # Test cases with quality indicators and expected rewards
    # UPDATED: More strict test cases to properly test bonus implementation
    test_cases = [
        # Test 1: Perfect answer with all quality indicators - should get bonuses
        {
            "response": "Step 1: Add 5 + 3 = 8. Step 2: Multiply by 2 = 16. Therefore, the answer is 16.",
            "predicted": 16.0,
            "correct": 16.0,
            "quality": {
                "has_calculation": True,
                "has_steps": True,
                "has_reasoning": True,
                "response_length": 80,
                "sentence_count": 3
            },
            "min_reward": 1.15,  # Raised from 1.0 - must have bonuses for quality
            "max_reward": 1.3,
            "description": "Perfect answer with all quality indicators (should get bonuses)"
        },
        # Test 2: Correct but minimal response - should be exactly 1.0
        {
            "response": "16",
            "predicted": 16.0,
            "correct": 16.0,
            "quality": {
                "has_calculation": False,
                "has_steps": False,
                "has_reasoning": False,
                "response_length": 2,
                "sentence_count": 1
            },
            "min_reward": 1.0,
            "max_reward": 1.0,  # Must be exactly 1.0 - no bonuses
            "description": "Correct but minimal response (no bonuses)"
        },
        # Test 3: Correct with some work but not perfect - should get partial bonuses
        {
            "response": "The calculation is 5 + 3 = 8, then 8 * 2 = 16.",
            "predicted": 16.0,
            "correct": 16.0,
            "quality": {
                "has_calculation": True,
                "has_steps": False,
                "has_reasoning": False,
                "response_length": 50,
                "sentence_count": 1
            },
            "min_reward": 1.05,  # Should get some bonus but not max
            "max_reward": 1.15,
            "description": "Correct with calculation but no steps (partial bonus)"
        },
    ]
    
    passed = 0
    failed = 0
    
    for test_case in test_cases:
        response = test_case["response"]
        predicted = test_case["predicted"]
        correct = test_case["correct"]
        quality = test_case["quality"]
        min_reward = test_case["min_reward"]
        max_reward = test_case["max_reward"]
        description = test_case["description"]
        
        try:
            reward = test_model.compute_correct_reward(
                response=response,
                predicted=predicted,
                correct_answer=correct,
                quality=quality,
            )
            
            if min_reward <= reward <= max_reward:
                print(f"‚úÖ PASSED: {description}")
                print(f"   Reward: {reward:.3f} (expected: {min_reward:.3f} - {max_reward:.3f})")
                passed += 1
            else:
                print(f"‚ùå FAILED: {description}")
                print(f"   Reward: {reward:.3f} (expected: {min_reward:.3f} - {max_reward:.3f})")
                failed += 1
                
        except Exception as e:
            print(f"‚ùå ERROR: {description}")
            print(f"   Exception: {e}")
            failed += 1
    
    # Summary
    print("\n" + "="*50)
    print(f"Results: {passed}/{len(test_cases)} passed, {failed}/{len(test_cases)} failed")
    
    if failed == 0:
        print("üéâ All tests passed! Exercise 4 is complete.")
    else:
        print("‚ö†Ô∏è  Some tests failed. Please review your implementation:")
        print("Expected behavior:")
        print("- Base reward: 1.0 for correct answer")
        print("- +0.1 bonus for showing steps (has_steps=True)")
        print("- +0.1 bonus for detailed explanation (>100 chars)")
        print("- +0.05 bonus for using reasoning words")
        print("- +0.05 bonus for showing calculations")
        print("- Max reward: 1.3")
    
    return passed == len(test_cases)

# Run the test
test_compute_correct_reward()

üß™ Testing Exercise 4: Compute Correct Reward
‚úÖ PASSED: Perfect answer with all quality indicators (should get bonuses)
   Reward: 1.150 (expected: 1.150 - 1.300)
‚úÖ PASSED: Correct but minimal response (no bonuses)
   Reward: 1.000 (expected: 1.000 - 1.000)
‚úÖ PASSED: Correct with calculation but no steps (partial bonus)
   Reward: 1.050 (expected: 1.050 - 1.150)

Results: 3/3 passed, 0/3 failed
üéâ All tests passed! Exercise 4 is complete.


True

### Exercise 5: Implement Partial Credit for Wrong Answers

This is perhaps the most critical exercise for effective GRPO training. The current implementation gives 0.0 reward to all wrong answers, which means the model learns nothing from near-misses or partially correct solutions. This binary reward system (1.0 for correct, 0.0 for wrong) makes learning extremely difficult and slow.

Your task is to implement a sophisticated partial credit system based on how close the wrong answer is to being correct. Start by calculating the relative error between the predicted and correct answers. Responses within 1% of the correct answer should receive 0.9 reward (they're almost right!), within 10% should get 0.7, and within 30% should get 0.5. 

Additionally, check if the answer is at least in the right order of magnitude (between 0.1x and 10x the correct answer) and give 0.3 reward if so. Any reasonable attempt should get at least 0.1 reward. Finally, add a bonus of 0.1 if the response shows work (has calculations and steps) even though the final answer is wrong - this encourages the model to show its reasoning, making it easier to debug and improve.

This graduated reward system is essential for GRPO because it provides a learning gradient - the model can learn that some wrong answers are "less wrong" than others and gradually improve toward the correct solution.

In [20]:
# GRADED CELL: exercise 5

def compute_wrong_reward(self, response: str, predicted: float,
                        correct_answer: float, quality: Dict[str, any], question: str = None) -> float:
    """
    Compute partial credit for wrong answers.

    This is critical for learning! Instead of 0.0 for all wrong answers,
    give partial credit based on:
    1. How close the answer is
    2. Whether work was shown
    3. Quality of reasoning

    Args:
        response: The model's generated response
        predicted: The extracted numerical answer
        correct_answer: The correct numerical answer
        quality: Quality indicators from analyze_response_quality

    Returns:
        Reward between 0.1 and 0.9
    """

    reward = 0.0  # Currently no partial credit - this is the main problem!

    # Calculate relative error
    if correct_answer != 0:
        relative_error = abs(predicted - correct_answer) / abs(correct_answer)
    else:
        relative_error = abs(predicted - correct_answer)

    ### START CODE HERE ### 
    # Base reward based on how close the answer is
    # Within 1% error ‚Üí 0.9 reward
    if relative_error < 0.01:
        reward += 0.9 
    # Within 5% error ‚Üí 0.7 reward
    elif relative_error < 0.05:
        reward = reward + 0.7
    # Within 10% error ‚Üí 0.5 reward
    elif relative_error < 0.1:
        reward = reward + 0.5
    # Within 30% error ‚Üí 0.3 reward
    elif relative_error < 0.3:
        reward = reward + 0.3
    # Any attempt ‚Üí 0.1 reward  
    else:
        reward = reward + 0.1

    # Add bonus (+0.1) for showing calculations and steps (has_calculation and has_steps)
    if quality.get("has_calculation",False) and quality.get("has_steps",False) :
        reward += 0.1
    # Add bonus (+0.05) for only showing calculations words, but no steps (has_calculation)
    elif quality.get("has_calculation",False):
        reward += 0.05

    # Small bonus for longer, more detailed responses
    if quality.get("response_length") > 200:
        reward += 0.05
    
    ### END CODE HERE ###

    # Ensure minimum reward for any attempt
    reward = max(0.1, reward)

    # Cap at 0.9 (to keep it below correct answers)
    reward = min(0.9, reward)

    self.logger.info(f"‚ùå WRONG ANSWER:")
    self.logger.info(f"  Predicted: {predicted}")
    self.logger.info(f"  Expected: {correct_answer}")
    self.logger.info(f"  Reward: {reward}")

    return reward

# Add the method to GSM8KRewardSignal class for grading purposes
GSM8KRewardSignal.compute_wrong_reward = compute_wrong_reward

In [21]:
# ============================================
# UNIT TEST: Exercise 5 - Compute Wrong Reward
# ============================================

def test_compute_wrong_reward():
    """Unit test for the compute_wrong_reward method."""
    print("üß™ Testing Exercise 5: Compute Wrong Reward")
    print("="*50)
    
    # Create an instance of the reward model
    test_model = GSM8KRewardSignal()
    
    # Test cases with various error levels
    test_cases = [
        # Test 1: Almost correct (within 1%)
        {
            "response": "The answer is 100.5",
            "predicted": 100.5,
            "correct": 100.0,
            "quality": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": True,
                "response_length": 19,
                "has_reasoning_words": False,
                "sentence_count": 1
            },
            "min_reward": 0.85,
            "max_reward": 0.95,
            "description": "Within 1% error (0.5% off)"
        },
        
        # Test 2: Close but wrong (within 10%)
        {
            "response": "After calculations, I get 45",
            "predicted": 45.0,
            "correct": 50.0,
            "quality": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": True,
                "response_length": 28,
                "has_reasoning_words": False,
                "sentence_count": 1
            },
            "min_reward": 0.3,
            "max_reward": 0.4,
            "description": "Within 10% error"
        },
        
        # Test 3: Moderately wrong (within 30%)
        {
            "response": "The result is 70",
            "predicted": 70.0,
            "correct": 100.0,
            "quality": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": True,
                "response_length": 16,
                "has_reasoning_words": False,
                "sentence_count": 1
            },
            "min_reward": 0.1,  # 30% error (0.3) falls into >30% bracket (0.1 base)
            "max_reward": 0.15,
            "description": "Within 30% error"
        },
        
        # Test 4: Right order of magnitude
        {
            "response": "Approximately 500",
            "predicted": 500.0,
            "correct": 100.0,
            "quality": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": True,
                "response_length": 17,
                "has_reasoning_words": False,
                "sentence_count": 1
            },
            "min_reward": 0.1,
            "max_reward": 0.35,
            "description": "Right order of magnitude (5x off)"
        },
        
        # Test 5: Wrong but shows work
        {
            "response": "Step 1: Add 10 + 20 = 30. Step 2: Multiply by 3 = 90. The answer is 90.",
            "predicted": 90.0,
            "correct": 60.0,
            "quality": {
                "has_calculation": True,
                "has_steps": True,
                "has_numbers": True,
                "response_length": 72,
                "has_reasoning_words": False,
                "sentence_count": 3
            },
            "min_reward": 0.15,  # 0.5 for 50% error + 0.1 for showing work
            "max_reward": 0.25,
            "description": "Wrong but shows detailed work"
        },
        
        # Test 6: Very wrong
        {
            "response": "The answer is 1000000",
            "predicted": 1000000.0,
            "correct": 10.0,
            "quality": {
                "has_calculation": False,
                "has_steps": False,
                "has_numbers": True,
                "response_length": 21,
                "has_reasoning_words": False,
                "sentence_count": 1
            },
            "min_reward": 0.05,
            "max_reward": 0.15,
            "description": "Very wrong answer"
        }
    ]
    
    passed = 0
    failed = 0
    
    for test_case in test_cases:
        response = test_case["response"]
        predicted = test_case["predicted"]
        correct = test_case["correct"]
        quality = test_case["quality"]
        min_reward = test_case["min_reward"]
        max_reward = test_case["max_reward"]
        description = test_case["description"]
        
        try:
            reward = test_model.compute_wrong_reward(
                response=response,
                predicted=predicted,
                correct_answer=correct,
                quality=quality
            )
            
            # Calculate actual error for debugging
            relative_error = abs(predicted - correct) / (abs(correct) + 1e-10)
            
            if min_reward <= reward <= max_reward:
                print(f"‚úÖ PASSED: {description}")
                print(f"   Error: {relative_error:.2%}, Reward: {reward:.2f} (expected: {min_reward:.2f}-{max_reward:.2f})")
                passed += 1
            else:
                print(f"‚ùå FAILED: {description}")
                print(f"   Predicted: {predicted}, Correct: {correct}, Error: {relative_error:.2%}")
                print(f"   Reward: {reward:.2f} (expected: {min_reward:.2f}-{max_reward:.2f})")
                failed += 1
                
        except Exception as e:
            print(f"‚ùå ERROR: {description}")
            print(f"   Exception: {e}")
            failed += 1
    
    # Summary
    print("\n" + "="*50)
    print(f"Results: {passed}/{len(test_cases)} passed, {failed}/{len(test_cases)} failed")
    
    if failed == 0:
        print("üéâ All tests passed! Exercise 5 is complete.")
    else:
        print("‚ö†Ô∏è Some tests failed. Please review your implementation.")
        print("\nMinimum requirements (partial credit system):")
        print("- Within 1% error ‚Üí 0.9 reward")
        print("- Within 10% error ‚Üí 0.7 reward")
        print("- Within 30% error ‚Üí 0.5 reward")
        print("- Right order of magnitude (0.1x-10x) ‚Üí 0.3 reward")
        print("- Any reasonable attempt ‚Üí 0.1 reward")
        print("- +0.1 bonus for showing work (calculations + steps)")
    
    return passed == len(test_cases)

# Run the test
test_compute_wrong_reward()

üß™ Testing Exercise 5: Compute Wrong Reward
‚úÖ PASSED: Within 1% error (0.5% off)
   Error: 0.50%, Reward: 0.90 (expected: 0.85-0.95)
‚úÖ PASSED: Within 10% error
   Error: 10.00%, Reward: 0.30 (expected: 0.30-0.40)
‚úÖ PASSED: Within 30% error
   Error: 30.00%, Reward: 0.10 (expected: 0.10-0.15)
‚úÖ PASSED: Right order of magnitude (5x off)
   Error: 400.00%, Reward: 0.10 (expected: 0.10-0.35)
‚úÖ PASSED: Wrong but shows detailed work
   Error: 50.00%, Reward: 0.20 (expected: 0.15-0.25)
‚úÖ PASSED: Very wrong answer
   Error: 9999900.00%, Reward: 0.10 (expected: 0.05-0.15)

Results: 6/6 passed, 0/6 failed
üéâ All tests passed! Exercise 5 is complete.


True

In [22]:
# This cell will be graded - compute_reward method (main orchestrator)
def compute_reward(self, response: str, correct_answer: float, question: str = None) -> float:
    """
    Main reward computation function - delegates to specialized methods.
    
    This function orchestrates the reward computation by:
    1. Extracting numerical answer from response
    2. Analyzing response quality indicators  
    3. Calling appropriate reward computation method
    4. Returning final reward value
    
    Args:
        response: The model's response text
        correct_answer: The correct numerical answer
        question: The original question (optional)
        
    Returns:
        float: Final reward value for this response
    """
    # Step 1: Try to extract numerical answer
    predicted = self.extract_numerical_answer(response)
    
    # Step 2: Analyze response quality
    quality = self.analyze_response_quality(response)
    
    # Step 3: Route to appropriate reward computation
    if predicted is None:
        # Case 1: Could not parse any numerical answer
        return self.compute_unparseable_reward(response, correct_answer, quality, question)
    elif abs(predicted - correct_answer) < 0.01:
        # Case 2: Answer is correct (within small tolerance)
        return self.compute_correct_reward(response, predicted, correct_answer, quality, question)
    else:
        # Case 3: Answer is wrong
        return self.compute_wrong_reward(response, predicted, correct_answer, quality, question)

# Add the method to GSM8KRewardSignal class for grading purposes
GSM8KRewardSignal.compute_reward = compute_reward

print("‚úÖ Reward model class defined!")
print("This will be your 'grading system' for model responses.")

‚úÖ Reward model class defined!
This will be your 'grading system' for model responses.


In [23]:
# ============================================
# Test the Reward Model
# ============================================

# Test your reward model with different types of answers
print("üß™ TESTING THE REWARD MODEL")
print("="*60)
print("See how it grades different types of responses:\n")

reward_model = GSM8KRewardSignal()
print("‚úÖ Reward model created with detailed logging for debugging!")
print("Check the log file to see question, response, and reward details for each generation.")

# Test cases: (response, correct_answer)
test_cases = [
    # Perfect answer with steps
    ("Let me calculate step by step: 5 + 3 = 8. The answer is #### 8", 8.0),
    
    # Correct but no steps
    ("The answer is 8", 8.0),
    
    # Close but wrong
    ("5 + 3 = 7. So the answer is 7.", 8.0),
    
    # Very wrong
    ("The total is 100", 8.0),
    
    # Shows work but wrong
    ("First I add 5 + 3 = 9. Then I multiply by 2 = 18. Answer: 18", 8.0),
    
    # No attempt
    ("I don't know how to solve this", 8.0),
]

for i, (response, correct) in enumerate(test_cases, 1):
    print(f"Test {i}:")
    print(f"  Response: \"{response[:50]}...\"" if len(response) > 50 else f"  Response: \"{response}\"")
    
    # Compute reward
    reward = reward_model.compute_reward(response, correct, "Test question")
    
    # Extract answer for display
    extracted = reward_model.extract_numerical_answer(response)
    
    print(f"  Extracted answer: {extracted}")
    print(f"  Correct answer: {correct}")
    print(f"  Reward: {reward:.2f}")
    
    # Explain the reward
    if reward >= 1.0:
        print("  ‚úÖ Excellent!")
    elif reward >= 0.5:
        print("  üü® Good attempt")
    elif reward >= 0.2:
        print("  üü† Some credit")
    else:
        print("  ‚ö™ Minimal credit")
    print()

print(f"\nüí° Note: Detailed logs are saved to: {log_file}")

üß™ TESTING THE REWARD MODEL
See how it grades different types of responses:

‚úÖ Reward model created with detailed logging for debugging!
Check the log file to see question, response, and reward details for each generation.
Test 1:
  Response: "Let me calculate step by step: 5 + 3 = 8. The answ..."
  Extracted answer: 8.0
  Correct answer: 8.0
  Reward: 1.15
  ‚úÖ Excellent!

Test 2:
  Response: "The answer is 8"
  Extracted answer: 8.0
  Correct answer: 8.0
  Reward: 1.00
  ‚úÖ Excellent!

Test 3:
  Response: "5 + 3 = 7. So the answer is 7."
  Extracted answer: 7.0
  Correct answer: 8.0
  Reward: 0.40
  üü† Some credit

Test 4:
  Response: "The total is 100"
  Extracted answer: 100.0
  Correct answer: 8.0
  Reward: 0.10
  ‚ö™ Minimal credit

Test 5:
  Response: "First I add 5 + 3 = 9. Then I multiply by 2 = 18. ..."
  Extracted answer: 18.0
  Correct answer: 8.0
  Reward: 0.20
  üü† Some credit

Test 6:
  Response: "I don't know how to solve this"
  Extracted answer: None
  Corre

## Load the Language Model <a id="loadthelanguagemodel"></a>

### About DeepSeek Math Model

You will be using **DeepSeek Math 7B Base** Model

### What You are Loading

1. **Tokenizer**: Converts text to numbers the model understands
2. **Model**: The actual neural network with all the parameters

### Memory Requirements

- The model needs about 14-28 GB of GPU memory for loading the model
- Loading may take 1-2 minutes

In [24]:
# ============================================
# Load the Tokenizer
# ============================================

print("Loading tokenizer...")
print("The tokenizer converts text to tokens (numbers) that the model understands.\n")

is_local = os.path.exists(config.model_name)

# Load the tokenizer
# trust_remote_code=True allows loading custom code from the model repository
tokenizer = AutoTokenizer.from_pretrained(
    config.model_name,
    trust_remote_code=True,  # Some models have custom tokenizer code
    local_files_only=is_local
)

# Set up padding token
# Padding is used to make all inputs the same length
if tokenizer.pad_token is None:
    # If no padding token defined, use the end-of-sequence token
    tokenizer.pad_token = tokenizer.eos_token

# Set padding to left side for generation tasks
# This ensures the actual text is on the right (where model expects it)
tokenizer.padding_side = "left"

print("‚úÖ Tokenizer loaded successfully!")
print(f"  Vocabulary size: {len(tokenizer):,} tokens")
print(f"  Padding token: {tokenizer.pad_token}")
print(f"  End-of-sequence token: {tokenizer.eos_token}")

# Example of tokenization
example_text = "What is 5 + 3?"
tokens = tokenizer.encode(example_text)
print(f"\nExample tokenization:")
print(f"  Text: \"{example_text}\"")
print(f"  Tokens: {tokens[:10]}..." if len(tokens) > 10 else f"  Tokens: {tokens}")
print(f"  Number of tokens: {len(tokens)}")

Loading tokenizer...
The tokenizer converts text to tokens (numbers) that the model understands.

‚úÖ Tokenizer loaded successfully!
  Vocabulary size: 100,002 tokens
  Padding token: <ÔΩúend‚ñÅof‚ñÅsentenceÔΩú>
  End-of-sequence token: <ÔΩúend‚ñÅof‚ñÅsentenceÔΩú>

Example tokenization:
  Text: "What is 5 + 3?"
  Tokens: [100000, 2640, 317, 207, 20, 919, 207, 18, 30]
  Number of tokens: 9


In [25]:
# ============================================
# Load the Language Model
# ============================================

print("Loading the DeepSeek Math model...")
print("This may take 1-2 minutes depending on your internet speed.\n")

# Set up model loading arguments
model_kwargs = {
    "trust_remote_code": True,  # Allow custom model code
    "dtype": torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    "use_cache": True,
    "local_files_only": is_local,
    # bfloat16: Uses less memory than float32 but maintains good precision
    # float32: Full precision (used on CPU)
}

# Optional: Use 8-bit quantization to save memory
if config.use_8bit:
    print("Using 8-bit mode to save memory...")
    model_kwargs["load_in_8bit"] = True
    # 8-bit mode reduces memory usage by ~50% with minimal accuracy loss

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    **model_kwargs
)
model.to(device)
print("‚úÖ Model loaded successfully!")

# Display model information
print(f"\nModel Information:")
print(f"  Model type: {model.__class__.__name__}")
print(f"  Number of parameters: {sum(p.numel() for p in model.parameters()) / 1e9:.2f} billion")

# Check model device
if next(model.parameters()).is_cuda:
    print(f"  Location: GPU (fast training!)")
else:
    print(f"  Location: CPU (slower training)")

Loading the DeepSeek Math model...
This may take 1-2 minutes depending on your internet speed.

‚úÖ Model loaded successfully!

Model Information:
  Model type: LlamaForCausalLM
  Number of parameters: 6.91 billion
  Location: GPU (fast training!)


In [26]:
# ============================================
# Enable Memory Optimizations
# ============================================

print("Enabling memory optimizations...\n")

# Enable gradient checkpointing
# This trades computation for memory by not storing all intermediate values
if hasattr(model, 'gradient_checkpointing_enable'):
    model.gradient_checkpointing_enable()
    print("‚úÖ Gradient checkpointing enabled")
    print("   This saves memory by recomputing values when needed")
    print("   Training will be slightly slower but use less memory")
else:
    print("Gradient checkpointing not available for this model")

# Test the model with a simple generation
print("\nTesting model with a simple math problem...")
test_prompt = "What is 2 + 2? The answer is"
inputs = tokenizer(test_prompt, return_tensors="pt")

# Move inputs to same device as model
if torch.cuda.is_available():
    inputs = {k: v.cuda() for k, v in inputs.items()}

# Generate a short response
with torch.no_grad():  # Don't calculate gradients for this test
    outputs = model.generate(
        **inputs,
        max_new_tokens=10,
        temperature=0.1,  # Low temperature for deterministic output
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id  # Explicitly set to suppress warning
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"  Prompt: \"{test_prompt}\"")
print(f"  Model response: \"{response}\"")
print("\n‚úÖ Model is working correctly!")

Enabling memory optimizations...

‚úÖ Gradient checkpointing enabled
   This saves memory by recomputing values when needed
   Training will be slightly slower but use less memory

Testing model with a simple math problem...
  Prompt: "What is 2 + 2? The answer is"
  Model response: "What is 2 + 2? The answer is 4. What is 2 + 2"

‚úÖ Model is working correctly!


## Prepare Training and Validation Datasets <a id="preparetraining"></a>

### Data Splitting

You need to split the data into two parts:
1. **Training set** (80%): Used to train the model
2. **Validation set** (20%): Used to check progress during training

### Data Preparation Steps

For each problem, we:
1. Create a prompt (the question)
2. Extract the numerical answer
3. Format it for the model

### Prompt Template

You use a specific format to help the model understand what you want:
```
Question: [math problem]
Let's solve this step-by-step and find the numerical answer:
```

In [27]:
# ============================================
# Prepare Training and Validation Datasets
# ============================================

print("Preparing datasets for training...\n")

# Use the utility function to prepare the datasets
# This function:
# 1. Loads the GSM8K training data
# 2. Splits it into train/validation
# 3. Formats each problem with the prompt template
# 4. Extracts numerical answers
train_dataset, eval_dataset = prepare_dataset(config, tokenizer)

# Display dataset statistics
print("\nDataset Statistics:")
print(f"  Total examples: {len(train_dataset) + len(eval_dataset):,}")
print(f"  Training examples: {len(train_dataset):,} ({len(train_dataset)/(len(train_dataset)+len(eval_dataset))*100:.1f}%)")
print(f"  Validation examples: {len(eval_dataset):,} ({len(eval_dataset)/(len(train_dataset)+len(eval_dataset))*100:.1f}%)")

# Calculate training iterations
steps_per_epoch = len(train_dataset) // (config.per_device_train_batch_size * config.gradient_accumulation_steps)
total_steps = steps_per_epoch * config.num_train_epochs
print(f"\nTraining Iterations:")
print(f"  Steps per epoch: {steps_per_epoch}")
print(f"  Total training steps: {total_steps}")
print(f"  Evaluations during training: {total_steps // config.eval_steps}")

Preparing datasets for training...


Dataset Statistics:
  Total examples: 7,473
  Training examples: 5,978 (80.0%)
  Validation examples: 1,495 (20.0%)

Training Iterations:
  Steps per epoch: 93
  Total training steps: 465
  Evaluations during training: 23


In [28]:
# ============================================
# Show Sample Training Examples
# ============================================

print("üìù Sample Training Examples:")
print("="*60)

# Show 3 examples from the training set
for i in range(min(3, len(train_dataset))):
    sample = train_dataset[i]
    
    print(f"\nExample {i+1}:")
    print("-"*40)
    
    # Show the prompt (that you give the model)
    print("PROMPT (Input to model):")
    print(sample['prompt'][:300] + "..." if len(sample['prompt']) > 300 else sample['prompt'])
    
    # Show the expected answer
    print(f"\nEXPECTED NUMERICAL ANSWER: {sample['answer']}")
    
    # Show part of the solution
    if 'answer_text' in sample:
        print("\nSOLUTION STEPS (first part):")
        print(sample['answer_text'][:200] + "..." if len(sample['answer_text']) > 200 else sample['answer_text'])

print("\n" + "="*60)
print("The model will learn to generate step-by-step solutions like these!")

üìù Sample Training Examples:

Example 1:
----------------------------------------
PROMPT (Input to model):
Question: Stefan goes to a restaurant to eat dinner with his family. They order an appetizer that costs $10 and 4 entrees that are $20 each. If they tip 20% of the total for the waiter, what is the total amount of money that they spend at the restaurant?

Let's solve this step-by-step and find the n...

EXPECTED NUMERICAL ANSWER: 108.0

SOLUTION STEPS (first part):
The total cost of the entrees is 4 * $20 = $<<4*20=80>>80.
The total cost of the dinner is $80 + $10 = $<<80+10=90>>90.
The tip is $90 * 0.20 = $<<90*0.20=18>>18
The total cost with tip is $90 + $18 =...

Example 2:
----------------------------------------
PROMPT (Input to model):
Question: The gauge on a water tank shows that the tank is 1/3 full of water. To fill the tank, 16 gallons of water are added. How many gallons of water does the tank hold when full?

Let's solve this step-by-step and find the numerical answ

## Create Evaluation Callback <a id="createevaluation"></a>

### Why Evaluation Matters

During training, you want to know:
- Is the model getting better?
- What's the current accuracy?
- Should you stop training?

### What You Track

Our evaluation callback measures:
1. **Accuracy**: Percentage of correct answers
2. **Average Reward**: How good the answers are overall
3. **Sample Outputs**: Actual model responses

### Evaluation Frequency

You evaluate:
- Every 20 training steps (configurable)
- On the test set (never seen during training)
- Using a sample for speed

In [29]:
# ============================================
# Set Up Evaluation System
# ============================================

print("Setting up evaluation system...\n")

# Load the test dataset for evaluation
# IMPORTANT: you should use the TEST set (not training data) to measure true performance
print("Loading GSM8K test dataset...")
gsm8k = load_from_disk("/app/data/gsm8k")
test_dataset = gsm8k["test"]
print(f"‚úÖ Loaded {len(test_dataset):,} test examples")
print("   These are NEVER seen during training (prevents cheating)\n")

# Create the evaluation callback
# This will run periodically during training to check progress
eval_callback = GSM8KEvaluationCallback(
    tokenizer=tokenizer,
    test_dataset=test_dataset,
    batch_size=64,  # How many examples to evaluate at once
    sample_size=1.0  # Use full test set (1.0 = 100%)
)

print("‚úÖ Evaluation system configured!")
print(f"\nEvaluation Details:")
print(f"  Will evaluate on: {len(test_dataset)} test examples")
print(f"  Evaluation frequency: Every {config.eval_steps} training steps")
print(f"  Metrics tracked: Accuracy, Average Reward, Sample Outputs")
print(f"\nTip: Watch the accuracy increase during training!")

Setting up evaluation system...

Loading GSM8K test dataset...
‚úÖ Loaded 1,319 test examples
   These are NEVER seen during training (prevents cheating)

‚úÖ Evaluation system configured!

Evaluation Details:
  Will evaluate on: 1319 test examples
  Evaluation frequency: Every 20 training steps
  Metrics tracked: Accuracy, Average Reward, Sample Outputs

Tip: Watch the accuracy increase during training!


## Configure GRPO Trainer <a id="configuregrpotrainer"></a>

### The GRPO Training Process

Here's how GRPO training works:

1. **Generate Multiple Answers**: For each question, generate 12 different answers
2. **Score Each Answer**: Use your reward model to grade each one
3. **Compare Within Group**: See which answers are better than others
4. **Update Model**: Teach it to prefer the better answers

Instead of saying "this answer is worth 0.7 points", GRPO says "Answer #3 is better than answers #1, #2, #4, #5...". This relative comparison is more stable and effective than scoring each answer.

### Key Components

1. **Trainer**: Orchestrates the training process
2. **Reward Function**: Grades the generated answers
3. **Configuration**: All your settings from earlier

Note the `GRPOConfig()` within the `create_grpo_trainer()` function. This is where you set various parameters such as the temperature, as you have seen in the video.

In [30]:
# ============================================
# Define the GRPO Trainer Creation Function
# ============================================

def create_grpo_trainer(config, model, tokenizer, train_dataset, eval_dataset, reward_model, test_dataset=None):
    """
    Create and configure the GRPO trainer.
    
    This function sets up everything needed for GRPO training:
    1. Configuration for the training process
    2. Reward computation function
    3. The trainer itself
    
    Args:
        config: Training configuration
        model: The language model to train
        tokenizer: Tokenizer for the model
        train_dataset: Training data
        eval_dataset: Validation data
        reward_model: Your reward scoring system
        test_dataset: Test data for evaluation
    
    Returns:
        A configured GRPO trainer ready to train
    """
    
    print("üîß Creating GRPO trainer...\n")
    
    # Build a dictionary for fast answer lookup
    # This maps each prompt to its correct answer
    # O(1) lookup is much faster than searching through the dataset
    prompt2ans = {}
    for item in train_dataset:
        prompt2ans[item['prompt']] = (item['answer'], item['question'])
    print(f"Built answer lookup with {len(prompt2ans):,} entries")
    
    # Configure GRPO training parameters 
    # In most cases, you don't need to modify the parameters below
    GRPO_config = GRPOConfig(
        # ===== GRPO SPECIFIC =====
        num_generations=config.num_generations,  # Generate 12 answers per question
        temperature=config.temperature,  # Randomness in generation
        generation_kwargs={
            "max_new_tokens": config.max_new_tokens,
            "temperature": config.temperature,
            "top_p": 0.95,  # Only consider top 95% probability tokens
            "do_sample": True,  # Enable random sampling
            "pad_token_id": tokenizer.pad_token_id,
            "eos_token_id": tokenizer.eos_token_id,
        },
        generation_batch_size=min(96, config.per_device_train_batch_size * config.num_generations),
        
        # ===== OUTPUT SETTINGS =====
        output_dir=config.output_dir,
        
        # ===== TRAINING DURATION =====
        num_train_epochs=config.num_train_epochs,
        
        # ===== BATCH SETTINGS =====
        per_device_train_batch_size=config.per_device_train_batch_size,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        
        # ===== LEARNING SETTINGS =====
        learning_rate=config.learning_rate,
        lr_scheduler_type="cosine",  # Gradually reduce learning rate
        warmup_ratio=0.1,  # Start with lower learning rate for stability
        
        # ===== OPTIMIZATION =====
        max_grad_norm=1.0,  # Clip gradients to prevent explosions
        adam_beta1=0.9,  # Momentum parameter
        adam_beta2=0.95,  # Better for RL than default 0.999
        weight_decay=0.1,  # L2 regularization to prevent overfitting
        
        # ===== MONITORING =====
        logging_steps=config.logging_steps,
        eval_steps=config.eval_steps,
        save_steps=config.save_steps,
        save_total_limit=config.save_total_limit,
        
        # ===== TECHNICAL SETTINGS =====
        seed=config.seed,
        bf16=True if torch.cuda.is_available() else False,  # Use bfloat16 on GPU
        fp16=False,  # Don't use float16 (less stable)
        remove_unused_columns=False,
        push_to_hub=False,  # Don't upload to Hugging Face
        report_to=[],  # Disable external logging
    )
    
    print("‚úÖ GRPO configuration created")
    return GRPO_config, prompt2ans

# Create the configuration
GRPO_config, prompt2ans = create_grpo_trainer(config, model, tokenizer, train_dataset, eval_dataset, reward_model, test_dataset)
print("\nConfiguration ready for training!")

üîß Creating GRPO trainer...

Built answer lookup with 5,978 entries
‚úÖ GRPO configuration created

Configuration ready for training!


### Reward compuation function defined with detailed logging

This function will:
  1. Score each generated answer with detailed logs
  2. Show question, response, and reward for each generation
  3. Compare answers within groups
  4. Return relative rewards for training
  5. Log group statistics for debugging

In this function you can find the implementation of both the `mean_reward` and `normalized_rewards`, which you saw in the video.

In [31]:
# ============================================
# Define the Reward Computation Function
# ============================================

def compute_rewards(prompts: List[str], completions: List[str], **kwargs):
    """
    Compute rewards for generated completions with detailed logging.
    
    This is the heart of GRPO - it scores each generated answer
    and normalizes within groups for relative comparison.
    
    Process:
    1. Group completions by their prompt
    2. Score each completion with detailed logging
    3. Normalize scores within each group
    4. Return normalized rewards
    
    Args:
        prompts: List of prompts (with duplicates for each generation)
        completions: List of generated responses
    
    Returns:
        List of normalized rewards
    """
    
    rewards = []
    
    # Log batch information
    logger.info(f"\n{'='*80}")
    logger.info(f"GRPO REWARD COMPUTATION")
    logger.info(f"Total completions: {len(completions)}")
    logger.info(f"Total prompts: {len(prompts)}")
    logger.info(f"Expected generations per prompt: {config.num_generations}")
    logger.info(f"{'='*80}")
    
    # Get unique prompts (remove duplicates)
    seen = set()
    unique_prompts = []
    for p in prompts:
        if p not in seen:
            unique_prompts.append(p)
            seen.add(p)
    
    num_unique_prompts = len(unique_prompts)
    logger.info(f"Calculated unique prompts: {num_unique_prompts}")
    
    # Debug: Print samples
    for i in range(min(4, len(completions))):
        logger.info(f"Completion {i+1}: {completions[i][:80]}...")
    
    # Process each unique prompt and its completions
    for i, unique_prompt in enumerate(unique_prompts):
        # Find all completions for this prompt
        prompt_indices = [idx for idx, p in enumerate(prompts) if p == unique_prompt]
        group_completions = [completions[idx] for idx in prompt_indices]
        
        # Get correct answer from your lookup dictionary
        correct_answer, question = prompt2ans.get(unique_prompt, (0.0, "Unknown question"))
        
        logger.info(f"\n--- Unique Prompt {i+1}/{num_unique_prompts} ---")
        logger.info(f"Question: {question[:150]}...")
        logger.info(f"Expected Answer: {correct_answer}")
        logger.info(f"Number of generations for this prompt: {len(group_completions)}")
        
        # Compute reward for each completion with detailed logging
        group_rewards = []
        for j, completion in enumerate(group_completions):
            logger.info(f"\n  Generation {j+1}/{len(group_completions)}:")
            reward = reward_model.compute_reward(completion, correct_answer, question)
            group_rewards.append(reward)
        
        # Normalize rewards within the group (mean-centering)
        # This makes rewards relative: positive = better than average, negative = worse
        if len(group_rewards) > 1:
            mean_reward = sum(group_rewards) / len(group_rewards)
            # Mean-center only - no standard deviation scaling
            normalized_rewards = [r - mean_reward for r in group_rewards]
        else:
            # Single generation - no comparison possible
            normalized_rewards = [0.0]
        
        # Add normalized rewards in correct order
        for idx, norm_reward in zip(prompt_indices, normalized_rewards):
            if len(rewards) <= idx:
                rewards.extend([0] * (idx - len(rewards) + 1))
            rewards[idx] = norm_reward
        
        # Log group statistics matching your format
        avg_reward = sum(group_rewards) / len(group_rewards)
        max_reward = max(group_rewards)
        min_reward = min(group_rewards)
        avg_norm = sum(normalized_rewards) / len(normalized_rewards)
        logger.info(f"\n  Group Summary: Avg={avg_reward:.3f}, Max={max_reward:.3f}, Min={min_reward:.3f}")
        logger.info(f"  Normalized: Avg={avg_norm:.3f}, Rewards={[f'{r:.3f}' for r in normalized_rewards]}")
    
    overall_avg = sum(rewards) / len(rewards) if rewards else 0
    logger.info(f"\n{'='*80}")
    logger.info(f"BATCH SUMMARY: Overall Average Reward = {overall_avg:.3f}")
    logger.info(f"{'='*80}\n")
    
    return rewards

In [32]:
# ============================================
# Create the GRPO Trainer
# ============================================

print("Creating the GRPO trainer...\n")

# Initialize the GRPO trainer with all the components
trainer = GRPOTrainer(
    model=model,  # The language model to train
    reward_funcs=compute_rewards,  # Function to compute rewards
    args=GRPO_config,  # Training configuration
    train_dataset=train_dataset,  # Training data
    eval_dataset=eval_dataset,  # Validation data
    processing_class=tokenizer,  # Tokenizer
)

# Add the evaluation callback
trainer.add_callback(eval_callback)
print("‚úÖ Added evaluation callback for progress tracking")

print("\n‚úÖ GRPO Trainer created successfully!")
print("\nTraining Summary:")
print(f"  Model: DeepSeek Math 7B")
print(f"  Dataset: GSM8K math problems")
print(f"  Training examples: {len(train_dataset):,}")
print(f"  Generations per prompt: {config.num_generations}")
print(f"  Effective batch size: {config.per_device_train_batch_size * config.gradient_accumulation_steps}")
print(f"  Total training steps: ~{len(train_dataset) // (config.per_device_train_batch_size * config.gradient_accumulation_steps) * config.num_train_epochs}")
print(f"\nThe model will now learn to solve math problems better!")

Creating the GRPO trainer...

‚úÖ Added evaluation callback for progress tracking

‚úÖ GRPO Trainer created successfully!

Training Summary:
  Model: DeepSeek Math 7B
  Dataset: GSM8K math problems
  Training examples: 5,978
  Generations per prompt: 12
  Effective batch size: 64
  Total training steps: ~465

The model will now learn to solve math problems better!


## Train the Model with GRPO! (Ungraded Part) <a id="trainmodelwithgrpo"></a>

### What Happens During Training

During each training step:
1. **Select** a batch of math problems
2. **Generate** 12 different solutions for each
3. **Score** each solution with the reward model
4. **Compare** solutions within each group
5. **Update** the model to prefer better solutions

### Training Time (With GPU)

- To finish the full training, it will take dozens of hours. (Generally, it's not necessary to finish the full training schedule.)
- With proper reward functions, you can notice that the Eval Acc is showing an upward trend within 1 or 2 hours of training
- With proper reward functions, the Eval Acc could go above 40% within 10 to 15 hours of training.

### What to Watch

- **Loss**: Should decrease (model is learning)
- **Accuracy**: Should increase (getting more problems right)
- **Rewards**: Should increase (generating better solutions)

### Important Notes

- Training will save checkpoints periodically
- You can resume if training is interrupted
- Evaluation results every 20 steps
- Please analyze the logs under the grpo_logs folder to imporve the reward function

In [33]:
# ============================================
# Start GRPO Training!
# ============================================

print("You have successfully built the trainer. Next, you can use it to train GRPO. However, since the training process takes over 10 hours, it will not be conducted in this lab. You can run it on other GPU resources.")


You have successfully built the trainer. Next, you can use it to train GRPO. However, since the training process takes over 10 hours, it will not be conducted in this lab. You can run it on other GPU resources.


## Summary <a id="summary"></a>

### Congratulations!

You've successfully completed the GRPO training lab! You've learned how to use reinforcement learning to improve a language model's ability to solve math problems.

### What You've Learned

1. **GRPO Fundamentals**
   - How to generate multiple responses per prompt
   - Why relative comparison works better than absolute rewards
   - How group normalization stabilizes training

2. **Practical Implementation**
   - Setting up a reward model for math problems
   - Configuring GRPO training parameters
   - Monitoring training progress
   - Evaluating model improvements

3. **Key Insights**
   - More generations per prompt = better comparison
   - Partial credit rewards help learning
   - Evaluation on held-out data is crucial

### Next Steps

To further improve your results:

2. **Adjust Temperature**: Experiment with values between 0.5-1.0
3. **Increase Generations**: Try 16 or 20 generations per prompt
4. **Fine-tune Rewards**: Adjust the partial credit system
5. **Try Other Datasets**: Apply GRPO to different tasks


### Additional Resources

- **TRL Documentation**: [https://huggingface.co/docs/trl](https://huggingface.co/docs/trl)
- **GRPO Paper**: [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
- **GSM8K Dataset**: [OpenAI GitHub](https://github.com/openai/grade-school-math)
- **DeepSeek Models**: [Hugging Face](https://huggingface.co/deepseek-ai)

### Achievement Unlocked!

You've successfully:
- ‚úÖ Implemented GRPO from scratch
- ‚úÖ Trained a 7B parameter model
- ‚úÖ Improved math problem-solving accuracy
- ‚úÖ Learned modern RL techniques for LLMs

### Final Thoughts

GRPO is just one of many post-training techniques. The same principles you've learned here apply to:
- **PPO** (Proximal Policy Optimization)
- **DPO** (Direct Preference Optimization)
- **RLHF** (Reinforcement Learning from Human Feedback)

The future of AI involves not just pre-training large models, but also fine-tuning them for specific tasks using techniques like GRPO. You're now equipped with the knowledge to be part of that future!

### Thank You!

Thank you for completing this lab. We hope you found it educational and enjoyable. Keep experimenting, keep learning, and keep pushing the boundaries of what's possible with AI!
