# Graded Lab: Constitutional AI for Mathematical Reasoning

Welcome to this assignment!

Carefully read each Markdown (text) cell, which include instructions and hints. Start by reading the background behind your upcoming tasks.

When you are done, submit your solution by saving it, then clicking on the submit button at the top right side of the page.

## In order for your submission to be graded correctly, you **MUST**:
* **Use the provided variable names**, otherwise the autograder will not be able to locate the variable for grading. 

* **Replace any instances of `None` with your own code.** 

* **Only modify the cells that start with the comment `# GRADED CELL`**.  

* **Use the provided cells for your solution.** You can add new cells to experiment, but these will be omitted when grading. 

To submit your solution, save it, then click on the blue submit button at the top of the page.

<div style="background-color: #FAD888; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
<strong>Important notes</strong>:

- Code blocks with None will not run properly. If you run them before completing the exercise, you will likely get an error. 

- The notebooks work best in Chrome browser. If you are having problems, please switch to Chrome.

- Make sure you always save before submitting.
</div>

## Introduction

In this lab, you'll build a pipeline that generates multiple solution approaches to math problems and then evaluates them using constitutional principles. This assignment will deepen your understanding of template engineering, solution diversity, and constitution alignment techniques.

## Objectives

You will implement a comprehensive constitutional AI system that generates diverse mathematical solutions and evaluates them against quality principles. Through hands-on exercises, you'll learn to engineer effective templates, create solution diversity, and apply constitutional principles to rank mathematical reasoning approaches. You will:

* **Create Solution Templates:** Design prompt templates for generating Chain-of-Thought and verification-based solutions that encourage clear mathematical reasoning and answer checking.
* **Generate Alternative Method Solutions:** Build dynamic prompts that analyze existing solutions and generate genuinely different mathematical approaches using varied problem-solving strategies.
* **Implement Accuracy Assessment:** Create a robust numerical answer extraction system that compares model solutions against ground truth values with appropriate tolerance handling.
* **Evaluate Solution Completeness:** Develop comprehensive scoring systems that assess intermediate steps, calculations, and explanatory reasoning in mathematical solutions using pattern recognition.
* **Assess Verification Quality:** Use LLM-as-judge techniques to evaluate whether solutions include proper answer verification and sanity checking procedures.
* **Measure Solution Novelty:** Implement comparative analysis to determine whether alternative solutions use genuinely different mathematical approaches rather than mere rewordings of existing methods.

## Table of Contents

* [Setup](#setup)
* [Template Engineering](#templateengineering) - Exercise 1
* [Prompt Formatting](#promptformatting)
* [Generate Chain-of-Thought Solutions](#generatecot)
* [Generate Alternative Method Solutions](#alternative) - Exercise 2
* [Constitutional Principles Foundation](#constitutional) - Exercise 3
* [Assessments](#assessments) - Exercise 4, 5, 6

## Setup <a id="setup"></a>

As usual, start by importing the necessary packages.

In [1]:
import json
import re
from typing import List, Dict, Optional

import torch
from tqdm import tqdm

# Import utility functions
from utils import setup_model_and_tokenizer, load_gsm8k_dataset, save_results, display_evaluation_results

In [2]:
# Load dataset and examine sample
dataset = load_gsm8k_dataset()
sample = dataset[0]
print("\nSample Problem:")
print(f"Question: {sample['question']}")
print(f"Ground Truth Answer: {sample['ground_truth_answer']}")
print(f"\nDataset contains {len(dataset)} problems")


Sample Problem:
Question: Maddox and Theo both bought 3 Polaroid Cameras, each sold at $20 per camera from Amazon, and decided to sell them on eBay. Maddox sold his cameras at $28 each, while Theo sold his cameras at $23 each. How much more profit did Maddox get from the sale of his cameras than Theo?
Ground Truth Answer: 15.0

Dataset contains 50 problems


You are using **Meta-Llama-3.2-8B-Instruct**, a powerful language model fine-tuned for following instructions and generating high-quality text.

In [3]:
# Load model using utility function
model, tokenizer = setup_model_and_tokenizer()

Device: cuda
GPU: AMD Instinct MI300X VF
GPU Memory: 191.7 GB

Loading /app/models/llama-3.2-8b...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Model loaded successfully in 7.8 seconds!


## Template Engineering <a id="templateengineering"></a>

Template Engineering is the foundation of generating diverse, high-quality mathematical solutions. This part focuses on creating three distinct solution approaches and implementing efficient batch processing for scalable generation. The Templates class manages different solution approaches. You start with two base templates, the third approach (alternative) will be created dynamically later using the CoT solutions as context.

### Exercise 1: Create Solution Templates

Your task here is to write a template to generate 
1. A Chain-Of-Thought solution  
2. A solution that includes a verification step 

**Note**: You only define two templates here because the alternative approach needs to see the CoT solution first to ensure it uses a genuinely different method.

In [4]:
# GRADED CELL: exercise 1

class Templates:
    """Manages different solution templates and batched generation"""
    
    def __init__(self, model=None, tokenizer=None):
        self.model = model
        self.tokenizer = tokenizer

        ### START CODE HERE ###
        # Complete the COT and verification templates to encourage different mathematical approaches
        # COT: Solve problem step by step, Verification: Verify that the answer makes sense
        
        # IMPORTANT: Include {problem} in your template where the math problem should be inserted
        # This placeholder will be replaced with the actual problem text using .format()
        self.templates = { 
            "cot": """Solve the given problem step by step showing clear reasoning\n problem: {problem}""",
            "verification": """Verify that the answer makes sense\n...problem: {problem}"""
        } 
        ### END CODE HERE ###

# Initialize templates
templates = Templates()

## Prompt Formatting <a id="promptformatting"></a>

LLaMA models require specific chat formatting. This function ensures prompts are properly structured for the model with a reliable fallback method.

In [5]:
def format_prompts_for_batch(tokenizer, prompts):
    """Format multiple prompts for batch processing"""
    formatted_prompts = []
    for p in prompts:
        try:
            formatted = tokenizer.apply_chat_template(
                [{"role": "user", "content": p}], 
                tokenize=False, 
                add_generation_prompt=True
            )
        except:
            formatted = f"<s>[INST] {p} [/INST]"
        formatted_prompts.append(formatted)
    return formatted_prompts

To process multiple prompts efficiently, you tokenize them in batches. This function converts the formatted text prompts into token IDs, applies padding for uniform length, and moves tensors to the model's device (GPU/CPU).

In [6]:
def tokenize_batch(tokenizer, formatted_prompts, model):
    """Tokenize batch of prompts and move to device"""
    inputs = tokenizer(
        formatted_prompts, 
        return_tensors="pt", 
        padding=True, 
        truncation=True,
        max_length=512
    )
    
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    
    return input_ids, attention_mask

This function performs the actual text generation using the tokenized inputs. It runs inference on the model with specified generation parameters like temperature and maximum token length, while using memory-efficient processing with torch.no_grad()

**Key Parameters**:
- `max_tokens=600`: Maximum length for generated solutions
- `temperature=0.3`: Controls randomness (lower = more consistent)
- Proper memory management with `torch.no_grad()`

In [7]:
def generate_batch(model, tokenizer, input_ids, attention_mask, max_tokens=600, temperature=0.3):
    """Generate responses for tokenized batch"""
    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=max_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    return outputs

After generation, you extract only the newly generated tokens by slicing off the original input length, then decode these tokens back to readable text. This ensures you only return the model's response, not the input prompt.

In [8]:
def decode_batch_outputs(tokenizer, outputs, input_ids):
    """Decode generated outputs to text"""
    batch_results = []
    for j, output in enumerate(outputs):
        actual_input_len = input_ids[j].shape[0]
        gen_tokens = output[actual_input_len:]
        text = tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()
        batch_results.append(text)
    return batch_results

## Generate Chain-of-Thought Solutions  <a id="generatecot"></a>

Step 1 of the dataset generation pipeline: Create detailed, step-by-step reasoning solutions for one problem to test that your prompt works well. These solutions will serve as the baseline and will be used to inform the alternative solutions.

Check the generated COT solution for your prompt. Edit section 2.1, if you are not satisfied with the output. 

In [9]:
def generate_cot_solutions(templates, dataset, num_problems=5):
    """Generate Chain-of-Thought solutions"""
    print(f"Generating CoT responses for {num_problems} problem{'s' if num_problems > 1 else ''}...")
    
    problems = dataset[:num_problems]
    cot_prompts = [templates.templates["cot"].format(problem=p["question"]) for p in problems]
    
    # Process in batches
    cot_responses = []
    batch_size = 2
    
    for i in range(0, len(cot_prompts), batch_size):
        batch_prompts = cot_prompts[i:i+batch_size]
        formatted = format_prompts_for_batch(tokenizer, batch_prompts)
        input_ids, attention_mask = tokenize_batch(tokenizer, formatted, model)
        outputs = generate_batch(model, tokenizer, input_ids, attention_mask)
        batch_results = decode_batch_outputs(tokenizer, outputs, input_ids)
        cot_responses.extend(batch_results)
    
    return cot_prompts, cot_responses

# Test CoT template on a single problem
cot_prompts, cot_responses = generate_cot_solutions(templates, dataset, 1)
print(f"\nTesting CoT template:")
print(f"Problem: {dataset[0]['question']}")
print(f"Ground Truth: {dataset[0]['ground_truth_answer']}")
print(f"\nCoT Solution:\n{cot_responses[0]}")

Generating CoT responses for 1 problem...

Testing CoT template:
Problem: Maddox and Theo both bought 3 Polaroid Cameras, each sold at $20 per camera from Amazon, and decided to sell them on eBay. Maddox sold his cameras at $28 each, while Theo sold his cameras at $23 each. How much more profit did Maddox get from the sale of his cameras than Theo?
Ground Truth: 15.0

CoT Solution:
To find out how much more profit Maddox got from the sale of his cameras than Theo, we need to calculate the profit made by each of them and then find the difference.

**Step 1: Calculate the cost price of the cameras**
Maddox and Theo bought 3 cameras each at $20 per camera.
Cost price of 3 cameras = 3 x $20 = $60

**Step 2: Calculate the profit made by Maddox**
Maddox sold his cameras at $28 each.
Selling price of 3 cameras = 3 x $28 = $84
Profit made by Maddox = Selling price - Cost price
= $84 - $60
= $24

**Step 3: Calculate the profit made by Theo**
Theo sold his cameras at $23 each.
Selling price of 3

Check the generated verification solution for your prompt. Edit section 2.1, if you are not satisfied with the output. 

In [10]:
def generate_verification_solutions(templates, dataset, num_problems=5):
    """Generate verification solutions"""
    print(f"Generating verification responses for {num_problems} problem{'s' if num_problems > 1 else ''}...")
    
    problems = dataset[:num_problems]
    verification_prompts = [templates.templates["verification"].format(problem=p["question"]) for p in problems]
    
    # Process in batches
    verification_responses = []
    batch_size = 2
    
    for i in range(0, len(verification_prompts), batch_size):
        batch_prompts = verification_prompts[i:i+batch_size]
        formatted = format_prompts_for_batch(tokenizer, batch_prompts)
        input_ids, attention_mask = tokenize_batch(tokenizer, formatted, model)
        outputs = generate_batch(model, tokenizer, input_ids, attention_mask)
        batch_results = decode_batch_outputs(tokenizer, outputs, input_ids)
        verification_responses.extend(batch_results)
    
    return verification_prompts, verification_responses

# Test verification template on the same problem
verification_prompts, verification_responses = generate_verification_solutions(templates, dataset, 1)
print(f"\nTesting Verification template:")
print(f"Verification Solution:\n{verification_responses[0]}")

Generating verification responses for 1 problem...

Testing Verification template:
Verification Solution:
To find the profit difference, we need to calculate the profit made by each person.

Maddox bought 3 cameras at $20 each, so his total cost is 3 x $20 = $60. 
He sold each camera for $28, so his total revenue is 3 x $28 = $84. 
His profit is $84 - $60 = $24.

Theo bought 3 cameras at $20 each, so his total cost is 3 x $20 = $60. 
He sold each camera for $23, so his total revenue is 3 x $23 = $69. 
His profit is $69 - $60 = $9.

The difference in profit between Maddox and Theo is $24 - $9 = $15.

Therefore, Maddox got $15 more profit from the sale of his cameras than Theo.


## Generate Alternative Method Solutions <a id="alternative"></a>

### Exercise 2: Generate Alternative Method Solutions

Create solutions using different mathematical approaches. Unlike the first two templates, these prompts are created dynamically by showing the model the CoT solution and asking for a different approach.

Complete the alternative prompt template to encourage truly different mathematical approaches. The template should encourage different mathematical approaches like working backwards, using different operations, or visual reasoning.

In [11]:
# GRADED CELL: exercise 2

def generate_alternative_solutions(dataset, cot_responses, num_problems=5):
    """Generate alternative approach solutions"""
    print(f"Generating alternative responses for {num_problems} problem{'s' if num_problems > 1 else ''}...")
    
    ### START CODE HERE ###
    # Create a template that shows the student an existing solution and asks for a different approach
    # Remember: Use {problem} and {cot_response} as placeholders that will be filled in later
    alt_prompt_template = """You are an expert in solving mathematical problems. 
    Solve the given problem step-by-step using an alternative mathematical approach like working backwards or using different operations.
    The solution must be different from the given chain-of-thought response.

    Problem: {problem}

    Response: {cot_response}
    
    """
    ### END CODE HERE ###
    
    problems = dataset[:num_problems]
    alt_prompts = []
    
    for i, problem_data in enumerate(problems):
        alt_prompt = alt_prompt_template.format(
            problem=problem_data["question"], 
            cot_response=cot_responses[i]
        )
        alt_prompts.append(alt_prompt)
    
    # Process in batches
    alt_responses = []
    batch_size = 2
    
    for i in range(0, len(alt_prompts), batch_size):
        batch_prompts = alt_prompts[i:i+batch_size]
        formatted = format_prompts_for_batch(tokenizer, batch_prompts)
        input_ids, attention_mask = tokenize_batch(tokenizer, formatted, model)
        outputs = generate_batch(model, tokenizer, input_ids, attention_mask)
        batch_results = decode_batch_outputs(tokenizer, outputs, input_ids)
        alt_responses.extend(batch_results)
    
    return alt_prompts, alt_responses

# Test alternative template on the same problem
alt_prompts, alt_responses = generate_alternative_solutions(dataset, cot_responses, 1)
print(f"\nTesting Alternative template:")
print(f"Alternative Solution:\n{alt_responses[0]}")

Generating alternative responses for 1 problem...

Testing Alternative template:
Alternative Solution:
To solve this problem using an alternative approach, let's use a different method: finding the profit percentage and then comparing it.

**Step 1: Calculate the profit percentage for each seller**
Maddox sold his cameras at $28 each, which is 140% of the cost price ($20 x 1.4 = $28).
Theo sold his cameras at $23 each, which is 115.5% of the cost price ($20 x 1.155 = $23).

**Step 2: Calculate the profit percentage difference**
The profit percentage difference between Maddox and Theo is 140% - 115.5% = 24.5%.

**Step 3: Calculate the profit difference**
Since the cost price is the same for both ($60), the profit difference is directly proportional to the profit percentage difference.
To find the actual profit difference, multiply the cost price by the profit percentage difference (as a decimal).
Profit difference = $60 x 0.245 = $14.7

However, we can also express the profit difference

This function combines all three solution types into the structured dataset format required for constitutional evaluation.

**Output Format**: Each result contains the original problem, ground truth answer, and all three solutions with their templates, prompts, and responses.

In [12]:
def assemble_solution_dataset(dataset, cot_data, verification_data, alternative_data, num_problems):
    """Combine all solution types into structured dataset"""
    cot_prompts, cot_responses = cot_data
    verification_prompts, verification_responses = verification_data
    alt_prompts, alt_responses = alternative_data
    
    results = []
    problems = dataset[:num_problems]
    
    for i, problem_data in enumerate(problems):
        problem = problem_data["question"]
        ground_truth = str(problem_data["ground_truth_answer"])
        
        solutions = [
            {"template": "cot", "prompt": cot_prompts[i], "response": cot_responses[i]},
            {"template": "verification", "prompt": verification_prompts[i], "response": verification_responses[i]},
            {"template": "alternative", "prompt": alt_prompts[i], "response": alt_responses[i]},
        ]
        
        result = {
            "problem_id": i,
            "question": problem,
            "ground_truth_answer": ground_truth,
            "solutions": solutions
        }
        results.append(result)
    
    return results

Now that you've tested all templates and verified their output quality, generate solutions for the entire dataset. Notice that `NUM_PROBLEMS` is set to 50. Try starting with 5-10 problems for initial testing, then scale up to 50 for full analysis.

It should take no more than 12 minutes to run this cell on all 50 problems.

In [13]:
NUM_PROBLEMS = 50

print(f"Generating complete dataset for {NUM_PROBLEMS} problems...")

# Generate all solution types for the full dataset
print("\n" + "="*50)
print("GENERATING FULL DATASET")
print("="*50)

# Step 1: Generate CoT solutions for full dataset
cot_prompts_full, cot_responses_full = generate_cot_solutions(templates, dataset, NUM_PROBLEMS)

# Step 2: Generate verification solutions for full dataset
verification_prompts_full, verification_responses_full = generate_verification_solutions(templates, dataset, NUM_PROBLEMS)

# Step 3: Generate alternative solutions for full dataset
alt_prompts_full, alt_responses_full = generate_alternative_solutions(dataset, cot_responses_full, NUM_PROBLEMS)

# Step 4: Assemble complete dataset
solution_results = assemble_solution_dataset(
    dataset, 
    (cot_prompts_full, cot_responses_full), 
    (verification_prompts_full, verification_responses_full), 
    (alt_prompts_full, alt_responses_full), 
    NUM_PROBLEMS
)

# Save results
save_results(solution_results, "generated_solutions.json")
print(f"\nGeneration complete! Dataset with {len(solution_results)} problems saved to generated_solutions.json")

# Display sample results for review
for i, result in enumerate(solution_results[:2]):
    print(f"\n{'='*60}")
    print(f"Problem {i+1}: {result['question']}")
    print(f"Ground Truth: {result['ground_truth_answer']}")
    
    for solution in result['solutions']:
        print(f"\n--- {solution['template'].upper()} Solution ---")
        response_preview = solution['response'][:200] + "..." if len(solution['response']) > 200 else solution['response']
        print(response_preview)

print(f"\n{'='*60}")
print("Solution generation complete! Ready for constitutional evaluation.")

Generating complete dataset for 50 problems...

GENERATING FULL DATASET
Generating CoT responses for 50 problems...
Generating verification responses for 50 problems...
Generating alternative responses for 50 problems...
Results saved to generated_solutions.json

Generation complete! Dataset with 50 problems saved to generated_solutions.json

Problem 1: Maddox and Theo both bought 3 Polaroid Cameras, each sold at $20 per camera from Amazon, and decided to sell them on eBay. Maddox sold his cameras at $28 each, while Theo sold his cameras at $23 each. How much more profit did Maddox get from the sale of his cameras than Theo?
Ground Truth: 15.0

--- COT Solution ---
To find out how much more profit Maddox got from the sale of his cameras than Theo, we need to calculate the profit made by each of them and then find the difference.

**Step 1: Calculate the cost pri...

--- VERIFICATION Solution ---
To find out how much profit each person made, we need to calculate the total revenue and su

## Constitutional Principles Foundation <a id="constitutional"></a>

The Constitutional Evaluator applies four key principles to assess solution quality. Each principle targets a specific aspect of mathematical reasoning to encourage.

**The Four Principles**:
- **Accuracy**: All three solutions must be correct
- **Completeness**: All three solutions must show all intermediate calculation steps
- **Verification**: Solution 2 should include verification or sanity checking
- **Novelty**: Solution 3 must use a different approach from solution 1 to solve the problem

In [14]:
class ConstitutionalEvaluator:
    """Constitutional AI system for evaluating mathematical reasoning"""
    
    def __init__(self, model, tokenizer):
        self.principles = {
            "accuracy": "All three solutions must be correct",
            "completeness": "All three solutions must show all intermediate calculation steps", 
            "verification": "Solution 2 should include verification or sanity checking",
            "novelty": "Solution 3 must use a different approach from solution 1 to solve the problem"
        }
        print("Constitutional Principles:")
        for name, description in self.principles.items():
            print(f"  - {name}: {description}")
        
        self.model = model
        self.tokenizer = tokenizer
        print("Using LLM as judge for verification and novelty evaluation")

# Initialize evaluator
evaluator = ConstitutionalEvaluator(model, tokenizer)

Constitutional Principles:
  - accuracy: All three solutions must be correct
  - completeness: All three solutions must show all intermediate calculation steps
  - verification: Solution 2 should include verification or sanity checking
  - novelty: Solution 3 must use a different approach from solution 1 to solve the problem
Using LLM as judge for verification and novelty evaluation


Before you can check accuracy, you need to extract the final numerical answer from each solution. This function uses multiple regex patterns to find answers in various formats.

**Challenge**: Mathematical solutions can express answers in many ways - "The answer is 42", "= 42", "Total: $42", etc.

In [15]:
def extract_numerical_answer(self, solution: str) -> Optional[float]:

    #Extract the final numerical answer from a solution using REGEX
    patterns = [
        r"(?:The answer is|answer is|Therefore|final answer,?)\s*\$?([+-]?\d+(?:,\d{3})*(?:\.\d+)?)",
        r"(?:So,?|Thus,?)\s*.*?\$?([+-]?\d+(?:,\d{3})*(?:\.\d+)?)",
        r"=\s*\$?([+-]?\d+(?:,\d{3})*(?:\.\d+)?)(?:\s|\.|\n|$)",
        r"(?:got|has|have|total|profit|difference|more|needs?)\s*\$?([+-]?\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:more|profit|dollars?|units?|eggs?|oranges?|\.|\n|$)",
    ]
    
    solution_lines = solution.strip().split('\n')
    
    # Check last few lines first
    for line in reversed(solution_lines[-3:]):
        for pattern in patterns:
            matches = re.findall(pattern, line, re.IGNORECASE)
            if matches:
                try:
                    return float(matches[-1].replace(',', ''))
                except:
                    continue
    
    # Fallback: get last number
    all_numbers = re.findall(r'([+-]?\d+(?:,\d{3})*(?:\.\d+)?)', solution)
    if all_numbers:
        try:
            return float(all_numbers[-1].replace(',', ''))
        except:
            pass
    
    return None

# Add method to class
ConstitutionalEvaluator.extract_numerical_answer = extract_numerical_answer

# Test extraction
test_solution = "Step 1: 5 + 3 = 8. Step 2: 8 * 2 = 16. Therefore, the answer is 16."
extracted = evaluator.extract_numerical_answer(test_solution)
print(f"Test extraction: '{test_solution}' -> {extracted}")

Test extraction: 'Step 1: 5 + 3 = 8. Step 2: 8 * 2 = 16. Therefore, the answer is 16.' -> 16.0


### Exercise 3: Implement Accuracy Assessment

This function implements the accuracy principle by comparing extracted answers to ground truth values. It uses a small tolerance (0.01) to handle floating-point precision issues.

Binary scoring (1.0 for correct, 0.0 for incorrect) ensures accuracy is treated as non-negotiable.

In [24]:
# GRADED CELL: exercise 3

def check_accuracy(self, problem: str, solution: str, ground_truth: float) -> float:
    ### START CODE HERE ###
    # Use extract_numerical_answer function defined above to extract the numerical answer from the solution
    extracted_answer = evaluator.extract_numerical_answer(solution)

    # Return 0.0 if extraction failed (if extracted answer is None)
    if not extracted_answer:
        return 0.0
    
    # Return 1.0 if answer matches ground truth (within 0.01 tolerance), else 0.0
    if abs(extracted_answer - ground_truth) < 0.01:
        return 1.0
    else:
        return 0.0

    ### END CODE HERE ###

# Add method to class
ConstitutionalEvaluator.check_accuracy = check_accuracy

# Test accuracy checking
test_ground_truth = 16.0
accuracy_score = evaluator.check_accuracy("test problem", test_solution, test_ground_truth)
print(f"Accuracy test: extracted={extracted}, ground_truth={test_ground_truth}, score={accuracy_score}")

Accuracy test: extracted=16.0, ground_truth=16.0, score=1.0


## Assessments <a id="assessments"></a>

You will investigate the solutions that the model produced and assess them for quality.

### Exercise 4: Evaluate Solution Completeness

This function evaluates whether solutions show adequate intermediate steps and reasoning. It looks for step indicators, calculations, and explanatory text.

Implement a function that takes the solution as input and checks if intermediate steps are shown using REGEX or LLM. Assign credits for different aspects of completeness. Minimum score is 0 and maximum score is 1. 

In [19]:
# GRADED CELL: exercise 4

def check_completeness(self, problem: str, solution: str) -> float:
    
    score = 0.0

    ### START CODE HERE ###
    # Convert solution to lowercase for easier matching
    solution_lower = solution.lower()
    
    # Check for step indicators (e.g., "step 1", "first", "then", "calculate")
    # Award up to 0.4 points based on how many step indicators are found
    step_indicators = [ 
        r'step \d+', r'first', r'second', r'third', r'next', r'then', r'finally', 
        r'calculate', r'find', r'determine', r'multiply', r'divide', r'add', r'subtract' 
    ] 
    
    # Count occurrences of step indicators and add to score
    # Hint: Use re.findall() for each indicator
    step_count = sum(len(re.findall(indicator, solution_lower)) for indicator in step_indicators)
    
    # If there are 5 or more steps, award 0.4 points.
    if step_count >= 5:
        score += 0.4
    # If there are 3 or more steps, award 0.3 points.
    elif step_count >= 3:
        score += 0.3
    # If there are 1 or more steps, award 0.2 points.
    elif step_count >= 1:
        score += 0.2
    
    # Check for intermediate calculations (e.g., "5 * 3 = None
    # Award up to 0.3 points based on number of calculations shown
    calculations = re.findall(r'\d+(?:\.\d+)?\s*[+\-*/รรท]\s*\d+(?:\.\d+)?\s*=\s*\d+(?:\.\d+)?', solution) 
    # If there are 3 or more calculations, award 0.3 points.
    if len(calculations) >= 3:
        score += 0.3
    # If there are 1 or more calculations, award 0.2 points.
    elif len(calculations) >= 1:
        score += 0.2
    
    
    # Check for explanatory phrases (e.g., "because", "therefore", "this means")
    # Award up to 0.3 points based on number of explanations
    explanation_phrases = ['because', 'since', 'so', 'therefore', 'this means', 'we need to'] 
    # Count how many explanation phrases appear in the solution
    explanation_count = sum(1 for phrase in explanation_phrases if phrase in solution_lower) 
    # If there are 3 or more explanations, award 0.3 points.
    if explanation_count >= 3:
        score += 0.3
    # If there are 1 or more explanations, award 0.2 points.
    elif explanation_count >= 1:
        score += 0.2
    ### END CODE HERE ###

    # Return final score (capped at 1.0)
    return min(score, 1.0)

# Add method to class
ConstitutionalEvaluator.check_completeness = check_completeness

# Test completeness
completeness_score = evaluator.check_completeness("test", test_solution)
print(f"Completeness test: score={completeness_score}")

Completeness test: score=0.6000000000000001


For complex judgments like verification quality and solution novelty, you can use the LLM itself as a judge. This helper function handles the LLM evaluation process.

**Key Features**: 
- Low temperature for consistent scoring
- Short generation for efficiency
- Error handling with fallback scores

In [20]:
def _get_llm_score(self, prompt: str) -> float:
    """Helper to get score from LLM"""
    try:
        messages = [{"role": "user", "content": prompt}]
        formatted_prompt = self.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        
        inputs = self.tokenizer(formatted_prompt, return_tensors="pt")
        input_ids = inputs.input_ids.to(self.model.device)
            
        with torch.no_grad():
            outputs = self.model.generate(
                input_ids,
                max_new_tokens=20,
                temperature=0.1,
                do_sample=True,
                pad_token_id=self.tokenizer.pad_token_id or self.tokenizer.eos_token_id
            )
        
        input_length = input_ids.shape[1]
        generated_tokens = outputs[0][input_length:]
        response = self.tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
        
        numbers = re.findall(r'(\d*\.?\d+)', response)
        if numbers:
            score = float(numbers[0])
            return min(max(score, 0.0), 1.0)
        else:
            return 0.5
            
    except Exception as e:
        print(f"LLM evaluation failed: {e}")
        return 0.5

# Add method to class
ConstitutionalEvaluator._get_llm_score = _get_llm_score

### Exercise 5: Assess Verification Quality

This function uses the LLM to evaluate whether verification solutions include proper answer checking. Human evaluation of verification quality is difficult, so you can delegate this to the LLM.

Complete the function below to rate a solution's verification from 0 to 1. Write a prompt and guide the LLM to make consistent verification assessments.

In [21]:
# GRADED CELL: exercise 5

def check_verification(self, problem: str, solution: str) -> float:
    ### START CODE HERE ###
    # Use LLM to check if solution includes proper verification
    # Make sure to pass the {problem} and the {solution} to the prompt.
    prompt = """You are an expert in assessment. Please assess the quality of verification solution for the given problem.
    Problem:
    {problem}

    Solution:
    {solution}
    """
    ### END CODE HERE ###

    return self._get_llm_score(prompt) # @REPLACE return None

# Add method to class
ConstitutionalEvaluator.check_verification = check_verification
print("Verification assessment function defined!")

Verification assessment function defined!


### Exercise 6: Measure Solution Novelty
 
This function evaluates whether alternative solutions use genuinely different approaches compared to the CoT solutions. This is crucial for ensuring solution diversity.

Complete the novelty assessment prompt to help the LLM distinguish between truly different approaches and mere rewordings.

In [22]:
# GRADED CELL: exercise 6

def check_novelty(self, cot_solution: str, alternative_solution: str) -> float:
    """Use LLM to check if alternative uses genuinely different approach"""
    
    ### START CODE HERE ###
    # Complete this prompt for novelty assessment
    # Make sure to pass both solutions to the prompt, so the model can compare them.
    prompt = """You are an expert evaluator. 
    Evaluate whether the given alternative solution uses genuinely different approach compared to the given CoT solution and not mere rewordings.
    Assign a score based on novelty and diversity of the alternate solution.
    
    Alternative solution:
    {alternative_solution}

    CoT solution:
    {cot_solution}
    
    """
    ### END CODE HERE ###

    return self._get_llm_score(prompt) # @REPLACE return None

# Add method to class
ConstitutionalEvaluator.check_novelty = check_novelty

This is the main evaluation function that applies all constitutional principles to rank the three solutions. It uses different scoring weights for different solution types.

**Scoring Strategy**:
- **CoT**: Accuracy (60%) + Completeness (40%)
- **Verification**: Accuracy (50%) + Completeness (30%) + Verification (20%)
- **Alternative**: Accuracy (50%) + Completeness (30%) + Novelty (20%)

It should take no more than 60 seconds to run this cell.


In [25]:
def evaluate_solutions(self, problem: str, ground_truth: float, solutions: List[Dict]) -> Dict:
    """Evaluate all three solutions and return ranked results"""
    evaluated_solutions = []
    
    for solution in solutions:
        template = solution['template']
        response = solution['response']
        
        accuracy = self.check_accuracy(problem, response, ground_truth)
        completeness = self.check_completeness(problem, response)
        
        if template == 'verification':
            verification = self.check_verification(problem, response)
            scores = {
                'accuracy': round(accuracy, 3),
                'completeness': round(completeness, 3),
                'verification': round(verification, 3)
            }
            composite_score = 0.5 * accuracy + 0.3 * completeness + 0.2 * verification
        
        elif template == 'alternative':
            cot_solution = next((s['response'] for s in solutions if s['template'] == 'cot'), "")
            novelty = self.check_novelty(cot_solution, response)
            scores = {
                'accuracy': round(accuracy, 3),
                'completeness': round(completeness, 3),
                'novelty': round(novelty, 3)
            }
            composite_score = 0.5 * accuracy + 0.3 * completeness + 0.2 * novelty
        
        else:  # CoT template
            scores = {
                'accuracy': round(accuracy, 3),
                'completeness': round(completeness, 3)
            }
            composite_score = 0.6 * accuracy + 0.4 * completeness
        
        evaluated_solutions.append({
            'template': template,
            'scores': scores,
            'composite_score': round(composite_score, 3)
        })
    
    evaluated_solutions.sort(key=lambda x: x['composite_score'], reverse=True)
    
    result = {
        'problem': problem,
        'ground_truth': ground_truth,
        'solutions': []
    }
    
    for i, solution in enumerate(evaluated_solutions):
        solution['rank'] = i + 1
        result['solutions'].append(solution)
    
    return result

# Add method to class
ConstitutionalEvaluator.evaluate_solutions = evaluate_solutions

# Test evaluation on generated solutions
print("Testing constitutional evaluation...")
evaluation_results = []

for i, result in enumerate(solution_results):
    print(f"\nEvaluating Problem {i+1}...")
    
    evaluation = evaluator.evaluate_solutions(
        result['question'],
        float(result['ground_truth_answer']),
        result['solutions']
    )
    evaluation_results.append(evaluation)
    
    # Display results for first problem
    if i == 0:
        print(f"Problem: {result['question'][:50]}...")
        print(f"Ground Truth: {result['ground_truth_answer']}")
        for solution in evaluation['solutions']:
            print(f"  Rank {solution['rank']}: {solution['template'].upper()}")
            print(f"    Composite Score: {solution['composite_score']}")
            print(f"    Individual Scores: {solution['scores']}")

# Save evaluation results
save_results(evaluation_results, "evaluation_results.json")
print(f"\nEvaluation complete! Results saved to evaluation_results.json")
df = display_evaluation_results("evaluation_results.json", num_rows=20)

Testing constitutional evaluation...

Evaluating Problem 1...
Problem: Maddox and Theo both bought 3 Polaroid Cameras, ea...
Ground Truth: 15.0
  Rank 1: COT
    Composite Score: 0.88
    Individual Scores: {'accuracy': 1.0, 'completeness': 0.7}
  Rank 2: VERIFICATION
    Composite Score: 0.78
    Individual Scores: {'accuracy': 1.0, 'completeness': 0.6, 'verification': 0.5}
  Rank 3: ALTERNATIVE
    Composite Score: 0.31
    Individual Scores: {'accuracy': 0.0, 'completeness': 0.7, 'novelty': 0.5}

Evaluating Problem 2...

Evaluating Problem 3...

Evaluating Problem 4...

Evaluating Problem 5...

Evaluating Problem 6...

Evaluating Problem 7...

Evaluating Problem 8...

Evaluating Problem 9...

Evaluating Problem 10...

Evaluating Problem 11...

Evaluating Problem 12...

Evaluating Problem 13...

Evaluating Problem 14...

Evaluating Problem 15...

Evaluating Problem 16...

Evaluating Problem 17...

Evaluating Problem 18...

Evaluating Problem 19...

Evaluating Problem 20...

Evaluatin

Unnamed: 0,Problem,Ground Truth,Template,Rank,Composite Score,accuracy,completeness,verification,novelty
0,1,15.0,COT,1,0.88,1.0,0.7,,
1,1,15.0,VERIFICATION,2,0.78,1.0,0.6,0.5,
2,1,15.0,ALTERNATIVE,3,0.31,0.0,0.7,,0.5
3,2,9.0,COT,1,0.88,1.0,0.7,,
4,2,9.0,ALTERNATIVE,2,0.87,1.0,0.9,,0.5
5,2,9.0,VERIFICATION,3,0.81,1.0,0.7,0.5,
6,3,90.0,COT,1,0.96,1.0,0.9,,
7,3,90.0,VERIFICATION,2,0.84,1.0,0.8,0.5,
8,3,90.0,ALTERNATIVE,3,0.81,1.0,0.7,,0.5
9,4,96.0,COT,1,0.96,1.0,0.9,,
