Chain-of-Thought Reasoning Evaluation on GSM8K Dataset
===

**Created by Dr Chao Shu (chao.shu@qmul.ac.uk)**

Submitted By

**Name**:

**QMUL ID**:

**BUPT ID**:

# Introduction
---

Chain-of-Thought (CoT) is the fundamental technique behind cutting-edge reasoning Large Language Models (LLMs). This notebook guides you through implementing and evaluating different CoT reasoning approaches on the GSM8K dataset. 

You will implement Zero-shot CoT, Few-shot CoT, Self-consistency CoT as well as a standard Input-Ouput prompting as the baseline. You will evaluate their effectiveness by solving math word problems in the GSM8K dataset.

## Prerequisite: Ollama and Qwen2 1.5B Model

Before you start this project, you should already have ollama and Qwen2 1.5B model installed on your computer and be able to run "qwen2:1.5b" model locally on your computer since we have finished L01 and you should be able to run all the examples in L01.

If you haven't done so, this guide will help you set up Ollama and download the qwen2:1.5b model required for this notebook.

**1. Install Ollama**

Please go to [Ollama GitHub repository](https://github.com/ollama/ollama) or [webpage](https://ollama.com/) and follow the instructions to install Ollama.

**2. Download the qwen2:1.5b Model**

After installing Ollama, open a terminal (Command Prompt on Windows) and run:

```bash
ollama pull qwen2:1.5b
```

This will download the model, which may take some time depending on your internet connection.

**3. Verify Installation**

To verify that everything is working correctly, run:

```bash
ollama list
```

You should see `qwen2:1.5b` in the list of available models.

**4. Troubleshooting**

- If you encounter permission issues on Linux or macOS, try running commands with `sudo`
- If the model download fails, check your internet connection and try again
- For more help, visit the [Ollama GitHub repository](https://github.com/ollama/ollama)

# PART I: Implement Functions for the Evaluation
---

In this part, you will build the foundation for evaluating different reasoning approaches by implementing key functions that will:

1. **Set up the model**: Configure and prepare the language model for evaluation
2. **Process the dataset**: Load and parse the GSM8K math problem dataset
3. **Implement CoT prompting strategies**: Create prompt templates for different reasoning approaches
4. **Extract and evaluate answers**: Parse model outputs and compare with ground truth
5. **Implement Self-Consistency CoT**: Generate multiple reasoning paths and use majority voting

For each key function, you will complete a practical implementation and then test it with simplified examples to ensure everything works correctly. This careful testing is crucial as any errors in these foundational functions would impact your final evaluation results. Please also carefully check the output reasoning results from the LLM to get familar with reasoning behaviours of the LLM and make sure the required functions are implemented correctly.

To get started, let's import necessary libraries.

In [1]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from datasets import load_dataset
from pathlib import Path
import json
from typing import List, Dict, Any, Tuple, Optional
from collections import Counter
from IPython.display import Markdown, display
from datetime import datetime

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

## Function to Set up an LLM Model


First, let's implement a function to set up an LLM that is used to evaluate different CoT prompting strategies.

In [2]:
def setup_model(model_name="qwen2:1.5b", temperature=0.7, top_p=0.95):
    """Set up the LLM using LangChain"""
    
    # Ollama platform
    print(f"Setting up Ollama model: {model_name}, temperature: {temperature}, top_p: {top_p}")
    return ChatOllama(
        model=model_name,
        temperature=temperature,
        top_p=top_p,
        num_predict=512
    )

### Test the function

**[Q1.1]** Test the function following the TODO instructions. Fix any error if there is any. **[5 Marks]**

In [None]:
## TODO: Setup an LLM using Qwen2 1.5B model with default parameters, temperature=0.7 and top_p=0.95. 
# Save the model in a variable called `test_llm`.


## Functions for GSM8K Dataset

The GSM8K dataset consists of high-quality grade school math word problems. Let's implement functions to load the GSM8K dataset and extract numerical answers from the provided ground-truth solutions.
- By default, the function will create a directory called "*data*" in the same directory of this notebook to save the dataset downloaded from Hugging Face.
- The numerical answer of each question is placed behind "####" string in the "answer" text. You can find the format of dataset after you run the test codes at the end of this section.

In [4]:
def load_gsm8k_dataset(split="test", sample_size=None, random_seed=42, cache_dir="./data"):
    """
    Load the GSM8K dataset from the Hugging Face datasets library
    
    Args:
        split: Which dataset split to load ("train" or "test")
        sample_size: Optional limit on number of examples to load
        random_seed: Seed for random sampling (for reproducibility)
        cache_dir: Directory to cache the downloaded dataset
    
    Returns:
        List of examples from the GSM8K dataset
    """
    try:
        dataset = load_dataset("gsm8k", "main", cache_dir=cache_dir)
        examples = dataset[split]
        
        if sample_size and sample_size < len(examples):
            # Generate random indices for sampling
            np.random.seed(random_seed)
            random_indices = np.random.choice(
                range(len(examples)), 
                size=sample_size, 
                replace=False  # Sample without replacement
            )
            # Use random indices to select examples
            examples = examples.select(random_indices)
            print(f"Randomly selected {sample_size} examples with seed {random_seed}")
        
        return examples
    except Exception as e:
        print(f"Error loading GSM8K dataset: {e}")
        raise

def get_ground_truth_answer(example):
    """
    Extract ground truth answer from a GSM8K example
    
    Args:
        example: A single example from the GSM8K dataset
    
    Returns:
        The ground truth answer as a string
    """
    # GSM8K answers are typically in the format "#### X"
    answer_match = re.search(r'####\s*([\d\.\-]+)', example["answer"])
    if answer_match:
        return answer_match.group(1).strip()
    return ""

### Test the Functions

**[Q1.2]** Test the function following the TODO instructions. Fix any error if there is any. **[5 Marks]**

We'll randomly select **10** problems from the **"test"** split of the GSM8K dataset using your student ID as the randome seed and explore the first problem in the randomly selected problems as a test.

In [5]:
# TODO: !!IMPORTANT!! Set your Student ID. Save your student ID (e.g., '221154321') to a variable named "STUDENT_ID"
STUDENT_ID = "221154321"  # Replace with your actual student ID

In [None]:
# Convert student ID to integer for use as random seed
try:
    STUDENT_SEED = int(STUDENT_ID)
    print(f"Using student ID {STUDENT_ID} as random seed value {STUDENT_SEED}")
    # TODO: Randomly select 10 problems in the "test" split using the student ID as seed. Save the problems in a variable called `test_problems`.
    test_problems = "Your codes here"

    print(f"Selected {len(test_problems)} problems using student ID as random seed")
    print(f"Example Question: {test_problems[0]['question']}")
    print(f"Example Question's answer: {test_problems[0]['answer']}")
    print(f"Extracted answer: {get_ground_truth_answer(test_problems[0])}")
except ValueError:
    print(f"Invalid Student ID {STUDENT_SEED}. Please enter a valid Student ID in previous cell.")

## Implement CoT Prompts

Let's set prompts for Zero-Shot CoT, Few-Shot CoT as well as a standard prompt (Input-Output) as the baseline for our evaluation.

In [7]:
# New ChatPromptTemplates with simpler format
STANDARD_CHAT_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "Answer the questions briefly with just a few words without reasoning.."
              "Put your final numerical answer within $\\boxed{{}}$"),
    ("human", "Question: {question}")
])

# Zero-shot Chain of Thought Chat Prompt
ZERO_SHOT_COT_CHAT_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "Answer the questions by thinking step by step. For each question, carefully break down your reasoning process, provide each logical step, followed by a numerical answer. Put your final numerical answer within \\boxed{{}}. "
              "The response structure should be: \n\nThinking: [your reasoning steps] \n Answer: $\\boxed{{}}$."),
    ("human", "Question: {question}")
])

# Few-shot Chain of Thought Chat Prompt
FEW_SHOT_COT_CHAT_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "Answer the questions by thinking step by step. For each question, carefully break down your reasoning process, provide each logical step, followed by a numerical answer. Put your final numerical answer within \\boxed{{}}. "
              "A few examples will be provided for you to follow and understand how to think step by step. "
              "The response structure should be: \n\nThinking: [your reasoning steps] \n Answer: $\\boxed{{}}$."),
    ("human", """
Question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
Thinking: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6.
Answer: $\\boxed{{6}}$

Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Thinking: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5.
Answer: $\\boxed{{5}}$

Question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
Thinking: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39.
Answer: $\\boxed{{39}}$

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Thinking: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8.
Answer: $\\boxed{{8}}$

Question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
Thinking: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9.
Answer: $\\boxed{{9}}$

Question: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?
Thinking: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29.
Answer: $\\boxed{{29}}$

Question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
Thinking: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls.
Answer: $\\boxed{{33}}$

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Thinking: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8.
Answer: $\\boxed{{8}}$

Question: {question}
""")
])

### Test the Prompts

**[Q1.3]** Please use the first test problem (`test_problems[0]`) you obtained from the previous section and the `ChatPromptTemplate.invoke()` method to test the 3 prompts. Fix any error if there is any. **[5 Marks]**

> 💡 Note:
> Please check if the `content` of the `SystemMessage` and the `HumanMessage` meet the expectation as defined in `STANDARD_CHAT_PROMPT`, `ZERO_SHOT_COT_CHAT_PROMPT` and `FEW_SHOT_COT_CHAT_PROMPT`.

In [None]:
# TODO: Test the standard chat prompt with the first problem in the test_problems data


In [None]:
# TODO: Test the Zero-Shot CoT chat prompt with the first problem in the test_problems data


In [None]:
# TODO: Test the Few-Shot CoT chat prompt with the first problem in the test_problems data


## Utility Functions for Evaluation

Let's implement utility functions for parsing model responses and evaluating the correctness of answers.
- `extract_response_answer()`: A general function used to extract the numerical answer from a reasoning LLM's response.
- `evaluate_predictions()`: Compare the numercal answers from an LLM to the ground truth and calculate accuracy (solve rate) for one round of evaluation (all problems in the test dataset).
- `evaluate_predictions()`: Calculate summary statistics from multiple rounds of evaluation.

In [11]:
def extract_response_answer(response: str) -> str:
    """
    Extract the final numerical answer from the model's response.
    
    Args:
        response: The full text response from the LLM
    
    Returns:
        The extracted numerical answer as a string
    """
    # First check for answer tags: <answer>X</answer>
    answer_tag_match = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
    if answer_tag_match:
        answer_content = answer_tag_match.group(1).strip()
        # Extract number from answer content if it contains text
        numbers = re.findall(r'[-+]?\d*\.?\d+', answer_content.replace(',', ''))
        if numbers:
            return numbers[-1].strip()
        return answer_content.strip()
    
    # Then try to find boxed answer format: \boxed{X}
    boxed_match = re.search(r'\\boxed{([^{}]+)}', response)
    if boxed_match:
        box_content = boxed_match.group(1).strip()
        # Extract number from box content if it contains text
        numbers = re.findall(r'[-+]?\d*\.?\d+', box_content.replace(',', ''))
        if numbers:
            return numbers[-1].strip()
        return box_content.strip()
    
    # If no boxed answer, try to find "Answer: X" pattern
    answer_match = re.search(r'Answer:\s*([\d\.\-]+)', response)
    if answer_match:
        return answer_match.group(1).strip()
    
    # Otherwise try to find the last number in the text
    numbers = re.findall(r'(\d+\.?\d*|\.\d+)', response)
    if numbers:
        return numbers[-1].strip()
    
    return ""

def evaluate_predictions(predictions: List[str], references: List[str]) -> Dict[str, float]:
    """
    Evaluate the model's predictions against reference answers
    
    Args:
        predictions: List of predicted answers
        references: List of reference answers
    
    Returns:
        Dictionary with evaluation metrics
    """
    if len(predictions) != len(references):
        raise ValueError("Predictions and references must have the same length")
    
    correct = 0
    for pred, ref in zip(predictions, references):
        try:
            # Clean and convert to float for numerical comparison
            pred_value = float(pred.replace(',', '').strip())
            ref_value = float(ref.replace(',', '').strip())
            if pred_value == ref_value:
                correct += 1
        except (ValueError, TypeError):
            # If conversion fails, it's wrong
            pass
    
    accuracy = correct / len(references) if references else 0
    return {
        "accuracy": accuracy,
        "correct": correct,
        "total": len(references)
    }

def calculate_statistics(accuracies):
    """Calculate summary statistics from multiple rounds of evaluation"""
    mean_accuracy = np.mean(accuracies)
    std_dev = np.std(accuracies, ddof=1)  # Sample standard deviation
    min_accuracy = np.min(accuracies)
    max_accuracy = np.max(accuracies)
    range_accuracy = max_accuracy - min_accuracy
    
    return {
        "mean": float(mean_accuracy),
        "std_dev": float(std_dev),
        "min": float(min_accuracy),
        "max": float(max_accuracy),
        "range": float(range_accuracy),
        "all_values": [float(acc) for acc in accuracies]
    }

## Functions for Different CoT Prompting Strategies


Now, let's implement the different prompting strategies for Chain-of-Thought reasoning.

### Standard, Zero-Shot and Few-Shot CoT

**[Q1.4]** Please implement the function a single problem using the provided prompt template and LLM (SC-CoT not included) by completing the `TODO` parts in the function.  **[8 Marks]**

In [None]:
# Function to solve a single problem using the provided prompt template and LLM (SC-CoT not included)
def solve_single_problem(problem, prompt_template, llm):
    """
    Solve a single problem using the provided prompt template and LLM
    Args:
        problem: A single example from the GSM8K dataset
        prompt_template: The prompt template to use
        llm: The LLM to use for generating responses
    Returns:
        A dictionary with the question, reference answer, model response, prediction, and correctness
    """

    # TODO: Extract question and reference answer (ground truth). Save them in variables called `question` and `reference`
    question = "Your codes here"
    reference = "Your codes here"
    
    # Skip examples where we can't extract a ground truth
    if not reference:
        return None
    
    # TODO: Create LCEL chain
    # Your codes here
    
    # Use LCEL chain to get response
    try:
        # TODO: Invoke the chain with the question to get the response. Save the response in a variable called `response_text`
        response_text = "Your codes here"
        
        # TODO: Extract answer from the LLM's response. Save the answer in a variable called `prediction`
        prediction = "Your codes here"
    except Exception as e:
        print(f"Error processing example: {e}")
        print(f"Error details: {str(e)}")
    
    # Create a dictionary to store the result
    result = {
        "question": question,
        "reference": reference,
        "response": response_text,
        "prediction": prediction,
        "correct": prediction.strip() == reference.strip()
    }
    
    return result 

#### Test the Functions

**[Q1.5]** Let's test if we can get a response with "Thinking" and "Answer" parts from the `test_llm` for the `test_problems[0]` using **Few-Shot CoT** and extract the numerical answer from the response.  **[2 Marks]**

In [None]:
# TODO: Use the `test_llm` to solve the `test_problems[0]` using the Few-Shot CoT chat prompt.
# Save the result in a variable called `test_result`
test_result = "Your codes here"

In [None]:
# Inspect the result
test_result

In [None]:
# Display the response in a pretty format
display(Markdown(test_result["response"]))

### SC-CoT based on Zero-Shot or Few-Shot CoT

**[Q1.6]** Please implement the function to solve a single problem using SC-CoT reasoning technique by completing the `TODO` parts in the function. The SC-CoT can be based on either Zero-Shot or Few-Shot prompting (both should be supported). **[20 Marks]**

As mentioned in Lesson 01 notebook, please implement the SC-CoT so that the response follows the format below, i.e., each reasoning path should start with "path_i: ". An example of multi-path response (some parts are omitted due to the lengthy response):

```text
{'path_1': "To find out how much each of the other two pizzas cost, let's break down what we know:\n\n1. Total cost for four pizzas: $64\n2. Cost for two pizzas: $30\n\nLet's denote the price of one pizza as \\(x\\).\n\nSo, the equation representing the total cost would be:\n\\[2x + 2x = 30\\]\nSimplifying this gives us:\n\\[4x = 30\\]\n\nTo find out how much each of the other two pizzas cost (\\(2x\\) since there are two), divide both sides by \\(2\\):\n\\[x = \\frac{30}{2}\\]\n\\[x = 15\\]\n\nSo, each of the other two pizzas costs $15.", 

'path_2': "To solve this problem, ... Therefore, each of the remaining two pizzas cost \\$17.", 

'path_3': "Let's break down the problem step-by-step.... Step 6: Present the final answer.\n- Each of the other two pizzas costs $17.\n\nTherefore, the answer is $17.", 

'path_4': ... Therefore, each of the other two pizzas cost $17.", 

'path_5': 'To find out how much each of the other two pizzas cost ... Therefore, each of the two remaining pizzas cost $1.33.'}
```

The implementation of `solve_single_problem_sc_cot()` does not have to strictly follow the TODO instructions inside. You can develop your own implementation with the help from any AI tools. **However, you MUST obtain the required values of the specified variables to fill in the `result` dictionary as the return value**.

> 💡 Note:
> 
> You will get 10 marks IF the SC-CoT is correctly implemented (All fields in the `result` dictionary are correct, showing multiple reasoning paths and a correct majority voting answer)
> 
> You wlll get 20 marks IF the SC-CoT is correctly implemented with parallel inference (without using the `for` loop)


In [None]:
# Function to solve a single problem using SC-CoT 
def solve_single_problem_sc_cot(problem, prompt_template, llm, num_paths=3, temperatures=[0.7, 0.7, 0.7]):
    """
    Solve a single problem using SC-CoT based on the provided prompt template and LLM
    Args:
        problem: A single example from the GSM8K dataset
        prompt_template: The prompt template to use
        llm: The LLM to use for generating responses
        num_paths: Number of reasoning paths to generate for each question
    Returns:
        A dictionary with the question, reference answer, model response, prediction, and correctness
    """

    # TODO: Extract question and reference answer (ground truth). Save them in variables called `question` and `reference`
    question = "Your codes here"
    reference = "Your codes here"
    
    # Skip examples where we can't extract a ground truth
    if not reference:
        return None

    # Truncate if too many temperatures provided
    # To support sampling with diverse temperatures. You do not have to use this variable in this project)
    temperatures = temperatures[:num_paths] 
    
    # TODO: Create SC-CoT chain (optional)
    # Your codes here

    # Generate multiple reasoning paths with different temperatures
    try:
        # TODO: Get all reasoning paths
        paths = "Your codes here"
        
        # TODO: Extract answers from each path
        path_answers = []
        # Your codes here
        
        # TODO: Determine majority vote answer
        if path_answers:
            majority_answer = "Your codes here"
            
            # For debugging
            # print(f"Path answers: {path_answers}, Majority: {majority_answer}")
            
            # TODO: Create consolidated response text showing all paths and the majority answer
            consolidated_response = "Your codes here"
    except Exception as e:
        print(f"Error processing example with SC-CoT: {e}")
        print(f"Error details: {str(e)}")
                
    # Create a dictionary to store the result
    result = {
        "question": question,
        "reference": reference,
        "response": consolidated_response,
        "paths": paths,
        "path_answers": path_answers,
        "prediction": majority_answer,
        "correct": majority_answer.strip() == reference.strip()
    }
    
    return result 

#### Test the Functions

**[Q1.7]** Let's test if we can get a response with **3** reasoning paths from the `test_llm` for the `test_problems[0]` based on **Zero-Shot CoT** prompting and correctly extract the majority answer from the multiple reasoning paths.  **[5 Marks]**

In [None]:
# TODO: Use the `test_llm` to solve the `test_problems[0]` using SC-CoT base on the Zero-Shot chat prompt.
# Save the result in a variable called `test_result`
num_paths = 3
temperatures = [0.7] * num_paths
test_result = "Your codes here"

In [None]:
# Inspect the result
test_result

In [None]:
# Display the response in a pretty format
display(Markdown(test_result["response"]))

## Functions for CoT Evaluation

**[Q1.8]** Now let's implement the functions to evaluate different CoT strategies.  **[4 Marks]**

In [None]:
def run_evaluation(llm, problems, prompt_template, prompt_name, rounds=1,
                   sc_cot_flag=False, num_paths=3, temperatures=[0.7, 0.7, 0.7]):
    """
    Run evaluation with a specific prompting strategy for multiple rounds
    
    Args:
        llm: The language model to use
        problems: Dataset examples to evaluate
        prompt_template: The prompt template to use
        prompt_name: Name of the prompting strategy for logging
        rounds: Number of evaluation rounds to run
        num_paths: Number of reasoning paths to generate for each question
        temperatures: List of temperature values to use for each path
        
    Returns:
        Tuple of (summary_metrics, all_round_metrics, all_round_responses)
    """

    # Initialize lists to store results for all rounds
    all_metrics = []
    all_responses = []
    all_accuracies = []
    
    if sc_cot_flag:
        print(f"Running {rounds} rounds of SC-CoT evaluation with {prompt_name}...")
    else:
        print(f"Running {rounds} rounds of evaluation with {prompt_name}...")
    
    for round_num in range(1, rounds + 1):
        # Initialize lists to store predictions and references for this round
        predictions = []
        references = []
        round_responses = []
        
        if sc_cot_flag:
            print(f"Round {round_num}/{rounds} - {prompt_name} (Self-Consistency)")
        else:
            print(f"Round {round_num}/{rounds} - {prompt_name}")
        
        for problem in tqdm(problems, desc=f"Round {round_num}"):
            # TODO: Solve the problem using the LLM and specified CoT tenchique
            # Your codes here
            result = "Your codes here"  # This is only a placeholder, please replace it with your codes and add more lines if needed

            # Skip if result is None (e.g., no ground truth)
            if result is None:
                continue

            # Save the results
            predictions.append(result["prediction"])
            references.append(result["reference"])
            round_responses.append(result)
        
        # Calculate metrics for this round
        round_metrics = evaluate_predictions(predictions, references)
        if sc_cot_flag:
            print(f"Round {round_num} - {prompt_name} SC-CoT Results: {round_metrics}")
        else:
            print(f"Round {round_num} - {prompt_name} Results: {round_metrics}")
        
        # Store the round results
        all_metrics.append({
            "round_number": round_num,
            "metrics": round_metrics
        })
        all_responses.append(round_responses)
        all_accuracies.append(round_metrics["accuracy"])
        
        # Print round results
        print(f"  Round {round_num}: {round_metrics['accuracy']:.2%}")
    
    # Calculate summary statistics across all rounds
    summary_stats = calculate_statistics(all_accuracies)
    
    return summary_stats, all_metrics, all_responses

### Test the Function

Test the function following the `TODO` instructions. Fix any error if there is any.

**[Q1.9]** Run the evaluation for standard prompt for 2 rounds using `test_llm` and the 10 problems in the `test_problems` data. Please set the value of the `prompt_name` argument when you call `run_evaluation()` to `"Standard Prompt"`.  **[3 Marks]**

In [None]:
# TODO: Run the standard prompt evaluation for 2 rounds
print("\nEvaluating Standard Prompt:")
# Your codes here

In [None]:
# TODO: Inspect the summary statistics and metrics for the two round
# Your codes here

In [None]:
# TODO: Inspect the responses to make sure the results and the statistics are consistent for each round
# Your codes here

**[Q1.10]** Run the evaluation for few-shot SC-CoT with 3 reasoning paths and temperature 0.7 for 2 rounds using `test_llm` and the 10 problems in the `test_problems` data. Please set the value of the `prompt_name` argument when you call `run_evaluation()` to `"Few-shot SC-CoT"`.  **[3 Marks]**

In [None]:
# TODO: Run few-shot SC-CoT evaluation with 3 reasoning paths and temperature 0.7 for 2 rounds
print("\nEvaluating Few-shot Self-Consistency CoT:")
# Your codes here

In [None]:
# TODO: Inspect the summary statistics and metrics for the two round
# Your codes here

In [None]:
# TODO: Inspect the responses to make sure the results and the statistics are consistent for each round
# Your codes here

# PART II: Run Evaluations
---

In this part, we'll assemble all components we've built and run the evaluation on 100 randomly selected GSM8K problems using all CoT methods. Let's find out how each CoT technique performs in comparison with others.

## Configure the Evaluation

Let's configure the parameters for the evaluation.
-  Please make sure you run the cell to set the `STUDENT_SEED` variable to your QM Student ID in the "**Load GSM8K Dataset**" section.
-  Please do not change values for other parameters to obtain the results you are going to submit with the notebook. You can play with different configurations (e.g., using different models, different temperatures, number of rounds, number of paths for SC-CoT, etc.), for your own experiments if you are interested.
-  We choose a relatively low temperature (0.25) to reduce the variation of the results since our sample size is small.
-  Typically, the number of reasoning paths for SC-CoT is power of 2. We use 5 to save evaluation time.

In [31]:
# Configuration for evaluation
eval_config = {
    "platform": "ollama",
    "model": "qwen2:1.5b",
    "temperature": 0.25,
    "top_p": 0.95,
    "sample_size": 100,
    "random_seed": STUDENT_SEED,
    "rounds": 5,
    "num_sc_paths": 5,
    "output_dir": "./output"
} 

## Set up an LLM

**[Q2.1]** Please setup an LLM based on the evaluation configuration. Save the model in a variable called `llm`.  **[1 Mark]**

In [None]:
## TODO: Setup an LLM based on the evaluation configuration. Save the model in a variable called `llm`


## Load GSM8K Dataset

**[Q2.2]** Please load the GSM8K dataset with the specified sample size and random seed in the `eval_config`. Save the problems in a variable called `eval_problems`.  **[1 Mark]**

In [None]:
# TODO: Load the GSM8K dataset with the specified sample size and random seed in the `eval_config`.
# Save the problems in a variable called `eval_problems`


## Run CoT Evaluations

**[Q2.3]** Please complete the script below following the TODO instructions and run the evaluations for the standard, zero-shot and few-shot CoT prompts for the number of rounds specified in `eval_config`.  **[3 Marks]**

Please set the value of the `prompt_name` argument when you call `run_evaluation()` to `"Standard Prompt"`, `Zero-shot CoT` and `Few-shot CoT`.

> 💡 Note:
> 
> This evaluation takes about 35 min to complete on a Macbook Pro M2 with 8GB RAM. Please allow enough time to run this evaluation.

In [None]:
# Create output directory if it doesn't exist
os.makedirs(eval_config["output_dir"], exist_ok=True)

# Initialize results structure
all_results = {
    "config": eval_config,
    "standard": {},
    "zero_shot_cot": {},
    "few_shot_cot": {},
    "zero_shot_sc_cot": {},
    "few_shot_sc_cot": {}
}

# TODO: Run standard prompt evaluation
print("\nEvaluating Standard Prompt:")
standard_summary, standard_rounds, standard_responses = "Your", "codes", "here"
all_results["standard"] = {
    "summary": standard_summary,
    "rounds": standard_rounds
}

# TODO: Run zero-shot CoT evaluation
print("\nEvaluating Zero-shot CoT:")
zero_shot_summary, zero_shot_rounds, zero_shot_responses = "Your", "codes", "here"
all_results["zero_shot_cot"] = {
    "summary": zero_shot_summary,
    "rounds": zero_shot_rounds
}

# TODO: Run few-shot CoT evaluation
print("\nEvaluating Few-shot CoT:")
few_shot_summary, few_shot_rounds, few_shot_responses = "Your", "codes", "here"
all_results["few_shot_cot"] = {
    "summary": few_shot_summary,
    "rounds": few_shot_rounds
}

**[Q2.4]** Please complete the script below following the TODO instructions and run the evaluations for the SC-CoT with zero-shot and few-shot CoT prompts for the number of rounds specified in `eval_config`.  **[2 Marks]**

Please set the value of the `prompt_name` argument when you call `run_evaluation()` to `"Zero-shot SC-CoT"` and `"Few-shot SC-CoT"`.

> 💡 Note:
> 
> This evaluation takes about 150 min to complete on a Macbook Pro M2 with 8GB RAM. Please allow enough time to run this evaluation. You can put the following two evaluations in two cells and run separately if necessary.

In [None]:
# Set temperatures for SC-CoT evaluations
temperatures = [eval_config["temperature"]] * eval_config["num_sc_paths"]

# TODO: Run zero-shot SC-CoT evaluation
print("\nEvaluating Zero-shot Self-Consistency CoT:")
zero_shot_sc_summary, zero_shot_sc_rounds, zero_shot_sc_responses = "Your", "codes", "here"
all_results["zero_shot_sc_cot"] = {
    "summary": zero_shot_sc_summary,
    "rounds": zero_shot_sc_rounds
}

# TODO: Run few-shot SC-CoT evaluation
print("\nEvaluating Few-shot Self-Consistency CoT:")
few_shot_sc_summary, few_shot_sc_rounds, few_shot_sc_responses = "Your", "codes", "here"
all_results["few_shot_sc_cot"] = {
    "summary": few_shot_sc_summary,
    "rounds": few_shot_sc_rounds
}

## Save Results

**[Q2.5]** Save the metrics/statistics file and responses files for 5 the reasoning techniques.  **[18 Marks]**

In [None]:
# Save consolidated metrics
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_identifier = f"{eval_config['platform']}_{eval_config['model']}".replace("/", "-").replace(":", "_")
metrics_file = f"{eval_config['output_dir']}/metrics_{model_identifier}_{eval_config['rounds']}rounds_{timestamp}.json"
with open(metrics_file, "w") as f:
    json.dump(all_results, f, indent=2)

# Save responses for each prompt type and round
os.makedirs(f"{eval_config['output_dir']}/responses", exist_ok=True)

for round_num, responses in enumerate(standard_responses, 1):
    responses_file = f"{eval_config['output_dir']}/responses/standard_{model_identifier}_round{round_num}_{timestamp}.json"
    with open(responses_file, "w") as f:
        json.dump(responses, f, indent=2)
        
for round_num, responses in enumerate(zero_shot_responses, 1):
    responses_file = f"{eval_config['output_dir']}/responses/zero_shot_{model_identifier}_round{round_num}_{timestamp}.json"
    with open(responses_file, "w") as f:
        json.dump(responses, f, indent=2)
        
for round_num, responses in enumerate(few_shot_responses, 1):
    responses_file = f"{eval_config['output_dir']}/responses/few_shot_{model_identifier}_round{round_num}_{timestamp}.json"
    with open(responses_file, "w") as f:
        json.dump(responses, f, indent=2)
        
for round_num, responses in enumerate(zero_shot_sc_responses, 1):
    responses_file = f"{eval_config['output_dir']}/responses/zero_shot_sc_{model_identifier}_round{round_num}_{timestamp}.json"
    with open(responses_file, "w") as f:
        json.dump(responses, f, indent=2)
        
for round_num, responses in enumerate(few_shot_sc_responses, 1):
    responses_file = f"{eval_config['output_dir']}/responses/few_shot_sc_{model_identifier}_round{round_num}_{timestamp}.json"
    with open(responses_file, "w") as f:
        json.dump(responses, f, indent=2)

print(f"All results saved to {metrics_file}")

# Print final summary with statistics
print("\n===== EVALUATION SUMMARY =====")
print(f"Platform: {eval_config['platform']}")
print(f"Model: {eval_config['model']}")
print(f"Temperature: {eval_config['temperature']}")
print(f"Top-p: {eval_config['top_p']}")
print(f"Sample size: {eval_config['sample_size']}")
print(f"Rounds: {eval_config['rounds']}")
print(f"SC-CoT paths: {eval_config['num_sc_paths']}")

print("\nStandard Prompt:")
print(f"  Mean: {standard_summary['mean']:.2%}")
print(f"  Range: {standard_summary['min']:.2%} - {standard_summary['max']:.2%} (±{standard_summary['range']/2:.2%})")
print(f"  Std Dev: {standard_summary['std_dev']:.4f}")

print("\nZero-shot CoT:")
print(f"  Mean: {zero_shot_summary['mean']:.2%}")
print(f"  Range: {zero_shot_summary['min']:.2%} - {zero_shot_summary['max']:.2%} (±{zero_shot_summary['range']/2:.2%})")
print(f"  Std Dev: {zero_shot_summary['std_dev']:.4f}")

print("\nFew-shot CoT:")
print(f"  Mean: {few_shot_summary['mean']:.2%}")
print(f"  Range: {few_shot_summary['min']:.2%} - {few_shot_summary['max']:.2%} (±{few_shot_summary['range']/2:.2%})")
print(f"  Std Dev: {few_shot_summary['std_dev']:.4f}")

print("\nZero-shot Self-Consistency CoT:")
print(f"  Mean: {zero_shot_sc_summary['mean']:.2%}")
print(f"  Range: {zero_shot_sc_summary['min']:.2%} - {zero_shot_sc_summary['max']:.2%} (±{zero_shot_sc_summary['range']/2:.2%})")
print(f"  Std Dev: {zero_shot_sc_summary['std_dev']:.4f}")


print("\nFew-shot Self-Consistency CoT:")
print(f"  Mean: {few_shot_sc_summary['mean']:.2%}")
print(f"  Range: {few_shot_sc_summary['min']:.2%} - {few_shot_sc_summary['max']:.2%} (±{few_shot_sc_summary['range']/2:.2%})")
print(f"  Std Dev: {few_shot_sc_summary['std_dev']:.4f}")

# PART III: Analysis and Discussions
---

**[Q3.1]** Based on the metrics result file you obtatined from the evaluations, visualise the mean accuracies/solve rates of different CoT approaches in ONE bar chart for comparison using matplotlib (already imported). Please include error bars representing the standard deviation. (Use GenAI tools to help you generate the codes.)  **[5 Marks]**

In [None]:
# Your visualisation codes here

**[Q3.2]**: Based on the evaluation results, Which CoT approach performed best? Please analyse the EVALUATION SUMMARY results and summarise the findings.  **[6 Marks]**

*Your answer here*


**[Q3.3]** Considering the available parameters in the `eval_config`, which parameters do you think can be adjusted to further enhance the performance of (Zero-shot/Few-shot) SC-CoT? Please provide your reasoning. (You can get some ideas by observing the saved responses or the responses generated in the test eveluation for SC-CoT in Q1.10. You do not need to verify your speculations in this project.)

*Your answer here*
