# **LLM Math Reasoning: Experiment Runner & Analyzer**

This notebook serves two purposes:

1.  **Run Experiments:** Execute the Python scripts to generate the JSON output files for each experiment.
2.  **Analyze Results:** Load the generated JSON files, calculate final accuracies, and display sample responses to analyze model behavior.

## 1. **Setup**

First, let's install the requirements and log in to Hugging Face. You only need to run this once.

In [None]:
!pip install -r ../requirements.txt

In [None]:
# You will need a Hugging Face token to access the Mistral model.
# Run this cell and paste your token when prompted.
!huggingface-cli login

## 2. **Run Experiments**

The following cells will run all 7 experiments (with their variants). This will take a **very long time** (potentially several hours) and requires a powerful GPU.

**Note:** The main script uses `argparse`, so we call it from the command line using `!python ...`. We must navigate to the root directory first (`cd ..`).

In [None]:
%cd ..

### **Experiment 1: Direct Prompting**

This is the most basic method and serves as our baseline. We give the model the math problem and ask it to provide the answer directly, without any examples or instructions to "think step-by-step." 

We test two variants:
* `v1`: Asks for a concise answer with no extra commentary.
* `v2`: Asks for *only* the final numeric answer.

In [None]:
!python scripts/run_experiment.py --experiment direct --variant 1 --sample_numbers 50
!python scripts/run_experiment.py --experiment direct --variant 2 --sample_numbers 50

### **Experiment 2: Zero-Shot CoT**

This method tests the "Chain-of-Thought" (CoT) concept. We don't provide any examples, but we add a simple phrase like "Let's think step by step" to the prompt. This encourages the model to generate a reasoning process *before* giving the final answer, which often improves accuracy on complex problems.

* `v1`: Uses the classic "Let's think step by step."
* `v2`: Uses a slightly different prompt: "Reason through it carefully and stepwise."

In [None]:
!python scripts/run_experiment.py --experiment zero_shot --variant 1 --sample_numbers 50
!python scripts/run_experiment.py --experiment zero_shot --variant 2 --sample_numbers 50

### **Experiment 3: Few-Shot (Positive)**

In this experiment, we provide the model with a few *correct* examples (shots) of questions and their step-by-step solutions before giving it the new problem. The idea is that the model learns the correct format and reasoning pattern from these examples (`in-context learning`).

We test this with 2, 4, and 8 shots loaded from `data/prompts/`.

In [None]:
!python scripts/run_experiment.py --experiment few_shot --shots 2 --sample_numbers 50
!python scripts/run_experiment.py --experiment few_shot --shots 4 --sample_numbers 50
!python scripts/run_experiment.py --experiment few_shot --shots 8 --sample_numbers 50

### **Experiment 4: Few-Shot (Negative)**

This is a counter-intuitive test. We provide the model with examples of questions that are solved *incorrectly*. The goal is to see if the model learns the *format* of reasoning (Q&A) while being smart enough to *ignore* the flawed logic, or if it gets confused and copies the mistakes.

We test this with 2, 4, and 8 "wrong" shots loaded from `data/prompts/`.

In [None]:
!python scripts/run_experiment.py --experiment wrong_shot --shots 2 --sample_numbers 50
!python scripts/run_experiment.py --experiment wrong_shot --shots 4 --sample_numbers 50
!python scripts/run_experiment.py --experiment wrong_shot --shots 8 --sample_numbers 50

### **Experiment 5: Self-Consistency**

This method uses a "majority vote" approach. We run the same prompt multiple times (K=3, K=4, K=5) with sampling enabled (`do_sample=True`), which produces different reasoning paths. We then extract the final answer from each path and choose the answer that appears most frequently.

This helps to reduce the impact of a single flawed reasoning path.

In [None]:
!python scripts/run_experiment.py --experiment self_consistency --k_samples 3 --sample_numbers 50
!python scripts/run_experiment.py --experiment self_consistency --k_samples 4 --sample_numbers 50
!python scripts/run_experiment.py --experiment self_consistency --k_samples 5 --sample_numbers 50

### **Experiment 6: Verbalized Confidence**

Here, we ask the model to do two things for each of its K generated answers: 
1.  Solve the problem.
2.  State a "Confidence: X%" score for its own answer.

We then select the single answer that the model claims to be the *most confident* about. This tests the model's self-evaluation capabilities.

In [None]:
!python scripts/run_experiment.py --experiment verbalized --k_samples 3 --sample_numbers 50
!python scripts/run_experiment.py --experiment verbalized --k_samples 4 --sample_numbers 50
!python scripts/run_experiment.py --experiment verbalized --k_samples 5 --sample_numbers 50

### **Experiment 7: Subquestion Decomposition**

This is the most complex method, also known as "Least-to-Most" prompting. 

1.  We first ask the model to break the main problem down into a list of simpler sub-questions.
2.  Then, we feed each sub-question back to the model one by one, building a chain of reasoning by adding the previous answers to the context.
3.  Finally, we ask for the final answer based on all the intermediate steps.

In [None]:
!python scripts/run_experiment.py --experiment subquestion --sample_numbers 50

## 3. **Analyze Results**

If you have already run the experiments (or downloaded the results), you can run the cells below to analyze the outputs.

We will import the helper functions from our `src` module to calculate accuracy and print samples.

In [None]:
import sys
import pandas as pd
from pathlib import Path

# Add the src directory to the Python path to import our modules
sys.path.append('../src')

from llm_math_eval.evaluation import get_accuracy, print_sample_responses

In [None]:
OUTPUT_DIR = Path('../outputs')
all_results = []

# Define all expected result files
result_files = {
    'Direct Prompt (v1)': OUTPUT_DIR / 'direct_prompt_v1.json',
    'Direct Prompt (v2)': OUTPUT_DIR / 'direct_prompt_v2.json',
    'Zero-Shot CoT (v1)': OUTPUT_DIR / 'zero_shot_v1.json',
    'Zero-Shot CoT (v2)': OUTPUT_DIR / 'zero_shot_v2.json',
    'Few-Shot (2-shot)': OUTPUT_DIR / 'few_shot_2.json',
    'Few-Shot (4-shot)': OUTPUT_DIR / 'few_shot_4.json',
    'Few-Shot (8-shot)': OUTPUT_DIR / 'few_shot_8.json',
    'Wrong-Shot (2-shot)': OUTPUT_DIR / 'wrong_shot_2.json',
    'Wrong-Shot (4-shot)': OUTPUT_DIR / 'wrong_shot_4.json',
    'Wrong-Shot (8-shot)': OUTPUT_DIR / 'wrong_shot_8.json',
    'Self-Consistency (K=3)': OUTPUT_DIR / 'self_consistency_3.json',
    'Self-Consistency (K=4)': OUTPUT_DIR / 'self_consistency_4.json',
    'Self-Consistency (K=5)': OUTPUT_DIR / 'self_consistency_5.json',
    'Verbalized Conf. (K=3)': OUTPUT_DIR / 'verbalized_confidence_3.json',
    'Verbalized Conf. (K=4)': OUTPUT_DIR / 'verbalized_confidence_4.json',
    'Verdbalized Conf. (K=5)': OUTPUT_DIR / 'verbalized_confidence_5.json',
    'Subquestion Decomp.': OUTPUT_DIR / 'subquestion.json',
}

print("--- Accuracy Results ---")
for name, path in result_files.items():
    if path.exists():
        acc = get_accuracy(path)
        all_results.append({'Method': name, 'Accuracy': acc})
    else:
        print(f"File not found, skipping: {path}")
        all_results.append({'Method': name, 'Accuracy': None})

# Display results in a clean table
df_results = pd.DataFrame(all_results)
df_results = df_results.dropna()
df_results['Accuracy'] = pd.to_numeric(df_results['Accuracy'], errors='coerce')
df_results = df_results.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)

display(df_results.style.format({'Accuracy': '{:.2%}'}).hide(axis='index'))

### **Sample Analysis**

Let's look at some samples from the best-performing and worst-performing methods.

In [None]:
# Get the name of the best and worst method from our DataFrame
if not df_results.empty and pd.notna(df_results.iloc[0]['Accuracy']):
    best_method_name = df_results.iloc[0]['Method']
    worst_method_name = df_results.iloc[-1]['Method']

    best_file = result_files[best_method_name]
    worst_file = result_files[worst_method_name]

    print(f"\n\n--- Samples from BEST Method: {best_method_name} ---")
    print_sample_responses(best_file, num_samples=3)

    print(f"\n\n--- Samples from WORST Method: {worst_method_name} ---")
    print_sample_responses(worst_file, num_samples=3)
else:
    print("Could not determine best/worst methods. Please ensure result files exist.")

In [None]:
# You can also inspect any specific file
print("\n\n--- Samples from Zero-Shot CoT (v1) ---")
if (OUTPUT_DIR / 'zero_shot_v1.json').exists():
    print_sample_responses(OUTPUT_DIR / 'zero_shot_v1.json', num_samples=3)
else:
    print("File 'zero_shot_v1.json' not found.")