<a href="https://colab.research.google.com/github/ywangumichigan/EECS595-Project/blob/main/Aliyun/sft/evaluate_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluate Fine-tuned Model on MATH Test Dataset

This notebook evaluates the fine-tuned Qwen3 model downloaded from Aliyun on the MATH test dataset.

## Setup
- Model location: `model/` directory
- Test dataset: `test_math.json`
- Evaluation metric: Exact Match (EM) using SymPy normalization


In [19]:
# Install required packages
# If you encounter version compatibility errors between transformers and peft,
# you may need to install compatible versions. Try this cell first, and if it fails,
# run the troubleshooting cell below.

%pip install -q transformers accelerate peft sympy torch bitsandbytes


KeyboardInterrupt: 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
%cd /content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/Aliyun/sft
%pwd

/content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/Aliyun/sft


'/content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/Aliyun/sft'

## Troubleshooting: Version Compatibility

If you get an error like `ModuleNotFoundError: No module named 'transformers.modeling_layers'`,
run the cell below to fix the version compatibility issue.


In [None]:
# TROUBLESHOOTING CELL - Run this if you get version compatibility errors
# This installs compatible versions of transformers and peft

# Option 1: Try upgrading to latest compatible versions
%pip install --upgrade transformers peft accelerate

# Option 2: Install specific compatible versions (recommended)
# Uncomment the line below if Option 1 doesn't work:
# %pip install "transformers>=4.37.0,<4.52.0" "peft>=0.7.0" accelerate sympy torch bitsandbytes

# Option 3: Latest versions (may work, try if others fail)
# %pip install "transformers>=4.40.0" "peft>=0.10.0" accelerate sympy torch bitsandbytes

print("If you see version errors, uncomment one of the options above and run this cell.")

If you see version errors, uncomment one of the options above and run this cell.


In [35]:
import os
import json
import re
import sympy
import torch
import random
from pathlib import Path
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

# Try to import PeftModel (only needed if using LoRA adapter)
# If import fails, we'll handle it when loading the adapter
try:
    from peft import PeftModel
    PEFT_AVAILABLE = True
except ImportError as e:
    print(f"Warning: peft not available or version incompatible: {e}")
    print("You can only use full models (not LoRA adapters) without peft.")
    PEFT_AVAILABLE = False
    PeftModel = None

# Configuration
MODEL_DIR = "model"  # Path to the model DIRECTORY (folder containing model files, not the weights file itself)
# MODEL_DIR should contain: config.json, tokenizer files, and either:
#   - model.safetensors (full model), OR
#   - adapter/ folder (LoRA adapter)
TEST_DATA_PATH = "/content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/test_math.json"  # Path to test dataset (should have level and type fields)
USE_8BIT = True  # Use 8-bit quantization to save memory (requires bitsandbytes)
MAX_NEW_TOKENS = 128 # Reduced for faster generation
NUM_SAMPLES = 10  # None means evaluate on all test samples, or set a number for quick testing
DO_SAMPLE = False # Set to False for faster (greedy) generation
TEMPERATURE = 0.7 # Lower temperature for less randomness when do_sample is True

# Filtering options (set to None or [] to disable filtering)
FILTER_LEVELS = ["Level 2"]  # e.g., ["Level 1", "Level 2"] or None for all levels
FILTER_TYPES = None  # e.g., ["algebra", "geometry"] or None for all types

## Load Evaluation Functions

These functions are used to extract answers from model output and compare them.


In [None]:
import sympy
import re

def _parse_latex_braces(s, start_index):
    """Parses content within balanced braces, handling nesting."""
    if not s or start_index >= len(s) or s[start_index] != '{':
        return None, -1

    brace_count = 1
    content_chars = []
    current_index = start_index + 1

    while current_index < len(s):
        char = s[current_index]
        if char == '{':
            brace_count += 1
            content_chars.append(char)
        elif char == '}':
            brace_count -= 1
            if brace_count == 0:
                return "".join(content_chars), current_index
            else:
                content_chars.append(char)
        else:
            content_chars.append(char)
        current_index += 1

    return None, -1 # Unbalanced braces

def _insert_multiplication_stars(s):
    """Inserts '*' for implicit multiplication in mathematical strings for SymPy."""
    if not s:
        return s

    # Regex patterns for implicit multiplication
    # 1. Number followed by a variable (e.g., "2x" -> "2*x")
    s = re.sub(r'(\d)([a-zA-Z])', r'\1*\2', s)
    # 2. Variable followed by a variable (e.g., "xy" -> "x*y")
    #    This is a simplification for typical single-letter variables and may over-correct for function names (e.g., 'sin(x)').
    #    For MATH dataset problems, variables are usually single letters, making this generally safe.
    s = re.sub(r'([a-zA-Z])([a-zA-Z])', r'\1*\2', s)
    # 3. Closing parenthesis followed by an opening parenthesis or a variable
    #    (e.g., "(a+b)(c+d)" -> "(a+b)*(c+d)", "(a+b)x" -> "(a+b)*x")
    s = re.sub(r'(\))([a-zA-Z(])', r'\1*\2', s)
    # 4. Number followed by an opening parenthesis (e.g., "2(x+y)" -> "2*(x+y)")
    s = re.sub(r'(\d)(\()', r'\1*\2', s)
    return s

def extract_boxed(latex_string):
    """Extract content from the LAST \boxed{} in LaTeX string, handling nested braces."""
    if not latex_string:
        return None

    last_boxed_content = None

    # Find all occurrences of \boxed{ using re.finditer
    matches = list(re.finditer(r'\\boxed\s*\{', latex_string, re.IGNORECASE))

    for match in matches:
        # Start parsing content from after the opening brace of \boxed{}
        start_index = match.end()
        brace_content, end_index = _parse_latex_braces(latex_string, start_index - 1) # Pass the index of '{'
        if brace_content is not None:
            last_boxed_content = brace_content
    return last_boxed_content

def normalize_sympy(s):
    """Normalize mathematical expression using sympy, preprocessing LaTeX fractions and implicit multiplication."""
    if not s:
        return None

    processed_s = s

    # Step 1: Unescape double backslashes for \frac, \dfrac if they exist
    processed_s = processed_s.replace('\\\\frac', '\\frac')
    processed_s = processed_s.replace('\\\\dfrac', '\\dfrac')

    # Step 2: Iteratively convert LaTeX fractions to SymPy compatible format
    while True:
        # Search for \frac or \dfrac pattern
        match = re.search(r'\\(?:d?frac)\s*\{', processed_s)
        if not match:
            break

        latex_frac_keyword_start = match.start()
        num_brace_start = match.end() - 1 # Index of '{' for numerator

        numerator, num_brace_end = _parse_latex_braces(processed_s, num_brace_start)
        if numerator is None:
            break # Malformed

        denom_brace_start = num_brace_end + 1
        if denom_brace_start >= len(processed_s) or processed_s[denom_brace_start] != '{':
            break # Malformed

        denominator, denom_brace_end = _parse_latex_braces(processed_s, denom_brace_start)
        if denominator is None:
            break # Malformed

        # DO NOT call _insert_multiplication_stars here for numerator/denominator
        # This will be handled in Step 3 on the fully processed string.

        sympy_frac_str = f"({numerator})/({denominator})"

        original_latex_frac = processed_s[latex_frac_keyword_start : denom_brace_end + 1]
        processed_s = processed_s.replace(original_latex_frac, sympy_frac_str, 1)

    # Step 3: Insert multiplication stars for implicit multiplication in the final string
    processed_s = _insert_multiplication_stars(processed_s)

    try:
        return sympy.sympify(processed_s)
    except (sympy.SympifyError, TypeError, SyntaxError):
        return None

print("Evaluation functions updated with LaTeX fraction preprocessing and implicit multiplication handling (corrected)!")

Evaluation functions updated with LaTeX fraction preprocessing and implicit multiplication handling (corrected)!


**Reasoning**:
The previous fix to `normalize_sympy` has been applied, and now I need to re-run the test cases to verify that the nested fraction test and all other tests pass as expected. This will confirm the correctness of the implicit multiplication handling and LaTeX fraction preprocessing.



In [None]:
print("\n--- Verifying normalize_sympy and sympy.equals() ---")

test_strings = [
    r"\frac{1}{2}",
    r"\dfrac{1}{2}",
    r"\\frac{1}{2}", # Example with double backslash if that's how it appears from some source
    "0.5",
    "0.5000"
]

sympy_expressions = []
for s in test_strings:
    normalized_expr = normalize_sympy(s)
    sympy_expressions.append(normalized_expr)
    print(f"Original: '{s}' -> Normalized: {normalized_expr}")

# Compare all expressions to the first one (which should be 1/2)
# Assuming the first expression is valid for comparison
if sympy_expressions[0] is not None:
    base_expr = sympy_expressions[0]
    print(f"\nComparing all to base expression: {base_expr}")
    for i, expr in enumerate(sympy_expressions):
        if expr is not None:
            is_equal = base_expr.equals(expr)
            print(f"  '{test_strings[i]}' ({expr}) equals base expression? {is_equal}")
        else:
            print(f"  '{test_strings[i]}' ({expr}) could not be normalized, cannot compare.")
else:
    print("Base expression could not be normalized, cannot perform comparisons.")

# Specific test: \frac{1}{2} should equal 0.5
expr_frac = normalize_sympy(r"\frac{1}{2}")
expr_decimal = normalize_sympy("0.5")
print(f"\nTest: \"\\frac{{1}}{{2}}\" equals \"0.5\"? {expr_frac.equals(expr_decimal) if expr_frac and expr_decimal else 'Cannot compare'}")

# Test with nested fractions (optional, but good for robustness)
expr_nested_frac = normalize_sympy(r"\frac{1}{\frac{2}{3}}")
expr_simplified_nested = normalize_sympy("3/2")
print(f"\nTest: \"\\frac{{1}}{{\\frac{{2}}{{3}}}}\" equals \"3/2\"? {expr_nested_frac.equals(expr_simplified_nested) if expr_nested_frac and expr_simplified_nested else 'Cannot compare'}")

# Test with complex expression
expr_complex_latex = normalize_sympy(r"\frac{x+y}{2x-y}")
expr_complex_sympy = normalize_sympy("(x+y)/(2x-y)")
print(f"\nTest: \"\\frac{{x+y}}{{2x-y}}\" equals \"(x+y)/(2x-y)\" (symbolic)? {expr_complex_latex.equals(expr_complex_sympy) if expr_complex_latex and expr_complex_sympy else 'Cannot compare'}")


--- Verifying normalize_sympy and sympy.equals() ---
Original: '\frac{1}{2}' -> Normalized: 1/2
Original: '\dfrac{1}{2}' -> Normalized: 1/2
Original: '\\frac{1}{2}' -> Normalized: 1/2
Original: '0.5' -> Normalized: 0.500000000000000
Original: '0.5000' -> Normalized: 0.500000000000000

Comparing all to base expression: 1/2
  '\frac{1}{2}' (1/2) equals base expression? True
  '\dfrac{1}{2}' (1/2) equals base expression? True
  '\\frac{1}{2}' (1/2) equals base expression? True
  '0.5' (0.500000000000000) equals base expression? True
  '0.5000' (0.500000000000000) equals base expression? True

Test: "\frac{1}{2}" equals "0.5"? True

Test: "\frac{1}{\frac{2}{3}}" equals "3/2"? True

Test: "\frac{x+y}{2x-y}" equals "(x+y)/(2x-y)" (symbolic)? True


In [None]:
# extract_boxed("\\boxed{\\frac{\\boldsymbol{x}}{1}}")
normalize_sympy("(2+x)").equals(normalize_sympy("2+x"))

True

## Load Model and Tokenizer

The notebook will automatically detect if the model is a full model or a LoRA adapter.


In [None]:
# Check if we have a full model or LoRA adapter
model_path = Path(MODEL_DIR)
has_full_model = (model_path / "model.safetensors").exists() or (model_path / "pytorch_model.bin").exists()
has_adapter = (model_path / "adapter" / "adapter_config.json").exists()

print(f"Model directory: {model_path.absolute()}")
print(f"Has full model: {has_full_model}")
print(f"Has adapter: {has_adapter}")

if not has_full_model and not has_adapter:
    raise ValueError(f"No model found in {MODEL_DIR}. Please check the path.")

# Check if adapter requires peft
if has_adapter and not PEFT_AVAILABLE:
    raise ValueError(
        "LoRA adapter detected but peft is not available or incompatible.\n"
        "Please fix the version compatibility issue:\n"
        "Option 1: Upgrade packages: pip install --upgrade transformers peft\n"
        "Option 2: Install compatible versions: pip install 'transformers>=4.37.0,<4.52.0' 'peft>=0.7.0'\n"
        "Option 3: Use a full model instead of a LoRA adapter"
    )

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_DIR,
    trust_remote_code=True
)

# Set pad token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer loaded: {tokenizer.__class__.__name__}")
print(f"Vocab size: {tokenizer.vocab_size}")

Model directory: /content/drive/MyDrive/Colab_Notebooks/CSE595_Proj/Aliyun/sft/model
Has full model: True
Has adapter: True
Tokenizer loaded: Qwen2TokenizerFast
Vocab size: 151643


## Preprocess Test Dataset (Extract Ground Truth Answers)

Preprocess the test dataset to extract and normalize ground truth boxed answers in advance.
This speeds up evaluation since we don't need to extract answers for each example during evaluation.


In [22]:
# Preprocess test dataset: extract and normalize ground truth answers
import json
from tqdm import tqdm
import sys

PREPROCESSED_TEST_PATH = "test_math_preprocessed.json"  # Output path for preprocessed data

# Check if preprocessed file already exists
if os.path.exists(PREPROCESSED_TEST_PATH):
    print(f"Preprocessed file {PREPROCESSED_TEST_PATH} already exists.")
    print("Loading preprocessed data...")
    with open(PREPROCESSED_TEST_PATH, 'r', encoding='utf-8') as f:
        test_data = json.load(f)
    print(f"Loaded {len(test_data)} preprocessed examples")

    # Show an example to verify format
    if len(test_data) > 0:
        example = test_data[0]
        print("\nExample preprocessed item:")
        print(f"  Instruction: {example['instruction'][:100]}...")
        print(f"  Gold boxed answer (raw): {example.get('gold_answer_str', 'N/A')}")
        print(f"  Gold answer (normalized): {example.get('gold_answer_sympy', 'N/A')}")
else:
    print("Preprocessing test dataset...")
    print("This may take a few minutes for large datasets.")

    # Temporarily increase the integer string conversion limit for SymPy serialization
    # This helps with extremely large numbers that might appear in MATH dataset solutions (e.g., factorials)
    original_int_limit = sys.get_int_max_str_digits()
    sys.set_int_max_str_digits(0) # Set to 0 to disable the limit

    try:
        # Load original test data
        with open(TEST_DATA_PATH, 'r', encoding='utf-8') as f:
            original_test_data = json.load(f)

        print(f"Processing {len(original_test_data)} examples...")

        # Preprocess each example
        preprocessed_data = []
        for item in tqdm(original_test_data, desc="Preprocessing"):
            instruction = item["instruction"]
            gold_output = item["output"]

            # Extract boxed answer from gold output
            gold_ans_str = extract_boxed(gold_output)

            # Normalize using sympy
            gold_ans_sym = normalize_sympy(gold_ans_str)

            # Store normalized answer as string for comparison
            # Wrap in try-except to handle potentially huge numbers that can't be stringified
            gold_ans_sym_str = None
            if gold_ans_sym is not None:
                try:
                    gold_ans_sym_str = str(gold_ans_sym)
                except ValueError:
                    # Fallback if the number is too large to stringify, store type or a placeholder
                    gold_ans_sym_str = f"SymPy object (too large to stringify: {type(gold_ans_sym)})"

                        # Create preprocessed item
            preprocessed_item = {
                "instruction": instruction,
                "output": gold_output,  # Keep original output for reference
                "gold_answer_str": gold_ans_str,  # Extracted boxed answer (raw string)
                "gold_answer_sympy": gold_ans_sym_str,  # Normalized answer (as string for JSON)
                "gold_answer_sympy_obj": None,  # Will be None, we'll reconstruct from string
                # Preserve level and type if present in original data
                "level": item.get("level", None),
                "type": item.get("type", None)
            }

            preprocessed_data.append(preprocessed_item)

        # Save preprocessed data
        with open(PREPROCESSED_TEST_PATH, 'w', encoding='utf-8') as f:
            json.dump(preprocessed_data, f, indent=2, ensure_ascii=False)

        print(f"✓ Preprocessing complete! Saved to {PREPROCESSED_TEST_PATH}")
        print(f"  Processed {len(preprocessed_data)} examples")

        # Statistics
        valid_answers = sum(1 for item in preprocessed_data if item["gold_answer_sympy"] is not None)
        print(f"  Valid boxed answers extracted: {valid_answers}/{len(preprocessed_data)}")

        # Load the preprocessed data
        test_data = preprocessed_data

    finally:
        # Restore original integer string conversion limit
        sys.set_int_max_str_digits(original_int_limit)

print("\nPreprocessing ready!")

Preprocessing test dataset...
This may take a few minutes for large datasets.
Processing 5000 examples...


Preprocessing: 100%|██████████| 5000/5000 [00:02<00:00, 2237.97it/s]


✓ Preprocessing complete! Saved to test_math_preprocessed.json
  Processed 5000 examples
  Valid boxed answers extracted: 4114/5000

Preprocessing ready!


In [23]:
import json
import sympy
from tqdm import tqdm

# Ensure PREPROCESSED_TEST_PATH is defined from previous cells

print(f"Loading preprocessed data from: {PREPROCESSED_TEST_PATH}")
with open(PREPROCESSED_TEST_PATH, 'r', encoding='utf-8') as f:
    preprocessed_data = json.load(f)

incompatible_items = []

for idx, item in enumerate(tqdm(preprocessed_data, desc="Checking inconsistencies")):
    gold_ans_str_from_json = item.get("gold_answer_str")
    gold_ans_sympy_str_from_json = item.get("gold_answer_sympy")

    # Case 1: gold_answer_str is present, but gold_answer_sympy is None
    if gold_ans_str_from_json is not None and gold_ans_sympy_str_from_json is None:
        # Re-normalize to confirm it should be None or if there was an error
        re_normalized_sympy_obj = normalize_sympy(gold_ans_str_from_json)
        if re_normalized_sympy_obj is not None:
            try:
                re_normalized_sympy_obj_to_store = str(re_normalized_sympy_obj)
            except ValueError:
                re_normalized_sympy_obj_to_store = f"SymPy object (too large to stringify: {type(re_normalized_sympy_obj)})"
            incompatible_items.append({
                "index": idx,
                "reason": "gold_answer_sympy was None, but gold_answer_str could be normalized",
                "gold_str": gold_ans_str_from_json,
                "stored_sympy_str": gold_ans_sympy_str_from_json,
                "re_normalized_sympy_obj": re_normalized_sympy_obj_to_store
            })
        continue

    # Case 2: gold_answer_str is None, but gold_answer_sympy is present
    if gold_ans_str_from_json is None and gold_ans_sympy_str_from_json is not None:
        try:
            re_sympified_from_stored_str = sympy.sympify(gold_ans_sympy_str_from_json)
            if re_sympified_from_stored_str is not None:
                try:
                    sympified_from_stored_str_to_store = str(re_sympified_from_stored_str)
                except ValueError:
                    sympified_from_stored_str_to_store = f"SymPy object (too large to stringify: {type(re_sympified_from_stored_str)})"

                incompatible_items.append({
                    "index": idx,
                    "reason": "gold_answer_sympy was not None, but gold_answer_str was None",
                    "gold_str": gold_ans_str_from_json,
                    "stored_sympy_str": gold_ans_sympy_str_from_json,
                    "sympified_from_stored": sympified_from_stored_str_to_store
                })
        except (sympy.SympifyError, TypeError, SyntaxError):
            incompatible_items.append({
                "index": idx,
                "reason": "gold_answer_str was None, and stored gold_answer_sympy was malformed",
                "gold_str": gold_ans_str_from_json,
                "stored_sympy_str": gold_ans_sympy_str_from_json
            })
        continue

    # Case 3: Both are present, check for actual equivalence
    if gold_ans_str_from_json is not None and gold_ans_sympy_str_from_json is not None:
        re_normalized_sympy_obj = normalize_sympy(gold_ans_str_from_json)
        try:
            stored_sympy_obj = sympy.sympify(gold_ans_sympy_str_from_json)
        except (sympy.SympifyError, TypeError, SyntaxError):
            stored_sympy_obj = None # Malformed stored sympy string

        # Robust comparison logic
        is_equal = False
        if re_normalized_sympy_obj is None or stored_sympy_obj is None:
            # If either could not be normalized, and they are not both None, then they are not equal
            # If both are None, they are implicitly compatible for this check
            is_equal = (re_normalized_sympy_obj is None and stored_sympy_obj is None)
        elif isinstance(re_normalized_sympy_obj, sympy.Expr) and isinstance(stored_sympy_obj, sympy.Expr):
            # Both are SymPy expressions, use .equals()
            is_equal = re_normalized_sympy_obj.equals(stored_sympy_obj)
        else:
            # One or both are not SymPy expressions (e.g., Python int, float, tuple, list)
            # In this case, we can attempt a direct Python equality check.
            is_equal = (re_normalized_sympy_obj == stored_sympy_obj)

        if not is_equal:
            # Safely convert to string for storing in the report
            try:
                re_normalized_sympy_obj_to_store = str(re_normalized_sympy_obj) if re_normalized_sympy_obj is not None else None
            except ValueError:
                re_normalized_sympy_obj_to_store = f"SymPy object (too large to stringify: {type(re_normalized_sympy_obj)})" if re_normalized_sympy_obj is not None else None

            try:
                stored_sympy_obj_to_store = str(stored_sympy_obj) if stored_sympy_obj is not None else None
            except ValueError:
                stored_sympy_obj_to_store = f"SymPy object (too large to stringify: {type(stored_sympy_obj)})" if stored_sympy_obj is not None else None

            incompatible_items.append({
                "index": idx,
                "reason": "Normalized gold_answer_str does not equal stored gold_answer_sympy",
                "gold_str": gold_ans_str_from_json,
                "stored_sympy_str": stored_sympy_obj_to_store,
                "re_normalized_sympy_obj": re_normalized_sympy_obj_to_store,
                "stored_sympy_obj_type": str(type(stored_sympy_obj)),
                "re_normalized_sympy_obj_type": str(type(re_normalized_sympy_obj))
            })


print(f"\nFound {len(incompatible_items)} items with potential inconsistencies.")
if incompatible_items:
    for item in incompatible_items:
        print(json.dumps(item, indent=2, ensure_ascii=False))
else:
    print("No obvious inconsistencies found between 'gold_answer_str' and 'gold_answer_sympy'.")

Loading preprocessed data from: test_math_preprocessed.json


Checking inconsistencies: 100%|██████████| 5000/5000 [00:03<00:00, 1391.33it/s]


Found 1 items with potential inconsistencies.
{
  "index": 2854,
  "reason": "Normalized gold_answer_str does not equal stored gold_answer_sympy",
  "gold_str": "\\frac{1}{2004!}",
  "stored_sympy_str": null,
  "re_normalized_sympy_obj": "SymPy object (too large to stringify: <class 'sympy.core.numbers.Rational'>)",
  "stored_sympy_obj_type": "<class 'NoneType'>",
  "re_normalized_sympy_obj_type": "<class 'sympy.core.numbers.Rational'>"
}





In [24]:
# Load model based on type
if has_adapter and PEFT_AVAILABLE:
    # Load base model first (need to check config for base model path)
    adapter_config_path = model_path / "adapter" / "adapter_config.json"
    with open(adapter_config_path, 'r') as f:
        adapter_config = json.load(f)

    # Note: If base_model_name_or_path is a local path that doesn't exist,
    # you may need to manually specify the base model name
    base_model_path = adapter_config.get("base_model_name_or_path", "")

    print(f"Loading base model from: {base_model_path}")

    # Try to load base model - if path doesn't exist, we'll try loading from MODEL_DIR
    if os.path.exists(base_model_path):
        base_model_path_to_use = base_model_path
    else:
        # Assume the full model is in MODEL_DIR (if it exists)
        if has_full_model:
            base_model_path_to_use = str(model_path)
            print(f"Base model path not found, using model directory: {base_model_path_to_use}")
        else:
            # If no base model found, we need the HuggingFace model name
            # This should be provided by the user or detected from config
            raise ValueError(f"Cannot find base model. Please check the adapter config.")

    # Import BitsAndBytesConfig for quantization (if available)
    try:
        from transformers import BitsAndBytesConfig
        quantization_config = None
        if USE_8BIT:
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0,
            )
    except ImportError:
        print("Warning: bitsandbytes not available. Setting USE_8BIT=False")
        USE_8BIT = False
        quantization_config = None

    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_path_to_use if os.path.exists(base_model_path_to_use) else MODEL_DIR,
        device_map="auto",
        quantization_config=quantization_config,
        torch_dtype=torch.float16 if not USE_8BIT else None,
        trust_remote_code=True
    )

    # Load adapter
    adapter_path = model_path / "adapter"
    print(f"Loading LoRA adapter from: {adapter_path}")
    model = PeftModel.from_pretrained(base_model, str(adapter_path))
    print("LoRA adapter loaded successfully!")

else:
    # Load full model
    # Import BitsAndBytesConfig for quantization (if available)
    try:
        from transformers import BitsAndBytesConfig
        quantization_config = None
        if USE_8BIT:
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0,
            )
    except ImportError:
        print("Warning: bitsandbytes not available. Setting USE_8BIT=False")
        USE_8BIT = False
        quantization_config = None

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_DIR,
        device_map="auto",
        quantization_config=quantization_config,
        torch_dtype=torch.float16 if not USE_8BIT else None,
        trust_remote_code=True
    )
    print("Full model loaded successfully!")

model.eval()
print(f"Model device: {next(model.parameters()).device}")
print(f"Model dtype: {next(model.parameters()).dtype}")


Loading base model from: /tmp/input_model/
Base model path not found, using model directory: model
Loading LoRA adapter from: model/adapter
LoRA adapter loaded successfully!
Model device: cuda:0
Model dtype: torch.float16


## Load Test Dataset

Load the test dataset in JSON format (instruction/output pairs).


In [36]:
# Load preprocessed test dataset
# The preprocessed data includes 'gold_answer_str' and 'gold_answer_sympy'
with open(PREPROCESSED_TEST_PATH, 'r', encoding='utf-8') as f:
    preprocessed_test_data = json.load(f)

print(f"Loaded {len(preprocessed_test_data)} preprocessed test examples")

# Apply filtering by level and type if specified
filtered_data = preprocessed_test_data

if FILTER_LEVELS is not None and len(FILTER_LEVELS) > 0:
    filtered_data = [item for item in filtered_data if item.get("level") in FILTER_LEVELS]
    print(f"After filtering by levels {FILTER_LEVELS}: {len(filtered_data)} examples")

if FILTER_TYPES is not None and len(FILTER_TYPES) > 0:
    filtered_data = [item for item in filtered_data if item.get("type") in FILTER_TYPES]
    print(f"After filtering by types {FILTER_TYPES}: {len(filtered_data)} examples")

if len(filtered_data) == 0:
    raise ValueError("No examples remaining after filtering! Please check your FILTER_LEVELS and FILTER_TYPES settings.")

# Limit number of samples if specified and draw randomly
if NUM_SAMPLES is not None:
    if NUM_SAMPLES <= len(filtered_data):
        # random.seed(42) # For reproducibility
        test_data = random.sample(filtered_data, NUM_SAMPLES)
        print(f"Randomly selected {NUM_SAMPLES} samples for evaluation")
    else:
        print(f"NUM_SAMPLES ({NUM_SAMPLES}) is greater than filtered examples ({len(filtered_data)}). Using all filtered examples.")
        test_data = filtered_data
else:
    test_data = filtered_data

# Show statistics
if test_data:
    levels = {}
    types = {}
    for item in test_data:
        level = item.get("level", "Unknown")
        type_field = item.get("type", "Unknown")
        levels[level] = levels.get(level, 0) + 1
        types[type_field] = types.get(type_field, 0) + 1

    print(f"\nStatistics for selected test data:")
    print(f"  Total examples: {len(test_data)}")
    print(f"  Levels: {dict(sorted(levels.items()))}")
    print(f"  Types: {dict(sorted(types.items()))}")

# Show an example
print("\nExample preprocessed test item:")
print(json.dumps(test_data[0], indent=2, ensure_ascii=False))


Loaded 5000 preprocessed test examples
After filtering by levels ['Level 2']: 894 examples
Randomly selected 10 samples for evaluation

Statistics for selected test data:
  Total examples: 10
  Levels: {'Level 2': 10}
  Types: {'Algebra': 4, 'Geometry': 1, 'Intermediate Algebra': 4, 'Prealgebra': 1}

Example preprocessed test item:
{
  "instruction": "You are a math assistant. Solve the problem step by step, explain your reasoning, and box the final answer using \\boxed{}.\n\nThe sum of seven consecutive integers is 49. What is the smallest of the seven integers?",
  "output": "Let the smallest of these consecutive integers be $a-3$, and thus the largest will be $a+3$. The sum of all seven  integers is equal to the average of the first and last term, multiplied by the number of terms, which is $7a = 49$. Thus, $a=7$. The smallest of the seven integers is $a-3=7-3=\\boxed{4}$.",
  "gold_answer_str": "4",
  "gold_answer_sympy": "4",
  "gold_answer_sympy_obj": null,
  "level": "Level 2",


## Evaluation Function

This function evaluates the model on the test dataset and computes exact match accuracy.


In [33]:
def evaluate_model(model, tokenizer, test_data, max_new_tokens=512, do_sample=True, temperature=0.7, verbose=True):
    """
    Evaluate model on test dataset.
    Returns: exact_match_score, detailed_results
    """
    correct = 0
    total = 0
    results = []

    # Extract system prompt and user question from instruction
    # Format: "You are a math assistant...\n\n<problem>"
    SYSTEM_PROMPT = "You are a math assistant. Solve the problem step by step, explain your reasoning, and box the final answer using \boxed{}."

    for idx, item in enumerate(tqdm(test_data, desc="Evaluating")):
        instruction = item["instruction"]
        problem = instruction.split("\n\n", 1)[1] if "\n\n" in instruction else instruction
        system_msg = instruction.split("\n\n", 1)[0] if "\n\n" in instruction else SYSTEM_PROMPT

        # Use pre-extracted gold answers
        gold_ans_str = item["gold_answer_str"]
        gold_ans_sympy_str = item["gold_answer_sympy"]

        gold_ans_sym = None
        if gold_ans_sympy_str is not None and not gold_ans_sympy_str.startswith("SymPy object (too large to stringify:"):
            try:
                gold_ans_sym = sympy.sympify(gold_ans_sympy_str)
            except (sympy.SympifyError, TypeError, SyntaxError):
                gold_ans_sym = None

        # Construct prompt in Qwen3 format
        # Format: <|im_start|>system\n<system_prompt><|im_end|>\n<|im_start|>user\n<problem><|im_end|>\n<|im_start|>assistant\n
        # Handle potential escaped newlines in system_msg for Qwen
        system_msg_for_prompt = system_msg.replace('\\n', '\n')
        problem_for_prompt = problem.replace('\\n', '\n')

        prompt = f"<|im_start|>system\n{system_msg_for_prompt}<|im_end|>\n<|im_start|>user\n{problem_for_prompt}<|im_end|>\n<|im_start|>assistant\n"

        # Tokenize
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
        input_length = inputs["input_ids"].shape[1]

        # Generate
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=do_sample,
                temperature=temperature,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id
            )

        # Decode only the newly generated tokens
        generated_tokens = outputs[0][input_length:]
        pred_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)

        # Extract predicted answer
        pred_ans_str = extract_boxed(pred_text)
        pred_ans_sym = normalize_sympy(pred_ans_str)

        # Check if correct
        is_correct = False
        if pred_ans_sym is not None and gold_ans_sym is not None:
            # Robust comparison for symbolic expressions
            if isinstance(pred_ans_sym, sympy.Expr) and isinstance(gold_ans_sym, sympy.Expr):
                is_correct = gold_ans_sym.equals(pred_ans_sym)
            else:
                # Fallback for non-sympy expressions like tuples or numbers
                is_correct = (gold_ans_sym == pred_ans_sym)
        elif (pred_ans_str == "" or pred_ans_str is None) and (gold_ans_str == "" or gold_ans_str is None):
            # Both are empty/None, consider them equal
            is_correct = True

        results.append({
            "problem": problem[:100] + "..." if len(problem) > 100 else problem,
            "predicted": pred_ans_str,
            "gold": gold_ans_str,
            "predicted_full": pred_text[:200] + "..." if len(pred_text) > 200 else pred_text,
            "correct": is_correct
        })

        if verbose and idx < 5:  # Show first 5 examples
            print(f"\n--- Example {idx + 1} ---")
            print(f"Problem: {problem[:150]}...")
            print(f"Predicted answer: {pred_ans_str}")
            print(f"Gold answer: {gold_ans_str}")
            print(f"Correct: {is_correct}")
            print(f"Generated text (first 200 chars): {pred_text[:200]}...")

    exact_match = correct / total if total > 0 else 0.0
    return exact_match, results

print("Evaluation function defined!")

Evaluation function defined!


## Run Evaluation

Evaluate the model on the test dataset and compute exact match accuracy.


In [37]:
print("Starting evaluation...")
print(f"Total test examples: {len(test_data)}")
print(f"Max new tokens: {MAX_NEW_TOKENS}")
print(f"Do sampling: {DO_SAMPLE}")
print(f"Temperature: {TEMPERATURE}")
print("-" * 60)

exact_match_score, detailed_results = evaluate_model(
    model,
    tokenizer,
    test_data,
    max_new_tokens=MAX_NEW_TOKENS,
    do_sample=DO_SAMPLE,
    temperature=TEMPERATURE,
    verbose=True
)

print("\n" + "=" * 60)
print(f"EVALUATION RESULTS")
print("=" * 60)
print(f"Exact Match Score: {exact_match_score:.4f} ({exact_match_score*100:.2f}%)")
print(f"Correct: {sum(r['correct'] for r in detailed_results)}")
print(f"Total: {len(detailed_results)}")
print("=" * 60)

Starting evaluation...
Total test examples: 10
Max new tokens: 128
Do sampling: False
Temperature: 0.7
------------------------------------------------------------


Evaluating:  10%|█         | 1/10 [00:25<03:48, 25.35s/it]


--- Example 1 ---
Problem: The sum of seven consecutive integers is 49. What is the smallest of the seven integers?...
Predicted answer: 4
Gold answer: 4
Correct: True
Generated text (first 200 chars): Let the seven consecutive integers be $n-3$, $n-2$, $n-1$, $n$, $n+1$, $n+2$, and $n+3$.  The sum of these seven integers is $7n$, and we are given that this sum is 49.  Therefore, $7n=49$, so $n=\box...


Evaluating:  20%|██        | 2/10 [00:50<03:22, 25.31s/it]


--- Example 2 ---
Problem: Compute $\sin 330^\circ$....
Predicted answer: -\frac{1}{2}
Gold answer: -\frac{1}{2}
Correct: True
Generated text (first 200 chars): We have that
\[\sin 330^\circ = \sin (360^\circ - 30^\circ) = -\sin 30^\circ = \boxed{-\frac{1}{2}}.\]Note that $\sin 330^\circ$ is the same as $\sin (-30^\circ),$ so we can also use the fact that $\s...


Evaluating:  30%|███       | 3/10 [01:15<02:56, 25.27s/it]


--- Example 3 ---
Problem: If $\&x$ is defined as $\&x = x + 5$ and $\#x$ is defined as $\#x = x^2$ , what is the value of $\#(\&4)$?...
Predicted answer: 81
Gold answer: 81
Correct: True
Generated text (first 200 chars): We have that $\&4 = 4+5 = 9$ and $\#9 = 9^2 = 81$.  Therefore, $\#(\&4) = \#9 = \boxed{81}$.  (We could have also done this by first finding $\#4 = 4^2 = 16$ and then finding $\&16 = 16+5 = 21$ and th...


Evaluating:  40%|████      | 4/10 [01:41<02:31, 25.30s/it]


--- Example 4 ---
Problem: Find $w$ if $|3-2i+w|=0$....
Predicted answer: -3+2i
Gold answer: -3+2i
Correct: True
Generated text (first 200 chars): If $|3-2i+w|=0$, then $3-2i+w=0$.  Therefore, $w=-3+2i$.  So, $w=\boxed{-3+2i}$.  (Note that $w$ is a complex number, not a real number.)  (The absolute value of a complex number is the distance from ...


Evaluating:  50%|█████     | 5/10 [02:06<02:06, 25.24s/it]


--- Example 5 ---
Problem: What is the largest prime factor of 78?...
Predicted answer: 13
Gold answer: 13
Correct: True
Generated text (first 200 chars): We can factor 78 as $2\cdot39$.  We can factor 39 as $3\cdot13$.  Therefore, the largest prime factor of 78 is $\boxed{13}$.  (We could also have factored 78 as $2\cdot3\cdot13$ and gotten the same an...


Evaluating: 100%|██████████| 10/10 [04:12<00:00, 25.22s/it]


EVALUATION RESULTS
Exact Match Score: 0.0000 (0.00%)
Correct: 7
Total: 10





## Save Results

Save the evaluation results to a file for later analysis.


In [None]:
# Save results to JSON file
results_file = "evaluation_results.json"
results_summary = {
    "exact_match_score": exact_match_score,
    "total_examples": len(detailed_results),
    "correct": sum(r['correct'] for r in detailed_results),
    "detailed_results": detailed_results
}

with open(results_file, 'w', encoding='utf-8') as f:
    json.dump(results_summary, f, indent=2, ensure_ascii=False)

print(f"Results saved to {results_file}")

# Show some statistics
print(f"\nStatistics:")
print(f"  Accuracy: {exact_match_score*100:.2f}%")
correct_count = sum(r['correct'] for r in detailed_results)
incorrect_count = len(detailed_results) - correct_count
print(f"  Correct: {correct_count}")
print(f"  Incorrect: {incorrect_count}")
