# Finetuned checkpoint step evaluation

The unsloth/gemma-3-1b-it-bnb-4bit model is finetuned using GRPO for 100 steps and the lora adapters are saved for every 10 steps to evaluate the accuracy on the test split. In this notebook, we are calculating the accuracy for every 10 steps.

## Framework/Libraries Installation

In [None]:
# Installs unsloth and other dependencies optimized for colab
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --upgrade transformers accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-9hjmm34q/unsloth_def170a9266940a2abe57ef58e18edd5
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-9hjmm34q/unsloth_def170a9266940a2abe57ef58e18edd5
  Resolved https://github.com/unslothai/unsloth.git to commit 9390bd528d4126840b142d5c354b8c1d7461f41e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Imports and Parameters

In [None]:
import unsloth
from unsloth import FastLanguageModel
import re
import torch
import pandas as pd
import os
import json
from datasets import load_dataset
from tqdm.notebook import tqdm
import math
import glob


BASE_MODEL_NAME = "unsloth/gemma-3-1b-it-bnb-4bit"
DATASET_NAME = "openai/gsm8k"
DATASET_SPLIT = "test"
CHECKPOINTS_PARENT_DIR = "/content/drive/MyDrive/info621_models/checkpoints"
# Directory to save CSVs and the JSON accuracy summary
RESULTS_DIR = "/content/drive/MyDrive/info621_models/evaluation_results_gsm8k"
MAX_NEW_TOKENS = 256
EVALUATION_BATCH_SIZE = 128

## Helper Functions

In [None]:
def load_model_from_checkpoint(checkpoint_path: str):
    """
    Load the Unsloth model from a specific checkpoint directory.
    """
    print(f"Loading model from checkpoint: {checkpoint_path}")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=checkpoint_path,
        load_in_4bit=True,
        device_map="auto",
    )
    FastLanguageModel.for_inference(model)

    # Ensure padding token is set correctly for batch generation
    if tokenizer.pad_token is None:
        tokenizer.padding_side = "left"
        tokenizer.pad_token = tokenizer.eos_token
    elif tokenizer.padding_side != "left":
        print("Ensuring padding side is 'left' for batch generation.")
        tokenizer.padding_side = "left"

    if model.config.pad_token_id is None and tokenizer.pad_token_id is not None:
        model.config.pad_token_id = tokenizer.pad_token_id
    elif model.config.pad_token_id is None and tokenizer.eos_token_id is not None:
         model.config.pad_token_id = tokenizer.eos_token_id


    print(f"Model and tokenizer loaded successfully from {checkpoint_path}.")
    return model, tokenizer

In [None]:
def prepare_dataset(split: str = "test", dataset_name="openai/gsm8k"):
    """
    Loads the gsm8k dataset and prepares it for evaluation.
    Formats questions into prompts suitable for the Gemma instruction-tuned model
    and extracts the final numeric gold answer.
    """
    print(f"Loading and preparing {dataset_name} dataset (split: {split})...")
    dataset_hf = load_dataset(dataset_name, "main")
    dataset_split_hf = dataset_hf[split]

    formatted_data = []
    for example in tqdm(dataset_split_hf, desc=f"Formatting {split} data for {dataset_name}"):
        question = example["question"].strip()
        answer_full = example["answer"].strip()

        try:
            gold_answer_numeric = re.split(r"####\s*", answer_full)[-1].strip()
        except Exception as e:
            print(f"Warning: Could not parse gold answer for question: '{question[:50]}...' Error: {e}")
            gold_answer_numeric = None

        # Gemma instruction formatting
        prompt = f"<start_of_turn>user\nSolve the following math problem step-by-step:\n{question}<end_of_turn>\n<start_of_turn>model\n"

        formatted_data.append({
            "prompt": prompt,
            "gold_answer_numeric": gold_answer_numeric,
            "original_question": question,
            "original_answer": answer_full
        })
    print(f"Dataset preparation complete. {len(formatted_data)} examples formatted.")
    return formatted_data

In [None]:
def extract_final_number(text: str):
    """
    Extracts the last occurring number (integer or float) from a string.
    Removes commas from numbers before extraction.
    """
    if text is None:
        return None
    text_no_commas = text.replace(',', '')
    numbers = re.findall(r"[-+]?\d*\.\d+|[-+]?\d+", text_no_commas)
    if numbers:
        return numbers[-1]
    return None

In [None]:
@torch.no_grad()
def evaluate_model_on_dataset(model, tokenizer, dataset: list, batch_size: int, max_new_tokens: int):
    """
    Evaluates the loaded model's performance on the prepared dataset.
    """
    print(f"Evaluating on {len(dataset)} samples with batch size {batch_size}.")
    correct_count = 0
    results_list = []
    total_count = len(dataset)

    # Determine pad_token_id for generation, fallback to eos_token_id if not set
    gen_pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
    if gen_pad_token_id is None:
        print("Critical Warning: pad_token_id is None. Generation might fail or be suboptimal.")
        #using eos_token_id if available
        if tokenizer.eos_token_id is not None:
            gen_pad_token_id = tokenizer.eos_token_id
            print(f"Using eos_token_id ({gen_pad_token_id}) for padding during generation.")
        else:
            raise ValueError("Cannot determine a pad_token_id for generation and eos_token_id is also None.")


    num_batches = math.ceil(total_count / batch_size)

    for i in tqdm(range(num_batches), desc=f"Evaluating Batches (Size {batch_size})"):
        start_index = i * batch_size
        end_index = min((i + 1) * batch_size, total_count)
        batch_data = dataset[start_index:end_index]

        batch_prompts = [example["prompt"] for example in batch_data]
        batch_gold_numeric = [example["gold_answer_numeric"] for example in batch_data]

        inputs = tokenizer(
            batch_prompts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=model.config.max_position_embeddings - max_new_tokens - 10,
        ).to(model.device)

        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.0,  # For deterministic output
            do_sample=False,
            pad_token_id=gen_pad_token_id
        )

        # Decoding the generated text
        input_lengths = inputs.input_ids.shape[1]
        generated_texts_answer_only = tokenizer.batch_decode(outputs[:, input_lengths:], skip_special_tokens=True)

        for j in range(len(batch_data)):
            original_example = batch_data[j]
            generated_text_ans = generated_texts_answer_only[j]
            gold_numeric = batch_gold_numeric[j]

            generated_numeric = extract_final_number(generated_text_ans)

            is_correct = False
            if gold_numeric is not None and generated_numeric is not None:
                try:
                    # comparing floats for numerical equivalence
                    if abs(float(generated_numeric) - float(gold_numeric)) < 1e-6:
                        is_correct = True
                except ValueError:
                    if generated_numeric == gold_numeric:
                        is_correct = True
            elif gold_numeric == generated_numeric:
                is_correct = (gold_numeric is None and generated_numeric is None)


            if is_correct:
                correct_count += 1

            results_list.append({
                "question": original_example["original_question"],
                "full_gold_answer": original_example["original_answer"],
                "gold_numeric_expected": gold_numeric,
                "model_generated_full_text": generated_text_ans,
                "extracted_prediction_numeric": generated_numeric,
                "is_correct": is_correct
            })

    accuracy = (correct_count / total_count) if total_count > 0 else 0.0
    print(f"Evaluation Complete for this checkpoint: Correct = {correct_count}, Total = {total_count}, Accuracy = {accuracy:.4f}")

    return {
        "accuracy": accuracy,
        "correct_count": correct_count,
        "total_count": total_count,
        "results": results_list
    }

In [None]:
def main():
    os.makedirs(RESULTS_DIR, exist_ok=True)
    all_checkpoint_accuracies = {}

    # 1. prepare dataset
    print("Preparing dataset...")
    evaluation_dataset = prepare_dataset(split=DATASET_SPLIT, dataset_name=DATASET_NAME)
    if not evaluation_dataset:
        print(f"Dataset '{DATASET_NAME}' (split '{DATASET_SPLIT}') could not be loaded or is empty. Exiting.")
        return

    # 2. finding checkpoint directories
    # Assumes checkpoints are named like 'checkpoint-10', 'checkpoint-20', etc.
    # and are direct subdirectories of CHECKPOINTS_PARENT_DIR
    checkpoint_dirs_found = sorted(
        [d for d in glob.glob(os.path.join(CHECKPOINTS_PARENT_DIR, "checkpoint-*")) if os.path.isdir(d)],
        key=lambda x: int(x.split('-')[-1]) # Sort numerically by step number
    )

    if not checkpoint_dirs_found:
        print(f"No checkpoint directories found in '{CHECKPOINTS_PARENT_DIR}' matching 'checkpoint-*'.")
        print(f"Please ensure CHECKPOINTS_PARENT_DIR is set correctly. Current content: {os.listdir(CHECKPOINTS_PARENT_DIR)}")
        return

    print(f"Found the following checkpoint directories to evaluate: {checkpoint_dirs_found}")

    # 3. iterating through checkpoints
    for checkpoint_dir_path in checkpoint_dirs_found:
        checkpoint_name = os.path.basename(checkpoint_dir_path)
        print(f"\n--- Evaluating Checkpoint: {checkpoint_name} ---")

        current_model = None
        current_tokenizer = None
        try:
            current_model, current_tokenizer = load_model_from_checkpoint(checkpoint_dir_path)
        except Exception as e:
            print(f"Error loading model/tokenizer for {checkpoint_name}: {e}")
            print("Skipping this checkpoint.")
            all_checkpoint_accuracies[checkpoint_name] = {"error": str(e), "accuracy": 0.0, "correct": 0, "total": len(evaluation_dataset)}
            if current_model is not None: del current_model
            if current_tokenizer is not None: del current_tokenizer
            torch.cuda.empty_cache()
            continue

        eval_summary = evaluate_model_on_dataset(
            current_model,
            current_tokenizer,
            evaluation_dataset,
            batch_size=EVALUATION_BATCH_SIZE,
            max_new_tokens=MAX_NEW_TOKENS
        )

        all_checkpoint_accuracies[checkpoint_name] = {
            "accuracy": eval_summary["accuracy"],
            "correct": eval_summary["correct_count"],
            "total": eval_summary["total_count"]
        }

        # saving individual checkpoint results to CSV
        csv_filename = os.path.join(RESULTS_DIR, f"predictions_{DATASET_NAME.replace('/', '_')}_{DATASET_SPLIT}_{checkpoint_name}.csv")
        try:
            results_df = pd.DataFrame(eval_summary['results'])
            results_df.to_csv(csv_filename, index=False)
            print(f"Detailed predictions for {checkpoint_name} saved to {csv_filename}")
        except Exception as e:
            print(f"Error saving CSV for {checkpoint_name}: {e}")

        # cleaning up model and tokenizer to free GPU memory for the next checkpoint
        del current_model
        del current_tokenizer
        torch.cuda.empty_cache()
        print(f"Cleaned up model and tokenizer for {checkpoint_name} to free memory.")


    # 4. save all accuracies to a json
    accuracies_json_path = os.path.join(RESULTS_DIR, f"all_checkpoints_accuracies_{DATASET_NAME.replace('/', '_')}.json")
    with open(accuracies_json_path, 'w') as f:
        json.dump(all_checkpoint_accuracies, f, indent=4)
    print(f"\nSummary of all checkpoint accuracies saved to {accuracies_json_path}")

    print("\n--- Final Summary of Accuracies ---")
    for ckpt, data in all_checkpoint_accuracies.items():
        if "error" in data and data["error"]:
            print(f"{ckpt}: Error - {data['error']}")
        else:
            print(f"{ckpt}: Accuracy = {data['accuracy']:.4f} ({data['correct']}/{data['total']})")

if __name__ == '__main__':
    main()

Preparing dataset...
Loading and preparing openai/gsm8k dataset (split: test)...


README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Formatting test data for openai/gsm8k:   0%|          | 0/1319 [00:00<?, ?it/s]

Dataset preparation complete. 1319 examples formatted.
Found the following checkpoint directories to evaluate: ['/content/drive/MyDrive/info621_models/checkpoints/checkpoint-1', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-10', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-20', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-30', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-40', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-50', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-60', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-70', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-80', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-90', '/content/drive/MyDrive/info621_models/checkpoints/checkpoint-100']

--- Evaluating Checkpoint: checkpoint-1 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-1
==((====))==

model.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/checkpoints/checkpoint-1.
Evaluating on 1319 samples with batch size 128.


Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-1 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-1.csv
Cleaned up model and tokenizer for checkpoint-1 to free memory.

--- Evaluating Checkpoint: checkpoint-10 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-10
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/checkp

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-10 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-10.csv
Cleaned up model and tokenizer for checkpoint-10 to free memory.

--- Evaluating Checkpoint: checkpoint-20 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-20
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/che

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-20 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-20.csv
Cleaned up model and tokenizer for checkpoint-20 to free memory.

--- Evaluating Checkpoint: checkpoint-30 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-30
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/che

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-30 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-30.csv
Cleaned up model and tokenizer for checkpoint-30 to free memory.

--- Evaluating Checkpoint: checkpoint-40 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-40
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/che

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-40 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-40.csv
Cleaned up model and tokenizer for checkpoint-40 to free memory.

--- Evaluating Checkpoint: checkpoint-50 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-50
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/che

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-50 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-50.csv
Cleaned up model and tokenizer for checkpoint-50 to free memory.

--- Evaluating Checkpoint: checkpoint-60 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-60
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/che

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-60 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-60.csv
Cleaned up model and tokenizer for checkpoint-60 to free memory.

--- Evaluating Checkpoint: checkpoint-70 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-70
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/che

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-70 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-70.csv
Cleaned up model and tokenizer for checkpoint-70 to free memory.

--- Evaluating Checkpoint: checkpoint-80 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-80
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/che

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-80 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-80.csv
Cleaned up model and tokenizer for checkpoint-80 to free memory.

--- Evaluating Checkpoint: checkpoint-90 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-90
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/che

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-90 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-90.csv
Cleaned up model and tokenizer for checkpoint-90 to free memory.

--- Evaluating Checkpoint: checkpoint-100 ---
Loading model from checkpoint: /content/drive/MyDrive/info621_models/checkpoints/checkpoint-100
==((====))==  Unsloth 2025.4.8: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model and tokenizer loaded successfully from /content/drive/MyDrive/info621_models/c

Evaluating Batches (Size 128):   0%|          | 0/11 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

Evaluation Complete for this checkpoint: Correct = 378, Total = 1319, Accuracy = 0.2866
Detailed predictions for checkpoint-100 saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/predictions_openai_gsm8k_test_checkpoint-100.csv
Cleaned up model and tokenizer for checkpoint-100 to free memory.

Summary of all checkpoint accuracies saved to /content/drive/MyDrive/info621_models/evaluation_results_gsm8k/all_checkpoints_accuracies_openai_gsm8k.json

--- Final Summary of Accuracies ---
checkpoint-1: Accuracy = 0.2866 (378/1319)
checkpoint-10: Accuracy = 0.2866 (378/1319)
checkpoint-20: Accuracy = 0.2866 (378/1319)
checkpoint-30: Accuracy = 0.2866 (378/1319)
checkpoint-40: Accuracy = 0.2866 (378/1319)
checkpoint-50: Accuracy = 0.2866 (378/1319)
checkpoint-60: Accuracy = 0.2866 (378/1319)
checkpoint-70: Accuracy = 0.2866 (378/1319)
checkpoint-80: Accuracy = 0.2866 (378/1319)
checkpoint-90: Accuracy = 0.2866 (378/1319)
checkpoint-100: Accuracy = 0.2866 (378/1319)
