### Project: Domain Adaptation of Portuguese SLMs via Self-Supervised Fine-Tuning with LoRA
MO436C - Introduction to Self-Supervised Learning (SSRL)

**Team Members:**
- Alejandro Núñez Arroyo. <a href="mailto:a299215@dac.unicamp.br">a299215@dac.unicamp.br</a>  
- Daniel da Costa Nunes Resende Neto. <a href="mailto:d169408@dac.unicamp.br">d169408@dac.unicamp.br</a>  
- José Augusto de Almeida Neto. <a href="mailto:j299218@dac.unicamp.br">j299218@dac.unicamp.br</a>  

*Instituto de Computação (IC), Universidade Estadual de Campinas (UNICAMP)*  
*Campinas, November 2025*

---

#### License

This notebook and its source code are released under the **GNU General Public License v3.0 (GPLv3)**.  
You are free to use, modify, and redistribute this work under the following terms:

> **GNU General Public License v3.0**  
> Copyright © 2025 The Authors listed above  
>
> This program is free software: you can redistribute it and/or modify  
> it under the terms of the GNU General Public License as published by  
> the Free Software Foundation, either version 3 of the License, or  
> (at your option) any later version.  
>
> This program is distributed in the hope that it will be useful,  
> but **without any warranty**; without even the implied warranty of  
> merchantability or fitness for a particular purpose. See the  
> GNU General Public License for more details.  
>
> You should have received a copy of the GNU General Public License  
> along with this program. If not, see  
> [https://www.gnu.org/licenses/gpl-3.0.en.html](https://www.gnu.org/licenses/gpl-3.0.en.html).

---

# Notebook 4: Evaluation

This notebook focuses on the **evaluation phase** of the project, in which all fine-tuned and baseline models are systematically assessed on the **Portuguese MMLU test set**.  
It measures model accuracy under multiple answer-derivation strategies, quantifying reasoning and knowledge retention in a zero-shot setting.

---

**Overview**

The main objectives of this notebook are:

1. **Setup & Imports**  
   Load all core dependencies, configure logging, and define global settings for reproducibility.

2. **Evaluation Framework**  
   - Implement a configurable evaluation pipeline for MMLU-style multiple-choice questions.  
   - Support both **standard models** and **LoRA-adapted checkpoints**.  
   - Include multiple evaluation strategies: greedy decoding, free-form generation, and sequence log-probability scoring.

3. **Models**  
   - Execute and compare evaluations across all variants of the *Gemma-3-1b* family:  

   1. ***Gemma-3-1b-it* (Baseline)**
   2. ***Gemma-3-1b-pt* + LoRA SFT on MMLU**
   3. ***Gemma-3-1b-pt* + Wiki Context + LoRA SFT on MMLU**

4. **Evaluation Runs**
   - All evaluations were conducted on the **held-out MMLU test set (1,038 questions)**,
   following a **zero-shot** evaluation protocol.
   - Accuracy was computed for each of the three inference strategies:

   1. **Greedy Decoding (Single Token)** — direct deterministic prediction.
   2. **Free-Form Generation (Sampling)** — natural language reasoning and response.
   3. **Sequence Log-Probability Scoring** — log-likelihood ranking across all answer options.

**Output Artifacts**  
   - `mmlu_eval.log` — runtime logs for all evaluations.  
   - `results/*.csv` — per-model detailed results with predicted and correct answers.  
   - Printed accuracy summaries by decoding strategy.


## Summary

* [Part 1: Setup & Imports](#1-setup--imports)
* [Part 2: Evaluation Framework](#2-evaluation-framework)
  - [2.1 Evaluation Configuration](#21-evaluation-configuration)
  - [2.2 MMLU Evaluator Class](#22-mmlu-evaluator-class)
* [Part 3: Model Evaluation Runs](#3-model-evaluation-runs)
  - [3.1 Gemma-3-1b-it (Baseline)](#31-gemma-3-1b-it)
  - [3.2 Gemma-3-1b-pt + LoRA SFT on MMLU](#32-gemma-3-1b-pt--lora-sft-on-mmlu)
  - [3.3 Gemma-3-1b-pt + Wiki Context + LoRA SFT on MMLU](#33-gemma-3-1b-pt--wiki-context--lora-sft-on-mmlu)

<!-- ## 1. Setup & Imports -->
## 1. Setup & Imports <a id="part_01"></a>


Here we load the main libraries that will be used throughout this notebook.

It also sets up the logger, ensuring standardized output formatting and reproducibility across experiments.

In [None]:
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Optional

import numpy as np
import pandas as pd
import torch
from loguru import logger
from peft import PeftModel
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
# Configure logger
logger.remove()  # Remove default handler
logger.add(
    sink=lambda msg: print(msg, end=""),
    colorize=True,
    format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | "
           "<level>{level: <8}</level> | "
           "<cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> - "
           "<level>{message}</level>",
)
# logger.add("mmlu_eval_gemma-3-1b-it.log", rotation="5 MB",
logger.add("mmlu_eval.log", rotation="5 MB",
           retention="10 days", level="INFO")

In [None]:
# Set random seed for reproducibility
def set_seed(seed: int) -> None:
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

## 2. Evaluation Framework <a id="part_02"></a>

This section defines the **evaluation architecture** used to benchmark language models on the MMLU dataset.
It includes configuration structures, model loading routines, prompt formatting, and metric computation methods.

### 2.1 Evaluation Configuration <a id="21-evaluation-configuration"></a>

The configuration class centralizes all parameters controlling evaluation behavior — including model paths,
sampling settings, generation parameters, and output file structure.

This modular design ensures repeatability and allows for quick adaptation between model variants.


In [None]:
@dataclass
class EvaluationConfig:
    """Configuration for model evaluation."""

    # Model parameters
    model_path: str
    base_model_path: str = None
    lora: bool = False

    # Test data parameters
    test_data_path: str = "https://drive.google.com/uc?export=download&id=1PzCRCQZtIGV6GO4C-oq76DeIMck-t6P-"
    output_dir: str = "results"

    # Sampling configuration
    sample_size: Optional[int] = None
    random_seed: int = 42

    # Generation parameters
    max_new_tokens_short: int = 1   # Short generation for single token - A, B, C, or D
    max_new_tokens_long: int = 32   # Long generation for full answer explanation
    do_sample_short: bool = False   # Use greedy decoding for short answers
    do_sample_long: bool = True     # Use sampling for long answers
    temperature: float = 0.5        # Temperature for sampling

    # System prompt
    system_prompt: str = (
        "Você é um assistente especialista que responde questões de múltipla escolha em português do Brasil.\n"
        "Responda apenas com UMA opção correta (A, B, C ou D).\n"
    )

    @property
    def model_name(self) -> str:
        """Extract model name from path."""
        return f"{Path(self.model_path).parent.name}_{Path(self.model_path).name}"

    @property
    def output_path(self) -> Path:
        """Generate output path for results."""
        output_dir = Path(self.output_dir)
        output_dir.mkdir(parents=True, exist_ok=True)
        return output_dir / f"{self.model_name}_results.csv"

### 2.2 MMLU Evaluator Class <a id="22-mmlu-evaluator-class"></a>

The `MMLUEvaluator` class encapsulates all evaluation procedures, including:

1. **Model Loading**

   * Supports both regular checkpoints and LoRA-adapted models.
   * Automatically detects device configuration and precision mode.

2. **Prompt Formatting**

   * Standardizes multiple-choice question input format, ensuring alignment between training and evaluation.

3. **Generation Strategies**
   Implements three complementary approaches:

   * **Greedy Decoding (1 token):** measures deterministic accuracy.
   * **Free-Form Sampling:** allows longer, natural completions.
   * **Sequence Log-Probability Scoring:** evaluates likelihood of full answer sequences.

4. **Metrics and Logging**

   * Computes accuracy and distribution statistics per strategy.
   * Saves detailed predictions to CSV and logs summaries for reproducibility.

> This framework directly implements the methodology described in the report, evaluating each model
> using three independent answer extraction methods to ensure a comprehensive performance comparison.

In [None]:
class MMLUEvaluator:
    """
    MMLU-style multiple choice question evaluator for language models.

    Handles model loading, inference, answer extraction, and metric calculation.
    """

    def __init__(self, config: EvaluationConfig):
        """
        Initialize the evaluator with configuration.

        Args:
            config: EvaluationConfig instance with model and test parameters
        """
        self.config = config
        set_seed(self.config.random_seed)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model = None
        self.tokenizer = None
        self.results_df = None

    # -----------------------------------------------------------------
    # Model loading
    # -----------------------------------------------------------------
    def load_model(self) -> None:
        """Load the model and tokenizer onto the device."""
        logger.info(f"Device: {self.device}")
        logger.info(f"Loading model: {self.config.model_path}")

        self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.config.model_path,
            dtype=torch.bfloat16 if self.device == "cuda" else torch.float32,
            device_map="auto",
        ).eval()

        logger.info(f"Model loaded successfully")

    def load_model_lora(self) -> None:
        """Load the base model and apply LoRA adapter."""
        logger.info(f"Device: {self.device}")
        logger.info(f"Loading BASE MODEL + LORA ADAPTER")

        # 1. Load base model
        logger.info(f"Base model: {self.config.base_model_path}")

        self.model = AutoModelForCausalLM.from_pretrained(
            self.config.base_model_path,
            dtype=torch.bfloat16 if self.device == "cuda" else torch.float32,
            device_map="auto",
        )

        # 2. Load LoRA adapter
        logger.info(f"Loading LoRA adapter: {self.config.model_path}")

        # Load tokenizer from LoRA model path
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.config.model_path)

        # Apply LoRA adapter
        self.model = PeftModel.from_pretrained(
            self.model,
            self.config.model_path,
            dtype=torch.bfloat16 if self.device == "cuda" else torch.float32,
            base_model_name_or_path=self.config.base_model_path
        )

        self.model = self.model.eval()
        logger.info(f"Model + LoRA loaded successfully on {self.device}")

    # -----------------------------------------------------------------
    # Prompt formatting
    # -----------------------------------------------------------------
    def format_question(self, row: pd.Series) -> str:
        """
        Format a question row into a prompt.

        Args:
            row: DataFrame row containing question and options

        Returns:
            Formatted prompt string
        """
        subject = row['Subject'].replace('_', ' ').title()
        user_prompt = (
            f"{self.config.system_prompt}"
            f"Assunto: {subject}\n\n"
            f"Pergunta: {row['Question']}\n"
            f"A) {row['A']}\n"
            f"B) {row['B']}\n"
            f"C) {row['C']}\n"
            f"D) {row['D']}\n\n"
            f"Resposta correta:"
        )

        return user_prompt

    # -----------------------------------------------------------------
    # Generation methods
    # -----------------------------------------------------------------
    def generate_response_with_chat_template(self, prompt: str) -> str:
        """
        Generate a response from the model for a given prompt using chat template.

        Args:
            prompt: The input prompt string

        Returns:
            A tuple containing the generated long and short text responses (continuation only)
        """
        messages = [{"role": "user", "content": prompt}]
        model_input = self.tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt").to(self.device)

        # Generation kwargs
        generation_kwargs_long = {
            "max_new_tokens": self.config.max_new_tokens_long,
            "do_sample": self.config.do_sample_long,
            "temperature": self.config.temperature,
        }

        generation_kwargs_short = {
            "max_new_tokens": self.config.max_new_tokens_short,
            "do_sample": self.config.do_sample_short,
        }

        # Generate long and short outputs
        with torch.no_grad():
            output_long = self.model.generate(
                model_input, **generation_kwargs_long)

        with torch.no_grad():
            output_short = self.model.generate(
                model_input, **generation_kwargs_short)

        # Decode only the continuation beyond the prompt
        out_decoded_long = self.tokenizer.decode(
            output_long[0][model_input.shape[1]:],
            skip_special_tokens=True
        ).strip()

        out_decoded_short = self.tokenizer.decode(
            output_short[0][model_input.shape[1]:],
            skip_special_tokens=True
        ).strip()

        return out_decoded_long, out_decoded_short

    def generate_response_seqlogprob(self, prompt: str, row: pd.Series) -> str:
        """
        Generate a response from the model for a given prompt using sequential log-probabilities.

        Args:
            prompt: The input prompt string
            row: DataFrame row containing question and options

        Returns:
            The predicted answer letter (A, B, C, or D)
        """
        messages = [{"role": "user", "content": prompt}]
        prefix_ids = self.tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt").to(self.device)

        options = {}
        for letter in "ABCD":
            target_text = f"{letter}) {row[letter]}"
            target_ids = self.tokenizer(
                target_text,
                add_special_tokens=False,
                return_tensors="pt").to(self.device)["input_ids"]

            with torch.no_grad():
                # feed prefix + target[:-1] to predict target[1:]
                inp = torch.cat([prefix_ids, target_ids[:, :-1]], dim=1)
                out = self.model(inp)
                lp = torch.log_softmax(
                    out.logits[:, -target_ids.shape[1]:, :], dim=-1)
                # gather log-probs of the true next tokens
                tgt = target_ids
                token_lp = lp.gather(-1, tgt.unsqueeze(-1)).squeeze(-1)
                seq_lp = token_lp.sum().item()
            options[letter] = seq_lp

        pred = max(options, key=options.get)
        return pred

    # -----------------------------------------------------------------
    # Evaluation
    # -----------------------------------------------------------------
    @staticmethod
    def extract_answer(text: str) -> Optional[str]:
        """
        Extract the multiple choice answer letter from model output.

        Args:
            text: The model's generated text

        Returns:
            The extracted letter (a-d) in lowercase, or None if not found
        """
        match = re.search(r'\b([a-dA-D])\b', text)
        if match:
            return match.group(1).upper()
        return None

    def evaluate(self) -> pd.DataFrame:
        """
        Run the full evaluation pipeline.

        Returns:
            DataFrame with evaluation results
        """
        if self.model is None:
            raise RuntimeError("Model not loaded. Call load_model() first.")

        # Load test data
        logger.info(f"Loading test data from: {self.config.test_data_path}")
        df_test = pd.read_csv(self.config.test_data_path)

        # Sample if configured
        if self.config.sample_size:
            df_test = df_test.sample(
                n=min(self.config.sample_size, len(df_test)),
                random_state=self.config.random_seed
            )

        logger.info(f"Evaluating on {len(df_test)} questions...")

        results = []
        for _, row in tqdm(df_test.iterrows(), total=len(df_test), desc="Answering MMLU"):
            # Generate prompt
            prompt = self.format_question(row)

            # Generate responses
            out_chat_long, out_chat_short = self.generate_response_with_chat_template(
                prompt)
            out_seqlogprob = self.generate_response_seqlogprob(prompt, row)

            results.append({
                "subject": row["Subject"],
                "question": row["Question"],
                "A": row["A"],
                "B": row["B"],
                "C": row["C"],
                "D": row["D"],
                "correct": row["Answer"],
                "out_chat_long": out_chat_long,
                "out_extracted_chat_long": self.extract_answer(out_chat_long),
                "out_chat_short": out_chat_short,
                "out_seqlogprob": out_seqlogprob,
            })

        # Process results
        self.results_df = pd.DataFrame(results)

        return self.results_df

    # -----------------------------------------------------------------
    # Metrics
    # -----------------------------------------------------------------
    def save_results(self, output_path: Optional[Path] = None) -> None:
        """
        Save evaluation results to CSV.

        Args:
            output_path: Optional custom output path. Uses config default if None.
        """
        if self.results_df is None:
            raise RuntimeError("No results to save. Run evaluate() first.")

        path = output_path or self.config.output_path
        self.results_df.to_csv(path, index=False)
        logger.info(f"Results saved to: {path}")

    def _compute_accuracy(self, col: str) -> Dict[str, float]:
        """
        Compute accuracy for a given output column.

        Args:
            col: Column name in results_df to evaluate

        Returns:
            Dictionary with accuracy and other metrics
        """
        df = self.results_df
        correct = (df[col].str.upper() == df["correct"].str.upper()).sum()
        total = len(df)
        empty = df[col].isna().sum()
        return {
            "accuracy": 100 * correct / total,
            "correct": correct,
            "total": total,
            "empty": empty,
            "distribution": df[col].value_counts().to_dict(),
        }

    def calculate_metrics(self) -> Dict[str, Dict[str, float]]:
        """
        Calculate evaluation metrics.

        Returns:
            Dictionary with metrics for each evaluation mode
        """
        if self.results_df is None:
            raise RuntimeError("No results to compute metrics on.")
        return {
            "with_chat_long_answer": self._compute_accuracy("out_extracted_chat_long"),
            "with_chat_short_answer": self._compute_accuracy("out_chat_short"),
            "seqlogprob": self._compute_accuracy("out_seqlogprob"),
        }

    def print_metrics(self) -> None:
        """Print evaluation metrics in a formatted way."""
        metrics = self.calculate_metrics()
        for mode, data in metrics.items():
            logger.info(f"\n{'='*50}\nRESULTS ({mode})\n{'='*50}")
            logger.info(f"Accuracy: {data['accuracy']:.2f}% "
                        f"({data['correct']}/{data['total']})")
            logger.info(f"Empty answers: {data['empty']}")
            logger.info(f"Distribution: {data['distribution']}")

    def run(self) -> dict:
        """
        Convenience method to run the complete evaluation pipeline.

        Returns:
            Dictionary with evaluation metrics
        """
        logger.info(
            f"\n{'='*50}\nEVALUATION ({self.config.model_name})\n{'='*50}")
        if self.config.lora:
            self.load_model_lora()
        else:
            self.load_model()
        self.evaluate()
        self.save_results()
        self.print_metrics()

        return self.calculate_metrics()


## 3. Model Evaluation Runs <a id="part_03"></a>

This section executes the evaluation pipeline on three different *Gemma-3-1b* configurations.
Each run loads the corresponding model (with or without LoRA adapters), evaluates it against the
*MMLU Portuguese test set*, and saves results to the `results/` directory.

### 3.1 *Gemma-3-1b-it* (Baseline) <a id="31-gemma-3-1b-it"></a>

This experiment evaluates the **instruction-tuned baseline model** (`gemma-3-1b-it`) without any fine-tuning.
It serves as a reference point to assess the impact of domain-specific adaptation.


> ⚠️ **Warning:** The next Python cell involves **high-performance computing (HPC)**.  
> Execution requires a **dedicated or cloud machine with multiple cores**, not a standard desktop or notebook.  
> Runtime and cell outputs are reported below.


In [None]:
config = EvaluationConfig(
    model_path="google/gemma-3-1b-it",
    lora=False
)

evaluator = MMLUEvaluator(config)
evaluator.run()

```log
2025-11-25 04:16:41.833 | INFO     | __main__:run:435 - 
==================================================
EVALUATION (models_gemma-3-1b-it)
==================================================
2025-11-25 04:16:41.834 | INFO     | __main__:load_model:129 - Device: cuda
2025-11-25 04:16:41.834 | INFO     | __main__:load_model:130 - Loading model: models/gemma-3-1b-it
2025-11-25 04:16:44.174 | INFO     | __main__:load_model:139 - Model loaded successfully
2025-11-25 04:16:44.174 | INFO     | __main__:evaluate:323 - Loading test data from: data/mmlu_test.csv
2025-11-25 04:16:44.183 | INFO     | __main__:evaluate:333 - Evaluating on 1038 questions...
2025-11-25 04:28:38.244 | INFO     | __main__:save_results:379 - Results saved to: results/models_gemma-3-1b-it_results.csv
2025-11-25 04:28:38.248 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (with_chat_long_answer)
==================================================
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:423 - Accuracy: 30.64% (318/1038)
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:426 - Distribution: {'C': 633, 'A': 200, 'B': 110, 'D': 95}
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (with_chat_short_answer)
==================================================
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:423 - Accuracy: 30.64% (318/1038)
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:426 - Distribution: {'C': 636, 'A': 197, 'B': 109, 'D': 93, 'The': 3}
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (seqlogprob)
==================================================
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:423 - Accuracy: 30.35% (315/1038)
2025-11-25 04:28:38.249 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 04:28:38.250 | INFO     | __main__:print_metrics:426 - Distribution: {'C': 603, 'A': 200, 'B': 141, 'D': 94}

```

### 3.2 *Gemma-3-1b-pt* + LoRA SFT on MMLU <a id="32-gemma-3-1b-pt--lora-sft-on-mmlu"></a>

This configuration applies **LoRA-based supervised fine-tuning (SFT)** to the pre-trained base model `gemma-3-1b-pt`,
trained specifically on the MMLU dataset.
It measures how well task-specific adaptation improves factual and reasoning performance.

> ⚠️ **Warning:** The next Python cell involves **high-performance computing (HPC)**.  
> Execution requires a **dedicated or cloud machine with multiple cores**, not a standard desktop or notebook.  
> Runtime and cell outputs are reported below.


In [None]:
config = EvaluationConfig(
    model_path="models/gemma-3-1b-pt-sft",
    base_model_path="google/gemma-3-1b-pt",
    lora=True
)

evaluator = MMLUEvaluator(config)
evaluator.run()

```log
2025-11-25 04:28:38.312 | INFO     | __main__:run:435 - 
==================================================
EVALUATION (gemma-3-1b-pt-sft105_best_eval)
==================================================
2025-11-25 04:28:38.312 | INFO     | __main__:load_model_lora:143 - Device: cuda
2025-11-25 04:28:38.312 | INFO     | __main__:load_model_lora:144 - Loading BASE MODEL + LORA ADAPTER
2025-11-25 04:28:38.312 | INFO     | __main__:load_model_lora:147 - Base model: models/gemma-3-1b-pt
2025-11-25 04:28:38.892 | INFO     | __main__:load_model_lora:156 - Loading LoRA adapter: old/models/gemma-3-1b-pt-sft105/best_eval
2025-11-25 04:28:43.528 | INFO     | __main__:load_model_lora:171 - Model + LoRA loaded successfully on cuda
2025-11-25 04:28:43.528 | INFO     | __main__:evaluate:323 - Loading test data from: data/mmlu_test.csv
2025-11-25 04:28:43.537 | INFO     | __main__:evaluate:333 - Evaluating on 1038 questions...
2025-11-25 05:35:18.616 | INFO     | __main__:save_results:379 - Results saved to: results/gemma-3-1b-pt-sft105_best_eval_results.csv
2025-11-25 05:35:18.619 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (with_chat_long_answer)
==================================================
2025-11-25 05:35:18.619 | INFO     | __main__:print_metrics:423 - Accuracy: 24.95% (259/1038)
2025-11-25 05:35:18.620 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 05:35:18.620 | INFO     | __main__:print_metrics:426 - Distribution: {'C': 301, 'B': 258, 'A': 246, 'D': 233}
2025-11-25 05:35:18.620 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (with_chat_short_answer)
==================================================
2025-11-25 05:35:18.620 | INFO     | __main__:print_metrics:423 - Accuracy: 27.94% (290/1038)
2025-11-25 05:35:18.620 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 05:35:18.620 | INFO     | __main__:print_metrics:426 - Distribution: {'C': 435, 'A': 277, 'D': 207, 'B': 116, 'Você': 3}
2025-11-25 05:35:18.621 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (seqlogprob)
==================================================
2025-11-25 05:35:18.621 | INFO     | __main__:print_metrics:423 - Accuracy: 25.63% (266/1038)
2025-11-25 05:35:18.621 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 05:35:18.621 | INFO     | __main__:print_metrics:426 - Distribution: {'C': 342, 'B': 248, 'A': 230, 'D': 218}
```

### 3.3 *Gemma-3-1b-pt* + Wiki Context + LoRA SFT on MMLU <a id="33-gemma-3-1b-pt--wiki-context--lora-sft-on-mmlu"></a>

In this final evaluation, the model integrates **contextual knowledge** from the curated
Wikipedia subset (*Law, Governance, and Ethics*) into its LoRA fine-tuning process.
This variant assesses whether combining factual domain data with reasoning QA pairs
enhances zero-shot performance on MMLU.

> ⚠️ **Warning:** The next Python cell involves **high-performance computing (HPC)**.  
> Execution requires a **dedicated or cloud machine with multiple cores**, not a standard desktop or notebook.  
> Runtime and cell outputs are reported below.


In [None]:
config = EvaluationConfig(
    model_path="AleNunezArroyo/gemma-3-1b-pt-contextual-e1-ckpt1600",
    base_model_path="google/gemma-3-1b-pt",
    lora=True
)

evaluator = MMLUEvaluator(config)
evaluator.run()

```log
2025-11-25 08:12:24.251 | INFO     | __main__:run:435 - 
==================================================
EVALUATION (gemma-3-1b-pt-contextual-e1-ckpt1600-sft115_best_eval)
==================================================
2025-11-25 08:12:24.252 | INFO     | __main__:load_model_lora:143 - Device: cuda
2025-11-25 08:12:24.252 | INFO     | __main__:load_model_lora:144 - Loading BASE MODEL + LORA ADAPTER
2025-11-25 08:12:24.252 | INFO     | __main__:load_model_lora:147 - Base model: models/gemma-3-1b-pt
2025-11-25 08:12:24.930 | INFO     | __main__:load_model_lora:156 - Loading LoRA adapter: models/gemma-3-1b-pt-contextual-e1-ckpt1600-sft115/best_eval
2025-11-25 08:12:26.649 | INFO     | __main__:load_model_lora:171 - Model + LoRA loaded successfully on cuda
2025-11-25 08:12:26.649 | INFO     | __main__:evaluate:323 - Loading test data from: data/mmlu_test.csv
2025-11-25 08:12:26.659 | INFO     | __main__:evaluate:333 - Evaluating on 1038 questions...
2025-11-25 09:30:42.335 | INFO     | __main__:save_results:379 - Results saved to: results/gemma-3-1b-pt-contextual-e1-ckpt1600-sft115_best_eval_results.csv
2025-11-25 09:30:42.338 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (with_chat_long_answer)
==================================================
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:423 - Accuracy: 26.30% (273/1038)
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:426 - Distribution: {'B': 327, 'D': 288, 'C': 264, 'A': 159}
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (with_chat_short_answer)
==================================================
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:423 - Accuracy: 26.01% (270/1038)
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:426 - Distribution: {'B': 433, 'D': 322, 'C': 276, 'A': 7}
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:422 - 
==================================================
RESULTS (seqlogprob)
==================================================
2025-11-25 09:30:42.339 | INFO     | __main__:print_metrics:423 - Accuracy: 25.92% (269/1038)
2025-11-25 09:30:42.340 | INFO     | __main__:print_metrics:425 - Empty answers: 0
2025-11-25 09:30:42.340 | INFO     | __main__:print_metrics:426 - Distribution: {'B': 526, 'D': 340, 'C': 165, 'A': 7}
```