### Project: Domain Adaptation of Portuguese SLMs via Self-Supervised Fine-Tuning with LoRA
MO436C - Introduction to Self-Supervised Learning (SSRL)

**Team Members:**
- Alejandro Núñez Arroyo. <a href="mailto:a299215@dac.unicamp.br">a299215@dac.unicamp.br</a>  
- Daniel da Costa Nunes Resende Neto. <a href="mailto:d169408@dac.unicamp.br">d169408@dac.unicamp.br</a>  
- José Augusto de Almeida Neto. <a href="mailto:j299218@dac.unicamp.br">j299218@dac.unicamp.br</a>  

*Instituto de Computação (IC), Universidade Estadual de Campinas (UNICAMP)*  
*Campinas, November 2025*

---

#### License

This notebook and its source code are released under the **GNU General Public License v3.0 (GPLv3)**.  
You are free to use, modify, and redistribute this work under the following terms:

> **GNU General Public License v3.0**  
> Copyright © 2025 The Authors listed above  
>
> This program is free software: you can redistribute it and/or modify  
> it under the terms of the GNU General Public License as published by  
> the Free Software Foundation, either version 3 of the License, or  
> (at your option) any later version.  
>
> This program is distributed in the hope that it will be useful,  
> but **without any warranty**; without even the implied warranty of  
> merchantability or fitness for a particular purpose. See the  
> GNU General Public License for more details.  
>
> You should have received a copy of the GNU General Public License  
> along with this program. If not, see  
> [https://www.gnu.org/licenses/gpl-3.0.en.html](https://www.gnu.org/licenses/gpl-3.0.en.html).

---

# Notebook 3b: Supervised Fine-Tuning (Instruction Tuning on  Wiki Context model)

This notebook documents the **supervised fine-tuning (SFT)** stage of the experimental pipeline,  
where the **contextually pre-trained model** `gemma-3-1b-pt-contextual-e1-ckpt1600` — obtained after continued self-supervised training on the Portuguese Wikipedia subset (*Law, Governance, and Ethics*) —  
is adapted to the **MMLU multiple-choice question answering task** using **Low-Rank Adaptation (LoRA)**.  

This corresponds to the **Context-Adapted + Instruction-Tuned Model** described in the experimental report —  
representing the *third* model variant, which combines **domain adaptation** (from the SSRL stage)  
with **task-specific supervised fine-tuning** on MMLU.

---

**Overview**

The main objectives of this notebook are:

1. **Setup & Environment Initialization**  
   Load all necessary dependencies, configure GPU acceleration, and prepare model and dataset paths.

2. **Data Preparation**  
   - Load and explore the *MMLU Portuguese training dataset* (`mmlu_train.csv`).  
   - Perform stratified splits into train and validation subsets.  
   - Convert the data into **instruction-tuning format** following the *Gemma 3 chat schema*.

3. **LoRA Configuration & SFT Setup**  
   - Define the LoRA adaptation modules and parameters (`r`, `alpha`, `dropout`).  
   - Configure the SFT trainer and optimization strategy using the **Hugging Face TRL** framework.  
   - Run pre-training sanity checks to validate dataset tokenization, masking, and loss targets.

4. **Model Training & Saving**  
   - Execute the full supervised fine-tuning routine.  
   - Log metrics such as training loss, token accuracy, and gradient norms.  
   - Save the resulting adapter weights and tokenizer for downstream evaluation.

**Output Artifacts**
   - `gemma-3-1b-pt-contextual-e1-ckpt1600-sft/` — directory containing the new **instruction LoRA adapters** trained on top of the contextual model.  
   - `trainer_state.json` — logs of loss, accuracy, and gradient statistics.  
   - Model checkpoints saved every 50 steps.

---


## Summary

* [Part 1: Setup & Imports](#1-setup--imports)
  - [1.1 Load Data](#11-load-data)
  - [1.2 Load Model](#12-load-model)
* [Part 2: Data](#2-data)
  - [2.1 Prepare Data](#21-prepare-data)
  - [2.2 Convert to Instruction-Tuning Format](#22-convert-to-instruction-tuning-format)
* [Part 3: LoRA Instruction-Tuning](#3-lora-instruction-tuning)
  - [3.1 LoRA Configuration](#31-lora-configuration)
  - [3.2 Training Setup](#32-training-setup)
  - [3.3 Sanity Checks](#33-sanity-checks)
  - [3.4 Training](#34-training)

## 1. Setup & Imports
This section initializes the working environment, importing all required libraries for dataset handling,  
model loading, fine-tuning, and monitoring. It also verifies CUDA availability and frees GPU memory.


In [1]:
import gc
from pathlib import Path

import pandas as pd
import torch
from datasets import Dataset
from peft import LoraConfig, PeftModel
from sklearn.model_selection import train_test_split
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

gc.collect()
torch.cuda.empty_cache()
print("Garbage collection done and CUDA cache emptied.")

PyTorch version: 2.9.0+cu128
CUDA available: True
GPU: NVIDIA GeForce RTX 4070 Laptop GPU
Garbage collection done and CUDA cache emptied.


After confirming system readiness, a fixed **random seed** ensures reproducibility.
Here, two model paths are defined:

* the **original base model** (`gemma-3-1b-pt`) and
* the **contextual adapter** (`gemma-3-1b-pt-contextual-e1-ckpt1600`)
  which will be **merged** before fine-tuning.

In [None]:
# Define constants
RANDOM_SEED = 27

# Model paths
BASE_MODEL_ID = "../models/gemma-3-1b-pt-contextual-e1-ckpt1600"
BASE_MODEL_ORIGINAL_ID = "../models/gemma-3-1b-pt"
BASE_MODEL_FOR_CHAT_TEMPLATE = "../models/gemma-3-1b-it"

# Directories and file paths
TRAIN_PATH = Path("../../data/mmlu_train.csv")
OUTPUT_MODEL_DIR = Path(f"{BASE_MODEL_ID}-sft")

### 1.1 Load Data <a id="11-load-data"></a>

Load the **Portuguese MMLU training dataset** used for instruction fine-tuning.
Basic exploratory statistics verify subject balance and label distribution.

In [3]:
# Load training data
df_train = pd.read_csv(TRAIN_PATH)

# Explore data
print("=" * 60)
print(f"TRAIN SET OVERVIEW - Total: {len(df_train):,} samples")
print("=" * 60)

# Subject and Answer distributions
print("\n" + "=" * 60)
subject_counts = df_train['Subject'].value_counts()
print(f"SUBJECT DISTRIBUTION - Total subjects: {len(subject_counts)}")
print("=" * 60)
print(subject_counts)
print("\n" + "=" * 60)
answer_counts = df_train['Answer'].value_counts()
print(f"ANSWER DISTRIBUTION - Total answers: {len(answer_counts)}")
print("=" * 60)
print(answer_counts)
print("- With this distribution, there's no strong statistical incentive to always answer a specific letter.")

# Check for duplicates
duplicates = df_train.duplicated(subset=['Question']).sum()
print(f"\nDuplicate questions in train: {duplicates}")

TRAIN SET OVERVIEW - Total: 2,419 samples

SUBJECT DISTRIBUTION - Total subjects: 7
Subject
professional_law     1073
moral_scenarios       626
moral_disputes        242
philosophy            218
logical_fallacies     114
jurisprudence          76
business_ethics        70
Name: count, dtype: int64

ANSWER DISTRIBUTION - Total answers: 4
Answer
C    639
B    610
D    590
A    580
Name: count, dtype: int64
- With this distribution, there's no strong statistical incentive to always answer a specific letter.

Duplicate questions in train: 0


#### Dataset Summary

* **Total Samples:** 2,419 unique question–answer pairs.
* **Main Domains:** Law, Ethics, and Philosophy — mirroring the project’s focus on
  *Law, Governance, and Ethics* macrodomain.
* **Duplicate Check:** 0 duplicates found.

| Subject           | Count |
| :---------------- | :---- |
| professional_law  | 1,073 |
| moral_scenarios   | 626   |
| moral_disputes    | 242   |
| philosophy        | 218   |
| logical_fallacies | 114   |
| jurisprudence     | 76    |
| business_ethics   | 70    |

Answer Distribuition
| Answer | Count |
| :----- | :---- |
| C      | 639   |
| B      | 610   |
| D      | 590   |
| A      | 580   |

This balanced answer distribution ensures the model **must rely on semantic understanding**,
not frequency bias, to achieve accuracy.

### 1.2 Load Model <a id="12-load-model"></a>

For this experiment, we must **merge the contextual adapter** (obtained from SSRL on Wikipedia PT-BR)
into the original `gemma-3-1b-pt` base weights before starting SFT.

The merge-and-unload process permanently fuses the domain-adapted LoRA into the model,
yielding a **single contextualized transformer** ready for new instruction adapters.


In [5]:
def load_model_sequential(base_model_id, adapter_path):
    """
    1. Loads Base Model (Gemma-3-1b-pt)
    2. Loads Contextual Adapter
    3. Merges Adapter into Base
    4. Unloads Adapter to free resources
    5. Returns a 'clean' merged model ready for a NEW LoRA
    """
    print(f"--- STRATEGY: MERGE & RE-LORA ---")
    print(f"1. Loading Base: {base_model_id}")
    
    # Load Base
    model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        dtype=torch.bfloat16, # 40-series GPUs love bfloat16
        device_map="auto",
        low_cpu_mem_usage=True
    )
    
    print(f"2. Loading Context Adapter: {adapter_path}")
    # Load the Contextual Adapter
    model = PeftModel.from_pretrained(model, adapter_path)
    
    print("3. Merging Context weights into Base Model...")
    # THE CRITICAL STEP: Merge weights permanently
    model = model.merge_and_unload()
    
    # Verify we are back to a standard model architecture
    print(f"Model Type after merge: {type(model)}")
    
    # Load Tokenizer (Usually from the base is fine, unless you added tokens)
    tokenizer = AutoTokenizer.from_pretrained(base_model_id)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id
    
    return tokenizer, model

In [None]:
# Load model
tokenizer, model = load_model_sequential(BASE_MODEL_ORIGINAL_ID, BASE_MODEL_ID)

--- STRATEGY: MERGE & RE-LORA ---
1. Loading Base: ../../models/gemma-3-1b-pt
2. Loading Context Adapter: ../../models/gemma-3-1b-pt-contextual-e1-ckpt1600




3. Merging Context weights into Base Model...


```python
from transformers import AutoModelForCausalLM

# Load original tied model
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it", tie_word_embeddings=False)

# Set the randomly initialized lm_head to the previously tied embeddings
model.lm_head.weight.data = model.model.embed_tokens.weight.data.clone()

# Save the untied model
untied_model_dir = "dir/for/untied/model"
model.save_pretrained(untied_model_dir)
model.config.save_pretrained(untied_model_dir)

# Now use the original model but in untied format
model = AutoModelForCausalLM.from_pretrained(untied_model_dir)
```



Model Type after merge: <class 'transformers.models.gemma3.modeling_gemma3.Gemma3ForCausalLM'>


In [7]:
# Verify model loading and print some information
print("=" * 60)
print("MODEL INFORMATION")
print("=" * 60)
print(f"Model type: {type(model)}")
print(f"Model device: {next(model.parameters()).device}")
print(f"Model dtype: {next(model.parameters()).dtype}")
print(f"Tokenizer pad token: {tokenizer.pad_token}")
print(f"Model config pad token id: {model.config.pad_token_id}")

# Check if the model has existing PEFT adapters
try:
    from peft import PeftModel
    if hasattr(model, 'peft_config'):
        print("Model has existing PEFT adapters")
        print(f"PEFT config: {model.peft_config}")
    else:
        print("Model does not have existing PEFT adapters")
except:
    print("PEFT status unknown")

print("=" * 60)

MODEL INFORMATION
Model type: <class 'transformers.models.gemma3.modeling_gemma3.Gemma3ForCausalLM'>
Model device: cuda:0
Model dtype: torch.bfloat16
Tokenizer pad token: <pad>
Model config pad token id: 0
Model has existing PEFT adapters
PEFT config: {'default': LoraConfig(task_type='CAUSAL_LM', peft_type=<PeftType.LORA: 'LORA'>, auto_mapping={'base_model_class': 'Gemma3ForCausalLM', 'parent_library': 'transformers.models.gemma3.modeling_gemma3', 'unsloth_fixed': True}, base_model_name_or_path='/mnt/shared_models/_home/aarroyo/models/gemma-3-1b-pt', revision=None, inference_mode=True, r=128, target_modules={'embed_tokens', 'down_proj', 'gate_proj', 'k_proj', 'lm_head', 'o_proj', 'q_proj', 'up_proj', 'v_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0, fan_in_fan_out=False, bias='none', use_rslora=True, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.

**Model Summary**

* **Architecture:** Gemma 3 1B (Portuguese) — contextually pre-trained
* **Precision:** bfloat16
* **Adapter Layers:** Merged (Wiki context permanently integrated)
* **Tokenizer:** Compatible with `<eos>` as padding token

## 2. Data <a id="part_02"></a>

This stage prepares the dataset for supervised fine-tuning by splitting,
formatting, and converting it into the **Gemma-compatible chat format**.

### 2.1 Prepare Data <a id="21-prepare-data"></a>

A **stratified 90/10 split** preserves subject balance between training and validation sets.

| Subset     | Samples |
| :--------- | :------ |
| Train      | 2,177   |
| Validation | 242     |

This structure ensures each subject domain contributes proportionally during training and evaluation.

In [8]:
# Split dataset into train and validation sets
train_df, val_df = train_test_split(
    df_train,
    test_size=0.1,
    random_state=42,
    stratify=df_train['Subject']
)

print(f"Train: {len(train_df)}  |  Val: {len(val_df)}")

subject_counts = train_df['Subject'].value_counts()
print(f"\nTrain set subject distribution:")
print(subject_counts)
subject_counts = val_df['Subject'].value_counts()
print(f"\nValidation set subject distribution:")
print(subject_counts)

Train: 2177  |  Val: 242

Train set subject distribution:
Subject
professional_law     966
moral_scenarios      563
moral_disputes       218
philosophy           196
logical_fallacies    103
jurisprudence         68
business_ethics       63
Name: count, dtype: int64

Validation set subject distribution:
Subject
professional_law     107
moral_scenarios       63
moral_disputes        24
philosophy            22
logical_fallacies     11
jurisprudence          8
business_ethics        7
Name: count, dtype: int64


### 2.2 Convert to Instruction-Tuning Format <a id="22-convert-to-instruction-tuning-format"></a>

The dataset is reformatted into conversational structure following the *Gemma 3 chat schema*.
Each entry becomes a **(user, assistant)** message pair, where:

* The **user prompt** includes system instructions and the multiple-choice question.
* The **assistant response** provides the correct letter and answer text.

In [9]:
def build_user_prompt(row: pd.Series) -> str:
    """
    Build the user prompt for a given row in the dataset.
    """

    # System instruction
    system_instruction = (
        "Você é um assistente especialista que responde questões de múltipla escolha em português do Brasil.\n"
        "Responda apenas com UMA opção correta (A, B, C ou D).\n"
    )

    # Input - User turn
    subject = row['Subject'].replace('_', ' ').title()
    
    return (
        f"{system_instruction}"
        f"Assunto: {subject}\n\n"
        f"Pergunta: {row['Question']}\n"
        f"A) {row['A']}\n"
        f"B) {row['B']}\n"
        f"C) {row['C']}\n"
        f"D) {row['D']}\n\n"
        "Resposta correta:" 
    )

def format_gemma_instruction(row: pd.Series) -> dict:
    """
    Convert MMLU row to Gemma 3 chat format.
    """

    # Input - User turn
    user_content = build_user_prompt(row)

    # Output - Model turn
    answer = row['Answer'] 
    model_content = f"{answer}) {row[answer]}"
    
    return {
        "messages": [
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": model_content}
        ]
    }

In [10]:
# Apply formatting
train_dataset = Dataset.from_pandas(train_df).map(format_gemma_instruction)
val_dataset = Dataset.from_pandas(val_df).map(format_gemma_instruction)

# Remove columns we don't need so TRL doesn't get confused
columns_to_keep = ["messages"]
train_dataset = train_dataset.remove_columns([c for c in train_dataset.column_names if c not in columns_to_keep])
val_dataset = val_dataset.remove_columns([c for c in val_dataset.column_names if c not in columns_to_keep])

# Check the formatted text
print(train_dataset[100])

Map:   0%|          | 0/2177 [00:00<?, ? examples/s]

Map:   0%|          | 0/242 [00:00<?, ? examples/s]

{'messages': [{'content': 'Você é um assistente especialista que responde questões de múltipla escolha em português do Brasil.\nResponda apenas com UMA opção correta (A, B, C ou D).\nAssunto: Business Ethics\n\nPergunta: Os gerentes têm a confiança de administrar a empresa aos melhores interesses dos _______. Especificamente, eles têm o dever de agir ao benefício da empresa, além do dever de __________ e de ____________.\nA) Acionistas, Cuidado e habilidade, Diligência.\nB) Partes interessadas, cuidado e habilidade, Diligência\nC) Acionistas, Interesse próprio, Diligência\nD) Acionistas, Cuidado e habilidade, Interesse próprio\n\nResposta correta:', 'role': 'user'}, {'content': 'A) Acionistas, Cuidado e habilidade, Diligência.', 'role': 'assistant'}]}


In [None]:
# Get Gemma 3 chat template from "gemma-3-1b-it"
tokenizer_it = AutoTokenizer.from_pretrained(BASE_MODEL_FOR_CHAT_TEMPLATE)

print(tokenizer_it.chat_template)

{{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}


The `chat_template_with_gen` defines the conversational control tokens (`<start_of_turn>`, `<bos>`, `<end_of_turn>`)
ensuring TRL interprets message boundaries correctly.

A small sample is inspected to visually confirm the final training input structure before tokenization.

In [None]:
# Add generation keyword - required for TRL to know when to start generating
chat_template_with_gen ="""{{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {% generation %}{{ message['content'] | trim }}{% endgeneration %}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {%- if item['type'] == 'image' -%}
                {{ '<start_of_image>' }}
            {%- elif item['type'] == 'text' -%}
                {{ item['text'] | trim }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{'<start_of_turn>model
'}}
{%- endif -%}
"""

# "Copy" Gemma 3 chat template
tokenizer.chat_template = chat_template_with_gen

# Verify formatting remains visually correct
print(tokenizer.apply_chat_template(train_dataset[100]["messages"], tokenize=False, add_generation_prompt=False, return_tensors="pt"))

<bos><start_of_turn>user
Você é um assistente especialista que responde questões de múltipla escolha em português do Brasil.
Responda apenas com UMA opção correta (A, B, C ou D).
Assunto: Business Ethics

Pergunta: Os gerentes têm a confiança de administrar a empresa aos melhores interesses dos _______. Especificamente, eles têm o dever de agir ao benefício da empresa, além do dever de __________ e de ____________.
A) Acionistas, Cuidado e habilidade, Diligência.
B) Partes interessadas, cuidado e habilidade, Diligência
C) Acionistas, Interesse próprio, Diligência
D) Acionistas, Cuidado e habilidade, Interesse próprio

Resposta correta:<end_of_turn>
<start_of_turn>model
A) Acionistas, Cuidado e habilidade, Diligência.<end_of_turn>



## 3. LoRA Instruction-Tuning <a id="part_03"></a>

This section defines the LoRA configuration and training procedure for supervised fine-tuning.

### 3.1 LoRA Configuration <a id="31-lora-configuration"></a>

The **Low-Rank Adaptation (LoRA)** parameters determine which layers are trained and how
the adapter weights are integrated with the base model.

| Parameter        | Value                               | Description                                      |
| :--------------- | :---------------------------------- | :----------------------------------------------- |
| `r`              | 32                                  | Low-rank dimension of trainable adapter matrices |
| `alpha`          | 64                                  | Scaling factor (≈ 2× rank)                       |
| `dropout`        | 0.05                                | Regularization term                              |
| `target_modules` | q/k/v/o, gate, up, down projections | Core transformer attention modules               |

These settings follow standard SFT practices for instruction tuning small-to-medium models
and align with configurations used in recent studies (e.g., *LIMIT*, *Unveiling the Secret Recipe*).

In [14]:
lora_config = LoraConfig(
    r=32,                           # Rank - using different rank than existing (which was 128)
    lora_alpha=64,                  # Alpha = Scaling factor - usually 2x the rank
    lora_dropout=0.05,              # Dropout - regularization for instruction tuning
    bias="none",                    # Whether to train bias parameters
    task_type="CAUSAL_LM",          # Task type for the model
    target_modules=[                # Target modules for LoRA
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ]
)
print("LoRA Config for Instruction Tuning:")
print(f"  Rank: {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Scaling factor: {lora_config.lora_alpha / lora_config.r}")  # Should be ~2
print(f"  Target modules: {lora_config.target_modules}")

LoRA Config for Instruction Tuning:
  Rank: 32
  Alpha: 64
  Scaling factor: 2.0
  Target modules: {'down_proj', 'gate_proj', 'k_proj', 'o_proj', 'q_proj', 'up_proj', 'v_proj'}


### 3.2 Training Setup <a id="32-training-setup"></a>

Fine-tuning is implemented with the **SFTTrainer** class from the TRL library.
The following configuration controls optimization behavior, checkpointing, and memory usage.

| Parameter                   | Value                    |
| :-------------------------- | :----------------------- |
| Epochs                      | 2.5                      |
| Max Sequence Length         | 768 tokens               |
| Effective Batch Size        | 16 (1 × 16 accumulation) |
| Learning Rate               | 1e-4                     |
| Scheduler                   | Cosine                   |
| Warmup Ratio                | 0.1                      |
| Gradient Clipping           | 1.0                      |
| Precision                   | bfloat16                 |
| Evaluation & Save Frequency | every 50 steps           |

`assistant_only_loss=True` ensures the model learns **only from assistant responses**,
preserving stable conversational patterns during fine-tuning.

In [None]:
# Training arguments
training_args = SFTConfig(
    output_dir=str(OUTPUT_MODEL_DIR),

    # Training hyperparameters
    num_train_epochs=2.5,               # Slightly more epochs for instruction tuning
    max_length=768,                     # Maximum sequence length for the model inputs

    # Data & Loss Handling
    assistant_only_loss=True,
    packing=False,
    dataset_text_field="messages",

    # Batch size 1 X 16 = 16 examples per step (adjust based on GPU memory)
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,

    # Learning rate - slightly lower for fine-tuning on top of existing adapters
    learning_rate=1e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,

    # Optimization
    optim="adamw_torch",
    weight_decay=0.01,
    max_grad_norm=1.0,

    # Evaluation and logging
    save_total_limit=4,
    load_best_model_at_end=False,        # Causes issues with multi-adapter
    metric_for_best_model="eval_loss",
    eval_strategy="steps",
    save_strategy="steps",
    eval_steps=50,
    save_steps=50,
    logging_steps=25,
    
    # Performance
    bf16=torch.cuda.is_available(),     # Use bfloat16 if available
    dataloader_num_workers=0,           # Dataloader safety (Windows/WSL sometimes has problem with workers)
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={     # Critical for LoRA stability
        'use_reentrant':False
    },

    # Other
    report_to="none",
    seed=RANDOM_SEED,
)

print("Training arguments configured:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(
    f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Output dir: {OUTPUT_MODEL_DIR}")

Training arguments configured:
  Epochs: 2.5
  Effective batch size: 16
  Learning rate: 0.0001
  Output dir: ../../models/gemma-3-1b-pt-contextual-e1-ckpt1600-sft115


In [16]:
# Initialize Trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=lora_config,        # LoRA configuration
    processing_class=tokenizer      # Argument for tokenizer
)

print("Trainer initialized. Ready to start training!")



Tokenizing train dataset:   0%|          | 0/2177 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/2177 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/242 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/242 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.


Trainer initialized. Ready to start training!


### 3.3 Sanity Checks <a id="33-sanity-checks"></a>

Before training begins, two diagnostic steps validate dataset integrity and loss computation

This cell performs a **sanity check** on the training dataset to ensure the model inputs are formatted correctly.

Given a sample, It **decodes input IDs** and converts numerical tokens back into text using the tokenizer.  
This reveals the **actual prompt structure** the model sees, including special control tokens (e.g., `<start_of_turn>`, `<bos>`) and the formatting of the Q&A pair.

In [17]:
# Verify the trainer processed the dataset
processed_sample = trainer.train_dataset[100]

# Decode the Input IDs back to text to see what the model actually sees
decoded_text = tokenizer.decode(processed_sample['input_ids'])

print("--- WHAT THE MODEL SEES ---")
print(decoded_text)
print("---------------------------")

# Check Length
print(f"Token count: {len(processed_sample['input_ids'])}")

--- WHAT THE MODEL SEES ---
<bos><start_of_turn>user
Você é um assistente especialista que responde questões de múltipla escolha em português do Brasil.
Responda apenas com UMA opção correta (A, B, C ou D).
Assunto: Business Ethics

Pergunta: Os gerentes têm a confiança de administrar a empresa aos melhores interesses dos _______. Especificamente, eles têm o dever de agir ao benefício da empresa, além do dever de __________ e de ____________.
A) Acionistas, Cuidado e habilidade, Diligência.
B) Partes interessadas, cuidado e habilidade, Diligência
C) Acionistas, Interesse próprio, Diligência
D) Acionistas, Cuidado e habilidade, Interesse próprio

Resposta correta:<end_of_turn>
<start_of_turn>model
A) Acionistas, Cuidado e habilidade, Diligência.<end_of_turn>

---------------------------
Token count: 180


This cell checks **how tokens and labels are prepared** by the data collator before training.  
It prints each token’s ID and verifies which ones are masked (`label = -100`) — meaning they don’t contribute to the loss.

A **good result** shows that only the **assistant’s response tokens** are labeled (non -100), and everything else (user prompt, padding) is ignored.

A **bad result** — like here — shows that the special token `<end_of_turn>` is masked (`-100`), meaning the model **never learns when to stop** its answer.  
That can lead to repeated or runaway outputs during inference.


In [18]:
# Inspecting Token Labels and Masking Behavior
sample = trainer.train_dataset[0]   # Get one sample

# Use the Data Collator (which applies the masking)
collator = trainer.data_collator
batch = collator([sample])

# Extract Input IDs and Labels
input_ids = batch['input_ids'][0]
labels = batch['labels'][0]

print("Masked tokens:", sum(l==-100 for l in batch['labels'][0]))
print("Unmasked tokens:", sum(l!=-100 for l in batch['labels'][0]))

# Find where the answer "A/B/C/D" is
for i, label_id in enumerate(labels):
    # Decode the actual input token for inspection
    input_token_str = tokenizer.decode([input_ids[i].item()])
    if label_id != -100:
        decoded_token = tokenizer.decode([label_id])
        decoded_label = labels[i].item()
        decoded_next_label = labels[i+1].item() if i+1 < len(labels) else "End"
        
        print(f"Token at {i}: '{decoded_token}' | Label ID: {decoded_label}")
    else:
        # If label IS -100, the token is IGNORED in loss calculation
        decoded_next_label = labels[i+1].item() if i+1 < len(labels) else "End"
        print(f"--> THIS Token is IGNORED '{input_token_str}' | Label ID: {label_id}")

Masked tokens: tensor(11)
Unmasked tokens: tensor(302)
--> THIS Token is IGNORED '<bos>' | Label ID: -100
--> THIS Token is IGNORED '<start_of_turn>' | Label ID: -100
--> THIS Token is IGNORED 'user' | Label ID: -100
--> THIS Token is IGNORED '
' | Label ID: -100
Token at 4: 'Você' | Label ID: 88270
Token at 5: ' é' | Label ID: 1559
Token at 6: ' um' | Label ID: 1983
Token at 7: ' assist' | Label ID: 6361
Token at 8: 'ente' | Label ID: 3194
Token at 9: ' especialista' | Label ID: 127835
Token at 10: ' que' | Label ID: 929
Token at 11: ' responde' | Label ID: 106451
Token at 12: ' questões' | Label ID: 82829
Token at 13: ' de' | Label ID: 569
Token at 14: ' múlti' | Label ID: 80642
Token at 15: 'pla' | Label ID: 30635
Token at 16: ' escolha' | Label ID: 97358
Token at 17: ' em' | Label ID: 1092
Token at 18: ' português' | Label ID: 130383
Token at 19: ' do' | Label ID: 776
Token at 20: ' Brasil' | Label ID: 23463
Token at 21: '.' | Label ID: 236761
Token at 22: '
' | Label ID: 107
Token

### 3.4 Training <a id="34-training"></a>

The model is fine-tuned using LoRA adapters with the configured SFTTrainer.
Each step logs loss, token accuracy, and gradient statistics.

**Logged Metrics:**

* `step` — logging step index within the epoch
* `training_loss` — cross-entropy loss on the training split (unmasked tokens)
* `validation_loss` — cross-entropy loss on the validation split
* `entropy` — average uncertainty of model outputs
* `num_tokens` — total number of tokens processed per logging step
* `mean_token_accuracy` — token-level prediction accuracy

> ⚠️ **Warning:** The next Python cell involves **high-performance computing (HPC)**.  
> Execution requires a **dedicated or cloud machine with multiple cores**, not a standard desktop or notebook.  
> Runtime and cell outputs are reported below.


In [19]:
# Start training
print("Starting training...")
trainer.train()

Starting training...


Step,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
50,1.4074,1.220184,1.216521,220148.0,0.743835
100,1.3251,1.174997,1.203234,434520.0,0.752394
150,1.245,1.166622,1.105936,645252.0,0.755691
200,1.2222,1.159029,1.118751,858136.0,0.756214
250,1.2058,1.155876,1.086265,1074267.0,0.757525
300,1.1356,1.165259,1.04083,1289655.0,0.756023


TrainOutput(global_step=343, training_loss=1.303121566772461, metrics={'train_runtime': 4712.0263, 'train_samples_per_second': 1.155, 'train_steps_per_second': 0.073, 'total_flos': 6406230200577024.0, 'train_loss': 1.303121566772461, 'entropy': 1.009279995639291, 'num_tokens': 1474756.0, 'mean_token_accuracy': 0.7776619887186421, 'epoch': 2.5071198897565456})

After training completes, the best checkpoint is saved:

In [20]:
# Save the final LoRA adapters
final_model_path = OUTPUT_MODEL_DIR / "best_eval"
trainer.save_model(str(final_model_path))

print(f"Final LoRA adapters saved to: {final_model_path}")

# Also save the tokenizer to the same location
tokenizer.save_pretrained(str(final_model_path))
print(f"Tokenizer saved to: {final_model_path}")

Final LoRA adapters saved to: ../../models/gemma-3-1b-pt-contextual-e1-ckpt1600-sft115/best_eval
Tokenizer saved to: ../../models/gemma-3-1b-pt-contextual-e1-ckpt1600-sft115/best_eval


**Final Outputs:**

* `gemma-3-1b-pt-contextual-e1-ckpt1600-sft/best_eval/adapter_model.bin` — fine-tuned LoRA weights
* `tokenizer.json` — tokenizer configuration for inference and evaluation
* `trainer_state.json` — loss and accuracy progression logs