# Fine-tuning Gemma 3 4B IT on SQuAD using PEFT LoRA on Kaggle

This notebook demonstrates how to fine-tune the `google/gemma-3-4b-it` model on the Stanford Question Answering Dataset (SQuAD) using Parameter-Efficient Fine-Tuning (PEFT) with LoRA and 4-bit quantization. We'll leverage the `transformers`, `peft`, `trl`, `accelerate`, `bitsandbytes`, and `datasets` libraries within a Kaggle environment equipped with 2 T4 GPUs.

**Goal:** Adapt the Gemma 3 instruction-tuned model to better handle question-answering tasks based on provided context, while optimizing resource usage.

**Key Techniques:**

1.  **Gemma 3 4B IT:** Utilizing Google's latest instruction-tuned model.
2.  **SQuAD Dataset:** Training on a standard question-answering benchmark.
3.  **PEFT & LoRA:** Significantly reducing the number of trainable parameters for efficient fine-tuning.
4.  **4-bit Quantization:** Loading the base model in 4-bit precision to conserve GPU memory.
5.  **SFTTrainer:** Using the `trl` library's trainer designed for supervised fine-tuning tasks.
6.  **Kaggle Environment:** Utilizing the free T4 GPU resources.
7.  **Loss Tracking:** Monitoring both training and validation loss during the process.

## 1. Setup Kaggle Environment

* **Enable GPUs:** Make sure to enable the GPU accelerator in your Kaggle notebook settings. Go to "Settings" -> "Accelerator" and select "GPU T4 x2".
* **Hugging Face Token (Optional but Recommended):** While Gemma models might be accessible without login now, it's good practice to use a Hugging Face token, especially for potentially gated models or private use.
    * Go to your Huggle Face account settings -> Access Tokens -> New token.
    * In your Kaggle notebook, go to "Add-ons" -> "Secrets" and add your Hugging Face token with the name `HF_TOKEN`. The code below will attempt to log in using this secret.

## 2. Install Libraries

We need to install the necessary libraries from Hugging Face and other dependencies.

In [1]:
# --- Code Cell: Install Libraries ---
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes huggingface_hub packaging ninja

# The ninja package is often needed as a build dependency for bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m71.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m354.7/354.7 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.1/411.1 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.0/348.0 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.5/66.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m0:00:

In [2]:
!pip freeze | grep transformers

sentence-transformers==3.4.1
transformers==4.51.3


## 3. Import Libraries & Login

In [3]:
# --- Code Cell: Imports and Login ---
import os
import torch
import transformers
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from trl import SFTTrainer, SFTConfig
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient # To access HF_TOKEN

# --- Configuration ---
# Attempt to log in to Hugging Face Hub
try:
    user_secrets = UserSecretsClient()
    hf_token = user_secrets.get_secret("HF_TOKEN_EG")
    login(token=hf_token)
    print("Successfully logged into Hugging Face Hub.")
except Exception as e:
    print(f"Could not log in to Hugging Face Hub. Using public access. Error: {e}")

# Model ID
model_id = "google/gemma-3-4b-it"

# Dataset ID (Using Hugging Face's dataset identifier for SQuAD)
dataset_name = "squad"

# PEFT/LoRA Configuration
lora_r = 16 # Rank of the LoRA matrices
lora_alpha = 32 # Alpha parameter for LoRA scaling
lora_dropout = 0.05 # Dropout probability for LoRA layers
# Target modules can vary based on the model architecture.
# Common targets for Gemma-like models include query, key, value, and output projections.
# Inspecting model.named_modules() can help identify them.
# Let's target common projection layers.
lora_target_modules = [
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
]

# BitsAndBytes Configuration (for 4-bit quantization)
use_4bit = True
bnb_4bit_compute_dtype = "float16" # Use bfloat16 for faster computation
bnb_4bit_quant_type = "nf4" # Use NF4 quantization type for better precision
use_nested_quant = False # Activate nested quantization for more memory savings

# TrainingArguments Configuration
output_dir = "./gemma3-squad-finetuned" # Directory to save the trained model adapters
num_train_epochs = 1 # Number of training epochs (adjust as needed)
per_device_train_batch_size = 4 # Batch size per GPU
per_device_eval_batch_size = 4 # Batch size for evaluation
gradient_accumulation_steps = 2 # Number of steps to accumulate gradients before updating weights (effective batch size = 2 * 4 * 2 = 16)
gradient_checkpointing = True # Use gradient checkpointing to save memory
optim = "paged_adamw_32bit" # Use paged AdamW optimizer for memory efficiency
learning_rate = 2e-5 # Learning rate
weight_decay = 0.001 # Weight decay for regularization
max_grad_norm = 0.3 # Gradient clipping threshold
max_steps = -1 # Maximum number of training steps (-1 for epoch-based training)
warmup_ratio = 0.03 # Ratio of steps for learning rate warmup
lr_scheduler_type = "constant" # Learning rate scheduler type
logging_steps = 25 # Log training loss every 25 steps
eval_steps = 50 # Evaluate on validation set every 50 steps
save_steps = 50 # Save checkpoint every 50 steps
evaluation_strategy = "steps" # Evaluate during training at `eval_steps`
save_strategy = "steps" # Save checkpoint during training at `save_steps`
save_total_limit = 2 # Keep only the last 2 checkpoints
load_best_model_at_end = True # Load the best model found during training at the end
report_to = "tensorboard" # Log metrics to TensorBoard (useful in Kaggle)

# SFTTrainer Configuration
max_seq_length = 512 # Maximum sequence length for tokenization
packing = False # Whether to pack multiple sequences into one sample (set to False for QA)

2025-04-28 09:20:18.765405: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745832018.975578      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745832019.038435      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Successfully logged into Hugging Face Hub.


## 4. Load Dataset and Preprocess

We'll load the SQuAD dataset and preprocess it into a format suitable for `SFTTrainer`. SFTTrainer expects a single text column containing the full input and output sequence. We will format each SQuAD example (context, question, answer) into a prompt that instructs the model to answer the question based on the context.

**Formatting Strategy:**

We'll use a simple template:

```
<start_of_turn>user
Context: [context]
Question: [question]
Answer:<end_of_turn>
<start_of_turn>model
[answer]<end_of_turn>
```

This follows the instruction-following format Gemma expects.

In [4]:
# --- Code Cell: Load and Preprocess Dataset ---
from datasets import load_dataset

# Load SQuAD dataset
dataset =  load_dataset(dataset_name, split='train') # Load the training split

# SQuAD answers are structured. We need the first answer text.
def format_squad_example(example):
    # Ensure 'answers' field exists and has 'text'
    if 'answers' in example and 'text' in example['answers'] and len(example['answers']['text']) > 0:
        context = example['context']
        question = example['question']
        answer = example['answers']['text'][0] # Take the first answer

        # Create the formatted text string using the model's chat template structure
        # Note: We are using a simplified template here. For best results,
        # using the tokenizer's apply_chat_template might be preferable if available
        # and correctly configured for the task.
        formatted_text = f"<start_of_turn>user\nContext: {context}\nQuestion: {question}\nAnswer:<end_of_turn>\n<start_of_turn>model\n{answer}<end_of_turn>"
        return {"text": formatted_text}
    else:
        # Handle cases where 'answers' might be missing or empty
        # Returning None or an empty dict signals to filter this example out
        return None


# Apply the formatting function
# We use map with batched=False as the logic is per-example.
# remove_columns keeps only the new 'text' column SFTTrainer needs.
# We also filter out examples that couldn't be formatted (returned None).
formatted_dataset = dataset.map(format_squad_example, remove_columns=list(dataset.features))
formatted_dataset = formatted_dataset.filter(lambda example: example['text'] is not None)

# Take subset for faster experimentation (avoid OutOfMemoryError)
formatted_dataset = formatted_dataset.select(range(5000))


# Split the dataset into training and validation sets
# SQuAD doesn't have a predefined validation split in the same format,
# so we create one from the training data.
train_test_split = formatted_dataset.train_test_split(test_size=0.1) # 10% for validation
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

print(f"Training dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(eval_dataset)}")
print("\nSample formatted example:")
print(train_dataset[0]['text'])

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Filter:   0%|          | 0/87599 [00:00<?, ? examples/s]

Training dataset size: 4500
Validation dataset size: 500

Sample formatted example:
<start_of_turn>user
Context: The two became friends, and for many years lived in close proximity in Paris, Chopin at 38 Rue de la Chaussée-d'Antin, and Liszt at the Hôtel de France on the Rue Lafitte, a few blocks away. They performed together on seven occasions between 1833 and 1841. The first, on 2 April 1833, was at a benefit concert organized by Hector Berlioz for his bankrupt Shakespearean actress wife Harriet Smithson, during which they played George Onslow's Sonata in F minor for piano duet. Later joint appearances included a benefit concert for the Benevolent Association of Polish Ladies in Paris. Their last appearance together in public was for a charity concert conducted for the Beethoven Memorial in Bonn, held at the Salle Pleyel and the Paris Conservatory on 25 and 26 April 1841.
Question: For whose benefit was the first of these concerts performed for on 2 April 1833?
Answer:<end_of_turn>
<

## 5. Load Model and Tokenizer

Now, we load the Gemma 3 4B IT model and its tokenizer. We apply the 4-bit quantization configuration during model loading using `BitsAndBytesConfig`.

In [6]:
# --- Code Cell: Load Model and Tokenizer ---

# Load BitsAndBytes configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
# create custom device dictionary
custom_device_map = {
    # Vision Tower and Projector on GPU 0
    "vision_tower": 0,
    "multi_modal_projector": 0,

    # Language Model Components
    "language_model.model.embed_tokens": 0,
    **{f"language_model.model.layers.{i}": 0 for i in range(24)}, # LLM layers 0-16 on GPU 0
    **{f"language_model.model.layers.{i}": 1 for i in range(24, 33)}, # LLM layers 17-33 on GPU 1
    "language_model.model.norm": 1,
    "language_model.lm_head": 1,
}
 
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto", # Automatically distribute model across GPUs ------------- # KIV cuda mapping, might cause Tensor error at validation
    trust_remote_code=True, # Gemma 3 might require this
    attn_implementation='eager' # 'sdpa' not compatible
)

model.config.use_cache = False  # Disable cache for training
model.config.pretraining_tp = 1  # Set tensor parallelism degree (1 for no TP)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Set padding token and side. For Gemma, the EOS token is often used as the PAD token.
# Check tokenizer config or documentation if unsure. Let's assume EOS for now.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # "right" is common for causal models, ensures labels are not padded

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

## 6. Configure PEFT (LoRA)

We integrate PEFT into the loaded model. `prepare_model_for_kbit_training` prepares the quantized model for PEFT, and `get_peft_model` applies the LoRA configuration.

In [7]:
# --- Code Cell: Configure PEFT ---

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)

# Configure LoRA
peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=lora_target_modules,
    bias="none", # Typically set to 'none' for LoRA
    task_type="CAUSAL_LM",
)

# Get PEFT model
model = get_peft_model(model, peft_config)

# Freeze unnecessary LoRA layers for Vision Model
for count, (name,param) in enumerate(model.named_parameters()):
    if "vision" in name and "lora" in name:
        param.requires_grad = False
    # print(count,name,param.requires_grad)

# Print trainable parameters
model.print_trainable_parameters()

trainable params: 29,802,496 || all params: 4,332,867,952 || trainable%: 0.6878


In [8]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma3ForConditionalGeneration(
      (vision_tower): SiglipVisionModel(
        (vision_model): SiglipVisionTransformer(
          (embeddings): SiglipVisionEmbeddings(
            (patch_embedding): Conv2d(3, 1152, kernel_size=(14, 14), stride=(14, 14), padding=valid)
            (position_embedding): Embedding(4096, 1152)
          )
          (encoder): SiglipEncoder(
            (layers): ModuleList(
              (0-26): 27 x SiglipEncoderLayer(
                (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
                (self_attn): SiglipAttention(
                  (k_proj): lora.Linear4bit(
                    (base_layer): Linear4bit(in_features=1152, out_features=1152, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Lin

## 7. Configure Training Arguments

We define the training parameters using `transformers.TrainingArguments`. This includes settings for batch size, learning rate, number of epochs, saving frequency, evaluation frequency, and more.

In [9]:
# --- Code Cell: Configure Training Arguments ---

training_arguments = SFTConfig(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=False, # fp16 not used with 4-bit, bf16 determined by compute_dtype
    bf16=True if bnb_4bit_compute_dtype == 'bfloat16' and torch.cuda.is_bf16_supported() else False, # Use bf16 if supported for compute
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True, # Group sequences of similar length for efficiency
    lr_scheduler_type=lr_scheduler_type,
    logging_steps=logging_steps,
    eval_strategy=evaluation_strategy,
    eval_steps=eval_steps,
    save_strategy=save_strategy,
    save_steps=save_steps,
    save_total_limit=save_total_limit,
    load_best_model_at_end=load_best_model_at_end,
    metric_for_best_model="eval_loss", # Use eval loss to determine the best model
    greater_is_better=False, # Lower eval loss is better
    report_to=report_to,
    push_to_hub=False, # Set to True if you want to push the adapter to Hub
    gradient_checkpointing=gradient_checkpointing,
    # ddp_find_unused_parameters=False, # Might be needed in some multi-GPU setups
    dataset_text_field="text", # The column containing our formatted text ###
    max_seq_length=max_seq_length,  ###
    packing=packing  ###
)

## 8. Initialize SFTTrainer and Start Training

We create an instance of `SFTTrainer`, passing the model, datasets, PEFT config, tokenizer, training arguments, and other relevant parameters. Then, we start the training process. The trainer will automatically handle the training loop, evaluation, logging (including training and validation loss), and saving checkpoints.

In [10]:
# --- Code Cell: Initialize Trainer and Train ---

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    # dataset_text_field="text", # The column containing our formatted text
    # max_seq_length=max_seq_length,
    processing_class=tokenizer,
    args=training_arguments,
    # packing=packing,
)

# Start training
print("Starting training...")
trainer.train()
print("Training finished.")

# Save the final adapter model
final_adapter_dir = os.path.join(output_dir, "final_adapter")
trainer.model.save_pretrained(final_adapter_dir)
print(f"Final PEFT adapter model saved to {final_adapter_dir}")

# --- Optional: Clean up memory ---
# del model
# del trainer
# torch.cuda.empty_cache()

Converting train dataset to ChatML:   0%|          | 0/4500 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/4500 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/500 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/500 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Starting training...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

## 9. Evaluation and Inference (Optional)

After training, the `trainer` object holds the training history, including training and validation loss, which can be accessed via `trainer.state.log_history`. TensorBoard logs will also be available in the `output_dir`.

You can also load the trained adapter model and perform inference.


In [None]:
# --- Code Cell: Load Trained Model and Inference (Example) ---
from peft import PeftModel
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# --- Reload Base Model and Tokenizer with Quantization ---
# Ensure you reload the model in the same way you trained it (with quantization)
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# --- Load the PEFT Adapter ---
# Use the path where the best model adapter was saved by the trainer
# If load_best_model_at_end=True, the trainer saves it automatically.
# Check the output_dir for checkpoint folders. Let's assume the final one is best for this example.
adapter_path = "/kaggle/working/gemma3-squad-finetuned/final_adapter" # Or path to the best checkpoint, e.g., f"{output_dir}/checkpoint-XXX"

# Load the LoRA adapter onto the base model
model_with_adapter = PeftModel.from_pretrained(base_model, adapter_path)
model_with_adapter = model_with_adapter.eval() # Set to evaluation mode

print("Loaded base model and adapter for inference.")

# --- Create Inference Pipeline ---
pipe = pipeline(
    task="text-generation",
    model=model_with_adapter,
    tokenizer=tokenizer,
    max_new_tokens=50, # Limit the number of generated tokens (answer length)
    # temperature=0.7, # Adjust creativity
    # top_p=0.9,       # Use nucleus sampling
    # repetition_penalty=1.1 # Penalize repetition
)

# --- Example Inference ---
# Take an example from the original SQuAD validation set (or create one)
# Note: Use the *original* SQuAD format here, not the training format.
# The model expects the prompt format we used during fine-tuning.

# Example from SQuAD dev set (you might need to load it separately)
context_example = "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ('Norman' comes from 'Norseman') raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia."
question_example = "Who were the Normans descended from?"

# Format the prompt like in training
prompt = f"<start_of_turn>user\nContext: {context_example}\nQuestion: {question_example}\nAnswer:<end_of_turn>\n<start_of_turn>model\n"

# Generate the answer
result = pipe(prompt)

print("\n--- Inference Example ---")
print(f"Context: {context_example}")
print(f"Question: {question_example}")
print("\nGenerated Answer (following prompt):")
# The output includes the prompt, so we display the full generated text.
# You might want to parse out just the answer part based on the <end_of_turn> token.
print(result[0]['generated_text'])

# Parse the generated answer (simple approach)
generated_full_text = result[0]['generated_text']
answer_start_tag = "<start_of_turn>model\n"
answer_end_tag = "<end_of_turn>"

start_index = generated_full_text.find(answer_start_tag)
if start_index != -1:
    start_index += len(answer_start_tag)
    end_index = generated_full_text.find(answer_end_tag, start_index)
    if end_index != -1:
        parsed_answer = generated_full_text[start_index:end_index].strip()
        print(f"\nParsed Answer: {parsed_answer}")
    else:
        # If end token not found, maybe take text after start token
        parsed_answer = generated_full_text[start_index:].strip()
        print(f"\nParsed Answer (end token not found): {parsed_answer}")
else:
    print("\nCould not parse the answer using the expected format.")

## 10. Conclusion

This notebook provided a comprehensive walkthrough for fine-tuning the Gemma 3 4B IT model on the SQuAD dataset using PEFT LoRA and 4-bit quantization within a Kaggle environment.

**Key Takeaways:**

* We successfully configured and ran a fine-tuning job using `SFTTrainer`.
* PEFT LoRA allowed training with significantly fewer parameters.
* 4-bit quantization drastically reduced memory requirements, making it feasible on T4 GPUs.
* The training process tracked both training and validation loss, allowing monitoring of overfitting and model performance.
* The final adapter model can be saved and used for inference on question-answering tasks formatted similarly to the training data.

**Next Steps:**

* Experiment with hyperparameters (learning rate, batch size, LoRA rank/alpha, number of epochs).
* Try different prompt formatting strategies.
* Evaluate the fine-tuned model using standard QA metrics (F1, Exact Match) on the SQuAD development set (requires custom evaluation logic).
* Push the final adapter model to the Hugging Face Hub for easy sharing and reuse.