# Real-Time Progress Tracking with TransformersTrainer

This notebook demonstrates how to monitor distributed training progress in real-time using `TransformersTrainer` from Kubeflow Trainer v2 on Red Hat OpenShift AI.

## Overview

In this example, we fine-tune the **Qwen 2.5 1.5B Instruct** model on the **Stanford Alpaca** instruction-following dataset. The training runs on 2 GPU nodes with automatic progress tracking enabled, allowing you to monitor training metrics in real-time from the OpenShift AI Dashboard.

### What You'll Learn

| Feature | Description |
|---------|-------------|
| **Automatic Progress Tracking** | TransformersTrainer auto-injects a `KubeflowProgressCallback` that exposes training metrics via HTTP |
| **Real-Time Metrics** | View current step, epoch, loss, and estimated time remaining in the OpenShift AI Dashboard |
| **PVC-Based Checkpointing** | Save model checkpoints to a shared PersistentVolumeClaim for durability and resume capability |
| **Distributed Training** | Run training across multiple GPU nodes using PyTorch's DistributedDataParallel (DDP) |

### Model Details

**Qwen 2.5 1.5B Instruct** is a compact instruction-tuned language model from the Qwen family:
- **Parameters:** 1.5 billion
- **Context Length:** 32K tokens
- **Languages:** Multilingual with strong English and Chinese support
- **Use Case:** Ideal for instruction-following, chat, and text generation tasks
- **Why this model?** Small enough to train quickly for demonstration, yet powerful enough for real-world tasks

### Prerequisites

Before running this notebook, ensure you have:

1. **OpenShift AI Cluster** with Kubeflow Trainer v2 enabled
2. **Workbench** running Python 3.12+ with GPU access
3. **Environment Variables:**
   - `OPENSHIFT_API_URL` - Your OpenShift API server URL
   - `NOTEBOOK_USER_TOKEN` - Authentication token for API access
4. **Shared PVC** named `shared` mounted at `/opt/app-root/src/shared` in the workbench

## Setup and Imports

Install the Kubeflow SDK and required packages.

In [None]:
!python3 -m pip install datasets transformers accelerate huggingface_hub
!python3 -m pip install --force-reinstall --no-cache-dir -U "kubeflow @ git+https://github.com/opendatahub-io/kubeflow-sdk.git@v0.2.1+rhai0"
!python3 -m pip install --force-reinstall --no-cache-dir -U ipykernel

In [None]:
import os

import kubeflow
import torch
from datasets import load_dataset
from kubeflow.common.types import KubernetesBackendConfig
from kubeflow.trainer import TrainerClient
from kubeflow.trainer.rhai import TransformersTrainer
from kubernetes import client as k8s
from transformers import AutoModelForCausalLM, AutoTokenizer

print(f"Kubeflow SDK version: {kubeflow.__version__}")
print("‚úÖ All imports successful")

## Configuration

Configure authentication and paths. TransformersTrainer has **progress tracking enabled by default**.

In [None]:
# Authentication configuration
api_server = os.getenv("OPENSHIFT_API_URL")
token = os.getenv("NOTEBOOK_USER_TOKEN")

if not api_server or not token:
    raise RuntimeError(
        "OPENSHIFT_API_URL and NOTEBOOK_USER_TOKEN environment variables are required"
    )

# Configure Kubernetes client
configuration = k8s.Configuration()
configuration.host = api_server
configuration.verify_ssl = False  # Set to True if using trusted certificates
configuration.api_key = {"authorization": f"Bearer {token}"}

# PVC Configuration
PVC_NAME = "shared"
# In the notebook, the shared PVC is mounted at /opt/app-root/src/shared
# In training pods, we mount it at /opt/app-root/src
NOTEBOOK_SHARED_PATH = "/opt/app-root/src/shared"  # Where notebook sees the shared PVC
TRAINING_POD_PATH = "/opt/app-root/src"  # Where training pods will mount it

# Model Configuration - use notebook path for downloading, training path for train_func
MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
# Paths for notebook (downloading)
MODEL_PATH = f"{NOTEBOOK_SHARED_PATH}/models/qwen2.5-1.5b-instruct"
DATA_PATH = f"{NOTEBOOK_SHARED_PATH}/data/alpaca_processed"
CHECKPOINTS_PATH = f"{NOTEBOOK_SHARED_PATH}/checkpoints/progress-tracking"

# Paths for training pods (used in train_func and PodTemplateOverrides)
TRAINING_MODEL_PATH = f"{TRAINING_POD_PATH}/models/qwen2.5-1.5b-instruct"
TRAINING_DATA_PATH = f"{TRAINING_POD_PATH}/data/alpaca_processed"
TRAINING_CHECKPOINTS_PATH = f"{TRAINING_POD_PATH}/checkpoints/progress-tracking"

print(f"API Server: {api_server}")
print(f"Model: {MODEL_NAME}")
print(f"Notebook Model Path: {MODEL_PATH}")
print(f"Training Pod Model Path: {TRAINING_MODEL_PATH}")
print(f"Data Path: {DATA_PATH}")
print(f"Checkpoints Path: {CHECKPOINTS_PATH}")

## Download Model and Dataset to Shared PVC

Before submitting the training job, we pre-download the model and dataset to the shared PVC. This ensures:
- **Offline Training:** Training pods don't need internet access during training
- **Faster Startup:** No download delays when training pods start
- **Consistency:** All nodes use the same model weights and data

### About the Dataset

We use the **Stanford Alpaca** dataset (`tatsu-lab/alpaca`), a widely-used instruction-following dataset:

| Property | Value |
|----------|-------|
| **Source** | Stanford University |
| **Size** | 52,000 instruction-response pairs |
| **Format** | Instruction, optional input, and response |
| **Use Case** | Instruction-tuning language models |

Each sample follows this structure:
```
### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced diet...
2. Exercise regularly...
3. Get enough sleep...
```

For this demo, we use a **500-sample subset** to enable quick training (~1 minute).

In [None]:
# Download model to PVC
if os.path.exists(MODEL_PATH) and os.listdir(MODEL_PATH):
    print(f"‚úÖ Model already exists at {MODEL_PATH}")
else:
    print(f"üîÑ Downloading model {MODEL_NAME} to {MODEL_PATH}...")
    os.makedirs(MODEL_PATH, exist_ok=True)

    # Use fast tokenizer for compatibility
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_NAME, use_fast=True, trust_remote_code=True
    )
    tokenizer.save_pretrained(MODEL_PATH)

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
    )
    model.save_pretrained(MODEL_PATH, safe_serialization=True)
    print(f"‚úÖ Model saved to {MODEL_PATH}")
    print(f"üìÅ Files: {os.listdir(MODEL_PATH)}")

In [None]:
# Download and prepare dataset
if os.path.exists(DATA_PATH) and os.listdir(DATA_PATH):
    print(f"‚úÖ Dataset already exists at {DATA_PATH}")
else:
    print("üîÑ Downloading and processing Alpaca dataset...")
    os.makedirs(DATA_PATH, exist_ok=True)

    # Load subset of Alpaca dataset
    dataset = load_dataset("tatsu-lab/alpaca", split="train[:500]")

    # Load tokenizer for preprocessing
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_PATH, use_fast=True, trust_remote_code=True
    )
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    def format_instruction(example):
        if example.get("input"):
            text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
        else:
            text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
        return {"text": text}

    dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)

    def tokenize_function(examples):
        tokenized = tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            max_length=512,
        )
        tokenized["labels"] = tokenized["input_ids"].copy()
        return tokenized

    tokenized_dataset = dataset.map(
        tokenize_function, batched=True, remove_columns=["text"]
    )
    tokenized_dataset.save_to_disk(DATA_PATH)
    print(f"‚úÖ Dataset saved to {DATA_PATH}")

print("\n‚úÖ Model and dataset ready on PVC!")

## Define the Training Function

The training function runs inside each training pod as a distributed PyTorch process. TransformersTrainer serializes this function and executes it via `torchrun` on each node.

### How Progress Tracking Works

When you use `TransformersTrainer` with `enable_progression_tracking=True` (the default):

1. **Automatic Instrumentation:** TransformersTrainer injects a `KubeflowProgressCallback` into your HuggingFace `Trainer`
2. **HTTP Metrics Server:** A lightweight HTTP server starts on port 28080, exposing metrics as JSON
3. **Dashboard Integration:** OpenShift AI Dashboard polls these metrics and displays real-time progress

### Training Configuration

| Parameter | Value | Description |
|-----------|-------|-------------|
| `num_train_epochs` | 1 | Complete one pass through the dataset |
| `per_device_train_batch_size` | 2 | Samples per GPU per step |
| `gradient_accumulation_steps` | 4 | Effective batch size = 2 √ó 4 √ó 2 nodes = 16 |
| `learning_rate` | 2e-5 | Standard fine-tuning rate |
| `save_steps` | 20 | Checkpoint every 20 steps |
| `bf16` | True | Use bfloat16 mixed precision |

### Key Points

- **Supported Trainers:** Use `transformers.Trainer` or `trl.SFTTrainer` - both are auto-instrumented
- **No Manual Setup:** Progress tracking callback is injected automatically
- **Local Files Only:** Model and data are loaded from the mounted PVC (no network access needed)

In [None]:
def train_func():
    """SFT training function using HuggingFace Trainer.

    TransformersTrainer automatically:
    - Injects KubeflowProgressCallback for real-time metrics
    - Applies checkpoint configuration from periodic_checkpoint_config
    - Enables auto-resume from latest checkpoint
    """
    import os

    # IMPORTANT: Set offline mode BEFORE importing transformers/huggingface_hub
    # This prevents the newer huggingface_hub from validating local paths as repo IDs
    os.environ["HF_HUB_OFFLINE"] = "1"
    os.environ["TRANSFORMERS_OFFLINE"] = "1"

    import torch
    from datasets import load_from_disk
    from transformers import (
        AutoConfig,
        AutoModelForCausalLM,
        DataCollatorForLanguageModeling,
        PreTrainedTokenizerFast,
        Trainer,
        TrainingArguments,
    )

    rank = int(os.environ.get("RANK", 0))
    local_rank = int(os.environ.get("LOCAL_RANK", 0))

    # Paths on the shared PVC (mounted via pvc:// URI)
    model_path = "/opt/app-root/src/models/qwen2.5-1.5b-instruct"
    data_path = "/opt/app-root/src/data/alpaca_processed"
    # output_dir is set by TransformersTrainer from the output_dir parameter
    output_dir = "/opt/app-root/src/checkpoints/progress-tracking"

    print(f"üöÄ Starting training on rank {rank}")

    if torch.cuda.is_available():
        torch.cuda.set_device(local_rank)
        print(f"üîß GPU: {torch.cuda.get_device_name(local_rank)}")

    # Load tokenizer directly from tokenizer.json file
    # This bypasses AutoTokenizer's hub validation that fails with local paths
    print(f"üì• Loading tokenizer from: {model_path}")
    tokenizer_file = os.path.join(model_path, "tokenizer.json")
    tokenizer_config_file = os.path.join(model_path, "tokenizer_config.json")

    # Load tokenizer config to get special tokens
    import json

    with open(tokenizer_config_file) as f:
        tokenizer_config = json.load(f)

    tokenizer = PreTrainedTokenizerFast(
        tokenizer_file=tokenizer_file,
        eos_token=tokenizer_config.get("eos_token", "<|endoftext|>"),
        pad_token=tokenizer_config.get("pad_token"),
        bos_token=tokenizer_config.get("bos_token"),
        unk_token=tokenizer_config.get("unk_token"),
    )
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Load model config first, then model - avoids hub validation issues
    print(f"üì• Loading model from: {model_path}")
    config_file = os.path.join(model_path, "config.json")
    with open(config_file) as f:
        model_config_dict = json.load(f)

    # Extract model_type and pass remaining config
    model_type = model_config_dict.pop("model_type")
    config = AutoConfig.for_model(model_type, **model_config_dict)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        config=config,
        torch_dtype=torch.bfloat16,
        device_map={"": local_rank},
        local_files_only=True,
    )

    # Load dataset
    print(f"üì• Loading dataset from: {data_path}")
    tokenized_dataset = load_from_disk(data_path)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
    )

    # Create output directory
    os.makedirs(output_dir, exist_ok=True)

    # Training arguments - TransformersTrainer will override save_* settings
    # from periodic_checkpoint_config if provided
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        learning_rate=2e-5,
        logging_steps=5,
        save_strategy="steps",
        save_steps=20,
        save_total_limit=2,
        report_to="none",
        bf16=True,
        ddp_find_unused_parameters=False,
    )

    # Trainer - TransformersTrainer automatically injects KubeflowProgressCallback
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=data_collator,
    )

    # Train - auto-resumes from latest checkpoint if available
    trainer.train()

    # Save final model (only on rank 0)
    if rank == 0:
        final_path = f"{output_dir}/final"
        trainer.save_model(final_path)
        tokenizer.save_pretrained(final_path)
        print(f"‚úÖ Final model saved to {final_path}")

    print(f"‚úÖ Training complete on rank {rank}")


print("‚úÖ Training function defined")

## Create the Trainer Client

Initialize the TrainerClient with authentication configuration.

In [None]:
# Create client with authentication
api_client = k8s.ApiClient(configuration)

backend_config = KubernetesBackendConfig(
    client_configuration=api_client.configuration,
)

client = TrainerClient(backend_config)
print("‚úÖ TrainerClient created")

# Get the torch-distributed runtime
runtime = client.backend.get_runtime("torch-distributed")
print(f"‚úÖ Using runtime: {runtime.name}")

## Submit the Training Job with TransformersTrainer

Now we create and submit the distributed training job. The `TransformersTrainer` wraps your training function and handles all the distributed training setup.

### Job Configuration

| Parameter | Value | Description |
|-----------|-------|-------------|
| `num_nodes` | 2 | Number of GPU nodes for distributed training |
| `nvidia.com/gpu` | 1 | GPUs per node |
| `cpu` | 4 | CPU cores per node |
| `memory` | 16Gi | Memory per node |
| `enable_progression_tracking` | True | Enable real-time progress monitoring (default) |
| `metrics_poll_interval_seconds` | 30 | How often the dashboard polls for metrics |

### PVC Mounting

We use `PodTemplateOverrides` to mount the shared PVC in training pods:
- **Mount Path:** `/opt/app-root/src` - Where training pods access model, data, and checkpoints
- **PVC Name:** `shared` - The ReadWriteMany PVC containing our data

This ensures all training nodes can access the same model weights, dataset, and can write checkpoints to a shared location.

In [None]:
from kubeflow.trainer.options.kubernetes import (
    ContainerOverride,
    PodSpecOverride,
    PodTemplateOverride,
    PodTemplateOverrides,
)

# Create TransformersTrainer with progress tracking (enabled by default)
# Checkpointing is handled in the training function - saves to shared PVC
trainer = TransformersTrainer(
    func=train_func,
    num_nodes=2,
    resources_per_node={
        "nvidia.com/gpu": 1,
        "cpu": "4",
        "memory": "16Gi",
    },
    # Progress tracking is enabled by default
    enable_progression_tracking=True,
    metrics_poll_interval_seconds=30,
)

# Submit the training job with PVC mount for model, data, and checkpoints
job_name = client.train(
    trainer=trainer,
    runtime=runtime,
    options=[
        PodTemplateOverrides(
            PodTemplateOverride(
                target_jobs=["node"],
                spec=PodSpecOverride(
                    volumes=[
                        {
                            "name": "shared",
                            "persistentVolumeClaim": {"claimName": PVC_NAME},
                        },
                    ],
                    containers=[
                        ContainerOverride(
                            name="node",
                            volume_mounts=[
                                {"name": "shared", "mountPath": TRAINING_POD_PATH},
                            ],
                        )
                    ],
                ),
            )
        )
    ],
)
print(f"‚úÖ Training job submitted: {job_name}")
print("üìä Progress tracking: ENABLED (auto-injected by TransformersTrainer)")
print(f"üíæ Checkpoints saved to: {TRAINING_CHECKPOINTS_PATH}")

## Follow Job Logs

Let's fetch our job logs to make sure training is going as expected. The logs will stream in real-time as the training progresses.

In [None]:
# Stream logs (press Ctrl+C to stop if you want to continue with other cells)
for logline in client.get_job_logs(job_name, follow=True):
    print(logline, end="")

## Get Job Status

Check the final status of the training job after completion.

In [None]:
# Check job status
job = client.get_job(job_name)
print("Final TrainJob Status:")
print(f"   Name: {job.name}")
print(f"   Status: {job.status}")
print(f"   Created: {job.creation_timestamp}")
print(f"   Nodes: {job.num_nodes}")
print(f"   Runtime: {job.runtime.name}")

if job.steps:
    print("   Steps:")
    for step in job.steps:
        print(f"     - {step.name}: {step.status}")
    print()

## Check Training Progress

View detailed progress metrics from the TrainJob annotations. The Kubeflow controller polls the training pods every 30 seconds and writes metrics to these annotations.

In [None]:
import json

from kubernetes import client as k8s

custom_api = k8s.CustomObjectsApi(api_client)

# Get current namespace
try:
    with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace") as f:
        job_namespace = f.read().strip()
except FileNotFoundError:
    # Fallback: use oc project -q
    import subprocess

    result = subprocess.run("oc project -q", shell=True, capture_output=True, text=True)
    job_namespace = result.stdout.strip() if result.returncode == 0 else "default"

print(f"üìç Namespace: {job_namespace}")

try:
    trainjob = custom_api.get_namespaced_custom_object(
        group="trainer.kubeflow.org",
        version="v1alpha1",
        namespace=job_namespace,
        plural="trainjobs",
        name=job_name,
    )

    annotations = trainjob.get("metadata", {}).get("annotations", {})

    print(f"\nTrainJob Progress for {job_name}:\n")

    # Progression tracking config
    print("Config:")
    print(
        f"  tracking-enabled: {annotations.get('trainer.opendatahub.io/progression-tracking', 'N/A')}"
    )
    print(
        f"  metrics-port: {annotations.get('trainer.opendatahub.io/metrics-port', 'N/A')}"
    )
    print(
        f"  poll-interval: {annotations.get('trainer.opendatahub.io/metrics-poll-interval', 'N/A')}s"
    )

    # Progression metrics from trainerStatus annotation
    trainer_status = annotations.get("trainer.opendatahub.io/trainerStatus")
    if trainer_status:
        progress = json.loads(trainer_status)
        print("\nMetrics:")
        print(f"  progress: {progress.get('progressPercentage', 'N/A')}%")
        print(
            f"  step: {progress.get('currentStep', 'N/A')}/{progress.get('totalSteps', 'N/A')}"
        )
        print(
            f"  epoch: {progress.get('currentEpoch', 'N/A')}/{progress.get('totalEpochs', 'N/A')}"
        )
        print(f"  remaining: {progress.get('estimatedRemainingSeconds', 'N/A')}s")

        train_metrics = progress.get("trainMetrics", {})
        if train_metrics:
            print(f"  loss: {train_metrics.get('loss', 'N/A')}")
            print(f"  learning_rate: {train_metrics.get('learning_rate', 'N/A')}")

    print(f"\nAll annotations:\n{json.dumps(annotations, indent=2)}")

except Exception as e:
    print(f"Error: {e}")

## Test the Trained Model

After training completes, we load the fine-tuned model from the checkpoint saved on the shared PVC.

### Checkpoint Structure

The training function saves checkpoints with this structure:
```
/opt/app-root/src/shared/checkpoints/progress-tracking/
‚îú‚îÄ‚îÄ checkpoint-20/     # Intermediate checkpoint at step 20
‚îú‚îÄ‚îÄ checkpoint-32/     # Checkpoint at final step
‚îî‚îÄ‚îÄ final/             # Final merged model ready for inference
```

### Testing the Model

We'll load the fine-tuned model and test it with an instruction prompt using the same format as the training data:

```
### Instruction:
<your instruction here>

### Response:
<model generates response>
```

In [None]:
def find_most_recent_checkpoint(output_dir):
    """Find the most recently created checkpoint directory."""
    if not os.path.exists(output_dir):
        raise FileNotFoundError(f"Output directory not found: {output_dir}")

    checkpoint_dirs = [
        os.path.join(output_dir, d)
        for d in os.listdir(output_dir)
        if os.path.isdir(os.path.join(output_dir, d))
        and (d.startswith("checkpoint-") or d == "final")
    ]

    if not checkpoint_dirs:
        raise FileNotFoundError(f"No checkpoints found in {output_dir}")

    # Prefer 'final' if it exists
    final_path = os.path.join(output_dir, "final")
    if os.path.exists(final_path):
        return final_path

    return max(checkpoint_dirs, key=os.path.getctime)


print("‚úÖ Checkpoint utility defined")

In [None]:
# Find and load the trained model
final_checkpoint = find_most_recent_checkpoint(CHECKPOINTS_PATH)
print(f"üìÇ Loading checkpoint from: {final_checkpoint}")

trained_tokenizer = AutoTokenizer.from_pretrained(
    final_checkpoint, trust_remote_code=True
)
trained_model = AutoModelForCausalLM.from_pretrained(
    final_checkpoint,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    trust_remote_code=True,
)

print("‚úÖ Model loaded successfully")
print(f"üìä Model parameters: {trained_model.num_parameters():,}")

# Test the model
test_prompt = "### Instruction:\nExplain what machine learning is in one sentence.\n\n### Response:"

print("\nüìù Testing model with prompt:")
print(test_prompt)
print("\nü§ñ Model response:")

inputs = trained_tokenizer(test_prompt, return_tensors="pt").to(trained_model.device)

# Remove token_type_ids if present (not used by some models like Qwen)
if "token_type_ids" in inputs:
    del inputs["token_type_ids"]

with torch.no_grad():
    outputs = trained_model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
    )

response = trained_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.replace(test_prompt, "").strip())

print("\n‚úÖ Model test completed!")

## Cleanup

Delete the training job and free resources.

In [None]:
# Delete the training job
client.delete_job(name=job_name)
print(f"‚úÖ Job {job_name} deleted")

In [None]:
import gc

# Clear CUDA cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

gc.collect()
print("‚úÖ Resources freed, CUDA cache cleared")

## Summary

Congratulations! You've successfully completed a distributed fine-tuning job with real-time progress tracking on OpenShift AI.

### What You Accomplished

| Step | Description |
|------|-------------|
| ‚úÖ Model Download | Downloaded Qwen 2.5 1.5B Instruct to shared PVC |
| ‚úÖ Dataset Preparation | Processed Stanford Alpaca dataset for instruction-tuning |
| ‚úÖ Distributed Training | Ran 2-node distributed training with PyTorch DDP |
| ‚úÖ Progress Tracking | Monitored real-time metrics via SDK and Dashboard |
| ‚úÖ Checkpointing | Saved model checkpoints to shared PVC |
| ‚úÖ Model Testing | Loaded and tested the fine-tuned model |

### Key Takeaways

1. **TransformersTrainer** automatically instruments your HuggingFace `Trainer` with:
   - `KubeflowProgressCallback` for real-time metrics (HTTP endpoint on port 28080)
   - Progress visible in OpenShift AI Dashboard without any code changes

2. **Shared PVC Strategy** for distributed training:
   - Pre-download model and dataset to shared RWX PVC from workbench
   - Training pods mount the same PVC and access data locally (offline mode)
   - Checkpoints are written to shared storage for durability

3. **Supported Trainers**:
   - `transformers.Trainer` - Standard HuggingFace trainer
   - `trl.SFTTrainer` - TRL's supervised fine-tuning trainer

### TransformersTrainer Quick Reference

| Parameter | Description | Default |
|-----------|-------------|---------|
| `func` | Training function using `transformers.Trainer` | Required |
| `num_nodes` | Number of distributed training nodes | Required |
| `resources_per_node` | GPU, CPU, memory per node | Required |
| `enable_progression_tracking` | Enable real-time metrics server | `True` |
| `metrics_poll_interval_seconds` | How often controller polls metrics | `30` |

### Next Steps

- **Scale Up:** Increase `num_nodes` for larger models or datasets
- **Use LoRA:** Add PEFT/LoRA for memory-efficient fine-tuning
- **Try Other Models:** This pattern works with any HuggingFace model
- **Enable JIT Checkpointing:** Use `enable_jit_checkpoint=True` for automatic checkpoint saving on preemption

### Resources

- [Kubeflow Trainer Documentation](https://www.kubeflow.org/docs/components/trainer/)
- [HuggingFace Transformers](https://huggingface.co/docs/transformers/)
- [Stanford Alpaca Dataset](https://huggingface.co/datasets/tatsu-lab/alpaca)