# Fine-tune Qwe2.5-1.5B with Alpaca Dataset

This example demonstrates how to fine-tune Qwen2.5-1.5B model with the Alpaca Dataset using TorchTune `BuiltinTrainer` from Kubeflow Trainer SDK.

This notebooks walks you through the prerequisites of using TorchTune `BuiltinTrainer` from Kubeflow Trainer SDK, and how to submit TrainJob to bootstrap the fine-tuning workflow.

Qwen2.5-1.5B: https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct

Alpaca Dataset: https://huggingface.co/datasets/tatsu-lab/alpaca

## Install the Kubeflow SDK

You need to install the Kubeflow SDK to interact with Kubeflow Trainer APIs:

In [None]:
!pip install kubeflow

## Prerequisites

### Install Official Training Runtimes

You need to make sure that you've installed the Kubeflow Trainer Controller Manager and Kubeflow Training Runtimes mentioned in the [installation guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/).

In [None]:
# List all available Kubeflow Training Runtimes.
from kubeflow.trainer import *
from kubeflow_trainer_api import models

client = TrainerClient()
for runtime in client.list_runtimes():
    print(runtime)

### Create PVCs for Models and Datasets

Currently, we do not support automatically orchestrate the volume claim in (Cluster)TrainingRuntime.

So, we need to manually create PVCs for each models we want to fine-tune. Please note that **the PVC name must be equal to the ClusterTrainingRuntime name**. In this example, it's `torchtune-qwen2.5-1.5b`.

REF: https://github.com/kubeflow/trainer/issues/2630

In [None]:
# Create a PersistentVolumeClaim for the TorchTune Qwen 2.5 1.5B model.
client.backend.core_api.create_namespaced_persistent_volume_claim(
    namespace="default",
    body=models.IoK8sApiCoreV1PersistentVolumeClaim(
        apiVersion="v1",
        kind="PersistentVolumeClaim",
        metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
            name="torchtune-qwen2.5-1.5b"
        ),
        spec=models.IoK8sApiCoreV1PersistentVolumeClaimSpec(
            accessModes=["ReadWriteOnce"],
            resources=models.IoK8sApiCoreV1VolumeResourceRequirements(
                requests={
                    "storage": models.IoK8sApimachineryPkgApiResourceQuantity("200Gi")
                }
            ),
        ),
    ).to_dict(),
)

## Bootstrap LLM Fine-tuning Workflow

Kubeflow TrainJob will train the model in the referenced (Cluster)TrainingRuntime.

In [None]:
job_name = client.train(
    runtime=client.get_runtime(name="torchtune-qwen2.5-1.5b"),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data"
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://Qwen/Qwen2.5-1.5B-Instruct",
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET, split="train[:1000]"
            ),
            resources_per_node={
                "memory": "128G",
                "gpu": 1,
            },
            
        )
    )
)

## Wait for running status

In [None]:

# Wait for the running status.
client.wait_for_job_status(name=job_name, status={"Running"})


## Watch the TrainJob Logs

We can use the `get_job_logs()` API to get the TrainJob logs.

### Dataset Initializer

In [None]:
from kubeflow.trainer.constants import constants

for line in client.get_job_logs(job_name, follow=True, step=constants.DATASET_INITIALIZER):
    print(line)

### Model Initializer

In [None]:
for line in client.get_job_logs(job_name, follow=True, step=constants.MODEL_INITIALIZER):
    print(line)

### Trainer Node 

In [None]:
for c in client.get_job(name=job_name).steps:
    print(f"Step: {c.name}, Status: {c.status}, Devices: {c.device} x {c.device_count}\n")

for line in client.get_job_logs(job_name, follow=True):
    print(line)

# Get the Fine-tuned Model

After Trainer node completes the fine-tuning task, the fine-tuned model will be stored into the `/workspace/output` directory, which can be shared across Pods through PVC mounting. You can find it in another Pod's `/<mountDir>/output` directory if you mount the PVC under `/<mountDir>`.