Real-Life Pipeline Example: Automation and Orchestration of a Supervised Tuning Pipeline

This example demonstrates how to automate the orchestration of a machine learning pipeline using Kubeflow Pipelines for parameter-efficient fine-tuning (PEFT). The pipeline reuses an existing pipeline template from Google for tuning the PaLM 2 foundation model, allowing users to specify parameters without building the pipeline from scratch.

Step 1: Specify Data URIs
In this step, we define the training and evaluation data URIs, which point to the dataset files in JSONL format. These files will be used for tuning the model.
The data files are consistent across learners, ensuring reproducibility.

In [None]:
### Specify the URIs for training and evaluation data.
### These are jsonl files that contain the question-answer pairs for tuning.
TRAINING_DATA_URI = "./tune_data_stack_overflow_python_qa.jsonl" 
EVAUATION_DATA_URI = "./tune_eval_data_stack_overflow_python_qa.jsonl"  


Step 2: Provide a Model Version
Versioning is important for reproducibility, auditing, and rollback capabilities. By providing a unique version name for the model (using a timestamp), we can track changes and restore previous versions if necessary.
This step creates a unique model name with the current date.

In [None]:
### Import datetime to generate a versioned model name.
import datetime

### Generate a timestamp for versioning the model.
date = datetime.datetime.now().strftime("%H:%d:%m:%Y")

### Create a unique model name with the current date and time.
MODEL_NAME = f"deep-learning-ai-model-{date}"


Step 3: Define Model Tuning Parameters
In this step, two key parameters are defined:
TRAINING_STEPS: The number of training steps for fine-tuning the model. For extractive QA, it is usually between 100-500.
EVALUATION_INTERVAL: Defines how frequently the model is evaluated during the training process (in this case, every 20 steps).


In [None]:
### Define the number of training steps for fine-tuning the model.
TRAINING_STEPS = 200

### Define the evaluation interval to assess the model's performance during training.
EVALUATION_INTERVAL = 20


Step 4: Load Project ID and Credentials
This cell loads the necessary credentials and project ID to authenticate and connect with the Google Cloud Platform, which is required for executing the Kubeflow Pipeline.

In [None]:
### Import the authenticate function to load credentials and project ID.
from utils import authenticate

### Load credentials and project ID for Google Cloud.
credentials, PROJECT_ID = authenticate()

### Specify the region for pipeline execution.
REGION = "us-central1"


Step 5: Define Pipeline Arguments
Here, we define the pipeline arguments that will be passed to the pipeline when it's executed. These include:
model_display_name: The unique name for the model being tuned.
location: The region where the pipeline will be executed.
large_model_reference: The reference to the foundation model being fine-tuned (PaLM 2 in this case).
train_steps: The number of training steps.
dataset_uri: The location of the training data.
evaluation_interval: The interval for evaluating the model.
evaluation_data_uri: The location of the evaluation data.

In [None]:
### Define the input arguments that will be passed to the pipeline.
pipeline_arguments = {
    "model_display_name": MODEL_NAME,  # Versioned model name
    "location": REGION,  # Execution region
    "large_model_reference": "text-bison@001",  # PaLM 2 model reference
    "project": PROJECT_ID,  # Google Cloud project ID
    "train_steps": TRAINING_STEPS,  # Number of training steps
    "dataset_uri": TRAINING_DATA_URI,  # URI for training data
    "evaluation_interval": EVALUATION_INTERVAL,  # Evaluation interval
    "evaluation_data_uri": EVAUATION_DATA_URI,  # URI for evaluation data
}


Step 6: Specify Pipeline Template Path
The pipeline will reuse an existing pipeline template, which simplifies the process. The path to the template is provided here, which allows the pipeline to be executed using pre-defined configurations.

In [None]:
### Specify the path to the pipeline template to reuse.
### This template is a pre-configured pipeline for large model tuning.
template_path = 'https://us-kfp.pkg.dev/ml-pipeline/large-language-model-pipelines/tune-large-model/v2.0.0'


Step 7: Submit Pipeline Job for Execution
This step prepares and submits the pipeline job for execution using PipelineJob.
The pipeline job includes:
template_path: The YAML file or template that defines the pipeline.
display_name: A unique name for the pipeline run.
parameter_values: The pipeline arguments (input data and configurations).
pipeline_root: A directory where temporary files will be stored during execution.
enable_caching: This option enables caching of pipeline components, so only changed components need to be re-executed.

In [None]:
### Import PipelineJob to create a pipeline execution job.
from google.cloud.aiplatform import PipelineJob

### Specify the pipeline root directory.
pipeline_root = "./"

### Create a pipeline job with the specified parameters and submit it for execution.
job = PipelineJob(
        template_path=template_path,  # Path to the pipeline YAML template
        display_name=f"deep_learning_ai_pipeline-{date}",  # Unique name for the pipeline run
        parameter_values=pipeline_arguments,  # Input parameters for the pipeline
        location=REGION,  # Execution region
        pipeline_root=pipeline_root,  # Root directory for storing intermediate files
        enable_caching=True,  # Enable caching to avoid re-running unchanged components
)

### Submit the pipeline job for execution.
job.submit()


Step 8: Check Job Status
After submitting the pipeline job, this cell allows us to check the status of the job to ensure it is running as expected.

In [None]:
### Check the current state of the submitted job.
job.state


Conclusion
This pipeline automates the tuning of a foundation model (PaLM 2) using a Kubeflow Pipeline. By reusing an existing pipeline template, we significantly reduce the complexity of building the pipeline from scratch. The pipeline supports important ML tasks like model versioning, training, evaluation, and orchestration, enabling efficient parameter-efficient fine-tuning (PEFT).