Model Tuning and Building a Machine Learning Pipeline
In machine learning, model tuning is a crucial process that is often performed multiple times to optimize model performance. Building a machine learning (ML) pipeline using an open-source workflow facilitates this process, allowing for streamlined and repeatable model training and evaluation.

MLOps Workflow
An MLOps workflow integrates various stages of the machine learning lifecycle. During model development, we focus on three key components: training data, training and evaluation, and the resulting trained model. Once the model is deployed to production, we use production data alongside the trained model to make predictions and continue the evaluation process. Automating this MLOps workflow ensures continuous and efficient model improvement and deployment.

Automation, Orchestration, and Deployment
To achieve this level of automation, orchestration, automation, and deployment are critical components. Orchestration helps organize and automate multiple tasks in the pipeline, ensuring that each step (from data processing to model training) is executed in sequence. Tools like Airflow and Kubeflow are widely used for orchestrating ML pipelines. In this context:

Orchestration involves coordinating different tasks.
Automation ensures tasks are executed without manual intervention.
Deployment makes the model available in production for real-world use.
Components of Orchestration
Orchestration in ML pipelines can be broken down into two primary components:

Data Processing: The first component of the pipeline, responsible for collecting, cleaning, and transforming raw data into a suitable format for model training.
Model Training: The second component, where the machine learning model is trained on the processed data and fine-tuned as needed.
These components work together in an automated workflow, enabling continuous integration and deployment of machine learning models.

DSL (Domain-Specific Language) for Pipelines
DSL (Domain-Specific Language) is a set of instructions or syntax specifically designed for defining machine learning pipelines. In Kubeflow, DSL allows users to define and configure pipeline steps (such as data preprocessing and model training). Kubeflow uses DSL to translate these pipeline definitions into executable workflows.

Kubeflow workflows operate within a containerized environment, where each step of the pipeline runs inside a container. A container includes not only the code but also all necessary dependencies and the operating system. This allows for scalability and portability, as containers can be run consistently across different environments without requiring users to manage servers, operating systems, or other infrastructure.

Orchestration in Machine Learning
In machine learning, orchestration is vital for automating tasks such as data processing, model training, and evaluation. URIs (Uniform Resource Identifiers) are often used to pass data by location, ensuring that different pipeline components can easily access the required data. Additionally, pipelines can be parameterized to enable reuse across different datasets and tasks, such as in supervised fine-tuning.

Purpose of YAML File in Kubeflow Pipelines
Kubeflow pipelines can be compiled into a YAML file, which is a human-readable format used for configuration. The purpose of generating a YAML file in a machine learning pipeline is to provide a declarative specification of the pipeline. This file defines the structure of the pipeline, including its components, inputs, and outputs.

By using YAML files:

Reusability is enhanced, as the pipeline can be re-executed with different parameters without modifying the core logic.
Portability is achieved, as the pipeline configuration can be shared across environments.
Scalability is ensured, as the pipeline steps are containerized and can run in distributed environments, without worrying about infrastructure details.




In [None]:
Kubeflow Pipelines Automation Example

This notebook demonstrates how to use Kubeflow Pipelines to orchestrate and automate a workflow. It walks through the process of defining components, building a pipeline, compiling it into a YAML file, and submitting it for execution.



Step 1: Import Necessary Libraries and Ignore Warnings
The purpose of this cell is to import the required libraries for working with Kubeflow Pipelines and set up the environment. It also ignores future warnings to prevent clutter in the output.

In [None]:
# Import required libraries from kfp
from kfp import dsl
from kfp import compiler

# Ignore FutureWarnings from Kubeflow Pipelines (kfp)
import warnings
warnings.filterwarnings("ignore", 
                        category=FutureWarning, 
                        module='kfp.*')


Step 2: Define the First Pipeline Component
This component (say_hello) is a simple function that takes a name and returns a greeting. The @dsl.component decorator makes it a pipeline component.

In [None]:
### Simple example: component 1
@dsl.component
def say_hello(name: str) -> str:
    hello_text = f'Hello, {name}!'  # Create a greeting message
    return hello_text  # Return the greeting message


Step 3: Test the First Component
Here, we call the say_hello component and print the PipelineTask object. This step also demonstrates how to retrieve the output from the task.

In [None]:
# Call the component
hello_task = say_hello(name="Erwin")

# Print the PipelineTask object
print(hello_task)

# Print the output, which is the greeting text
print(hello_task.output)


Step 4: Define the Second Pipeline Component
This component (how_are_you) takes the greeting from the first component and appends a follow-up question. It also demonstrates passing the output of one component to another.

In [None]:
### Simple example: component 2
@dsl.component
def how_are_you(hello_text: str) -> str:
    how_are_you_text = f"{hello_text}. How are you?"  # Append a question
    return how_are_you_text  # Return the modified message


Step 5: Pass the Output of One Component to Another
This step shows how to pass the output from the say_hello component to the how_are_you component and how to handle errors when passing the wrong object type.

In [None]:
# Pass the output of say_hello to how_are_you
how_task = how_are_you(hello_text=hello_task.output)

# Print the PipelineTask object and its output
print(how_task)
print(how_task.output)


Step 6: Define the Pipeline
This cell defines a simple pipeline (hello_pipeline) that orchestrates the execution of the two components. The pipeline first executes say_hello and then passes its output to how_are_you.



In [None]:
### Simple example: pipeline
@dsl.pipeline
def hello_pipeline(recipient: str) -> str:
    # First task: say hello
    hello_task = say_hello(name=recipient)
    
    # Second task: ask how the recipient is, using the output from hello_task
    how_task = how_are_you(hello_text=hello_task.output)
    
    # Return the output of how_task
    return how_task.output


Step 7: Test the Pipeline
This step demonstrates what happens when we attempt to return a PipelineTask object instead of its output. It shows how such an error can occur.

In [None]:
### Pipeline with wrong return value type
@dsl.pipeline
def hello_pipeline_with_error(recipient: str) -> str:
    hello_task = say_hello(name=recipient)
    how_task = how_are_you(hello_text=hello_task.output)
    
    # Returning the PipelineTask object instead of the output will give an error
    return how_task


Step 8: Compile the Pipeline
This cell compiles the hello_pipeline into a YAML file (pipeline.yaml). The YAML file defines the pipeline structure and is used to execute it in a managed environment.

In [None]:
# Compile the pipeline into a YAML file
compiler.Compiler().compile(hello_pipeline, 'pipeline.yaml')


Step 9: Define Pipeline Arguments
Define the input arguments that will be passed to the pipeline when it is executed. In this case, we are passing "World!" as the recipient for the say_hello component.

In [None]:
# Define pipeline arguments
pipeline_arguments = {
    "recipient": "World!",
}


Step 10: View the Generated Pipeline YAML
This cell displays the contents of the pipeline.yaml file, which was generated from the pipeline definition.

In [None]:
# View the generated pipeline.yaml file
!cat pipeline.yaml


Step 11: Submit the Pipeline to Vertex AI
This cell provides the code needed to submit the pipeline to Vertex AI Pipelines for execution. It submits the pipeline YAML and checks the job status.

In [None]:
### import `PipelineJob`
from google.cloud.aiplatform import PipelineJob

# Create a PipelineJob object to execute the YAML pipeline
job = PipelineJob(
    ### Path of the YAML file to execute
    template_path="pipeline.yaml",
    ### Name of the pipeline
    display_name="deep_learning_ai_pipeline",
    ### Pipeline arguments (inputs)
    parameter_values=pipeline_arguments,
    ### Region of execution
    location="us-central1",
    ### Directory to store temporary files
    pipeline_root="./",
)

# Submit the job for execution
job.submit()

# Check the job status
job.state


Conclusion
This notebook demonstrated the creation of a simple Kubeflow Pipeline. It showed how to define reusable components, create a pipeline, and compile it into a YAML file. The YAML file can then be submitted to Vertex AI Pipelines for managed, serverless execution.