![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FMLOps%2FPipelines&file=Vertex+AI+Pipelines+-+GCS+Read+and+Write.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/MLOps/Pipelines/Vertex%20AI%20Pipelines%20-%20GCS%20Read%20and%20Write.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FMLOps%2FPipelines%2FVertex%2520AI%2520Pipelines%2520-%2520GCS%2520Read%2520and%2520Write.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/MLOps/Pipelines/Vertex%20AI%20Pipelines%20-%20GCS%20Read%20and%20Write.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/MLOps/Pipelines/Vertex%20AI%20Pipelines%20-%20GCS%20Read%20and%20Write.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

---
This is part of a [series of notebook based workflows](./readme.md) that teach all the ways to use pipelines within Vertex AI. The suggested order and description/reason is:

||Notebook Workflow|Description|
|---|---|---|
||[Vertex AI Pipelines - Start Here](./Vertex%20AI%20Pipelines%20-%20Start%20Here.ipynb)|What are pipelines? Start here to go from code to pipeline and see it in action.|
||[Vertex AI Pipelines - Introduction](./Vertex%20AI%20Pipelines%20-%20Introduction.ipynb)|Introduction to pipelines with the console and Vertex AI SDK|
||[Vertex AI Pipelines - Components](./Vertex%20AI%20Pipelines%20-%20Components.ipynb)|An introduction to all the ways to create pipeline components from your code|
||[Vertex AI Pipelines - IO](./Vertex%20AI%20Pipelines%20-%20IO.ipynb)|An overview of all the type of inputs and outputs for pipeline components|
||[Vertex AI Pipelines - Control](./Vertex%20AI%20Pipelines%20-%20Control.ipynb)|An overview of controlling the flow of exectution for pipelines|
||[Vertex AI Pipelines - Secret Manager](./Vertex%20AI%20Pipelines%20-%20Secret%20Manager.ipynb)|How to pass sensitive information to pipelines and components|
|_**This Notebook**_|[Vertex AI Pipelines - GCS Read and Write](./Vertex%20AI%20Pipelines%20-%20GCS%20Read%20and%20Write.ipynb)|How to read/write to GCS from components, including container components.|
||[Vertex AI Pipelines - Scheduling](./Vertex%20AI%20Pipelines%20-%20Scheduling.ipynb)|How to schedule pipeline execution|
||[Vertex AI Pipelines - Notifications](./Vertex%20AI%20Pipelines%20-%20Notifications.ipynb)|How to send email notification of pipeline status.|
||[Vertex AI Pipelines - Management](./Vertex%20AI%20Pipelines%20-%20Management.ipynb)|Managing, Reusing, and Storing pipelines and components|
||[Vertex AI Pipelines - Testing](./Vertex%20AI%20Pipelines%20-%20Testing.ipynb)|Strategies for testing components and pipeliens locally and remotely to aide development.|
||[Vertex AI Pipelines - Managing Pipeline Jobs](./Vertex%20AI%20Pipelines%20-%20Managing%20Pipeline%20Jobs.ipynb)|Manage runs of pipelines in an environment: list, check status, filtered list, cancel and delete jobs.|

To discover these notebooks as part of an introduction to MLOps orchestration [start here](./readme.md).  To read more about MLOps also check out [the parent folder](../readme.md).

---

# Vertex AI Pipelines - GCS Read and Write With Fuse Mount

As a pipeline job executes each component instance (task) as a Vertex AI Custom Training Job.  A core feature of these training jobs is that they automatcally setup [Cloud Storage as a mounted file system](https://cloud.google.com/vertex-ai/docs/training/cloud-storage-file-system). This workflow examines how to interact with data stored in GCS from these jobs.

**Workflow:**
- Create a lightweight Python component, essentially a Python function, that reads and writes to GCS during execution
- Create a container component that runs an input Python script which reads and writes to GCS during execution

---
## Colab Setup

To run this notebook in Colab run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.aiplatform', 'google-cloud-storage'),
    ('kfp', 'kfp'),
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'mlops'
EXPERIMENT = 'pipeline-gcs-data'

# gcs bucket
GCS_BUCKET = PROJECT_ID

Packages

In [8]:
import os, datetime

from google.cloud import aiplatform
from google.cloud import storage
import kfp

In [9]:
kfp.__version__

'2.12.1'

In [10]:
aiplatform.__version__

'1.78.0'

Clients

In [11]:
# vertex ai clients
aiplatform.init(project = PROJECT_ID, location = REGION)

# gcs clients
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

parameters:

In [12]:
DIR = f"temp/{SERIES}-{EXPERIMENT}"

In [13]:
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'1026793852137-compute@developer.gserviceaccount.com'

environment:

In [14]:
if not os.path.exists(DIR):
    os.makedirs(DIR)

---
## Example File In GCS

Your component might need to read data, like training records, from a GCS bucket.  The following code create an example file `example_instance.txt` to use in this workflow.

In [15]:
example_str = 'This is my example text instance as of ' + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
example_str

'This is my example text instance as of 2025-04-02 19:26:20'

In [16]:
blob = bucket.blob(f'{SERIES}/{EXPERIMENT}/example_instance.txt')
blob.upload_from_string(example_str)
blob.name

'mlops/pipeline-gcs-data/example_instance.txt'

In [17]:
[b.name for b in bucket.list_blobs(prefix = f'{SERIES}/{EXPERIMENT}/example')]

['mlops/pipeline-gcs-data/example_instance.txt']

---
## Lightweight Component That Reads/Writes To GCS with Fuse Mount

Components run as Vertex AI custom training jobs which already have [Cloud Storage as a mounted file system](https://cloud.google.com/vertex-ai/docs/training/cloud-storage-file-system).

> **Note:** a Fuse Mount is not a POSIX file system (see [limitations](https://cloud.google.com/storage/docs/cloud-storage-fuse/overview#differences-and-limitations)).  Some methods may not work correctly with direct read/write - such as exporting model files in some frameworks.  A potential work around is to first write locally, then copy to the fuse mount location.

### Create Component: Read/Write GCS With Fuse Mount

In [18]:
name_str = f'{SERIES}-{EXPERIMENT}-gcs-fuse'
name_str

'mlops-pipeline-gcs-data-gcs-fuse'

In [19]:
@kfp.dsl.component(
    base_image = "python:3.11"
)
def gcs_fuse(
    instance_bucket: str,
    instance_path: str,
    instance_file: str
) -> str:
    
    import datetime
    
    # read from GCS
    with open(f'/gcs/{instance_bucket}/{instance_path}/{instance_file}', 'r') as f:
        instance = f.read()
    
    # write to GCS
    with open(f'/gcs/{instance_bucket}/{instance_path}/gcs_fuse.txt', 'w') as f:
        f.write(
            'Successfully used GCS as a mounted file system to create this file at ' + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        )
        
    return instance

### Compile and Run Component On Vertex AI Pipelines

In [20]:
kfp.compiler.Compiler().compile(
    pipeline_func = gcs_fuse,
    package_path = f'{DIR}/{name_str}.yaml',
    pipeline_name = name_str
)

In [21]:
pipeline_job = aiplatform.PipelineJob(
    display_name = name_str,
    template_path = f"{DIR}/{name_str}.yaml",
    parameter_values = dict(
        instance_bucket = GCS_BUCKET,
        instance_path = f'{SERIES}/{EXPERIMENT}',
        instance_file = 'example_instance.txt'
    ),
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root',
    enable_caching = False # True (enabled), False (disable), None (defer to component level caching) 
)

In [22]:
response = pipeline_job.submit(
    service_account = SERVICE_ACCOUNT
)

Creating PipelineJob
PipelineJob created. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-20250402192620
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-20250402192620')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-gcs-data-gcs-fuse-20250402192620?project=1026793852137


In [23]:
pipeline_job.wait()

PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-20250402192620 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-20250402192620 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-20250402192620 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-20250402192620 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob run completed. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-20250402192620


### Review Outputs From Run:

- The pipeline(component) should output the text of the example instance (created earlier) that it reads.
- The pipeline should create/update an output file in GCS

In [24]:
aiplatform.get_pipeline_df(pipeline = name_str)['param.output:Output'][0]

'This is my example text instance as of 2025-04-02 19:26:20'

In [25]:
output_blob = bucket.blob(f'{SERIES}/{EXPERIMENT}/gcs_fuse.txt')
output_blob.download_as_bytes().decode('utf-8')

'Successfully used GCS as a mounted file system to create this file at 2025-04-02 19:27:36'

---
## Container Components That Read/Write To GCS Fuse Mount

One way or the other, code ends up in the container to be executed.  It might be built into the container with `docker build` or a service like Cloud Build.  Or, it might be provided as an input via commands or args.  The approach here supplies a Python script to a container at run time.  The script expects the fuse mount at the `/gcs` path.

### Create Script That Uses Fuse Mount To Read/Write

In [26]:
name_str = f'{SERIES}-{EXPERIMENT}-gcs-fuse-container'
name_str

'mlops-pipeline-gcs-data-gcs-fuse-container'

In [37]:
example_script = """
import argparse
import os
import datetime

# import argument to local variables
parser = argparse.ArgumentParser()
# the passed param, dest: a name for the param, default: if absent fetch this param from the OS, type: type to convert to, help: description of argument
parser.add_argument('--bucket', dest = 'instance_bucket', type = str, help = 'GCS Bucket name')
parser.add_argument('--path', dest = 'instance_path', type = str, help = 'Path to file')
parser.add_argument('--name', dest = 'instance_name', type = str, help = 'Filename')
args = parser.parse_args()

# read from GCS
with open(f'/gcs/{args.instance_bucket}/{args.instance_path}/{args.instance_name}', 'r') as f:
    instance = f.read()

# write to GCS
with open(f'/gcs/{args.instance_bucket}/{args.instance_path}/gcs_fuse.txt', 'w') as f:
    f.write(
        'Successfully used GCS as a mounted file system to create this file at ' + datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    )
    
print(instance)
"""

### Create Pipeline: Run Python Script That Uses GCS With Fuse Mount



In [143]:
@kfp.dsl.component(
    base_image = 'python:3.11'
)
def example_job(
    args: list,
    libs: list,
    script: str
) -> kfp.dsl.Artifact:
    
    import base64
    
    # libs
    if libs:
        install_command = f"python -m pip install --upgrade pip && python -m pip install {' '.join(libs)}"
    else:
        install_command = ''
    
    # args
    script_args = ' '.join(args)
    
    # script
    script_bytes = script.encode('utf-8')
    encoded_script = base64.b64encode(script_bytes).decode('utf-8')
    
    # output artifact
    job = kfp.dsl.Artifact(
        metadata = dict(
            install_command = install_command,
            script_args = script_args,
            encoded_script = encoded_script
        )
    )
    
    return job

In [216]:
@kfp.dsl.container_component
def example_container(
    job: kfp.dsl.Input[kfp.dsl.Artifact],
    instance: kfp.dsl.OutputPath(str)
):

    return kfp.dsl.ContainerSpec(
        image = 'python:3.12-alpine3.19',
        command = [
            'sh',
            '-c',
            f'''
            {job.metadata['install_command']}\
            && mkdir -p $(dirname $0)\
            && echo {job.metadata['encoded_script']} | base64 -d > script.py\
            && python script.py {job.metadata['script_args']} > $0
            '''
        ],
        args = [instance]
    )

In [217]:
@kfp.dsl.pipeline()
def example_pipeline(
    libs: list,
    args: list,
    script: str
) -> str:
    
    job_op = example_job(
        args = args,
        libs = libs,
        script = script
    )
    
    run_op = example_container(job = job_op.output)
    
    return run_op.outputs['instance']

### Compile and Run Pipeline On Vertex AI Pipelines

In [218]:
kfp.compiler.Compiler().compile(
    pipeline_func = example_pipeline,
    package_path = f'{DIR}/{name_str}.yaml',
    pipeline_name = name_str
)

In [219]:
CMDARGS = [
    f"--bucket='{GCS_BUCKET}'",
    f"--path='{SERIES}/{EXPERIMENT}'",
    f"--name='example_instance.txt'"
]

In [220]:
pipeline_job = aiplatform.PipelineJob(
    display_name = name_str,
    template_path = f"{DIR}/{name_str}.yaml",
    parameter_values = dict(
        libs = ['numpy'],
        args = CMDARGS,
        script = example_script
    ),
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root',
    enable_caching = False # True (enabled), False (disable), None (defer to component level caching) 
)

In [221]:
response = pipeline_job.submit(
    service_account = SERVICE_ACCOUNT
)

Creating PipelineJob
PipelineJob created. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-container-20250403001252
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-container-20250403001252')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-gcs-data-gcs-fuse-container-20250403001252?project=1026793852137


In [222]:
pipeline_job.wait()

PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-container-20250403001252 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-container-20250403001252 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-container-20250403001252 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob run completed. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-gcs-data-gcs-fuse-container-20250403001252


### Review Outputs From Run:

- The pipeline(component) should output the text of the example instance (created earlier) that it reads.
- The pipeline should create/update an output file in GCS

In [224]:
aiplatform.get_pipeline_df(pipeline = name_str)['param.output:Output'][0]

'This is my example text instance as of 2025-04-02 19:26:20\n'

In [225]:
output_blob = bucket.blob(f'{SERIES}/{EXPERIMENT}/gcs_fuse.txt')
output_blob.download_as_bytes().decode('utf-8')

'Successfully used GCS as a mounted file system to create this file at 2025-04-03 00:13:58'