Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to propagate mlpipeline-metrics from custom Python function TFX component? #3094

Open
axeltidemann opened this issue Jan 20, 2021 · 22 comments
Assignees
Labels

Comments

@axeltidemann
Copy link
Contributor

I want to export mlpipeline-metrics from my custom Python function TFX component so that it is displayed in the KubeFlow UI, as described here: https://www.kubeflow.org/docs/pipelines/sdk/pipelines-metrics/

This is a minimal example of what I am trying to do:

import json

from tfx.dsl.component.experimental.annotations import OutputArtifact
from tfx.dsl.component.experimental.decorators import component
from tfx.types.standard_artifacts import Artifact

class Metric(Artifact):
    TYPE_NAME = 'Metric'

@component
def ShowMetric(MLPipeline_Metrics: OutputArtifact[Metric]):

    rmse_eval = 333.33

    metrics = {
        'metrics':[
            {
                'name': 'RMSE-validation',
                'numberValue': rmse_eval,
                'format': 'RAW'
            }
        ]
    }

    path = '/tmp/mlpipeline-metrics.json'
    
    with open(path, 'w') as _file:
        json.dump(metrics, _file)

    MLPipeline_Metrics.uri = path

In the KubeFlow UI, the "Run output" tab says "No metrics found for this run." However, the output artefact shows up in the ML MetaData (see screenshot). Any help on how to accomplish this would be greatly appreciated. Thanks!

Screenshot 2021-01-20 at 16 21 51

@axeltidemann axeltidemann changed the title How to propagate mlpipeline-metrics from custom Python function component? How to propagate mlpipeline-metrics from custom Python function TFX component? Jan 20, 2021
@arghyaganguly arghyaganguly self-assigned this Jan 20, 2021
@arghyaganguly
Copy link
Contributor

@axeltidemann, this issue seems relevant to Kubeflow.
Please raise this in Kubeflow issues.
Confirm once , if this seems okay.
Thanks.

@axeltidemann
Copy link
Contributor Author

axeltidemann commented Jan 21, 2021

@arghyaganguly But I see mlpipeline-ui-metadata showing up automatically in the KubeFlow UI, which also is coming from TFX (see https://github.com/tensorflow/tfx/blob/e0cb043ff5d3a9fc33f20b1ce6348518e68352ff/tfx/orchestration/kubeflow/base_component.py). Given that TFX is built on top of KubeFlow and I am using a TFX custom component, it must be a TFX relevant issue, no? How would the KubeFlow team know TFX specific questions? (There could be some overlap of course, I am happy to stand corrected.)

@axeltidemann
Copy link
Contributor Author

Sorry to bother you @jiyongjung0, but I'd really appreciate your input when you have the time. Thanks.

@jiyongjung0
Copy link

I'm sorrry for late response. I'm not very familiar with Kubeflow stuff and was finding a better person to respond. @neuromage could you give some help on this issue?

@axeltidemann
Copy link
Contributor Author

It seems mlpipeline-metrics does not get propagated at all, if so it would have been added to the output_artifact_paths dictionary:

'mlpipeline-ui-metadata': '/mlpipeline-ui-metadata.json',

In addition, it should have been dealt with in the container entry point, like mlpipeline-ui-metadata:

with open('/mlpipeline-ui-metadata.json', 'w') as f:

Is there a specific reason for this omission? Or maybe something for a pull request?

@axeltidemann
Copy link
Contributor Author

I tried to make changes to the source code of TFX itself (following the instructions here), where I basically implemented the changes above, i.e.

output_artifact_paths={
            'mlpipeline-ui-metadata': '/mlpipeline-ui-metadata.json',
            'mlpipeline-metrics': '/mlpipeline-metrics.json'
        }

in tfx/tfx/orchestration/kubeflow/base_component.py and also hardcoded metrics and file output in tfx/tfx/orchestration/kubeflow/container_entrypoint.py like so:

metrics = {
        'metrics':[
            {
                'name': 'RMSE-validation',
                'numberValue': 777.77,
                'format': 'RAW'
            }
        ]
    }
    
with open('/mlpipeline-metrics.json', 'w') as _file:
        json.dump(metrics, _file)

This was still not picked up by the KubeFlow UI. I assume there are some deeper changes needed, then. Maybe @neuromage can shed some light on this?

@neuromage
Copy link

Hi @axeltidemann, those changes look correct to me.

/cc @numerology and @chensun, any ideas why the above may not be working?

@numerology
Copy link

Changing output_artifact_paths in base_component.py should suffice. If that is not picked up by the UI then it seems like a bug to me.

May I ask which KFP version are you using (both SDK and deployment)?

@axeltidemann
Copy link
Contributor Author

Good question. I don't specify which KFP version to use in deployment, I use the tfx CLI. It was my assumption that it creates a docker image from my local installation and uploads that to eu.gcr.io and therefore would use my local KFP version, but I can't figure out how to determine which KFP version is actually used on the cluster. Is there a way to find that out?

These are my local versions, in any case:

>python
Python 3.7.9 (default, Nov 20 2020, 18:45:38) 
[Clang 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tfx
>>> tfx.__version__
'0.28.0.dev'
>>> import kfp
>>> kfp.__version__
'1.3.0'

@chensun
Copy link

chensun commented Feb 5, 2021

but I can't figure out how to determine which KFP version is actually used on the cluster. Is there a way to find that out?

If you have access to KFP UI, it's shown in the bottom left corner, for instance:
image

or if you have kubectl connected to your cluster, you can describe any KFP pod, for example:
kubectl describe pod ml-pipeline-76fddff986-h7hsh -n kubeflow

and the container image label (1.2.0 shown in the below output) is the KFP backend version.

Containers:
  ml-pipeline-api-server:
    Container ID:   docker://a84dc475d6b6fb6e9dc58204e58e6c606498239f38fa12145e93953458bdd045
    Image:          gcr.io/ml-pipeline/api-server:1.2.0

@axeltidemann
Copy link
Contributor Author

Thanks, @chensun. The version displayed is indeed 1.0.4, and the container image label is in the YAML file in the KubeFlow UI: ml-pipeline-api-server: gcr.io/cloud-marketplace/google-cloud-ai-platform/kubeflow-pipelines/apiserver:1.0.4.

However, could it be that local changes I make to TFX are not packaged and uploaded to the KubeFlow cluster in any case?

@axeltidemann
Copy link
Contributor Author

@numerology I suppose I should create a separate Docker image with my changes to TFX, push that to Docker hub, and make the tfx cli use that image. I see when running

tfx pipeline update --pipeline-path=kubeflow_runner.py --endpoint=$ENDPOINT

that the tensorflow/tfx:0.25.0 image is used:

[truncated]
[Skaffold] #3 [internal] load metadata for docker.io/tensorflow/tfx:0.25.0
[Skaffold] #3 sha256:0de1d35ca0abce93f6f1d57543269f062bb56777e77abd8be41593a801cd2d61
[Skaffold] #3 DONE 2.8s
[Skaffold]
[Skaffold] #7 [1/3] FROM docker.io/tensorflow/tfx:0.25.0@sha256:0700c27c6492b8b2998e7d543ca13088db8d40ef26bd5c6eec58245ff8cdec35
[Skaffold] #7 sha256:8e5e2c00eb5ed31ca14860fd9aa40e783fe78ad12be31dc9da89ddad19876dc9
[Skaffold] #7 DONE 0.0s
[truncated]

However, I cannot figure out where to set which Docker image to use. I have even tried searching the repository for load metadata for, but no results came up. Any ideas?

@numerology
Copy link

@axeltidemann Indeed, in order to do that I believe you'll need to specify the base image when running the CLI command. For example:

tfx pipeline create --pipeline-path=kubeflow_runner.py --endpoint=$ENDPOINT --build_base_image your-docker-hub-repo/your-tfx-image --build_target_image your-docker-hub-repo/your-image-for-this-pipeline

Also please refer to the help message for --build_target_image option in https://github.com/tensorflow/tfx/blob/HEAD/tfx/tools/cli/commands/pipeline.py for advanced image building options.

@easadler
Copy link

I wanted to mention how important getting Kubeflow metrics into TFX is for my team. I curious if this is no longer an issue in the kubeflow v2 runner? I haven't been able to try it out.

@numerology
Copy link

@easadler

Kubeflow v2 runner is still being developed. Currently it only compiles TFX DSL objects into KFP IR spec. The story of visualization in Kubeflow v2 runner is being discussed.

/cc @neuromage

@axeltidemann
Copy link
Contributor Author

axeltidemann commented Feb 24, 2021

@numerology I was able to create a custom build-base-image of TFX with the changes I referenced above:

  1. Pulled the TFX code from GitHub.
  2. Made changes as I wrote above.
  3. Created image by following these instructions (in essence ./tfx/tools/docker/build_docker_image.sh)
  4. Renamed (i.e. gave the image the tag eu.gcr.io/my-project/custom-tfx-image), pushed it to GCR.
  5. Specified both build and target image:
    tfx pipeline create --engine kubeflow --build-target-image eu.gcr.io/my-project/my-tfx-pipeline --build-base-image eu.grc.io/my-project/custom-tfx-image --endpoint $ENDPOINT --pipeline-path kubeflow_runner.py
  6. I can verify that the changes above are applied, because I can see
    [Skaffold] Step 1/4 : FROM eu.gcr.io/my-project/custom-tfx-image
    when creating the pipeline.
  7. Another verification step: I print out the hardcoded JSON file in the container_entrypoint.py after writing it, so I am sure it is successfully written.
  8. But alas, no mlpipeline-metrics in the KubeFlow UI.

SDK KFP version is 1.4, and KubeFlow (deployment) is 1.0.4, could this be an issue? It was my understanding that the KubeFlow Pipelines (1.4) and deployment of KubeFlow on Kubernetes (1.0.4) are two different things, and that the version number comparison is meaningless (but please correct me if I am wrong). Do you have any other ideas?

@axeltidemann
Copy link
Contributor Author

@neuromage maybe you have some ideas why the above approach does not work?

@axeltidemann
Copy link
Contributor Author

@neuromage @numerology sorry to bother you again, but do you have any thoughts on this?

@ConverJens
Copy link
Contributor

@neuromage @numerology @axeltidemann
What is the status on this? Is it possible to export metrics and custom metadata with TFX in KubeFlow nowadays?

@axeltidemann
Copy link
Contributor Author

No progress from my side, when I have time I'd like to re-try the suggestions I outlined above, just to verify.

@github-actions
Copy link

github-actions bot commented Apr 6, 2023

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale label Apr 6, 2023
@axeltidemann
Copy link
Contributor Author

I still haven't had the time, but I'd very much like to keep this issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants