Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes cluster resolver fails when running from within a K8S cluster. #70581

Open
msteiner-google opened this issue Jun 28, 2024 · 0 comments · May be fixed by #70691
Open

Kubernetes cluster resolver fails when running from within a K8S cluster. #70581

msteiner-google opened this issue Jun 28, 2024 · 0 comments · May be fixed by #70691
Assignees
Labels
comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.16 type:bug Bug

Comments

@msteiner-google
Copy link

msteiner-google commented Jun 28, 2024

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.16.1

Custom code

Yes

OS platform and distribution

linux

Mobile device

No response

Python version

3.11

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

If trying to create a cluster spec from a pod running within a K8s cluster, this try block fails because it can't find the kubectl config file.

The quick is rather straightforward:

if override_client is None:
    try:
      from kubernetes import config as k8sconfig  # pylint: disable=g-import-not-at-top

      k8sconfig.load_kube_config()
    except ImportError:
      if not override_client:
        raise ImportError('The Kubernetes Python client must be installed '
                          'before using the Kubernetes Cluster Resolver. '
                          'To install the Kubernetes Python client, run '
                          '`pip install kubernetes` on your command line.')

...

Happy to open a MR for this.

Standalone code to reproduce the issue

main.py

import os
import tensorflow as tf

from absl import logging
from kubernetes import client, config

logging.set_verbosity(logging.DEBUG)
logging.info("TF version: %s", tf.__version__)

config.load_incluster_config()
k8s_cli = client.CoreV1Api()

# Fails here despite providing an override client for talking with the k8s APIs.
cluster_resolver = tf.distribute.cluster_resolver.KubernetesClusterResolver(
    {"worker": ["job-name=mobileye-0", "job-name=mobileye-1"]}, override_client=k8s_cli
)
task_index = int(os.environ.get("TASK_INDEX"))
cluster_resolver.task_type = "worker"
cluster_resolver.task_id = task_index

logging.info("Cluster spec: %s", cluster_resolver.cluster_spec().as_dict())
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
    cluster_resolver=cluster_resolver
)

job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: mobileye-0
spec:
  template:
    metadata:
      name: mobileye-training
    spec:
      containers:
      - name: tensorflow
        image: europe-west4-docker.pkg.dev/msteiner-kubeflow/mobileye-test/test-tf-image:latest 
        resources:
          limits:
            cpu: "1"
            memory: 3Gi
        env:
          - name: TASK_INDEX
            value: "0"
      restartPolicy: Never
  parallelism: 1
---
apiVersion: batch/v1
kind: Job
metadata:
  name: mobileye-1
spec:
  template:
    metadata:
      name: mobileye-training
    spec:
      containers:
      - name: tensorflow
        image: europe-west4-docker.pkg.dev/msteiner-kubeflow/mobileye-test/test-tf-image:latest 
        resources:
          limits:
            cpu: "1"
            memory: 3Gi
        env:
          - name: TASK_INDEX
            value: "1"
      restartPolicy: Never
  parallelism: 1


### Relevant log output

```shell
2024-06-28 13:23:10.980 CEST
2024-06-28 11:23:10.979832: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-28 13:23:10.991 CEST
2024-06-28 11:23:10.991166: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-28 13:23:11.109 CEST
2024-06-28 11:23:11.108950: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
2024-06-28 13:23:11.109 CEST
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-28 13:23:14.057 CEST
2024-06-28 11:23:14.056964: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-06-28 13:23:18.311 CEST
INFO:absl:TF version: 2.16.1
2024-06-28 13:23:18.312 CEST
INFO:absl:PATH: /usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2024-06-28 13:23:18.312 CEST
INFO:absl:HOSTNAME: mobileye-0-v7lkr
2024-06-28 13:23:18.312 CEST
INFO:absl:LANG: C.UTF-8
2024-06-28 13:23:18.312 CEST
INFO:absl:GPG_KEY: A035C8C19219BA821ECEA86B64E628F8D684696D
2024-06-28 13:23:18.312 CEST
INFO:absl:PYTHON_VERSION: 3.11.8
2024-06-28 13:23:18.312 CEST
INFO:absl:PYTHON_PIP_VERSION: 24.0
2024-06-28 13:23:18.312 CEST
INFO:absl:PYTHON_SETUPTOOLS_VERSION: 65.5.1
2024-06-28 13:23:18.312 CEST
INFO:absl:PYTHON_GET_PIP_URL: https://github.com/pypa/get-pip/raw/dbf0c85f76fb6e1ab42aa672ffca6f0a675d9ee4/public/get-pip.py
2024-06-28 13:23:18.312 CEST
INFO:absl:PYTHON_GET_PIP_SHA256: dfe9fd5c28dc98b5ac17979a953ea550cec37ae1b47a5116007395bfacff2ab9
2024-06-28 13:23:18.312 CEST
INFO:absl:TASK_INDEX: 0
2024-06-28 13:23:18.313 CEST
INFO:absl:KUBERNETES_SERVICE_PORT: 443
2024-06-28 13:23:18.313 CEST
INFO:absl:KUBERNETES_SERVICE_PORT_HTTPS: 443
2024-06-28 13:23:18.313 CEST
INFO:absl:KUBERNETES_PORT: tcp://34.118.224.1:443
2024-06-28 13:23:18.313 CEST
INFO:absl:KUBERNETES_PORT_443_TCP: tcp://34.118.224.1:443
2024-06-28 13:23:18.313 CEST
INFO:absl:KUBERNETES_PORT_443_TCP_PROTO: tcp
2024-06-28 13:23:18.313 CEST
INFO:absl:KUBERNETES_PORT_443_TCP_PORT: 443
2024-06-28 13:23:18.313 CEST
INFO:absl:KUBERNETES_PORT_443_TCP_ADDR: 34.118.224.1
2024-06-28 13:23:18.313 CEST
INFO:absl:KUBERNETES_SERVICE_HOST: 34.118.224.1
2024-06-28 13:23:18.313 CEST
INFO:absl:HOME: /root
2024-06-28 13:23:18.313 CEST
INFO:absl:TF2_BEHAVIOR: 1
2024-06-28 13:23:18.313 CEST
INFO:absl:TPU_ML_PLATFORM: Tensorflow
2024-06-28 13:23:18.315 CEST
Traceback (most recent call last):   File "//src/main.py", line 20, in <module>     cluster_resolver = tf.distribute.cluster_resolver.KubernetesClusterResolver(
2024-06-28 13:23:18.316 CEST
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-28 13:23:18.316 CEST
  File "/usr/local/lib/python3.11/site-packages/tensorflow/python/distribute/cluster_resolver/kubernetes_cluster_resolver.py", line 93, in __init__
2024-06-28 13:23:18.317 CEST
    k8sconfig.load_kube_config()
2024-06-28 13:23:18.317 CEST
  File "/usr/local/lib/python3.11/site-packages/kubernetes/config/kube_config.py", line 819, in load_kube_config
2024-06-28 13:23:18.318 CEST
    loader = _get_kube_config_loader(
2024-06-28 13:23:18.318 CEST
             ^^^^^^^^^^^^^^^^^^^^^^^^
2024-06-28 13:23:18.318 CEST
  File "/usr/local/lib/python3.11/site-packages/kubernetes/config/kube_config.py", line 776, in _get_kube_config_loader
2024-06-28 13:23:18.319 CEST
    raise ConfigException(
2024-06-28 13:23:18.319 CEST
kubernetes.config.config_exception.ConfigException: Invalid kube-config file. No configuration found.
@google-ml-butler google-ml-butler bot added the type:bug Bug label Jun 28, 2024
@Venkat6871 Venkat6871 added TF 2.16 comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jul 1, 2024
msteiner-google added a commit to msteiner-google/tensorflow that referenced this issue Jul 1, 2024
@msteiner-google msteiner-google linked a pull request Jul 1, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.16 type:bug Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants