Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow[and-cuda] 2.15.0/2.15.1 compatibility with jax[cuda12] #68290

Closed
attaluris opened this issue May 20, 2024 · 8 comments
Closed

tensorflow[and-cuda] 2.15.0/2.15.1 compatibility with jax[cuda12] #68290

attaluris opened this issue May 20, 2024 · 8 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.15 For issues related to 2.15.x type:bug Bug

Comments

@attaluris
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.15.0/2.15.1

Custom code

Yes

OS platform and distribution

Debian Bulleye

Mobile device

No response

Python version

3.9/3.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

12.2

GPU model and memory

v100

Current behavior?

Hey y'all! I think tensorflow[and-cuda] is incompatible with jax[cuda12] and just wanted to clarify if this was expected.
The solve error I'm getting is:

Because tensorflow[and-cuda] (2.15.0) depends on nvidia-nccl-cu12 (2.16.5)
 and jax[cuda12] (0.4.23) depends on nvidia-nccl-cu12 (>=2.18.3), tensorflow[and-cuda] (2.15.0) is incompatible with jax[cuda12] (0.4.23).
So, because hex-packages depends on both jax[cuda12] (0.4.23) and tensorflow[and-cuda] (2.15.0), version solving failed.

and none of the jax[cuda12] versions with GPU compatibility support nvidia-nccl-cu12=2.16.5; does this requirement need to be hard or can it be looser to accomodate higher versions of nvidia-nccl-cu12?

Standalone code to reproduce the issue

[tool.poetry]
name = "test-jax-and-poetry"
version = "0.1.0"
description = ""
authors = ["Tim Nonet <tnonet@hex.tech>"]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.9,<3.11"
tensorflow = { "version" = "2.15.1", extras = ["and-cuda"] }
jax = {"version" = "*", extras = ["cuda12"] }


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Relevant log output

Because tensorflow[and-cuda] (2.15.0) depends on nvidia-nccl-cu12 (2.16.5)
 and jax[cuda12] (0.4.23) depends on nvidia-nccl-cu12 (>=2.18.3), tensorflow[and-cuda] (2.15.0) is incompatible with jax[cuda12] (0.4.23).
So, because hex-packages depends on both jax[cuda12] (0.4.23) and tensorflow[and-cuda] (2.15.0), version solving failed.

whenever I use a version of jax that has the cuda12 extra

@google-ml-butler google-ml-butler bot added the type:bug Bug label May 20, 2024
@sushreebarsa sushreebarsa added the TF 2.15 For issues related to 2.15.x label May 21, 2024
@sushreebarsa
Copy link
Contributor

@attaluris TensorFlow[and-cuda] 2.15.0/2.15.1 is likely not compatible with jax[cuda12]. There's a version mismatch with respect to the NVIDIA NCCL library, a component needed for GPU support in both TensorFlow and JAX.
TensorFlow 2.15.0/2.15.1 might depend on an older NCCL version (e.g., nvidia-nccl-cu12 version 2.16.5).
For any further queries please raise an issue in Jax repository.

Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label May 21, 2024
@attaluris
Copy link
Author

attaluris commented May 21, 2024

@sushreebarsa Is there a way we could loosen the strict requirement in tensorflow from =2.16.5 to >2.16.5?

I also raised an issue in the Jax repo

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 21, 2024
@sushreebarsa
Copy link
Contributor

@attaluris Could you try to update tensorflow to 2.16.1 and jax to 0.4.28?
TF 2.16.1 uses nvidia-nccl-cu12==2.19.3 (from tensorflow[and-cuda]) which is compatible with JAX's requirement of >=2.18.3.

Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label May 22, 2024
@attaluris
Copy link
Author

@sushreebarsa thanks for the detail! 👀
I'm seeing the same issue as #63362 with tensorflow 2.16.1 so I'm using 2.15.1
can I get a confirmation that the tensorflow 2.16.1 installation will be fixed and maybe a timeline? thanks in advance!

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 22, 2024
@sushreebarsa
Copy link
Contributor

@attaluris On ubuntu 20.04LTS machine with Tesla P100 GPU, we tried to install JAX 0.4.28 cuda version first and then installed tensorflow 2.15.1. Both were installed successfully. JAX and tf were able to detect GPUs as well. Please have a look at the below screenshot for reference.
image (4)

Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label May 30, 2024
Copy link

github-actions bot commented Jun 7, 2024

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 7, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.15 For issues related to 2.15.x type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants