Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] AMD-Instinct-MI250X falsely shown as unused #45684

Open
gregordecristoforo opened this issue Jun 3, 2024 · 2 comments
Open

[tune] AMD-Instinct-MI250X falsely shown as unused #45684

gregordecristoforo opened this issue Jun 3, 2024 · 2 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P3 Issue moderate in impact or severity

Comments

@gregordecristoforo
Copy link

What happened + What you expected to happen

I am using ray tune on the LUMI supercomputer on one whole GPU node. The node contains four AMD MI250X GPUs (with two GPU dies each).

The output of the script contains the following:

Logical resource usage: 56.0/56 CPUs, 8.0/8 GPUs (0.0/1.0 accelerator_type:AMD-Instinct-MI250X)

which shows that all the GPUs are utilized as intended (checking with rocm-smi gives the same result). However, the statement 0.0/1.0 accelerator_type:AMD-Instinct-MI250X is clearly incorrect. Shouldn't it show 1/1 or even 8/8 accelerator_type:AMD-Instinct-MI250X? Please let me know if any additional information is required.

Versions / Dependencies

Ray: 2.12.0
Python: 3.11.9
OS: Linux 5.14.21-150400.24.81_12.0.75-cray_shasta_c x86_64

The whole conda environment looks as follows:

packages in environment at /opt/conda/envs/conda_container_env:

Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
accelerate 0.29.0 pypi_0 pypi
aiohttp 3.9.5 py311h459d7ec_0 conda-forge
aiohttp-cors 0.7.0 py_0 conda-forge
aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge
annotated-types 0.7.0 pyhd8ed1ab_0 conda-forge
async-timeout 4.0.3 pyhd8ed1ab_0 conda-forge
attrs 23.2.0 pyh71513ae_0 conda-forge
aws-c-auth 0.7.11 h0b4cabd_1 conda-forge
aws-c-cal 0.6.9 h14ec70c_3 conda-forge
aws-c-common 0.9.12 hd590300_0 conda-forge
aws-c-compression 0.2.17 h572eabf_8 conda-forge
aws-c-event-stream 0.4.1 h97bb272_2 conda-forge
aws-c-http 0.8.0 h9129f04_2 conda-forge
aws-c-io 0.14.0 hf8f278a_1 conda-forge
aws-c-mqtt 0.10.1 h2b97f5f_0 conda-forge
aws-c-s3 0.4.9 hca09fc5_0 conda-forge
aws-c-sdkutils 0.1.13 h572eabf_1 conda-forge
aws-checksums 0.1.17 h572eabf_7 conda-forge
aws-crt-cpp 0.26.0 h04327c0_8 conda-forge
aws-sdk-cpp 1.11.210 hba3e011_10 conda-forge
brotli-python 1.1.0 py311hb755f60_1 conda-forge
bzip2 1.0.8 hd590300_5 conda-forge
c-ares 1.28.1 hd590300_0 conda-forge
ca-certificates 2024.2.2 hbcca054_0 conda-forge
cachetools 5.3.3 pyhd8ed1ab_0 conda-forge
certifi 2024.2.2 pyhd8ed1ab_0 conda-forge
cffi 1.16.0 py311hb3a22ac_0 conda-forge
charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge
click 8.1.7 unix_pyh707e725_0 conda-forge
colorama 0.4.6 pyhd8ed1ab_0 conda-forge
colorful 0.5.6 pyhd8ed1ab_0 conda-forge
cryptography 42.0.7 py311h4a61cc7_0 conda-forge
datasets 2.18.0 pyhd8ed1ab_0 conda-forge
dill 0.3.8 pyhd8ed1ab_0 conda-forge
distlib 0.3.8 pyhd8ed1ab_0 conda-forge
filelock 3.13.4 pyhd8ed1ab_0 conda-forge
freetype 2.12.1 h267a509_2 conda-forge
frozenlist 1.4.1 py311h459d7ec_0 conda-forge
fsspec 2024.2.0 pyhca7485f_0 conda-forge
gflags 2.2.2 he1b5a44_1004 conda-forge
glog 0.6.0 h6f12383_0 conda-forge
gmp 6.3.0 h59595ed_1 conda-forge
gmpy2 2.1.5 py311hc4f1f91_1 conda-forge
google-api-core 2.19.0 pyhd8ed1ab_0 conda-forge
google-auth 2.29.0 pyhca7485f_0 conda-forge
googleapis-common-protos 1.63.0 pyhd8ed1ab_0 conda-forge
grpcio 1.59.3 py311ha6695c7_0 conda-forge
huggingface_hub 0.22.2 pyhd8ed1ab_0 conda-forge
icu 73.2 h59595ed_0 conda-forge
idna 3.7 pyhd8ed1ab_0 conda-forge
importlib-metadata 7.1.0 pyha770c72_0 conda-forge
importlib_resources 6.4.0 pyhd8ed1ab_0 conda-forge
jinja2 3.1.3 pyhd8ed1ab_0 conda-forge
jsonschema 4.22.0 pyhd8ed1ab_0 conda-forge
jsonschema-specifications 2023.12.1 pyhd8ed1ab_0 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
krb5 1.21.2 h659d440_0 conda-forge
lcms2 2.16 hb7c19ff_0 conda-forge
ld_impl_linux-64 2.40 h55db66e_0 conda-forge
lerc 4.0.0 h27087fc_0 conda-forge
libabseil 20230802.1 cxx17_h59595ed_0 conda-forge
libarrow 15.0.0 h84dd17c_0_cpu conda-forge
libarrow-acero 15.0.0 h59595ed_0_cpu conda-forge
libarrow-dataset 15.0.0 h59595ed_0_cpu conda-forge
libarrow-flight 15.0.0 h120cb0d_0_cpu conda-forge
libarrow-flight-sql 15.0.0 h61ff412_0_cpu conda-forge
libarrow-gandiva 15.0.0 hacb8726_0_cpu conda-forge
libarrow-substrait 15.0.0 h61ff412_0_cpu conda-forge
libblas 3.9.0 22_linux64_openblas conda-forge
libbrotlicommon 1.1.0 hd590300_1 conda-forge
libbrotlidec 1.1.0 hd590300_1 conda-forge
libbrotlienc 1.1.0 hd590300_1 conda-forge
libcblas 3.9.0 22_linux64_openblas conda-forge
libcrc32c 1.1.2 h9c3ff4c_0 conda-forge
libcurl 8.7.1 hca28451_0 conda-forge
libdeflate 1.20 hd590300_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 hd590300_2 conda-forge
libevent 2.1.12 hf998b51_1 conda-forge
libexpat 2.6.2 h59595ed_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 13.2.0 h77fa898_7 conda-forge
libgfortran-ng 13.2.0 h69a702a_7 conda-forge
libgfortran5 13.2.0 hca663fb_7 conda-forge
libgomp 13.2.0 h77fa898_7 conda-forge
libgoogle-cloud 2.12.0 h5206363_4 conda-forge
libgrpc 1.59.3 hd6c4280_0 conda-forge
libiconv 1.17 hd590300_2 conda-forge
libjpeg-turbo 3.0.0 hd590300_1 conda-forge
liblapack 3.9.0 22_linux64_openblas conda-forge
libllvm15 15.0.7 hb3ce162_4 conda-forge
libnghttp2 1.58.0 h47da74e_1 conda-forge
libnl 3.9.0 hd590300_0 conda-forge
libnsl 2.0.1 hd590300_0 conda-forge
libopenblas 0.3.27 pthreads_h413a1c8_0 conda-forge
libparquet 15.0.0 h352af49_0_cpu conda-forge
libpng 1.6.43 h2797004_0 conda-forge
libprotobuf 4.24.4 hf27288f_0 conda-forge
libre2-11 2023.09.01 h7a70373_1 conda-forge
libsqlite 3.45.3 h2797004_0 conda-forge
libssh2 1.11.0 h0841786_0 conda-forge
libstdcxx-ng 13.2.0 hc0a3c3a_7 conda-forge
libthrift 0.19.0 hb90f79a_1 conda-forge
libtiff 4.6.0 h1dd3fc0_3 conda-forge
libunwind 1.6.2 h9c3ff4c_0 conda-forge
libutf8proc 2.8.0 h166bdaf_0 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libuv 1.48.0 hd590300_0 conda-forge
libwebp-base 1.4.0 hd590300_0 conda-forge
libxcb 1.15 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libxml2 2.12.7 hc051c1a_0 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
lz4-c 1.9.4 hcb278e6_0 conda-forge
markdown-it-py 3.0.0 pyhd8ed1ab_0 conda-forge
markupsafe 2.1.5 py311h459d7ec_0 conda-forge
mdurl 0.1.2 pyhd8ed1ab_0 conda-forge
memray 1.12.0 py311h259950f_0 conda-forge
mpc 1.3.1 hfe3b2da_0 conda-forge
mpfr 4.2.1 h9458935_1 conda-forge
mpmath 1.3.0 pyhd8ed1ab_0 conda-forge
msgpack-python 1.0.8 py311h52f7536_0 conda-forge
multidict 6.0.5 py311h459d7ec_0 conda-forge
multiprocess 0.70.16 py311h459d7ec_0 conda-forge
ncurses 6.5 h59595ed_0 conda-forge
networkx 3.3 pyhd8ed1ab_1 conda-forge
nodejs 20.12.2 hb753e55_0 conda-forge
numpy 1.26.4 py311h64a7726_0 conda-forge
opencensus 0.11.3 pyhd8ed1ab_0 conda-forge
opencensus-context 0.1.3 py311h38be061_2 conda-forge
openjpeg 2.5.2 h488ebb8_0 conda-forge
openssl 3.3.0 h4ab18f5_2 conda-forge
orc 1.9.2 h4b38347_0 conda-forge
packaging 24.0 pyhd8ed1ab_0 conda-forge
pandas 2.2.2 py311h14de704_1 conda-forge
pillow 10.3.0 py311h18e6fac_0 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
pkgutil-resolve-name 1.3.10 pyhd8ed1ab_1 conda-forge
platformdirs 3.11.0 pyhd8ed1ab_0 conda-forge
prometheus_client 0.20.0 pyhd8ed1ab_0 conda-forge
proto-plus 1.23.0 pyhd8ed1ab_0 conda-forge
protobuf 4.24.4 py311h46cbc50_0 conda-forge
psutil 5.9.8 py311h459d7ec_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
py-spy 0.3.14 h87a5ac0_0 conda-forge
pyarrow 15.0.0 py311h39c9aba_0_cpu conda-forge
pyarrow-hotfix 0.6 pyhd8ed1ab_0 conda-forge
pyasn1 0.6.0 pyhd8ed1ab_0 conda-forge
pyasn1-modules 0.4.0 pyhd8ed1ab_0 conda-forge
pycparser 2.22 pyhd8ed1ab_0 conda-forge
pydantic 2.7.1 pyhd8ed1ab_0 conda-forge
pydantic-core 2.18.2 py311h5ecf98a_0 conda-forge
pygments 2.18.0 pyhd8ed1ab_0 conda-forge
pyopenssl 24.0.0 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.11.9 hb806964_0_cpython conda-forge
python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge
python-tzdata 2024.1 pyhd8ed1ab_0 conda-forge
python-xxhash 3.4.1 py311h459d7ec_0 conda-forge
python_abi 3.11 4_cp311 conda-forge
pytorch-triton-rocm 2.2.0 pypi_0 pypi
pytz 2024.1 pyhd8ed1ab_0 conda-forge
pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge
pyyaml 6.0.1 py311h459d7ec_1 conda-forge
ray-core 2.12.0 py311h3a73429_0 conda-forge
ray-default 2.12.0 py311h48098de_0 conda-forge
ray-tune 2.12.0 py311h38be061_0 conda-forge
rdma-core 51.0 hd3aeb46_0 conda-forge
re2 2023.09.01 h7f4b329_1 conda-forge
readline 8.2 h8228510_1 conda-forge
referencing 0.35.1 pyhd8ed1ab_0 conda-forge
regex 2024.5.15 py311h331c9d8_0 conda-forge
requests 2.32.1 pyhd8ed1ab_0 conda-forge
rich 13.7.1 pyhd8ed1ab_0 conda-forge
rpds-py 0.18.1 py311h5ecf98a_0 conda-forge
rsa 4.9 pyhd8ed1ab_0 conda-forge
s2n 1.4.1 h06160fa_0 conda-forge
safetensors 0.4.3 py311h46250e7_0 conda-forge
setproctitle 1.3.3 py311h459d7ec_0 conda-forge
setuptools 69.5.1 pyhd8ed1ab_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
smart_open 7.0.4 pyhd8ed1ab_0 conda-forge
snappy 1.1.10 hdb0a2a9_1 conda-forge
sympy 1.12 pypyh9d50eac_103 conda-forge
tensorboardx 2.6.2.2 pyhd8ed1ab_0 conda-forge
textual 0.62.0 pyhd8ed1ab_0 conda-forge
tk 8.6.13 noxft_h4845f30_101 conda-forge
tokenizers 0.19.1 py311h6640629_0 conda-forge
torch 2.2.2+rocm5.6 pypi_0 pypi
torchaudio 2.2.2+rocm5.6 pypi_0 pypi
torchvision 0.17.2+rocm5.6 pypi_0 pypi
tqdm 4.66.4 pyhd8ed1ab_0 conda-forge
transformers 4.40.2 pyhd8ed1ab_0 conda-forge
typing-extensions 4.11.0 hd8ed1ab_0 conda-forge
typing_extensions 4.11.0 pyha770c72_0 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
ucx 1.15.0 ha691c75_8 conda-forge
urllib3 2.2.1 pyhd8ed1ab_0 conda-forge
virtualenv 20.21.0 pyhd8ed1ab_0 conda-forge
wheel 0.43.0 pyhd8ed1ab_1 conda-forge
wrapt 1.16.0 py311h459d7ec_0 conda-forge
xorg-libxau 1.0.11 hd590300_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xxhash 0.8.2 hd590300_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
yaml 0.2.5 h7f98852_2 conda-forge
yarl 1.9.4 py311h459d7ec_0 conda-forge
zipp 3.17.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.13 hd590300_5 conda-forge
zstd 1.5.6 ha6fb4c9_0 conda-forge

Reproduction script

This example fine-tunes a LLM for 8 different learning rates. If required, I can provide the whole python script which contains the trainable and the run.sh script that specifies the SLUM parameters (even though I am pretty sure that SLURM has nothing to do with the problem).

    ray.init(num_cpus=56, num_gpus=8, log_to_driver=False)

    config = { "learning_rate":  tune.uniform(1e-6, 1e-3)   }

    # Create a Tuner object
    tuner = tune.Tuner(
        tune.with_resources(
            trainable,
            resources={"cpu": 7, "gpu": 1},  # Set resources for every trial run
        ),
        param_space=config,
        tune_config=tune.TuneConfig(
            num_samples=8,  # Number of samples
            metric="perplexity",  # Metric to optimize
            mode="min",  # Minimize the metric
        ),
    )
    # Run the tuning process
    results = tuner.fit()

Issue Severity

Low: It annoys or frustrates me.

@gregordecristoforo gregordecristoforo added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 3, 2024
@gregordecristoforo gregordecristoforo changed the title [tune] AMD-Instinct-MI250X falsly shown as unused [tune] AMD-Instinct-MI250X falsely shown as unused Jun 3, 2024
@anyscalesam anyscalesam added the tune Tune-related issues label Jun 3, 2024
@woshiyyya woshiyyya added core Issues that should be addressed in Ray Core and removed tune Tune-related issues labels Jun 5, 2024
@jjyao
Copy link
Collaborator

jjyao commented Jun 18, 2024

Hi @gregordecristoforo, this is just a side effect of how we implement accelerator type as custom resources. We will re-implement accelerator type as node labels so you don't see 0.0/1.0

@jjyao jjyao added P2 Important issue, but not time-critical observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 18, 2024
@gregordecristoforo
Copy link
Author

OK, good to know. Thank you for the explanation!

@jjyao jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

4 participants