[doc][EM] Add a brief introduction to NUMA. #11538

trivialfis · 2025-06-27T08:59:44Z

~~Add a utility to help set the CPU affinity.~~

demo/guide-python/external_memory.py

demo/guide-python/distributed_extmem_basic.py

demo/guide-python/external_memory.py

python-package/xgboost/utils.py

pentschev · 2025-06-30T19:22:28Z

python-package/xgboost/utils.py

+def _get_uuid(ordinal: int) -> str:
+    """Construct a string representation of UUID."""
+    from cuda.bindings import runtime as cudart
+
+    status, prop = cudart.cudaGetDeviceProperties(ordinal)
+    _checkcu(status)
+
+    dash_pos = {0, 4, 6, 8, 10}
+    uuid = "GPU"
+
+    for i in range(16):
+        if i in dash_pos:
+            uuid += "-"
+        h = hex(0xFF & np.int32(prop.uuid.bytes[i]))
+        assert h[:2] == "0x"
+        h = h[2:]
+
+        while len(h) < 2:
+            h = "0" + h
+        uuid += h
+    return uuid


Why? https://github.com/rapidsai/dask-cuda/blob/6c3223d7f82f4b2e7eb209a8485b1e32cb5357a7/dask_cuda/utils.py#L752-L775

Thank you for pointing it out. Got a bit too used to cudart.

Thinking again, I need this cudart version as XGBoost should prefer the CUDA device enumeration instead of the nvml device enumeration.

python-package/xgboost/utils.py

trivialfis · 2025-07-01T09:09:27Z

@pentschev I have simplified the utility using nvml and changed the package name in the documentation.

hcho3

Let's document that set_device_cpu_affinity is not available on other OSes like Windows.

trivialfis · 2025-07-02T08:51:28Z

Update:

Documented that this is Linux-only.
Use CUDA for device enumeration. We obtain the device ordinal and corresponding UUID from CUDA, then use the UUID to get the affinity. This way, we can honor the CUDA_VISIBLE_DEVICES environment variable.
Manually tested on an NVL system.

trivialfis · 2025-07-02T20:24:52Z

Hold this PR for now. I think the CPU affinity alone is not sufficient when the memory is under pressure.

trivialfis · 2025-07-03T13:25:44Z

Expanded the document for using numactl, and removed all the utilities. Removing the utilities since CPU affinity alone is not sufficient when memory is under pressure. I don't want to maintain specialized code that is not effective.

Copilot

Pull Request Overview

This PR enhances the External Memory tutorial by introducing a section on NUMA configuration and updates the Python demos to reference the new cuda-python package and the NUMA guidance.

Added a table of contents and a new NUMA section explaining how to set CPU and memory affinity.
Updated demo scripts to switch from python-cuda to cuda-python, adjust the cudart import, and reference the NUMA tutorial.
Minor history update in the release notes to include support for the Grace Blackwell decompression engine.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
doc/tutorials/external_memory.rst	Added TOC and a detailed NUMA section with examples for `numactl`.
demo/guide-python/external_memory.py	Swapped `python-cuda` to `cuda-python`, updated import, added NUMA note.
demo/guide-python/distributed_extmem_basic.py	Same updates as above: package name, import path, and NUMA reference.

Comments suppressed due to low confidence (3)

doc/tutorials/external_memory.rst:297

[nitpick] Consider adding a brief note or link on installing numactl (e.g., via apt-get install numactl), so readers know how to obtain the tool before using it.

    numactl --membind=${NODEID} --cpunodebind=${NODEID} ./myapp

demo/guide-python/external_memory.py:50

[nitpick] The device_mem_total function is duplicated across demos; consider extracting it into a shared utility module to reduce repetition.

    import cuda.bindings.runtime as cudart

demo/guide-python/distributed_extmem_basic.py:44

[nitpick] Same helper appears here; extracting the GPU memory query into a common helper would improve consistency and reduce maintenance overhead.

    import cuda.bindings.runtime as cudart

[doc][EM] Add a brief introduction to NUMA.

1b5be09

trivialfis requested review from Copilot and hcho3 June 27, 2025 08:59

This comment was marked as outdated.

Sign in to view

note.

5827f66

pentschev reviewed Jun 27, 2025

View reviewed changes

demo/guide-python/external_memory.py Outdated Show resolved Hide resolved

trivialfis added 10 commits June 28, 2025 00:58

Add utils.

99fca48

Use os sched.

7a7b3a0

cleanup.

e8aab28

Add api doc.

45d30be

lint.

d5ba2aa

import error.

933cf7c

init.

a0b4752

return empty set.

39ea425

Rename, unify the parameters.

ce7653b

set.

23c5d0c

pentschev reviewed Jun 30, 2025

View reviewed changes

trivialfis added 8 commits July 1, 2025 14:24

Package name.

681c516

Fix demo.

7e8f561

Use nvml to get the UUID.

56e927f

doc.

238d3e0

No need for UUID.

de1a01a

lint.

85a891d

cleanup.

03f1a5c

cleanup.

2ac61ff

trivialfis changed the title ~~[doc][EM] Add a brief introduction to NUMA.~~ [EM] Add utility for NUMA. Jul 1, 2025

This comment was marked as outdated.

Sign in to view

hcho3 requested changes Jul 1, 2025

View reviewed changes

Use cudart for uuid.

cfd9d21

trivialfis added 3 commits July 2, 2025 14:38

simple tests.

c380bb1

Linux CUDA only.

a9b8f17

Use the current device.

c7c0309

trivialfis added 2 commits July 3, 2025 21:19

Expand on the use numactl.

7500e13

Revert changes.

c3ca7fb

trivialfis changed the title ~~[EM] Add utility for NUMA.~~ [doc][EM] Add a brief introduction to NUMA. Jul 3, 2025

trivialfis requested a review from hcho3 July 3, 2025 13:35

tabs.

a838678

trivialfis requested a review from Copilot July 3, 2025 17:18

Copilot AI reviewed Jul 3, 2025

View reviewed changes

hcho3 approved these changes Jul 3, 2025

View reviewed changes

trivialfis merged commit eabb5ed into dmlc:master Jul 3, 2025
61 checks passed

trivialfis deleted the ext-numa-doc branch July 3, 2025 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[doc][EM] Add a brief introduction to NUMA. #11538

[doc][EM] Add a brief introduction to NUMA. #11538

Uh oh!

trivialfis commented Jun 27, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pentschev Jun 30, 2025

Uh oh!

trivialfis Jul 1, 2025

Uh oh!

trivialfis Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

trivialfis commented Jul 1, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

hcho3 left a comment

Uh oh!

trivialfis commented Jul 2, 2025 •

edited

Loading

Uh oh!

trivialfis commented Jul 2, 2025

Uh oh!

trivialfis commented Jul 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[doc][EM] Add a brief introduction to NUMA. #11538

[doc][EM] Add a brief introduction to NUMA. #11538

Uh oh!

Conversation

trivialfis commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pentschev Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

trivialfis Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

trivialfis commented Jul 1, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

hcho3 left a comment

Choose a reason for hiding this comment

Uh oh!

trivialfis commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trivialfis commented Jul 2, 2025

Uh oh!

trivialfis commented Jul 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

trivialfis commented Jun 27, 2025 •

edited

Loading

trivialfis commented Jul 2, 2025 •

edited

Loading