Skip to content

[doc][EM] Add a brief introduction to NUMA. #11538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jul 3, 2025

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jun 27, 2025

Add a utility to help set the CPU affinity.

@trivialfis trivialfis requested review from Copilot and hcho3 June 27, 2025 08:59
Copilot

This comment was marked as outdated.

Comment on lines 52 to 72
def _get_uuid(ordinal: int) -> str:
"""Construct a string representation of UUID."""
from cuda.bindings import runtime as cudart

status, prop = cudart.cudaGetDeviceProperties(ordinal)
_checkcu(status)

dash_pos = {0, 4, 6, 8, 10}
uuid = "GPU"

for i in range(16):
if i in dash_pos:
uuid += "-"
h = hex(0xFF & np.int32(prop.uuid.bytes[i]))
assert h[:2] == "0x"
h = h[2:]

while len(h) < 2:
h = "0" + h
uuid += h
return uuid

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing it out. Got a bit too used to cudart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking again, I need this cudart version as XGBoost should prefer the CUDA device enumeration instead of the nvml device enumeration.

@trivialfis trivialfis changed the title [doc][EM] Add a brief introduction to NUMA. [EM] Add utility for NUMA. Jul 1, 2025
@trivialfis
Copy link
Member Author

@pentschev I have simplified the utility using nvml and changed the package name in the documentation.

hcho3

This comment was marked as outdated.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document that set_device_cpu_affinity is not available on other OSes like Windows.

@trivialfis
Copy link
Member Author

trivialfis commented Jul 2, 2025

Update:

  • Documented that this is Linux-only.
  • Use CUDA for device enumeration. We obtain the device ordinal and corresponding UUID from CUDA, then use the UUID to get the affinity. This way, we can honor the CUDA_VISIBLE_DEVICES environment variable.
  • Manually tested on an NVL system.

@trivialfis
Copy link
Member Author

Hold this PR for now. I think the CPU affinity alone is not sufficient when the memory is under pressure.

@trivialfis trivialfis changed the title [EM] Add utility for NUMA. [doc][EM] Add a brief introduction to NUMA. Jul 3, 2025
@trivialfis
Copy link
Member Author

Expanded the document for using numactl, and removed all the utilities. Removing the utilities since CPU affinity alone is not sufficient when memory is under pressure. I don't want to maintain specialized code that is not effective.

@trivialfis trivialfis requested a review from hcho3 July 3, 2025 13:35
@trivialfis trivialfis requested a review from Copilot July 3, 2025 17:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the External Memory tutorial by introducing a section on NUMA configuration and updates the Python demos to reference the new cuda-python package and the NUMA guidance.

  • Added a table of contents and a new NUMA section explaining how to set CPU and memory affinity.
  • Updated demo scripts to switch from python-cuda to cuda-python, adjust the cudart import, and reference the NUMA tutorial.
  • Minor history update in the release notes to include support for the Grace Blackwell decompression engine.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
doc/tutorials/external_memory.rst Added TOC and a detailed NUMA section with examples for numactl.
demo/guide-python/external_memory.py Swapped python-cuda to cuda-python, updated import, added NUMA note.
demo/guide-python/distributed_extmem_basic.py Same updates as above: package name, import path, and NUMA reference.
Comments suppressed due to low confidence (3)

doc/tutorials/external_memory.rst:297

  • [nitpick] Consider adding a brief note or link on installing numactl (e.g., via apt-get install numactl), so readers know how to obtain the tool before using it.
    numactl --membind=${NODEID} --cpunodebind=${NODEID} ./myapp

demo/guide-python/external_memory.py:50

  • [nitpick] The device_mem_total function is duplicated across demos; consider extracting it into a shared utility module to reduce repetition.
    import cuda.bindings.runtime as cudart

demo/guide-python/distributed_extmem_basic.py:44

  • [nitpick] Same helper appears here; extracting the GPU memory query into a common helper would improve consistency and reduce maintenance overhead.
    import cuda.bindings.runtime as cudart

@trivialfis trivialfis merged commit eabb5ed into dmlc:master Jul 3, 2025
61 checks passed
@trivialfis trivialfis deleted the ext-numa-doc branch July 3, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants