Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update RMMNumbaManager to handle NUMBA_CUDA_USE_NVIDIA_BINDING=1 #1004

Conversation

brandon-b-miller
Copy link
Contributor

Fixes #1003

@brandon-b-miller brandon-b-miller requested a review from a team as a code owner March 22, 2022 16:33
@brandon-b-miller brandon-b-miller changed the title update rmm numba mem manager to handle new bindings Update RMMNumbaManager to handle NUMBA_CUDA_USE_NVIDIA_BINDING=1 Mar 22, 2022
@github-actions github-actions bot added the Python Related to RMM Python API label Mar 22, 2022
@harrism
Copy link
Member

harrism commented Mar 22, 2022

@brandon-b-miller we are already in burndown for 22.04, so unless this is urgent for 22.04 we should push to the next release. From the bug description this doesn't sound like something we would hotfix for, so I think we can push it.

@harrism harrism added the bug Something isn't working label Mar 22, 2022
@gmarkall
Copy link
Contributor

Have you run the Numba test suite with this branch? e.g.:

NUMBA_CUDA_USE_NVIDIA_BINDING=1 NUMBA_CUDA_MEMORY_MANAGER=rmm python -m numba.runtests numba.cuda.tests

@brandon-b-miller
Copy link
Contributor Author

Have you run the Numba test suite with this branch? e.g.:

NUMBA_CUDA_USE_NVIDIA_BINDING=1 NUMBA_CUDA_MEMORY_MANAGER=rmm python -m numba.runtests numba.cuda.tests

This revealed more changes that were needed which are now pushed.

@brandon-b-miller brandon-b-miller changed the base branch from branch-22.04 to branch-22.06 March 28, 2022 14:11
@galipremsagar galipremsagar added the non-breaking Non-breaking change label Mar 28, 2022
@brandon-b-miller
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing when running

NUMBA_CUDA_USE_NVIDIA_BINDING=1 NUMBA_CUDA_MEMORY_MANAGER=rmm python -m numba.runtests numba.cuda.tests -v -m

with this PR and Numba main:

======================================================================
FAIL: test_ipc_array (numba.cuda.tests.cudapy.test_ipc.TestIpcStaged)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba/numba/cuda/tests/cudapy/test_ipc.py", line 293, in test_ipc_array
    self.fail(out)
AssertionError: Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba/numba/cuda/tests/cudapy/test_ipc.py", line 215, in staged_ipc_array_test
    with cuda.gpus[device_num]:
  File "/home/gmarkall/numbadev/numba/numba/cuda/cudadrv/devices.py", line 84, in __exit__
    self._device.get_primary_context().pop()
  File "/home/gmarkall/numbadev/numba/numba/cuda/cudadrv/driver.py", line 1355, in pop
    assert int(popped) == int(self.handle)
AssertionError


======================================================================
FAIL: test_staged (numba.cuda.tests.cudapy.test_ipc.TestIpcStaged)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba/numba/cuda/tests/cudapy/test_ipc.py", line 273, in test_staged
    self.fail(out)
AssertionError: Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba/numba/cuda/tests/cudapy/test_ipc.py", line 18, in core_ipc_handle_test
    arr = the_work()
  File "/home/gmarkall/numbadev/numba/numba/cuda/tests/cudapy/test_ipc.py", line 199, in the_work
    with cuda.gpus[device_num]:
  File "/home/gmarkall/numbadev/numba/numba/cuda/cudadrv/devices.py", line 84, in __exit__
    self._device.get_primary_context().pop()
  File "/home/gmarkall/numbadev/numba/numba/cuda/cudadrv/driver.py", line 1355, in pop
    assert int(popped) == int(self.handle)
AssertionError


----------------------------------------------------------------------
Ran 1278 tests in 111.982s

FAILED (failures=2, skipped=20, expected failures=8)

This is with multiple devices:

$ python -c "from numba import cuda; cuda.detect()"
Found 3 CUDA devices
id 0     b'NVIDIA RTX A6000'                              [SUPPORTED]
                      Compute Capability: 8.6
                           PCI Device ID: 0
                              PCI Bus ID: 21
                                    UUID: GPU-842b25ad-db82-ba9d-0380-e65fe57189eb
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
id 1     b'NVIDIA RTX A6000'                              [SUPPORTED]
                      Compute Capability: 8.6
                           PCI Device ID: 0
                              PCI Bus ID: 45
                                    UUID: GPU-af183771-f998-7235-c638-b407c81bf3f7
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
id 2         b'Quadro P2200'                              [SUPPORTED]
                      Compute Capability: 6.1
                           PCI Device ID: 0
                              PCI Bus ID: 11
                                    UUID: GPU-321c7ee1-375f-7c11-a413-b0aab3ec4756
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
Summary:
	3/3 devices are supported

(I suspect it does not occur with a single GPU)

@gmarkall
Copy link
Contributor

Turns out this is something that started happening between 21.12 and 22.02, unrelated to this PR - I'll look and see if there's a fix we can roll into this PR so it passes tests again.

@gmarkall
Copy link
Contributor

gmarkall commented Apr 7, 2022

Turns out this is something that started happening between 21.12 and 22.02, unrelated to this PR - I'll look and see if there's a fix we can roll into this PR so it passes tests again.

This will be hard to track down and is unrelated to this PR, so let's not attempt to address it here.

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the fact the issue I previously identified is unrelated to this PR and was introduced earlier, I now think this looks good.

@brandon-b-miller
Copy link
Contributor Author

@gpucibot merge

1 similar comment
@shwina
Copy link
Contributor

shwina commented Apr 8, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a067498 into rapidsai:branch-22.06 Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Related to RMM Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] RMMNumbaManager broken when Numba is using NV CUDA bindings
5 participants