Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][JNI] MemoryBuffer.onClosed needs to be called after the buffer is deallocated #15350

Closed
abellina opened this issue Mar 20, 2024 · 0 comments · Fixed by #15351
Closed

[BUG][JNI] MemoryBuffer.onClosed needs to be called after the buffer is deallocated #15350

abellina opened this issue Mar 20, 2024 · 0 comments · Fixed by #15351
Assignees
Labels
bug Something isn't working Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

abellina commented Mar 20, 2024

We are seeing an odd race condition in spark-rapids here NVIDIA/spark-rapids#10585 that seems to be very dependent on timing. For example, we enabled debug logging to try and find the root cause but the logging looks to "fix" the problem.

I was able to narrow it to state being tracked in a class in spark-rapids called HostAlloc where we are keeping a counter of pinned memory allocated and triggering state transitions for a state machine we use for our OOM handling. These state changes are happening too soon, when MemoryBuffer.onClosed is called, because onClosed is getting triggered before we actually free the buffer: so we update state thinking that the memory is free but we haven't actually freed it yet. As a result, we are blocking a thread and it will never come out of that state because the event that would have woken it up happened too soon (triggered by onClosed).

The fix we are looking at unfortunately is in cuDF and is very late for 24.04, but I think it's needed. We would change the order for the onClosed callback to right after the free.

eventHandler.onClosed(refCount);
moves a few lines down to .

I should have a PR today. The main thing I am worried about is that our spill framework uses this callback and the timing will be changed by this work, so I want to make sure we don't cause issues while spilling.

@abellina abellina added bug Something isn't working Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS labels Mar 20, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 20, 2024
Closes #15350. This PR changes the order of the callback `MemoryBuffer.onClosed` to happen after our `MemoryCleaner` finishes. This is done so that we can accurately, and safely, reflect the state of the memory resource (be it device or host). This PR is needed to address a bug found in spark-rapids here: NVIDIA/spark-rapids#10585.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Gera Shegalov (https://github.com/gerashegalov)

URL: #15351
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant