-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spilling to host memory #12106
Spilling to host memory #12106
Conversation
Codecov ReportBase: 86.61% // Head: 88.24% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #12106 +/- ##
================================================
+ Coverage 86.61% 88.24% +1.63%
================================================
Files 135 137 +2
Lines 22934 22501 -433
================================================
- Hits 19864 19857 -7
+ Misses 3070 2644 -426
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty happy with the code at this point. The main thing that I haven't reviewed much yet is the tests, which I'll look at next week. I'd like to hear some other reviewer opinions at this point, so let's mark this PR ready for review (or wait until @galipremsagar has a chance to look through, either is fine with me) and get some more feedback.
mark benchmark a good candidate to run with `CUDF_SPILL=ON`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome work @madsbk! 🔥
Thanks for the @galipremsagar, I have added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! I have a couple of requests for improving the code in the testing module, but I don't think they're blockers and can be addressed in a follow-up. I'm going to go ahead and merge as is. Thanks for the great work here Mads!
def gen_df(target="gpu") -> cudf.DataFrame: | ||
ret = cudf.DataFrame({"a": [1, 2, 3]}) | ||
if target != "gpu": | ||
gen_df.buffer(ret).spill(target=target) | ||
return ret | ||
|
||
|
||
gen_df.buffer = lambda df: df._data._data["a"].data | ||
gen_df.is_spilled = lambda df: gen_df.buffer(df).is_spilled | ||
gen_df.is_spillable = lambda df: gen_df.buffer(df).spillable | ||
gen_df.buffer_size = gen_df.buffer(gen_df()).size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not assign attributes to a function this way. I would rather either define these as standalone helper functions or define a helper class as a subclass of cudf.DataFrame that implements these as properties.
def test_from_pandas(manager: SpillManager): | ||
pdf1 = pandas.DataFrame({"x": [1, 2, 3]}) | ||
df = cudf.from_pandas(pdf1) | ||
assert df._data._data["x"].data.spillable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that you need to write this out again anyway definitely suggests we want to expose some of those lambdas attached to get_df as functions.
@gpucibot merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
Support of the [new built-in spilling in cuDF](rapidsai/cudf#12106) so that `device_memory_limit` and `memory_limit` ignores cuDF's device buffers. This is only implemented for `DeviceHostFile`. Since jit-unspill also targets cuDF and libraries such as cupy isn't supported, I don't think it is important to support cuDF's built-in spilling in `ProxifyHostFile`. For now, `DeviceHostFile` simply ignores cuDF's device buffers and let cuDF handle the spilling. This means that `DeviceHostFile` might estimate the device and host memory usage incorrectly ([or more incorrectly than usually](dask/distributed#4568 (comment))). Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #984
Based on @vyasr's [review](#12106 (review)), I am cleaning up the spilling tests. Hopefully making them more readable. Notice, I am targeting `branch-23.02` because of code freeze but maybe we should target `branch-22.12` since it implements spilling? Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #12220
Now that #12106 is in, wrapping calls to `libcudf` is an ongoing effort. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #12232
) Now that #12106 is in, wrapping calls to `libcudf` and setting `as_buffer(..., exposed=False)` is an ongoing effort. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #12535
This PR implementing spilling of device to host memory, which is based on #11553.
Spilling can be enabled in two ways (it is disabled by default):
CUDF_SPILL=on
, orspill
option incudf
by doingcudf.set_option("spill", True)
.Additionally, parameters are:
CUDF_SPILL_ON_DEMAND=ON
/cudf.set_option("spill_on_demand", True)
, which registers an RMM out-of-memory error handler that spills buffers in order to free up memory.CUDF_SPILL_DEVICE_LIMIT=...
/cudf.set_option("spill_device_limit", ...)
, which sets a device memory limit in bytes.I have limited the scope of this PR. In a follow-up PR, I will port the statistics, logging, and partial unspill from #11553.
Design
Spilling consists of two components:
A new buffer sub-class,
SpillableBuffer
, that implements moving of its data from host to device memory in-place.A spill manager that tracks all instances of
SpillableBuffer
and spills them on demand.A global spill manager is used throughout cudf when spilling is enabled, which makes
as_buffer()
returnSpillableBuffer
instead of the defaultBuffer
instances.Challenges
Accessing
Buffer.ptr
, we get the device memory pointer of the buffer. This is unproblematic in the case ofBuffer
but what happens when accessingSpillableBuffer.ptr
, which might have spilled its device memory? In this case,SpillableBuffer
needs to unspill the memory before returning its device memory pointer. Furthermore, while this device memory pointer is being used (or could be used),SpillableBuffer
cannot spill its memory back to host memory because doing so would invalidate the device pointer.To address this, we mark the
SpillableBuffer
as unspillable, we say that the buffer has been exposed. This can be either permanent if the device pointer is exposed to external projects or temporary whilelibcudf
accesses the device memory.The
SpillableBuffer.get_ptr()
returns the device pointer of the buffer memory just like.ptr
but if given an instance ofSpillLock
, the buffer is only unspillable as long as the instance ofSpillLock
is alive.For convenience, one can use the decorator/context
with_spill_lock
to associate aSpillLock
with a lifetime bound to the context automatically.Overhead
When spilling is disabled, the overhead of this PR comes from the decorator
with_spill_lock
. However, this is small https://gist.github.com/madsbk/da6520e7583cf5d728a1b5a1b09200f3:Checklist
New or existing tests cover these changes.
Avoid changes to
libcudf
The documentation is up to date with these changes.