[dataset] Reduce memory usage during .to_pandas() #20921

ericl · 2021-12-06T21:24:34Z

Why are these changes needed?

We shouldn't ray.get() all the blocks immediately during the to_pandas call, it's better to do it one by one. That's a little slower but to_pandas() isn't expected to be fast anyways.

clarkzinzow

For my understanding, in the flow from the network through Plasma to building the DataFrame, where was the large memory inflation happening? My understanding of the flow is:

ray.get(): Plasma buffers are allocated and filled as objects are pulled, with the Arrow tables being deserialized directly into the Plasma buffers.
builder.add_block(): The block builder holds pointers to those Plasma buffers, so no new copies are made while add blocks to the builder.
builder.build().to_pandas(): The .build() and .to_pandas() calls create a copy or two during the Arrow table concatenation and the DataFrame construction, respectively.

This PR deals with stages (1) and (2) rather than the copy-heavy stage (3), and AFAICT wouldn't result in a lower peak during those two stages. Am I missing the creation of an extra copy or two in (1) and (2), maybe receiver-side queueing of byte chunks during pulls, or something else?

ericl · 2021-12-06T22:50:39Z

It's mostly about reducing peak plasma memory usage, which is much more constrained than heap memory usage. Streaming through plasma solves that bottleneck.

clarkzinzow · 2021-12-07T02:23:35Z

It's mostly about reducing peak plasma memory usage... Streaming through plasma solves that bottleneck.

Ah so I think what I was forgetting is that output.add_block(ray.get(block)) isn't just appending a Plasma buffer pointer to the builder._tables list, but is actually copying data out of Plasma memory and into the worker heap, since we don't have zero-copy deserialization on Arrow tables in Plasma. If these blocks were instead NumPy arrays, then this streaming output.add_block(ray.get(block)) would be zero-copy and would just accumulate the NumPy arrays in Plasma, same as if we did the bulk ray.get(blocks) and then added them to the builder.

clarkzinzow · 2021-12-07T02:27:00Z

Hmm actually is that true? Are serialized Arrow tables currently zero-copy? You've asserted as much elsewhere. #20242 (comment)

clarkzinzow · 2021-12-07T02:54:36Z

So it looks like all Arrow buffers implement the Pickle5 out-of-band protocol (__reduce_ex__ returning a PickleBuffer instance), so I assume that Arrow tables should be almost entirely zero-copy, which was my initial assumption. So I still don't understand how this streaming output.add_block(ray.get(block)) results in a lower peak Plasma usage, since it should all be pointers to Plasma buffers until output.build() is invoked, so the streaming ray.get()s should yield the same peak Plasma usage as the bulk ray.get(blocks).

        output = DelegatingArrowBlockBuilder()
        for block in blocks:
            output.add_block(ray.get(block))  # <-- Block deserialized with zero-copy, pointers to Plasma buffers
                                              #     added to output._tables, very little memory moved to the worker heap.
        # <-- All buffers underlying all blocks should still be in Plasma here.
        return output.build().to_pandas()  # <-- .build() concatenates all tables into worker heap memory,
                                           #     releasing Plasma buffers.

Am I missing something here? Just making sure that I understand this.

ericl · 2021-12-07T03:31:54Z

Yes, this is a good point. I think we'd need to copy in those case to reduce plasma memory usage.

But maybe this isn't that important since to_pandas() is more of a debugging utility anyways, and we use it only for small amounts of data.

clarkzinzow · 2021-12-07T03:43:25Z

Ok cool, I'll be sure to remember this if we're doing these kind of streaming optimizations on more critical paths.

ericl added 3 commits December 6, 2021 13:22

reduce mem

5358b42

wip

d5d1a34

update

d83348c

ericl assigned clarkzinzow Dec 6, 2021

ericl requested a review from scv119 as a code owner December 6, 2021 21:24

scv119 approved these changes Dec 6, 2021

View reviewed changes

clarkzinzow reviewed Dec 6, 2021

View reviewed changes

ericl merged commit 5d5ca8f into ray-project:master Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dataset] Reduce memory usage during .to_pandas() #20921

[dataset] Reduce memory usage during .to_pandas() #20921

ericl commented Dec 6, 2021

clarkzinzow left a comment •

edited

Loading

ericl commented Dec 6, 2021

clarkzinzow commented Dec 7, 2021

clarkzinzow commented Dec 7, 2021 •

edited

Loading

clarkzinzow commented Dec 7, 2021

ericl commented Dec 7, 2021

clarkzinzow commented Dec 7, 2021

[dataset] Reduce memory usage during .to_pandas() #20921

[dataset] Reduce memory usage during .to_pandas() #20921

Conversation

ericl commented Dec 6, 2021

Why are these changes needed?

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

ericl commented Dec 6, 2021

clarkzinzow commented Dec 7, 2021

clarkzinzow commented Dec 7, 2021 • edited Loading

clarkzinzow commented Dec 7, 2021

ericl commented Dec 7, 2021

clarkzinzow commented Dec 7, 2021

clarkzinzow left a comment •

edited

Loading

clarkzinzow commented Dec 7, 2021 •

edited

Loading