[data] Optimization to reduce ArrowBlock building time for blocks of size 1 #38833

stephanie-wang · 2023-08-24T17:32:22Z

Why are these changes needed?

Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it.

Related issue number

Needed for #37474.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang · 2023-08-24T17:42:56Z

For a single-row block [{"field": 100MB np.array}]:

Before:
In [11]: %timeit b.build()
385 ms ± 8.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:
In [13]: %timeit b.build()
116 µs ± 3.59 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

ericl · 2023-08-24T18:32:43Z

Nice!

…

On Thu, Aug 24, 2023, 10:43 AM Stephanie Wang ***@***.***> wrote: For a single-row block [{"field": 100MB np.array}]: Before: In [11]: %timeit b.build() 385 ms ± 8.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) After: In [13]: %timeit b.build() 116 µs ± 3.59 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) — Reply to this email directly, view it on GitHub <#38833 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSTVV4C5L4HF7YBA77LXW6HCXANCNFSM6AAAAAA35LQNKM> . You are receiving this because your review was requested.Message ID: ***@***.***>

c21

LGTM

zhe-thoughts · 2023-08-24T20:17:09Z

Please verify tests-ok and then I will review / merge. Thanks!

stephanie-wang · 2023-08-25T17:14:29Z

Hey @zhe-thoughts this is ready to merge.

zhe-thoughts · 2023-08-25T22:50:18Z

"buildkite/premerge" is a x

stephanie-wang · 2023-08-25T23:04:47Z

"buildkite/premerge" is a x

The failure is unrelated (it's flaky on master)

c21 · 2023-08-25T23:10:04Z

I guess buildkite/premerge is mandatory to pass before merging. Let me retry it.

zhe-thoughts · 2023-08-25T23:49:37Z

Yes other checks we can use manual judgement, buildkite/premerge must pass. cc @aslonnie

stephanie-wang · 2023-08-26T00:16:04Z

Yes other checks we can use manual judgement, buildkite/premerge must pass. cc @aslonnie

It's passing now.

…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…size 1 #38833 (#38988) Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>

…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Victor <vctr.y.m@example.com>

fix

0bd2a03

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani and raulchen as code owners August 24, 2023 17:32

c21 approved these changes Aug 24, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into optimize-one-record

b6b1a7b

stephanie-wang assigned zhe-thoughts and unassigned zhe-thoughts Aug 24, 2023

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 24, 2023

stephanie-wang added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 24, 2023

raulchen approved these changes Aug 25, 2023

View reviewed changes

stephanie-wang assigned zhe-thoughts Aug 25, 2023

zhe-thoughts approved these changes Aug 25, 2023

View reviewed changes

stephanie-wang merged commit 621ef89 into ray-project:master Aug 26, 2023
51 of 54 checks passed

stephanie-wang deleted the optimize-one-record branch August 26, 2023 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Optimization to reduce ArrowBlock building time for blocks of size 1 #38833

[data] Optimization to reduce ArrowBlock building time for blocks of size 1 #38833

stephanie-wang commented Aug 24, 2023 •

edited

Loading

stephanie-wang commented Aug 24, 2023

ericl commented Aug 24, 2023 via email

c21 left a comment

zhe-thoughts commented Aug 24, 2023

stephanie-wang commented Aug 25, 2023

zhe-thoughts commented Aug 25, 2023

stephanie-wang commented Aug 25, 2023

c21 commented Aug 25, 2023

zhe-thoughts commented Aug 25, 2023

stephanie-wang commented Aug 26, 2023

[data] Optimization to reduce ArrowBlock building time for blocks of size 1 #38833

[data] Optimization to reduce ArrowBlock building time for blocks of size 1 #38833

Conversation

stephanie-wang commented Aug 24, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang commented Aug 24, 2023

ericl commented Aug 24, 2023 via email

c21 left a comment

Choose a reason for hiding this comment

zhe-thoughts commented Aug 24, 2023

stephanie-wang commented Aug 25, 2023

zhe-thoughts commented Aug 25, 2023

stephanie-wang commented Aug 25, 2023

c21 commented Aug 25, 2023

zhe-thoughts commented Aug 25, 2023

stephanie-wang commented Aug 26, 2023

stephanie-wang commented Aug 24, 2023 •

edited

Loading