-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Optimization to reduce ArrowBlock building time for blocks of size 1 #38833
[data] Optimization to reduce ArrowBlock building time for blocks of size 1 #38833
Conversation
For a single-row block Before: After: |
Nice!
…On Thu, Aug 24, 2023, 10:43 AM Stephanie Wang ***@***.***> wrote:
For a single-row block [{"field": 100MB np.array}]:
Before:
In [11]: %timeit b.build()
385 ms ± 8.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
After:
In [13]: %timeit b.build()
116 µs ± 3.59 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
—
Reply to this email directly, view it on GitHub
<#38833 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSTVV4C5L4HF7YBA77LXW6HCXANCNFSM6AAAAAA35LQNKM>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please verify |
Hey @zhe-thoughts this is ready to merge. |
"buildkite/premerge" is a x |
The failure is unrelated (it's flaky on master) |
I guess |
Yes other checks we can use manual judgement, |
It's passing now. |
…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
…size 1 #38833 (#38988) Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using
np.expand_dims
to reshape the array instead of copying it.Related issue number
Needed for #37474.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.