Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add vectorized global and grouped aggregations. #23478

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Mar 24, 2022

This PR adds support for vectorized global and grouped aggregations, porting the built-in aggregations to vectorized block aggregations for tabular datasets.

The AggregateFn abstraction is extended to have an optional vectorized_aggregate function that performs a vectorized aggregation on a single block, allowing aggregations to opt-in to vectorized block aggregation. The AggregateFn also exposes a can_vectorize_for_block() API, which allows aggregations to opt-in for vectorized block aggregation for only certain block types, e.g. only for Arrow and Pandas blocks. The built-in set of aggregations currently only opts-in to vectorized block aggregation on tabular datasets, i.e. only for Arrow and Pandas blocks, since vectorized aggregation of simple blocks will amount to the accumulator loop with extra copying for each group slice (no zero-copy views possible for Python lists).

For Arrow blocks, vectorized block aggregation is supported by creating zero-copy views on each group slice within each partition and applying the vectorized aggregation on these group slice views. This currently entails two scans of each block partition: one to determine the group view boundaries, and one to process each group. As a future optimization, we could eliminate this extra scan by gathering group boundaries while sorting and partitioning each block along the sample boundaries.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow changed the title [Datasets] Add prototype for vectorized global aggregations. [Datasets] Add vectorized global and grouped aggregations. Mar 29, 2022
@clarkzinzow clarkzinzow requested a review from ericl March 29, 2022 19:23
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level looks good! @jianoaix @jjyao can you review for implementation?

Btw, I think we should consider (micro)benchmarks for agg / groupby here, similar to the existing ray_perf tests. It will be easy to regress by accident.

@clarkzinzow clarkzinzow force-pushed the datasets/feat/vectorized-global-aggregations branch 4 times, most recently from a37d1bd to 3622a0a Compare March 29, 2022 20:30
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/impl/null_aggregate.py Outdated Show resolved Hide resolved
python/ray/data/impl/null_aggregate.py Outdated Show resolved Hide resolved
python/ray/data/impl/arrow_block.py Show resolved Hide resolved
python/ray/data/impl/null_aggregate.py Show resolved Hide resolved
python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset.py Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the datasets/feat/vectorized-global-aggregations branch 2 times, most recently from 4f44cb1 to bbb480c Compare March 30, 2022 23:54
Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg! I'll take a closer look tomorrow.

python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Outdated Show resolved Hide resolved
python/ray/data/impl/null_aggregate.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Show resolved Hide resolved
python/requirements.txt Outdated Show resolved Hide resolved
python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/aggregate.py Show resolved Hide resolved
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

python/ray/data/impl/arrow_block.py Show resolved Hide resolved
python/ray/data/impl/arrow_block.py Outdated Show resolved Hide resolved
python/ray/data/impl/arrow_block.py Outdated Show resolved Hide resolved
)

if self.num_rows() == 0:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why None instead of 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be able to delineate between a non-empty table with all nulls (0) and an empty table (None), in case ignore_nulls=False, in which case we'd want to propagate that None in the latter case. See the Mean and Std agg function definitions.

python/ray/data/impl/null_aggregate.py Outdated Show resolved Hide resolved
python/ray/data/impl/null_aggregate.py Outdated Show resolved Hide resolved
python/ray/data/impl/null_aggregate.py Outdated Show resolved Hide resolved
# column will result in an all-None column of object type, which will raise
# a type error when attempting to do most binary operations. We explicitly
# check for this type failure here so we can properly propagate a null.
if np.issubdtype(col.dtype, np.object_) and col.isnull().all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just check beforehand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm purposefully not checking beforehand since this check is expensive: col.isnull().all() creates a fully copy of the column and then traverses it (2 scans). By only checking it on such a TypeError, we keep this expensive check off of the common critical path.

python/ray/data/impl/pandas_block.py Show resolved Hide resolved
python/ray/data/impl/simple_block.py Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the datasets/feat/vectorized-global-aggregations branch from ae74504 to 69e10f4 Compare April 13, 2022 20:32
@clarkzinzow
Copy link
Contributor Author

@jjyao I addressed your feedback, ready for another pass.

@clarkzinzow clarkzinzow requested a review from jjyao April 13, 2022 20:51
@clarkzinzow clarkzinzow force-pushed the datasets/feat/vectorized-global-aggregations branch from 69e10f4 to 55bac7b Compare April 13, 2022 23:39
@clarkzinzow
Copy link
Contributor Author

@jjyao Ping on this!

Copy link
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

python/ray/data/impl/arrow_block.py Show resolved Hide resolved
@jianoaix
Copy link
Contributor

Can we have a representative perf test to measure the improvement? (can be a follow-up after this PR)

@clarkzinzow clarkzinzow merged commit 3b09fdd into ray-project:master Apr 15, 2022
@clarkzinzow
Copy link
Contributor Author

@jianoaix Added an issue here! #23936

@clarkzinzow clarkzinzow deleted the datasets/feat/vectorized-global-aggregations branch April 15, 2022 01:58
@clarkzinzow clarkzinzow restored the datasets/feat/vectorized-global-aggregations branch April 1, 2023 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants