New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Datasets] Add Pandas-native groupby and sorting. #26313

Merged

clarkzinzow merged 2 commits into ray-project:master from clarkzinzow:datasets/feat/pandas-groupby-sort-impl

Jul 21, 2022

Contributor

clarkzinzow commented Jul 5, 2022

This PR adds a Pandas-native implementation of groupby and sorting for Pandas blocks. Before this PR, we were converting to Arrow, doing groupbys + aggregations and sorting in Arrow land, and then converting back to Pandas; this to-from-Arrow conversion was happening both on the map side and the reduce side, which was very inefficient for Pandas blocks (many extra table copies). By adding Pandas-native groupby + sorting, we should see a decrease in memory utilization and faster performance when using the AIR preprocessors.

Related issue number

Closes #21296

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

clarkzinzow marked this pull request as ready for review

July 5, 2022 23:41

clarkzinzow requested review from ericl, scv119, jjyao and jianoaix as code owners

July 5, 2022 23:41

clarkzinzow changed the title ~~[Datasets[ Add Pandas-native groupby and sorting.~~ [Datasets] Add Pandas-native groupby and sorting.

clarkzinzow force-pushed the datasets/feat/pandas-groupby-sort-impl branch from 0e30293 to f7d7d93 Compare

July 6, 2022 15:26

jjyao self-assigned this

jjyao approved these changes

View reviewed changes

python/ray/data/_internal/pandas_block.py Outdated

    
                          bounds = table[col].searchsorted(boundaries)

                      last_idx = 0

                      for idx in bounds:

                          # Slices need to be copied to avoid including the base table

Collaborator

jjyao Jul 6, 2022

Is this comment relevant for pandas table as well?

Contributor Author

clarkzinzow Jul 19, 2022

It's not, let me remove that!

python/ray/data/_internal/pandas_block.py Outdated

    
                      """

                      if key is not None and not isinstance(key, str):

                          raise ValueError(

                              "key must be a string or None when aggregating on Arrow blocks, but "

Collaborator

jjyao Jul 6, 2022

no longer Arrow :)

clarkzinzow mentioned this pull request

[Datasets] [Pandas Block] Implement PandasBlockAccessor in pandas-native ways #21296

Closed

2 tasks

clarkzinzow added 2 commits

July 19, 2022 20:16


          Add Pandas-native groupby and sorting.

3b3a6d2


          PR feedback.

b116873

clarkzinzow force-pushed the datasets/feat/pandas-groupby-sort-impl branch from f7d7d93 to b116873 Compare

July 19, 2022 20:18

c21 approved these changes

View reviewed changes

Contributor

c21 left a comment

LGTM. Having some minor comments/questions. Curious how much performance improvement we saw during testing?

python/ray/data/_internal/pandas_block.py

+                          )
+                      if self._table.shape[0] == 0:
+                          # If the pyarrow table is empty we may not have schema

Contributor

c21 Jul 20, 2022

pyarrow table -> pandas DataFrame?

python/ray/data/_internal/pandas_block.py

                   def combine(self, key: KeyFn, aggs: Tuple[AggregateFn]) -> "pandas.DataFrame":
-                      # TODO (kfstorm): A workaround to pass tests. Not efficient.
-                      return BlockAccessor.for_block(self.to_arrow()).combine(key, aggs).to_pandas()
+                      """Combine rows with the same key into an accumulator.

Contributor

c21 Jul 20, 2022

This looks like very similar to ArrowBlockAccessor.combine, except the builder is different. Shall we refactor them into a single method?

This is a non-blocking comment.

python/ray/data/_internal/pandas_block.py

+                      stats = BlockExecStats.builder()
+                      blocks = [b for b in blocks if b.shape[0] > 0]
+                      if len(blocks) == 0:
+                          ret = PandasBlockAccessor._empty_table()

Contributor

c21 Jul 20, 2022

why not just self._empty_table() as above?

python/ray/data/_internal/pandas_block.py

                   @staticmethod
                   def merge_sorted_blocks(
-                      blocks: List["pandas.DataFrame"], key: "SortKeyT", _descending: bool
+                      blocks: List[Block[T]], key: "SortKeyT", _descending: bool
                   ) -> Tuple["pandas.DataFrame", BlockMetadata]:

Contributor

c21 Jul 20, 2022

what's blocking us to change return type to Tuple[Block[T], BlockMetadata] same as ArrowBlockAccessor?

python/ray/data/_internal/pandas_block.py

+                      )
+                      next_row = None
+                      builder = PandasBlockBuilder()
+                      while True:

Contributor

c21 Jul 20, 2022

nit: this merge-aggregate loop is also mostly same as Arrow's function. Non-blocking comment as well.

Contributor

c21 commented Jul 21, 2022

Discussed offline, feel free to merge it as it is now, and I can add a followup PR to address the comments.

Contributor Author

clarkzinzow commented Jul 21, 2022

Synced with @c21 offline, agreed with all feedback but @c21 is going to take it on as a follow-up PR so we can get this in.

clarkzinzow merged commit da97efb into ray-project:master

Rohan138 pushed a commit to Rohan138/ray that referenced this pull request


          [Datasets] Add Pandas-native groupby and sorting. (ray-project#26313)

4fd2f48

This PR adds a Pandas-native implementation of groupby and sorting for Pandas blocks. Before this PR, we were converting to Arrow, doing groupbys + aggregations and sorting in Arrow land, and then converting back to Pandas; this to-from-Arrow conversion was happening both on the map side and the reduce side, which was very inefficient for Pandas blocks (many extra table copies). By adding Pandas-native groupby + sorting, we should see a decrease in memory utilization and faster performance when using the AIR preprocessors.

Signed-off-by: Rohan138 <rapotdar@purdue.edu>

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request


          [Datasets] Add Pandas-native groupby and sorting. (ray-project#26313)

0d511b8

This PR adds a Pandas-native implementation of groupby and sorting for Pandas blocks. Before this PR, we were converting to Arrow, doing groupbys + aggregations and sorting in Arrow land, and then converting back to Pandas; this to-from-Arrow conversion was happening both on the map side and the reduce side, which was very inefficient for Pandas blocks (many extra table copies). By adding Pandas-native groupby + sorting, we should see a decrease in memory utilization and faster performance when using the AIR preprocessors.

Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

c21 c21 approved these changes

jjyao jjyao approved these changes

ericl Awaiting requested review from ericl

scv119 Awaiting requested review from scv119

jianoaix Awaiting requested review from jianoaix

Labels

None yet