[Data] Enable sort over multiple keys in datasets #37124

jaidisido · 2023-07-05T22:26:11Z

Why are these changes needed?

This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()).

A new unit test test_sort_with_multiple_keys showcases an example.

Related issue number

#25732

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

jaidisido

Clarifications

jaidisido · 2023-07-05T22:28:32Z

python/ray/data/_internal/arrow_block.py

        table = sort(self._table, key, descending)
        if len(boundaries) == 0:
            return [table]

-        partitions = []
        # For each boundary value, count the number of items that are less
        # than it. Since the block is sorted, these counts partition the items
        # such that boundaries[i] <= x < boundaries[i + 1] for each x in
        # partition[i]. If `descending` is true, `boundaries` would also be
        # in descending order and we only need to count the number of items
        # *greater than* the boundary value instead.


np.searchsorted does not support multi-dimensional arrays. This implementation over multiple keys is not vectorised and therefore not as optimal/performant. Would appreciate pointers on how to improve it.

Note that pyarrow table to_pylist method was only introduced in pyarrow version >=7, so a custom _to_pylist function with the same logic is used to support arrow 6

it's weird that I find we are already using to_pylist and our CI also test against pyarrow 6. Not sure why that can work. I'll confirm. also, in this is indeed needed, we can move the definition to python/ray/data/_internal/arrow_ops/transform_pyarrow.py.

CI for pyarrow 6 was failing. Thanks, moved to arrow_ops.

python/ray/data/_internal/planner/exchange/sort_task_spec.py

Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

jaidisido · 2023-07-06T12:28:06Z

python/ray/data/_internal/planner/exchange/sort_task_spec.py

+            ][1:]
+            for k, v in sample_dict.items()
+        }
+        return [


sample_boundaries now returns a list of tuples sorted across the keys. For example:

[("A", 1, True), ("A", 2, False), ("B", 1, False)...]

Could you add this to the function comment? and could you also add some comments explaining the above code? It's not very easy to understand.

Thanks, that dict/list comprehension was a bit heavy. I've refactored it & removed sorting inside of the loop. Added clarifying comments as well.

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

zhe-thoughts · 2023-07-07T16:13:03Z

Question from live discussion: is it OK to start with PyArrow blocks and later extend to more generic cases? @raulchen

raulchen · 2023-07-07T19:07:50Z

@zhe-thoughts I think that is fine as long as the code structure is generic and extensible in the future. E.g., we should define some abstraction in the base class and leave some subclasses as NotImplemented. Will look at this PR later today.

kukushking · 2023-07-10T16:14:43Z

Hi @zhe-thoughts @raulchen fyi this PR also includes pandas blocks implementation now - added on Friday

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

raulchen · 2023-07-13T23:32:29Z

Sorry for the delay. Just got a chance to fully read through the whole PR today.

One major concern is that the type definition of the sort key has been been messy and inconsistent across the code base. Of course, this is an existing issue. But extending to multiple sort columns would make the situation worse. I'm wondering if you'd be able to do a sanitization refactor around this (If you don't have time, we can work on this as well).

Here is the current situation:

For Dataset.sort, SortStage, _validate_key_fn, and the Sort logical op, we use key: Optinal[str], descending: bool.
But for generate_sort_fn, SortTaskSpec, _internal/sort.py, and BlockAccessor.sample, we use key: SortKeyT, which is Union[None, List[Tuple[str, str]], Callable[[T], Any]].
- For SortKeyT, if it's List[Tuple[str, str]], it'd be something like [("column1", "ascending"), ("column2", "descending")]. Because of this, we have a lot of code doing something like col = key[0][0] or isinstance(key, list), which is very hard to read. I believe the reason why it's defined as this is to match the definition of pyarrow.compute.sort_indices. Also, the Callable case seems to be not implemented at all (@c21 correct me if I'm wrong).

To sanitize this, I'm thinking of the following approach:

At the API level (only Dataset.sort), we still use key: Optinal[str], descending: bool. And extend them to lists after this PR.
For everything else, we should define a SortKey class to incapsulate all the related logic. The class will provide methods like get_columns, to_arrow_sort_args, to_pandas_sort_args, etc.

python/ray/data/tests/test_sort.py

raulchen · 2023-07-13T22:21:19Z

python/ray/data/dataset.py

@@ -2014,7 +2014,9 @@ def std(
        ret = self._aggregate_on(Std, on, ignore_nulls, ddof=ddof)
        return self._aggregate_result(ret)

-    def sort(self, key: Optional[str] = None, descending: bool = False) -> "Dataset":
+    def sort(
+        self, key: Optional[Union[str, List[str]]] = None, descending: bool = False


same as pandas.DataFrame.sort_values, descending should also accept a list of booleans?

Can you also update the example and comment below?

Thanks, that makes sense. I was wondering what high-level api that allows multi-directional sort should look like for the user - whether it's a list of tuples like it is in some parts of the existing codebase or something else. sort_values is a good example.

Multi-directional sort might be ambitious in my opinion. It would require a completely different approach as bisectdoes not support it

I think that can be done with a custom key function for bisect. Did you try that?
Also by using the custom key function, I think we can avoid reversing the order of the items.

AFAIK the key argument to bisect was only introduced in Python 3.10+, so isn't available in 3.8, 3.9
https://docs.python.org/3.10/library/bisect.html
https://docs.python.org/3.8/library/bisect.html

Got it. But then I think it's worth it to implement our own bisect to avoid copy.

Custom multi-order key function would work well if it's just simple types like ints i.e. lambda x: x[0], -x[1] but it gets more complicated with other types.

raulchen · 2023-07-13T23:37:04Z

python/ray/data/_internal/arrow_block.py

        table = sort(self._table, key, descending)
        if len(boundaries) == 0:
            return [table]

-        partitions = []
        # For each boundary value, count the number of items that are less
        # than it. Since the block is sorted, these counts partition the items
        # such that boundaries[i] <= x < boundaries[i + 1] for each x in
        # partition[i]. If `descending` is true, `boundaries` would also be
        # in descending order and we only need to count the number of items
        # *greater than* the boundary value instead.


it's weird that I find we are already using to_pylist and our CI also test against pyarrow 6. Not sure why that can work. I'll confirm. also, in this is indeed needed, we can move the definition to python/ray/data/_internal/arrow_ops/transform_pyarrow.py.

raulchen · 2023-07-13T23:44:01Z

python/ray/data/_internal/arrow_block.py

        # For each boundary value, count the number of items that are less
        # than it. Since the block is sorted, these counts partition the items
        # such that boundaries[i] <= x < boundaries[i + 1] for each x in
        # partition[i]. If `descending` is true, `boundaries` would also be
        # in descending order and we only need to count the number of items
        # *greater than* the boundary value instead.
+        table_items = [tuple(d.values()) for d in _to_pylist(table.select(columns))]


This looks heavy. I guess it's possible to bisect on the original arrow table, without converting it to a pylist.

Would obviously prefer to bisect on the entire arrow table but not sure how you are suggesting to do it here?

I was thinking of doing the bisect with the original arrow table, with a custom key argument.

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

kukushking · 2023-07-14T13:43:53Z

Hi @raulchen yeah that makes sense. I'll give the approach a shot.

On another note - looking at the last ci there's two failures seemingly unrelated to the PR - ray train (python/ray/train/tests/test_lightning_deepspeed.py::test_deepspeed_stages) and link check. Do you know what to do with those? In particular train test - is that just flaky, or potentially shows a performance drop?

raulchen · 2023-07-19T18:09:52Z

hey @kukushking, the code looks good to me. But I don't fully understand your proposal. Are you proposing to keep SortKeyT for some APIs?

kukushking · 2023-07-20T10:39:46Z

@raulchen Yes that's right. It seems that BlockAccessor methods are also user-facing so it would make sense to keep the same definition as Dataset.sort so wanted to hear your opinion here. Happy to merge my branch into this PR as-is though.

raulchen · 2023-07-21T18:08:44Z

@kukushking I see. But BlockAccessor isn't supposed to be user-facing. If any docs/comments cause the confusion, please let us know. We should update them. By the way, if possible, it'd be more preferable to submit the refactor change as a separate PR, so the change is easier to review and track. But if that will bring significant overheads, using 1 PR is acceptable as well.

kukushking · 2023-07-21T19:51:44Z

@raulchen sounds good, I will submit separate PR. Thanks!

PhysicsACE · 2023-07-21T22:16:00Z

@raulchen I was going to submit this PR before but due to hardware issues, I was unable to access my local repo as I did not have access to my computer. I have a working prototype of multikey sorting/grouping/aggregating on my forked branch https://github.com/PhysicsACE/ray/tree/multikey and if it looks good, I would be happy to merge the changes into this PR for the requested feature.

raulchen · 2023-07-21T23:40:14Z

@raulchen I was going to submit this PR before but due to hardware issues, I was unable to access my local repo as I did not have access to my computer. I have a working prototype of multikey sorting/grouping/aggregating on my forked branch https://github.com/PhysicsACE/ray/tree/multikey and if it looks good, I would be happy to merge the changes into this PR for the requested feature.

Sorry, I am confused. How is that branch different from this PR? Do you mean that branch not only implements sort, but also groupby and agg? I think it'd be better to submit different PRs if possible.

PhysicsACE · 2023-07-24T17:16:12Z

@raulchen I was going to submit this PR before but due to hardware issues, I was unable to access my local repo as I did not have access to my computer. I have a working prototype of multikey sorting/grouping/aggregating on my forked branch https://github.com/PhysicsACE/ray/tree/multikey and if it looks good, I would be happy to merge the changes into this PR for the requested feature.

Sorry, I am confused. How is that branch different from this PR? Do you mean that branch not only implements sort, but also groupby and agg? I think it'd be better to submit different PRs if possible.

It implements both sort and groupby. It is different than this branch in terms of implementation as it has a custom searchsorted function to support directions for each column that can be integrated with this PR. Once sort is supported, I can create a separate PR for groupby as it requires the same sort_and_partition function.

raulchen · 2023-07-27T00:25:09Z

@raulchen I was going to submit this PR before but due to hardware issues, I was unable to access my local repo as I did not have access to my computer. I have a working prototype of multikey sorting/grouping/aggregating on my forked branch https://github.com/PhysicsACE/ray/tree/multikey and if it looks good, I would be happy to merge the changes into this PR for the requested feature.

Sorry, I am confused. How is that branch different from this PR? Do you mean that branch not only implements sort, but also groupby and agg? I think it'd be better to submit different PRs if possible.

It implements both sort and groupby. It is different than this branch in terms of implementation as it has a custom searchsorted function to support directions for each column that can be integrated with this PR. Once sort is supported, I can create a separate PR for groupby as it requires the same sort_and_partition function.

I see. thanks for the clarification. Maybe let's wait until this PR is merged, and then you can submit a separate one?

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

raulchen · 2023-08-07T23:43:38Z

python/ray/data/dataset.py

-            descending: Whether to sort in descending order.
+            key: The column or a list of columns to sort by.
+            descending: Whether to sort in descending order. Must be a boolean or a list
+                of booleans matching the number of the colunns.


typo columns

Thank you! Fixed

raulchen · 2023-08-07T23:47:09Z

python/ray/data/_internal/sort.py

+            if len(descending) != len(key):
+                raise ValueError(
+                    f"Descending must be a boolean or a list of booleans, "
+                    f"but got {descending}."


this error message is a bit misleading. just say lengths don't match?

Agreed, thanks!

raulchen · 2023-08-07T23:51:01Z

python/ray/data/_internal/planner/exchange/sort_task_spec.py

+            ][1:]
+            for k, v in sample_dict.items()
+        }
+        return [


Could you add this to the function comment? and could you also add some comments explaining the above code? It's not very easy to understand.

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com>

This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

jaidisido added 4 commits July 4, 2023 10:55

feat: add working sort test case over multi keys

b87f8a4

fix: formatting

cf85194

Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

feat: add modified sort boundaries heuristic

b75c32f

Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

Merge branch 'master' into feat/ray-data-sort-multi-keys

275cfe3

jaidisido requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani and raulchen as code owners July 5, 2023 22:26

jaidisido commented Jul 5, 2023

View reviewed changes

jaidisido mentioned this pull request Jul 5, 2023

[Datasets] Support multiple sort/groupby keys. #25732

Closed

fix: simplify and format sample_dict

b63b243

Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

jaidisido commented Jul 6, 2023

View reviewed changes

feat: Update pandas block sort and partition by multiple keys

7715bfb

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

kukushking force-pushed the feat/ray-data-sort-multi-keys branch from 49729c5 to 7715bfb Compare July 7, 2023 15:32

raulchen self-assigned this Jul 7, 2023

kukushking added 5 commits July 12, 2023 12:24

fix: Types

c5e44a7

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

fix: Arrow 6 errors & sort key type

79f704d

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

fix: Pandas block key issue & lint

f3bc728

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

fix: Python 3.7 compatibility - non-reversible dict_values view

90efc31

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

fix: test_groupby_errors - update upstream exception

f0e224e

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

raulchen reviewed Jul 13, 2023

View reviewed changes

Move to_pylist to arrow_ops

bae602a

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

kukushking mentioned this pull request Jul 24, 2023

[Data] Refactor sort key #37712

Merged

8 tasks

LeonLuttenberger mentioned this pull request Jul 26, 2023

[Data] Enable group over multiple keys in datasets #37832

Merged

8 tasks

LeonLuttenberger and others added 4 commits July 28, 2023 15:18

Merge branch 'master' into feat/ray-data-sort-multi-keys

dbd925d

fix formatting

2c103a8

Update 'Dataset.sort()' to accept a list of column orders.

cf54d46

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

Prepare for the refactor

a10e827

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

raulchen reviewed Aug 7, 2023

View reviewed changes

kukushking added 3 commits August 8, 2023 17:13

PR feedback

d03dd03

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

Comments

d44ba5d

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

Fix dict_values not reversible

d1dfae4

Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>

raulchen approved these changes Aug 9, 2023

View reviewed changes

raulchen and others added 2 commits August 9, 2023 14:22

Merge branch 'master' into feat/ray-data-sort-multi-keys

c52a823

Merge branch 'master' into feat/ray-data-sort-multi-keys

9948ba4

raulchen merged commit bbc2df7 into ray-project:master Aug 10, 2023
56 of 60 checks passed

LeonLuttenberger deleted the feat/ray-data-sort-multi-keys branch August 10, 2023 18:18

c21 mentioned this pull request Aug 15, 2023

Release test dataset_shuffle_push_based_sort_1tb.aws failed #38385

Closed

[Data] Enable sort over multiple keys in datasets #37124

[Data] Enable sort over multiple keys in datasets #37124

Conversation

jaidisido commented Jul 5, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

jaidisido left a comment

Choose a reason for hiding this comment

jaidisido Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhe-thoughts commented Jul 7, 2023

raulchen commented Jul 7, 2023

kukushking commented Jul 10, 2023

raulchen commented Jul 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kukushking commented Jul 14, 2023

raulchen commented Jul 19, 2023

kukushking commented Jul 20, 2023

raulchen commented Jul 21, 2023

kukushking commented Jul 21, 2023

PhysicsACE commented Jul 21, 2023

raulchen commented Jul 21, 2023

PhysicsACE commented Jul 24, 2023

raulchen commented Jul 27, 2023

Choose a reason for hiding this comment

kukushking Aug 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaidisido commented Jul 5, 2023 •

edited

Loading

jaidisido Jul 5, 2023 •

edited

Loading

kukushking Aug 8, 2023 •

edited

Loading