-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Enable sort over multiple keys in datasets #37124
[Data] Enable sort over multiple keys in datasets #37124
Conversation
Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>
Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarifications
table = sort(self._table, key, descending) | ||
if len(boundaries) == 0: | ||
return [table] | ||
|
||
partitions = [] | ||
# For each boundary value, count the number of items that are less | ||
# than it. Since the block is sorted, these counts partition the items | ||
# such that boundaries[i] <= x < boundaries[i + 1] for each x in | ||
# partition[i]. If `descending` is true, `boundaries` would also be | ||
# in descending order and we only need to count the number of items | ||
# *greater than* the boundary value instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.searchsorted
does not support multi-dimensional arrays. This implementation over multiple keys is not vectorised and therefore not as optimal/performant. Would appreciate pointers on how to improve it.
Note that pyarrow table to_pylist method was only introduced in pyarrow version >=7, so a custom _to_pylist
function with the same logic is used to support arrow 6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird that I find we are already using to_pylist
and our CI also test against pyarrow 6. Not sure why that can work. I'll confirm. also, in this is indeed needed, we can move the definition to python/ray/data/_internal/arrow_ops/transform_pyarrow.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI for pyarrow 6 was failing. Thanks, moved to arrow_ops
.
Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>
][1:] | ||
for k, v in sample_dict.items() | ||
} | ||
return [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_boundaries
now returns a list of tuples sorted across the keys. For example:
[("A", 1, True), ("A", 2, False), ("B", 1, False)...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this to the function comment? and could you also add some comments explaining the above code? It's not very easy to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that dict/list comprehension was a bit heavy. I've refactored it & removed sorting inside of the loop. Added clarifying comments as well.
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
49729c5
to
7715bfb
Compare
Question from live discussion: is it OK to start with PyArrow blocks and later extend to more generic cases? @raulchen |
@zhe-thoughts I think that is fine as long as the code structure is generic and extensible in the future. E.g., we should define some abstraction in the base class and leave some subclasses as NotImplemented. Will look at this PR later today. |
Hi @zhe-thoughts @raulchen fyi this PR also includes pandas blocks implementation now - added on Friday |
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
Sorry for the delay. Just got a chance to fully read through the whole PR today. One major concern is that the type definition of the sort key has been been messy and inconsistent across the code base. Of course, this is an existing issue. But extending to multiple sort columns would make the situation worse. I'm wondering if you'd be able to do a sanitization refactor around this (If you don't have time, we can work on this as well). Here is the current situation:
To sanitize this, I'm thinking of the following approach:
|
python/ray/data/dataset.py
Outdated
@@ -2014,7 +2014,9 @@ def std( | |||
ret = self._aggregate_on(Std, on, ignore_nulls, ddof=ddof) | |||
return self._aggregate_result(ret) | |||
|
|||
def sort(self, key: Optional[str] = None, descending: bool = False) -> "Dataset": | |||
def sort( | |||
self, key: Optional[Union[str, List[str]]] = None, descending: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- same as pandas.DataFrame.sort_values,
descending
should also accept a list of booleans? - Can you also update the example and comment below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense. I was wondering what high-level api that allows multi-directional sort should look like for the user - whether it's a list of tuples like it is in some parts of the existing codebase or something else. sort_values
is a good example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-directional sort might be ambitious in my opinion. It would require a completely different approach as bisect
does not support it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that can be done with a custom key function for bisect
. Did you try that?
Also by using the custom key function, I think we can avoid reversing the order of the items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the key argument to bisect was only introduced in Python 3.10+, so isn't available in 3.8, 3.9
https://docs.python.org/3.10/library/bisect.html
https://docs.python.org/3.8/library/bisect.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. But then I think it's worth it to implement our own bisect to avoid copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom multi-order key function would work well if it's just simple types like ints i.e. lambda x: x[0], -x[1]
but it gets more complicated with other types.
table = sort(self._table, key, descending) | ||
if len(boundaries) == 0: | ||
return [table] | ||
|
||
partitions = [] | ||
# For each boundary value, count the number of items that are less | ||
# than it. Since the block is sorted, these counts partition the items | ||
# such that boundaries[i] <= x < boundaries[i + 1] for each x in | ||
# partition[i]. If `descending` is true, `boundaries` would also be | ||
# in descending order and we only need to count the number of items | ||
# *greater than* the boundary value instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird that I find we are already using to_pylist
and our CI also test against pyarrow 6. Not sure why that can work. I'll confirm. also, in this is indeed needed, we can move the definition to python/ray/data/_internal/arrow_ops/transform_pyarrow.py
.
# For each boundary value, count the number of items that are less | ||
# than it. Since the block is sorted, these counts partition the items | ||
# such that boundaries[i] <= x < boundaries[i + 1] for each x in | ||
# partition[i]. If `descending` is true, `boundaries` would also be | ||
# in descending order and we only need to count the number of items | ||
# *greater than* the boundary value instead. | ||
table_items = [tuple(d.values()) for d in _to_pylist(table.select(columns))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks heavy. I guess it's possible to bisect on the original arrow table, without converting it to a pylist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would obviously prefer to bisect on the entire arrow table but not sure how you are suggesting to do it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of doing the bisect with the original arrow table, with a custom key argument.
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
Hi @raulchen yeah that makes sense. I'll give the approach a shot. On another note - looking at the last ci there's two failures seemingly unrelated to the PR - ray train ( |
hey @kukushking, the code looks good to me. But I don't fully understand your proposal. Are you proposing to keep |
@raulchen Yes that's right. It seems that |
@kukushking I see. But |
@raulchen sounds good, I will submit separate PR. Thanks! |
@raulchen I was going to submit this PR before but due to hardware issues, I was unable to access my local repo as I did not have access to my computer. I have a working prototype of multikey sorting/grouping/aggregating on my forked branch https://github.com/PhysicsACE/ray/tree/multikey and if it looks good, I would be happy to merge the changes into this PR for the requested feature. |
Sorry, I am confused. How is that branch different from this PR? Do you mean that branch not only implements sort, but also groupby and agg? I think it'd be better to submit different PRs if possible. |
It implements both sort and groupby. It is different than this branch in terms of implementation as it has a custom searchsorted function to support directions for each column that can be integrated with this PR. Once sort is supported, I can create a separate PR for groupby as it requires the same sort_and_partition function. |
I see. thanks for the clarification. Maybe let's wait until this PR is merged, and then you can submit a separate one? |
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
python/ray/data/dataset.py
Outdated
descending: Whether to sort in descending order. | ||
key: The column or a list of columns to sort by. | ||
descending: Whether to sort in descending order. Must be a boolean or a list | ||
of booleans matching the number of the colunns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Fixed
python/ray/data/_internal/sort.py
Outdated
if len(descending) != len(key): | ||
raise ValueError( | ||
f"Descending must be a boolean or a list of booleans, " | ||
f"but got {descending}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error message is a bit misleading. just say lengths don't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, thanks!
][1:] | ||
for k, v in sample_dict.items() | ||
} | ||
return [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this to the function comment? and could you also add some comments explaining the above code? It's not very easy to understand.
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com>
This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com>
This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to .sort()). A new unit test test_sort_with_multiple_keys showcases an example. --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com> Signed-off-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Anton Kukushkin <kukushkin.anton@gmail.com> Co-authored-by: Leon Luttenberger <luttenberger.leon@gmail.com> Co-authored-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This is a first draft into enabling sorting Ray datasets over multiple keys (i.e. passing a list of keys to
.sort()
).A new unit test
test_sort_with_multiple_keys
showcases an example.Related issue number
#25732
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.