-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add Pandas-native groupby and sorting. #26313
[Datasets] Add Pandas-native groupby and sorting. #26313
Conversation
0e30293
to
f7d7d93
Compare
bounds = table[col].searchsorted(boundaries) | ||
last_idx = 0 | ||
for idx in bounds: | ||
# Slices need to be copied to avoid including the base table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment relevant for pandas table as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, let me remove that!
""" | ||
if key is not None and not isinstance(key, str): | ||
raise ValueError( | ||
"key must be a string or None when aggregating on Arrow blocks, but " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer Arrow :)
f7d7d93
to
b116873
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Having some minor comments/questions. Curious how much performance improvement we saw during testing?
) | ||
|
||
if self._table.shape[0] == 0: | ||
# If the pyarrow table is empty we may not have schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow table
-> pandas DataFrame
?
|
||
def combine(self, key: KeyFn, aggs: Tuple[AggregateFn]) -> "pandas.DataFrame": | ||
# TODO (kfstorm): A workaround to pass tests. Not efficient. | ||
return BlockAccessor.for_block(self.to_arrow()).combine(key, aggs).to_pandas() | ||
"""Combine rows with the same key into an accumulator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like very similar to ArrowBlockAccessor.combine
, except the builder
is different. Shall we refactor them into a single method?
This is a non-blocking comment.
stats = BlockExecStats.builder() | ||
blocks = [b for b in blocks if b.shape[0] > 0] | ||
if len(blocks) == 0: | ||
ret = PandasBlockAccessor._empty_table() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just self._empty_table()
as above?
|
||
@staticmethod | ||
def merge_sorted_blocks( | ||
blocks: List["pandas.DataFrame"], key: "SortKeyT", _descending: bool | ||
blocks: List[Block[T]], key: "SortKeyT", _descending: bool | ||
) -> Tuple["pandas.DataFrame", BlockMetadata]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's blocking us to change return type to Tuple[Block[T], BlockMetadata]
same as ArrowBlockAccessor?
) | ||
next_row = None | ||
builder = PandasBlockBuilder() | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this merge-aggregate loop is also mostly same as Arrow's function. Non-blocking comment as well.
Discussed offline, feel free to merge it as it is now, and I can add a followup PR to address the comments. |
This PR adds a Pandas-native implementation of groupby and sorting for Pandas blocks. Before this PR, we were converting to Arrow, doing groupbys + aggregations and sorting in Arrow land, and then converting back to Pandas; this to-from-Arrow conversion was happening both on the map side and the reduce side, which was very inefficient for Pandas blocks (many extra table copies). By adding Pandas-native groupby + sorting, we should see a decrease in memory utilization and faster performance when using the AIR preprocessors. Signed-off-by: Rohan138 <rapotdar@purdue.edu>
This PR adds a Pandas-native implementation of groupby and sorting for Pandas blocks. Before this PR, we were converting to Arrow, doing groupbys + aggregations and sorting in Arrow land, and then converting back to Pandas; this to-from-Arrow conversion was happening both on the map side and the reduce side, which was very inefficient for Pandas blocks (many extra table copies). By adding Pandas-native groupby + sorting, we should see a decrease in memory utilization and faster performance when using the AIR preprocessors. Signed-off-by: Stefan van der Kleij <s.vanderkleij@viroteq.com>
This PR adds a Pandas-native implementation of groupby and sorting for Pandas blocks. Before this PR, we were converting to Arrow, doing groupbys + aggregations and sorting in Arrow land, and then converting back to Pandas; this to-from-Arrow conversion was happening both on the map side and the reduce side, which was very inefficient for Pandas blocks (many extra table copies). By adding Pandas-native groupby + sorting, we should see a decrease in memory utilization and faster performance when using the AIR preprocessors.
Related issue number
Closes #21296
Checks
scripts/format.sh
to lint the changes in this PR.