[Data] Fix bug where _StatsActor
errors with PandasBlock
#40481
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
tl;dr:
StatsActor
errors if your dataset contains pandas blocks.Results from write tasks are stored in pandas blocks.
ray/python/ray/data/_internal/planner/plan_write_op.py
Lines 22 to 30 in 7c44833
The problem is that Ray expects stats to be an
int
orfloat
:ray/python/ray/data/_internal/stats.py
Line 225 in 7c44833
But
PandasBlock.size_bytes()
returns a NumPy scalar.ray/python/ray/data/_internal/pandas_block.py
Line 241 in 7c44833
This PR fixes the issue by converting the NumPy scalar to an
int
.Related issue number
Fixes #40480
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.