[data] Normalize upstream blocks for zip/map/reduce/etc. operations using inferred block accessors #39960

michaelhly · 2023-09-28T23:20:42Z

Signed-off-by: Michael Huang michaelhly@gmail.com

Why are these changes needed?

When reducing mapped outputs (or zipping datasets), BlockAccessors are inferred based on the first block of mapper_outputs. However, mapper_outputs can contain heterogeneous block types inadvertently due to

user setting inconsistent batch format in task graph

internal fallback conversion when encountering data types not supported by pyarrow
a.)

ray/python/ray/data/block.py

Lines 366 to 373 in 3135323

    
           try: 
        
               return ArrowBlockAccessor.numpy_to_block(batch) 
        
           except (pa.ArrowNotImplementedError, pa.ArrowInvalid, pa.ArrowTypeError): 
        
               import pandas as pd 
        
               # TODO(ekl) once we support Python objects within Arrow blocks, we 
        
               # don't need this fallback path. 
        
               return pd.DataFrame(dict(batch))

b.)

ray/python/ray/data/_internal/delegating_block_builder.py

Lines 20 to 28 in 08aa138

    
           if self._builder is None: 
        
               try: 
        
                   check = ArrowBlockBuilder() 
        
                   check.add(item) 
        
                   check.build() 
        
                   self._builder = ArrowBlockBuilder() 
        
               except (TypeError, pyarrow.lib.ArrowInvalid): 
        
                   # Can also handle nested Python objects, which Arrow cannot. 
        
                   self._builder = PandasBlockBuilder()

We want to normalize heterogeneous blocks in mapper_output to arrow blocks before using the BlockAccessor for the first block to compute reduced results.

Related issue number

Closes #39155
Closes #39206
Closes #39291
Closes #31550

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Michael Huang <michaelhly@gmail.com>

scottjlee

this will make our outputs much more consistent, thanks!

Also left a related comment here, we could even move the implementation from that PR into this one if it's easier (or wait for this one to merge, then update the other PR).

python/ray/data/_internal/util.py

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly · 2023-09-30T14:28:54Z

python/ray/data/_internal/planner/exchange/sort_task_spec.py

@@ -95,7 +97,7 @@ def sample_boundaries(
        samples = sample_bar.fetch_until_complete(sample_results)
        sample_bar.close()
        del sample_results
-        samples = [s for s in samples if len(s) > 0]
+        samples = normalize_blocks([s for s in samples if len(s) > 0])


normalize here b/c later to_numpy call can be ambiguous

michaelhly · 2023-09-30T14:29:39Z

python/ray/data/_internal/planner/exchange/shuffle_task_spec.py

@@ -82,7 +83,7 @@ def reduce(
        # TODO: Support fusion with other downstream operators.
        stats = BlockExecStats.builder()
        builder = DelegatingBlockBuilder()
-        for block in mapper_outputs:
+        for block in normalize_blocks(mapper_outputs):


normalize here b/c later random_shuffle call can be ambiguous

michaelhly · 2023-09-30T14:30:13Z

python/ray/data/_internal/shuffle_and_partition.py

@@ -80,7 +81,7 @@ def reduce(
    ) -> (Block, BlockMetadata):
        stats = BlockExecStats.builder()
        builder = DelegatingBlockBuilder()
-        for block in mapper_outputs:
+        for block in normalize_blocks(mapper_outputs):


normalize here b/c later random_shuffle call can be ambiguous

michaelhly · 2023-09-30T14:30:38Z

python/ray/data/_internal/sort.py

@@ -165,7 +167,7 @@ def sample_boundaries(
    if should_close_bar:
        sample_bar.close()
    del sample_results
-    samples = [s for s in samples if len(s) > 0]
+    samples = normalize_blocks([s for s in samples if len(s) > 0])


normalize here b/c later to_numpy call can be ambiguous

michaelhly · 2023-10-16T17:52:18Z

Hi @c21. Would you mind taking a look at this?

stale · 2023-12-15T05:59:58Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

michaelhly added 3 commits September 28, 2023 19:14

Normalize mapper outputs when reducing blocks

b9874fb

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Add option to skip type check

169c1bd

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Fixes

80c113f

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly changed the title ~~[data] Normalize mapper outputs when reducing blocks~~ [data] Normalize mapper_outputs when reducing blocks Sep 28, 2023

Skip type check

9af5484

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly changed the title ~~[data] Normalize mapper_outputs when reducing blocks~~ [data] Normalize mapper_outputs if block reduction fails on AttributeError Sep 29, 2023

Remove try/except

0b2ceee

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly changed the title ~~[data] Normalize mapper_outputs if block reduction fails on AttributeError~~ [data] Normalize mapper_outputs before reducing blocks Sep 29, 2023

Clean up

79ca1bc

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly force-pushed the normalize-mapper-outputs branch 3 times, most recently from 036d720 to 1409bc1 Compare September 29, 2023 03:08

fixup

1dc4193

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly force-pushed the normalize-mapper-outputs branch from 1409bc1 to 1dc4193 Compare September 29, 2023 03:10

fixup

e1a840d

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly force-pushed the normalize-mapper-outputs branch from e9b9029 to e1a840d Compare September 29, 2023 03:14

michaelhly mentioned this pull request Sep 29, 2023

[data] Typecheck + fallback to ArrowBlockAccessor when zipping datasets #39817

Closed

8 tasks

scottjlee approved these changes Sep 29, 2023

View reviewed changes

python/ray/data/_internal/util.py Show resolved Hide resolved

scottjlee assigned raulchen, amogkam and scottjlee Sep 29, 2023

michaelhly added 3 commits September 29, 2023 20:28

Add test

583e1a7

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Add doc string

6120464

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Normalize blocks before zipping

b451ddb

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly changed the title ~~[data] Normalize mapper_outputs before reducing blocks~~ [data] Normalize blocks before zipping/reducing outputs Sep 30, 2023

michaelhly changed the title ~~[data] Normalize blocks before zipping/reducing outputs~~ [data] Normalize mismatched blocks before zipping/reducing outputs Sep 30, 2023

michaelhly changed the title ~~[data] Normalize mismatched blocks before zipping/reducing outputs~~ [data] Normalize mismatched blocks before zip/reduce operations Sep 30, 2023

Consistency

89f98e6

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly force-pushed the normalize-mapper-outputs branch from 423661d to 89f98e6 Compare September 30, 2023 02:01

normalize blcoks for in shuffle_and_partition

8fde5bd

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly added 2 commits September 30, 2023 09:58

Normalize samples before calling .to_numpy

9a687dd

Signed-off-by: Michael Huang <michaelhly@gmail.com>

Fix typo

ecbcaaa

Signed-off-by: Michael Huang <michaelhly@gmail.com>

michaelhly changed the title ~~[data] Normalize upstream blocks before zip/reduce/map operations~~ [data] Normalize upstream blocks before zip/reduce/map operations Sep 30, 2023

michaelhly changed the title ~~[data] Normalize upstream blocks before zip/reduce/map operations~~ [data] Normalize upstream blocks for zip/reduce/map operations with inferred block accessor or delayed block building Sep 30, 2023

michaelhly changed the title ~~[data] Normalize upstream blocks for zip/reduce/map operations with inferred block accessor or delayed block building~~ [data] Normalize upstream blocks for zip/reduce/map operations with inferred block accessor or delegated block building Sep 30, 2023

michaelhly changed the title ~~[data] Normalize upstream blocks for zip/reduce/map operations with inferred block accessor or delegated block building~~ [data] Normalize upstream blocks for zip/reduce/map operations using inferred block accessors Sep 30, 2023

michaelhly commented Sep 30, 2023

View reviewed changes

michaelhly changed the title ~~[data] Normalize upstream blocks for zip/reduce/map operations using inferred block accessors~~ [data] Normalize upstream blocks for zip/map/reduce operations using inferred block accessors Oct 1, 2023

michaelhly changed the title ~~[data] Normalize upstream blocks for zip/map/reduce operations using inferred block accessors~~ [data] Normalize upstream blocks for zip/map/reduce/etc. operations using inferred block accessors Oct 1, 2023

PRESIDENT810 mentioned this pull request Oct 1, 2023

[data]'DataFrame' object has no attribute 'num_columns' using StandardScaler #39206

Closed

michaelhly marked this pull request as ready for review October 2, 2023 15:28

michaelhly requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen and stephanie-wang as code owners October 2, 2023 15:28

scottjlee assigned c21 and unassigned amogkam Oct 3, 2023

danielezhu mentioned this pull request Nov 15, 2023

fix: replace all calls to add_column with map to avoid setting inconsistent batch format in ray task graph aws/fmeval#115

Merged

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 15, 2023

danielezhu mentioned this pull request Dec 15, 2023

fix: replace add_column with map in _generate_prompt_column aws/fmeval#161

Merged

michaelhly requested a review from Zandew as a code owner December 25, 2023 04:13

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 25, 2023

scottjlee mentioned this pull request Mar 6, 2024

[Data] Normalize block types before internal multi-block operations #43764

Merged

8 tasks

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Normalize upstream blocks for zip/map/reduce/etc. operations using inferred block accessors #39960

[data] Normalize upstream blocks for zip/map/reduce/etc. operations using inferred block accessors #39960

michaelhly commented Sep 28, 2023 •

edited

scottjlee left a comment •

edited

michaelhly Sep 30, 2023

michaelhly Sep 30, 2023

michaelhly Sep 30, 2023

michaelhly Sep 30, 2023

michaelhly commented Oct 16, 2023

stale bot commented Dec 15, 2023

	try:
	return ArrowBlockAccessor.numpy_to_block(batch)
	except (pa.ArrowNotImplementedError, pa.ArrowInvalid, pa.ArrowTypeError):
	import pandas as pd

	# TODO(ekl) once we support Python objects within Arrow blocks, we
	# don't need this fallback path.
	return pd.DataFrame(dict(batch))

	if self._builder is None:
	try:
	check = ArrowBlockBuilder()
	check.add(item)
	check.build()
	self._builder = ArrowBlockBuilder()
	except (TypeError, pyarrow.lib.ArrowInvalid):
	# Can also handle nested Python objects, which Arrow cannot.
	self._builder = PandasBlockBuilder()

[data] Normalize upstream blocks for zip/map/reduce/etc. operations using inferred block accessors #39960

Are you sure you want to change the base?

[data] Normalize upstream blocks for zip/map/reduce/etc. operations using inferred block accessors #39960

Conversation

michaelhly commented Sep 28, 2023 • edited

Why are these changes needed?

Related issue number

Checks

scottjlee left a comment • edited

Choose a reason for hiding this comment

michaelhly Sep 30, 2023

Choose a reason for hiding this comment

michaelhly Sep 30, 2023

Choose a reason for hiding this comment

michaelhly Sep 30, 2023

Choose a reason for hiding this comment

michaelhly Sep 30, 2023

Choose a reason for hiding this comment

michaelhly commented Oct 16, 2023

stale bot commented Dec 15, 2023

michaelhly commented Sep 28, 2023 •

edited

scottjlee left a comment •

edited