[Data] Perform incremental writes to Parquet files #43563

bveeramani · 2024-02-29T17:36:37Z

Why are these changes needed?

When we write to Parquet files, we combine all of the input blocks into one big block. This doubles our heap memory usage, because we store the input blocks as well as the combined big block. To avoid OOM issues, this PR updates the implementation to incrementally write one block at a time.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani · 2024-02-29T17:37:59Z

python/ray/data/datasource/parquet_datasink.py



-class _ParquetDatasink(BlockBasedFileDatasink):
+class _ParquetDatasink(_FileDatasink):


@raulchen I decided to subclass _FileDatasink rather than change write_block_to_file to minimize API churn (it's not a public API, but it's still a documented developer API).

When we write to Parquet files, we combine all of the input blocks into one big block. This doubles our heap memory usage, because we store the input blocks as well as the combined big block. To avoid OOM issues, this PR updates the implementation to incrementally write one block at a time. Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani added 2 commits February 29, 2024 09:34

Initial commit

c962108

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

Appease lint

498f073

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and omatthew98 as code owners February 29, 2024 17:36

bveeramani commented Feb 29, 2024

View reviewed changes

raulchen approved these changes Feb 29, 2024

View reviewed changes

can-anyscale merged commit 5da4795 into ray-project:master Feb 29, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Perform incremental writes to Parquet files #43563

[Data] Perform incremental writes to Parquet files #43563

bveeramani commented Feb 29, 2024

bveeramani Feb 29, 2024 •

edited



		class _ParquetDatasink(BlockBasedFileDatasink):
		class _ParquetDatasink(_FileDatasink):

[Data] Perform incremental writes to Parquet files #43563

[Data] Perform incremental writes to Parquet files #43563

Conversation

bveeramani commented Feb 29, 2024

Why are these changes needed?

Related issue number

Checks

bveeramani Feb 29, 2024 • edited

Choose a reason for hiding this comment

bveeramani Feb 29, 2024 •

edited