[Data] Dedupe repeated schema during `ParquetDatasource` metadata prefetching #44750

scottjlee · 2024-04-15T22:35:34Z

Why are these changes needed?

For large datasets consisting of many files and many columns, the metadata for individual Parquet file fragments can be significantly large, using up memory resources. Often times, the metadata are identical across multiple Parquet file fragments, so we can save memory usage by maintaining a set of unique MetaData objects, and referencing the common set of MetaData objects in ParquetDatasource.

Related issue number

Closes #44754

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

raulchen · 2024-04-15T23:03:27Z

python/ray/data/datasource/parquet_datasource.py

+                    # serialize it to compare equality and store in dict.
+                    fragment_md = _SerializedFragmentMetadata(fragment_md)
+                    md_exists = False
+                    for unique_md, md_indices in self._unique_metadata.items():


since we've already serialized the FileMetaData objects to binaries, we can simply put the binaries in a set to dedup them, no need to have this linear scan.

raulchen · 2024-04-15T23:09:15Z

python/ray/data/datasource/parquet_datasource.py

+                for fragment_idx, fragment_md in enumerate(metadata_list):
+                    # pyarrow.parquet.FileMetaData is not hashable, so we need to
+                    # serialize it to compare equality and store in dict.
+                    fragment_md = _SerializedFragmentMetadata(fragment_md)


It looks like that FileMetaData also contains attributes like num_rows, which is different.
Maybe we should only dedupe a few attributes that are likely to be identical (e.g. schema)

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2024-04-19T20:19:06Z

For the following Parquet/metadata related release tests, checked the release test results, no regression from nightly results: https://buildkite.com/ray-project/release/builds/13878#_

torch_batch_inference_1_gpu_10gb_parquet
parquet_metadata_resolution
read_parquet_benchmark_single_node
read_parquet_train_4_gpu

dedupe repeated metadata

c20eaf9

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee assigned raulchen Apr 15, 2024

raulchen reviewed Apr 15, 2024

View reviewed changes

scottjlee added 3 commits April 19, 2024 02:04

dedupe schema only

09664b1

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into 0415-dedup-parquetmd

6f52554

Signed-off-by: Scott Lee <sjl@anyscale.com>

only store unique schemas

388e59e

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review April 19, 2024 20:21

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, stephanie-wang and omatthew98 as code owners April 19, 2024 20:21

scottjlee changed the title ~~[Data] Dedupe repeated metadata during ParquetDatasource metadata prefetching~~ [Data] Dedupe repeated schema during ParquetDatasource metadata prefetching Apr 19, 2024

raulchen approved these changes Apr 19, 2024

View reviewed changes

raulchen merged commit a0b0c9d into ray-project:master Apr 19, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Dedupe repeated schema during `ParquetDatasource` metadata prefetching #44750

[Data] Dedupe repeated schema during `ParquetDatasource` metadata prefetching #44750

scottjlee commented Apr 15, 2024 •

edited

raulchen Apr 15, 2024

raulchen Apr 15, 2024

scottjlee commented Apr 19, 2024

[Data] Dedupe repeated schema during ParquetDatasource metadata prefetching #44750

[Data] Dedupe repeated schema during ParquetDatasource metadata prefetching #44750

Conversation

scottjlee commented Apr 15, 2024 • edited

Why are these changes needed?

Related issue number

Checks

raulchen Apr 15, 2024

Choose a reason for hiding this comment

raulchen Apr 15, 2024

Choose a reason for hiding this comment

scottjlee commented Apr 19, 2024

[Data] Dedupe repeated schema during `ParquetDatasource` metadata prefetching #44750

[Data] Dedupe repeated schema during `ParquetDatasource` metadata prefetching #44750

scottjlee commented Apr 15, 2024 •

edited