Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Dedupe repeated schema during ParquetDatasource metadata prefetching #44750

Merged
merged 4 commits into from
Apr 19, 2024

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Apr 15, 2024

Why are these changes needed?

For large datasets consisting of many files and many columns, the metadata for individual Parquet file fragments can be significantly large, using up memory resources. Often times, the metadata are identical across multiple Parquet file fragments, so we can save memory usage by maintaining a set of unique MetaData objects, and referencing the common set of MetaData objects in ParquetDatasource.

Related issue number

Closes #44754

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>
# serialize it to compare equality and store in dict.
fragment_md = _SerializedFragmentMetadata(fragment_md)
md_exists = False
for unique_md, md_indices in self._unique_metadata.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we've already serialized the FileMetaData objects to binaries, we can simply put the binaries in a set to dedup them, no need to have this linear scan.

for fragment_idx, fragment_md in enumerate(metadata_list):
# pyarrow.parquet.FileMetaData is not hashable, so we need to
# serialize it to compare equality and store in dict.
fragment_md = _SerializedFragmentMetadata(fragment_md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that FileMetaData also contains attributes like num_rows, which is different.
Maybe we should only dedupe a few attributes that are likely to be identical (e.g. schema)

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee
Copy link
Contributor Author

For the following Parquet/metadata related release tests, checked the release test results, no regression from nightly results: https://buildkite.com/ray-project/release/builds/13878#_

  • torch_batch_inference_1_gpu_10gb_parquet
  • parquet_metadata_resolution
  • read_parquet_benchmark_single_node
  • read_parquet_train_4_gpu

@scottjlee scottjlee marked this pull request as ready for review April 19, 2024 20:21
@scottjlee scottjlee changed the title [Data] Dedupe repeated metadata during ParquetDatasource metadata prefetching [Data] Dedupe repeated schema during ParquetDatasource metadata prefetching Apr 19, 2024
@raulchen raulchen merged commit a0b0c9d into ray-project:master Apr 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] High memory usage for large Datasets created from Parquet files
2 participants