-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Dedupe repeated schema during ParquetDatasource
metadata prefetching
#44750
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
# serialize it to compare equality and store in dict. | ||
fragment_md = _SerializedFragmentMetadata(fragment_md) | ||
md_exists = False | ||
for unique_md, md_indices in self._unique_metadata.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we've already serialized the FileMetaData objects to binaries, we can simply put the binaries in a set to dedup them, no need to have this linear scan.
for fragment_idx, fragment_md in enumerate(metadata_list): | ||
# pyarrow.parquet.FileMetaData is not hashable, so we need to | ||
# serialize it to compare equality and store in dict. | ||
fragment_md = _SerializedFragmentMetadata(fragment_md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that FileMetaData also contains attributes like num_rows
, which is different.
Maybe we should only dedupe a few attributes that are likely to be identical (e.g. schema
)
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
For the following Parquet/metadata related release tests, checked the release test results, no regression from nightly results: https://buildkite.com/ray-project/release/builds/13878#_
|
ParquetDatasource
metadata prefetchingParquetDatasource
metadata prefetching
Why are these changes needed?
For large datasets consisting of many files and many columns, the metadata for individual Parquet file fragments can be significantly large, using up memory resources. Often times, the metadata are identical across multiple Parquet file fragments, so we can save memory usage by maintaining a set of unique MetaData objects, and referencing the common set of MetaData objects in
ParquetDatasource
.Related issue number
Closes #44754
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.