fix: notebook large outputs#69
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce memory issues when ingesting notebook outputs by writing Parquet with controlled row-group sizing and disabled statistics, helping avoid problematic Parquet metadata/footers during downstream reads/ingest.
Changes:
- Add
get_parquet_rows_per_group()to estimate an appropriaterow_group_size(~128MB target). - Apply
row_group_sizeand disable Parquet statistics across multiple dataframe/arrow writers inconvert_data_to_parquet(). - Bump package version to
0.20.10.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/redivis/common/util.py |
Adds row-group sizing heuristic and updates Parquet writers to reduce ingest-time memory/metadata issues. |
src/redivis/_version.py |
Version bump to reflect the fix release. |
Comments suppressed due to low confidence (1)
src/redivis/common/util.py:152
- The
pa_dataset.Datasetbranch writes a directory viawrite_dataset()but then returns onlypart-0.parquet. If the input dataset has multiple fragments/files, this can result in multiple output parquet files and silently upload only the first one (data loss). Consider writing a single Parquet file by streaming record batches from a scanner into aParquetWriter.
elif isinstance(data, pa_dataset.Dataset):
pa_dataset.write_dataset(
data,
temp_file_path,
format="parquet",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Properly construct parquet files to avoid memory issues on ingest due to bad footers