Skip to content

fix: notebook large outputs#69

Merged
imathews merged 3 commits into
mainfrom
fix-notebook_large_outputs
Jun 17, 2026
Merged

fix: notebook large outputs#69
imathews merged 3 commits into
mainfrom
fix-notebook_large_outputs

Conversation

@imathews

Copy link
Copy Markdown
Member

Properly construct parquet files to avoid memory issues on ingest due to bad footers

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce memory issues when ingesting notebook outputs by writing Parquet with controlled row-group sizing and disabled statistics, helping avoid problematic Parquet metadata/footers during downstream reads/ingest.

Changes:

  • Add get_parquet_rows_per_group() to estimate an appropriate row_group_size (~128MB target).
  • Apply row_group_size and disable Parquet statistics across multiple dataframe/arrow writers in convert_data_to_parquet().
  • Bump package version to 0.20.10.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/redivis/common/util.py Adds row-group sizing heuristic and updates Parquet writers to reduce ingest-time memory/metadata issues.
src/redivis/_version.py Version bump to reflect the fix release.
Comments suppressed due to low confidence (1)

src/redivis/common/util.py:152

  • The pa_dataset.Dataset branch writes a directory via write_dataset() but then returns only part-0.parquet. If the input dataset has multiple fragments/files, this can result in multiple output parquet files and silently upload only the first one (data loss). Consider writing a single Parquet file by streaming record batches from a scanner into a ParquetWriter.
    elif isinstance(data, pa_dataset.Dataset):
        pa_dataset.write_dataset(
            data,
            temp_file_path,
            format="parquet",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/redivis/common/util.py Outdated
Comment thread src/redivis/common/util.py
Comment thread src/redivis/common/util.py
Comment thread src/redivis/common/util.py
@imathews imathews merged commit 1170e4c into main Jun 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants