Add chunk processing support for PolarsCursor#637
Merged
laughingman7743 merged 9 commits intomasterfrom Jan 4, 2026
Merged
Conversation
Implement memory-efficient chunked iteration for PolarsCursor and AsyncPolarsCursor using Polars' native lazy evaluation APIs. Features: - Add chunksize parameter to PolarsCursor and AsyncPolarsCursor - Add iter_chunks() method for memory-efficient chunk iteration - Use pl.scan_csv() and pl.scan_parquet() with collect_batches() for lazy evaluation - Support both CSV and Parquet (UNLOAD) result formats This follows the same pattern as PandasCursor's chunksize option, allowing users to process large datasets without loading the entire result set into memory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Following the PandasCursor test patterns, add: - test_iter_chunks_data_consistency: Verify chunked and regular reading produce the same data - test_iter_chunks_chunk_sizes: Verify each chunk respects the specified chunksize limit 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Consolidate CSV parameter extraction logic by reusing the _get_csv_params() method that was added for iter_chunks. This reduces code duplication between _read_csv and _iter_csv_chunks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Extract _is_csv_readable() helper for CSV validation - Extract _prepare_parquet_location() helper for Parquet setup - Refactor _read_csv to use _get_csv_params() helper - Skip eager loading in __init__ when chunksize is set (avoid double reads) - Allow iter_chunks() without chunksize (yields entire DataFrame as single chunk) - Update docstrings to reflect new behavior This provides a consistent interface matching PandasCursor behavior where iter_chunks() works with or without chunksize configuration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduce DataFrameIterator class following PandasCursor's pattern to provide a unified interface for both chunked and non-chunked DataFrame iteration. This eliminates the need for flag-based lazy loading and provides a more transparent API. Key changes: - Add DataFrameIterator class with iterrows() and as_polars() methods - Replace _df and _row_index with _df_iter iterator - Update fetchone() to use iterator-based row access - Update as_polars() and as_arrow() to use the wrapper - Update iter_chunks() to return the iterator directly - Remove _ensure_data_loaded() method 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused _current_df and _row_index instance variables - Simplify __next__ method to return directly without intermediate variables - Consolidate column name extraction in _get_csv_params to use _get_column_names() - Update class docstring to document chunked iteration feature 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update documentation to explain that standard fetch methods (fetchone, fetchmany) also benefit from chunked loading when chunksize is set - Update docstrings in PolarsCursor, AsyncPolarsCursor, and ResultSet to reflect this behavior - Add examples showing both iteration patterns in docs/polars.rst 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tests for fetchone, fetchmany, fetchall, and iterator with chunksize - Add tests for fetch methods with UNLOAD mode and chunksize - Remove redundant iter_chunks tests from AsyncPolarsCursor since both cursor types share the same AthenaPolarsResultSet implementation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change `import abc` to `from collections import abc` to fix AttributeError where `abc.Iterator` was not found (standard library abc module doesn't have Iterator, it's in collections.abc). Also add cast to fix mypy no-any-return errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
chunksizeparameter toPolarsCursorandAsyncPolarsCursorchunksizeis set, all data access methods (fetchone(),fetchmany(),fetchall(),iter_chunks()) load data lazily in chunksscan_csv(),scan_parquet()withcollect_batches())Motivation
This follows the same pattern as
PandasCursor's chunksize option, allowing users to process large datasets without loading the entire result set into memory.Usage
Standard fetch methods with chunked loading
Explicit chunk iteration
Implementation Details
DataFrameIterator: Wrapper class that provides unified iteration over chunked or non-chunked DataFrameschunksizeis None (default): Eager loading (existing behavior)chunksizeis set: Lazy evaluation with bounded memory usageChanges
pyathena/polars/result_set.py: AddDataFrameIteratorclass,chunksizeparameter, lazy chunk loadingpyathena/polars/cursor.py: Addchunksizeparameter anditer_chunks()methodpyathena/polars/async_cursor.py: Addchunksizeparameterdocs/polars.rst: Add documentation for chunksize options with examplestests/pyathena/polars/test_cursor.py: Add comprehensive tests for chunk processingTest Plan
test_iter_chunks- Basic chunk iterationtest_iter_chunks_without_chunksize- Yields entire DataFrame as single chunktest_iter_chunks_many_rows- Large dataset chunk iterationtest_iter_chunks_unload- Chunk iteration with Parquet/UNLOADtest_iter_chunks_data_consistency- Chunked vs regular reading produce same datatest_iter_chunks_chunk_sizes- Chunk size validationtest_fetchone_with_chunksize- fetchone with lazy loadingtest_fetchmany_with_chunksize- fetchmany with lazy loadingtest_fetchall_with_chunksize- fetchall with lazy loadingtest_iterator_with_chunksize- Iterator protocol with lazy loading🤖 Generated with Claude Code