Add chunk processing support for PolarsCursor by laughingman7743 · Pull Request #637 · pyathena-dev/PyAthena

laughingman7743 · 2026-01-03T09:24:25Z

Summary

Add chunksize parameter to PolarsCursor and AsyncPolarsCursor
Implement lazy chunk loading for memory-efficient data processing
When chunksize is set, all data access methods (fetchone(), fetchmany(), fetchall(), iter_chunks()) load data lazily in chunks
Use Polars' native lazy evaluation APIs (scan_csv(), scan_parquet() with collect_batches())
Support both CSV and Parquet (UNLOAD) result formats

Motivation

This follows the same pattern as PandasCursor's chunksize option, allowing users to process large datasets without loading the entire result set into memory.

Usage

Standard fetch methods with chunked loading

from pyathena import connect
from pyathena.polars.cursor import PolarsCursor

# With chunksize, data is loaded lazily in chunks
cursor = connect(s3_staging_dir="s3://bucket/path/",
                 region_name="us-west-2",
                 cursor_class=PolarsCursor).cursor(chunksize=50_000)

cursor.execute("SELECT * FROM large_table")

# fetchone/fetchmany load data chunk by chunk as needed
for row in cursor:
    process_row(row)

Explicit chunk iteration

cursor = connect(...).cursor(PolarsCursor, chunksize=50_000)
cursor.execute("SELECT * FROM large_table")

for chunk in cursor.iter_chunks():
    # Each chunk is a polars.DataFrame
    process_chunk(chunk)

Implementation Details

DataFrameIterator: Wrapper class that provides unified iteration over chunked or non-chunked DataFrames
When chunksize is None (default): Eager loading (existing behavior)
When chunksize is set: Lazy evaluation with bounded memory usage

Changes

pyathena/polars/result_set.py: Add DataFrameIterator class, chunksize parameter, lazy chunk loading
pyathena/polars/cursor.py: Add chunksize parameter and iter_chunks() method
pyathena/polars/async_cursor.py: Add chunksize parameter
docs/polars.rst: Add documentation for chunksize options with examples
tests/pyathena/polars/test_cursor.py: Add comprehensive tests for chunk processing

Test Plan

🤖 Generated with Claude Code

Implement memory-efficient chunked iteration for PolarsCursor and AsyncPolarsCursor using Polars' native lazy evaluation APIs. Features: - Add chunksize parameter to PolarsCursor and AsyncPolarsCursor - Add iter_chunks() method for memory-efficient chunk iteration - Use pl.scan_csv() and pl.scan_parquet() with collect_batches() for lazy evaluation - Support both CSV and Parquet (UNLOAD) result formats This follows the same pattern as PandasCursor's chunksize option, allowing users to process large datasets without loading the entire result set into memory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Following the PandasCursor test patterns, add: - test_iter_chunks_data_consistency: Verify chunked and regular reading produce the same data - test_iter_chunks_chunk_sizes: Verify each chunk respects the specified chunksize limit 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Consolidate CSV parameter extraction logic by reusing the _get_csv_params() method that was added for iter_chunks. This reduces code duplication between _read_csv and _iter_csv_chunks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Extract _is_csv_readable() helper for CSV validation - Extract _prepare_parquet_location() helper for Parquet setup - Refactor _read_csv to use _get_csv_params() helper - Skip eager loading in __init__ when chunksize is set (avoid double reads) - Allow iter_chunks() without chunksize (yields entire DataFrame as single chunk) - Update docstrings to reflect new behavior This provides a consistent interface matching PandasCursor behavior where iter_chunks() works with or without chunksize configuration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Introduce DataFrameIterator class following PandasCursor's pattern to provide a unified interface for both chunked and non-chunked DataFrame iteration. This eliminates the need for flag-based lazy loading and provides a more transparent API. Key changes: - Add DataFrameIterator class with iterrows() and as_polars() methods - Replace _df and _row_index with _df_iter iterator - Update fetchone() to use iterator-based row access - Update as_polars() and as_arrow() to use the wrapper - Update iter_chunks() to return the iterator directly - Remove _ensure_data_loaded() method 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused _current_df and _row_index instance variables - Simplify __next__ method to return directly without intermediate variables - Consolidate column name extraction in _get_csv_params to use _get_column_names() - Update class docstring to document chunked iteration feature 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update documentation to explain that standard fetch methods (fetchone, fetchmany) also benefit from chunked loading when chunksize is set - Update docstrings in PolarsCursor, AsyncPolarsCursor, and ResultSet to reflect this behavior - Add examples showing both iteration patterns in docs/polars.rst 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add tests for fetchone, fetchmany, fetchall, and iterator with chunksize - Add tests for fetch methods with UNLOAD mode and chunksize - Remove redundant iter_chunks tests from AsyncPolarsCursor since both cursor types share the same AthenaPolarsResultSet implementation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change `import abc` to `from collections import abc` to fix AttributeError where `abc.Iterator` was not found (standard library abc module doesn't have Iterator, it's in collections.abc). Also add cast to fix mypy no-any-return errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

laughingman7743 and others added 9 commits January 3, 2026 18:23

laughingman7743 marked this pull request as ready for review January 4, 2026 06:48

laughingman7743 merged commit 1651786 into master Jan 4, 2026
5 checks passed

laughingman7743 deleted the feature/polars-chunk-processing branch January 4, 2026 06:48

laughingman7743 mentioned this pull request Jan 4, 2026

Align PandasCursor chunk processing with PolarsCursor implementation #638

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chunk processing support for PolarsCursor#637

Add chunk processing support for PolarsCursor#637
laughingman7743 merged 9 commits intomasterfrom
feature/polars-chunk-processing

laughingman7743 commented Jan 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laughingman7743 commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Usage

Standard fetch methods with chunked loading

Explicit chunk iteration

Implementation Details

Changes

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

laughingman7743 commented Jan 3, 2026 •

edited

Loading