Skip to content

Add chunk processing support for PolarsCursor#637

Merged
laughingman7743 merged 9 commits intomasterfrom
feature/polars-chunk-processing
Jan 4, 2026
Merged

Add chunk processing support for PolarsCursor#637
laughingman7743 merged 9 commits intomasterfrom
feature/polars-chunk-processing

Conversation

@laughingman7743
Copy link
Member

@laughingman7743 laughingman7743 commented Jan 3, 2026

Summary

  • Add chunksize parameter to PolarsCursor and AsyncPolarsCursor
  • Implement lazy chunk loading for memory-efficient data processing
  • When chunksize is set, all data access methods (fetchone(), fetchmany(), fetchall(), iter_chunks()) load data lazily in chunks
  • Use Polars' native lazy evaluation APIs (scan_csv(), scan_parquet() with collect_batches())
  • Support both CSV and Parquet (UNLOAD) result formats

Motivation

This follows the same pattern as PandasCursor's chunksize option, allowing users to process large datasets without loading the entire result set into memory.

Usage

Standard fetch methods with chunked loading

from pyathena import connect
from pyathena.polars.cursor import PolarsCursor

# With chunksize, data is loaded lazily in chunks
cursor = connect(s3_staging_dir="s3://bucket/path/",
                 region_name="us-west-2",
                 cursor_class=PolarsCursor).cursor(chunksize=50_000)

cursor.execute("SELECT * FROM large_table")

# fetchone/fetchmany load data chunk by chunk as needed
for row in cursor:
    process_row(row)

Explicit chunk iteration

cursor = connect(...).cursor(PolarsCursor, chunksize=50_000)
cursor.execute("SELECT * FROM large_table")

for chunk in cursor.iter_chunks():
    # Each chunk is a polars.DataFrame
    process_chunk(chunk)

Implementation Details

  • DataFrameIterator: Wrapper class that provides unified iteration over chunked or non-chunked DataFrames
  • When chunksize is None (default): Eager loading (existing behavior)
  • When chunksize is set: Lazy evaluation with bounded memory usage

Changes

  • pyathena/polars/result_set.py: Add DataFrameIterator class, chunksize parameter, lazy chunk loading
  • pyathena/polars/cursor.py: Add chunksize parameter and iter_chunks() method
  • pyathena/polars/async_cursor.py: Add chunksize parameter
  • docs/polars.rst: Add documentation for chunksize options with examples
  • tests/pyathena/polars/test_cursor.py: Add comprehensive tests for chunk processing

Test Plan

  • test_iter_chunks - Basic chunk iteration
  • test_iter_chunks_without_chunksize - Yields entire DataFrame as single chunk
  • test_iter_chunks_many_rows - Large dataset chunk iteration
  • test_iter_chunks_unload - Chunk iteration with Parquet/UNLOAD
  • test_iter_chunks_data_consistency - Chunked vs regular reading produce same data
  • test_iter_chunks_chunk_sizes - Chunk size validation
  • test_fetchone_with_chunksize - fetchone with lazy loading
  • test_fetchmany_with_chunksize - fetchmany with lazy loading
  • test_fetchall_with_chunksize - fetchall with lazy loading
  • test_iterator_with_chunksize - Iterator protocol with lazy loading

🤖 Generated with Claude Code

laughingman7743 and others added 9 commits January 3, 2026 18:23
Implement memory-efficient chunked iteration for PolarsCursor and
AsyncPolarsCursor using Polars' native lazy evaluation APIs.

Features:
- Add chunksize parameter to PolarsCursor and AsyncPolarsCursor
- Add iter_chunks() method for memory-efficient chunk iteration
- Use pl.scan_csv() and pl.scan_parquet() with collect_batches()
  for lazy evaluation
- Support both CSV and Parquet (UNLOAD) result formats

This follows the same pattern as PandasCursor's chunksize option,
allowing users to process large datasets without loading the entire
result set into memory.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Following the PandasCursor test patterns, add:
- test_iter_chunks_data_consistency: Verify chunked and regular
  reading produce the same data
- test_iter_chunks_chunk_sizes: Verify each chunk respects the
  specified chunksize limit

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Consolidate CSV parameter extraction logic by reusing the
_get_csv_params() method that was added for iter_chunks.
This reduces code duplication between _read_csv and _iter_csv_chunks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Extract _is_csv_readable() helper for CSV validation
- Extract _prepare_parquet_location() helper for Parquet setup
- Refactor _read_csv to use _get_csv_params() helper
- Skip eager loading in __init__ when chunksize is set (avoid double reads)
- Allow iter_chunks() without chunksize (yields entire DataFrame as single chunk)
- Update docstrings to reflect new behavior

This provides a consistent interface matching PandasCursor behavior where
iter_chunks() works with or without chunksize configuration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduce DataFrameIterator class following PandasCursor's pattern to
provide a unified interface for both chunked and non-chunked DataFrame
iteration. This eliminates the need for flag-based lazy loading and
provides a more transparent API.

Key changes:
- Add DataFrameIterator class with iterrows() and as_polars() methods
- Replace _df and _row_index with _df_iter iterator
- Update fetchone() to use iterator-based row access
- Update as_polars() and as_arrow() to use the wrapper
- Update iter_chunks() to return the iterator directly
- Remove _ensure_data_loaded() method

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused _current_df and _row_index instance variables
- Simplify __next__ method to return directly without intermediate variables
- Consolidate column name extraction in _get_csv_params to use _get_column_names()
- Update class docstring to document chunked iteration feature

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update documentation to explain that standard fetch methods (fetchone,
  fetchmany) also benefit from chunked loading when chunksize is set
- Update docstrings in PolarsCursor, AsyncPolarsCursor, and ResultSet
  to reflect this behavior
- Add examples showing both iteration patterns in docs/polars.rst

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tests for fetchone, fetchmany, fetchall, and iterator with chunksize
- Add tests for fetch methods with UNLOAD mode and chunksize
- Remove redundant iter_chunks tests from AsyncPolarsCursor since both
  cursor types share the same AthenaPolarsResultSet implementation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change `import abc` to `from collections import abc` to fix AttributeError
where `abc.Iterator` was not found (standard library abc module doesn't
have Iterator, it's in collections.abc).

Also add cast to fix mypy no-any-return errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@laughingman7743 laughingman7743 marked this pull request as ready for review January 4, 2026 06:48
@laughingman7743 laughingman7743 merged commit 1651786 into master Jan 4, 2026
5 checks passed
@laughingman7743 laughingman7743 deleted the feature/polars-chunk-processing branch January 4, 2026 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant