Skip to content

feat: QuerySet::iterator(chunk_size) — chunked streaming (closes #23)#113

Merged
ujeenet merged 3 commits into
mainfrom
issue-23-iterator
May 15, 2026
Merged

feat: QuerySet::iterator(chunk_size) — chunked streaming (closes #23)#113
ujeenet merged 3 commits into
mainfrom
issue-23-iterator

Conversation

@ujeenet
Copy link
Copy Markdown
Owner

@ujeenet ujeenet commented May 15, 2026

Summary

Closes #23. Django parity for QuerySet.iterator(chunk_size=2000). Returns a ChunkedIter<T> that fetches chunk_size rows at a time via LIMIT N OFFSET M — never buffers the full result set. Right tool for million-row exports / ETL pipelines / batch processors that would OOM with fetch_pool returning Vec<T>.

```rust
let mut iter = Post::objects()
.where_(Post::published.eq(true))
.order_by(&[("id", false)])
.iterator(2_000)?;

// 1. Whole-chunk loop:
while let Some(chunk) = iter.next_chunk(&pool).await? {
for post in chunk { /* … */ }
}

// 2. Row-by-row loop (internal VecDeque buffer, O(1) per row):
let mut iter = Post::objects().order_by(&[("id", false)]).iterator(2_000)?;
while let Some(post) = iter.next_row(&pool).await? { /* … */ }
```

API

Method Returns Description
QuerySet::iterator(chunk_size) Result<ChunkedIter<T>> Compiles eagerly — schema errors surface here, not mid-stream
ChunkedIter::next_chunk(&pool) Result<Option<Vec<T>>> Whole-chunk; None = done
ChunkedIter::next_row(&pool) Result<Option<T>> Row-by-row; buffers one chunk internally
ChunkedIter::rows_seen() i64 Cumulative count (progress reporting)
ChunkedIter::is_exhausted() bool Post-drain flag

Mixing next_chunk and next_row on the same iterator is safe — the buffer drains in row order before any new DB fetch.

Design notes

  • LIMIT/OFFSET chunker, not a server-side cursor. Portable across PG / MySQL / SQLite with no transaction overhead. Trade-off documented in the cookbook: O(n²) total work on deep pagination (each chunk reseeks). For truly streaming PG reads, drop into transaction() + raw sqlx::query(...).fetch Stream directly.
  • VecDeque buffer for row-by-row path. O(1) pop_front per row; a Vec::remove(0) would be O(n).
  • Exhaustion is sticky. Once a chunk comes back smaller than chunk_size (or empty), the iterator marks itself exhausted and every subsequent call returns Ok(None) without re-querying.
  • No new dependencies. Considered exposing a Stream impl via futures-core but went with the simpler next_chunk / next_row async-method API — zero new deps, two ergonomic styles, no Pin/Unpin gymnastics in user code.

Tri-dialect

Works on every backend — LIMIT/OFFSET is the portable common denominator. Cookbook flags the deep-pagination trade-off + the PG escape hatch.

What landed

Tests

Suite Count Status
`tests/iterator_live.rs` (live PG, 1000-row fixture) 5 passing
`cargo test -p rustango --features tenancy,mysql,sqlite --tests` (full suite) all passing

Live tests cover:

  • next_chunk_yields_all_rows_in_order — 5 chunks of 200, row order preserved.
  • next_row_yields_every_row_with_internal_buffering — chunk_size that doesn't divide evenly (150 → 7 chunks: 6×150 + 100).
  • mixed_next_row_and_next_chunk_preserve_order — partial drain via next_row, then next_chunk yields buffered remainder before refetching.
  • iterator_respects_where_clause_and_short_final_chunk — narrowed filter, short final chunk (4 chunks: 30, 30, 30, 10).
  • empty_result_yields_none_first_call — no DB hit on subsequent calls after exhaustion.

Verification

```
cargo test -p rustango --features tenancy,mysql,sqlite --tests → all passing
DATABASE_URL=... cargo test --features tenancy --test iterator_live → 5 passed
cargo test --workspace --all-features --no-run → clean
cargo build --no-default-features --features sqlite,tenancy → clean
```

Test plan

  • 1000-row fixture iterated chunk-by-chunk in order.
  • 1000-row fixture iterated row-by-row in order.
  • Mixed next_chunk + next_row preserves order.
  • WHERE filter respected.
  • Short final chunk.
  • Empty result short-circuits.
  • rows_seen() tracks correctly across both paths.
  • is_exhausted() becomes true once drained.
  • Workspace --all-features --no-run gate passes.
  • SQLite-tenancy litmus build passes.
  • CI green on `postgres_test` + `mysql_live` + `sqlite_live` + `sqlite_litmus`.

ujeenet added a commit that referenced this pull request May 15, 2026
Self-review tweaks on PR #113:

1. `iterator(chunk_size)` now asserts `chunk_size > 0`. Silently
   producing an immediately-exhausted iterator on a 0 or negative
   value was a footgun for `iterator(unchecked_input as i64)` style
   callers. Three new no-DB validation tests pin the panic message +
   the positive-path success.

2. Doc-comment + cookbook now flag two concurrency hazards:
   - **Concurrent-write hazard**: each chunk is its own query, so
     INSERT/DELETE between chunks can skip or duplicate rows.
     Recommend a snapshot-isolation transaction (REPEATABLE READ) on
     write-concurrent tables.
   - **select_for_update doesn't propagate**: row locks held by a
     `.select_for_update()` queryset are released between chunks
     because each runs in its own implicit transaction. Redirects to
     `pool.begin()` + `.fetch_on(&mut *tx)` for full-drain locking.

3. Extracted shared fetch path into private `fetch_next_chunk(pool)`.
   `next_chunk` + `next_row` both call it instead of duplicating the
   clone-query / set-limit / set-offset / check-exhaustion block.
   Caller-side `seen` accounting stays per-method because the
   counting semantics differ (next_chunk counts the whole chunk
   immediately; next_row counts rows as they pop out of the buffer).

4. Live test `mixed_next_row_and_next_chunk_preserve_order` now
   asserts `rows_seen() == 50` mid-iteration after 50 next_row calls
   on a chunk_size=100 iterator — pins the buffer-aware accounting.

All gates green: 5 live + 3 validation tests pass, all-features
build clean, sqlite-tenancy litmus clean.
ujeenet added a commit that referenced this pull request May 15, 2026
Second polish pass on PR #113. The first polish (commit 8c58dca)
added concurrency caveats but referenced a `next_chunk_on(&mut *tx)`
method that doesn't exist — a reader copying the cookbook snippet
verbatim hits a compile error.

Three fixes:

1. Cookbook concurrent-write snippet replaced with an honest
   hand-rolled `select_rows_on` loop inside the snapshot-isolation
   transaction. Explicitly states "ChunkedIter takes `&Pool`, not a
   `&mut Transaction`, so the chunker API can't be used inside the
   tx directly."

2. Cookbook `select_for_update` redirect rewritten with the
   trade-off pair: `.fetch_on(&mut *tx)` for in-memory results,
   hand-rolled LIMIT/OFFSET for streamed ones. Both call out that
   you're outside the ChunkedIter API.

3. Same correction in the `.iterator()` rustdoc — points at
   [`select_rows_on`] as the actual primitive for tx-scoped chunked
   reads. Notes the future `iterator_on` companion as out-of-scope
   for issue #23.

No behavior change. All gates green: 5 live + 3 validation tests
pass, build clean across tri-feature + sqlite-tenancy litmus.
ujeenet added 3 commits May 15, 2026 19:45
Django parity for `QuerySet.iterator(chunk_size=2000)`. Returns a
`ChunkedIter<T>` that fetches `chunk_size` rows at a time via
`LIMIT N OFFSET M`, never buffering the full result set. Right tool
for million-row exports / ETL / batch processors that would OOM with
`fetch_pool` returning `Vec<T>`.

```rust
let mut iter = Post::objects()
    .order_by(&[("id", false)])
    .iterator(2_000)?;

// Whole-chunk loop:
while let Some(chunk) = iter.next_chunk(&pool).await? {
    for post in chunk { /* … */ }
}

// Row-by-row loop (buffer one chunk internally via VecDeque):
while let Some(post) = iter.next_row(&pool).await? { /* … */ }
```

- `ChunkedIter<T>` struct in `sql/executor.rs` — holds the compiled
  `SelectQuery`, rotating `OFFSET`, exhaustion flag, and a
  `VecDeque<T>` buffer for the row-by-row path.
- `QuerySet::iterator(chunk_size)` entry point. Compiles eagerly so
  schema-validation errors surface here rather than mid-stream.
- `next_chunk(&pool)` — `Option<Vec<T>>`. Returns `None` when
  exhausted. Drains internal buffer first if mixed with `next_row`.
- `next_row(&pool)` — `Option<T>`. O(1) per row via `VecDeque::pop_front`.
- `rows_seen()` / `is_exhausted()` — progress + termination state.

Works on every backend — LIMIT/OFFSET is the portable common
denominator. Documented trade-off: O(n²) total work on deep
pagination (each chunk reseeks); for truly streaming PG reads,
callers can drop into `transaction()` + raw `sqlx::query(...).fetch`
Stream directly.

5 live PG tests against a 1000-row fixture:
- `next_chunk_yields_all_rows_in_order` — 5 chunks of 200, row order
  preserved across chunks.
- `next_row_yields_every_row_with_internal_buffering` — chunk_size
  that doesn't divide evenly (150 → 7 chunks: 6×150 + 100).
- `mixed_next_row_and_next_chunk_preserve_order` — partial drain via
  next_row, then `next_chunk` yields buffered remainder before
  refetching.
- `iterator_respects_where_clause_and_short_final_chunk` — narrowed
  filter, short final chunk (4 chunks: 30, 30, 30, 10).
- `empty_result_yields_none_first_call` — no DB hit on subsequent
  calls after exhaustion.

Cookbook section added under "Row-level locks".
Self-review tweaks on PR #113:

1. `iterator(chunk_size)` now asserts `chunk_size > 0`. Silently
   producing an immediately-exhausted iterator on a 0 or negative
   value was a footgun for `iterator(unchecked_input as i64)` style
   callers. Three new no-DB validation tests pin the panic message +
   the positive-path success.

2. Doc-comment + cookbook now flag two concurrency hazards:
   - **Concurrent-write hazard**: each chunk is its own query, so
     INSERT/DELETE between chunks can skip or duplicate rows.
     Recommend a snapshot-isolation transaction (REPEATABLE READ) on
     write-concurrent tables.
   - **select_for_update doesn't propagate**: row locks held by a
     `.select_for_update()` queryset are released between chunks
     because each runs in its own implicit transaction. Redirects to
     `pool.begin()` + `.fetch_on(&mut *tx)` for full-drain locking.

3. Extracted shared fetch path into private `fetch_next_chunk(pool)`.
   `next_chunk` + `next_row` both call it instead of duplicating the
   clone-query / set-limit / set-offset / check-exhaustion block.
   Caller-side `seen` accounting stays per-method because the
   counting semantics differ (next_chunk counts the whole chunk
   immediately; next_row counts rows as they pop out of the buffer).

4. Live test `mixed_next_row_and_next_chunk_preserve_order` now
   asserts `rows_seen() == 50` mid-iteration after 50 next_row calls
   on a chunk_size=100 iterator — pins the buffer-aware accounting.

All gates green: 5 live + 3 validation tests pass, all-features
build clean, sqlite-tenancy litmus clean.
Second polish pass on PR #113. The first polish (commit 8c58dca)
added concurrency caveats but referenced a `next_chunk_on(&mut *tx)`
method that doesn't exist — a reader copying the cookbook snippet
verbatim hits a compile error.

Three fixes:

1. Cookbook concurrent-write snippet replaced with an honest
   hand-rolled `select_rows_on` loop inside the snapshot-isolation
   transaction. Explicitly states "ChunkedIter takes `&Pool`, not a
   `&mut Transaction`, so the chunker API can't be used inside the
   tx directly."

2. Cookbook `select_for_update` redirect rewritten with the
   trade-off pair: `.fetch_on(&mut *tx)` for in-memory results,
   hand-rolled LIMIT/OFFSET for streamed ones. Both call out that
   you're outside the ChunkedIter API.

3. Same correction in the `.iterator()` rustdoc — points at
   [`select_rows_on`] as the actual primitive for tx-scoped chunked
   reads. Notes the future `iterator_on` companion as out-of-scope
   for issue #23.

No behavior change. All gates green: 5 live + 3 validation tests
pass, build clean across tri-feature + sqlite-tenancy litmus.
@ujeenet ujeenet force-pushed the issue-23-iterator branch from 4dbd581 to 108ad74 Compare May 15, 2026 22:48
@ujeenet ujeenet merged commit 67baee1 into main May 15, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[orm] QuerySet .iterator(chunk_size) — chunked server-side streaming

1 participant