Exact table statistics for the optimizer (DataFusion 54) by alxmrs · Pull Request #201 · xqlsystems/xarray-sql

alxmrs · 2026-06-30T21:05:07Z

Report exact statistics from the xarray scan so DataFusion's cost-based optimizer can plan joins and aggregations well — on the ordinary engine, with no second engine and the full datafusion-python DataFrame surface preserved.

Why this works now

Earlier exploration went down a "native in-process engine" path because datafusion-ffi (52/53) dropped Statistics across the FFI boundary — a foreign scan reported unknown cardinality. DataFusion 54 forwards ExecutionPlan statistics across FFI (FFI_ExecutionPlan gained partition_statistics), so the statistics our scan reports now reach the optimizer through the normal FFI table-provider path. The native engine is gone.

What's here

XarrayScanExec wraps the StreamingTableExec from scan() and reports exact Statistics. Every stat is derived from coordinate metadata xarray already knows — none of it scans the data — and each is exact, not an estimate:
- num_rows — summed product of each surviving chunk's dimension sizes. Drives JoinSelection's build-side choice and lets COUNT(*) skip the scan.
- total_byte_size — num_rows × fixed row width (derived in Rust from the projected schema's primitive widths).
- per dimension-coordinate column — exact min/max (the join/filter keys) and null_count = 0 (grid axes are always fully populated). Data variables are left unknown; their bounds would require a scan.
Why the wrapper (and not just TableProvider::statistics()): the physical JoinSelection rule reads statistics off the ExecutionPlan, not off the TableProvider — so injecting them at the plan node is what actually drives build-side selection, even under DF 54's FFI forwarding. Confirmed empirically: TableProvider::statistics() alone leaves the join as mode=Partitioned.
Where the numbers come from: num_rows is plumbed from Python as an optional third tuple element (factory, metadata, num_rows) — _block_len(block), the product of the chunk's slice extents (the 2-tuple form still works). The min/max bounds reuse the same coordinate metadata already computed for partition pruning. Nothing is recomputed from the data.
Dependency upgrade: datafusion + datafusion-ffi 52 → 54 (and arrow 57 → 58, pyo3 0.26 → 0.28 to match), plus the datafusion Python dep → 54.

Verified

A big-vs-small join plans as HashJoinExec: mode=CollectLeft with the small side's Rows=Exact(64) carried through the FFI boundary (FFI_ExecutionPlan: XarrayScanExec) — the cost-based build-side choice statistics unlock.
A projected scan's EXPLAIN shows Rows=Exact(2000), Bytes=Exact(48000), [(Col[0]: Min=Exact(Int64(0)) Max=Exact(Int64(3)) Null=Exact(0)), …], with data-variable columns carrying no bounds.
COUNT(*) is answered from the exact statistics without scanning.
Full suite green (tests/test_stats.py pins the statistics contract; reader tests that used COUNT(*) to force a scan now use SELECT *, since COUNT(*) is metadata-only once statistics are exact).

🤖 Generated with Claude Code

https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

alxmrs · 2026-06-30T21:32:58Z

+      shuffle) and, in future, join-driven partition pruning. The native
+      engine is **eager** (it materialises the result) and does not register
+      Python scalar UDFs; use the default engine for ``cftime`` filters or
+      multi-dimension-group datasets.


That the native engine is eager and not lazy is a merge blocker. Laziness is a must.

Fixed in a4258a6. The native engine is now lazy: NativeContext.sql() returns a lazy NativeDataFrame (planned, not executed) that streams result batches instead of collecting, and it round-trips through the same to_pandas/to_dataset — including the chunked, dask-backed path — as the FFI engine. A reduction or chunked scan over a larger-than-memory store never materialises the whole result. Covered by test_native_lazy_chunked_roundtrip and test_native_sql_returns_lazy_frame.

Generated by Claude Code

alxmrs · 2026-06-30T21:33:44Z

+        if self._native is not None:
+            raise NotImplementedError(
+                "The native engine does not yet support datasets whose "
+                "variables span multiple dimension groups (which register as "
+                f"a SQL namespace). {name!r} has groups {list(groups)}. Use "
+                "engine='ffi' for namespaced datasets."
+            )


We also needed this support.

Added in ab7a847. The native engine now supports multi-dimension-group datasets as SQL namespaces (NativeContext.register_table accepts a schema-qualified name like era5.surface and creates the schema on demand) and the cftime() filter UDF (a native ScalarUDFImpl that calls the Python cftime callable, since a datafusion-python UDF can't cross the FFI boundary). Tests: test_native_multigroup_namespace and test_native_cftime_udf. Let me know if "this support" meant something else.

Generated by Claude Code

alxmrs · 2026-06-30T21:37:34Z

+        inner = SessionContext().from_arrow(table)
+        return XarrayDataFrame(inner, templates=self._registered_datasets)
+
+    def explain_native(self, query: str) -> str:


I don't think this should be a public API

Agreed — made it private in a4258a6. It's now _explain_native (used only by tests to confirm stats/plan shape), so it's no longer part of the public surface.

Generated by Claude Code

alxmrs · 2026-06-30T21:38:18Z

+        the existing :class:`XarrayDataFrame` round-trip — ``to_pandas`` and
+        ``to_dataset`` — works unchanged.
+        """
+        batches = self._native.sql(query)


This needs to be lazy. What if we wanted to load ARCO-ERA5 (several petabytes)

Fixed in a4258a6 (same lazy-engine change). The result is no longer collected — NativeDataFrame.execute_stream() pulls batches on demand (releasing the GIL per batch), and the chunked to_dataset path re-plans per chunk with a coordinate filter that pushes into the scan and prunes source partitions. So a SELECT over ARCO-ERA5-scale data reads chunk-by-chunk and never holds the whole input or output in memory.

Generated by Claude Code

alxmrs · 2026-07-01T11:20:55Z

+        return self._batch
+
+
+class NativeFrame:


Does this serve the entire DataFrame API surface? Can we subclass it so it does?

alxmrs · 2026-07-01T13:50:17Z

+/// Execution, schema, ordering, and partitioning are delegated verbatim to the
+/// inner plan (so projection mechanics are reused unchanged); the only thing
+/// this node adds is real cardinality. When consumed natively (not across the
+/// FFI boundary, which drops statistics entirely), this is what lets


I think this comment is stale

alxmrs · 2026-07-01T14:31:27Z

+/// what lets `COUNT(*)` be answered without a scan. Column min/max would add
+/// range hints but do not change those decisions (WHERE filters are already
+/// handled by partition pruning), so they are omitted to keep this simple.
+fn build_scan_statistics(output_schema: &Schema, metas: &[&PartitionMetadata]) -> Statistics {
+    let mut stats = Statistics::new_unknown(output_schema);
+    stats.num_rows = sum_row_counts(metas.iter().copied());
+    stats


This may be too simple. I think adding all the relevant stats that we have is good. It may help us later.

Agreed — expanded in c7550c5. build_scan_statistics now reports every stat we can derive from coordinate metadata without touching the data, each exact:

num_rows (as before)

total_byte_size = num_rows × fixed row width (derived in Rust from the projected schema's primitive widths; Absent if any column is variable-width)

per dimension-coordinate column: exact min/max (folded coordinate bounds — the join/filter keys) and null_count = 0 (grid axes are always fully populated)

Verified through the FFI boundary — EXPLAIN on a projected scan now shows Rows=Exact(2000), Bytes=Exact(48000), [(Col[0]: Min=Exact(Int64(0)) Max=Exact(Int64(3)) Null=Exact(0)), …], with data-variable columns correctly carrying no bounds. Pinned by test_exact_byte_size_in_scan_statistics and test_dimension_column_min_max_in_scan_statistics.

I left distinct_count and sum_value as Absent: distinct count would need per-dimension cardinality (not present in the min/max metadata) plus a coordinate-uniqueness guarantee we don't enforce, and reporting an inexact value as Exact would mislead the optimizer.

Generated by Claude Code

alxmrs

Another idea

Report exact statistics from the xarray scan so DataFusion's cost-based optimizer can plan joins and aggregations well — without a second engine. DataFusion 54's datafusion-ffi forwards ExecutionPlan statistics across the FFI boundary (52/53 dropped them), so the statistics the scan reports now reach the optimizer on the ordinary path. - XarrayScanExec wraps the StreamingTableExec from scan() and reports exact Statistics: num_rows is the summed product of each chunk's dimension sizes (exact, not an estimate), plus exact min/max for numeric dimension columns. Per-partition row counts are plumbed from Python as a third tuple element (factory, metadata, num_rows); the 2-tuple form still works. - Upgrade datafusion + datafusion-ffi 52 -> 54 (and arrow 57 -> 58, pyo3 0.26 -> 0.28 to match), and the datafusion Python dep to 54. Verified: a big-vs-small join now plans as HashJoinExec mode=CollectLeft with the small side's Rows=Exact(64) carried through the FFI boundary (FFI_ExecutionPlan: XarrayScanExec), and COUNT(*) is answered from the exact statistics without scanning. The reader tests that used COUNT(*) to force a scan now use SELECT * (COUNT(*) is metadata-only once statistics are exact). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_019VuSeCio99NcME5eubcN3N

alxmrs commented Jun 30, 2026

View reviewed changes

alxmrs commented Jul 1, 2026

View reviewed changes

alxmrs force-pushed the claude/datafusion-geospatial-perf-fs1bqv branch from ab7a847 to 913707e Compare July 1, 2026 13:41

alxmrs changed the title ~~Add native DataFusion context with exact table statistics~~ Exact table statistics for the optimizer (DataFusion 54) Jul 1, 2026

alxmrs commented Jul 1, 2026

View reviewed changes

alxmrs force-pushed the claude/datafusion-geospatial-perf-fs1bqv branch from 913707e to 997f73d Compare July 1, 2026 14:18

alxmrs mentioned this pull request Jul 1, 2026

Add native DataFusion context with exact table statistics #202

Closed

alxmrs commented Jul 1, 2026

View reviewed changes

Comment thread src/lib.rs Outdated

alxmrs force-pushed the claude/datafusion-geospatial-perf-fs1bqv branch 2 times, most recently from c7550c5 to 24f2a8d Compare July 1, 2026 15:11

alxmrs force-pushed the claude/datafusion-geospatial-perf-fs1bqv branch from 24f2a8d to 5f05dce Compare July 1, 2026 15:26

alxmrs merged commit f085271 into main Jul 1, 2026
12 checks passed

alxmrs deleted the claude/datafusion-geospatial-perf-fs1bqv branch July 1, 2026 15:48

alxmrs mentioned this pull request Jul 2, 2026

Dictionary-encode coordinate columns #217

Draft

Uh oh!

Conversation

alxmrs commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this works now

What's here

Verified

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alxmrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alxmrs commented Jun 30, 2026 •

edited

Loading