Skip to content

fix: Implement file writing directly for VortexSink instead of DataFusion demuxer#23

Merged
lukekim merged 24 commits into
spiceai-52from
peasee/260211-target-vortex-size
Feb 27, 2026
Merged

fix: Implement file writing directly for VortexSink instead of DataFusion demuxer#23
lukekim merged 24 commits into
spiceai-52from
peasee/260211-target-vortex-size

Conversation

@peasee
Copy link
Copy Markdown

@peasee peasee commented Feb 12, 2026

🗣 Description

  • Bypasses using the DataFusion FileSink and Demuxer when a target file size is specified, as the DataFusion demuxer has no guarantees on file size - only row count. As the required number of rows to reach a particular file size is dynamic based on the contents of those rows, we cannot use the Demuxer to reliably reach target file sizes.

sgrebnov and others added 16 commits January 14, 2026 14:24
…UE" as in keep for that node, up to any `and` or `or` node and handle empty IN list. (#8)
* Fix session get-or-default (vortex-data#5662)

The comments described this get-or-default, but instead it was a panic

---------

Signed-off-by: Nicholas Gates <nick@nickgates.com>

* feat: Support retrieving writer strategy builder from vortex session

---------

Signed-off-by: Nicholas Gates <nick@nickgates.com>
Co-authored-by: Nicholas Gates <gatesn@users.noreply.github.com>
* fix: Ensure CastExpr/CastColumnExpr/ScalarFunctionExpr check children in can_be_pushed_down

The can_be_pushed_down function was returning true for CastExpr and CastColumnExpr
without checking if their child expressions are convertible. This caused runtime
errors when the child contained expressions like CaseExpr that convert() cannot
handle.

Also fixed ScalarFunctionExpr to recursively check its arguments.

Fixes spiceai/spiceai#9037

* Add Case-When Expression and tests

* Implement execute()

* Add additional tests and fix type issue

* Fix toml lint

* feat(case_when): implement lazy evaluation to avoid side effects in unevaluated branches

This implements proper lazy evaluation in CaseWhen expression to ensure
that THEN/ELSE branches are only evaluated for rows where they apply.
This is critical for correctness when expressions have side effects like
divide-by-zero panics.

The implementation:
1. Evaluates conditions in order, tracking which rows have been matched
2. For each condition, computes an effective mask (condition AND NOT matched)
3. Uses filter() to create a scoped array with only matching rows
4. Evaluates THEN expression only on the filtered scope
5. Uses scatter_with_mask() to expand results back to original positions
6. Short-circuits when all rows are matched or all conditions fail

This fixes TPC-DS Q73 which has a pattern like:
  CASE WHEN hd_vehicle_count > 0
       THEN hd_dep_count/hd_vehicle_count
       ELSE NULL END

Previously, the division would be evaluated for all rows including those
where hd_vehicle_count=0, causing a divide-by-zero panic. Now the division
is only evaluated for rows where the condition is true.

Added test: test_evaluate_divide_by_zero_protected_by_case_when

* Formatting
@peasee peasee self-assigned this Feb 12, 2026
@peasee peasee added the bug Something isn't working label Feb 12, 2026
@lukekim lukekim changed the base branch from spiceai-51 to spiceai-52 February 27, 2026 01:58
@lukekim lukekim changed the base branch from spiceai-52 to spiceai-52-patches February 27, 2026 01:58
Base automatically changed from spiceai-52-patches to spiceai-52 February 27, 2026 02:18
- Introduced `target_file_size_mb` option to `VortexFormat` for controlling the size of output files in megabytes.
- Updated `VortexSink` to handle writing files based on the specified target size, bypassing DataFusion's default row count-based splitting.
- Implemented logic to split output files when the buffered data exceeds the target file size during the write process.
- Added tests to verify the new target file size configuration.
Comment thread vortex-datafusion/src/convert/exprs.rs
Comment thread vortex-datafusion/src/convert/exprs.rs
Comment thread vortex-datafusion/src/persistent/source.rs Outdated
@lukekim lukekim merged commit 2b65b68 into spiceai-52 Feb 27, 2026
8 of 42 checks passed
@lukekim lukekim deleted the peasee/260211-target-vortex-size branch February 27, 2026 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working changelog/feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants