fix: Allow named arguments to skip optional parameter positions by sgrebnov · Pull Request #123 · spiceai/datafusion

sgrebnov · 2026-01-20T12:36:57Z

Which issue does this PR close?

Related to this PR:

Add PostgreSQL-style named arguments support for scalar functions apache/datafusion#18019

Fixes named argument resolution to follow PostgreSQL semantics where named arguments can target any parameter position without requiring intermediate optional positions to be filled.

Problem

Before this fix, resolve_function_arguments required positions 0..N to be filled when N arguments were provided. This incorrectly rejected valid PostgreSQL-style calls:

-- Signature: func(a, b DEFAULT 10, c DEFAULT 20, d DEFAULT 30)
SELECT func(1, d => 40);  -- Valid in PostgreSQL, was rejected

Motivation for this in existing DF logic is that DF does not provide default mechanism to attach args to specific supported arguments so they are provided consequently (first N args). Spice implements own mechanism for this (extra metadata) and can benefit from full Postgres dialect.

PR updates ordering logic to only validate that positional arguments fill consecutive slots from position 0. Named arguments can target any parameter position. This does NOT break existing functionality (no changes if arguments are specified in more conservative way) but also unblock Spice scenario:

// Before: for i in 0..args.len()        -- checked positions 0,1,2 for 3 args
// After:  for i in 0..positional_count  -- only checks positional arg positions

Example

vector_search(foo, 'bar', rank_weight => 100)
-- params: [tbl, query, column, limit, include_score, rank_weight]
-- 2 positional fill positions 0,1 ✓
-- Named 'rank_weight' fills position 5 ✓
-- Positions 2,3,4 skipped (optional) ✓

Important:

DataFusion only validates argument names and reorders to signature order. Functions are responsible for their own required/optional argument validation and must use metadata or other mechanisms to identify named arguments when positions are skipped.

PostgreSQL Reference

From PostgreSQL Docs - Calling Functions:

"In either notation, parameters that have default values given in the function
declaration need not be written in the call at all. But this is particularly
useful in named notation, since any combination of parameters can be omitted."

…pache#19659) (apache#19705) ## Which issue does this PR close?  - part of apache#18566 ## Rationale for this change I propose back porting the fix for apache#19641 to 52 release ## What changes are included in this PR? - Backport apache#19659 ## Are these changes tested? eYes ## Are there any user-facing changes? bug fix Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com>

…he#19661) ## Which issue does this PR close?  - Part of apache#18566 ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

## Which issue does this PR close?  - part of apache#19784. ## Rationale for this change  This is a follow-up of apache#19573 ## What changes are included in this PR?  Backport - apache#19804 ## Are these changes tested?  Existing tests ## Are there any user-facing changes?  no

…ache#20289) ## Which issue does this PR close? - related to apache#20287 ## Rationale for this change see issue apache#20109 ## What changes are included in this PR? 1. Remap parent filter expressions: When a FilterExec has a projection, remap unsupported parent filter expressions from output schema coordinates to input schema coordinates using `reassign_expr_columns()` before combining them with the current filter's predicates. 2. Preserve projection: When creating the merged FilterExec, preserve the original projection instead of discarding it . ## Are these changes tested? yes, add some test case ## Are there any user-facing changes? --------- ## Which issue does this PR close?  - Closes #. ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?   --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

- Closes apache#17508 The previous implementation used UUID-based aliasing as a workaround to prevent duplicate names for literals in Substrait plans. This approach had several drawbacks: - Non-deterministic plan names that made testing difficult (requiring UUID regex filters) - Only addressed literal naming conflicts, not the broader issue of name deduplication - Added unnecessary dependency on the `uuid` crate - Didn't properly handle cases where the same qualified name could appear with different schema representations 1. Enhanced NameTracker: Refactored to detect two types of conflicts: - Duplicate schema names: Tracked via schema_name() to prevent validate_unique_names failures (e.g., two Utf8(NULL) literals) - Ambiguous references: Tracked via qualified_name() to prevent DFSchema::check_names failures when a qualified field (e.g., left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same column name 2. **Removed UUID dependency**: Eliminated the `uuid` crate from `datafusion/substrait` 3. **Removed literal-specific aliasing**: The UUID-based workaround in `project_rel.rs` is no longer needed as the improved NameTracker handles all naming conflicts consistently 4. **Deterministic naming**: Name conflicts now use predictable `__temp__N` suffixes instead of random UUIDs Note: This doesn't fully fix all the issues in apache#17508 which allow some special casing of `CAST` which are not included here. Yes: - Updated snapshot tests to reflect the new deterministic naming (e.g., `Utf8("people")__temp__0` instead of UUID-based names) - Modified some roundtrip tests to verify semantic equivalence (schema matching and execution) rather than exact string matching, which is more robust - All existing integration tests pass with the new naming scheme Minimal. The generated plan names are now deterministic and more readable (using `__temp__N` suffixes instead of UUIDs), but this is primarily an internal representation change. The functional behavior and query results remain unchanged. ## Which issue does this PR close?  - Closes #. ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?   Co-authored-by: Xander <zander181@googlemail.com>

…#19714) ## Which issue does this PR close?  - Part of apache#7013 ## Rationale for this change I was speaking with someone who wanted to get more involved with DataFusion recently, and it was not clear from the website that discord is far more active than Slack. I think it would help new new users to make this clearer in the documentation to lower the barrier to joining the communitu ## What changes are included in this PR? Improve the communication documentation here: https://datafusion.apache.org/contributor-guide/communication.html and mention that discord is more active than slack <img width="1191" height="856" alt="Screenshot 2026-01-09 at 8 54 39 AM" src="https://github.com/user-attachments/assets/9bbef5d4-645d-45f0-85aa-6b12a2a7eea5" /> While I was reviewing the communication documentation, I noticed some other minor issues that could be improved which I will call out inline ## Are these changes tested?  ## Are there any user-facing changes?

## Which issue does this PR close?  - N/A ## Rationale for this change  Occasionally we do PRs to bump a bunch of versions in `Cargo.lock`, recent example: - apache#19667 Ideally this should be handled by dependabot already. I suspect the default limit of 5 is preventing dependabot from creating all the PRs it needs; this causes it to error and those version bumps are "lost". - I say "lost" but dependabot can pick it up on the next run... if it doesn't error again See an example from a dependabot run: <img width="703" height="118" alt="image" src="https://github.com/user-attachments/assets/f8ae6b70-6fe1-4613-8c60-28564c23188b" /> From the logs: ```text +-------------------------------------------------------+ | Changes to Dependabot Pull Requests | +---------+---------------------------------------------+ | created | insta ( from 1.45.0 to 1.46.0 ) | | created | tracing ( from 0.1.43 to 0.1.44 ) | | created | syn ( from 2.0.111 to 2.0.113 ) | | created | async-compression ( from 0.4.35 to 0.4.36 ) | | created | object_store ( from 0.12.4 to 0.13.0 ) | | created | serde_json ( from 1.0.145 to 1.0.148 ) | | created | bigdecimal ( from 0.4.9 to 0.4.10 ) | | created | clap ( from 4.5.53 to 4.5.54 ) | | created | libc ( from 0.2.177 to 0.2.179 ) | | created | tokio ( from 1.48.0 to 1.49.0 ) | | created | tokio-util ( from 0.7.17 to 0.7.18 ) | | created | sqllogictest ( from 0.28.4 to 0.29.0 ) | +---------+---------------------------------------------+ ``` We expect to have all these PRs created (this was for the run 5 days ago), but only these were created on that day: - apache#19645 - apache#19644 - apache#19643 And considering these PRs were still open at the time: - apache#19544 - apache#19325 We can see it hit the 5 limit. > Dependabot default behavior: > > - If five pull requests with version updates are open, no further pull requests are raised until some of those open requests are merged or closed. > - Security updates have a separate, internal limit of ten open pull requests which cannot be changed. - https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference#open-pull-requests-limit- ## What changes are included in this PR?  Bump the limit to 15. We might have an appetite for increasing it more, 15 was chosen arbitrarily. ## Are these changes tested?  Dependabot only. ## Are there any user-facing changes?  No.

## Which issue does this PR close?  - Closes apache#19720. ## Rationale for this change - Choosing LZO compression errors, I think it might never get supported so the best option moving forward is to remove it algother and update the docs.  ## What changes are included in this PR? - Removed LZO from parse_compression_string() function - Removed docs - Updated exptected test output  ## Are these changes tested? Yes  ## Are there any user-facing changes? User choosing LZO as compression will get a clear error message: ``` Unknown or unsupported parquet compression: lzo. Valid values are: uncompressed, snappy, gzip(level), brotli(level), lz4, zstd(level), and lz4_raw. ```

## Which issue does this PR close?  - part of 18566 ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

## Which issue does this PR close?  - Closes apache#19715. ## Rationale for this change The:is_used() API incorrectly returned false for custom `DataSource` implementations that didn't call reassign_expr_columns() -> with_new_children() . This caused `HashJoinExec` to skip computing dynamic filters even when they were actually being used. ## What changes are included in this PR? Updated is_used() to check both outer and inner Arc counts ## Are these changes tested? Functionality is covered by existing test `test_hashjoin_dynamic_filter_pushdown_is_used`. I was not sure if to add a repro since it would require adding a custom `DataSource`, the current tests in datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs use `FileScanConfig` ## Are there any user-facing changes? no

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986 ## Rationale for this change - The current floor and ceil implementations always convert scalar inputs to arrays via `values_to_arrays()`, which introduces unnecessary overhead when processing single values.  ## What changes are included in this PR? - Add scalar fast path for floor and ceil - Add criterion benchmark for floor/ceil to measure performance  ## Are these changes tested? - All sqllogictest pass - Addec benchmark | Benchmark | Array Time | Scalar Time | Speedup | |---------------------|------------|-------------|---------| | floor_f64 (1024) | ~396 ns | ~147 ns | 2.7× | | ceil_f64 (1024) | ~363 ns | ~150 ns | 2.4× | | floor_f64 (4096) | ~681 ns | ~144 ns | 4.7× | | ceil_f64 (4096) | ~667 ns | ~144 ns | 4.6× | | floor_f64 (8192) | ~1,638 ns | ~144 ns | 11.4× | | ceil_f64 (8192) | ~1,634 ns | ~144 ns | 11.3× |  ## Are there any user-facing changes? No

## Which issue does this PR close?  - Closes apache#19671. ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?   --------- Signed-off-by: StandingMan <jmtangcs@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

## Which issue does this PR close?  - Closes #. ## Rationale for this change  Running ` ./bench.sh run tpcds` with a freshly created `./bench.sh data tpcds` fails with the following error: ``` Please prepare TPC-DS data first by following instructions: ./bench.sh data tpcds ``` This PR fixes it ## What changes are included in this PR?  Fixes the `TPCDS_DIR` variable in `run_tpcds` ## Are these changes tested?  just benchmark scripts ## Are there any user-facing changes?   no need

…pache#19731) ## Which issue does this PR close?  - Closes #. ## Rationale for this change  Add a simplifier rule to remove a comparison. It's probably unlikely that a human would write such an expression, but a generated query, or other optimizations, may lead to such an expression. ## What changes are included in this PR?  Simplify `expr = L1 AND expr != L2` to `expr = L1` when `L1 != L2` ## Are these changes tested?  Added unit tests and SLT. The first commit shows the plan before the optimization. ## Are there any user-facing changes? Fewer runtime-ops if the plan contains the pattern.

## Which issue does this PR close?  - Closes apache#18932 ## Rationale for this change - Quoted CTE references (e.g., `WITH barbaz AS (...) SELECT * FROM "barbaz"`) were being collected as table refs and could fall through to catalog lookup instead of resolving to the CTE, especially with custom providers.  ## What changes are included in this PR? - Normalize CTE and table references consistently during collection (preserve quote style; only lowercase unquoted idents when enabled). - Avoid double-normalization when converting to `TableReference` in the resolve path.  ## Are these changes tested? Yes  ## Are there any user-facing changes? - Behavior fix for quoted CTE references; no API changes.   --------- Co-authored-by: jonahgao <jonahgao@msn.com>

## Which issue does this PR close?  - Closes apache#10583. ## Rationale for this change  ## What changes are included in this PR?  This patch implements null-aware anti join support for HashJoin LeftAnti operations, enabling correct SQL NOT IN subquery semantics with NULL values. ## Are these changes tested?  ## Are there any user-facing changes?   --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

## Which issue does this PR close?  - Closes apache#19697 ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

## Which issue does this PR close?  - part of apache/datafusion-comet#2986 ## Rationale for this change  The current `hex` function implementation uses `format!` macro and `StringArray::from` iterator pattern, which causes: 1. Per-element String allocations: Each value allocates a new `String` via `format!` 2. Unnecessary conversions: Multiple intermediate conversions between types 3. Inefficient dictionary type handling: Collects all values into vectors before building the result ## What changes are included in this PR?  This PR optimizes the `hex` encoding by: - Replacing `format!("{num:X}")` with a fast lookup table approach - Building results directly using `StringBuilder` - Reusing a single vec buffer per iteration to avoid re-allocation - Optimizing dictionary array handling by building results iteratively ### Benchmark Results | Group | Size | Before | After | Speedup | | :--- | :----: | :--------: | :-------: | -------: | | hex_binary | 1024 | 89.9 µs | 51.9 µs | 1.73x | | hex_binary | 4096 | 385.2 µs | 218.7 µs | 1.76x | | hex_binary | 8192 | 741.6 µs | 451.6 µs | 1.64x | | hex_int64 | 1024 | 32.0 µs | 12.4 µs | 2.57x | | hex_int64 | 4096 | 132.4 µs | 59.7 µs | 2.22x | | hex_int64 | 8192 | 258.5 µs | 120.6 µs | 2.14x | | hex_int64_dict | 1024 | 75.2 µs | 12.4 µs | 6.04x | | hex_int64_dict | 4096 | 313.2 µs | 60.5 µs | 5.18x | | hex_int64_dict | 8192 | 614.7 µs | 129.0 µs | 4.76x | | hex_utf8 | 1024 | 88.5 µs | 53.5 µs | 1.66x | | hex_utf8 | 4096 | 357.6 µs | 211.1 µs | 1.69x | | hex_utf8 | 8192 | 698.7 µs | 424.8 µs | 1.64x | ## Are these changes tested?  Yes. Existing units and sqllogictest tests pass. New benchmarks added. ## Are there any user-facing changes? No.

## Which issue does this PR close?  Part of apache#19227 ## Rationale for this change  See issue for details. The existing script `./dev/rust_lint.sh` do checks for all non-functional tests include formater/clippy checks. Some check tools support auto fix options, so this PR add an option to the lint scripts to perform auto-fixes. Now `./dev/rust_lint.sh --write --allow-dirty` can perform auto-fixes for all linter etc. violations ``` yongting@Yongtings-MacBook-Pro-2 ~/C/datafusion (auto-fix)> ./dev/rust_lint.sh --help Usage: ./dev/rust_lint.sh [--write] [--allow-dirty] Runs the local Rust lint suite similar to CI. --write Run formatters, clippy and other non-functional checks in best-effort write/fix mode (requires a clean git worktree, no uncommitted changes; some checks are test-only and ignore this flag). --allow-dirty Allow `--write` to run even when the git worktree has uncommitted changes. ``` ## What changes are included in this PR?  - Adds `[--write] [--allow-dirty]` flag to `rust_lint.sh` to support auto fixes - `rust_lint.sh` consists of several sub-scripts like `rust_fmt.sh`, they're all extended with auto-fix feature through `--write` flag, and the `rust_lint.sh` is optionally calling them with an additional flag for auto fixes. - Clean up `rust_lint.sh` ## Are these changes tested?  Yes, commit 8c99417 intentionally introduced one violation for each available lint check, and the auto-fix command is able to fix all of them. The test may not be comprehensive, but it provides a reasonable starting point. We can begin using this script now and iterate on it if we discover cases where the auto-fix does not behave correctly. ## Are there any user-facing changes? No

…he#19280) ## Which issue does this PR close?  - Closes #. ## Rationale for this change  A dependency graph for workspace member crates are often needed when doing refactors, I want it to be included in the doc, and have a script to update it automatically. Here is the preview: <img width="1203" height="951" alt="image" src="https://github.com/user-attachments/assets/527c18fc-258e-465f-a150-f2aafe3e6db9" /> ## What changes are included in this PR?  - adds a script to generate the dependency graph `deps.svg`, and verify if the existing one is up to date. - adds a documentation page in `Contributor Guide` to show this graph - adds a CI job to check if the generated dependency graph is up to date with the code. ## Are these changes tested?  I tested the dependency graph display locally, see above. Is it possible to see the preview from this PR's change online? I also included a dummy crate in the initial commit, to test if the CI can catch it and throw understandable error message. ## Are there any user-facing changes? No   --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

…uct `PushdownChecker` (apache#19678) ## Which issue does this PR close?  - Closes apache#19673. ## Rationale for this change  ## What changes are included in this PR? - Changed [row_filter.rs](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/row_filter.rs) to use Vec instead of the BTreeSet  ## Are these changes tested? Yes  ## Are there any user-facing changes?

## Which issue does this PR close?  - Closes apache#19082. ## Rationale for this change  Support Set Comparison Subquery (or Quantified Comparison Subquery in some systems), it looks like ```sql ... WHERE t1.a > ANY(SELECT ... FROM t2) ... ``` or ```sql ... WHERE t1.a < ALL(SELECT ... FROM t2) ... ``` ## What changes are included in this PR?  - New `Expr` branch `Expr::SetComparison` and corresponding `SetComparison` struct - New optimizer rule `RewriteSetComparison` to translate this kind of subquery using existing components - Corresponding substrait and proto changes `RewriteSetComparison` is a very naive implementation. We can optimize it in various ways in the follow-up PRs, like using min/max for non-equal comparison operators etc. ## Are these changes tested?  Yes, by unit test and slt ## Are there any user-facing changes?  To SQL user, this is a new feature. To API user, `ExprPlanner::plan_any` is removed (not needed)  --------- Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

## Which issue does this PR close?  - Closes apache#19820. ## Rationale for this change `isoyear` part is available in both PG and Spark EXTRACT functions. https://www.postgresql.org/docs/current/functions-datetime.html#:~:text=the%20week%20numbering.-,isoyear,-The%20ISO%208601 https://github.com/apache/spark/blob/a03bedb6c1281c5263a42bfd20608d2ee005ab05/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L3360 ## What changes are included in this PR? Support for part `isoyear` in date_part function. ## Are these changes tested? yes in SLT ## Are there any user-facing changes? yes

…ultiple runs (apache#19757) ## Which issue does this PR close?  - Closes apache#19756. ## Rationale for this change  The previous logic of `EnsureCooperative` optimizer lacked context awareness regarding ancestor nodes, making it not idempotent across multiple runs. Specifically, we need to ensure that: 1. **Idempotency**: Running the rule multiple times does not produce nested `CooperativeExec` wrappers. 2. **Context Awareness**: If a subtree is already protected by a `CooperativeExec`, we should skip and not double-wrap its children. ## What changes are included in this PR? To solve this, we cannot rely solely on `transform_up` (which lacks parent context) or `transform_down` (which makes safe mutation difficult). This PR adopts `transform_down_up` with a depth counter to strictly enforce that nodes are only wrapped when they are not currently under a `CooperativeExec` scope.  ## Are these changes tested? More unit tests coverage  ## Are there any user-facing changes? No

…e#19683) ## Which issue does this PR close?  - Closes apache#19682 ## Rationale for this change  Explained in issue. ## What changes are included in this PR?  Change array_except, array_intersect and array_union UDFs to return null if either input is null. ## Are these changes tested?  Added & fixed tests. ## Are there any user-facing changes?  Behaviour change to a function output.

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986. ## Rationale for this change The current `trunc` implementation always converts scalar inputs to arrays via `make_scalar_function`, which introduces unnecessary overhead when processing single values.  ## What changes are included in this PR? - Add scalar fast path for `trunc` function to process Float32/Float64 scalar inputs directly - Handle optional precision argument for scalar inputs - Add scalar benchmarks to measure performance  ## Are these changes tested? Yes all sqllogictest pass ## Benchmark Results | Type | Before | After | Speedup | |------|--------|-------|---------| | f64 scalar | 256 ns | 55 ns | **4.6x** | | f32 scalar | 247 ns | 56 ns | **4.4x** |  ## Are there any user-facing changes? No   --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

…ansion (apache#19832) ## Which issue does this PR close?  Follow up to apache#19738 ## Rationale for this change  The current hex implementation expands `DictionaryArray` inputs into a regular array, which causes loss of dictionary encoding and redundant hex computation for repeated values. ## What changes are included in this PR? - Apply hex encoding only to dictionary values - Avoid expanding dictionary arrays during execution ### Benchmark | Size | Before | After | Speedup | | ---- | ------ | ----- | ------- | | 1024 | 8.3 µs | 7.2 µs | 1.15× | | 4096 | 42.9 µs | 34.5 µs | 1.24× | | 8192 | 91.6 µs | 71.7 µs | 1.28× |  ## Are these changes tested? Yes. Existing unit tests and `sqllogictest` tests pass.  ## Are there any user-facing changes? No.

## Which issue does this PR close?  - Closes apache#19822 . - Part of apache#15914 ## Rationale for this change The current date_part function in datafusion have a few differences with the spark implementation: - day of week parts are 1 indexed in spark but 0 indexed in datafusion - spark supports a few more aliases for certain parts Full list of spark supported aliases: https://github.com/apache/spark/blob/a03bedb6c1281c5263a42bfd20608d2ee005ab05/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L3356-L3371 ## What changes are included in this PR? New date_part function in spark crate. ## Are these changes tested? Yes with SLT ## Are there any user-facing changes? yes

…9829) ## Which issue does this PR close?  - Closes apache#19828. - Part of apache#15914 ## Rationale for this change implement spark: - https://spark.apache.org/docs/latest/api/sql/index.html#trunc - https://spark.apache.org/docs/latest/api/sql/index.html#date_trunc - https://spark.apache.org/docs/latest/api/sql/index.html#time_trunc ## What changes are included in this PR? Add spark compatible wrappers around datafusion date_trunc function to handle spark specificities. ## Are these changes tested? Yes in SLT ## Are there any user-facing changes? Yes

## Which issue does this PR close?  - Closes apache#19844 - Part of apache#15914 ## Rationale for this change Add support for spark https://spark.apache.org/docs/latest/api/sql/index.html#date_diff function ## What changes are included in this PR?  ## Are these changes tested? yes in SLT ## Are there any user-facing changes? yes

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986 ## Rationale for this change The round function currently converts scalar inputs to arrays before processing, even when both value and decimal_places are scalar values. This adds unnecessary overhead for constant folding scenarios like  ## What changes are included in this PR? - Add scalar fast path in RoundFunc::invoke_with_args for Float64 and Float32 inputs - Directly compute the result when both inputs are scalars, avoiding array conversion overhead - Add benchmark  ## Are these changes tested? Yes Type | Before | After | Speedup -- | -- | -- | -- round_f64_scalar | 570 ns | 195 ns | 2.9x round_f32_scalar | 564 ns | 192 ns | 2.9x  ## Are there any user-facing changes? No

- Merged CTE reference resolution tests into main cte.slt file - Added CREATE TABLE statement for orders table used in tests - Fixed 3 case-sensitivity issues: changed 'WHERE N < 10' to 'WHERE n < 10' Fixes apache#19786 ## Which issue does this PR close?  - Closes apache#19786 ## Rationale for this change  This PR consolidates the small test file `cte_quoted_reference.slt` into the main `cte.slt` file as requested in issue apache#19786. The separate file was small enough that it didn't need to be maintained independently, and consolidating it improves code organization and maintainability. ## What changes are included in this PR?  - Merged CTE reference resolution tests from `cte_quoted_reference.slt` into `cte.slt` - Added `CREATE TABLE orders AS VALUES (1), (2);` statement to support the merged tests - Fixed 3 pre-existing case-sensitivity bugs by changing `WHERE N < 10` to `WHERE n < 10` in recursive CTE tests - Deleted `cte_quoted_reference.slt` as it's now consolidated into the main file ## Are these changes tested?  Yes, all changes are tested: - The consolidated tests from `cte_quoted_reference.slt` are now part of the `cte.slt` test suite - All sqllogictest tests pass successfully: `cargo test --test sqllogictests -- cte` - The case-sensitivity fixes ensure that field references match their definitions (lowercase `n`) ## Are there any user-facing changes?  No, there are no user-facing changes. This is purely a test file reorganization that improves code maintainability without affecting DataFusion's functionality or API.  Co-authored-by: Anjali Choudhary <anjalicy@amazon.com>

…e#19904) ## Which issue does this PR close?  - Part of apache#19798 ## Rationale for this change  This is an edge case, would prefer to fix upstream in arrow-rs instead of having handling code here, so just disable test for now. - arrow-rs issue: apache/arrow-rs#9227 ## What changes are included in this PR?  Disable edge-case array_union SLT ## Are these changes tested?  Test related change ## Are there any user-facing changes?  No.

## Which issue does this PR close?  - Part of apache#15914 - Closes apache#19710 ## Rationale for this change Implementation of spark `add_months` function. ## What changes are included in this PR?  ## Are these changes tested? yes ## Are there any user-facing changes? yes

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986 ## Rationale for this change The signum function currently converts scalar inputs to arrays before processing, even for single scalar values. This adds unnecessary overhead from array allocation and conversion. Adding a scalar fast path avoids this overhead and improves performance for constant folding and scalar expression evaluation.  ## What changes are included in this PR? - Added scalar fast path for `float32` and `float64` | Type | Before | After | Speedup | |------|--------|-------|---------| | **signum_f64_scalar** | 266 ns | 54 ns | **4.9x** | | **signum_f32_scalar** | 263 ns | 55 ns | **4.8x** |  ## Are these changes tested? Yes  ## Are there any user-facing changes? No   --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

…ow frame queries (apache#19887) ## Which issue does this PR close?  - Closes apache#19612. ## Rationale for this change The `DistinctMedianAccumulator::evaluate()` method was using `std::mem::take()` which consumed the internal state, causing subsequent calls to return incorrect results. This was the last remaining item from apache#19612  ## What changes are included in this PR? 1. Changed `DistinctMedianAccumulator::evaluate()` to use `.iter()` instead of `std::mem::take()` to preserve internal state across multiple calls 2. Added sqllogictest case for distinct median with window frames  ## Are these changes tested? Yes, added a sqllogictest  ## Are there any user-facing changes? No

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986. ## Rationale for this change The cot function currently converts scalar inputs to arrays before processing, even for single scalar values. This adds unnecessary overhead from array allocation and conversion. Adding a scalar fast path avoids this overhead.  ## What changes are included in this PR? - Added scalar fast path - Added benchmark - Update tests | Type | Before | After | Speedup | |------|--------|-------|---------| | **cot_f64_scalar** | 229 ns | 67 ns | **3.4x** | | **cot_f32_scalar** | 247 ns | 59 ns | **4.2x** |  ## Are these changes tested?  ## Are there any user-facing changes?

## Which issue does this PR close?  - Part of apache#18092 ## Rationale for this change  Main value add here is ensure UDAFs encode their actual accepted types in their signature instead of internally casting to the actual types they support from a wider signature. Also doing some driveby refactoring of removing manual Debug impls. ## What changes are included in this PR?  See rationale. ## Are these changes tested?  Existing tests. ## Are there any user-facing changes?  No.

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986. ## Rationale for this change The `iszero` function currently converts scalar inputs to arrays before processing, even for single scalar values. This adds unnecessary overhead from array allocation and conversion. Adding a scalar fast path avoids this overhead and improves performance  ## What changes are included in this PR? - Added scalar fast path for `Float16`, `Float32`, and `Float64` scalar inputs - Updated tests | Type | Before | After | Speedup | |------|--------|-------|---------| | **iszero f32 scalar** | 244 ns | 62 ns | **3.9x** | | **iszero f64 scalar** | 245 ns | 62 ns | **4.0x** |  ## Are these changes tested? Yes  ## Are there any user-facing changes? No

…che#19819) ## Which issue does this PR close?  - Related to apache#19004 and apache#19809. ## What changes are included in this PR?  Change internal error to user-facing error when function type coercion fails. Add helpful hint when Binary types are used with string functions. Before: ``` Internal error: Expect TypeSignatureClass::Native(...) but received NativeType::Binary, DataType: Binary ``` After: ``` Error: Function 'split_part' requires String, but received Binary (DataType: Binary). Hint: Binary types are not automatically coerced to String. Use CAST(column AS VARCHAR) to convert Binary data to String. ```

## Which issue does this PR close?  N/A ## Rationale for this change  When hashing list arrays we hash all bytes of the child array, even if we slice to a certain range of values. Refactor to slice only the needed bytes; also do some general refactors. ## What changes are included in this PR?  Refactor list array hashing. ## Are these changes tested?  Added test. ## Are there any user-facing changes?  No.

…pache#19909) ## Which issue does this PR close?  Part of: apache#15914 ## Rationale for this change  Implement spark compatible unhex functions: https://spark.apache.org/docs/latest/api/sql/index.html#unhex ## What changes are included in this PR?  ## Are these changes tested? Yes. UTs and SLT added.  ## Are there any user-facing changes? No.

## Which issue does this PR close?  - Closes apache#19912 ## Rationale for this change Just a couple of optimizations for hash table lookups usage in hash aggregate. ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986. ## Rationale for this change The `ascii` function currently converts scalar inputs to arrays before processing via `make_scalar_function`. Adding a scalar fast path avoids this overhead and improves performance.  ## What changes are included in this PR? 1. Refactored `invoke_with_args` to use `match` statement for handling both scalar and array inputs 2. Added scalar fast path for `Utf8`, `LargeUtf8`, and `Utf8View` scalar inputs | Type | Before | After | Speedup | |------|--------|-------|---------| | **ascii/scalar_utf8** | 234 ns | 56 ns | **4.2x** | | **ascii/scalar_utf8view** | 206 ns | 57 ns | **3.6x** |  ## Are these changes tested? Yes  ## Are there any user-facing changes? No

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986. ## Rationale for this change The `factorial` function currently converts scalar inputs to arrays before processing. Adding a scalar fast path avoids this overhead and improves performance.  ## What changes are included in this PR? 1. Refactored `invoke_with_args` to use `match` statement for handling both scalar and array inputs 2. Inlined array processing logic, removing `make_scalar_function` usage. | Type | Before | After | Speedup | |------|--------|-------|---------| | **factorial_scalar** | 244 ns | 102 ns | **2.4x** |  ## Are these changes tested? Yes, covered by existing SLT tests: - `scalar.slt` lines 461-477 - `math.slt` line 797  ## Are there any user-facing changes?

## Which issue does this PR close?  - Closes apache#19947 ## Rationale for this change  apache#17657 introduced a regression since it cloned the inner field in the execution path but in the `return_type` function it still set nullability to true. Fix to ensure we maintain the field of the inner field as is. ## What changes are included in this PR?  Change `return_type` to just pass through the input datatype as is. Also refactor away usage of a null buffer builder in favour of copying the input array null buffer. ## Are these changes tested?  Added tests. ## Are there any user-facing changes?  No.

## Which issue does this PR close?  - Closes #. ## Rationale for this change  Prerequisite for the following PRs: - apache#19760 - apache#19761 Even if the api on the `MemoryPool` does not require `&mut self` for growing/shrinking the reserved size, the api in `MemoryReservation` does, making simple implementations irrepresentable without synchronization primitives. For example, the following would require a `Mutex` for concurrent access to the `MemoryReservation` in different threads, even though the `MemoryPool` doesn't: ```rust let mut stream: SendableRecordBatchStream = SendableRecordBatchStream::new(); let mem: Arc<MemoryReservation> = Arc::new(MemoryReservation::new_empty()); let mut builder = ReceiverStreamBuilder::new(10); let tx = builder.tx(); { let mem = mem.clone(); builder.spawn(async move { while let Some(msg) = stream.next().await { mem.try_grow(msg.unwrap().get_array_memory_size()); // ❌ `mem` is not mutable tx.send(msg).unwrap(); } }); } builder .build() .inspect_ok(|msg| mem.shrink(msg.get_array_memory_size())); // ❌ `mem` is not mutable ``` ## What changes are included in this PR?  Make the methods in `MemoryReservation` require `&self` instead of `&mut self` for allowing concurrent shrink/grows from different tasks for the same reservation. ## Are these changes tested?  yes, by current tests ## Are there any user-facing changes? Users can now safely call methods of `MemoryReservation` from different tasks without synchronization primitives. This is a backwards compatible API change, as it will work out of the box for current users, however, depending on their clippy configuration, they might see some new warnings about "unused muts" in their codebase.

## Which issue does this PR close?  - Closes apache#20005 PR: ``` SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0; Query 1 iteration 0 took 30.5 ms and returned 1 rows Query 1 avg time: 30.48 ms ``` Main: ``` SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0; Query 1 iteration 0 took 39.6 ms and returned 1 rows Query 1 avg time: 39.61 ms ``` ## Rationale for this change Improving cold starts. ## What changes are included in this PR? ## Are these changes tested? Existing tests ## Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

## Which issue does this PR close?  - Closes apache#19967 - Part of apache#15914 ## Rationale for this change Add spark compatible base64/unbase64 functions ## What changes are included in this PR? - new encoding mode in DF encoding UDF for padded base64 - spark udfs for base64/unbase64 ## Are these changes tested? yes in SLT ## Are there any user-facing changes? yes

…ray_remove_all` functions (apache#19996) ## Which issue does this PR close?  - Part of apache/datafusion-comet#2986 ## Rationale for this change  The current implementation of `general_remove` is based on `filter + concat`, which creates intermediate arrays for each list row and can be relatively expensive. This PR introduces an alternative implementation based on `MutableArrayData`, which copies contiguous ranges from the original values buffer directly into the output array. The new approach is semantically equivalent to the existing implementation but reduces intermediate allocations and per-element overhead. ## What changes are included in this PR?  - Replaced `general_remove`'s filter-based implementation with `MutableArrayData` for more efficient memory usage. - Optimized the removal process by adding fast paths for rows where no matching elements need removal. ### Benchmark ``` group after before ----- ----- ------ array_remove_binary/remove/10 1.00 4.6±0.14ms ? ?/sec 2.41 11.0±0.34ms ? ?/sec array_remove_binary/remove/100 1.00 8.5±0.19ms ? ?/sec 1.95 16.6±0.42ms ? ?/sec array_remove_binary/remove/500 1.00 35.9±0.78ms ? ?/sec 1.43 51.4±1.10ms ? ?/sec array_remove_boolean/remove/10 1.00 3.7±0.05ms ? ?/sec 3.23 11.8±0.30ms ? ?/sec array_remove_boolean/remove/100 1.00 8.1±0.15ms ? ?/sec 2.18 17.6±0.35ms ? ?/sec array_remove_boolean/remove/500 1.00 26.6±0.43ms ? ?/sec 1.52 40.3±0.81ms ? ?/sec array_remove_decimal64/remove/10 1.00 3.9±0.07ms ? ?/sec 2.41 9.4±0.18ms ? ?/sec array_remove_decimal64/remove/100 1.00 6.7±0.19ms ? ?/sec 2.12 14.2±0.34ms ? ?/sec array_remove_decimal64/remove/500 1.00 40.3±0.75ms ? ?/sec 1.52 61.1±1.46ms ? ?/sec array_remove_f64/remove/10 1.00 3.8±0.10ms ? ?/sec 1.32 5.0±0.16ms ? ?/sec array_remove_f64/remove/100 1.00 4.8±0.34ms ? ?/sec 1.24 5.9±0.18ms ? ?/sec array_remove_f64/remove/500 1.00 22.3±0.68ms ? ?/sec 1.15 25.5±0.86ms ? ?/sec array_remove_fixed_size_binary/remove/10 1.00 4.7±0.09ms ? ?/sec 1.52 7.2±0.26ms ? ?/sec array_remove_fixed_size_binary/remove/100 1.00 8.0±0.32ms ? ?/sec 1.40 11.1±0.40ms ? ?/sec array_remove_fixed_size_binary/remove/500 1.00 45.4±0.97ms ? ?/sec 1.16 52.6±1.43ms ? ?/sec array_remove_int64/remove/10 1.00 3.9±0.11ms ? ?/sec 2.24 8.8±0.24ms ? ?/sec array_remove_int64/remove/100 1.00 5.5±0.18ms ? ?/sec 2.32 12.8±0.44ms ? ?/sec array_remove_int64/remove/500 1.00 25.5±1.06ms ? ?/sec 1.61 40.9±1.25ms ? ?/sec array_remove_strings/remove/10 1.00 4.5±0.10ms ? ?/sec 2.41 10.9±0.28ms ? ?/sec array_remove_strings/remove/100 1.00 8.5±0.37ms ? ?/sec 2.00 17.0±0.71ms ? ?/sec array_remove_strings/remove/500 1.00 35.9±0.84ms ? ?/sec 1.48 53.1±1.91ms ? ?/sec ``` ## Are these changes tested? Yes. Existing SLT for `array` continue to pass without modification. Benchmarks were added.  ## Are there any user-facing changes? No.   --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>

## Which issue does this PR close?  - Part of apache/datafusion-comet#2986. ## Rationale for this change The `repeat` function currently converts scalar inputs to arrays before processing via `make_scalar_function`. Adding a scalar fast path avoids this overhead and improves performance for constant folding and scalar expression evaluation.  ## What changes are included in this PR? 1. Refactored `invoke_with_args` to handle scalar inputs directly without array conversion 2. Added scalar fast path for `Utf8`, `LargeUtf8`, and `Utf8View` scalar inputs 3. Added array fast path that skips per-element null checks when `null_count() == 0` | Type | Before | After | Speedup | |------|--------|-------|---------| | **repeat/scalar_utf8** | 519 ns | 91 ns | **5.7x** | | **repeat/scalar_utf8view** | 437 ns | 91 ns | **4.8x** |  ## Are these changes tested? Yes  ## Are there any user-facing changes? No

## Which issue does this PR close?  - This PR is part of the [Utf8View support](apache#10918) epic. It adds `Utf8View` support in the Spark-compat layer. ## Rationale for this change In our internal project we're only suppporting `Utf8View` _(because of design constraints)_ and the current implementation of `SparkConcat` only supports `Utf8`. The `SparkConcat` function should accept `Utf8View` and mixed string types in line with the main DataFusion concat. This PR adds that support and follows the same patterns as [DataFusion’s concat](https://github.com/apache/datafusion/blob/main/datafusion/functions/src/string/concat.rs). Prevents errors like : > The type of Utf8 AND Utf8View of like physical should be same. > This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues from a query like:- ```sql select i_item_sk, item_info from (select i_item_sk, CONCAT('Item: ', i_item_desc) as item_info from item) sub where item_info LIKE 'Item: Electronic%' order by 1; ``` ## What changes are included in this PR? - Extend the type signature to accept `Utf8View` in addition to `Utf8` and `LargeUtf8` via `TypeSignature::Variadic(vec![Utf8View, Utf8, LargeUtf8])` matching DataFusion’s concat. - In `return_field_from_args`, compute the result type with precedence Utf8View > LargeUtf8 > Utf8. In spark_concat, handle Utf8View and LargeUtf8 in scalar paths (zero-argument and all-NULL). ## Are these changes tested? Yes. - Unit tests: `cargo test --package datafusion-spark function::string::concat::tests`, including `test_concat_utf8view`. - Sqllogictest: `spark/string/concat.slt` includes a “**Utf8View: no extra CAST in plan**” case that uses EXPLAIN and a temporary table to ensure no extra CASTs when using arrow_cast(..., 'Utf8View') with table columns. ## Are there any user-facing changes? - **API:** SparkConcat’s signature is extended to include Utf8View in the variadic list. No breaking changes. _used gpt to rephrase some of these points_

fix: Allow named arguments to skip optional parameter positions

bd25ad6

Jeadie approved these changes Jan 20, 2026

View reviewed changes

sgrebnov merged commit 6f22aae into spiceai-51 Jan 20, 2026

sgrebnov deleted the sgrebnov/0120-df51-argument-fix branch January 20, 2026 13:51

sgrebnov self-assigned this Jan 21, 2026

lukekim pushed a commit that referenced this pull request Feb 13, 2026

fix: Allow named arguments to skip optional parameter positions (#123)

3fe7811

lukekim pushed a commit that referenced this pull request Feb 19, 2026

fix: Allow named arguments to skip optional parameter positions (#123)

be0114b

lukekim pushed a commit that referenced this pull request Feb 26, 2026

fix: Allow named arguments to skip optional parameter positions (#123)

c0da56a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Allow named arguments to skip optional parameter positions#123

fix: Allow named arguments to skip optional parameter positions#123
sgrebnov merged 1 commit intospiceai-51from
sgrebnov/0120-df51-argument-fix

sgrebnov commented Jan 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sgrebnov commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Problem

Important:

PostgreSQL Reference

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sgrebnov commented Jan 20, 2026 •

edited

Loading