Skip to content

fix: Allow named arguments to skip optional parameter positions#123

Merged
sgrebnov merged 1 commit intospiceai-51from
sgrebnov/0120-df51-argument-fix
Jan 20, 2026
Merged

fix: Allow named arguments to skip optional parameter positions#123
sgrebnov merged 1 commit intospiceai-51from
sgrebnov/0120-df51-argument-fix

Conversation

@sgrebnov
Copy link
Copy Markdown

@sgrebnov sgrebnov commented Jan 20, 2026

Which issue does this PR close?

Related to this PR:

Fixes named argument resolution to follow PostgreSQL semantics where named arguments can target any parameter position without requiring intermediate optional positions to be filled.

Problem

Before this fix, resolve_function_arguments required positions 0..N to be filled when N arguments were provided. This incorrectly rejected valid PostgreSQL-style calls:

-- Signature: func(a, b DEFAULT 10, c DEFAULT 20, d DEFAULT 30)
SELECT func(1, d => 40);  -- Valid in PostgreSQL, was rejected

Motivation for this in existing DF logic is that DF does not provide default mechanism to attach args to specific supported arguments so they are provided consequently (first N args). Spice implements own mechanism for this (extra metadata) and can benefit from full Postgres dialect.

PR updates ordering logic to only validate that positional arguments fill consecutive slots from position 0. Named arguments can target any parameter position. This does NOT break existing functionality (no changes if arguments are specified in more conservative way) but also unblock Spice scenario:

// Before: for i in 0..args.len()        -- checked positions 0,1,2 for 3 args
// After:  for i in 0..positional_count  -- only checks positional arg positions

Example

vector_search(foo, 'bar', rank_weight => 100)
-- params: [tbl, query, column, limit, include_score, rank_weight]
-- 2 positional fill positions 0,1 ✓
-- Named 'rank_weight' fills position 5 ✓
-- Positions 2,3,4 skipped (optional) ✓

Important:

DataFusion only validates argument names and reorders to signature order. Functions are responsible for their own required/optional argument validation and must use metadata or other mechanisms to identify named arguments when positions are skipped.

PostgreSQL Reference

From PostgreSQL Docs - Calling Functions:

"In either notation, parameters that have default values given in the function
declaration need not be written in the call at all. But this is particularly
useful in named notation, since any combination of parameters can be omitted."

@sgrebnov sgrebnov merged commit 6f22aae into spiceai-51 Jan 20, 2026
@sgrebnov sgrebnov deleted the sgrebnov/0120-df51-argument-fix branch January 20, 2026 13:51
@sgrebnov sgrebnov self-assigned this Jan 21, 2026
lukekim pushed a commit that referenced this pull request Feb 12, 2026
…pache#19659) (apache#19705)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- part of apache#18566

## Rationale for this change

I propose back porting the fix for
apache#19641 to 52 release

## What changes are included in this PR?

- Backport apache#19659

## Are these changes tested?
eYes

## Are there any user-facing changes?

bug fix

Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com>
lukekim pushed a commit that referenced this pull request Feb 12, 2026
…he#19661)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache#18566

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
lukekim pushed a commit that referenced this pull request Feb 19, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- part of apache#19784.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

This is a follow-up of apache#19573

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Backport 
- apache#19804

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Existing tests

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
no
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
lukekim pushed a commit that referenced this pull request Mar 3, 2026
…ache#20289)

## Which issue does this PR close?

- related to apache#20287

## Rationale for this change

see issue apache#20109

## What changes are included in this PR?

1. Remap parent filter expressions: When a FilterExec has a projection,
remap unsupported parent filter expressions from output schema
coordinates to input schema coordinates using `reassign_expr_columns()`
before combining them with the current filter's predicates.

2. Preserve projection: When creating the merged FilterExec, preserve
the original projection instead of discarding it .

## Are these changes tested?

yes, add some test case

## Are there any user-facing changes?

---------

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
lukekim pushed a commit that referenced this pull request Mar 3, 2026
- Closes apache#17508

The previous implementation used UUID-based aliasing as a workaround to
prevent duplicate names for literals in Substrait plans. This approach
had several drawbacks:
- Non-deterministic plan names that made testing difficult (requiring
UUID regex filters)
- Only addressed literal naming conflicts, not the broader issue of name
deduplication
- Added unnecessary dependency on the `uuid` crate
- Didn't properly handle cases where the same qualified name could
appear with different schema representations

  1. Enhanced NameTracker: Refactored to detect two types of conflicts:
- Duplicate schema names: Tracked via schema_name() to prevent
validate_unique_names failures (e.g., two Utf8(NULL) literals)
- Ambiguous references: Tracked via qualified_name() to prevent
DFSchema::check_names failures when a qualified field (e.g.,
left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same
column name
2. **Removed UUID dependency**: Eliminated the `uuid` crate from
`datafusion/substrait`
3. **Removed literal-specific aliasing**: The UUID-based workaround in
`project_rel.rs` is no longer needed as the improved NameTracker handles
all naming conflicts consistently
4. **Deterministic naming**: Name conflicts now use predictable
`__temp__N` suffixes instead of random UUIDs

Note: This doesn't fully fix all the issues in apache#17508 which allow some
special casing of `CAST` which are not included here.

Yes:
- Updated snapshot tests to reflect the new deterministic naming (e.g.,
`Utf8("people")__temp__0` instead of UUID-based names)
- Modified some roundtrip tests to verify semantic equivalence (schema
matching and execution) rather than exact string matching, which is more
robust
- All existing integration tests pass with the new naming scheme

Minimal. The generated plan names are now deterministic and more
readable (using `__temp__N` suffixes instead of UUIDs), but this is
primarily an internal representation change. The functional behavior and
query results remain unchanged.

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

Co-authored-by: Xander <zander181@googlemail.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
…#19714)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache#7013 

## Rationale for this change

I was speaking with someone who wanted to get more involved with
DataFusion recently, and it was not clear from the website that discord
is far more active than Slack. I think it would help new new users to
make this clearer in the documentation to lower the barrier to joining
the communitu


## What changes are included in this PR?


Improve the communication documentation here:
https://datafusion.apache.org/contributor-guide/communication.html and
mention that discord is more active than slack

<img width="1191" height="856" alt="Screenshot 2026-01-09 at 8 54 39 AM"
src="https://github.com/user-attachments/assets/9bbef5d4-645d-45f0-85aa-6b12a2a7eea5"
/>


While I was reviewing the communication documentation, I noticed some
other minor issues that could be improved which I will call out inline

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- N/A

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Occasionally we do PRs to bump a bunch of versions in `Cargo.lock`,
recent example:

- apache#19667

Ideally this should be handled by dependabot already. I suspect the
default limit of 5 is preventing dependabot from creating all the PRs it
needs; this causes it to error and those version bumps are "lost".

- I say "lost" but dependabot can pick it up on the next run... if it
doesn't error again

See an example from a dependabot run:

<img width="703" height="118" alt="image"
src="https://github.com/user-attachments/assets/f8ae6b70-6fe1-4613-8c60-28564c23188b"
/>

From the logs:

```text
 +-------------------------------------------------------+
|          Changes to Dependabot Pull Requests          |
+---------+---------------------------------------------+
| created | insta ( from 1.45.0 to 1.46.0 )             |
| created | tracing ( from 0.1.43 to 0.1.44 )           |
| created | syn ( from 2.0.111 to 2.0.113 )             |
| created | async-compression ( from 0.4.35 to 0.4.36 ) |
| created | object_store ( from 0.12.4 to 0.13.0 )      |
| created | serde_json ( from 1.0.145 to 1.0.148 )      |
| created | bigdecimal ( from 0.4.9 to 0.4.10 )         |
| created | clap ( from 4.5.53 to 4.5.54 )              |
| created | libc ( from 0.2.177 to 0.2.179 )            |
| created | tokio ( from 1.48.0 to 1.49.0 )             |
| created | tokio-util ( from 0.7.17 to 0.7.18 )        |
| created | sqllogictest ( from 0.28.4 to 0.29.0 )      |
+---------+---------------------------------------------+
```

We expect to have all these PRs created (this was for the run 5 days
ago), but only these were created on that day:

- apache#19645
- apache#19644
- apache#19643

And considering these PRs were still open at the time:

- apache#19544
- apache#19325

We can see it hit the 5 limit.

> Dependabot default behavior:
>
> - If five pull requests with version updates are open, no further pull
requests are raised until some of those open requests are merged or
closed.
> - Security updates have a separate, internal limit of ten open pull
requests which cannot be changed.

-
https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference#open-pull-requests-limit-

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Bump the limit to 15. We might have an appetite for increasing it more,
15 was chosen arbitrarily.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Dependabot only.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19720.

## Rationale for this change

- Choosing LZO compression errors, I think it might never get supported
so the best option moving forward is to remove it algother and update
the docs.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Removed LZO from parse_compression_string() function
- Removed docs
- Updated exptected test output

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

User choosing LZO as compression will get a clear error message:

```
Unknown or unsupported parquet compression: lzo. Valid values are: uncompressed, snappy, gzip(level), brotli(level), lz4, zstd(level), and lz4_raw.
```

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- part of 18566 

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19715.

## Rationale for this change

The:is_used() API incorrectly returned false for custom `DataSource`
implementations that didn't call reassign_expr_columns() ->
with_new_children() . This caused `HashJoinExec` to skip computing
dynamic filters even when they were actually being used.

## What changes are included in this PR?


Updated is_used() to check both outer and inner Arc counts

## Are these changes tested?

Functionality is covered by existing test
`test_hashjoin_dynamic_filter_pushdown_is_used`. I was not sure if to
add a repro since it would require adding a custom `DataSource`, the
current tests in
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs use
`FileScanConfig`

## Are there any user-facing changes?

no
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache/datafusion-comet#2986

## Rationale for this change


- The current floor and ceil implementations always convert scalar
inputs to arrays via `values_to_arrays()`, which introduces unnecessary
overhead when processing single values.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Add scalar fast path for floor and ceil
- Add criterion benchmark for floor/ceil to measure performance

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
 
- All sqllogictest pass
- Addec benchmark

| Benchmark           | Array Time | Scalar Time | Speedup |
|---------------------|------------|-------------|---------|
| floor_f64 (1024)    | ~396 ns    | ~147 ns     | 2.7×    |
| ceil_f64 (1024)     | ~363 ns    | ~150 ns     | 2.4×    |
| floor_f64 (4096)    | ~681 ns    | ~144 ns     | 4.7×    |
| ceil_f64 (4096)     | ~667 ns    | ~144 ns     | 4.6×    |
| floor_f64 (8192)    | ~1,638 ns  | ~144 ns     | 11.4×   |
| ceil_f64 (8192)     | ~1,634 ns  | ~144 ns     | 11.3×   |


<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19671.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Signed-off-by: StandingMan <jmtangcs@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Running ` ./bench.sh run tpcds` with a freshly created `./bench.sh data
tpcds` fails with the following error:

```
Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds
```

This PR fixes it

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Fixes the `TPCDS_DIR` variable in `run_tpcds`

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

just benchmark scripts

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

no need
krinart pushed a commit that referenced this pull request Mar 3, 2026
…pache#19731)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Add a simplifier rule to remove a comparison.
It's probably unlikely that a human would write such an expression, but
a generated query, or other optimizations, may lead to such an
expression.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Simplify `expr = L1 AND expr != L2` to `expr = L1` when `L1 != L2`

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added unit tests and SLT.
The first commit shows the plan before the optimization.

## Are there any user-facing changes?

Fewer runtime-ops if the plan contains the pattern.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#18932 

## Rationale for this change
- Quoted CTE references (e.g., `WITH barbaz AS (...) SELECT * FROM
"barbaz"`) were being collected as table refs and could fall through to
catalog lookup instead of resolving to the CTE, especially with custom
providers.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?
- Normalize CTE and table references consistently during collection
(preserve quote style; only lowercase unquoted idents when enabled).
- Avoid double-normalization when converting to `TableReference` in the
resolve path.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
Yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?
- Behavior fix for quoted CTE references; no API changes.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: jonahgao <jonahgao@msn.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#10583.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

This patch implements null-aware anti join support for HashJoin LeftAnti
operations, enabling correct SQL NOT IN subquery semantics with NULL
values.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19697 

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- part of apache/datafusion-comet#2986

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

The current `hex` function implementation uses `format!` macro and
`StringArray::from` iterator pattern, which causes:
1. Per-element String allocations: Each value allocates a new `String`
via `format!`
2. Unnecessary conversions: Multiple intermediate conversions between
types
3. Inefficient dictionary type handling: Collects all values into
vectors before building the result

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

This PR optimizes the `hex` encoding by:
- Replacing `format!("{num:X}")` with a fast lookup table approach
- Building results directly using `StringBuilder`
- Reusing a single vec buffer per iteration to avoid re-allocation
- Optimizing dictionary array handling by building results iteratively

### Benchmark Results
| Group               | Size  | Before    | After     | Speedup |
| :---                | :----: | :--------: | :-------: | -------: |
| hex_binary           | 1024   | 89.9 µs   | 51.9 µs  | 1.73x |
| hex_binary           | 4096   | 385.2 µs  | 218.7 µs | 1.76x |
| hex_binary           | 8192   | 741.6 µs  | 451.6 µs | 1.64x |
| hex_int64            | 1024   | 32.0 µs   | 12.4 µs  | 2.57x |
| hex_int64            | 4096   | 132.4 µs  | 59.7 µs  | 2.22x |
| hex_int64            | 8192   | 258.5 µs  | 120.6 µs | 2.14x |
| hex_int64_dict       | 1024   | 75.2 µs   | 12.4 µs  | 6.04x |
| hex_int64_dict       | 4096   | 313.2 µs  | 60.5 µs  | 5.18x |
| hex_int64_dict       | 8192   | 614.7 µs  | 129.0 µs | 4.76x |
| hex_utf8             | 1024   | 88.5 µs   | 53.5 µs  | 1.66x |
| hex_utf8             | 4096   | 357.6 µs  | 211.1 µs | 1.69x |
| hex_utf8             | 8192   | 698.7 µs  | 424.8 µs | 1.64x |

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Yes. Existing units and sqllogictest tests pass. New benchmarks added.

## Are there any user-facing changes?

No.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

Part of apache#19227

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
See issue for details.

The existing script `./dev/rust_lint.sh` do checks for all
non-functional tests include formater/clippy checks.

Some check tools support auto fix options, so this PR add an option to
the lint scripts to perform auto-fixes.

Now `./dev/rust_lint.sh --write --allow-dirty` can perform auto-fixes
for all linter etc. violations

```
yongting@Yongtings-MacBook-Pro-2 ~/C/datafusion (auto-fix)> ./dev/rust_lint.sh --help
Usage: ./dev/rust_lint.sh [--write] [--allow-dirty]

Runs the local Rust lint suite similar to CI.
--write        Run formatters, clippy and other non-functional checks in best-effort write/fix mode (requires a clean git worktree, no uncommitted changes; some checks are test-only and ignore this flag).
--allow-dirty  Allow `--write` to run even when the git worktree has uncommitted changes.
```

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
- Adds `[--write] [--allow-dirty]` flag to `rust_lint.sh` to support
auto fixes
- `rust_lint.sh` consists of several sub-scripts like `rust_fmt.sh`,
they're all extended with auto-fix feature through `--write` flag, and
the `rust_lint.sh` is optionally calling them with an additional flag
for auto fixes.
- Clean up `rust_lint.sh`

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Yes, commit 8c99417 intentionally
introduced one violation for each available lint check, and the auto-fix
command is able to fix all of them.

The test may not be comprehensive, but it provides a reasonable starting
point. We can begin using this script now and iterate on it if we
discover cases where the auto-fix does not behave correctly.

## Are there any user-facing changes?
No
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
…he#19280)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
A dependency graph for workspace member crates are often needed when
doing refactors, I want it to be included in the doc, and have a script
to update it automatically.

Here is the preview:
<img width="1203" height="951" alt="image"
src="https://github.com/user-attachments/assets/527c18fc-258e-465f-a150-f2aafe3e6db9"
/>


## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
- adds a script to generate the dependency graph `deps.svg`, and verify
if the existing one is up to date.
- adds a documentation page in `Contributor Guide` to show this graph
- adds a CI job to check if the generated dependency graph is up to date
with the code.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
I tested the dependency graph display locally, see above.
Is it possible to see the preview from this PR's change online?

I also included a dummy crate in the initial commit, to test if the CI
can catch it and throw understandable error message.

## Are there any user-facing changes?
No
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
…uct `PushdownChecker` (apache#19678)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19673.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Changed
[row_filter.rs](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/row_filter.rs)
to use Vec instead of the BTreeSet

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19082.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Support Set Comparison Subquery (or Quantified Comparison Subquery in
some systems), it looks like

```sql
... WHERE t1.a > ANY(SELECT ... FROM t2) ...
```
or 
```sql
... WHERE t1.a < ALL(SELECT ... FROM t2) ...
```

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- New `Expr` branch `Expr::SetComparison` and corresponding
`SetComparison` struct
- New optimizer rule `RewriteSetComparison` to translate this kind of
subquery using existing components
- Corresponding substrait and proto changes

`RewriteSetComparison` is a very naive implementation. We can optimize
it in various ways in the follow-up PRs, like using min/max for
non-equal comparison operators etc.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Yes, by unit test and slt

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

To SQL user, this is a new feature. To API user, `ExprPlanner::plan_any`
is removed (not needed)

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19820.

## Rationale for this change

`isoyear` part is available in both PG and Spark EXTRACT functions. 


https://www.postgresql.org/docs/current/functions-datetime.html#:~:text=the%20week%20numbering.-,isoyear,-The%20ISO%208601

https://github.com/apache/spark/blob/a03bedb6c1281c5263a42bfd20608d2ee005ab05/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L3360

## What changes are included in this PR?

Support for part `isoyear` in date_part function.

## Are these changes tested?
yes in SLT

## Are there any user-facing changes?

yes
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
…ultiple runs (apache#19757)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19756.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

The previous logic of `EnsureCooperative` optimizer lacked context
awareness regarding ancestor nodes, making it not idempotent across
multiple runs.

Specifically, we need to ensure that:
1. **Idempotency**: Running the rule multiple times does not produce
nested `CooperativeExec` wrappers.
2. **Context Awareness**: If a subtree is already protected by a
`CooperativeExec`, we should skip and not double-wrap its children.

## What changes are included in this PR?

To solve this, we cannot rely solely on `transform_up` (which lacks
parent context) or `transform_down` (which makes safe mutation
difficult). This PR adopts `transform_down_up` with a depth counter to
strictly enforce that nodes are only wrapped when they are not currently
under a `CooperativeExec` scope.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

More unit tests coverage

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
…e#19683)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19682

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Explained in issue.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Change array_except, array_intersect and array_union UDFs to return null
if either input is null.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added & fixed tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

Behaviour change to a function output.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache/datafusion-comet#2986.

## Rationale for this change

The current `trunc` implementation always converts scalar inputs to
arrays via `make_scalar_function`, which introduces unnecessary overhead
when processing single values.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Add scalar fast path for `trunc` function to process Float32/Float64
scalar inputs directly
- Handle optional precision argument for scalar inputs
- Add scalar benchmarks to measure performance

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes all sqllogictest pass

## Benchmark Results

| Type | Before | After | Speedup |
|------|--------|-------|---------|
| f64 scalar | 256 ns | 55 ns | **4.6x** |
| f32 scalar | 247 ns | 56 ns | **4.4x** |

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
…ansion (apache#19832)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

Follow up to
apache#19738

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

The current hex implementation expands `DictionaryArray` inputs into a
regular array, which causes loss of dictionary encoding and redundant
hex computation for repeated values.

## What changes are included in this PR?

- Apply hex encoding only to dictionary values 
- Avoid expanding dictionary arrays during execution

### Benchmark
| Size | Before | After | Speedup |
| ---- | ------ | ----- | ------- |
| 1024 | 8.3 µs | 7.2 µs | 1.15× |
| 4096 | 42.9 µs | 34.5 µs | 1.24× |
| 8192 | 91.6 µs | 71.7 µs | 1.28× |

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
Yes. Existing unit tests and `sqllogictest` tests pass.

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?
No.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19822 .
- Part of apache#15914 

## Rationale for this change

The current date_part function in datafusion have a few differences with
the spark implementation:

- day of week parts are 1 indexed in spark but 0 indexed in datafusion
- spark supports a few more aliases for certain parts

Full list of spark supported aliases:
https://github.com/apache/spark/blob/a03bedb6c1281c5263a42bfd20608d2ee005ab05/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L3356-L3371

## What changes are included in this PR?

New date_part function in spark crate.

## Are these changes tested?

Yes with SLT

## Are there any user-facing changes?
yes
krinart pushed a commit that referenced this pull request Mar 3, 2026
…9829)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19828.
- Part of apache#15914 

## Rationale for this change

implement spark:
- https://spark.apache.org/docs/latest/api/sql/index.html#trunc
- https://spark.apache.org/docs/latest/api/sql/index.html#date_trunc
- https://spark.apache.org/docs/latest/api/sql/index.html#time_trunc

## What changes are included in this PR?

Add spark compatible wrappers around datafusion date_trunc function to
handle spark specificities.

## Are these changes tested?

Yes in SLT

## Are there any user-facing changes?

Yes
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19844
- Part of apache#15914 

## Rationale for this change

Add support for spark
https://spark.apache.org/docs/latest/api/sql/index.html#date_diff
function

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

yes in SLT

## Are there any user-facing changes?

 yes
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache/datafusion-comet#2986

## Rationale for this change

The round function currently converts scalar inputs to arrays before
processing, even when both value and decimal_places are scalar values.
This adds unnecessary overhead for constant folding scenarios like

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Add scalar fast path in RoundFunc::invoke_with_args for Float64 and
Float32 inputs
- Directly compute the result when both inputs are scalars, avoiding
array conversion overhead
- Add benchmark

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes


Type | Before | After | Speedup
-- | -- | -- | --
round_f64_scalar | 570 ns | 195 ns | 2.9x
round_f32_scalar | 564 ns | 192 ns | 2.9x

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
- Merged CTE reference resolution tests into main cte.slt file
- Added CREATE TABLE statement for orders table used in tests
- Fixed 3 case-sensitivity issues: changed 'WHERE N < 10' to 'WHERE n <
10'

Fixes apache#19786

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->
- Closes apache#19786

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->
This PR consolidates the small test file `cte_quoted_reference.slt` into
the main `cte.slt` file as requested in issue apache#19786. The separate file
was small enough that it didn't need to be maintained independently, and
consolidating it improves code organization and maintainability.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->
- Merged CTE reference resolution tests from `cte_quoted_reference.slt`
into `cte.slt`
- Added `CREATE TABLE orders AS VALUES (1), (2);` statement to support
the merged tests
- Fixed 3 pre-existing case-sensitivity bugs by changing `WHERE N < 10`
to `WHERE n < 10` in recursive CTE tests
- Deleted `cte_quoted_reference.slt` as it's now consolidated into the
main file

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Yes, all changes are tested:
- The consolidated tests from `cte_quoted_reference.slt` are now part of
the `cte.slt` test suite
- All sqllogictest tests pass successfully: `cargo test --test
sqllogictests -- cte`
- The case-sensitivity fixes ensure that field references match their
definitions (lowercase `n`)


## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
No, there are no user-facing changes. This is purely a test file
reorganization that improves code maintainability without affecting
DataFusion's functionality or API.


<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

Co-authored-by: Anjali Choudhary <anjalicy@amazon.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
…e#19904)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache#19798

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

This is an edge case, would prefer to fix upstream in arrow-rs instead
of having handling code here, so just disable test for now.

- arrow-rs issue: apache/arrow-rs#9227

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Disable edge-case array_union SLT

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Test related change

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache#15914
- Closes apache#19710

## Rationale for this change

Implementation of spark `add_months` function.



## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
yes

## Are there any user-facing changes?

yes
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache/datafusion-comet#2986

## Rationale for this change

The signum function currently converts scalar inputs to arrays before
processing, even for single scalar values. This adds unnecessary
overhead from array allocation and conversion. Adding a scalar fast path
avoids this overhead and improves performance for constant folding and
scalar expression evaluation.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Added scalar fast path for `float32` and `float64`

| Type | Before | After | Speedup |
|------|--------|-------|---------|
| **signum_f64_scalar** | 266 ns | 54 ns | **4.9x** |
| **signum_f32_scalar** | 263 ns | 55 ns | **4.8x** |

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
…ow frame queries (apache#19887)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19612.

## Rationale for this change

The `DistinctMedianAccumulator::evaluate()` method was using
`std::mem::take()` which consumed the internal state, causing subsequent
calls to return incorrect results. This was the last remaining item from
apache#19612

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

1. Changed `DistinctMedianAccumulator::evaluate()` to use `.iter()`
instead of `std::mem::take()` to preserve internal state across multiple
calls
2. Added sqllogictest case for distinct median with window frames

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes, added a sqllogictest

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of  apache/datafusion-comet#2986.

## Rationale for this change

The cot function currently converts scalar inputs to arrays before
processing, even for single scalar values. This adds unnecessary
overhead from array allocation and conversion. Adding a scalar fast path
avoids this overhead.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Added scalar fast path
- Added benchmark
- Update tests


| Type | Before | After | Speedup |
|------|--------|-------|---------|
| **cot_f64_scalar** | 229 ns | 67 ns | **3.4x** |
| **cot_f32_scalar** | 247 ns | 59 ns | **4.2x** |

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?



<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache#18092

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Main value add here is ensure UDAFs encode their actual accepted types
in their signature instead of internally casting to the actual types
they support from a wider signature. Also doing some driveby refactoring
of removing manual Debug impls.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

See rationale.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Existing tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache/datafusion-comet#2986.

## Rationale for this change

The `iszero` function currently converts scalar inputs to arrays before
processing, even for single scalar values. This adds unnecessary
overhead from array allocation and conversion. Adding a scalar fast path
avoids this overhead and improves performance

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

- Added scalar fast path for `Float16`, `Float32`, and `Float64` scalar
inputs
- Updated tests

| Type | Before | After | Speedup |
|------|--------|-------|---------|
| **iszero f32 scalar** | 244 ns | 62 ns | **3.9x** |
| **iszero f64 scalar** | 245 ns | 62 ns | **4.0x** |

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
…che#19819)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Related to apache#19004 and apache#19809.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Change internal error to user-facing error when function type coercion
fails. Add helpful hint when Binary types are used with string
functions.

Before:
```
Internal error: Expect TypeSignatureClass::Native(...) but received NativeType::Binary, DataType: Binary
```

After:
```
Error: Function 'split_part' requires String, but received Binary (DataType: Binary).
Hint: Binary types are not automatically coerced to String. Use CAST(column AS VARCHAR) to convert Binary data to String.
```
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

N/A

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

When hashing list arrays we hash all bytes of the child array, even if
we slice to a certain range of values. Refactor to slice only the needed
bytes; also do some general refactors.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Refactor list array hashing.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added test.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
…pache#19909)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

Part of: apache#15914

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Implement spark compatible unhex functions:
https://spark.apache.org/docs/latest/api/sql/index.html#unhex

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
Yes. UTs and SLT added.

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?
No.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19912

## Rationale for this change

Just a couple of optimizations for hash table lookups usage in hash
aggregate.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache/datafusion-comet#2986.

## Rationale for this change


The `ascii` function currently converts scalar inputs to arrays before
processing via `make_scalar_function`. Adding a scalar fast path avoids
this overhead and improves performance.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

1. Refactored `invoke_with_args` to use `match` statement for handling
both scalar and array inputs
2. Added scalar fast path for `Utf8`, `LargeUtf8`, and `Utf8View` scalar
inputs

| Type | Before | After | Speedup |
|------|--------|-------|---------|
| **ascii/scalar_utf8** | 234 ns | 56 ns | **4.2x** |
| **ascii/scalar_utf8view** | 206 ns | 57 ns | **3.6x** |

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache/datafusion-comet#2986.

## Rationale for this change

The `factorial` function currently converts scalar inputs to arrays
before processing. Adding a scalar fast path avoids this overhead and
improves performance.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

1. Refactored `invoke_with_args` to use `match` statement for handling
both scalar and array inputs
2. Inlined array processing logic, removing `make_scalar_function`
usage.

| Type | Before | After | Speedup |
|------|--------|-------|---------|
| **factorial_scalar** | 244 ns | 102 ns | **2.4x** |

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes, covered by existing SLT tests:
- `scalar.slt` lines 461-477
- `math.slt` line 797

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19947

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

apache#17657 introduced a regression since it cloned the inner field in the
execution path but in the `return_type` function it still set
nullability to true. Fix to ensure we maintain the field of the inner
field as is.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Change `return_type` to just pass through the input datatype as is.

Also refactor away usage of a null buffer builder in favour of copying
the input array null buffer.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

No.

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Prerequisite for the following PRs:
- apache#19760
- apache#19761

Even if the api on the `MemoryPool` does not require `&mut self` for
growing/shrinking the reserved size, the api in `MemoryReservation`
does, making simple implementations irrepresentable without
synchronization primitives. For example, the following would require a
`Mutex` for concurrent access to the `MemoryReservation` in different
threads, even though the `MemoryPool` doesn't:

```rust
let mut stream: SendableRecordBatchStream = SendableRecordBatchStream::new();
let mem: Arc<MemoryReservation> = Arc::new(MemoryReservation::new_empty());

let mut builder = ReceiverStreamBuilder::new(10);
let tx = builder.tx();
{
    let mem = mem.clone();
    builder.spawn(async move {
        while let Some(msg) = stream.next().await {
            mem.try_grow(msg.unwrap().get_array_memory_size()); // ❌ `mem` is not mutable
            tx.send(msg).unwrap();
        }
    });
}
builder
    .build()
    .inspect_ok(|msg| mem.shrink(msg.get_array_memory_size()));  // ❌ `mem` is not mutable
```


## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Make the methods in `MemoryReservation` require `&self` instead of `&mut
self` for allowing concurrent shrink/grows from different tasks for the
same reservation.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

yes, by current tests

## Are there any user-facing changes?

Users can now safely call methods of `MemoryReservation` from different
tasks without synchronization primitives.

This is a backwards compatible API change, as it will work out of the
box for current users, however, depending on their clippy configuration,
they might see some new warnings about "unused muts" in their codebase.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#20005

PR:
```
SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0;

Query 1 iteration 0 took 30.5 ms and returned 1 rows
Query 1 avg time: 30.48 ms
```

Main:

```
SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0;

Query 1 iteration 0 took 39.6 ms and returned 1 rows
Query 1 avg time: 39.61 ms
```

## Rationale for this change

Improving cold starts.

## What changes are included in this PR?

## Are these changes tested?

Existing tests

## Are there any user-facing changes?

No

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#19967
- Part of apache#15914 

## Rationale for this change

Add spark compatible base64/unbase64 functions

## What changes are included in this PR?

- new encoding mode in DF encoding UDF for padded base64
- spark udfs for base64/unbase64

## Are these changes tested?

yes in SLT

## Are there any user-facing changes?

yes
krinart pushed a commit that referenced this pull request Mar 3, 2026
…ray_remove_all` functions (apache#19996)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of apache/datafusion-comet#2986

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

The current implementation of `general_remove` is based on `filter +
concat`, which creates intermediate arrays for each list row and can be
relatively expensive.

This PR introduces an alternative implementation based on
`MutableArrayData`, which copies contiguous ranges from the original
values buffer directly into the output array. The new approach is
semantically equivalent to the existing implementation but reduces
intermediate allocations and per-element overhead.


## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Replaced `general_remove`'s filter-based implementation with
`MutableArrayData` for more efficient memory usage.
- Optimized the removal process by adding fast paths for rows where no
matching elements need removal.

### Benchmark
```
group                                        after                                  before
-----                                        -----                                  ------
array_remove_binary/remove/10                1.00      4.6±0.14ms        ? ?/sec    2.41     11.0±0.34ms        ? ?/sec
array_remove_binary/remove/100               1.00      8.5±0.19ms        ? ?/sec    1.95     16.6±0.42ms        ? ?/sec
array_remove_binary/remove/500               1.00     35.9±0.78ms        ? ?/sec    1.43     51.4±1.10ms        ? ?/sec
array_remove_boolean/remove/10               1.00      3.7±0.05ms        ? ?/sec    3.23     11.8±0.30ms        ? ?/sec
array_remove_boolean/remove/100              1.00      8.1±0.15ms        ? ?/sec    2.18     17.6±0.35ms        ? ?/sec
array_remove_boolean/remove/500              1.00     26.6±0.43ms        ? ?/sec    1.52     40.3±0.81ms        ? ?/sec
array_remove_decimal64/remove/10             1.00      3.9±0.07ms        ? ?/sec    2.41      9.4±0.18ms        ? ?/sec
array_remove_decimal64/remove/100            1.00      6.7±0.19ms        ? ?/sec    2.12     14.2±0.34ms        ? ?/sec
array_remove_decimal64/remove/500            1.00     40.3±0.75ms        ? ?/sec    1.52     61.1±1.46ms        ? ?/sec
array_remove_f64/remove/10                   1.00      3.8±0.10ms        ? ?/sec    1.32      5.0±0.16ms        ? ?/sec
array_remove_f64/remove/100                  1.00      4.8±0.34ms        ? ?/sec    1.24      5.9±0.18ms        ? ?/sec
array_remove_f64/remove/500                  1.00     22.3±0.68ms        ? ?/sec    1.15     25.5±0.86ms        ? ?/sec
array_remove_fixed_size_binary/remove/10     1.00      4.7±0.09ms        ? ?/sec    1.52      7.2±0.26ms        ? ?/sec
array_remove_fixed_size_binary/remove/100    1.00      8.0±0.32ms        ? ?/sec    1.40     11.1±0.40ms        ? ?/sec
array_remove_fixed_size_binary/remove/500    1.00     45.4±0.97ms        ? ?/sec    1.16     52.6±1.43ms        ? ?/sec
array_remove_int64/remove/10                 1.00      3.9±0.11ms        ? ?/sec    2.24      8.8±0.24ms        ? ?/sec
array_remove_int64/remove/100                1.00      5.5±0.18ms        ? ?/sec    2.32     12.8±0.44ms        ? ?/sec
array_remove_int64/remove/500                1.00     25.5±1.06ms        ? ?/sec    1.61     40.9±1.25ms        ? ?/sec
array_remove_strings/remove/10               1.00      4.5±0.10ms        ? ?/sec    2.41     10.9±0.28ms        ? ?/sec
array_remove_strings/remove/100              1.00      8.5±0.37ms        ? ?/sec    2.00     17.0±0.71ms        ? ?/sec
array_remove_strings/remove/500              1.00     35.9±0.84ms        ? ?/sec    1.48     53.1±1.91ms        ? ?/sec
```

## Are these changes tested?

Yes. Existing SLT for `array` continue to pass without modification.
Benchmarks were added.

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Oleks V <comphead@users.noreply.github.com>
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of  apache/datafusion-comet#2986.

## Rationale for this change

The `repeat` function currently converts scalar inputs to arrays before
processing via `make_scalar_function`. Adding a scalar fast path avoids
this overhead and improves performance for constant folding and scalar
expression evaluation.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

1. Refactored `invoke_with_args` to handle scalar inputs directly
without array conversion
2. Added scalar fast path for `Utf8`, `LargeUtf8`, and `Utf8View` scalar
inputs
3. Added array fast path that skips per-element null checks when
`null_count() == 0`

| Type | Before | After | Speedup |
|------|--------|-------|---------|
| **repeat/scalar_utf8** | 519 ns | 91 ns | **5.7x** |
| **repeat/scalar_utf8view** | 437 ns | 91 ns | **4.8x** |

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

Yes

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

No

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
krinart pushed a commit that referenced this pull request Mar 3, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- This PR is part of the [Utf8View
support](apache#10918) epic. It
adds `Utf8View` support in the Spark-compat layer.

## Rationale for this change

In our internal project we're only suppporting `Utf8View` _(because of
design constraints)_ and the current implementation of `SparkConcat`
only supports `Utf8`. The `SparkConcat` function should accept
`Utf8View` and mixed string types in line with the main DataFusion
concat. This PR adds that support and follows the same patterns as
[DataFusion’s
concat](https://github.com/apache/datafusion/blob/main/datafusion/functions/src/string/concat.rs).

Prevents errors like : 

> The type of Utf8 AND Utf8View of like physical should be same.
> This issue was likely caused by a bug in DataFusion's code. Please
help us to resolve this by filing a bug report in our issue tracker:
https://github.com/apache/datafusion/issues

from a query like:-

```sql
select i_item_sk,
       item_info
from
  (select i_item_sk,
          CONCAT('Item: ', i_item_desc) as item_info
   from item) sub
where item_info LIKE 'Item: Electronic%'
order by 1;
```

 

## What changes are included in this PR?

- Extend the type signature to accept `Utf8View` in addition to `Utf8`
and `LargeUtf8` via `TypeSignature::Variadic(vec![Utf8View, Utf8,
LargeUtf8])` matching DataFusion’s concat.

- In `return_field_from_args`, compute the result type with precedence
Utf8View &gt; LargeUtf8 &gt; Utf8.
In spark_concat, handle Utf8View and LargeUtf8 in scalar paths
(zero-argument and all-NULL).


## Are these changes tested?

Yes.
- Unit tests: `cargo test --package datafusion-spark
function::string::concat::tests`, including `test_concat_utf8view`.
- Sqllogictest: `spark/string/concat.slt` includes a “**Utf8View: no
extra CAST in plan**” case that uses EXPLAIN and a temporary table to
ensure no extra CASTs when using arrow_cast(..., 'Utf8View') with table
columns.

## Are there any user-facing changes?

- **API:** SparkConcat’s signature is extended to include Utf8View in
the variadic list. No breaking changes.

_used gpt to rephrase some of these points_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants