fix: Allow named arguments to skip optional parameter positions#123
Merged
sgrebnov merged 1 commit intospiceai-51from Jan 20, 2026
Merged
fix: Allow named arguments to skip optional parameter positions#123sgrebnov merged 1 commit intospiceai-51from
sgrebnov merged 1 commit intospiceai-51from
Conversation
Jeadie
approved these changes
Jan 20, 2026
lukekim
pushed a commit
that referenced
this pull request
Feb 12, 2026
…pache#19659) (apache#19705) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - part of apache#18566 ## Rationale for this change I propose back porting the fix for apache#19641 to 52 release ## What changes are included in this PR? - Backport apache#19659 ## Are these changes tested? eYes ## Are there any user-facing changes? bug fix Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com>
lukekim
pushed a commit
that referenced
this pull request
Feb 12, 2026
…he#19661) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache#18566 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
lukekim
pushed a commit
that referenced
this pull request
Feb 13, 2026
lukekim
pushed a commit
that referenced
this pull request
Feb 19, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - part of apache#19784. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> This is a follow-up of apache#19573 ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Backport - apache#19804 ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Existing tests ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> no <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
lukekim
pushed a commit
that referenced
this pull request
Feb 19, 2026
lukekim
pushed a commit
that referenced
this pull request
Feb 26, 2026
lukekim
pushed a commit
that referenced
this pull request
Mar 3, 2026
…ache#20289) ## Which issue does this PR close? - related to apache#20287 ## Rationale for this change see issue apache#20109 ## What changes are included in this PR? 1. Remap parent filter expressions: When a FilterExec has a projection, remap unsupported parent filter expressions from output schema coordinates to input schema coordinates using `reassign_expr_columns()` before combining them with the current filter's predicates. 2. Preserve projection: When creating the merged FilterExec, preserve the original projection instead of discarding it . ## Are these changes tested? yes, add some test case ## Are there any user-facing changes? --------- ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
lukekim
pushed a commit
that referenced
this pull request
Mar 3, 2026
- Closes apache#17508 The previous implementation used UUID-based aliasing as a workaround to prevent duplicate names for literals in Substrait plans. This approach had several drawbacks: - Non-deterministic plan names that made testing difficult (requiring UUID regex filters) - Only addressed literal naming conflicts, not the broader issue of name deduplication - Added unnecessary dependency on the `uuid` crate - Didn't properly handle cases where the same qualified name could appear with different schema representations 1. Enhanced NameTracker: Refactored to detect two types of conflicts: - Duplicate schema names: Tracked via schema_name() to prevent validate_unique_names failures (e.g., two Utf8(NULL) literals) - Ambiguous references: Tracked via qualified_name() to prevent DFSchema::check_names failures when a qualified field (e.g., left.Utf8(NULL)) and unqualified field (e.g., Utf8(NULL)) share the same column name 2. **Removed UUID dependency**: Eliminated the `uuid` crate from `datafusion/substrait` 3. **Removed literal-specific aliasing**: The UUID-based workaround in `project_rel.rs` is no longer needed as the improved NameTracker handles all naming conflicts consistently 4. **Deterministic naming**: Name conflicts now use predictable `__temp__N` suffixes instead of random UUIDs Note: This doesn't fully fix all the issues in apache#17508 which allow some special casing of `CAST` which are not included here. Yes: - Updated snapshot tests to reflect the new deterministic naming (e.g., `Utf8("people")__temp__0` instead of UUID-based names) - Modified some roundtrip tests to verify semantic equivalence (schema matching and execution) rather than exact string matching, which is more robust - All existing integration tests pass with the new naming scheme Minimal. The generated plan names are now deterministic and more readable (using `__temp__N` suffixes instead of UUIDs), but this is primarily an internal representation change. The functional behavior and query results remain unchanged. ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> Co-authored-by: Xander <zander181@googlemail.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…#19714) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache#7013 ## Rationale for this change I was speaking with someone who wanted to get more involved with DataFusion recently, and it was not clear from the website that discord is far more active than Slack. I think it would help new new users to make this clearer in the documentation to lower the barrier to joining the communitu ## What changes are included in this PR? Improve the communication documentation here: https://datafusion.apache.org/contributor-guide/communication.html and mention that discord is more active than slack <img width="1191" height="856" alt="Screenshot 2026-01-09 at 8 54 39 AM" src="https://github.com/user-attachments/assets/9bbef5d4-645d-45f0-85aa-6b12a2a7eea5" /> While I was reviewing the communication documentation, I noticed some other minor issues that could be improved which I will call out inline ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - N/A ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Occasionally we do PRs to bump a bunch of versions in `Cargo.lock`, recent example: - apache#19667 Ideally this should be handled by dependabot already. I suspect the default limit of 5 is preventing dependabot from creating all the PRs it needs; this causes it to error and those version bumps are "lost". - I say "lost" but dependabot can pick it up on the next run... if it doesn't error again See an example from a dependabot run: <img width="703" height="118" alt="image" src="https://github.com/user-attachments/assets/f8ae6b70-6fe1-4613-8c60-28564c23188b" /> From the logs: ```text +-------------------------------------------------------+ | Changes to Dependabot Pull Requests | +---------+---------------------------------------------+ | created | insta ( from 1.45.0 to 1.46.0 ) | | created | tracing ( from 0.1.43 to 0.1.44 ) | | created | syn ( from 2.0.111 to 2.0.113 ) | | created | async-compression ( from 0.4.35 to 0.4.36 ) | | created | object_store ( from 0.12.4 to 0.13.0 ) | | created | serde_json ( from 1.0.145 to 1.0.148 ) | | created | bigdecimal ( from 0.4.9 to 0.4.10 ) | | created | clap ( from 4.5.53 to 4.5.54 ) | | created | libc ( from 0.2.177 to 0.2.179 ) | | created | tokio ( from 1.48.0 to 1.49.0 ) | | created | tokio-util ( from 0.7.17 to 0.7.18 ) | | created | sqllogictest ( from 0.28.4 to 0.29.0 ) | +---------+---------------------------------------------+ ``` We expect to have all these PRs created (this was for the run 5 days ago), but only these were created on that day: - apache#19645 - apache#19644 - apache#19643 And considering these PRs were still open at the time: - apache#19544 - apache#19325 We can see it hit the 5 limit. > Dependabot default behavior: > > - If five pull requests with version updates are open, no further pull requests are raised until some of those open requests are merged or closed. > - Security updates have a separate, internal limit of ten open pull requests which cannot be changed. - https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference#open-pull-requests-limit- ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Bump the limit to 15. We might have an appetite for increasing it more, 15 was chosen arbitrarily. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Dependabot only. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19720. ## Rationale for this change - Choosing LZO compression errors, I think it might never get supported so the best option moving forward is to remove it algother and update the docs. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Removed LZO from parse_compression_string() function - Removed docs - Updated exptected test output <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? User choosing LZO as compression will get a clear error message: ``` Unknown or unsupported parquet compression: lzo. Valid values are: uncompressed, snappy, gzip(level), brotli(level), lz4, zstd(level), and lz4_raw. ``` <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - part of 18566 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19715. ## Rationale for this change The:is_used() API incorrectly returned false for custom `DataSource` implementations that didn't call reassign_expr_columns() -> with_new_children() . This caused `HashJoinExec` to skip computing dynamic filters even when they were actually being used. ## What changes are included in this PR? Updated is_used() to check both outer and inner Arc counts ## Are these changes tested? Functionality is covered by existing test `test_hashjoin_dynamic_filter_pushdown_is_used`. I was not sure if to add a repro since it would require adding a custom `DataSource`, the current tests in datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs use `FileScanConfig` ## Are there any user-facing changes? no
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986 ## Rationale for this change - The current floor and ceil implementations always convert scalar inputs to arrays via `values_to_arrays()`, which introduces unnecessary overhead when processing single values. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Add scalar fast path for floor and ceil - Add criterion benchmark for floor/ceil to measure performance <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? - All sqllogictest pass - Addec benchmark | Benchmark | Array Time | Scalar Time | Speedup | |---------------------|------------|-------------|---------| | floor_f64 (1024) | ~396 ns | ~147 ns | 2.7× | | ceil_f64 (1024) | ~363 ns | ~150 ns | 2.4× | | floor_f64 (4096) | ~681 ns | ~144 ns | 4.7× | | ceil_f64 (4096) | ~667 ns | ~144 ns | 4.6× | | floor_f64 (8192) | ~1,638 ns | ~144 ns | 11.4× | | ceil_f64 (8192) | ~1,634 ns | ~144 ns | 11.3× | <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19671. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Signed-off-by: StandingMan <jmtangcs@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Running ` ./bench.sh run tpcds` with a freshly created `./bench.sh data tpcds` fails with the following error: ``` Please prepare TPC-DS data first by following instructions: ./bench.sh data tpcds ``` This PR fixes it ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Fixes the `TPCDS_DIR` variable in `run_tpcds` ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> just benchmark scripts ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> no need
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…pache#19731) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Add a simplifier rule to remove a comparison. It's probably unlikely that a human would write such an expression, but a generated query, or other optimizations, may lead to such an expression. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Simplify `expr = L1 AND expr != L2` to `expr = L1` when `L1 != L2` ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added unit tests and SLT. The first commit shows the plan before the optimization. ## Are there any user-facing changes? Fewer runtime-ops if the plan contains the pattern. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#18932 ## Rationale for this change - Quoted CTE references (e.g., `WITH barbaz AS (...) SELECT * FROM "barbaz"`) were being collected as table refs and could fall through to catalog lookup instead of resolving to the CTE, especially with custom providers. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Normalize CTE and table references consistently during collection (preserve quote style; only lowercase unquoted idents when enabled). - Avoid double-normalization when converting to `TableReference` in the resolve path. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? - Behavior fix for quoted CTE references; no API changes. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: jonahgao <jonahgao@msn.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#10583. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> This patch implements null-aware anti join support for HashJoin LeftAnti operations, enabling correct SQL NOT IN subquery semantics with NULL values. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19697 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - part of apache/datafusion-comet#2986 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> The current `hex` function implementation uses `format!` macro and `StringArray::from` iterator pattern, which causes: 1. Per-element String allocations: Each value allocates a new `String` via `format!` 2. Unnecessary conversions: Multiple intermediate conversions between types 3. Inefficient dictionary type handling: Collects all values into vectors before building the result ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> This PR optimizes the `hex` encoding by: - Replacing `format!("{num:X}")` with a fast lookup table approach - Building results directly using `StringBuilder` - Reusing a single vec buffer per iteration to avoid re-allocation - Optimizing dictionary array handling by building results iteratively ### Benchmark Results | Group | Size | Before | After | Speedup | | :--- | :----: | :--------: | :-------: | -------: | | hex_binary | 1024 | 89.9 µs | 51.9 µs | 1.73x | | hex_binary | 4096 | 385.2 µs | 218.7 µs | 1.76x | | hex_binary | 8192 | 741.6 µs | 451.6 µs | 1.64x | | hex_int64 | 1024 | 32.0 µs | 12.4 µs | 2.57x | | hex_int64 | 4096 | 132.4 µs | 59.7 µs | 2.22x | | hex_int64 | 8192 | 258.5 µs | 120.6 µs | 2.14x | | hex_int64_dict | 1024 | 75.2 µs | 12.4 µs | 6.04x | | hex_int64_dict | 4096 | 313.2 µs | 60.5 µs | 5.18x | | hex_int64_dict | 8192 | 614.7 µs | 129.0 µs | 4.76x | | hex_utf8 | 1024 | 88.5 µs | 53.5 µs | 1.66x | | hex_utf8 | 4096 | 357.6 µs | 211.1 µs | 1.69x | | hex_utf8 | 8192 | 698.7 µs | 424.8 µs | 1.64x | ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes. Existing units and sqllogictest tests pass. New benchmarks added. ## Are there any user-facing changes? No. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Part of apache#19227 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> See issue for details. The existing script `./dev/rust_lint.sh` do checks for all non-functional tests include formater/clippy checks. Some check tools support auto fix options, so this PR add an option to the lint scripts to perform auto-fixes. Now `./dev/rust_lint.sh --write --allow-dirty` can perform auto-fixes for all linter etc. violations ``` yongting@Yongtings-MacBook-Pro-2 ~/C/datafusion (auto-fix)> ./dev/rust_lint.sh --help Usage: ./dev/rust_lint.sh [--write] [--allow-dirty] Runs the local Rust lint suite similar to CI. --write Run formatters, clippy and other non-functional checks in best-effort write/fix mode (requires a clean git worktree, no uncommitted changes; some checks are test-only and ignore this flag). --allow-dirty Allow `--write` to run even when the git worktree has uncommitted changes. ``` ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Adds `[--write] [--allow-dirty]` flag to `rust_lint.sh` to support auto fixes - `rust_lint.sh` consists of several sub-scripts like `rust_fmt.sh`, they're all extended with auto-fix feature through `--write` flag, and the `rust_lint.sh` is optionally calling them with an additional flag for auto fixes. - Clean up `rust_lint.sh` ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes, commit 8c99417 intentionally introduced one violation for each available lint check, and the auto-fix command is able to fix all of them. The test may not be comprehensive, but it provides a reasonable starting point. We can begin using this script now and iterate on it if we discover cases where the auto-fix does not behave correctly. ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…he#19280) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> A dependency graph for workspace member crates are often needed when doing refactors, I want it to be included in the doc, and have a script to update it automatically. Here is the preview: <img width="1203" height="951" alt="image" src="https://github.com/user-attachments/assets/527c18fc-258e-465f-a150-f2aafe3e6db9" /> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - adds a script to generate the dependency graph `deps.svg`, and verify if the existing one is up to date. - adds a documentation page in `Contributor Guide` to show this graph - adds a CI job to check if the generated dependency graph is up to date with the code. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> I tested the dependency graph display locally, see above. Is it possible to see the preview from this PR's change online? I also included a dummy crate in the initial commit, to test if the CI can catch it and throw understandable error message. ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com> Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…uct `PushdownChecker` (apache#19678) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19673. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Changed [row_filter.rs](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/row_filter.rs) to use Vec instead of the BTreeSet <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19082. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Support Set Comparison Subquery (or Quantified Comparison Subquery in some systems), it looks like ```sql ... WHERE t1.a > ANY(SELECT ... FROM t2) ... ``` or ```sql ... WHERE t1.a < ALL(SELECT ... FROM t2) ... ``` ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - New `Expr` branch `Expr::SetComparison` and corresponding `SetComparison` struct - New optimizer rule `RewriteSetComparison` to translate this kind of subquery using existing components - Corresponding substrait and proto changes `RewriteSetComparison` is a very naive implementation. We can optimize it in various ways in the follow-up PRs, like using min/max for non-equal comparison operators etc. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes, by unit test and slt ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> To SQL user, this is a new feature. To API user, `ExprPlanner::plan_any` is removed (not needed) <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19820. ## Rationale for this change `isoyear` part is available in both PG and Spark EXTRACT functions. https://www.postgresql.org/docs/current/functions-datetime.html#:~:text=the%20week%20numbering.-,isoyear,-The%20ISO%208601 https://github.com/apache/spark/blob/a03bedb6c1281c5263a42bfd20608d2ee005ab05/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L3360 ## What changes are included in this PR? Support for part `isoyear` in date_part function. ## Are these changes tested? yes in SLT ## Are there any user-facing changes? yes <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…ultiple runs (apache#19757) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19756. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> The previous logic of `EnsureCooperative` optimizer lacked context awareness regarding ancestor nodes, making it not idempotent across multiple runs. Specifically, we need to ensure that: 1. **Idempotency**: Running the rule multiple times does not produce nested `CooperativeExec` wrappers. 2. **Context Awareness**: If a subtree is already protected by a `CooperativeExec`, we should skip and not double-wrap its children. ## What changes are included in this PR? To solve this, we cannot rely solely on `transform_up` (which lacks parent context) or `transform_down` (which makes safe mutation difficult). This PR adopts `transform_down_up` with a depth counter to strictly enforce that nodes are only wrapped when they are not currently under a `CooperativeExec` scope. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? More unit tests coverage <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…e#19683) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19682 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Explained in issue. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Change array_except, array_intersect and array_union UDFs to return null if either input is null. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added & fixed tests. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> Behaviour change to a function output. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986. ## Rationale for this change The current `trunc` implementation always converts scalar inputs to arrays via `make_scalar_function`, which introduces unnecessary overhead when processing single values. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Add scalar fast path for `trunc` function to process Float32/Float64 scalar inputs directly - Handle optional precision argument for scalar inputs - Add scalar benchmarks to measure performance <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes all sqllogictest pass ## Benchmark Results | Type | Before | After | Speedup | |------|--------|-------|---------| | f64 scalar | 256 ns | 55 ns | **4.6x** | | f32 scalar | 247 ns | 56 ns | **4.4x** | <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…ansion (apache#19832) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Follow up to apache#19738 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> The current hex implementation expands `DictionaryArray` inputs into a regular array, which causes loss of dictionary encoding and redundant hex computation for repeated values. ## What changes are included in this PR? - Apply hex encoding only to dictionary values - Avoid expanding dictionary arrays during execution ### Benchmark | Size | Before | After | Speedup | | ---- | ------ | ----- | ------- | | 1024 | 8.3 µs | 7.2 µs | 1.15× | | 4096 | 42.9 µs | 34.5 µs | 1.24× | | 8192 | 91.6 µs | 71.7 µs | 1.28× | <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes. Existing unit tests and `sqllogictest` tests pass. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19822 . - Part of apache#15914 ## Rationale for this change The current date_part function in datafusion have a few differences with the spark implementation: - day of week parts are 1 indexed in spark but 0 indexed in datafusion - spark supports a few more aliases for certain parts Full list of spark supported aliases: https://github.com/apache/spark/blob/a03bedb6c1281c5263a42bfd20608d2ee005ab05/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L3356-L3371 ## What changes are included in this PR? New date_part function in spark crate. ## Are these changes tested? Yes with SLT ## Are there any user-facing changes? yes
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…9829) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19828. - Part of apache#15914 ## Rationale for this change implement spark: - https://spark.apache.org/docs/latest/api/sql/index.html#trunc - https://spark.apache.org/docs/latest/api/sql/index.html#date_trunc - https://spark.apache.org/docs/latest/api/sql/index.html#time_trunc ## What changes are included in this PR? Add spark compatible wrappers around datafusion date_trunc function to handle spark specificities. ## Are these changes tested? Yes in SLT ## Are there any user-facing changes? Yes
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19844 - Part of apache#15914 ## Rationale for this change Add support for spark https://spark.apache.org/docs/latest/api/sql/index.html#date_diff function ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? yes in SLT ## Are there any user-facing changes? yes
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986 ## Rationale for this change The round function currently converts scalar inputs to arrays before processing, even when both value and decimal_places are scalar values. This adds unnecessary overhead for constant folding scenarios like <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Add scalar fast path in RoundFunc::invoke_with_args for Float64 and Float32 inputs - Directly compute the result when both inputs are scalars, avoiding array conversion overhead - Add benchmark <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes Type | Before | After | Speedup -- | -- | -- | -- round_f64_scalar | 570 ns | 195 ns | 2.9x round_f32_scalar | 564 ns | 192 ns | 2.9x <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
- Merged CTE reference resolution tests into main cte.slt file - Added CREATE TABLE statement for orders table used in tests - Fixed 3 case-sensitivity issues: changed 'WHERE N < 10' to 'WHERE n < 10' Fixes apache#19786 ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19786 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> This PR consolidates the small test file `cte_quoted_reference.slt` into the main `cte.slt` file as requested in issue apache#19786. The separate file was small enough that it didn't need to be maintained independently, and consolidating it improves code organization and maintainability. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Merged CTE reference resolution tests from `cte_quoted_reference.slt` into `cte.slt` - Added `CREATE TABLE orders AS VALUES (1), (2);` statement to support the merged tests - Fixed 3 pre-existing case-sensitivity bugs by changing `WHERE N < 10` to `WHERE n < 10` in recursive CTE tests - Deleted `cte_quoted_reference.slt` as it's now consolidated into the main file ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes, all changes are tested: - The consolidated tests from `cte_quoted_reference.slt` are now part of the `cte.slt` test suite - All sqllogictest tests pass successfully: `cargo test --test sqllogictests -- cte` - The case-sensitivity fixes ensure that field references match their definitions (lowercase `n`) ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No, there are no user-facing changes. This is purely a test file reorganization that improves code maintainability without affecting DataFusion's functionality or API. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> Co-authored-by: Anjali Choudhary <anjalicy@amazon.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…e#19904) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache#19798 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> This is an edge case, would prefer to fix upstream in arrow-rs instead of having handling code here, so just disable test for now. - arrow-rs issue: apache/arrow-rs#9227 ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Disable edge-case array_union SLT ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Test related change ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache#15914 - Closes apache#19710 ## Rationale for this change Implementation of spark `add_months` function. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? yes ## Are there any user-facing changes? yes
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986 ## Rationale for this change The signum function currently converts scalar inputs to arrays before processing, even for single scalar values. This adds unnecessary overhead from array allocation and conversion. Adding a scalar fast path avoids this overhead and improves performance for constant folding and scalar expression evaluation. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Added scalar fast path for `float32` and `float64` | Type | Before | After | Speedup | |------|--------|-------|---------| | **signum_f64_scalar** | 266 ns | 54 ns | **4.9x** | | **signum_f32_scalar** | 263 ns | 55 ns | **4.8x** | <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…ow frame queries (apache#19887) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19612. ## Rationale for this change The `DistinctMedianAccumulator::evaluate()` method was using `std::mem::take()` which consumed the internal state, causing subsequent calls to return incorrect results. This was the last remaining item from apache#19612 <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? 1. Changed `DistinctMedianAccumulator::evaluate()` to use `.iter()` instead of `std::mem::take()` to preserve internal state across multiple calls 2. Added sqllogictest case for distinct median with window frames <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes, added a sqllogictest <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986. ## Rationale for this change The cot function currently converts scalar inputs to arrays before processing, even for single scalar values. This adds unnecessary overhead from array allocation and conversion. Adding a scalar fast path avoids this overhead. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Added scalar fast path - Added benchmark - Update tests | Type | Before | After | Speedup | |------|--------|-------|---------| | **cot_f64_scalar** | 229 ns | 67 ns | **3.4x** | | **cot_f32_scalar** | 247 ns | 59 ns | **4.2x** | <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache#18092 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Main value add here is ensure UDAFs encode their actual accepted types in their signature instead of internally casting to the actual types they support from a wider signature. Also doing some driveby refactoring of removing manual Debug impls. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> See rationale. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Existing tests. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986. ## Rationale for this change The `iszero` function currently converts scalar inputs to arrays before processing, even for single scalar values. This adds unnecessary overhead from array allocation and conversion. Adding a scalar fast path avoids this overhead and improves performance <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - Added scalar fast path for `Float16`, `Float32`, and `Float64` scalar inputs - Updated tests | Type | Before | After | Speedup | |------|--------|-------|---------| | **iszero f32 scalar** | 244 ns | 62 ns | **3.9x** | | **iszero f64 scalar** | 245 ns | 62 ns | **4.0x** | <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…che#19819) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Related to apache#19004 and apache#19809. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Change internal error to user-facing error when function type coercion fails. Add helpful hint when Binary types are used with string functions. Before: ``` Internal error: Expect TypeSignatureClass::Native(...) but received NativeType::Binary, DataType: Binary ``` After: ``` Error: Function 'split_part' requires String, but received Binary (DataType: Binary). Hint: Binary types are not automatically coerced to String. Use CAST(column AS VARCHAR) to convert Binary data to String. ```
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> N/A ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> When hashing list arrays we hash all bytes of the child array, even if we slice to a certain range of values. Refactor to slice only the needed bytes; also do some general refactors. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Refactor list array hashing. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added test. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…pache#19909) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Part of: apache#15914 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Implement spark compatible unhex functions: https://spark.apache.org/docs/latest/api/sql/index.html#unhex ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes. UTs and SLT added. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19912 ## Rationale for this change Just a couple of optimizations for hash table lookups usage in hash aggregate. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986. ## Rationale for this change The `ascii` function currently converts scalar inputs to arrays before processing via `make_scalar_function`. Adding a scalar fast path avoids this overhead and improves performance. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? 1. Refactored `invoke_with_args` to use `match` statement for handling both scalar and array inputs 2. Added scalar fast path for `Utf8`, `LargeUtf8`, and `Utf8View` scalar inputs | Type | Before | After | Speedup | |------|--------|-------|---------| | **ascii/scalar_utf8** | 234 ns | 56 ns | **4.2x** | | **ascii/scalar_utf8view** | 206 ns | 57 ns | **3.6x** | <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986. ## Rationale for this change The `factorial` function currently converts scalar inputs to arrays before processing. Adding a scalar fast path avoids this overhead and improves performance. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? 1. Refactored `invoke_with_args` to use `match` statement for handling both scalar and array inputs 2. Inlined array processing logic, removing `make_scalar_function` usage. | Type | Before | After | Speedup | |------|--------|-------|---------| | **factorial_scalar** | 244 ns | 102 ns | **2.4x** | <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes, covered by existing SLT tests: - `scalar.slt` lines 461-477 - `math.slt` line 797 <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19947 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> apache#17657 introduced a regression since it cloned the inner field in the execution path but in the `return_type` function it still set nullability to true. Fix to ensure we maintain the field of the inner field as is. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Change `return_type` to just pass through the input datatype as is. Also refactor away usage of a null buffer builder in favour of copying the input array null buffer. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Added tests. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Prerequisite for the following PRs: - apache#19760 - apache#19761 Even if the api on the `MemoryPool` does not require `&mut self` for growing/shrinking the reserved size, the api in `MemoryReservation` does, making simple implementations irrepresentable without synchronization primitives. For example, the following would require a `Mutex` for concurrent access to the `MemoryReservation` in different threads, even though the `MemoryPool` doesn't: ```rust let mut stream: SendableRecordBatchStream = SendableRecordBatchStream::new(); let mem: Arc<MemoryReservation> = Arc::new(MemoryReservation::new_empty()); let mut builder = ReceiverStreamBuilder::new(10); let tx = builder.tx(); { let mem = mem.clone(); builder.spawn(async move { while let Some(msg) = stream.next().await { mem.try_grow(msg.unwrap().get_array_memory_size()); // ❌ `mem` is not mutable tx.send(msg).unwrap(); } }); } builder .build() .inspect_ok(|msg| mem.shrink(msg.get_array_memory_size())); // ❌ `mem` is not mutable ``` ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Make the methods in `MemoryReservation` require `&self` instead of `&mut self` for allowing concurrent shrink/grows from different tasks for the same reservation. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> yes, by current tests ## Are there any user-facing changes? Users can now safely call methods of `MemoryReservation` from different tasks without synchronization primitives. This is a backwards compatible API change, as it will work out of the box for current users, however, depending on their clippy configuration, they might see some new warnings about "unused muts" in their codebase. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#20005 PR: ``` SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0; Query 1 iteration 0 took 30.5 ms and returned 1 rows Query 1 avg time: 30.48 ms ``` Main: ``` SELECT COUNT(*) FROM hits WHERE "AdvEngineID" <> 0; Query 1 iteration 0 took 39.6 ms and returned 1 rows Query 1 avg time: 39.61 ms ``` ## Rationale for this change Improving cold starts. ## What changes are included in this PR? ## Are these changes tested? Existing tests ## Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#19967 - Part of apache#15914 ## Rationale for this change Add spark compatible base64/unbase64 functions ## What changes are included in this PR? - new encoding mode in DF encoding UDF for padded base64 - spark udfs for base64/unbase64 ## Are these changes tested? yes in SLT ## Are there any user-facing changes? yes
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
…ray_remove_all` functions (apache#19996) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> The current implementation of `general_remove` is based on `filter + concat`, which creates intermediate arrays for each list row and can be relatively expensive. This PR introduces an alternative implementation based on `MutableArrayData`, which copies contiguous ranges from the original values buffer directly into the output array. The new approach is semantically equivalent to the existing implementation but reduces intermediate allocations and per-element overhead. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Replaced `general_remove`'s filter-based implementation with `MutableArrayData` for more efficient memory usage. - Optimized the removal process by adding fast paths for rows where no matching elements need removal. ### Benchmark ``` group after before ----- ----- ------ array_remove_binary/remove/10 1.00 4.6±0.14ms ? ?/sec 2.41 11.0±0.34ms ? ?/sec array_remove_binary/remove/100 1.00 8.5±0.19ms ? ?/sec 1.95 16.6±0.42ms ? ?/sec array_remove_binary/remove/500 1.00 35.9±0.78ms ? ?/sec 1.43 51.4±1.10ms ? ?/sec array_remove_boolean/remove/10 1.00 3.7±0.05ms ? ?/sec 3.23 11.8±0.30ms ? ?/sec array_remove_boolean/remove/100 1.00 8.1±0.15ms ? ?/sec 2.18 17.6±0.35ms ? ?/sec array_remove_boolean/remove/500 1.00 26.6±0.43ms ? ?/sec 1.52 40.3±0.81ms ? ?/sec array_remove_decimal64/remove/10 1.00 3.9±0.07ms ? ?/sec 2.41 9.4±0.18ms ? ?/sec array_remove_decimal64/remove/100 1.00 6.7±0.19ms ? ?/sec 2.12 14.2±0.34ms ? ?/sec array_remove_decimal64/remove/500 1.00 40.3±0.75ms ? ?/sec 1.52 61.1±1.46ms ? ?/sec array_remove_f64/remove/10 1.00 3.8±0.10ms ? ?/sec 1.32 5.0±0.16ms ? ?/sec array_remove_f64/remove/100 1.00 4.8±0.34ms ? ?/sec 1.24 5.9±0.18ms ? ?/sec array_remove_f64/remove/500 1.00 22.3±0.68ms ? ?/sec 1.15 25.5±0.86ms ? ?/sec array_remove_fixed_size_binary/remove/10 1.00 4.7±0.09ms ? ?/sec 1.52 7.2±0.26ms ? ?/sec array_remove_fixed_size_binary/remove/100 1.00 8.0±0.32ms ? ?/sec 1.40 11.1±0.40ms ? ?/sec array_remove_fixed_size_binary/remove/500 1.00 45.4±0.97ms ? ?/sec 1.16 52.6±1.43ms ? ?/sec array_remove_int64/remove/10 1.00 3.9±0.11ms ? ?/sec 2.24 8.8±0.24ms ? ?/sec array_remove_int64/remove/100 1.00 5.5±0.18ms ? ?/sec 2.32 12.8±0.44ms ? ?/sec array_remove_int64/remove/500 1.00 25.5±1.06ms ? ?/sec 1.61 40.9±1.25ms ? ?/sec array_remove_strings/remove/10 1.00 4.5±0.10ms ? ?/sec 2.41 10.9±0.28ms ? ?/sec array_remove_strings/remove/100 1.00 8.5±0.37ms ? ?/sec 2.00 17.0±0.71ms ? ?/sec array_remove_strings/remove/500 1.00 35.9±0.84ms ? ?/sec 1.48 53.1±1.91ms ? ?/sec ``` ## Are these changes tested? Yes. Existing SLT for `array` continue to pass without modification. Benchmarks were added. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of apache/datafusion-comet#2986. ## Rationale for this change The `repeat` function currently converts scalar inputs to arrays before processing via `make_scalar_function`. Adding a scalar fast path avoids this overhead and improves performance for constant folding and scalar expression evaluation. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? 1. Refactored `invoke_with_args` to handle scalar inputs directly without array conversion 2. Added scalar fast path for `Utf8`, `LargeUtf8`, and `Utf8View` scalar inputs 3. Added array fast path that skips per-element null checks when `null_count() == 0` | Type | Before | After | Speedup | |------|--------|-------|---------| | **repeat/scalar_utf8** | 519 ns | 91 ns | **5.7x** | | **repeat/scalar_utf8view** | 437 ns | 91 ns | **4.8x** | <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
krinart
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - This PR is part of the [Utf8View support](apache#10918) epic. It adds `Utf8View` support in the Spark-compat layer. ## Rationale for this change In our internal project we're only suppporting `Utf8View` _(because of design constraints)_ and the current implementation of `SparkConcat` only supports `Utf8`. The `SparkConcat` function should accept `Utf8View` and mixed string types in line with the main DataFusion concat. This PR adds that support and follows the same patterns as [DataFusion’s concat](https://github.com/apache/datafusion/blob/main/datafusion/functions/src/string/concat.rs). Prevents errors like : > The type of Utf8 AND Utf8View of like physical should be same. > This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues from a query like:- ```sql select i_item_sk, item_info from (select i_item_sk, CONCAT('Item: ', i_item_desc) as item_info from item) sub where item_info LIKE 'Item: Electronic%' order by 1; ``` ## What changes are included in this PR? - Extend the type signature to accept `Utf8View` in addition to `Utf8` and `LargeUtf8` via `TypeSignature::Variadic(vec![Utf8View, Utf8, LargeUtf8])` matching DataFusion’s concat. - In `return_field_from_args`, compute the result type with precedence Utf8View > LargeUtf8 > Utf8. In spark_concat, handle Utf8View and LargeUtf8 in scalar paths (zero-argument and all-NULL). ## Are these changes tested? Yes. - Unit tests: `cargo test --package datafusion-spark function::string::concat::tests`, including `test_concat_utf8view`. - Sqllogictest: `spark/string/concat.slt` includes a “**Utf8View: no extra CAST in plan**” case that uses EXPLAIN and a temporary table to ensure no extra CASTs when using arrow_cast(..., 'Utf8View') with table columns. ## Are there any user-facing changes? - **API:** SparkConcat’s signature is extended to include Utf8View in the variadic list. No breaking changes. _used gpt to rephrase some of these points_
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Related to this PR:
Fixes named argument resolution to follow PostgreSQL semantics where named arguments can target any parameter position without requiring intermediate optional positions to be filled.
Problem
Before this fix,
resolve_function_argumentsrequired positions0..Nto be filled when N arguments were provided. This incorrectly rejected valid PostgreSQL-style calls:Motivation for this in existing DF logic is that DF does not provide default mechanism to attach args to specific supported arguments so they are provided consequently (first N args). Spice implements own mechanism for this (extra metadata) and can benefit from full Postgres dialect.
PR updates ordering logic to only validate that positional arguments fill consecutive slots from position 0. Named arguments can target any parameter position. This does NOT break existing functionality (no changes if arguments are specified in more conservative way) but also unblock Spice scenario:
Example
Important:
DataFusion only validates argument names and reorders to signature order. Functions are responsible for their own required/optional argument validation and must use metadata or other mechanisms to identify named arguments when positions are skipped.
PostgreSQL Reference
From PostgreSQL Docs - Calling Functions: