Skip to content

Panic in datafusion_expr::window_state::WindowAggState::update #16308

@andygrove

Description

@andygrove
Member

Describe the bug

Upgrading Comet to use 48.0.0-rc2 causes tests to fail with a attempt to subtract with overflow panic. This did not happen with rc1. I have not debugged this yet to find the root cause.

PR: apache/datafusion-comet#1853

failing build: https://github.com/apache/datafusion-comet/actions/runs/15491877086/job/43619110943?pr=1853

The relevant part of the stack trace is:

2025-06-06T13:57:54.1903145Z         at datafusion_expr::window_state::WindowAggState::update(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/expr/src/window_state.rs:95)
2025-06-06T13:57:54.1905310Z         at datafusion_physical_expr::window::window_expr::AggregateWindowExpr::aggregate_evaluate_stateful(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-expr/src/window/window_expr.rs:260)
2025-06-06T13:57:54.1920612Z         at <datafusion_physical_expr::window::aggregate::PlainAggregateWindowExpr as datafusion_physical_expr::window::window_expr::WindowExpr>::evaluate_stateful(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-expr/src/window/aggregate.rs:148)
2025-06-06T13:57:54.1924024Z         at datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::compute_aggregates(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:983)
2025-06-06T13:57:54.1927398Z         at datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::poll_next_inner(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:1033)
2025-06-06T13:57:54.1930653Z         at <datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream as futures_core::stream::Stream>::poll_next(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:949)

There was one PR between rc1 and rc2 specifically related to evaluating window expressions, so I wonder if that is the issue. I will try and confirm.

#16234

Full stack trace:

2025-06-06T13:57:54.1864287Z - aggregate window function for all types *** FAILED *** (406 milliseconds)
2025-06-06T13:57:54.1871363Z   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2045.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2045.0 (TID 5401) (62bae2d9d85a executor driver): org.apache.comet.CometNativeException: attempt to subtract with overflow
2025-06-06T13:57:54.1873529Z         at comet::errors::init::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:151)
2025-06-06T13:57:54.1883399Z         at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/alloc/src/boxed.rs:1980)
2025-06-06T13:57:54.1894489Z         at std::panicking::rust_panic_with_hook(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:841)
2025-06-06T13:57:54.1895884Z         at std::panicking::begin_panic_handler::{{closure}}(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:699)
2025-06-06T13:57:54.1897662Z         at std::sys::backtrace::__rust_end_short_backtrace(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/sys/backtrace.rs:168)
2025-06-06T13:57:54.1899012Z         at __rustc::rust_begin_unwind(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:697)
2025-06-06T13:57:54.1900180Z         at core::panicking::panic_fmt(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/panicking.rs:75)
2025-06-06T13:57:54.1901495Z         at core::panicking::panic_const::panic_const_sub_overflow(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/core/src/panicking.rs:178)
2025-06-06T13:57:54.1903145Z         at datafusion_expr::window_state::WindowAggState::update(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/expr/src/window_state.rs:95)
2025-06-06T13:57:54.1905310Z         at datafusion_physical_expr::window::window_expr::AggregateWindowExpr::aggregate_evaluate_stateful(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-expr/src/window/window_expr.rs:260)
2025-06-06T13:57:54.1920612Z         at <datafusion_physical_expr::window::aggregate::PlainAggregateWindowExpr as datafusion_physical_expr::window::window_expr::WindowExpr>::evaluate_stateful(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-expr/src/window/aggregate.rs:148)
2025-06-06T13:57:54.1924024Z         at datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::compute_aggregates(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:983)
2025-06-06T13:57:54.1927398Z         at datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::poll_next_inner(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:1033)
2025-06-06T13:57:54.1930653Z         at <datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream as futures_core::stream::Stream>::poll_next(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:949)
2025-06-06T13:57:54.1933599Z         at <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-core-0.3.31/src/stream.rs:130)
2025-06-06T13:57:54.1935713Z         at futures_util::stream::stream::StreamExt::poll_next_unpin(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/mod.rs:1638)
2025-06-06T13:57:54.1938604Z         at <datafusion_physical_plan::projection::ProjectionStream as futures_core::stream::Stream>::poll_next(/usr/local/cargo/git/checkouts/datafusion-11a8b534adb6bd68/85f6621/datafusion/physical-plan/src/projection.rs:354)
2025-06-06T13:57:54.1940894Z         at <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-core-0.3.31/src/stream.rs:130)
2025-06-06T13:57:54.1942871Z         at futures_util::stream::stream::StreamExt::poll_next_unpin(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/mod.rs:1638)
2025-06-06T13:57:54.1945055Z         at <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/next.rs:32)
2025-06-06T13:57:54.1947663Z         at futures_util::future::future::FutureExt::poll_unpin(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/future/future/mod.rs:558)
2025-06-06T13:57:54.1949835Z         at <futures_util::async_await::poll::PollOnce<F> as core::future::future::Future>::poll(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/async_await/poll.rs:37)
2025-06-06T13:57:54.1952041Z         at comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}::{{closure}}::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/jni_api.rs:438)
2025-06-06T13:57:54.1954070Z         at tokio::runtime::park::CachedParkThread::block_on::{{closure}}(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/park.rs:284)
2025-06-06T13:57:54.1955846Z         at tokio::task::coop::with_budget(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/task/coop/mod.rs:167)
2025-06-06T13:57:54.1957636Z         at tokio::task::coop::budget(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/task/coop/mod.rs:133)
2025-06-06T13:57:54.1959325Z         at tokio::runtime::park::CachedParkThread::block_on(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/park.rs:284)
2025-06-06T13:57:54.1961375Z         at tokio::runtime::context::blocking::BlockingRegionGuard::block_on(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/context/blocking.rs:66)
2025-06-06T13:57:54.1963697Z         at tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/scheduler/multi_thread/mod.rs:87)
2025-06-06T13:57:54.1965887Z         at tokio::runtime::context::runtime::enter_runtime(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/context/runtime.rs:65)
2025-06-06T13:57:54.1968188Z         at tokio::runtime::scheduler::multi_thread::MultiThread::block_on(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/scheduler/multi_thread/mod.rs:86)
2025-06-06T13:57:54.1970246Z         at tokio::runtime::runtime::Runtime::block_on_inner(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/runtime.rs:358)
2025-06-06T13:57:54.1972087Z         at tokio::runtime::runtime::Runtime::block_on(/usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.45.1/src/runtime/runtime.rs:330)
2025-06-06T13:57:54.1974189Z         at comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/jni_api.rs:438)
2025-06-06T13:57:54.1975895Z         at comet::execution::tracing::with_trace(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/tracing.rs:117)
2025-06-06T13:57:54.1977694Z         at comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/jni_api.rs:395)
2025-06-06T13:57:54.1979212Z         at comet::errors::curry::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:485)
2025-06-06T13:57:54.1980462Z         at std::panicking::try::do_call(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:589)
2025-06-06T13:57:54.1981370Z         at __rust_try(__internal__:0)
2025-06-06T13:57:54.1982193Z         at std::panicking::try(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panicking.rs:552)
2025-06-06T13:57:54.1983410Z         at std::panic::catch_unwind(/rustc/17067e9ac6d7ecb70e50f92c1944e545188d2359/library/std/src/panic.rs:359)
2025-06-06T13:57:54.1984614Z         at comet::errors::try_unwrap_or_throw(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:499)
2025-06-06T13:57:54.1985938Z         at Java_org_apache_comet_Native_executePlan(/__w/datafusion-comet/datafusion-comet/native/core/src/execution/jni_api.rs:375)
2025-06-06T13:57:54.1987315Z         at <unknown>(__internal__:0)

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Activity

andygrove

andygrove commented on Jun 6, 2025

@andygrove
MemberAuthor

I also see a correctness issue in another test related to windowed aggregates:

2025-06-06T14:15:31.2495550Z [info] - postgreSQL/window_part1.sql *** FAILED *** (4 seconds, 628 milliseconds)
2025-06-06T14:15:31.2496326Z [info]   postgreSQL/window_part1.sql
2025-06-06T14:15:31.2496774Z [info]   Expected "...10
2025-06-06T14:15:31.2498107Z [info]   10
2025-06-06T14:15:31.2499258Z [info]   10
2025-06-06T14:15:31.2501018Z [info]   10
2025-06-06T14:15:31.2502765Z [info]   10
2025-06-06T14:15:31.2503832Z [info]   10
2025-06-06T14:15:31.2505296Z [info]   10[]", but got "...10
2025-06-06T14:15:31.2506923Z [info]   10
2025-06-06T14:15:31.2507701Z [info]   10
2025-06-06T14:15:31.2509163Z [info]   10
2025-06-06T14:15:31.2510556Z [info]   10
2025-06-06T14:15:31.2515385Z [info]   10
2025-06-06T14:15:31.2515856Z [info]   10[
2025-06-06T14:15:31.2516281Z [info]   20
2025-06-06T14:15:31.2517702Z [info]   20
2025-06-06T14:15:31.2520319Z [info]   20
2025-06-06T14:15:31.2521616Z [info]   20
2025-06-06T14:15:31.2523696Z [info]   20
2025-06-06T14:15:31.2525531Z [info]   20
2025-06-06T14:15:31.2527524Z [info]   20
2025-06-06T14:15:31.2529429Z [info]   20
2025-06-06T14:15:31.2536478Z [info]   20
2025-06-06T14:15:31.2536985Z [info]   20]" Result did not match for query #2
2025-06-06T14:15:31.2538025Z [info]   SELECT COUNT(*) OVER () FROM tenk1 WHERE unique2 < 10 (SQLQueryTestSuite.scala:663)
andygrove

andygrove commented on Jun 6, 2025

@andygrove
MemberAuthor

I did confirm that reverting #16234 fixes the issue

alamb

alamb commented on Jun 6, 2025

@alamb
Contributor

We reverted the change in DF 48:

FYI @suibianwanwank would you be willing to take this issue?

added
regressionSomething that used to work no longer does
on Jun 6, 2025
alamb

alamb commented on Jun 6, 2025

@alamb
Contributor

I also added this ticket to the list of things we need to do on DataFusion 49 prior to release

suibianwanwank

suibianwanwank commented on Jun 10, 2025

@suibianwanwank
Contributor

FYI @suibianwanwank would you be willing to take this issue?

Sure, I'd be happy to take a look. Things have been a bit busy on my end, but I’ll review it over the weekend.

alamb

alamb commented on Jun 10, 2025

@alamb
Contributor

Thank you @suibianwanwank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionSomething that used to work no longer doesspark

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @findepi@alamb@andygrove@suibianwanwank

      Issue actions

        Panic in `datafusion_expr::window_state::WindowAggState::update` · Issue #16308 · apache/datafusion