Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chaos Mesh networkchaos meta inject barrier failed #16884

Closed
huangjw806 opened this issue May 22, 2024 · 1 comment · Fixed by #16901
Closed

Chaos Mesh networkchaos meta inject barrier failed #16884

huangjw806 opened this issue May 22, 2024 · 1 comment · Fixed by #16901
Assignees
Labels
found-by-chaos-mesh type/bug Something isn't working
Milestone

Comments

@huangjw806
Copy link
Contributor

nightly-20240521 image ran out of this issue many times, it should be a newly introduced bug.

================================================================================
chaos-mesh Result
================================================================================
Result               FAIL                
Pipeline Message     Nightly nexmark     
Namespace            longcmkf-20240521-153227
TestBed              medium-arm-3cn-all-affinity
RW Version           nightly-20240521    
Test Start time      2024-05-21 15:35:48 
Test End time        2024-05-21 15:54:02 
Test Queries         q0,q1,q2,q3,q4,q5,q7,q8,q9,q10,q14,q15,q16,q17,q18,q20,q21,q22,q101,q102,q103,q104,q105
Grafana Metric       https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&var-namespace=longcmkf-20240521-153227&from=1716305748000&to=1716306842000
Grafana Logs         https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&var-namespace=longcmkf-20240521-153227&from=1716305748000&to=1716306842000
Memory Dumps         https://s3.console.aws.amazon.com/s3/buckets/test-useast1-mgmt-bucket-archiver?region=us-east-1&bucketType=general&prefix=k8s/longcmkf-20240521-153227/&showversions=false
Buildkite Job        https://buildkite.com/risingwave-test/chaos-mesh/builds/843

Metalog:

2024-05-21T17:30:29.046813024Z ERROR risingwave_meta::stream::stream_manager: failed to run drop command error=failed to inject barrier |  
-- | --
  |   |  
  | Backtrace: |  
  | 0: std::backtrace_rs::backtrace::libunwind::trace |  
  | at ./rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5 |  
  | 1: std::backtrace_rs::backtrace::trace_unsynchronized |  
  | at ./rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 |  
  | 2: std::backtrace::Backtrace::create |  
  | at ./rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/backtrace.rs:331:13 |  
  | 3: anyhow::context::<impl anyhow::Context<T,core::convert::Infallible> for core::option::Option<T>>::context |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.81/src/context.rs:99:54 |  
  | 4: risingwave_meta::barrier::schedule::BarrierScheduler::run_multiple_commands::{{closure}} |  
  | at ./risingwave/src/meta/src/barrier/schedule.rs:283:24 |  
  | 5: risingwave_meta::barrier::schedule::BarrierScheduler::run_command::{{closure}} |  
  | at ./risingwave/src/meta/src/barrier/schedule.rs:325:14 |  
  | 6: risingwave_meta::stream::stream_manager::GlobalStreamManager::drop_streaming_jobs_v2::{{closure}} |  
  | at ./risingwave/src/meta/src/stream/stream_manager.rs:571:18 |  
  | 7: risingwave_meta::rpc::ddl_controller_v2::<impl risingwave_meta::rpc::ddl_controller::DdlController>::drop_object::{{closure}} |  
  | at ./risingwave/src/meta/src/rpc/ddl_controller_v2.rs:427:14 |  
  | 8: risingwave_meta::rpc::ddl_controller::DdlController::drop_streaming_job::{{closure}} |  
  | at ./risingwave/src/meta/src/rpc/ddl_controller.rs:1285:22 |  
  | 9: risingwave_meta::rpc::ddl_controller::DdlController::run_command::{{closure}}::{{closure}} |  
  | at ./risingwave/src/meta/src/rpc/ddl_controller.rs:311:26 |  
  | 10: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9 |  
  | 11: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9 |  
  | 12: tokio::runtime::task::core::Core<T,S>::poll::{{closure}} |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:328:17 |  
  | 13: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/loom/std/unsafe_cell.rs:16:9 |  
  | 14: tokio::runtime::task::core::Core<T,S>::poll |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:317:30 |  
  | 15: tokio::runtime::task::harness::poll_future::{{closure}} |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:485:19 |  
  | 16: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once |  
  | at ./rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/panic/unwind_safe.rs:272:9 |  
  | 17: std::panicking::try::do_call |  
  | at ./rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:552:40 |  
  | 18: std::panicking::try |  
  | at ./rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:516:19 |  
  | 19: std::panic::catch_unwind |  
  | at ./rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panic.rs:146:14 |  
  | 20: tokio::runtime::task::harness::poll_future |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:473:18 |  
  | 21: tokio::runtime::task::harness::Harness<T,S>::poll_inner |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:208:27 |  
  | 22: tokio::runtime::task::harness::Harness<T,S>::poll |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:153:15 |  
  | 23: tokio::runtime::task::raw::RawTask::poll |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:201:18 |  
  | 24: tokio::runtime::task::LocalNotified<S>::run |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/mod.rs:427:9 |  
  | 25: tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}} |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:639:22 |  
  | 26: tokio::runtime::coop::with_budget |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/coop.rs:107:5 |  
  | 27: tokio::runtime::coop::budget |  
  | at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/coop.rs:73:5
@huangjw806 huangjw806 added the type/bug Something isn't working label May 22, 2024
@github-actions github-actions bot added this to the release-1.10 milestone May 22, 2024
@yezizp2012 yezizp2012 self-assigned this May 22, 2024
@yezizp2012
Copy link
Contributor

Seems like the cluster never got recovered, all compute nodes were kicked out and never register back in meta. Will keep investigate.

2024-05-21T15:54:00.712888564Z  INFO failure_recovery{error=Hummock error: SST 180 is invalid prev_epoch=6492695734583296}:recovery_attempt: risingwave_meta::barrier::recovery: recovering mview progress
2024-05-21T15:54:00.713464343Z  INFO failure_recovery{error=Hummock error: SST 180 is invalid prev_epoch=6492695734583296}:recovery_attempt: risingwave_meta::barrier::recovery: recovered mview progress
2024-05-21T15:54:00.761337526Z  WARN failure_recovery{error=Hummock error: SST 180 is invalid prev_epoch=6492695734583296}:recovery_attempt: risingwave_meta::barrier::recovery: scale actors failed error=No schedulable ParallelUnits available for fragment 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
found-by-chaos-mesh type/bug Something isn't working
Projects
None yet
2 participants