Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (4GB allocation) in RandomNodeOperationsTest.test_node_operations #18130

Closed
vbotbuildovich opened this issue Apr 28, 2024 · 7 comments
Assignees
Labels
area/replication auto-triaged used to know which issues have been opened from a CI job ci-failure ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 28, 2024

https://buildkite.com/redpanda/vtools/builds/13269

Module: rptest.tests.random_node_operations_test
Class: RandomNodeOperationsTest
Method: test_node_operations
Arguments: {
    "num_to_upgrade": 3,
    "enable_failures": true,
    "with_tiered_storage": false
}
test_id:    RandomNodeOperationsTest.test_node_operations
status:     FAIL
run time:   1808.561 seconds

<BadLogLines nodes=ip-172-31-7-144(4) example="WARN  2024-04-27 23:14:58,785 [shard 0:main] seastar_memory - oversized allocation: 4868616192 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x7ef429b 0x7bfcb07 0x7c084a3 0x5e8e293 0x5e8e1eb 0x5e9abef 0x5e91d77 0x2c85d1f 0x7caa1cb 0x7cac87f 0x7caaa53 0x7bcfca7 0x7bce7fb 0x2b76567 0x7f46f43 /opt/redpanda/lib/libc.so.6+0x2b1c7 /opt/redpanda/lib/libc.so.6+0x2b29f 0x2b6f5ef">
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 182, in wrapped
    redpanda.raise_on_bad_logs(
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1461, in raise_on_bad_logs
    lsearcher.search_logs(_searchable_nodes)
  File "/home/ubuntu/redpanda/tests/rptest/services/utils.py", line 166, in search_logs
    raise BadLogLines(bad_loglines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-7-144(4) example="WARN  2024-04-27 23:14:58,785 [shard 0:main] seastar_memory - oversized allocation: 4868616192 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x7ef429b 0x7bfcb07 0x7c084a3 0x5e8e293 0x5e8e1eb 0x5e9abef 0x5e91d77 0x2c85d1f 0x7caa1cb 0x7cac87f 0x7caaa53 0x7bcfca7 0x7bce7fb 0x2b76567 0x7f46f43 /opt/redpanda/lib/libc.so.6+0x2b1c7 /opt/redpanda/lib/libc.so.6+0x2b29f 0x2b6f5ef">

JIRA Link: CORE-2694

@vbotbuildovich vbotbuildovich added auto-triaged used to know which issues have been opened from a CI job ci-failure labels Apr 28, 2024
@piyushredpanda piyushredpanda added the sev/high loss of availability, pathological performance degradation, recoverable corruption label May 2, 2024
@travisdowns travisdowns changed the title CI Failure (key symptom) in RandomNodeOperationsTest.test_node_operations CI Failure (4GB allocation) in RandomNodeOperationsTest.test_node_operations May 2, 2024
@StephanDollberg
Copy link
Member

StephanDollberg commented May 2, 2024

NB: This is in version 23.3.13

0x7ef429b 0x7bfcb07 0x7c084a3 0x5e8e293 0x5e8e1eb 0x5e9abef 0x5e91d77 0x2c85d1f 0x7caa1cb 0x7cac87f 0x7caaa53 0x7bcfca7 0x7bce7fb 0x2b76567 0x7f46f43 /opt/redpanda/lib/libc.so.6+0x2b1c7 /opt/redpanda/lib/libc.so.6+0x2b29f 0x2b6f5ef
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/include/seastar/util/backtrace.hh:64
 (inlined by) seastar::current_backtrace_tasklocal() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/util/backtrace.cc:98
 (inlined by) seastar::current_tasktrace() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/util/backtrace.cc:149
 (inlined by) seastar::current_backtrace() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/util/backtrace.cc:182
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:839
 (inlined by) seastar::memory::cpu_pages::check_large_allocation(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:904
 (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:911
 (inlined by) seastar::memory::allocate_large(unsigned long, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:1521
 (inlined by) seastar::memory::allocate_slowpath(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:1645
seastar::memory::allocate(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:1658
 (inlined by) operator new(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:2355
void* std::__1::__libcpp_operator_new[abi:v160004]<unsigned long>(unsigned long) at /vectorized/llvm/bin/../include/c++/v1/new:266
 (inlined by) std::__1::__libcpp_allocate[abi:v160004](unsigned long, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/new:292
 (inlined by) std::__1::allocator<model::broker_shard>::allocate[abi:v160004](unsigned long) at /vectorized/llvm/bin/../include/c++/v1/__memory/allocator.h:115
 (inlined by) std::__1::__allocation_result<std::__1::allocator_traits<std::__1::allocator<model::broker_shard> >::pointer> std::__1::__allocate_at_least[abi:v160004]<std::__1::allocator<model::broker_shard> >(std::__1::allocator<model::broker_shard>&, unsigned long) at /vectorized/llvm/bin/../include/c++/v1/__memory/allocate_at_least.h:55
 (inlined by) std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> >::__vallocate[abi:v160004](unsigned long) at /vectorized/llvm/bin/../include/c++/v1/vector:688
 (inlined by) std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> >::vector(std::__1::vector<model::broker_shard, std::__1::allocator<model::broker_shard> > const&) at /vectorized/llvm/bin/../include/c++/v1/vector:1193
 (inlined by) cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}::operator()(cluster::partition_balancer_planner::reassignable_partition&) const at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-061f17d193e1f876c-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:1754
 (inlined by) decltype (((std::declval<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >)())((std::declval<cluster::partition_balancer_planner::reassignable_partition&>)())) std::__1::__invoke[abi:v160004]<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}>, cluster::partition_balancer_planner::reassignable_partition&>(seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}>&&, cluster::partition_balancer_planner::reassignable_partition&) at /vectorized/llvm/bin/../include/c++/v1/__functional/invoke.h:394
 (inlined by) decltype(auto) std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >::operator()[abi:v160004]<std::__1::__variant_detail::__alt<0ul, cluster::partition_balancer_planner::reassignable_partition>&>(std::__1::__variant_detail::__alt<0ul, cluster::partition_balancer_planner::reassignable_partition>&) const at /vectorized/llvm/bin/../include/c++/v1/variant:689
 (inlined by) decltype (((std::declval<std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> > >)())((std::declval<std::__1::__variant_detail::__alt<0ul, cluster::partition_balancer_planner::reassignable_partition>&>)())) std::__1::__invoke[abi:v160004]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >, std::__1::__variant_detail::__alt<0ul, cluster::partition_balancer_planner::reassignable_partition>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >&&, std::__1::__variant_detail::__alt<0ul, cluster::partition_balancer_planner::reassignable_partition>&) at /vectorized/llvm/bin/../include/c++/v1/__functional/invoke.h:394
 (inlined by) decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<0ul>::__dispatch[abi:v160004]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&) at /vectorized/llvm/bin/../include/c++/v1/variant:569
decltype(auto) std::__1::__variant_detail::__visitation::__base::__visit_alt[abi:v160004]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >, std::__1::__variant_detail::__impl<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >&&, std::__1::__variant_detail::__impl<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&) at /vectorized/llvm/bin/../include/c++/v1/variant:532
 (inlined by) decltype(auto) std::__1::__variant_detail::__visitation::__variant::__visit_alt[abi:v160004]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >, std::__1::variant<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}> >&&, std::__1::variant<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&) at /vectorized/llvm/bin/../include/c++/v1/variant:639
 (inlined by) decltype(auto) std::__1::__variant_detail::__visitation::__variant::__visit_value[abi:v160004]<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}>, std::__1::variant<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&>(seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}>&&, std::__1::variant<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&) at /vectorized/llvm/bin/../include/c++/v1/variant:658
 (inlined by) decltype(auto) std::__1::visit[abi:v160004]<seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}>, std::__1::variant<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&, void>(seastar::internal::variant_visitor<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}>&&, std::__1::variant<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&) at /vectorized/llvm/bin/../include/c++/v1/variant:1756
 (inlined by) auto seastar::visit<std::__1::variant<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}>(std::__1::variant<cluster::partition_balancer_planner::reassignable_partition, cluster::partition_balancer_planner::force_reassignable_partition, cluster::partition_balancer_planner::moving_partition, cluster::partition_balancer_planner::immutable_partition>&, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}&&, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}&&) at /vectorized/include/seastar/util/variant_utils.hh:71
 (inlined by) auto cluster::partition_balancer_planner::partition::match_variant<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}>(cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(cluster::partition_balancer_planner::reassignable_partition&)#1}&&, cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const::{lambda(auto:1&)#1}&&) at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-061f17d193e1f876c-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:797
 (inlined by) cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0::operator()(cluster::partition_balancer_planner::partition&) const at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-061f17d193e1f876c-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:1751
 (inlined by) seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)>::direct_vtable_for<cluster::partition_balancer_planner::get_counts_rebalancing_actions(cluster::partition_balancer_planner::request_context&)::$_0>::call(seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)> const*, cluster::partition_balancer_planner::partition&) at /vectorized/include/seastar/util/noncopyable_function.hh:129
seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)>::operator()(cluster::partition_balancer_planner::partition&) const at /vectorized/include/seastar/util/noncopyable_function.hh:215
 (inlined by) auto cluster::partition_balancer_planner::request_context::do_with_partition<seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)> >(model::ntp const&, cluster::partition_assignment const&, seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)>&) at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-061f17d193e1f876c-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:969
cluster::partition_balancer_planner::request_context::for_each_partition_random_order(seastar::noncopyable_function<seastar::bool_class<seastar::stop_iteration_tag> (cluster::partition_balancer_planner::partition&)>) at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-061f17d193e1f876c-1/redpanda/redpanda/src/v/cluster/partition_balancer_planner.cc:1016
std::__1::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume[abi:v160004]() const at /vectorized/llvm/bin/../include/c++/v1/__coroutine/coroutine_handle.h:169
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at /vectorized/include/seastar/core/coroutine.hh:125
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2750
 (inlined by) seastar::reactor::run_some_tasks() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3213
seastar::reactor::do_run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3397
seastar::reactor::run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3265
seastar::app_template::run_deprecated(int, char**, std::__1::function<void ()>&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::__1::function<seastar::future<int> ()>&&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/app-template.cc:167
application::run(int, char**) at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-061f17d193e1f876c-1/redpanda/redpanda/src/v/redpanda/application.cc:414
main at /var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-061f17d193e1f876c-1/redpanda/redpanda/src/v/redpanda/main.cc:22
/opt/redpanda/lib/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=6e7b96dfb83f0bdcb6a410469b82f86415e5ada3, for GNU/Linux 3.2.0, stripped

@travisdowns
Copy link
Member

travisdowns commented May 2, 2024

@ztlpn - I'm looking at the behavior of partition_balancer_planner::request_context::for_each_partition_random_order where this crash occurs. As I understand it, any time this code suspends, it replies on topics_map_revision changing so that it can be detected before potentially accessing now-invalid references, right?

So this means that topic table has to reliably increment the version on any modification (including non-top level modifications: changes not to the topic table objects themselves, but of objects pointed to by it, e.g., of the replicas for a partition), is that right?

@StephanDollberg
Copy link
Member

Similar issue as #17975 ?

@travisdowns
Copy link
Member

travisdowns commented May 2, 2024

Just scanning through the topic table code, here's one place where it appears to me that we modify-then-suspend (in topic_table::apply_snapshot) without immediately updating the version:

                for (auto as_it = md_item.get_assignments().begin();
                     as_it != md_item.get_assignments().end();) {
                    auto as_it_copy = as_it++;
                    if (!topic_snapshot.partitions.contains(as_it_copy->id)) {
                        applier.delete_ntp(ns_tp, *as_it_copy);
                        md_item.get_assignments().erase(as_it_copy);
                    }
                    co_await ss::coroutine::maybe_yield();
                }

@ztlpn
Copy link
Contributor

ztlpn commented May 3, 2024

@travisdowns great catch! I managed to reproduce it in a unit test that concurrently calls planner and applies controller snapshots (i.e. ASan sees it).

So this means that topic table has to reliably increment the version on any modification (including non-top level modifications: changes not to the topic table objects themselves, but of objects pointed to by it, e.g., of the replicas for a partition), is that right?

That's correct. Quite error-prone of course, ideally we'd use some kind of persistent data structure here, but that's what we have for now.

@travisdowns
Copy link
Member

travisdowns commented May 4, 2024

ideally we'd use some kind of persistent data structure here, but that's what we have for now.

Nice, thanks for the quick fix!

Quite error-prone of course, ideally we'd use some kind of persistent data structure...

One helpful approach I've done in the past for things like this is to try to use the type system to help you: e.g., wrap the map in another struct which only gives out const access (which should generally be safe as long as the nested structures are "deep const" (as containers are) freely. If you need non-const access it comes through an accessor that always increments the version when hands out the mutable reference.

This still leaves open a hole if you get such a reference and hold it over a suspension point and the further modify is, though this is already a potentially crashing bug that must be avoided: one way to fix that aspect is to not give out a map reference at all but only implement mutating operations inside the wrapper. Another way to close that last hole is to have a way of declaring objects which cannot be held across a possible suspension point, I can think of a few ways to do that.

@ztlpn
Copy link
Contributor

ztlpn commented May 14, 2024

Likely fixed by #18305

@ztlpn ztlpn closed this as completed May 14, 2024
@piyushredpanda piyushredpanda added the ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/replication auto-triaged used to know which issues have been opened from a CI job ci-failure ci-rca/redpanda CI Root Cause Analysis - Redpanda Issue sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

No branches or pull requests

5 participants