DHCP timeout on AWS guest #20

vladzcloudius · 2015-01-28T17:04:09Z

When running with DHCP on AWS we get a bellow assert after about 30 seconds.
This happens only with SMP configuration and doesn't reproduce in a UP configuration.
master hash is: 24d5c31

DHCP timeout
httpd: ./core/future.hh:145: void future_state<T>::set(A&& ...) [with A = {bool, net::dhcp::lease}; T = {bool, net::dhcp::lease}]: Assertion `_state == state::future' failed.

The bisect shows that the patch responsible for the breakage is:

ff4aca2ee0787b98d64090546adb63ef23b4dc7d is the first bad commit
commit ff4aca2ee0787b98d64090546adb63ef23b4dc7d
Author: Gleb Natapov <gleb@cloudius-systems.com>
Date:   Sun Jan 25 14:35:28 2015 +0200

    core: prefetch work items before processing

:040000 040000 2a28c23f48931e81723d025bc496a3a8a368e9cd 76f92f274bdf437f67c2842654460816f9fb4672 M      core

To reproduce run:
sudo ./build/release/apps/httpd/httpd --network-stack native --dpdk-pmd -m 512M -c 4

And wait for about 30 seconds.

The text was updated successfully, but these errors were encountered:

avikivity · 2015-01-28T17:56:01Z

Can you try the debug version?

avikivity · 2015-01-28T17:56:20Z

Adding @gleb-cloudius

vladzcloudius · 2015-01-28T18:04:14Z

On 01/28/15 19:56, Avi Kivity wrote:

Can you try the debug version?

Yes. Debug seems to report trash. Note that the below is reported before
DHCP discovery is over which we know is ending with success in a release
version.

DHCP sending discover

ASAN:SIGSEGV

==7801==ERROR: AddressSanitizer: SEGV on unknown address 0x602000076480 (pc 0x00000074e165 sp 0x7fff92516e58 bp 0x7fff92516ee0 T0)
#0 0x74e164 in ixgbe_xmit_pkts (/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x74e164)
#1 0x4b8527 in rte_eth_tx_burst /home/ubuntu/dpdk/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2546
#2 0x4bdc64 in dpdk::dpdk_qp::send(circular_buffer<net::packet, std::allocatornet::packet >&) net/dpdk.cc:685
#3 0x4d46b4 in net::qp::poll_tx() net/net.cc:35
#4 0x4c96dd in operator() net/net.cc:44
#5 0x4ce7eb in std::unique_ptr<reactor::pollfn, std::default_deletestd::unique_ptr > reactor::make_pollfnnet::qp::qp()::{lambda()#1}(net::qp::qp()::{lambda()#1}&&)::the_pollfn::poll_and_check_more_work() (/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x4ce7eb)
#6 0x5d66cf in reactor::poll_once() core/reactor.cc:795
#7 0x5d623d in reactor::run() core/reactor.cc:774
#8 0x6ab047 in app_template::run(int, char**, std::function<void ()>&&) core/app-template.cc:73
#9 0x40ef77 in main apps/httpd/httpd.cc:245
#10 0x7f90f7373ec4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21ec4)
#11 0x40def2 (/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x40def2)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 ixgbe_xmit_pkts
==7801==ABORTING

—
Reply to this email directly or view it on GitHub
#20 (comment).

avikivity · 2015-01-28T18:07:44Z

On 01/28/2015 08:04 PM, vladzcloudius wrote:

On 01/28/15 19:56, Avi Kivity wrote:

Can you try the debug version?

Yes. Debug seems to report trash. Note that the below is reported before
DHCP discovery is over which we know is ending with success in a release
version.

Can you bisect the debug build to find where the error started?

It seems unrelated to the patch.

DHCP sending discover

ASAN:SIGSEGV

==7801==ERROR: AddressSanitizer: SEGV on unknown address
0x602000076480 (pc 0x00000074e165 sp 0x7fff92516e58 bp 0x7fff92516ee0 T0)
#0 0x74e164 in ixgbe_xmit_pkts
(/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x74e164)
#1 0x4b8527 in rte_eth_tx_burst
/home/ubuntu/dpdk/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2546
#2 0x4bdc64 in dpdk::dpdk_qp::send(circular_buffer<net::packet,
std::allocatornet::packet >&) net/dpdk.cc:685
#3 0x4d46b4 in net::qp::poll_tx() net/net.cc:35
#4 0x4c96dd in operator() net/net.cc:44
#5 0x4ce7eb in std::unique_ptr<reactor::pollfn,
std::default_deletestd::unique_ptr >
reactor::make_pollfnnet::qp::qp()::{lambda()#1}(net::qp::qp()::{lambda()#1}&&)::the_pollfn::poll_and_check_more_work()
(/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x4ce7eb)
#6 0x5d66cf in reactor::poll_once() core/reactor.cc:795
#7 0x5d623d in reactor::run() core/reactor.cc:774
#8 0x6ab047 in app_template::run(int, char**, std::function<void
()>&&) core/app-template.cc:73
#9 0x40ef77 in main apps/httpd/httpd.cc:245
#10 0x7f90f7373ec4 in __libc_start_main
(/lib/x86_64-linux-gnu/libc.so.6+0x21ec4)
#11 0x40def2 (/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x40def2)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 ixgbe_xmit_pkts
==7801==ABORTING

—
Reply to this email directly or view it on GitHub

#20 (comment).

—
Reply to this email directly or view it on GitHub
#20 (comment).

slivne · 2015-02-03T11:09:13Z

duplicate of #18

…o_with Fixes failures in debug mode: ``` $ build/debug/tests/unit/closeable_test -l all -t deferred_close_test WARNING: debug mode. Not for benchmarking or production random-seed=3064133628 Running 1 test case... Entering test module "../../tests/unit/closeable_test.cc" ../../tests/unit/closeable_test.cc(0): Entering test case "deferred_close_test" ../../src/testing/seastar_test.cc(43): info: check true has passed ==9449==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases! terminate called after throwing an instance of 'seastar::broken_promise' what(): broken promise ==9449==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x7fbf1f49f000; bottom 0x7fbf40971000; size: 0xffffffffdeb2e000 (-558702592) False positive error reports may follow For details see google/sanitizers#189 ================================================================= ==9449==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_thread.cpp:356 "((ptr[0] == kCurrentStackFrameMagic)) != (0)" (0x0, 0x0) #0 0x7fbf45f39d0b (/lib64/libasan.so.6+0xb3d0b) #1 0x7fbf45f57d4e (/lib64/libasan.so.6+0xd1d4e) #2 0x7fbf45f3e724 (/lib64/libasan.so.6+0xb8724) #3 0x7fbf45eb3e5b (/lib64/libasan.so.6+0x2de5b) #4 0x7fbf45eb51e8 (/lib64/libasan.so.6+0x2f1e8) #5 0x7fbf45eb7694 (/lib64/libasan.so.6+0x31694) #6 0x7fbf45f39398 (/lib64/libasan.so.6+0xb3398) #7 0x7fbf45f3a00b in __asan_report_load8 (/lib64/libasan.so.6+0xb400b) #8 0xfe6d52 in bool __gnu_cxx::operator!=<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > >(__gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&, __gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&) /usr/include/c++/10/bits/stl_iterator.h:1116 #9 0xfe615c in dl_iterate_phdr ../../src/core/exception_hacks.cc:121 #10 0x7fbf44bd1810 in _Unwind_Find_FDE (/lib64/libgcc_s.so.1+0x13810) #11 0x7fbf44bcd897 (/lib64/libgcc_s.so.1+0xf897) #12 0x7fbf44bcea5f (/lib64/libgcc_s.so.1+0x10a5f) #13 0x7fbf44bcefd8 in _Unwind_RaiseException (/lib64/libgcc_s.so.1+0x10fd8) #14 0xfe6281 in _Unwind_RaiseException ../../src/core/exception_hacks.cc:148 #15 0x7fbf457364bb in __cxa_throw (/lib64/libstdc++.so.6+0xaa4bb) #16 0x7fbf45e10a21 (/lib64/libboost_unit_test_framework.so.1.73.0+0x1aa21) #17 0x7fbf45e20fe0 in boost::execution_monitor::execute(boost::function<int ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2afe0) #18 0x7fbf45e21094 in boost::execution_monitor::vexecute(boost::function<void ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2b094) #19 0x7fbf45e43921 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned long) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d921) #20 0x7fbf45e5eae1 (/lib64/libboost_unit_test_framework.so.1.73.0+0x68ae1) #21 0x7fbf45e5ed31 (/lib64/libboost_unit_test_framework.so.1.73.0+0x68d31) #22 0x7fbf45e2e547 in boost::unit_test::framework::run(unsigned long, bool) (/lib64/libboost_unit_test_framework.so.1.73.0+0x38547) #23 0x7fbf45e43618 in boost::unit_test::unit_test_main(bool (*)(), int, char**) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d618) #24 0x44798d in seastar::testing::entry_point(int, char**) ../../src/testing/entry_point.cc:77 #25 0x4134b5 in main ../../include/seastar/testing/seastar_test.hh:65 #26 0x7fbf44a1b1e1 in __libc_start_main (/lib64/libc.so.6+0x281e1) #27 0x4133dd in _start (/home/bhalevy/dev/seastar/build/debug/tests/unit/closeable_test+0x4133dd) ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210406100911.12278-1-bhalevy@scylladb.com>

When `posix_server_socket_impl::accept()` runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q. However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded<server>::stop). This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor. ``` class handle { foreign_ptr<lw_shared_ptr<load_balancer>> _lb; ~handle() { (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] { lb->closed_cpu(cpu); }); } ... ``` When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked: Direct leak of 88 byte(s) in 1 object(s) allocated from: #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 scylladb#1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba... ... // the unordered map here is conn_q scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket... scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9... fixes: scylladb#738 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

gleb-cloudius mentioned this issue Oct 18, 2015

memcached crashes on setting first key #64

Closed

nlfox mentioned this issue Jan 13, 2018

Memcached bad performance (both using DPDK and posix network stack) #383

Open

frank8989 mentioned this issue Feb 26, 2018

Task block detector deadlocks with exception throwing #298

Open

closertotheheart mentioned this issue Jan 30, 2019

1/47 Test #1: Seastar.dist.consumer .........................***Failed 86.95 sec #596

Closed

Howe2015 mentioned this issue Aug 31, 2020

http server dump #781

Open

nlescoua mentioned this issue Mar 28, 2023

Abort called when writing a temporary_buffer to a file_output_stream #1581

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DHCP timeout on AWS guest #20

DHCP timeout on AWS guest #20

vladzcloudius commented Jan 28, 2015

avikivity commented Jan 28, 2015

avikivity commented Jan 28, 2015

vladzcloudius commented Jan 28, 2015

avikivity commented Jan 28, 2015

ASAN:SIGSEGV

slivne commented Feb 3, 2015

DHCP timeout on AWS guest #20

DHCP timeout on AWS guest #20

Comments

vladzcloudius commented Jan 28, 2015

avikivity commented Jan 28, 2015

avikivity commented Jan 28, 2015

vladzcloudius commented Jan 28, 2015

ASAN:SIGSEGV

avikivity commented Jan 28, 2015

ASAN:SIGSEGV

slivne commented Feb 3, 2015