Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DHCP timeout on AWS guest #20

Open
vladzcloudius opened this issue Jan 28, 2015 · 5 comments
Open

DHCP timeout on AWS guest #20

vladzcloudius opened this issue Jan 28, 2015 · 5 comments

Comments

@vladzcloudius
Copy link
Contributor

When running with DHCP on AWS we get a bellow assert after about 30 seconds.
This happens only with SMP configuration and doesn't reproduce in a UP configuration.
master hash is: 24d5c31

DHCP timeout
httpd: ./core/future.hh:145: void future_state<T>::set(A&& ...) [with A = {bool, net::dhcp::lease}; T = {bool, net::dhcp::lease}]: Assertion `_state == state::future' failed.

The bisect shows that the patch responsible for the breakage is:

ff4aca2ee0787b98d64090546adb63ef23b4dc7d is the first bad commit
commit ff4aca2ee0787b98d64090546adb63ef23b4dc7d
Author: Gleb Natapov <gleb@cloudius-systems.com>
Date:   Sun Jan 25 14:35:28 2015 +0200

    core: prefetch work items before processing

:040000 040000 2a28c23f48931e81723d025bc496a3a8a368e9cd 76f92f274bdf437f67c2842654460816f9fb4672 M      core

To reproduce run:
sudo ./build/release/apps/httpd/httpd --network-stack native --dpdk-pmd -m 512M -c 4

And wait for about 30 seconds.

@avikivity
Copy link
Member

Can you try the debug version?

@avikivity
Copy link
Member

Adding @gleb-cloudius

@vladzcloudius
Copy link
Contributor Author

On 01/28/15 19:56, Avi Kivity wrote:

Can you try the debug version?

Yes. Debug seems to report trash. Note that the below is reported before
DHCP discovery is over which we know is ending with success in a release
version.

DHCP sending discover

ASAN:SIGSEGV

==7801==ERROR: AddressSanitizer: SEGV on unknown address 0x602000076480 (pc 0x00000074e165 sp 0x7fff92516e58 bp 0x7fff92516ee0 T0)
#0 0x74e164 in ixgbe_xmit_pkts (/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x74e164)
#1 0x4b8527 in rte_eth_tx_burst /home/ubuntu/dpdk/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2546
#2 0x4bdc64 in dpdk::dpdk_qp::send(circular_buffer<net::packet, std::allocatornet::packet >&) net/dpdk.cc:685
#3 0x4d46b4 in net::qp::poll_tx() net/net.cc:35
#4 0x4c96dd in operator() net/net.cc:44
#5 0x4ce7eb in std::unique_ptr<reactor::pollfn, std::default_deletestd::unique_ptr > reactor::make_pollfnnet::qp::qp()::{lambda()#1}(net::qp::qp()::{lambda()#1}&&)::the_pollfn::poll_and_check_more_work() (/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x4ce7eb)
#6 0x5d66cf in reactor::poll_once() core/reactor.cc:795
#7 0x5d623d in reactor::run() core/reactor.cc:774
#8 0x6ab047 in app_template::run(int, char**, std::function<void ()>&&) core/app-template.cc:73
#9 0x40ef77 in main apps/httpd/httpd.cc:245
#10 0x7f90f7373ec4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21ec4)
#11 0x40def2 (/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x40def2)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 ixgbe_xmit_pkts
==7801==ABORTING


Reply to this email directly or view it on GitHub
#20 (comment).

@avikivity
Copy link
Member

On 01/28/2015 08:04 PM, vladzcloudius wrote:

On 01/28/15 19:56, Avi Kivity wrote:

Can you try the debug version?

Yes. Debug seems to report trash. Note that the below is reported before
DHCP discovery is over which we know is ending with success in a release
version.

Can you bisect the debug build to find where the error started?

It seems unrelated to the patch.

DHCP sending discover

ASAN:SIGSEGV

==7801==ERROR: AddressSanitizer: SEGV on unknown address
0x602000076480 (pc 0x00000074e165 sp 0x7fff92516e58 bp 0x7fff92516ee0 T0)
#0 0x74e164 in ixgbe_xmit_pkts
(/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x74e164)
#1 0x4b8527 in rte_eth_tx_burst
/home/ubuntu/dpdk/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2546
#2 0x4bdc64 in dpdk::dpdk_qp::send(circular_buffer<net::packet,
std::allocatornet::packet >&) net/dpdk.cc:685
#3 0x4d46b4 in net::qp::poll_tx() net/net.cc:35
#4 0x4c96dd in operator() net/net.cc:44
#5 0x4ce7eb in std::unique_ptr<reactor::pollfn,
std::default_deletestd::unique_ptr >
reactor::make_pollfnnet::qp::qp()::{lambda()#1}(net::qp::qp()::{lambda()#1}&&)::the_pollfn::poll_and_check_more_work()
(/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x4ce7eb)
#6 0x5d66cf in reactor::poll_once() core/reactor.cc:795
#7 0x5d623d in reactor::run() core/reactor.cc:774
#8 0x6ab047 in app_template::run(int, char**, std::function<void
()>&&) core/app-template.cc:73
#9 0x40ef77 in main apps/httpd/httpd.cc:245
#10 0x7f90f7373ec4 in __libc_start_main
(/lib/x86_64-linux-gnu/libc.so.6+0x21ec4)
#11 0x40def2 (/home/ubuntu/seastar/build/debug/apps/httpd/httpd+0x40def2)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 ixgbe_xmit_pkts
==7801==ABORTING


Reply to this email directly or view it on GitHub

#20 (comment).


Reply to this email directly or view it on GitHub
#20 (comment).

@slivne
Copy link
Contributor

slivne commented Feb 3, 2015

duplicate of #18

avikivity pushed a commit that referenced this issue Apr 6, 2021
…o_with

Fixes failures in debug mode:
```
$ build/debug/tests/unit/closeable_test -l all -t deferred_close_test
WARNING: debug mode. Not for benchmarking or production
random-seed=3064133628
Running 1 test case...
Entering test module "../../tests/unit/closeable_test.cc"
../../tests/unit/closeable_test.cc(0): Entering test case "deferred_close_test"
../../src/testing/seastar_test.cc(43): info: check true has passed
==9449==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases!
terminate called after throwing an instance of 'seastar::broken_promise'
  what():  broken promise
==9449==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x7fbf1f49f000; bottom 0x7fbf40971000; size: 0xffffffffdeb2e000 (-558702592)
False positive error reports may follow
For details see google/sanitizers#189
=================================================================
==9449==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_thread.cpp:356 "((ptr[0] == kCurrentStackFrameMagic)) != (0)" (0x0, 0x0)
    #0 0x7fbf45f39d0b  (/lib64/libasan.so.6+0xb3d0b)
    #1 0x7fbf45f57d4e  (/lib64/libasan.so.6+0xd1d4e)
    #2 0x7fbf45f3e724  (/lib64/libasan.so.6+0xb8724)
    #3 0x7fbf45eb3e5b  (/lib64/libasan.so.6+0x2de5b)
    #4 0x7fbf45eb51e8  (/lib64/libasan.so.6+0x2f1e8)
    #5 0x7fbf45eb7694  (/lib64/libasan.so.6+0x31694)
    #6 0x7fbf45f39398  (/lib64/libasan.so.6+0xb3398)
    #7 0x7fbf45f3a00b in __asan_report_load8 (/lib64/libasan.so.6+0xb400b)
    #8 0xfe6d52 in bool __gnu_cxx::operator!=<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > >(__gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&, __gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&) /usr/include/c++/10/bits/stl_iterator.h:1116
    #9 0xfe615c in dl_iterate_phdr ../../src/core/exception_hacks.cc:121
    #10 0x7fbf44bd1810 in _Unwind_Find_FDE (/lib64/libgcc_s.so.1+0x13810)
    #11 0x7fbf44bcd897  (/lib64/libgcc_s.so.1+0xf897)
    #12 0x7fbf44bcea5f  (/lib64/libgcc_s.so.1+0x10a5f)
    #13 0x7fbf44bcefd8 in _Unwind_RaiseException (/lib64/libgcc_s.so.1+0x10fd8)
    #14 0xfe6281 in _Unwind_RaiseException ../../src/core/exception_hacks.cc:148
    #15 0x7fbf457364bb in __cxa_throw (/lib64/libstdc++.so.6+0xaa4bb)
    #16 0x7fbf45e10a21  (/lib64/libboost_unit_test_framework.so.1.73.0+0x1aa21)
    #17 0x7fbf45e20fe0 in boost::execution_monitor::execute(boost::function<int ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2afe0)
    #18 0x7fbf45e21094 in boost::execution_monitor::vexecute(boost::function<void ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2b094)
    #19 0x7fbf45e43921 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned long) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d921)
    #20 0x7fbf45e5eae1  (/lib64/libboost_unit_test_framework.so.1.73.0+0x68ae1)
    #21 0x7fbf45e5ed31  (/lib64/libboost_unit_test_framework.so.1.73.0+0x68d31)
    #22 0x7fbf45e2e547 in boost::unit_test::framework::run(unsigned long, bool) (/lib64/libboost_unit_test_framework.so.1.73.0+0x38547)
    #23 0x7fbf45e43618 in boost::unit_test::unit_test_main(bool (*)(), int, char**) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d618)
    #24 0x44798d in seastar::testing::entry_point(int, char**) ../../src/testing/entry_point.cc:77
    #25 0x4134b5 in main ../../include/seastar/testing/seastar_test.hh:65
    #26 0x7fbf44a1b1e1 in __libc_start_main (/lib64/libc.so.6+0x281e1)
    #27 0x4133dd in _start (/home/bhalevy/dev/seastar/build/debug/tests/unit/closeable_test+0x4133dd)
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210406100911.12278-1-bhalevy@scylladb.com>
dotnwat added a commit to dotnwat/seastar that referenced this issue Oct 24, 2022
When `posix_server_socket_impl::accept()` runs it may start a cross-core
background fiber that inserts a pending connection into the thread local
container posix_ap_server_socket_impl::conn_q.

However, the continuation that enqueues the pending connection may not
aactually run until after the target core calls abort_accept() (e.g.
parallel shutdown via a seastar::sharded<server>::stop).

This can leave an entry in the conn_q container that is destroyed when
the reactor thread exits. Unfortunately the conn_q container holds
conntrack::handle type that schedules additional work in its destructor.

```
   class handle {
       foreign_ptr<lw_shared_ptr<load_balancer>> _lb;
       ~handle() {
           (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] {
               lb->closed_cpu(cpu);
           });
       }
       ...
```

When this race occurs and the destructor runs the reactor is no longer
available, leading to the following memory leak in which the continuation that
is scheduled onto the reactor is leaked:

Direct leak of 88 byte(s) in 1 object(s) allocated from:
    #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3

    scylladb#1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba...
    ...
    // the unordered map here is conn_q
    scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket...
    scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9...

fixes: scylladb#738

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
dotnwat added a commit to dotnwat/seastar that referenced this issue Oct 24, 2022
When `posix_server_socket_impl::accept()` runs it may start a cross-core
background fiber that inserts a pending connection into the thread local
container posix_ap_server_socket_impl::conn_q.

However, the continuation that enqueues the pending connection may not
aactually run until after the target core calls abort_accept() (e.g.
parallel shutdown via a seastar::sharded<server>::stop).

This can leave an entry in the conn_q container that is destroyed when
the reactor thread exits. Unfortunately the conn_q container holds
conntrack::handle type that schedules additional work in its destructor.

```
   class handle {
       foreign_ptr<lw_shared_ptr<load_balancer>> _lb;
       ~handle() {
           (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] {
               lb->closed_cpu(cpu);
           });
       }
       ...
```

When this race occurs and the destructor runs the reactor is no longer
available, leading to the following memory leak in which the continuation that
is scheduled onto the reactor is leaked:

Direct leak of 88 byte(s) in 1 object(s) allocated from:
    #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3

    scylladb#1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba...
    ...
    // the unordered map here is conn_q
    scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket...
    scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9...

fixes: scylladb#738

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
dotnwat added a commit to dotnwat/seastar that referenced this issue Dec 21, 2022
When `posix_server_socket_impl::accept()` runs it may start a cross-core
background fiber that inserts a pending connection into the thread local
container posix_ap_server_socket_impl::conn_q.

However, the continuation that enqueues the pending connection may not
aactually run until after the target core calls abort_accept() (e.g.
parallel shutdown via a seastar::sharded<server>::stop).

This can leave an entry in the conn_q container that is destroyed when
the reactor thread exits. Unfortunately the conn_q container holds
conntrack::handle type that schedules additional work in its destructor.

```
   class handle {
       foreign_ptr<lw_shared_ptr<load_balancer>> _lb;
       ~handle() {
           (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] {
               lb->closed_cpu(cpu);
           });
       }
       ...
```

When this race occurs and the destructor runs the reactor is no longer
available, leading to the following memory leak in which the continuation that
is scheduled onto the reactor is leaked:

Direct leak of 88 byte(s) in 1 object(s) allocated from:
    #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3

    scylladb#1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba...
    ...
    // the unordered map here is conn_q
    scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket...
    scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9...

fixes: scylladb#738

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants