-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LeakSanitizer: detected memory leaks from seastar::net::conntrack::load_balancer #738
LeakSanitizer: detected memory leaks from seastar::net::conntrack::load_balancer #738
Comments
@gleb-cloudius can you please look into this issue? |
@bhalevy we also see this occasionally. here is a more recent trace from a few days ago. it sort of looks like when
|
here is a reproducer. basically if a socket is to be handled on some non-main thread then conn_q has a connected added to it from the main thread. if the target shard doesn't call accept or abort_accept for some reason then entry is never removed from conn_q. then when reactor threads are exiting conn_q thread local is destroyed and a background future is scheduled late in the shutdown sequence. appears that this happens before the thread local reactor destructor. seems it either isn't cleaning up at this level of detail or lsan is having trouble tracking something. +ss::logger lg("ok");
+
+struct server {
+ ss::future<> start() {
+ ss::listen_options lo;
+ lo.reuse_address = true;
+ lo.set_fixed_cpu(ss::smp::count - 1);
+ s = ss::engine().listen(
+ ss::socket_address(ss::net::inet_address("127.0.0.1"), 9092), lo);
+ if (ss::this_shard_id() == 0) {
+ (void)ss::with_gate(g, [this] {
+ return s.accept().then_wrapped([](auto far) {
+ try {
+ auto ar = far.get();
+ lg.info("accepted");
+ } catch (...) {
+ lg.info(
+ "accepted (error): {}", std::current_exception());
+ }
+ });
+ });
+ }
+ return ss::now();
+ }
+
+ ss::future<> stop() {
+ if (ss::this_shard_id() == 0) {
+ s.abort_accept();
+ }
+ return g.close();
+ }
+
+ ss::server_socket s;
+ ss::gate g;
+};
+
+SEASTAR_THREAD_TEST_CASE(memleak) {
+ ss::sharded<server> service;
+ service.start().get();
+ service.invoke_on_all([](server& s) { return s.start(); }).get();
+ auto c = ss::make_socket();
+ c.connect(ss::socket_address(ss::net::inet_address("127.0.0.1"), 9092))
+ .then_wrapped([](auto f) {
+ try {
+ f.get();
+ } catch (...) {
+ }
+ })
+ .get();
+ c.shutdown();
+ service.stop().get();
+} |
it also appears this race could occur even if a core calls abort_accept on its socket. say this happens in some ss::sharded<>::stop() sequence. on some non-main core if abort_accept is called, then the main core could still race with an accepted connection and add an entry in the other cores conn_q thread_local container. this would lead to the same leak. |
@gleb-cloudius please look into this issue Cc @xemul |
@dotnwat if you're able to reproduce the issue, it would be ideal if you could send a fix for it that you've validated. |
@bhalevy i had several solutions that seemed to work, but they were all rather ugly. it wasn't until after i hacked these together that i had a full understanding of the race condition that is happening. i'll post a message to seastar-dev for further discussion. |
When `posix_server_socket_impl::accept()` runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q. However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded<server>::stop). This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor. ``` class handle { foreign_ptr<lw_shared_ptr<load_balancer>> _lb; ~handle() { (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] { lb->closed_cpu(cpu); }); } ... ``` When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked: Direct leak of 88 byte(s) in 1 object(s) allocated from: #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba... ... // the unordered map here is conn_q scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket... scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9... fixes: scylladb#738 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
When `posix_server_socket_impl::accept()` runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q. However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded<server>::stop). This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor. ``` class handle { foreign_ptr<lw_shared_ptr<load_balancer>> _lb; ~handle() { (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] { lb->closed_cpu(cpu); }); } ... ``` When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked: Direct leak of 88 byte(s) in 1 object(s) allocated from: #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba... ... // the unordered map here is conn_q scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket... scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9... fixes: scylladb#738 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Any progress on this issue? |
@niekbouman there is a discussion going on here #1265 |
When `posix_server_socket_impl::accept()` runs it may start a cross-core background fiber that inserts a pending connection into the thread local container posix_ap_server_socket_impl::conn_q. However, the continuation that enqueues the pending connection may not aactually run until after the target core calls abort_accept() (e.g. parallel shutdown via a seastar::sharded<server>::stop). This can leave an entry in the conn_q container that is destroyed when the reactor thread exits. Unfortunately the conn_q container holds conntrack::handle type that schedules additional work in its destructor. ``` class handle { foreign_ptr<lw_shared_ptr<load_balancer>> _lb; ~handle() { (void)smp::submit_to(_host_cpu, [cpu = _target_cpu, lb = std::move(_lb)] { lb->closed_cpu(cpu); }); } ... ``` When this race occurs and the destructor runs the reactor is no longer available, leading to the following memory leak in which the continuation that is scheduled onto the reactor is leaked: Direct leak of 88 byte(s) in 1 object(s) allocated from: #0 0x557c91ca5b7d in operator new(unsigned long) /v/llvm/llvm/src/compiler-rt/lib/asan/asan_new_delete.cpp:95:3 #1 0x557ca3e3cc08 in void seastar::future<void>::schedule<seastar::internal::promise_ba... ... // the unordered map here is conn_q scylladb#19 0x557ca47034d8 in std::__1::unordered_multimap<std::__1::tuple<int, seastar::socket... scylladb#20 0x7f98dcaf238e in __call_tls_dtors (/lib64/libc.so.6+0x4038e) (BuildId: 6e3c087aca9... fixes: scylladb#738 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Some critical tasks can be missed during engine exit, causing LeakSanitizer failure. Notably, there is a chance to lose the destruction task of
foreign_ptr<lw_shared_ptr<conntrack::load_balancer>>
submitted byconntrack::~handle()
because its future is ignored, seeseastar/include/seastar/net/posix-stack.hh
Lines 87 to 90 in 65a8bed
Detailed leak report:
The text was updated successfully, but these errors were encountered: