Native stack: cannot find sent record in _tcbs. #750

WJTian · 2020-05-25T02:54:26Z

When I benchmark ceph crimson messenger using perf_crimson_msgr, the client use native network stack and 1 job(i.e. shard 1 send TCP SYN packet). The received SYN+ACK packet may be hashed to another shard which does not have corresponding record in _tcbs struct. This issue will cause TCP handshake failure.

The tcpdump output is:
15:21:47.400523 IP 192.168.122.111.53010 > 192.168.122.122.distinct: Flags [S], seq 2210197188, win 29200, options [mss 1460,wscale 7,eol], length 0
15:21:47.400735 IP 192.168.122.122.distinct > 192.168.122.111.53010: Flags [S.], seq 2180376672, ack 2210197189, win 29200, options [mss 1460,wscale 7,eol], length 0
15:21:47.401020 IP 192.168.122.111.53010 > 192.168.122.122.distinct: Flags [R.], seq 1, ack 1, win 0, length 0

avikivity · 2020-05-25T06:00:04Z

This can happen if the NIC uses a hash function different from what we think it is.

What NIC are you using?

Please test with --smp 1, just to validate.

WJTian · 2020-05-25T06:27:55Z

The NIC I use is Mellanox ConnectX-4 Lx.
The perf_crimson_msgr needs at least 2 threads: one for main thread and the other for job. I used --smp 2, the probability of TCP handshake success becomes very higher(50% I estimate).

avikivity · 2020-05-25T06:33:09Z

Try changing if (smp::count > 1) to if (false) in dpdk_device::init_port_start(). If it works, we know it's a hash function mismatch.

/cc @vladzcloudius

WJTian · 2020-05-25T07:09:49Z

Changing if (smp::count > 1) to if (false) in dpdk_device::init_port_start() does not work, and the client will always abort:

Aborting on shard 1.
Backtrace:
  0x0000000000b94768
  0x0000000000b50621
  0x0000000000b508ed
  0x0000000000b509b2
  /lib64/libpthread.so.0+0x000000000000f5df
  /lib64/libc.so.6+0x00000000000351f6
  /lib64/libc.so.6+0x00000000000368e7
  0x000000000060fbbf
  0x000000000060fc1a
  0x00000000006653cf
  0x0000000000b4bda8
  0x0000000000b4c0c7
  0x0000000000b7bb35
  0x0000000000b865fb
  0x0000000000b4480d
  /lib64/libpthread.so.0+0x0000000000007e24
  /lib64/libc.so.6+0x00000000000f834c

The seastar-addr2line shows:
[Backtrace #0]
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at /home/tianwenjie/ceph/ceph/src/seastar/include/seastar/util/backtrace.hh:56
seastar::backtrace_buffer::append_backtrace() at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:741
 (inlined by) print_with_backtrace at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:762
seastar::print_with_backtrace(char const*) at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:769
sigabrt_action at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:3473
 (inlined by) operator() at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:3455
 (inlined by) _FUN at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:3451
__ftrylockfile at :?
__GI___open_catalog at :?
__sigblock at :?
ceph::__ceph_assert_fail(char const*, char const*, int, char const*) at /home/tianwenjie/ceph/ceph/src/crimson/common/assert.cc:27
ceph::__ceph_assert_fail(ceph::assert_data const&) at /home/tianwenjie/ceph/ceph/src/crimson/common/assert.cc:14
operator() at /home/tianwenjie/ceph/ceph/src/tools/crimson/perf_crimson_msgr.cc:374
 (inlined by) apply at /home/tianwenjie/ceph/ceph/src/seastar/include/seastar/core/apply.hh:36
 (inlined by) apply<(anonymous namespace)::run((anonymous namespace)::perf_mode_t, const (anonymous namespace)::client_config&, const (anonymous namespace)::server_config&)::test_state::Client::connect_wait_verify(const entity_addr_t&)::<lambda(auto:72&)> [with auto:72 = (anonymous namespace)::run((anonymous namespace)::perf_mode_t, const (anonymous namespace)::client_config&, const (anonymous namespace)::server_config&)::test_state::Client]::<lambda()> > at /home/tianwenjie/ceph/ceph/src/seastar/include/seastar/core/apply.hh:44
 (inlined by) apply<(anonymous namespace)::run((anonymous namespace)::perf_mode_t, const (anonymous namespace)::client_config&, const (anonymous namespace)::server_config&)::test_state::Client::connect_wait_verify(const entity_addr_t&)::<lambda(auto:72&)> [with auto:72 = (anonymous namespace)::run((anonymous namespace)::perf_mode_t, const (anonymous namespace)::client_config&, const (anonymous namespace)::server_config&)::test_state::Client]::<lambda()> > at /home/tianwenjie/ceph/ceph/src/seastar/include/seastar/core/future.hh:1647
 (inlined by) operator() at /home/tianwenjie/ceph/ceph/src/seastar/include/seastar/core/future.hh:1226
 (inlined by) run_and_dispose at /home/tianwenjie/ceph/ceph/src/seastar/include/seastar/core/future.hh:504
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:2151
seastar::reactor::run_some_tasks() at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:2566
seastar::reactor::run_some_tasks() at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:2549
 (inlined by) seastar::reactor::run() at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:2721
seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config)::{lambda()#3}::operator()() const at /home/tianwenjie/ceph/ceph/src/seastar/src/core/reactor.cc:3888
std::function<void ()>::operator()() const at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/std_function.h:690
 (inlined by) seastar::posix_thread::start_routine(void*) at /home/tianwenjie/ceph/ceph/src/seastar/src/core/posix.cc:60
start_thread at pthread_create.c:?

avikivity · 2020-05-25T07:32:34Z

This looks like a ceph failure, not seastar. Of course it can be caused by a seastar bug, but it's not possible for me to diagnose ceph assertion failures.

Maybe you can try reproducing the problem with seastar's httpd (and then trying the dpdk.cc change).

WJTian · 2020-05-25T10:58:47Z

Actually, the ceph assertion failures is caused by failure of TCP connection in line 374, and it is same error as from previous test when SYN+ACK packet is hashed to wrong shard:

364       seastar::future<> connect_wait_verify(const entity_addr_t& peer_addr) {
365         return container().invoke_on_all([peer_addr] (auto& client) {
366           // start clients in active cores (#1 ~ #jobs)
367           if (client.is_active()) {
368             mono_time start_time = mono_clock::now();
369             client.active_conn = client.msgr->connect(peer_addr, entity_name_t::TYPE_OSD);
370             // make sure handshake won't hurt the performance
371             return seastar::sleep(1s).then([&client, start_time] {
372               if (client.conn_stats.connected_time == mono_clock::zero()) {
373                 logger().error("\n{} not connected after 1s!\n", client.lname);
374                 ceph_assert(false);
375               }
376               client.conn_stats.connecting_time = start_time;
377             });
378           }
379           return seastar::now();
380         });
381       }

In short, changing only if (smp::count > 1) to if (false) in dpdk_device::init_port_start() does not work. May be more changes is needed to verify that it's a hash function mismatch?

avikivity · 2020-05-25T11:05:10Z

client connect()s have extra hashing logic:

template <typename InetTraits>
auto tcp<InetTraits>::connect(socket_address sa) -> connection {
    uint16_t src_port;
    connid id;
    auto src_ip = _inet._inet.host_address();
    auto dst_ip = ipv4_address(sa);
    auto dst_port = net::ntoh(sa.u.in.sin_port);

    do {
        src_port = _port_dist(_e);
        id = connid{src_ip, dst_ip, src_port, dst_port};
    } while (_inet._inet.netif()->hw_queues_count() > 1 &&
             (_inet._inet.netif()->hash2cpu(id.hash(_inet._inet.netif()->rss_key())) != this_shard_id()
              || _tcbs.find(id) != _tcbs.end()));

So we may be using the wrong hash function here.

Please try with apps/seawreck, with smp=1 and smp>2, to validate that this is the problem.

vladzcloudius · 2020-05-28T21:38:31Z

Actually, the ceph assertion failures is caused by failure of TCP connection in line 374, and it is same error as from previous test when SYN+ACK packet is hashed to wrong shard:

364       seastar::future<> connect_wait_verify(const entity_addr_t& peer_addr) {
365         return container().invoke_on_all([peer_addr] (auto& client) {
366           // start clients in active cores (#1 ~ #jobs)
367           if (client.is_active()) {
368             mono_time start_time = mono_clock::now();
369             client.active_conn = client.msgr->connect(peer_addr, entity_name_t::TYPE_OSD);
370             // make sure handshake won't hurt the performance
371             return seastar::sleep(1s).then([&client, start_time] {
372               if (client.conn_stats.connected_time == mono_clock::zero()) {
373                 logger().error("\n{} not connected after 1s!\n", client.lname);
374                 ceph_assert(false);
375               }
376               client.conn_stats.connecting_time = start_time;
377             });
378           }
379           return seastar::now();
380         });
381       }

In short, changing only if (smp::count > 1) to if (false) in dpdk_device::init_port_start() does not work. May be more changes is needed to verify that it's a hash function mismatch?

@WJTian In order to see why the code above doesn't work I'll need to see the whole thing. A link to a github repo + branch would do nicely.

As @avikivity have already mentioned you can see how TCP client and servers may be implemented by looking at httpd (server) and seawreck (client) demo apps.

I tested them not long ago with DPDK (+native stack) and I definitely used a multi-queue/multi-shard configuration (I played with both ena and experimental user-virtio backends).

So, there is a good chance that our TCP code is healthy. Although I don't deny for a second that there is always a chance for a bug... ;)

There were issues with a reactor backend however: I had to use --reactor-backend epoll instead of the default linux-aio one.

WJTian · 2020-06-18T11:03:30Z

I just remembered that the device I used is a tap instead of a dpdk(the physical NIC is Mellanox ConnectX-4 Lx), so modification of dpdk_device::init_port_start() should not work. Any ideas to validate that this is a hash function mismatch for tap device?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native stack: cannot find sent record in _tcbs. #750

Native stack: cannot find sent record in _tcbs. #750

WJTian commented May 25, 2020 •

edited

avikivity commented May 25, 2020 •

edited

WJTian commented May 25, 2020 •

edited

avikivity commented May 25, 2020 •

edited

WJTian commented May 25, 2020 •

edited by avikivity

avikivity commented May 25, 2020

WJTian commented May 25, 2020 •

edited

avikivity commented May 25, 2020

vladzcloudius commented May 28, 2020 •

edited

WJTian commented Jun 18, 2020 •

edited

Native stack: cannot find sent record in _tcbs. #750

Native stack: cannot find sent record in _tcbs. #750

Comments

WJTian commented May 25, 2020 • edited

avikivity commented May 25, 2020 • edited

WJTian commented May 25, 2020 • edited

avikivity commented May 25, 2020 • edited

WJTian commented May 25, 2020 • edited by avikivity

avikivity commented May 25, 2020

WJTian commented May 25, 2020 • edited

avikivity commented May 25, 2020

vladzcloudius commented May 28, 2020 • edited

WJTian commented Jun 18, 2020 • edited

WJTian commented May 25, 2020 •

edited

avikivity commented May 25, 2020 •

edited

WJTian commented May 25, 2020 •

edited

avikivity commented May 25, 2020 •

edited

WJTian commented May 25, 2020 •

edited by avikivity

WJTian commented May 25, 2020 •

edited

vladzcloudius commented May 28, 2020 •

edited

WJTian commented Jun 18, 2020 •

edited