Assertion failure in gossiper during node reconnect #795

gleb-cloudius · 2016-01-14T09:27:14Z

Got this after killing scylla with "kill -9" and restarting:

scylla: gms/gossiper.cc:1393: gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> mutable: Assertion `endpoint_state_map.count(ep_addr)' failed.

#0  0x00007f0b18d6c8d7 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f0b18d6e53a in __GI_abort () at abort.c:89
#2  0x00007f0b18d6547d in __assert_fail_base (
    fmt=0x7f0b18ebecb8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x14316d8 "endpoint_state_map.count(ep_addr)", 
    file=file@entry=0x143203d "gms/gossiper.cc", line=line@entry=1393, 
    function=function@entry=0x1434b00 <gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::{lambda()#1}::operator()()::__PRETTY_FUNCTION__> "gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> mutable") at assert.c:92
#3  0x00007f0b18d65532 in __GI___assert_fail (
    assertion=0x14316d8 "endpoint_state_map.count(ep_addr)", 
    file=0x143203d "gms/gossiper.cc", line=1393, 
    function=0x1434b00 <gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::{lambda()#1}::operator()()::__PRETTY_FUNCTION__> "gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> mutable") at assert.c:101
#4  0x0000000000bbfb3d in operator() (__closure=0x60000892f620)
    at gms/gossiper.cc:1393
#5  apply (args=<optimized out>, 
    func=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x9fe3e62, DIE 0xa26bac2>) at /home/gleb/work/seastar/seastar/core/apply.hh:34
---Type <return> to continue, or q <return> to quit---
#6  apply<gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> > (args=<optimized out>, 
    func=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x9fe3e62, DIE 0xa26baff>) at /home/gleb/work/seastar/seastar/core/apply.hh:42
#7  do_void_futurize_apply_tuple<gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> > (args=<optimized out>, 
    func=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x9fe3e62, DIE 0xa26bb46>) at /home/gleb/work/seastar/seastar/core/future.hh:1161
#8  apply<gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> > (args=<optimized out>, 
    func=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x9fe3e62, DIE 0xa26bb85>) at /home/gleb/work/seastar/seastar/core/future.hh:1181
#9  operator() (__closure=<optimized out>)
    at /home/gleb/work/seastar/seastar/core/thread.hh:258
#10 std::_Function_handler<void(), seastar::async(Func&&, Args&& ...)::<lambda(seastar::async(Func&&, Args&& ...)::work&)> mutable [with Func = gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()>; Args = {}]::<lambda()> >::_M_invoke(const std::_Any_data &) (
    __functor=...) at /usr/include/c++/4.9.2/functional:2039
#11 0x0000000000422408 in operator() (this=0x6000000bf350)
    at /usr/include/c++/4.9.2/functional:2439
#12 seastar::thread_context::main (this=0x6000000bf340) at core/thread.cc:139
#13 0x000000000051aa92 in seastar::thread_context::s_main (lo=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    hi=<optimized out>) at core/thread.cc:130
#14 0x00007f0b18d80000 in ?? () from /lib64/libc.so.6
#15 0x0000000000000000 in ?? ()

The text was updated successfully, but these errors were encountered:

slivne · 2016-02-18T11:22:09Z

@asias did we fix this ?

asias · 2016-02-18T11:31:00Z

Worth retest, since we now drain on shutdown. @gleb-cloudius can you verify?

gleb-cloudius · 2016-02-18T11:35:25Z

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you verify?

kill -9 will not drain.

        Gleb.

asias · 2016-02-19T03:19:32Z

On Thu, Feb 18, 2016 at 7:35 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you
verify?

kill -9 will not drain.

You are right.

I did manage to reproduce myself.

start scylla node1
start scylla node2
run c-s for a while
kill -9 node2
start node2

I repeat above a few times. I did not see the assert.

Gleb, what did you do exactly?

Asias

gleb-cloudius · 2016-02-19T07:03:25Z

On Thu, Feb 18, 2016 at 07:19:39PM -0800, Asias He wrote:

On Thu, Feb 18, 2016 at 7:35 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you
verify?

kill -9 will not drain.

You are right.

I did manage to reproduce myself.

start scylla node1
start scylla node2
run c-s for a while
kill -9 node2
start node2

I repeat above a few times. I did not see the assert.

Gleb, what did you do exactly?

Saw it only once. I usually run 3 nodes. Try to kill a node while
another is connecting.

        Gleb.

asias · 2016-02-19T07:22:16Z

On Fri, Feb 19, 2016 at 3:03 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 07:19:39PM -0800, Asias He wrote:

On Thu, Feb 18, 2016 at 7:35 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you
verify?

kill -9 will not drain.

You are right.

I did manage to reproduce myself.

start scylla node1
start scylla node2
run c-s for a while
kill -9 node2
start node2

I repeat above a few times. I did not see the assert.

Gleb, what did you do exactly?

Saw it only once. I usually run 3 nodes. Try to kill a node while
another is connecting.

Did you run c-s when you add a new node?

You did like below?

start node1
start node2
start node3, in the mean while, kill node 2

Did you insert data before start node3?

Gleb.

—
Reply to this email directly or view it on GitHub
#795 (comment).

Asias

gleb-cloudius · 2016-02-19T07:32:26Z

On Thu, Feb 18, 2016 at 11:22:22PM -0800, Asias He wrote:

On Fri, Feb 19, 2016 at 3:03 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 07:19:39PM -0800, Asias He wrote:

On Thu, Feb 18, 2016 at 7:35 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you
verify?

kill -9 will not drain.

You are right.

I did manage to reproduce myself.

start scylla node1
start scylla node2
run c-s for a while
kill -9 node2
start node2

I repeat above a few times. I did not see the assert.

Gleb, what did you do exactly?

Saw it only once. I usually run 3 nodes. Try to kill a node while
another is connecting.

Did you run c-s when you add a new node?

Likely.

You did like below?

start node1
start node2
start node3, in the mean while, kill node 2

Did you insert data before start node3?

I do not really remember all the steps unfortunately.

        Gleb.

asias · 2016-02-22T08:59:24Z

The assert is about we can not find the entry in endpoint_state_map of the node itself. I checked the code. I can not really find any place we could call add_local_application_state before we call gossiper::start_gossiping() where it inserts broadcast address into endpoint_state_map.

penberg · 2016-02-24T08:55:38Z

@asias, endpoint_state_map is populated only on CPU0, right? Perhaps we're calling add_local_application_state from the wrong CPU?

asias · 2016-02-24T08:59:17Z

On Wed, Feb 24, 2016 at 4:55 PM, Pekka Enberg notifications@github.com
wrote:

@asias https://github.com/asias, endpoint_state_map is populated only
on CPU0, right? Perhaps we're calling add_local_application_state from
the wrong CPU?

It is replicated to all the other cores from cpu 0, but we should only call
it on cpu 0.

—
Reply to this email directly or view it on GitHub
#795 (comment).

Asias

penberg · 2016-02-24T09:05:36Z

@asias The replication of endpoint_state happens in gossiper::run(), right? Is it possible that someone calls add_local_application_state() before _scheduled_gossip_task is actually scheduled and run?

asias · 2016-02-24T09:12:04Z

On Wed, Feb 24, 2016 at 5:05 PM, Pekka Enberg notifications@github.com
wrote:

@asias https://github.com/asias The replication of endpoint_state
happens in gossiper::run(), right? Is it possible that someone calls
add_local_application_state() before _scheduled_gossip_task is actually
scheduled and run?

It is possible, but no one should call add_local_application_state on
non-zero core.

—

Reply to this email directly or view it on GitHub
#795 (comment).

Asias

asias · 2016-02-25T01:05:47Z

I've sent patch to the list to make add_local_application_state can be called on all the cores.

Gleb saw once: scylla: gms/gossiper.cc:1393: gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value):: mutable: Assertion `endpoint_state_map.count(ep_addr)' failed. The assert is about we can not find the entry in endpoint_state_map of the node itself. I can not really find any place we could call add_local_application_state before we call gossiper::start_gossiping() where it inserts broadcast address into endpoint_state_map. I can not reproduce issue, let's log the error so we can narrow down which application state triggered the assert. Refs: #795 Message-Id: <f4433be0a0d4f23470a5e24e528afdb67b74c7ef.1456315043.git.asias@scylladb.com>

add_local_application_state is used in various places. Before this patch, it can only be called on cpu zero. To make it safer to use, use invoke_on() to foward the code to run on cpu zero, so that caller can call it on any cpu. Refs: #795 Message-Id: <d69b81c5561622078dbe887d87209c4ea2e3bf46.1456315043.git.asias@scylladb.com>

slivne · 2016-02-25T13:56:12Z

I suggest we close this - unless there is anything else we can do

slivne added bug showstopper labels Feb 18, 2016

slivne added this to the GA milestone Feb 18, 2016

slivne assigned asias Feb 18, 2016

penberg closed this as completed Feb 25, 2016

penberg changed the title ~~assertoin in gossiper during node reconnect~~ Assertion failure in gossiper during node reconnect Feb 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failure in gossiper during node reconnect #795

Assertion failure in gossiper during node reconnect #795

gleb-cloudius commented Jan 14, 2016

slivne commented Feb 18, 2016

asias commented Feb 18, 2016

gleb-cloudius commented Feb 18, 2016

asias commented Feb 19, 2016

gleb-cloudius commented Feb 19, 2016

asias commented Feb 19, 2016

gleb-cloudius commented Feb 19, 2016

asias commented Feb 22, 2016

penberg commented Feb 24, 2016

asias commented Feb 24, 2016

penberg commented Feb 24, 2016

asias commented Feb 24, 2016

asias commented Feb 25, 2016

slivne commented Feb 25, 2016

Assertion failure in gossiper during node reconnect #795

Assertion failure in gossiper during node reconnect #795

Comments

gleb-cloudius commented Jan 14, 2016

slivne commented Feb 18, 2016

asias commented Feb 18, 2016

gleb-cloudius commented Feb 18, 2016

asias commented Feb 19, 2016

gleb-cloudius commented Feb 19, 2016

asias commented Feb 19, 2016

gleb-cloudius commented Feb 19, 2016

asias commented Feb 22, 2016

penberg commented Feb 24, 2016

asias commented Feb 24, 2016

penberg commented Feb 24, 2016

asias commented Feb 24, 2016

asias commented Feb 25, 2016

slivne commented Feb 25, 2016