Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure in gossiper during node reconnect #795

Closed
gleb-cloudius opened this issue Jan 14, 2016 · 14 comments
Closed

Assertion failure in gossiper during node reconnect #795

gleb-cloudius opened this issue Jan 14, 2016 · 14 comments
Assignees
Labels
Milestone

Comments

@gleb-cloudius
Copy link
Contributor

Got this after killing scylla with "kill -9" and restarting:

scylla: gms/gossiper.cc:1393: gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> mutable: Assertion `endpoint_state_map.count(ep_addr)' failed.

#0  0x00007f0b18d6c8d7 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f0b18d6e53a in __GI_abort () at abort.c:89
#2  0x00007f0b18d6547d in __assert_fail_base (
    fmt=0x7f0b18ebecb8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x14316d8 "endpoint_state_map.count(ep_addr)", 
    file=file@entry=0x143203d "gms/gossiper.cc", line=line@entry=1393, 
    function=function@entry=0x1434b00 <gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::{lambda()#1}::operator()()::__PRETTY_FUNCTION__> "gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> mutable") at assert.c:92
#3  0x00007f0b18d65532 in __GI___assert_fail (
    assertion=0x14316d8 "endpoint_state_map.count(ep_addr)", 
    file=0x143203d "gms/gossiper.cc", line=1393, 
    function=0x1434b00 <gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::{lambda()#1}::operator()()::__PRETTY_FUNCTION__> "gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> mutable") at assert.c:101
#4  0x0000000000bbfb3d in operator() (__closure=0x60000892f620)
    at gms/gossiper.cc:1393
#5  apply (args=<optimized out>, 
    func=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x9fe3e62, DIE 0xa26bac2>) at /home/gleb/work/seastar/seastar/core/apply.hh:34
---Type <return> to continue, or q <return> to quit---
#6  apply<gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> > (args=<optimized out>, 
    func=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x9fe3e62, DIE 0xa26baff>) at /home/gleb/work/seastar/seastar/core/apply.hh:42
#7  do_void_futurize_apply_tuple<gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> > (args=<optimized out>, 
    func=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x9fe3e62, DIE 0xa26bb46>) at /home/gleb/work/seastar/seastar/core/future.hh:1161
#8  apply<gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()> > (args=<optimized out>, 
    func=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x9fe3e62, DIE 0xa26bb85>) at /home/gleb/work/seastar/seastar/core/future.hh:1181
#9  operator() (__closure=<optimized out>)
    at /home/gleb/work/seastar/seastar/core/thread.hh:258
#10 std::_Function_handler<void(), seastar::async(Func&&, Args&& ...)::<lambda(seastar::async(Func&&, Args&& ...)::work&)> mutable [with Func = gms::gossiper::add_local_application_state(gms::application_state, gms::versioned_value)::<lambda()>; Args = {}]::<lambda()> >::_M_invoke(const std::_Any_data &) (
    __functor=...) at /usr/include/c++/4.9.2/functional:2039
#11 0x0000000000422408 in operator() (this=0x6000000bf350)
    at /usr/include/c++/4.9.2/functional:2439
#12 seastar::thread_context::main (this=0x6000000bf340) at core/thread.cc:139
#13 0x000000000051aa92 in seastar::thread_context::s_main (lo=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    hi=<optimized out>) at core/thread.cc:130
#14 0x00007f0b18d80000 in ?? () from /lib64/libc.so.6
#15 0x0000000000000000 in ?? ()
@slivne
Copy link
Contributor

slivne commented Feb 18, 2016

@asias did we fix this ?

@slivne slivne added this to the GA milestone Feb 18, 2016
@asias
Copy link
Contributor

asias commented Feb 18, 2016

Worth retest, since we now drain on shutdown. @gleb-cloudius can you verify?

@gleb-cloudius
Copy link
Contributor Author

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you verify?

kill -9 will not drain.

        Gleb.

@asias
Copy link
Contributor

asias commented Feb 19, 2016

On Thu, Feb 18, 2016 at 7:35 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you
verify?

kill -9 will not drain.

You are right.

I did manage to reproduce myself.

start scylla node1
start scylla node2
run c-s for a while
kill -9 node2
start node2

I repeat above a few times. I did not see the assert.

Gleb, what did you do exactly?

Asias

@gleb-cloudius
Copy link
Contributor Author

On Thu, Feb 18, 2016 at 07:19:39PM -0800, Asias He wrote:

On Thu, Feb 18, 2016 at 7:35 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you
verify?

kill -9 will not drain.

You are right.

I did manage to reproduce myself.

start scylla node1
start scylla node2
run c-s for a while
kill -9 node2
start node2

I repeat above a few times. I did not see the assert.

Gleb, what did you do exactly?

Saw it only once. I usually run 3 nodes. Try to kill a node while
another is connecting.

        Gleb.

@asias
Copy link
Contributor

asias commented Feb 19, 2016

On Fri, Feb 19, 2016 at 3:03 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 07:19:39PM -0800, Asias He wrote:

On Thu, Feb 18, 2016 at 7:35 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you
verify?

kill -9 will not drain.

You are right.

I did manage to reproduce myself.

start scylla node1
start scylla node2
run c-s for a while
kill -9 node2
start node2

I repeat above a few times. I did not see the assert.

Gleb, what did you do exactly?

Saw it only once. I usually run 3 nodes. Try to kill a node while
another is connecting.

Did you run c-s when you add a new node?

You did like below?

start node1
start node2
start node3, in the mean while, kill node 2

Did you insert data before start node3?

Gleb.


Reply to this email directly or view it on GitHub
#795 (comment).

Asias

@gleb-cloudius
Copy link
Contributor Author

On Thu, Feb 18, 2016 at 11:22:22PM -0800, Asias He wrote:

On Fri, Feb 19, 2016 at 3:03 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 07:19:39PM -0800, Asias He wrote:

On Thu, Feb 18, 2016 at 7:35 PM, Gleb Natapov notifications@github.com
wrote:

On Thu, Feb 18, 2016 at 03:31:08AM -0800, Asias He wrote:

Worth retest, since we now drain on shutdown. @gleb-cloudius can you
verify?

kill -9 will not drain.

You are right.

I did manage to reproduce myself.

start scylla node1
start scylla node2
run c-s for a while
kill -9 node2
start node2

I repeat above a few times. I did not see the assert.

Gleb, what did you do exactly?

Saw it only once. I usually run 3 nodes. Try to kill a node while
another is connecting.

Did you run c-s when you add a new node?

Likely.

You did like below?

start node1
start node2
start node3, in the mean while, kill node 2

Did you insert data before start node3?

I do not really remember all the steps unfortunately.

        Gleb.

@asias
Copy link
Contributor

asias commented Feb 22, 2016

The assert is about we can not find the entry in endpoint_state_map of the node itself. I checked the code. I can not really find any place we could call add_local_application_state before we call gossiper::start_gossiping() where it inserts broadcast address into endpoint_state_map.

@penberg
Copy link
Contributor

penberg commented Feb 24, 2016

@asias, endpoint_state_map is populated only on CPU0, right? Perhaps we're calling add_local_application_state from the wrong CPU?

@asias
Copy link
Contributor

asias commented Feb 24, 2016

On Wed, Feb 24, 2016 at 4:55 PM, Pekka Enberg notifications@github.com
wrote:

@asias https://github.com/asias, endpoint_state_map is populated only
on CPU0, right? Perhaps we're calling add_local_application_state from
the wrong CPU?

It is replicated to all the other cores from cpu 0, but we should only call
it on cpu 0.


Reply to this email directly or view it on GitHub
#795 (comment).

Asias

@penberg
Copy link
Contributor

penberg commented Feb 24, 2016

@asias The replication of endpoint_state happens in gossiper::run(), right? Is it possible that someone calls add_local_application_state() before _scheduled_gossip_task is actually scheduled and run?

@asias
Copy link
Contributor

asias commented Feb 24, 2016

On Wed, Feb 24, 2016 at 5:05 PM, Pekka Enberg notifications@github.com
wrote:

@asias https://github.com/asias The replication of endpoint_state
happens in gossiper::run(), right? Is it possible that someone calls
add_local_application_state() before _scheduled_gossip_task is actually
scheduled and run?

It is possible, but no one should call add_local_application_state on
non-zero core.

Reply to this email directly or view it on GitHub
#795 (comment).

Asias

@asias
Copy link
Contributor

asias commented Feb 25, 2016

I've sent patch to the list to make add_local_application_state can be called on all the cores.

penberg pushed a commit that referenced this issue Feb 25, 2016
Gleb saw once:

scylla: gms/gossiper.cc:1393:
gms::gossiper::add_local_application_state(gms::application_state,
gms::versioned_value):: mutable: Assertion
`endpoint_state_map.count(ep_addr)' failed.

The assert is about we can not find the entry in endpoint_state_map of
the node itself. I can not really find any place we could call
add_local_application_state before we call gossiper::start_gossiping()
where it inserts broadcast address into endpoint_state_map.

I can not reproduce issue, let's log the error so we can narrow down
which application state triggered the assert.

Refs: #795
Message-Id: <f4433be0a0d4f23470a5e24e528afdb67b74c7ef.1456315043.git.asias@scylladb.com>
penberg pushed a commit that referenced this issue Feb 25, 2016
add_local_application_state is used in various places. Before this
patch, it can only be called on cpu zero. To make it safer to use, use
invoke_on() to foward the code to run on cpu zero, so that caller can
call it on any cpu.

Refs: #795
Message-Id: <d69b81c5561622078dbe887d87209c4ea2e3bf46.1456315043.git.asias@scylladb.com>
@slivne
Copy link
Contributor

slivne commented Feb 25, 2016

I suggest we close this - unless there is anything else we can do

@penberg penberg closed this as completed Feb 25, 2016
@penberg penberg changed the title assertoin in gossiper during node reconnect Assertion failure in gossiper during node reconnect Feb 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants