-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossip cannot rejoin a node #403
Comments
After restarting node 82 everything is back to normal. |
@gleb-cloudius This can happen when node 82 is not running the gossiper::run() again due to stuck in
I saw this happen myself once. In this state, gossiper::do_status_check() will not be called, node 82 will not update other nodes's status information (up or down), thus you saw node 2 thought all the node were up. However, node 83 and 81 can not connect to node 82, so it thought 82 is down which is correct. When you restart node 82, it is recovered from the hang, the node 83 and 81 can talk to it again, thus cluster becomes normal again. I will send a patch to expose the heart beat version info though our HTTP api, so that we can check if gossiper::run() still runs when this issue happens. |
@gleb-cloudius I verified my theory with cl=all,rf=3 on a 3 nodes scylla cluster + 2 c-s nodes, after 1h of run. 3547 is the heart beat version for node 155. It is not changed after [fedora@ip-172-31-47-55 c-s]$ date a few seconds later [fedora@ip-172-31-47-55 c-s]$ date Let's take a closer look at node 155. In node 155's eye, node 156 and node 157's heart beat version are increasing this means node 155 can receive gossip data from them and send node 155's info back, however, node 155's heart beat version is not updated, so the other node will think node 155 is not alive. [fedora@ip-172-31-47-55 c-s]$ curl --silent -X GET "http://172.31.33.155:10000/gossiper/heart_beat_version/172.31.33.155";echo On node 155, last 100 lines of journal
|
I added debug code to check if replicate data to other cores will block:
This is on a 32 cores system.
|
@avikivity ideas? |
$ sudo perf record --call-graph dwarf -p
|
scylla$ addr2line -Cfpi -e build/release/scylla.aws.symbol 0x00000000000d7dfe |
|
2015-09-28 13:55 GMT+02:00 Asias He notifications@github.com:
|
I did see compaction in the log around the replication timeout. @raphaelsc Can you take a look at do_recursive_touch_directory? |
On Mon, Sep 28, 2015 at 06:13:20AM -0700, Tomasz Grabiec wrote:
|
|
On Mon, Sep 28, 2015 at 06:16:20PM -0700, Asias He wrote:
|
On Tue, Sep 29, 2015 at 3:49 PM, Gleb Natapov notifications@github.com
No, I did not. It looks like the symbol is bogus. BTW, I used unstripped
|
On Tue, Sep 29, 2015 at 12:51:57AM -0700, Asias He wrote:
|
On Tue, Sep 29, 2015 at 4:03 PM, Gleb Natapov notifications@github.com
no, i can not. The node is terminated.
|
On Tue, Sep 29, 2015 at 01:15:53AM -0700, Asias He wrote:
|
On Tue, Sep 29, 2015 at 4:21 PM, Gleb Natapov notifications@github.com
Gleb, can you do it yourself this time, I'm not familiar with the code with
|
On Tue, Sep 29, 2015 at 01:32:03AM -0700, Asias He wrote:
|
On Tue, Sep 29, 2015 at 4:36 PM, Gleb Natapov notifications@github.com
The code is here:
. I think you see this problem originally. Anyway, start 3 scylla nodes using I will send you my scripts to generate the load. If it turns out it is not an endless loop in gdb, you might want to check
|
On Tue, Sep 29, 2015 at 01:49:41AM -0700, Asias He wrote:
|
Here is the stack of infinity loop:
|
Looking at the GDB dump, it looks like some sort of a segment heap corruption. This happens when a segment is freed. The segment descriptor looks sane, I ruled out double-free. I'm not yet sure what could cause the corruption. I tried to reproduce it on EC2 using 3 scylla machines and 5 cassandra-stress machines (config as Asias mentioned). It survived a 5 hour run of RF=3 CL=ALL, and also a couple of shorter 1-2 hour runs. Maybe it's related to how c-s clients are started, @asias @gleb-cloudius , how exactly do you start the load? Was it on a fresh database, or an already populated one? |
On Tue, Oct 06, 2015 at 12:45:04AM -0700, Tomasz Grabiec wrote:
cassandra-stress write no-warmup duration=1h cl=ALL -mode native cql3 cassandra-stress write no-warmup duration=1h cl=ALL -mode native cql3
|
@gleb-cloudius Is $IP is an IP of one of the servers, or a list of IPs ? |
On Tue, Oct 06, 2015 at 01:00:17AM -0700, Tomasz Grabiec wrote:
|
@gleb-cloudius @asias Which AMIs did you use for Scylla and c-s ? |
On Mon, Oct 19, 2015 at 05:12:14AM -0700, Tomasz Grabiec wrote:
|
19 paź 2015 2:15 PM "Gleb Natapov" notifications@github.com napisał(a):
That's instance type, I'm asking about the image.
|
On Mon, Oct 19, 2015 at 05:31:06AM -0700, Tomasz Grabiec wrote:
|
On Mon, Oct 19, 2015 at 8:35 PM, Gleb Natapov notifications@github.com
I'm not 100% sure, but it probably is:
|
I managed to reproduce it with 2ccb5fe when starting clients exactly like Gleb mentioned. |
I run 3 instance cluster with ami-afb6c8ca and loaded it with write (RF=3 Cl=ALL) until some node fell out of cluster due to big networking latencies, but after traffic stopped it never rejoined the cluster. Now I see very strange picture: there are three nodes 81, 82 and 83. 81 and 83 see only 2 nodes in the cluster (node 81 and 83), but node 82 sees all three!
The text was updated successfully, but these errors were encountered: