Repair fails - server logs reports finished after a while #1428

dahankzter · 2016-07-05T09:03:12Z

Installation details
Scylla version (or git commit hash): 1.2.1
Cluster size: 3
OS (RHEL/CentOS/Ubuntu/AWS AMI): RHEL

After successful repair of our app tables it failed after the system keyspace:

[2016-07-05 10:18:23,380] Starting repair command #1, repairing 1 ranges for keyspace transform (parallelism=SEQUENTIAL, full=true)
[2016-07-05 10:29:01,686] Repair session 1 finished
[2016-07-05 10:29:02,213] Starting repair command #2, repairing 1 ranges for keyspace collector (parallelism=SEQUENTIAL, full=true)
[2016-07-05 10:29:27,329] Repair session 2 finished
[2016-07-05 10:29:27,409] Starting repair command #3, repairing 1 ranges for keyspace input (parallelism=SEQUENTIAL, full=true)
[2016-07-05 10:44:02,978] Repair session 3 failed
[2016-07-05 10:44:02,978] Repair session 3 finished
[2016-07-05 10:44:03,219] Starting repair command #4, repairing 1 ranges for keyspace system (parallelism=SEQUENTIAL, full=true)
[2016-07-05 10:44:03,325] Repair session 4 finished
error: nodetool failed, check server logs
-- StackTrace --
java.lang.RuntimeException: nodetool failed, check server logs
    at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290)
    at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202)

The logs say very little:

Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] stream_session - [Stream #165c64a1-428b-11e6-a11d-000000000000] Session with 172.27.164.164 is complete, state=FAILED
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] stream_session - [Stream #165c64a1-428b-11e6-a11d-000000000000] bytes_sent = 0, bytes_received = 9078598
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] stream_session - [Stream #165c64a1-428b-11e6-a11d-000000000000] Stream failed, peers={172.27.164.147, 172.27.164.164}
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed)
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - Failed sync of range (6570231404966192696, 6583873856864803129]: streaming::stream_exception (Stream failed)
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - repair 3 failed - std::runtime_error (Checksum or sync of partial range failed)
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - starting user-requested repair for keyspace system, repair id 4, options {incremental=false, parallelism=0, primaryRange=false}
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - [repair #a015b290-428c-11e6-a11d-000000000000] new session: will sync {} on range (-inf, -9214362364061060143] for system.{size
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - [repair #a015b291-428c-11e6-a11d-000000000000] new session: will sync {} on range (-9214362364061060143, -9204834331903101080]

This goes on for some time and then:

Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - [repair #a036cf21-428c-11e6-a11d-000000000000] new session: will sync {} on range (9197064076111319020, 9215082167558200096] fo
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - [repair #a036cf22-428c-11e6-a11d-000000000000] new session: will sync {} on range (9215082167558200096, +inf) for system.{size_
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - repair 4 completed sucessfully

dahankzter · 2016-07-05T09:03:58Z

Is this ok and just a bug in the nodetool error handling?

dahankzter · 2016-07-05T19:44:06Z

Rerunning a repair on node A again.

Now I see this in the logs on the other nodes:

The session 0x6030008ce600 made no progress with peer A

A is the IP of the node I run repair on. Any idea what it can be and if it is of any consequence.
The repair seems stalled on cammand 3 but I guess it can just take time.

dahankzter · 2016-07-05T19:58:16Z

I started a repair on another node and after a while I see this on the suspect node (I am starting to think one of them is in bad shape):

Prepare completed with 172.27.164.147. Receiving 0, sending 1
Jul 05 21:55:07 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] stream_transfer_task: Fail to send STREAM_MUTATION to 172.27.164.147:0: std::runtime_error (std::bad_alloc)
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] stream_transfer_task: Fail to send to 172.27.164.147:0: broken_semaphore (Semaphore broken)
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Streaming error occurred
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Session with 172.27.164.147 is complete, state=FAILED
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] bytes_sent = 14993368, bytes_received = 0
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Stream failed, peers={172.27.164.147}

Seems a bit serious?

avikivity · 2016-07-06T04:27:36Z

You shouldn't repair the system keyspace, it stores local information.

But (1) we should have prevented it and (2) it should have succeeded (destroying your cluster in the process).

avikivity · 2016-07-06T04:27:48Z

@nyh

slivne · 2016-07-06T05:32:55Z

There is already an issue for for bot repairing system keyspace
#1380

what we concluded was that we dont really repair the ayatem keyspacewith
other nodes it stays local

@nyh https://github.com/nyh

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1428 (comment),
or mute
the thread
https://github.com/notifications/unsubscribe/ADThCKNtbkCzaO3M-59MNEAFdiUkdlRtks5qSy7JgaJpZM4JE6WU
.

dahankzter · 2016-07-06T06:14:16Z

I don't explicitly repair the system keyspace i just ran

nodetool repair

Is the cluster broken now? Can it be fixed somehow if that is the case?
They seem to connect and the gossip logs says it sees peers.

dahankzter · 2016-07-06T06:16:15Z

nodetool repair consistently reports the initial error after a while though.
Is that related to the system keyspace repair issue?

avikivity · 2016-07-06T06:41:19Z

I don't think it's broken if you just ran nodetool repair. But we need to fix it not to attempt to repair the system keyspace.

avikivity · 2016-07-06T06:41:51Z

(it would probably have broken the cluster if it did manage to repair it, but since it didn't, it's fine)

gleb-cloudius · 2016-07-06T06:48:04Z

On Tue, Jul 05, 2016 at 11:41:57PM -0700, Avi Kivity wrote:

(it would probably have broken the cluster if it did manage to repair it, but since it didn't, it's fine)

system keyspace has only local node as a replica, so repair will just do
nothing since there is no one to repair with. Why it sometimes fails I
do not know (out of memory? read error?).

        Gleb.

slivne · 2016-11-06T11:01:50Z

@asias ping

asias · 2016-11-07T01:30:25Z

As mentioned in #1452 (comment), I think error but reported as "repair successfully" will not happen now.

slivne added the type/bug label Sep 25, 2016

slivne assigned asias Sep 25, 2016

slivne added this to the 1.5 milestone Sep 25, 2016

slivne closed this as completed Nov 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair fails - server logs reports finished after a while #1428

Repair fails - server logs reports finished after a while #1428

dahankzter commented Jul 5, 2016

dahankzter commented Jul 5, 2016

dahankzter commented Jul 5, 2016

dahankzter commented Jul 5, 2016

avikivity commented Jul 6, 2016

avikivity commented Jul 6, 2016

slivne commented Jul 6, 2016

dahankzter commented Jul 6, 2016

dahankzter commented Jul 6, 2016

avikivity commented Jul 6, 2016

avikivity commented Jul 6, 2016

gleb-cloudius commented Jul 6, 2016

slivne commented Nov 6, 2016

asias commented Nov 7, 2016

Repair fails - server logs reports finished after a while #1428

Repair fails - server logs reports finished after a while #1428

Comments

dahankzter commented Jul 5, 2016

dahankzter commented Jul 5, 2016

dahankzter commented Jul 5, 2016

dahankzter commented Jul 5, 2016

avikivity commented Jul 6, 2016

avikivity commented Jul 6, 2016

slivne commented Jul 6, 2016

dahankzter commented Jul 6, 2016

dahankzter commented Jul 6, 2016

avikivity commented Jul 6, 2016

avikivity commented Jul 6, 2016

gleb-cloudius commented Jul 6, 2016

slivne commented Nov 6, 2016

asias commented Nov 7, 2016