Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair fails - server logs reports finished after a while #1428

Closed
dahankzter opened this issue Jul 5, 2016 · 13 comments
Closed

Repair fails - server logs reports finished after a while #1428

dahankzter opened this issue Jul 5, 2016 · 13 comments
Assignees
Labels
Milestone

Comments

@dahankzter
Copy link

Installation details
Scylla version (or git commit hash): 1.2.1
Cluster size: 3
OS (RHEL/CentOS/Ubuntu/AWS AMI): RHEL

After successful repair of our app tables it failed after the system keyspace:

[2016-07-05 10:18:23,380] Starting repair command #1, repairing 1 ranges for keyspace transform (parallelism=SEQUENTIAL, full=true)
[2016-07-05 10:29:01,686] Repair session 1 finished
[2016-07-05 10:29:02,213] Starting repair command #2, repairing 1 ranges for keyspace collector (parallelism=SEQUENTIAL, full=true)
[2016-07-05 10:29:27,329] Repair session 2 finished
[2016-07-05 10:29:27,409] Starting repair command #3, repairing 1 ranges for keyspace input (parallelism=SEQUENTIAL, full=true)
[2016-07-05 10:44:02,978] Repair session 3 failed
[2016-07-05 10:44:02,978] Repair session 3 finished
[2016-07-05 10:44:03,219] Starting repair command #4, repairing 1 ranges for keyspace system (parallelism=SEQUENTIAL, full=true)
[2016-07-05 10:44:03,325] Repair session 4 finished
error: nodetool failed, check server logs
-- StackTrace --
java.lang.RuntimeException: nodetool failed, check server logs
    at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290)
    at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202)

The logs say very little:

Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] stream_session - [Stream #165c64a1-428b-11e6-a11d-000000000000] Session with 172.27.164.164 is complete, state=FAILED
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] stream_session - [Stream #165c64a1-428b-11e6-a11d-000000000000] bytes_sent = 0, bytes_received = 9078598
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] stream_session - [Stream #165c64a1-428b-11e6-a11d-000000000000] Stream failed, peers={172.27.164.147, 172.27.164.164}
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed)
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - Failed sync of range (6570231404966192696, 6583873856864803129]: streaming::stream_exception (Stream failed)
Jul 05 10:44:02 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - repair 3 failed - std::runtime_error (Checksum or sync of partial range failed)
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - starting user-requested repair for keyspace system, repair id 4, options {incremental=false, parallelism=0, primaryRange=false}
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - [repair #a015b290-428c-11e6-a11d-000000000000] new session: will sync {} on range (-inf, -9214362364061060143] for system.{size
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - [repair #a015b291-428c-11e6-a11d-000000000000] new session: will sync {} on range (-9214362364061060143, -9204834331903101080]

This goes on for some time and then:

Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - [repair #a036cf21-428c-11e6-a11d-000000000000] new session: will sync {} on range (9197064076111319020, 9215082167558200096] fo
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - [repair #a036cf22-428c-11e6-a11d-000000000000] new session: will sync {} on range (9215082167558200096, +inf) for system.{size_
Jul 05 10:44:03 app66.prod.content.eniro scylla[10290]:  [shard 0] repair - repair 4 completed sucessfully
@dahankzter
Copy link
Author

Is this ok and just a bug in the nodetool error handling?

@dahankzter
Copy link
Author

Rerunning a repair on node A again.

Now I see this in the logs on the other nodes:

The session 0x6030008ce600 made no progress with peer A

A is the IP of the node I run repair on. Any idea what it can be and if it is of any consequence.
The repair seems stalled on cammand 3 but I guess it can just take time.

@dahankzter
Copy link
Author

I started a repair on another node and after a while I see this on the suspect node (I am starting to think one of them is in bad shape):

Prepare completed with 172.27.164.147. Receiving 0, sending 1
Jul 05 21:55:07 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] stream_transfer_task: Fail to send STREAM_MUTATION to 172.27.164.147:0: std::runtime_error (std::bad_alloc)
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] stream_transfer_task: Fail to send to 172.27.164.147:0: broken_semaphore (Semaphore broken)
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Streaming error occurred
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Session with 172.27.164.147 is complete, state=FAILED
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] bytes_sent = 14993368, bytes_received = 0
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]:  [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Stream failed, peers={172.27.164.147}

Seems a bit serious?

@avikivity
Copy link
Member

You shouldn't repair the system keyspace, it stores local information.

But (1) we should have prevented it and (2) it should have succeeded (destroying your cluster in the process).

@avikivity
Copy link
Member

@nyh

@slivne
Copy link
Contributor

slivne commented Jul 6, 2016

There is already an issue for for bot repairing system keyspace
#1380

what we concluded was that we dont really repair the ayatem keyspacewith
other nodes it stays local

@nyh https://github.com/nyh


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1428 (comment),
or mute
the thread
https://github.com/notifications/unsubscribe/ADThCKNtbkCzaO3M-59MNEAFdiUkdlRtks5qSy7JgaJpZM4JE6WU
.

@dahankzter
Copy link
Author

I don't explicitly repair the system keyspace i just ran

nodetool repair

Is the cluster broken now? Can it be fixed somehow if that is the case?
They seem to connect and the gossip logs says it sees peers.

@dahankzter
Copy link
Author

nodetool repair consistently reports the initial error after a while though.
Is that related to the system keyspace repair issue?

@avikivity
Copy link
Member

I don't think it's broken if you just ran nodetool repair. But we need to fix it not to attempt to repair the system keyspace.

@avikivity
Copy link
Member

(it would probably have broken the cluster if it did manage to repair it, but since it didn't, it's fine)

@gleb-cloudius
Copy link
Contributor

On Tue, Jul 05, 2016 at 11:41:57PM -0700, Avi Kivity wrote:

(it would probably have broken the cluster if it did manage to repair it, but since it didn't, it's fine)

system keyspace has only local node as a replica, so repair will just do
nothing since there is no one to repair with. Why it sometimes fails I
do not know (out of memory? read error?).

        Gleb.

@slivne slivne added this to the 1.5 milestone Sep 25, 2016
@slivne
Copy link
Contributor

slivne commented Nov 6, 2016

@asias ping

@asias
Copy link
Contributor

asias commented Nov 7, 2016

As mentioned in #1452 (comment), I think error but reported as "repair successfully" will not happen now.

@slivne slivne closed this as completed Nov 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants