New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repair fails - server logs reports finished after a while #1428
Comments
Is this ok and just a bug in the nodetool error handling? |
Rerunning a repair on node A again. Now I see this in the logs on the other nodes: The session 0x6030008ce600 made no progress with peer A A is the IP of the node I run repair on. Any idea what it can be and if it is of any consequence. |
I started a repair on another node and after a while I see this on the suspect node (I am starting to think one of them is in bad shape): Prepare completed with 172.27.164.147. Receiving 0, sending 1
Jul 05 21:55:07 app66.prod.content.eniro scylla[9703]: [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] stream_transfer_task: Fail to send STREAM_MUTATION to 172.27.164.147:0: std::runtime_error (std::bad_alloc)
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]: [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] stream_transfer_task: Fail to send to 172.27.164.147:0: broken_semaphore (Semaphore broken)
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]: [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Streaming error occurred
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]: [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Session with 172.27.164.147 is complete, state=FAILED
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]: [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] bytes_sent = 14993368, bytes_received = 0
Jul 05 21:55:14 app66.prod.content.eniro scylla[9703]: [shard 2] stream_session - [Stream #5e5e7d11-42ea-11e6-b219-000000000000] Stream failed, peers={172.27.164.147} Seems a bit serious? |
You shouldn't repair the system keyspace, it stores local information. But (1) we should have prevented it and (2) it should have succeeded (destroying your cluster in the process). |
There is already an issue for for bot repairing system keyspace what we concluded was that we dont really repair the ayatem keyspacewith — |
I don't explicitly repair the system keyspace i just ran nodetool repair Is the cluster broken now? Can it be fixed somehow if that is the case? |
nodetool repair consistently reports the initial error after a while though. |
I don't think it's broken if you just ran nodetool repair. But we need to fix it not to attempt to repair the system keyspace. |
(it would probably have broken the cluster if it did manage to repair it, but since it didn't, it's fine) |
On Tue, Jul 05, 2016 at 11:41:57PM -0700, Avi Kivity wrote:
|
@asias ping |
As mentioned in #1452 (comment), I think error but reported as "repair successfully" will not happen now. |
Installation details
Scylla version (or git commit hash): 1.2.1
Cluster size: 3
OS (RHEL/CentOS/Ubuntu/AWS AMI): RHEL
After successful repair of our app tables it failed after the system keyspace:
The logs say very little:
This goes on for some time and then:
The text was updated successfully, but these errors were encountered: