Allow cluster health check to be optional #35

lukaszgielecinski · 2016-08-23T10:59:21Z

At the moment cqlmigrate requires that all nodes in all datacentres are up before it will run. We would like to make this check optional so that it will still run if a node is down so that the migration will be attempted with the supplied consistency level - which we will plan to set to LOCAL_QUORUM for read and EACH_QUORUM for write for our migrations.

adamdougal · 2016-10-27T07:23:10Z

Can this now be closed? Did #37 fix this?

sebbonnet · 2016-10-27T12:41:18Z

I don't think fix #37 addresses this issue when schema upgrades are required.

When a node is down, schema ugprades can still be performed although cassandra will issue a warning stating that a "schema version mismatch was detected". When the node comes back up the schema upgrade is applied to the node and the schema version is synced.
Here is the message I got when updating the schema with 1 node down, replication factor of 3 in a 4-node cluster. I wonder if doing the same via the cqldriver would throw a specific exception for this

cirrus_admin@cqlsh:platform_tests> alter table mycqlmigratetable add colc text;

Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
OperationTimedOut: errors={}, last_host=10.50.98.145

New schema version applied to the 3 nodes

ubuntu@ip-10-50-99-79:~$ nodetool describecluster
Cluster Information:
    Name: sandbox
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        1ef20cc1-bed4-32d3-a323-36b4b0a99e3e: [10.50.100.67, 10.50.98.145, 10.50.99.79]

        UNREACHABLE: [10.50.102.20]

Schema version synced after restarting the down node:

ubuntu@ip-10-50-99-79:~$ nodetool describecluster
Cluster Information:
    Name: sandbox
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        1ef20cc1-bed4-32d3-a323-36b4b0a99e3e: [10.50.100.67, 10.50.98.145, 10.50.102.20, 10.50.99.79]

adamdougal · 2016-10-28T09:46:10Z

@sebbonnet - in a previous comment @chbatey said "I am hesitant to allow schema changes when nodes are down. It really isn't a good idea and it'll allow people to shoot them selves in the foot."

PR #37 means we now don't check cluster health if there are no schema changes, which is a definite improvement.

Unless we disagree with @chbatey's comment, then presumably any further changes towards this issue are inadvisable?

sebbonnet · 2016-10-28T11:24:31Z

@adamdougal sounds like we need to discuss this further with @chbatey and find out its reasons for being "hesitant". I searched on the web but could not find anything that would prove or disprove it's ok, so I've asked the question on the IRC channel. Granted the test I've done above is not realistic, so I'd like to try other on a cluster that's under load and bring up a node that is on a previous schema version.
#37 is definitely an improvement at it means most app deployments / restart will not be impacted by a node down.

lukaszgielecinski · 2016-10-28T11:32:03Z

From our perspective it can be closed, we approached the problem with a different solution.

sebbonnet · 2016-10-31T14:44:02Z

I've asked the question a couple of times on IRC, but had no response... I've discussed this with @chbatey who said used to be quite a few bugs reported / unreported on this area where the node schemas would not be in agreement, so if we were to try we would need make sure nodes can recover from schema disagreement too.
So doing schema upgrade with a node down is probably not worth the risk at this stage, especially given the fact that the majority of deployments will not need be impacted following #37

sebbonnet mentioned this issue Oct 27, 2016

Cluster check should be configurable with an expected number of nodes #42

Open

sebbonnet closed this as completed Oct 31, 2016

jsravn mentioned this issue Dec 21, 2016

Cluster is considered unhealthy if some nodes are unreachable #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow cluster health check to be optional #35

Allow cluster health check to be optional #35

lukaszgielecinski commented Aug 23, 2016

adamdougal commented Oct 27, 2016

sebbonnet commented Oct 27, 2016 •

edited

Loading

adamdougal commented Oct 28, 2016

sebbonnet commented Oct 28, 2016

lukaszgielecinski commented Oct 28, 2016

sebbonnet commented Oct 31, 2016

Allow cluster health check to be optional #35

Allow cluster health check to be optional #35

Comments

lukaszgielecinski commented Aug 23, 2016

adamdougal commented Oct 27, 2016

sebbonnet commented Oct 27, 2016 • edited Loading

adamdougal commented Oct 28, 2016

sebbonnet commented Oct 28, 2016

lukaszgielecinski commented Oct 28, 2016

sebbonnet commented Oct 31, 2016

sebbonnet commented Oct 27, 2016 •

edited

Loading