CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add… #3560

thomasmck · 2018-04-12T12:04:11Z

… unit tests

Signed-off-by: Thomas Mckelvey thomas.mckelvey@citrix.com

coveralls · 2018-04-12T12:22:24Z

Coverage increased (+0.03%) to 20.385% when pulling 49ef424 on thomasmck:private/thomasmc/max_failures into 3df3276 on xapi-project:master.

edwintorok

I think the calculation in xapi_ha_vm_failover.ml is good if we don't consider cleanly shut down hosts.
The tests don't seem to test the formula because they allow more than half of the hosts to be down, I think we need some actual VMs in the test scenarios to test the corosync formula.

edwintorok · 2018-04-12T12:22:52Z

ocaml/xapi/xapi_ha_vm_failover.ml

-  let nhosts = List.length (Db.Host.get_all ~__context) in
+  let total_hosts = List.length (Db.Host.get_all ~__context) in
+  (* For corosync HA less than half of the pool can fail whilst maintaining quorum *)
+  let corosync_ha_max_hosts = ((total_hosts - 1) / 2) in


I think that for corosync we need to look at loosing only the enabled hosts. If you got 10 hosts, with 5 of them cleanly shutdown we can tolerate the loss of 2 additional hosts, so max failures in that case is 7.
So the failures to tolerate would be: count_of_disabled_hosts (* these already "failed" *) + ((count_of_enabled_hosts - 1)/2).

Otherwise if you want to shutdown hosts (which implies increasing the host failures you tolerate if you're already at the max) you'll be told that you can't shutdown more hosts because you reached the limits of HA failures to tolerate.

Calculating this would need to be done while holding the clustering lock with Xapi_clustering.with_clustering_lock_if_needed , and every time we enable or disable a cluster host we will need to update the HA plan. To avoid deadlock the function in xapi_ha_failover.ml won't take any locks, but all its callers will have to (Xapi_cluster_host.enable/disable already does, locks need to be added to other callers).

So the check we need is to see if the host is down and clustering is disabled? or can we just check is clustering is disabled?

You can check whether the associated Cluster_host is enabled or disabled. There is a helper in Xapi_clustering.is_clustering_disabled_on_host that you can use.

ok, will need to make some additional updates to the unit test framework to get this to work because currently we don't enable clustering on all hosts

edwintorok · 2018-04-12T12:23:37Z

ocaml/xapi/xapi_ha_vm_failover.ml

+  let corosync_ha_max_hosts = ((total_hosts - 1) / 2) in
+  let nhosts = match Db.Cluster.get_all ~__context with
+    | [] -> total_hosts
+    | _ -> corosync_ha_max_hosts in


Could you make this call a function in Xapi_clustering.ml to calculate the failures to tolerate? As mentioned above it needs to know about enabled/disabled hosts.

edwintorok · 2018-04-12T12:34:04Z

ocaml/tests/test_ha_vm_failover.ml

+                                                    {memory_total = gib 256L; name_label = "slave1"; vms = []};
+                                                    {memory_total = gib 256L; name_label = "slave2"; vms = []}
+                                                  ];
+                                                  ha_host_failures_to_tolerate = 3L;


Is this right? I think corosync could only tolerate the loss of one host in a 3 host pool.

This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.

edwintorok · 2018-04-12T12:35:02Z

ocaml/tests/test_ha_vm_failover.ml

+                                                  slaves = [
+                                                    {memory_total = gib 256L; name_label = "slave1"; vms = []}
+                                                  ];
+                                                  ha_host_failures_to_tolerate = 2L;


If your pool only has 2 hosts how can you loose them all? I think you need some actual VMs for the tests to make sense, indeed if you have no VMs you might as well turn the whole pool off, but thats not a realistic scenario.

See comment above. This number is inserted into the database as part of the set up but is over-ridden when we call the compute function. For 2 hosts currently expecting to tolerate 0 host failures as per the expected result three lines below.

thomasmck

Can't seem to respond to your comments on the setting of the ha_host_failures value so commenting here instead:
This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.

thomasmck · 2018-04-12T20:00:16Z

ocaml/tests/test_ha_vm_failover.ml

+                                                    {memory_total = gib 256L; name_label = "slave1"; vms = []};
+                                                    {memory_total = gib 256L; name_label = "slave2"; vms = []}
+                                                  ];
+                                                  ha_host_failures_to_tolerate = 3L;


This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.

thomasmck · 2018-04-12T20:01:47Z

ocaml/tests/test_ha_vm_failover.ml

+                                                  slaves = [
+                                                    {memory_total = gib 256L; name_label = "slave1"; vms = []}
+                                                  ];
+                                                  ha_host_failures_to_tolerate = 2L;


See comment above. This number is inserted into the database as part of the set up but is over-ridden when we call the compute function. For 2 hosts currently expecting to tolerate 0 host failures as per the expected result three lines below.

thomasmck · 2018-04-13T16:09:14Z

ocaml/tests/test_ha_vm_failover.ml

+                                                    {memory_total = gib 256L; name_label = "slave1"; vms = []}
+                                                  ];
+                                                  ha_host_failures_to_tolerate = 2L;
+                                                  cluster = 1;


Updated so cluster now dictates how many hosts to enable clustering on

edwintorok · 2018-04-16T17:33:56Z

I think this looks correct now, please squash the commits.

… unit tests Signed-off-by: Thomas Mckelvey <thomas.mckelvey@citrix.com>

edwintorok requested changes Apr 12, 2018

View reviewed changes

thomasmck commented Apr 13, 2018

View reviewed changes

CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add…

49ef424

… unit tests Signed-off-by: Thomas Mckelvey <thomas.mckelvey@citrix.com>

thomasmck force-pushed the private/thomasmc/max_failures branch from 7a7e3ef to 49ef424 Compare April 17, 2018 10:04

thomasmck changed the title ~~WIP) CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add…~~ CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add… Apr 17, 2018

edwintorok approved these changes Apr 17, 2018

View reviewed changes

edwintorok merged commit 947da4c into xapi-project:master Apr 17, 2018

CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add… #3560

CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add… #3560

Uh oh!

Conversation

thomasmck commented Apr 12, 2018

Uh oh!

coveralls commented Apr 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwintorok left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasmck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwintorok commented Apr 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Apr 12, 2018 •

edited

Loading

edwintorok commented Apr 16, 2018 •

edited

Loading