Skip to content

Conversation

thomasmck
Copy link

… unit tests

Signed-off-by: Thomas Mckelvey thomas.mckelvey@citrix.com

@coveralls
Copy link

coveralls commented Apr 12, 2018

Coverage Status

Coverage increased (+0.03%) to 20.385% when pulling 49ef424 on thomasmck:private/thomasmc/max_failures into 3df3276 on xapi-project:master.

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the calculation in xapi_ha_vm_failover.ml is good if we don't consider cleanly shut down hosts.
The tests don't seem to test the formula because they allow more than half of the hosts to be down, I think we need some actual VMs in the test scenarios to test the corosync formula.

let nhosts = List.length (Db.Host.get_all ~__context) in
let total_hosts = List.length (Db.Host.get_all ~__context) in
(* For corosync HA less than half of the pool can fail whilst maintaining quorum *)
let corosync_ha_max_hosts = ((total_hosts - 1) / 2) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for corosync we need to look at loosing only the enabled hosts. If you got 10 hosts, with 5 of them cleanly shutdown we can tolerate the loss of 2 additional hosts, so max failures in that case is 7.
So the failures to tolerate would be: count_of_disabled_hosts (* these already "failed" *) + ((count_of_enabled_hosts - 1)/2).

Otherwise if you want to shutdown hosts (which implies increasing the host failures you tolerate if you're already at the max) you'll be told that you can't shutdown more hosts because you reached the limits of HA failures to tolerate.

Calculating this would need to be done while holding the clustering lock with Xapi_clustering.with_clustering_lock_if_needed , and every time we enable or disable a cluster host we will need to update the HA plan. To avoid deadlock the function in xapi_ha_failover.ml won't take any locks, but all its callers will have to (Xapi_cluster_host.enable/disable already does, locks need to be added to other callers).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the check we need is to see if the host is down and clustering is disabled? or can we just check is clustering is disabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check whether the associated Cluster_host is enabled or disabled. There is a helper in Xapi_clustering.is_clustering_disabled_on_host that you can use.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will need to make some additional updates to the unit test framework to get this to work because currently we don't enable clustering on all hosts

let corosync_ha_max_hosts = ((total_hosts - 1) / 2) in
let nhosts = match Db.Cluster.get_all ~__context with
| [] -> total_hosts
| _ -> corosync_ha_max_hosts in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this call a function in Xapi_clustering.ml to calculate the failures to tolerate? As mentioned above it needs to know about enabled/disabled hosts.

{memory_total = gib 256L; name_label = "slave1"; vms = []};
{memory_total = gib 256L; name_label = "slave2"; vms = []}
];
ha_host_failures_to_tolerate = 3L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I think corosync could only tolerate the loss of one host in a 3 host pool.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.

slaves = [
{memory_total = gib 256L; name_label = "slave1"; vms = []}
];
ha_host_failures_to_tolerate = 2L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your pool only has 2 hosts how can you loose them all? I think you need some actual VMs for the tests to make sense, indeed if you have no VMs you might as well turn the whole pool off, but thats not a realistic scenario.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. This number is inserted into the database as part of the set up but is over-ridden when we call the compute function. For 2 hosts currently expecting to tolerate 0 host failures as per the expected result three lines below.

Copy link
Author

@thomasmck thomasmck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't seem to respond to your comments on the setting of the ha_host_failures value so commenting here instead:
This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.

{memory_total = gib 256L; name_label = "slave1"; vms = []};
{memory_total = gib 256L; name_label = "slave2"; vms = []}
];
ha_host_failures_to_tolerate = 3L;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.

slaves = [
{memory_total = gib 256L; name_label = "slave1"; vms = []}
];
ha_host_failures_to_tolerate = 2L;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. This number is inserted into the database as part of the set up but is over-ridden when we call the compute function. For 2 hosts currently expecting to tolerate 0 host failures as per the expected result three lines below.

{memory_total = gib 256L; name_label = "slave1"; vms = []}
];
ha_host_failures_to_tolerate = 2L;
cluster = 1;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so cluster now dictates how many hosts to enable clustering on

@edwintorok
Copy link
Contributor

edwintorok commented Apr 16, 2018

I think this looks correct now, please squash the commits.

… unit tests

Signed-off-by: Thomas Mckelvey <thomas.mckelvey@citrix.com>
@thomasmck thomasmck force-pushed the private/thomasmc/max_failures branch from 7a7e3ef to 49ef424 Compare April 17, 2018 10:04
@thomasmck thomasmck changed the title WIP) CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add… CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add… Apr 17, 2018
@edwintorok edwintorok merged commit 947da4c into xapi-project:master Apr 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants