-
Notifications
You must be signed in to change notification settings - Fork 292
CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add… #3560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA-287343: Update HA failure tolerance plan for corosync/GFS2 and add… #3560
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the calculation in xapi_ha_vm_failover.ml
is good if we don't consider cleanly shut down hosts.
The tests don't seem to test the formula because they allow more than half of the hosts to be down, I think we need some actual VMs in the test scenarios to test the corosync formula.
ocaml/xapi/xapi_ha_vm_failover.ml
Outdated
let nhosts = List.length (Db.Host.get_all ~__context) in | ||
let total_hosts = List.length (Db.Host.get_all ~__context) in | ||
(* For corosync HA less than half of the pool can fail whilst maintaining quorum *) | ||
let corosync_ha_max_hosts = ((total_hosts - 1) / 2) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for corosync we need to look at loosing only the enabled hosts. If you got 10 hosts, with 5 of them cleanly shutdown we can tolerate the loss of 2 additional hosts, so max failures in that case is 7.
So the failures to tolerate would be: count_of_disabled_hosts (* these already "failed" *) + ((count_of_enabled_hosts - 1)/2)
.
Otherwise if you want to shutdown hosts (which implies increasing the host failures you tolerate if you're already at the max) you'll be told that you can't shutdown more hosts because you reached the limits of HA failures to tolerate.
Calculating this would need to be done while holding the clustering lock with Xapi_clustering.with_clustering_lock_if_needed
, and every time we enable or disable a cluster host we will need to update the HA plan. To avoid deadlock the function in xapi_ha_failover.ml won't take any locks, but all its callers will have to (Xapi_cluster_host.enable/disable already does, locks need to be added to other callers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the check we need is to see if the host is down and clustering is disabled? or can we just check is clustering is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check whether the associated Cluster_host is enabled or disabled. There is a helper in Xapi_clustering.is_clustering_disabled_on_host
that you can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will need to make some additional updates to the unit test framework to get this to work because currently we don't enable clustering on all hosts
let corosync_ha_max_hosts = ((total_hosts - 1) / 2) in | ||
let nhosts = match Db.Cluster.get_all ~__context with | ||
| [] -> total_hosts | ||
| _ -> corosync_ha_max_hosts in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this call a function in Xapi_clustering.ml to calculate the failures to tolerate? As mentioned above it needs to know about enabled/disabled hosts.
{memory_total = gib 256L; name_label = "slave1"; vms = []}; | ||
{memory_total = gib 256L; name_label = "slave2"; vms = []} | ||
]; | ||
ha_host_failures_to_tolerate = 3L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? I think corosync could only tolerate the loss of one host in a 3 host pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.
slaves = [ | ||
{memory_total = gib 256L; name_label = "slave1"; vms = []} | ||
]; | ||
ha_host_failures_to_tolerate = 2L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your pool only has 2 hosts how can you loose them all? I think you need some actual VMs for the tests to make sense, indeed if you have no VMs you might as well turn the whole pool off, but thats not a realistic scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. This number is inserted into the database as part of the set up but is over-ridden when we call the compute function. For 2 hosts currently expecting to tolerate 0 host failures as per the expected result three lines below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't seem to respond to your comments on the setting of the ha_host_failures
value so commenting here instead:
This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.
{memory_total = gib 256L; name_label = "slave1"; vms = []}; | ||
{memory_total = gib 256L; name_label = "slave2"; vms = []} | ||
]; | ||
ha_host_failures_to_tolerate = 3L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is set as part of the test setup but is over-ridden when we call the compute ha failover function. We are expecting 1 host here (as per the current algorithm) which is the number three lines below. I will add a comment to make it more clear what is going on as this framework makes things a bit obscure.
slaves = [ | ||
{memory_total = gib 256L; name_label = "slave1"; vms = []} | ||
]; | ||
ha_host_failures_to_tolerate = 2L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. This number is inserted into the database as part of the set up but is over-ridden when we call the compute function. For 2 hosts currently expecting to tolerate 0 host failures as per the expected result three lines below.
{memory_total = gib 256L; name_label = "slave1"; vms = []} | ||
]; | ||
ha_host_failures_to_tolerate = 2L; | ||
cluster = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated so cluster now dictates how many hosts to enable clustering on
I think this looks correct now, please squash the commits. |
… unit tests Signed-off-by: Thomas Mckelvey <thomas.mckelvey@citrix.com>
7a7e3ef
to
49ef424
Compare
… unit tests
Signed-off-by: Thomas Mckelvey thomas.mckelvey@citrix.com