[BOSTON] Fix data races in HA when slaves return in the pool. #1514

Closed
wants to merge 4 commits into
from

Projects

None yet

3 participants

@jeromemaloberti
Contributor

Backport of #1387 for Boston-lcm.

In HA, during the recovery process of failed hosts, there were data races due to the conflicts between the liveness information of a returning host and its live status as determined by the xha daemon.
These commits are fixing all the problems found when Xapi_hooks.host_post_declare_dead, which is called on every dead hosts, took few minutes to complete, leaving a large window to the dead hosts to come back in the pool.

jeromemaloberti added some commits Jul 25, 2013
@jeromemaloberti jeromemaloberti CA-111637: Change PBD.plug to be a localhost operation.
During the return in a pool of a slave host that rebooted,
the PBD.plug will fail if the slave is not set as alive.
It is anyway an operation that is performed on the master.

Signed-off-by: Jerome Maloberti <jerome.maloberti@citrix.com>
dd220c4
@jeromemaloberti jeromemaloberti CA-111637: Reorder Hosts and VMs recovery after a host crash in HA.
When some hosts are considered dead in HA, restart_auto_run_vms were
processing them in this way:
 - for each dead host
   - List all resident VMs
   - Host.set_live=false
   - call Xapi_hooks.host_post_declare_dead which can be very long
   - set all resident VM to `Halted (including Control Domain)
This process was conflicting with db_sync if a host had the bad taste
of coming back to life while restart_auto_run_vms was stuck in
host_post_declare_dead.
This commit reorder the actions to put the shortest first:
 - for each dead host
   - set all resident VMs excluding Control Domain to `Halted
   - Host.set_live=false
 - for each dead host
   - call Xapi_hooks.host_post_declare_dead

Signed-off-by: Jerome Maloberti <jerome.maloberti@citrix.com>
0777851
@jeromemaloberti jeromemaloberti CA-111637: Use the HA live_set when computing a restart plan for VMs.
Previously, after a host failure in HA, the function
Xapi_ha_vm_failover.compute_restart_plan which choose which hosts
should restart the failed VMs would pickup host that are live and
enabled. In some cases, a host may be live but not in the HA
live_set, for example if it returned in the pool before the HA
recovery process finished.
This situation is bad since the live host would have new VMs
started on it, while later it may be marked as dead, once the
HA recovery process finished.
This commit add the live_set parameter to compute_restart_plan
and all functions that need it by transitivity.
Some functions are called during the HA recovery, where the
live_set is available, or at startup, in this case the live_set
is created from all live and enabled hosts.

Signed-off-by: Jerome Maloberti <jerome.maloberti@citrix.com>
ee80100
@jeromemaloberti jeromemaloberti CA-111637: Remove set_live in pool.hello in HA.
In HA the host live value can be different from the liveset determined
by xha, allowing a user to start a VM on host that just came back in
a pool, but before its recovery process finished.
This commit fix this problem by forbidding the live value to be changed
outside of HA recovery process, the HA liveset becomes the only liveness
state of a host in HA.

Signed-off-by: Jerome Maloberti <jerome.maloberti@citrix.com>
18f7c2c
@jonludlam jonludlam added the boston label Jul 28, 2014
@simonjbeaumont simonjbeaumont removed the boston label Feb 25, 2015
@simonjbeaumont
Collaborator

Replaced with xenserver#1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment