Skip to content

Conversation

mcintyre94
Copy link
Contributor

** Do not merge before #3470, that adds an mli file that I need to make a change to here, will need to fix after rebasing after that merges**. This builds because that mli file doesn't exist on master, once it's there I'll need to add the change to it. It's added over a few commits in #3470 though, more straightforward to rebase later.

Again, adding this one despite the rebase and small fix needed because it's significant changes and would appreciate feedback! Needed a fair bit of manual tweaking around the pbd.ml (untangling clustering related changes from the bug fixes) and the datamodel due to the split.

  • CA-274585: allow unplugging statefile VDI
  • CA-274585: unplug all local PBDs on shutdown/reboot
  • CP-24693: Tasks.with_tasks_destroy: add a function that waits for tasks with a timeout
  • CA-277346: do not get stuck detaching metadata VDI
  • CA-277346: stop (DB) requests when shutting down the master
  • Convert host evacuation script into ocaml
  • Remove scripts/examples/python/shutdown.py
  • Don't do prechecks in Host.prepare_for_poweroff

@mcintyre94 mcintyre94 force-pushed the CP-26970-master-shutdown branch from 5ca00cb to b769833 Compare February 23, 2018 17:55
@mcintyre94
Copy link
Contributor Author

Removed 'Remove scripts/examples/python/shutdown.py' - that's self-contained and will require a synchronised xenserver-specs PR, we'll make it separately later.

@coveralls
Copy link

coveralls commented Feb 23, 2018

Coverage Status

Coverage decreased (-0.02%) to 18.161% when pulling b5cded1 on mcintyre94:CP-26970-master-shutdown into abe969c on xapi-project:master.

@lindig
Copy link
Contributor

lindig commented Feb 27, 2018

Is this ready to go? Do we need to clean up the history a bit?

@mcintyre94 mcintyre94 force-pushed the CP-26970-master-shutdown branch from b769833 to 36d113b Compare February 27, 2018 15:17
@edwintorok
Copy link
Contributor

Tidied up the commits a bit, and we've reviewed and tested this in the feature branch. Might be useful if someone else takes a quick look as this affects the shutdown code of XAPI and is not feature flagged.

SM has implemented various workarounds (like introducing another service that gets run after xapi is shutdown and unmounts the SR) for unmounts getting stuck on shutdown.
With clustered filesystems XAPI really needs to ensure that it unplugs the PBDs before stopping clustering though, otherwise you either delay other hosts until the fencing timeout expires everytime you shutdown a host, and other hosts will treat the shut down host as crashed for the purposes of quorum calculation which is not what we wanted.
If we cleanly unmount and leave the cluster on shutdown that is a better user experience, unifies the shutdown behaviour (regardless if you do xe host-shutdown or /sbin/shutdown you execute the same code), and it removes the need for some of the workaround in SM, hence this PR is not feature flagged.

@edwintorok
Copy link
Contributor

edwintorok commented Feb 27, 2018

Just noticed that its missing this commit: 4f497c8
(edit) added the relevant parts of that commit, it actually simplifies the code (it removes the need for unplug_inner ~soft:true, makes no change to the unplug code at all).

(public_name xapi-client)
(libraries (
mtime
mtime.clock.os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to update the opam file as well?

let tasks_str = tasks |> List.map Ref.really_pretty_and_small |> String.concat "," in
D.info "Waiting for tasks timed out on %s" tasks_str;
false
| _ ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should run ocp-indent on this file after these changes.

@mseri
Copy link
Contributor

mseri commented Feb 28, 2018

ensure_no_vms contains a few XXX comments. Do we need to do anything about that?

@edwintorok
Copy link
Contributor

edwintorok commented Feb 28, 2018

You can still unplug the PBDs when the machines are suspended, I've checked that, and fixed the handling of paused VMs.

@edwintorok edwintorok force-pushed the CP-26970-master-shutdown branch from b621150 to 0d5c7db Compare February 28, 2018 15:19
@mseri
Copy link
Contributor

mseri commented Feb 28, 2018

Please double check. Aside that, the current fixup version looks good to me.

@edwintorok
Copy link
Contributor

edwintorok commented Feb 28, 2018

Might be useful to cancel the tasks as the python script did (the other XXX comment), e.g. if you were attempting to migrate a VM you probably can't hard shutdown it, will try to fix that.

@mseri
Copy link
Contributor

mseri commented Feb 28, 2018

Thanks

@edwintorok
Copy link
Contributor

The shutdown.py script used current_operations field, did the same in my fixup. Unfortunately the type here is a string to enum map instead of task ref to enum map and we can't change that in the datamodel without breaking compatibility with existing API users.
Might be useful if there was a way to use the correct type internally at least, without breaking the external API.

Copy link
Contributor

@mseri mseri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@edwintorok edwintorok force-pushed the CP-26970-master-shutdown branch from a1aca01 to ac5ad51 Compare February 28, 2018 16:18
@edwintorok
Copy link
Contributor

Thanks, squashed the fixups.

@mseri
Copy link
Contributor

mseri commented Feb 28, 2018

Btw, I saw your discussion on slack and I was looking at the docs and the code. What you do is not different from how we use the current operations in other parts of xapi. It is unfortunate but we cannot really do much about it with a reasonable effort.

@mseri
Copy link
Contributor

mseri commented Feb 28, 2018

Feel free to merge when appropriate

@lindig
Copy link
Contributor

lindig commented Mar 1, 2018

This needs to be merged after #3470.

@mseri
Copy link
Contributor

mseri commented Mar 1, 2018

I am half way through that PR

@edwintorok
Copy link
Contributor

I've pushed a fixup for the indentation, I think that was the only outstanding action on this PR.

@mseri
Copy link
Contributor

mseri commented Mar 1, 2018

Yes, I think this is fine. Once #3470 is ready and merged, we can merge this as well. Feel free to squash the fixup commits

If the HA daemon is not running it is safe to unplug the statefile VDI.
Need to allow unplugging the statefile VDI if we want to unplug all PBDs on
shutdown.

Note: this fixes shutdown on HA slaves, but not on HA pool master.
There the metadata VDI will stay plugged: CA-276993.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
We need to unplug the PBDs before shutdown/reboot, while we still have
functional networking.

We should not rely only on systemd to unmount the filesystems, because
SM would also expect PBD.unplug to get called on clean shutdown/reboot.
For some SRs (like GFS2) we also need to perform additional cleanup operations
that can only be done after all the SRs are unplugged.

The unplug must be done after HA is disabled, otherwise the statefile might
still be using it.

There are 3 ways to shutdown:
 * stop xapi-domains.service, xe host-disable, xe host-shutdown
 * stop from XC which shuts down VMs, and then does the host shutdown
 * xe host-shutdown directly with VMs running (not prevented, but the docs say you shouldn't do this)

Call the unplug in both places, this is a best effort unplug, i.e.
unplug as many PBDs as we can and ignore errors.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
We were trying to detach the metadata VDI while it was still in use.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Slaves might still be trying to write to the master DB via RPC.
If we turned off HA and the redo log in order to detach the static VDIs and
unplug all PBDs then we must not allow more writes to the DB.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
…timeout

For CP-24693.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>

Changes made to cherry-pick: Part of the xapi_pbd.ml change was clustering-specific/dependent, will have to go in separately
update Stdext to Xapi_stdext_pervasives

Signed-off-by: Callum McIntyre <callum.mcintyre@citrix.com>
@edwintorok edwintorok force-pushed the CP-26970-master-shutdown branch from 7ab9c94 to 10d7fbb Compare March 1, 2018 18:09
I've copied the code written by @edwintorok in an older PR into
vm_evacuation.ml.

Changes made for cherry-pick:
- Removed the change to ocaml/xapi/xapi_host_helpers.mli, will need to be fixed on a rebase (added in the iscsi changes)
- Tweaked the datamodel changes to put them in datamodel_host.ml

Don't do prechecks in Host.prepare_for_poweroff

At that point we've already reached to point of no return: it's best if
we carry on to evacuate and cleanly shut down our VMs and HA.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Signed-off-by: Callum McIntyre <callum.mcintyre@citrix.com>
Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
@edwintorok edwintorok force-pushed the CP-26970-master-shutdown branch from 10d7fbb to b5cded1 Compare March 1, 2018 18:09
@edwintorok edwintorok merged commit 85f128a into xapi-project:master Mar 1, 2018
@edwintorok edwintorok deleted the CP-26970-master-shutdown branch March 1, 2018 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants