CP-28477, CP-26179: Cluster.pool_force_destroy succeeds when host offline #3627

minishrink · 2018-06-12T08:35:43Z

This PR aims to make Cluster.pool_force_destroy more robust by implementing the following changes:

Cluster_host.forget deletes the cluster_host if successful
Cluster.pool_force_destroy now does the following:
- Cluster_host.destroy (best-effort) on all cluster_hosts, with successive rounds of Cluster_host.force_destroy, Cluster_host.forget, and finally database deletion of surviving cluster_hosts
- raise an API error if any cluster_hosts survive
Prevent new cluster_hosts from joining a cluster after attempting a pool_destroy
Cluster.pool_destroy no longer requires a cluster_host on pool master

coveralls · 2018-06-12T08:57:33Z

Coverage decreased (-0.003%) to 20.81% when pulling bef0e1d on minishrink:feature/REQ477/CP-28477 into 27c921f on xapi-project:feature/REQ477/master.

edwintorok · 2018-06-12T08:44:06Z

ocaml/xapi/message_forwarding.ml

+           let cluster = forward_cluster_op ~__context ~local_fn
+              (fun session_id rpc ->
+                Client.Cluster.create rpc session_id pIF cluster_stack pool_auto_join token_timeout token_timeout_coefficient
+              ) in


I know I've said we should add this to all operations in Cluster module (except pool), but now that I see it I realised its redundant for create: you cannot have an existing cluster when you call create (it will fail if you do), so we do not need to redirect it anywhere.

We do need to redirect destroy of course.

edwintorok · 2018-06-12T08:45:07Z

ocaml/xapi/message_forwarding.ml


  module Cluster = struct
+
+    let forward_cluster_op ~local_fn ~__context op =


Minor: The commit title should say something like: forward Cluster.destroy to a cluster member, localhost if possible.
The goal is to forward to a host that is a member of the cluster, localhost is only an optimization.

edwintorok · 2018-06-12T08:47:09Z

ocaml/xapi/xapi_cluster.ml

+      info "No cluster_hosts found, cannot run Cluster.destroy without force"
+    | [ cluster_host ] -> begin
+        assert_cluster_host_has_no_attached_sr_which_requires_cluster_stack ~__context ~self:cluster_host;
+        Xapi_cluster_host.force_destroy ~__context ~self:cluster_host;


Actually another possible way of implementing this is to not do the forwarding in message-forwarding, but use call_api_functions and call Client.Cluster_host.force_destroy instead. It is usually better if forwarding is done inside message-forwarding, so this is fine. If the last host is down then it will have to be declared as dead before we are able to destroy the Cluster.

edwintorok · 2018-06-12T09:22:57Z

ocaml/xapi/xapi_cluster.ml

          let uuid = Client.Client.Cluster_host.get_uuid ~rpc ~session_id ~self:cluster_host in
          debug "Ignoring exception while trying to force destroy cluster host %s: %s" uuid (ExnHelper.string_of_exn e);
+          Db.Cluster_host.destroy ~__context ~self:cluster_host;
+          debug "Cluster_host %s destroyed" uuid;


The pool* APIs have to be just convenience wrappers around the low-level APIs, they must not manipulate the DB directly.

edwintorok · 2018-06-12T09:23:53Z

ocaml/xapi/xapi_cluster.ml

-    | [ cluster_host ] -> Some (cluster_host)
+  match cluster_hosts with
+    | [] ->
+      info "No cluster_hosts found, cannot run Cluster.destroy without force"


I don't think the rest the changes in this commit are needed: if there is no cluster host we should just succeed in destroying the Db object as part of Cluster.destroy, as we did before, with this commit we succeed the Cluster.destroy without actually deleting the object from the DB, which is not good.

Instead of this commit we should probably disable auto-join flag on the Cluster object as soon as we start pool-destroy/pool-force-destroy (so other nodes will not attempt to join anymore), I'll update the CP ticket.

And in Cluster.pool_force_destroy we should try to call Xapi_cluster_host.forget on all the remaining cluster hosts, after we've tried destroy and force destroy on them and they failed.

edwintorok · 2018-06-12T12:23:45Z

Note that the API timeout should already be available since xapi-project/xen-api-libs-transitional#41 and #3548, so it should "just work", we should just try and see what exception we get.

edwintorok · 2018-06-12T13:31:37Z

Another thing we should improve is Xapi_cluster_host.forget, which does not delete the cluster host. If you use Host.destroy thats fine, the GC will delete it, but with pool_force_destroy we won't have that, so its better if Xapi_cluster_host does the right thing.

edwintorok · 2018-06-13T14:46:32Z

ocaml/xapi/xapi_cluster.ml

  let master = Helpers.get_master ~__context in
  let master_cluster_host =
    Xapi_clustering.find_cluster_host ~__context ~host:master
+    |> Xapi_stdext_monadic.Opt.unbox


This would fail if there is no master

edwintorok · 2018-06-13T14:47:19Z

ocaml/xapi/xapi_cluster.ml

+      all_remaining_cluster_hosts;
+
+  (* Finally, any cluster_hosts we couldn't force destroy will be forgotten *)
+  begin match Db.Cluster_host.get_all ~__context with


You probably don't need the match here, just call foreach with an empty list, it should be a no-op.

edwintorok · 2018-06-14T09:16:29Z

Could you make this change too?

Another thing we should improve is Xapi_cluster_host.forget, which does not delete the cluster host. If you use Host.destroy thats fine, the GC will delete it, but with pool_force_destroy we won't have that, so its better if Xapi_cluster_host does the right thing.

edwintorok · 2018-06-15T10:11:30Z

ocaml/xapi/xapi_cluster.ml

-  | Some x -> List.filter ((<>) x) xs
+let foreach_cluster_host ~__context ~self ~(fn : rpc:(Rpc.call -> Rpc.response) ->
+      session_id:API.ref_session -> self:API.ref_Cluster_host -> unit) ~log =
+  let wrapper = if log then log_and_ignore_exn else (fun _ -> ()) in


The bug is here: if log is false then this does not run fn at all, the correct way would be:
let wrapper = if log then log_and_ignore_exn else (fun f -> f ()) in

time to hang my head in shame
or borrow the cone of shame from my dog

Quicktest uses outdated framework and has since been converted to a unit test in test_clustering (test_assert_no_clustering_on_pif) Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

edwintorok

Looks good now, lets wait until xenrt tests complete

master cluster_host CP-28477: Forward Cluster.destroy to cluster member (maybe localhost) CP-26179: make pool_destroy work if master isn't in cluster CP-28477: Prevent new hosts joining cluster during Cluster.pool*_destroy - Factor out common code between pool_destroy and pool_force_destroy into helper pool_destroy_common CP-28477: Cluster.pool_force_destroy succeeds without cluster_host on master CP-28744: Make Cluster.pool_force_destroy more robust CP-28477: Update clustering tests using Cluster.destroy

minishrink added the don't merge label Jun 12, 2018

minishrink requested a review from edwintorok June 12, 2018 08:35

edwintorok requested changes Jun 12, 2018

View reviewed changes

edwintorok force-pushed the feature/REQ477/master branch from 27c921f to 53404c7 Compare June 12, 2018 13:41

minishrink force-pushed the feature/REQ477/CP-28477 branch from bef0e1d to 5586669 Compare June 13, 2018 14:18

edwintorok reviewed Jun 13, 2018

View reviewed changes

minishrink force-pushed the feature/REQ477/CP-28477 branch from 4baff1f to 0fe3742 Compare June 13, 2018 21:44

edwintorok force-pushed the feature/REQ477/master branch from a4821df to 53404c7 Compare June 14, 2018 16:15

minishrink removed the don't merge label Jun 14, 2018

minishrink changed the title ~~[WIP] CP-28477~~ CP-28477: Cluster.pool_force_destroy succeeds even when host offline Jun 14, 2018

minishrink changed the title ~~CP-28477: Cluster.pool_force_destroy succeeds even when host offline~~ CP-28477: Cluster.pool_force_destroy succeeds when host offline Jun 14, 2018

minishrink changed the title ~~CP-28477: Cluster.pool_force_destroy succeeds when host offline~~ [WIP] CP-28477: Cluster.pool_force_destroy succeeds when host offline Jun 14, 2018

minishrink added the don't merge label Jun 14, 2018

edwintorok reviewed Jun 15, 2018

View reviewed changes

Akanksha Mathur added 4 commits June 15, 2018 12:39

Remove obsolete clustering quicktest

932e689

Quicktest uses outdated framework and has since been converted to a unit test in test_clustering (test_assert_no_clustering_on_pif) Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-28477: Cluster code cleanup

2229b79

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-28477: Cluster_host.force_destroy ignores exn

f8415af

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-28477: Cluster_host.forget deletes cluster_host if successful

7867d91

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

minishrink force-pushed the feature/REQ477/CP-28477 branch from 5555436 to be35abb Compare June 15, 2018 12:46

edwintorok approved these changes Jun 15, 2018

View reviewed changes

edwintorok changed the title ~~[WIP] CP-28477: Cluster.pool_force_destroy succeeds when host offline~~ CP-28477: Cluster.pool_force_destroy succeeds when host offline Jun 15, 2018

minishrink removed the don't merge label Jun 15, 2018

minishrink force-pushed the feature/REQ477/CP-28477 branch from be35abb to 259c401 Compare June 19, 2018 10:48

minishrink changed the title ~~CP-28477: Cluster.pool_force_destroy succeeds when host offline~~ CP-28477, CP-26179: Cluster.pool_force_destroy succeeds when host offline Jun 19, 2018

edwintorok merged commit e595fc7 into xapi-project:feature/REQ477/master Jun 19, 2018

minishrink deleted the feature/REQ477/CP-28477 branch June 19, 2018 12:31


		module Cluster = struct

		let forward_cluster_op ~local_fn ~__context op =

CP-28477, CP-26179: Cluster.pool_force_destroy succeeds when host offline #3627

CP-28477, CP-26179: Cluster.pool_force_destroy succeeds when host offline #3627

Uh oh!

Conversation

minishrink commented Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jun 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwintorok Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwintorok Jun 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwintorok commented Jun 12, 2018

Uh oh!

edwintorok commented Jun 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwintorok commented Jun 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwintorok left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

minishrink commented Jun 12, 2018 •

edited

Loading

edwintorok Jun 12, 2018 •

edited

Loading

edwintorok Jun 12, 2018 •

edited

Loading