-
Notifications
You must be signed in to change notification settings - Fork 292
CP-28477, CP-26179: Cluster.pool_force_destroy succeeds when host offline #3627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP-28477, CP-26179: Cluster.pool_force_destroy succeeds when host offline #3627
Conversation
ocaml/xapi/message_forwarding.ml
Outdated
let cluster = forward_cluster_op ~__context ~local_fn | ||
(fun session_id rpc -> | ||
Client.Cluster.create rpc session_id pIF cluster_stack pool_auto_join token_timeout token_timeout_coefficient | ||
) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I've said we should add this to all operations in Cluster module (except pool), but now that I see it I realised its redundant for create
: you cannot have an existing cluster when you call create (it will fail if you do), so we do not need to redirect it anywhere.
We do need to redirect destroy of course.
|
||
module Cluster = struct | ||
|
||
let forward_cluster_op ~local_fn ~__context op = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: The commit title should say something like: forward Cluster.destroy to a cluster member, localhost if possible.
The goal is to forward to a host that is a member of the cluster, localhost is only an optimization.
ocaml/xapi/xapi_cluster.ml
Outdated
info "No cluster_hosts found, cannot run Cluster.destroy without force" | ||
| [ cluster_host ] -> begin | ||
assert_cluster_host_has_no_attached_sr_which_requires_cluster_stack ~__context ~self:cluster_host; | ||
Xapi_cluster_host.force_destroy ~__context ~self:cluster_host; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually another possible way of implementing this is to not do the forwarding in message-forwarding, but use call_api_functions
and call Client.Cluster_host.force_destroy
instead. It is usually better if forwarding is done inside message-forwarding, so this is fine. If the last host is down then it will have to be declared as dead before we are able to destroy the Cluster.
ocaml/xapi/xapi_cluster.ml
Outdated
let uuid = Client.Client.Cluster_host.get_uuid ~rpc ~session_id ~self:cluster_host in | ||
debug "Ignoring exception while trying to force destroy cluster host %s: %s" uuid (ExnHelper.string_of_exn e); | ||
Db.Cluster_host.destroy ~__context ~self:cluster_host; | ||
debug "Cluster_host %s destroyed" uuid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pool* APIs have to be just convenience wrappers around the low-level APIs, they must not manipulate the DB directly.
ocaml/xapi/xapi_cluster.ml
Outdated
| [ cluster_host ] -> Some (cluster_host) | ||
match cluster_hosts with | ||
| [] -> | ||
info "No cluster_hosts found, cannot run Cluster.destroy without force" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the rest the changes in this commit are needed: if there is no cluster host we should just succeed in destroying the Db object as part of Cluster.destroy, as we did before, with this commit we succeed the Cluster.destroy without actually deleting the object from the DB, which is not good.
Instead of this commit we should probably disable auto-join flag on the Cluster object as soon as we start pool-destroy/pool-force-destroy (so other nodes will not attempt to join anymore), I'll update the CP ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in Cluster.pool_force_destroy we should try to call Xapi_cluster_host.forget on all the remaining cluster hosts, after we've tried destroy and force destroy on them and they failed.
Note that the API timeout should already be available since xapi-project/xen-api-libs-transitional#41 and #3548, so it should "just work", we should just try and see what exception we get. |
Another thing we should improve is Xapi_cluster_host.forget, which does not delete the cluster host. If you use Host.destroy thats fine, the GC will delete it, but with pool_force_destroy we won't have that, so its better if Xapi_cluster_host does the right thing. |
27c921f
to
53404c7
Compare
bef0e1d
to
5586669
Compare
ocaml/xapi/xapi_cluster.ml
Outdated
let master = Helpers.get_master ~__context in | ||
let master_cluster_host = | ||
Xapi_clustering.find_cluster_host ~__context ~host:master | ||
|> Xapi_stdext_monadic.Opt.unbox |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fail if there is no master
ocaml/xapi/xapi_cluster.ml
Outdated
all_remaining_cluster_hosts; | ||
|
||
(* Finally, any cluster_hosts we couldn't force destroy will be forgotten *) | ||
begin match Db.Cluster_host.get_all ~__context with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably don't need the match here, just call foreach with an empty list, it should be a no-op.
4baff1f
to
0fe3742
Compare
Could you make this change too?
|
a4821df
to
53404c7
Compare
ocaml/xapi/xapi_cluster.ml
Outdated
| Some x -> List.filter ((<>) x) xs | ||
let foreach_cluster_host ~__context ~self ~(fn : rpc:(Rpc.call -> Rpc.response) -> | ||
session_id:API.ref_session -> self:API.ref_Cluster_host -> unit) ~log = | ||
let wrapper = if log then log_and_ignore_exn else (fun _ -> ()) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug is here: if log is false then this does not run fn
at all, the correct way would be:
let wrapper = if log then log_and_ignore_exn else (fun f -> f ()) in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time to hang my head in shame
or borrow the cone of shame from my dog
Quicktest uses outdated framework and has since been converted to a unit test in test_clustering (test_assert_no_clustering_on_pif) Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
5555436
to
be35abb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now, lets wait until xenrt tests complete
master cluster_host CP-28477: Forward Cluster.destroy to cluster member (maybe localhost) CP-26179: make pool_destroy work if master isn't in cluster CP-28477: Prevent new hosts joining cluster during Cluster.pool*_destroy - Factor out common code between pool_destroy and pool_force_destroy into helper pool_destroy_common CP-28477: Cluster.pool_force_destroy succeeds without cluster_host on master CP-28744: Make Cluster.pool_force_destroy more robust CP-28477: Update clustering tests using Cluster.destroy
be35abb
to
259c401
Compare
This PR aims to make Cluster.pool_force_destroy more robust by implementing the following changes: