Skip to content

Conversation

minishrink
Copy link
Contributor

@minishrink minishrink commented Jun 12, 2018

This PR aims to make Cluster.pool_force_destroy more robust by implementing the following changes:

  • Cluster_host.forget deletes the cluster_host if successful
  • Cluster.pool_force_destroy now does the following:
    • Cluster_host.destroy (best-effort) on all cluster_hosts, with successive rounds of Cluster_host.force_destroy, Cluster_host.forget, and finally database deletion of surviving cluster_hosts
    • raise an API error if any cluster_hosts survive
  • Prevent new cluster_hosts from joining a cluster after attempting a pool_destroy
  • Cluster.pool_destroy no longer requires a cluster_host on pool master

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.003%) to 20.81% when pulling bef0e1d on minishrink:feature/REQ477/CP-28477 into 27c921f on xapi-project:feature/REQ477/master.

let cluster = forward_cluster_op ~__context ~local_fn
(fun session_id rpc ->
Client.Cluster.create rpc session_id pIF cluster_stack pool_auto_join token_timeout token_timeout_coefficient
) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I've said we should add this to all operations in Cluster module (except pool), but now that I see it I realised its redundant for create: you cannot have an existing cluster when you call create (it will fail if you do), so we do not need to redirect it anywhere.

We do need to redirect destroy of course.


module Cluster = struct

let forward_cluster_op ~local_fn ~__context op =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The commit title should say something like: forward Cluster.destroy to a cluster member, localhost if possible.
The goal is to forward to a host that is a member of the cluster, localhost is only an optimization.

info "No cluster_hosts found, cannot run Cluster.destroy without force"
| [ cluster_host ] -> begin
assert_cluster_host_has_no_attached_sr_which_requires_cluster_stack ~__context ~self:cluster_host;
Xapi_cluster_host.force_destroy ~__context ~self:cluster_host;
Copy link
Contributor

@edwintorok edwintorok Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually another possible way of implementing this is to not do the forwarding in message-forwarding, but use call_api_functions and call Client.Cluster_host.force_destroy instead. It is usually better if forwarding is done inside message-forwarding, so this is fine. If the last host is down then it will have to be declared as dead before we are able to destroy the Cluster.

let uuid = Client.Client.Cluster_host.get_uuid ~rpc ~session_id ~self:cluster_host in
debug "Ignoring exception while trying to force destroy cluster host %s: %s" uuid (ExnHelper.string_of_exn e);
Db.Cluster_host.destroy ~__context ~self:cluster_host;
debug "Cluster_host %s destroyed" uuid;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool* APIs have to be just convenience wrappers around the low-level APIs, they must not manipulate the DB directly.

| [ cluster_host ] -> Some (cluster_host)
match cluster_hosts with
| [] ->
info "No cluster_hosts found, cannot run Cluster.destroy without force"
Copy link
Contributor

@edwintorok edwintorok Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the rest the changes in this commit are needed: if there is no cluster host we should just succeed in destroying the Db object as part of Cluster.destroy, as we did before, with this commit we succeed the Cluster.destroy without actually deleting the object from the DB, which is not good.

Instead of this commit we should probably disable auto-join flag on the Cluster object as soon as we start pool-destroy/pool-force-destroy (so other nodes will not attempt to join anymore), I'll update the CP ticket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in Cluster.pool_force_destroy we should try to call Xapi_cluster_host.forget on all the remaining cluster hosts, after we've tried destroy and force destroy on them and they failed.

@edwintorok
Copy link
Contributor

Note that the API timeout should already be available since xapi-project/xen-api-libs-transitional#41 and #3548, so it should "just work", we should just try and see what exception we get.

@edwintorok
Copy link
Contributor

Another thing we should improve is Xapi_cluster_host.forget, which does not delete the cluster host. If you use Host.destroy thats fine, the GC will delete it, but with pool_force_destroy we won't have that, so its better if Xapi_cluster_host does the right thing.

@edwintorok edwintorok force-pushed the feature/REQ477/master branch from 27c921f to 53404c7 Compare June 12, 2018 13:41
@minishrink minishrink force-pushed the feature/REQ477/CP-28477 branch from bef0e1d to 5586669 Compare June 13, 2018 14:18
let master = Helpers.get_master ~__context in
let master_cluster_host =
Xapi_clustering.find_cluster_host ~__context ~host:master
|> Xapi_stdext_monadic.Opt.unbox
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail if there is no master

all_remaining_cluster_hosts;

(* Finally, any cluster_hosts we couldn't force destroy will be forgotten *)
begin match Db.Cluster_host.get_all ~__context with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need the match here, just call foreach with an empty list, it should be a no-op.

@minishrink minishrink force-pushed the feature/REQ477/CP-28477 branch from 4baff1f to 0fe3742 Compare June 13, 2018 21:44
@edwintorok
Copy link
Contributor

Could you make this change too?

Another thing we should improve is Xapi_cluster_host.forget, which does not delete the cluster host. If you use Host.destroy thats fine, the GC will delete it, but with pool_force_destroy we won't have that, so its better if Xapi_cluster_host does the right thing.

@edwintorok edwintorok force-pushed the feature/REQ477/master branch from a4821df to 53404c7 Compare June 14, 2018 16:15
@minishrink minishrink changed the title [WIP] CP-28477 CP-28477: Cluster.pool_force_destroy succeeds even when host offline Jun 14, 2018
@minishrink minishrink changed the title CP-28477: Cluster.pool_force_destroy succeeds even when host offline CP-28477: Cluster.pool_force_destroy succeeds when host offline Jun 14, 2018
@minishrink minishrink changed the title CP-28477: Cluster.pool_force_destroy succeeds when host offline [WIP] CP-28477: Cluster.pool_force_destroy succeeds when host offline Jun 14, 2018
| Some x -> List.filter ((<>) x) xs
let foreach_cluster_host ~__context ~self ~(fn : rpc:(Rpc.call -> Rpc.response) ->
session_id:API.ref_session -> self:API.ref_Cluster_host -> unit) ~log =
let wrapper = if log then log_and_ignore_exn else (fun _ -> ()) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug is here: if log is false then this does not run fn at all, the correct way would be:
let wrapper = if log then log_and_ignore_exn else (fun f -> f ()) in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time to hang my head in shame
or borrow the cone of shame from my dog

Akanksha Mathur added 4 commits June 15, 2018 12:39
Quicktest uses outdated framework and has since been converted to a unit
test in test_clustering (test_assert_no_clustering_on_pif)

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
@minishrink minishrink force-pushed the feature/REQ477/CP-28477 branch from 5555436 to be35abb Compare June 15, 2018 12:46
Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, lets wait until xenrt tests complete

@edwintorok edwintorok changed the title [WIP] CP-28477: Cluster.pool_force_destroy succeeds when host offline CP-28477: Cluster.pool_force_destroy succeeds when host offline Jun 15, 2018
master cluster_host

CP-28477: Forward Cluster.destroy to cluster member (maybe localhost)

CP-26179: make pool_destroy work if master isn't in cluster

CP-28477: Prevent new hosts joining cluster during Cluster.pool*_destroy

- Factor out common code between pool_destroy and pool_force_destroy
  into helper pool_destroy_common

CP-28477: Cluster.pool_force_destroy succeeds without cluster_host on master

CP-28744: Make Cluster.pool_force_destroy more robust

CP-28477: Update clustering tests using Cluster.destroy
@minishrink minishrink force-pushed the feature/REQ477/CP-28477 branch from be35abb to 259c401 Compare June 19, 2018 10:48
@minishrink minishrink changed the title CP-28477: Cluster.pool_force_destroy succeeds when host offline CP-28477, CP-26179: Cluster.pool_force_destroy succeeds when host offline Jun 19, 2018
@edwintorok edwintorok merged commit e595fc7 into xapi-project:feature/REQ477/master Jun 19, 2018
@minishrink minishrink deleted the feature/REQ477/CP-28477 branch June 19, 2018 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants