[477 merge] #3624

edwintorok · 2018-06-08T16:12:27Z

The changes here have been reviewed on the feature branch, ticket numbers are in commits.

Unlike sr-probe, this one calls the XenAPI call, not the SMAPIv2 call. It requires an SR that supports this call. Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>

Add lima_release, bump schema minor: 142 -> 200 Add comment advising to leave interval for hotfixes Signed-off-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

(Not strictly necessary for ticket, but this change precedes the relevant fix) Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-26912: Add API error pif_not_attached_to_host for cluster_host creation Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-26912: Cluster.pool_create still takes a network parameter CP-26912: Refactor Cluster.pool_create CP-26912: Refactor Cluster_host.create_as_necessary CP-29612: Cluster.create takes PIF parameter for local cluster_host CP-26912: Add internal Cluster.get_network call for XenCenter Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

…etwork changes CP-26912: Update uses of Cluster.(pool_)create in tests Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Bond.create: if the cluster_host.PIF is a bond member, the cluster_host will be moved to the bond master PIF. Bond.destroy: if the cluster_host.PIF is in the bond, it will be moved to the primary slave PIF No PIF prerequisites need to be toggled as bond creation already enforces the following: - if disallow_unplug is set for one bond member, it will be set for all of them - only one PIF (the primary slave) may have a valid (!= None) IP configuration mode, which is then copied to the bond master. This means that in bond creation, the bond master already meets the clustering PIF prerequisites, as does the primary slave for bond destruction Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

This is in order to add a unit test for it CP-28213: Modify PIF assertion error Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

…etwork changes CP-26912: Update uses of Cluster.(pool_)create in tests Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

concurrently CP-25121: forbid clustering operations during rolling pool upgrade - Cluster calls such as create, destroy, add, and leave should not occur during an RPU as this may lead to mismatched versions or errors - Cluster_host calls such as enable and disable are fine as they simply toggle clustering - Pool clustering operations don't need a lock, as they simply iterate over other (locked) cluster calls CP-26147: prevent parallel Cluster.create and Cluster.destroy calls - wrapping clustering calls in `with_clustering_operation` currently blocks any and all concurrent clustering operations Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

…ng RPU Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

clustering `PIF.disallow_unplug:true` is necessary for the entire time a cluster host is enabled, as well as being a prerequisite for the API calls `Cluster_host.{enable, create}`. If `disallow_unplug` is set to false during either of these two clustering operations, it could create a race condition where both operations are allowed, but we then have a cluster_host that doesn't meet networking requirements and will therefore fail later on. This PR introduces a condition wherein you cannot allow PIF.unplug if the pool cluster is currently creating or enable a cluster_host on *any* PIF, regardless of which PIF you are allowing unplug for. This should rarely cause an issue, as `PIF.set_disallow_unplug` is rarely called, although an operation like `Cluster.pool_{create, resync}` could block it for a while, depending on the pool size. Additionally, this commit removes the obsolete `CLUSTERING_ENABLED_ON_NETWORK` API error in favour of using `CLUSTERING_ENABLED`, as cluster_hosts correspond more closely to PIFs Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Cluster_host.resync_host is iterated over all hosts in a pool by Cluster.pool_resync, and is also called at startup as part of the task "resync cluster host state" Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

We'll do a best-effort attempt at declaring hosts as dead, however this can fail if there are multiple hosts that are offline. We do not want to keep broken cluster host objects around because they would prevent disabling clustering too. Signed-off-by: Edwin Török <edvin.torok@citrix.com>

Declare dead requires the rest of the hosts in the cluster to be alive (otherwise data corruption could occur as we are about to forcibly kick hosts from the cluster). The current Xen API provides a way to kick hosts out of a pool one by one, or to mark them as dead. We could introduce a batched API, but that would be cumbersome for the user, it is better if we can declare dead hosts one at a time, and when we declared them all dead then we actually modify the cluster. The semantics of "declare dead" is different between XenAPI and clustering daemon, for XenAPI it means: the host is dead now, you can reset VM power state, etc. In fact HA automatically detects dead hosts. Then there is Host.destroy (or `host-forget`) on the CLI which considers the host permanently dead. This is the semantics the clustering daemon wants: if a host that has been declared dead to the clustering daemon comes back alive then data corruption could occur. To avoid confusion, call the new method `Cluster_host.forget`, and call it automatically from `Host.destroy`. `Host.declare_dead` will have no effect on clustering. For this to work we need to remember all the IPs of the hosts we have attempted to declare dead in the past but failed, and retry declaring them as dead again on the next API call. Declaring a host as dead is permanent, the host can never be part of the cluster again, so we do not need an undo API here. TODO: although things could go wrong if we try to join a host with the same IP as one that we are trying to declare as dead... (which can happen due to DHCP). Signed-off-by: Edwin Török <edvin.torok@citrix.com>

* CA-290471: fix wrong token timeout in XenCenter Use same units in the object and in the API. The timeout set by XenCenter 20000s, which is wrong, it should've been 20s (which is what you get from the CLI). XenCenter creates an object and then reads the field to find out the default, because the SDK doesn't expose the defaults for parameters. So we should really use the same units for API parameters and internal object fields. Especially that this field is not used by XAPI at all, it is only there so that the admin can inspect what values were used to create the cluster. Changing this field will not change the actual timeouts used by the cluster, because the timeout values cannot be changed at runtime. Signed-off-by: Edwin Török <edvin.torok@citrix.com>

Split `Xapi_cluster_host.create` into two functions: - `create_internal` creates cluster_host object in xapi db - `resync_host` now creates and enables cluster_host in client layer via clusterd Create will work as before, but these two functions will be called separately within the xapi startup sequence Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

- don't call Cluster_host.enable before cluster join finishes - attempt to plug PBDs after resyncing host Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

- separate Cluster_host.create into internal db and client calls - refactor assertions, helpers, tests, and locks - block Cluster_host operations if not joined, except leave Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Signed-off-by: Edwin Török <edvin.torok@citrix.com>

Locking_helpers provides a nice list of active locks as part of `xe host-get-thread-diagnostics`, which helps debugging. Signed-off-by: Edwin Török <edvin.torok@citrix.com>

This is quite useful for tracing what Xapi_cluster_host methods are called, and together with the previous commit for tracking deadlocks. Signed-off-by: Edwin Török <edvin.torok@citrix.com>

…ustering daemon is down Cluster.destroy did not work if you destroyed the last cluster host (like some of the unit tests actually did). Cluster_host.destroy on the last node is special: there is no cluster after you leave (leave would fail in reality), so just destroy the cluster. Refactor all 3 destroy operations into one, choose automatically based on number of hosts. Could've introduced a new API error to forbid destroying the last cluster host, but it is better if XAPI is able to automatically do the right thing than to tell the user it should call some other API instead. Also with Cluster_host.force_destroy you could have already ended up in a situation where you have no cluster hosts and want to destroy the cluster, which would hang indefinitely because the daemon was stopped. We always try to enable the daemon on startup, so keep track on whether we think it should be running, and if we know we stopped it then just raise an error when trying to do an RPC. This is useful in debugging situations where we try to send RPCs too early too (e.g. before we started the daemon). Signed-off-by: Edwin Török <edvin.torok@citrix.com>

Signed-off-by: Edwin Török <edvin.torok@citrix.com>

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Define constants for checking cluster token values

lindig

This is a lot of code that has been reviewed before. I had a cursory look and I think it looks good.

coveralls · 2018-06-08T16:33:35Z

Coverage increased (+0.2%) to 20.815% when pulling 53404c7 on feature/REQ477/master into 48ea5eb on master.

…/master [477 merge]

gaborigloi and others added 30 commits June 7, 2018 15:38

CP-27466: Add xe sr-probe-ext CLI

8694764

Unlike sr-probe, this one calls the XenAPI call, not the SMAPIv2 call. It requires an SR that supports this call. Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>

CP-26912: Introduce rel_lima, bump schema minor

c9d4aa9

Add lima_release, bump schema minor: 142 -> 200 Add comment advising to leave interval for hotfixes Signed-off-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-26912: Remove unnecessary commented-out code

d9194d4

(Not strictly necessary for ticket, but this change precedes the relevant fix) Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-26912: Add Cluster_host.PIF field to enable live bond creation

c65e5d6

CP-26912: Add API error pif_not_attached_to_host for cluster_host creation Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-26912: Update clustering tests with Cluster_host.PIF and Cluster.n…

2222fd7

…etwork changes CP-26912: Update uses of Cluster.(pool_)create in tests Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-28213: Make Xapi_pif.assert_no_clustering_enabled_on public

21b34e0

This is in order to add a unit test for it CP-28213: Modify PIF assertion error Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-28213: Unit test PIF.reconfigure_IP, remove quicktest

bd2ea6c

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-26912: Update clustering tests with Cluster_host.PIF and Cluster.n…

e2994c8

…etwork changes CP-26912: Update uses of Cluster.(pool_)create in tests Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-25121: Unit test allowed_operations blocks cluster operations duri…

f2c7d72

…ng RPU Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-25121: ocp-indent message_forwarding and test_clustering_allowed_ops

8b95d65

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-25121: Move RPU cluster operation logic

d0dc7b7

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-25121: Restart xapi-clusterd after RPU is cluster_host enabled in db

7d8a343

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CP-26199: Send alerts when Cluster_host.resync_host fails

48a6a34

Cluster_host.resync_host is iterated over all hosts in a pool by Cluster.pool_resync, and is also called at startup as part of the task "resync cluster host state" Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CA-290237: Wait for clustering IP before resyncing host

6972bc3

- don't call Cluster_host.enable before cluster join finishes - attempt to plug PBDs after resyncing host Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

CA-290686: fix forwarding of Cluster_host.forget

033dbba

Signed-off-by: Edwin Török <edvin.torok@citrix.com>

CP-28406: use named mutex for clustering

70e8dd4

Locking_helpers provides a nice list of active locks as part of `xe host-get-thread-diagnostics`, which helps debugging. Signed-off-by: Edwin Török <edvin.torok@citrix.com>

CP-28406: log caller of locking function

3827800

This is quite useful for tracing what Xapi_cluster_host methods are called, and together with the previous commit for tracking deadlocks. Signed-off-by: Edwin Török <edvin.torok@citrix.com>

CP-28406: introduce new API error for destroying last cluster host

df206fb

Signed-off-by: Edwin Török <edvin.torok@citrix.com>

Hardcode token value thresholds in Constants module

a1416d0

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Test Cluster.create fails for invalid token parameters

c48e1aa

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>

Merge pull request #3623 from minishrink/constant_token_threshold

53404c7

Define constants for checking cluster token values

edwintorok requested a review from lindig June 8, 2018 16:12

lindig approved these changes Jun 8, 2018

View reviewed changes

lindig merged commit ac329ec into master Jun 8, 2018

edwintorok pushed a commit to edwintorok/xen-api that referenced this pull request Jun 13, 2018

Merge pull request xapi-project#3624 from xapi-project/feature/REQ477…

3a05be7

…/master [477 merge]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[477 merge] #3624

[477 merge] #3624

Uh oh!

edwintorok commented Jun 8, 2018

Uh oh!

lindig left a comment

Uh oh!

coveralls commented Jun 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[477 merge] #3624

[477 merge] #3624

Uh oh!

Conversation

edwintorok commented Jun 8, 2018

Uh oh!

lindig left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jun 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants