Skip to content

Conversation

edwintorok
Copy link
Contributor

The changes here have been reviewed on the feature branch, ticket numbers are in commits.

gaborigloi and others added 30 commits June 7, 2018 15:38
Unlike sr-probe, this one calls the XenAPI call, not the SMAPIv2 call.
It requires an SR that supports this call.

Signed-off-by: Gabor Igloi <gabor.igloi@citrix.com>
Add lima_release, bump schema minor: 142 -> 200
Add comment advising to leave interval for hotfixes

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
(Not strictly necessary for ticket, but this change precedes the
relevant fix)

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
CP-26912: Add API error pif_not_attached_to_host for cluster_host creation

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
CP-26912: Cluster.pool_create still takes a network parameter

CP-26912: Refactor Cluster.pool_create

CP-26912: Refactor Cluster_host.create_as_necessary

CP-29612: Cluster.create takes PIF parameter for local cluster_host

CP-26912: Add internal Cluster.get_network call for XenCenter

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
…etwork changes

CP-26912: Update uses of Cluster.(pool_)create in tests

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Bond.create: if the cluster_host.PIF is a bond member, the
  cluster_host will be moved to the bond master PIF.
Bond.destroy: if the cluster_host.PIF is in the bond, it will be moved
  to the primary slave PIF

No PIF prerequisites need to be toggled as bond creation already
enforces the following:
- if disallow_unplug is set for one bond member, it will be set for all
  of them
- only one PIF (the primary slave) may have a valid (!= None) IP
  configuration mode, which is then copied to the bond master.

This means that in bond creation, the bond master already meets the
clustering PIF prerequisites, as does the primary slave for bond
destruction

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
This is in order to add a unit test for it

CP-28213: Modify PIF assertion error

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
…etwork changes

CP-26912: Update uses of Cluster.(pool_)create in tests

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
concurrently

CP-25121: forbid clustering operations during rolling pool upgrade

- Cluster calls such as create, destroy, add, and leave should not occur
  during an RPU as this may lead to mismatched versions or errors

- Cluster_host calls such as enable and disable are fine as they simply
  toggle clustering

- Pool clustering operations don't need a lock, as they simply iterate
  over other (locked) cluster calls

CP-26147:  prevent parallel Cluster.create and Cluster.destroy calls

- wrapping clustering calls in `with_clustering_operation`
currently blocks any and all concurrent clustering operations

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
…ng RPU

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
clustering

`PIF.disallow_unplug:true` is necessary for the entire time a cluster
host is enabled, as well as being a prerequisite for the API calls
`Cluster_host.{enable, create}`.

If `disallow_unplug` is set to false during either of these two
clustering operations, it could create a race condition where both
operations are allowed, but we then have a cluster_host that doesn't
meet networking requirements and will therefore fail later on.

This PR introduces a condition wherein you cannot allow PIF.unplug if
the pool cluster is currently creating or enable a cluster_host on *any*
PIF, regardless of which PIF you are allowing unplug for. This should
rarely cause an issue, as `PIF.set_disallow_unplug` is rarely called,
although an operation like `Cluster.pool_{create, resync}` could block
it for a while, depending on the pool size.

Additionally, this commit removes the obsolete
`CLUSTERING_ENABLED_ON_NETWORK` API error in favour of using
`CLUSTERING_ENABLED`, as cluster_hosts correspond more closely to PIFs

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Cluster_host.resync_host is iterated over all hosts in a pool by
Cluster.pool_resync, and is also called at startup as part of the task
"resync cluster host state"

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
We'll do a best-effort attempt at declaring hosts as dead, however this
can fail if there are multiple hosts that are offline.
We do not want to keep broken cluster host objects around because they
would prevent disabling clustering too.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Declare dead requires the rest of the hosts in the cluster to be alive
(otherwise data corruption could occur as we are about to forcibly kick
 hosts from the cluster).
The current Xen API provides a way to kick hosts out of a pool one by
one, or to mark them as dead. We could introduce a batched API, but that
would be cumbersome for the user, it is better if we can declare dead
hosts one at a time, and when we declared them all dead then we actually
modify the cluster.

The semantics of "declare dead" is different between XenAPI and
clustering daemon, for XenAPI it means: the host is dead now, you can
reset VM power state, etc. In fact HA automatically detects dead hosts.

Then there is Host.destroy (or `host-forget`) on the CLI which considers
the host permanently dead. This is the semantics the clustering daemon
wants: if a host that has been declared dead to the clustering daemon
comes back alive then data corruption could occur.

To avoid confusion, call the new method `Cluster_host.forget`, and call
it automatically from `Host.destroy`.
`Host.declare_dead` will have no effect on clustering.

For this to work we need to remember all the IPs of the hosts we have
attempted to declare dead in the past but failed, and retry declaring
them as dead again on the next API call.

Declaring a host as dead is permanent, the host can never be part of the
cluster again, so we do not need an undo API here.

TODO: although things could go wrong if we try to join a host with the same IP
as one that we are trying to declare as dead... (which can happen due to DHCP).

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
* CA-290471: fix wrong token timeout in XenCenter

Use same units in the object and in the API.
The timeout set by XenCenter 20000s, which is wrong, it should've been
20s (which is what you get from the CLI).
XenCenter creates an object and then reads the field to find out the
default, because the SDK doesn't expose the defaults for parameters.

So we should really use the same units for API parameters and internal
object fields. Especially that this field is not used by XAPI at all,
it is only there so that the admin can inspect what values were used
to create the cluster.

Changing this field will not change the actual timeouts used by the
cluster, because the timeout values cannot be changed at runtime.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Split `Xapi_cluster_host.create` into two functions:
- `create_internal` creates cluster_host object in xapi db
- `resync_host` now creates and enables cluster_host in client layer via
  clusterd

Create will work as before, but these two functions will be called
separately within the xapi startup sequence

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
- don't call Cluster_host.enable before cluster join finishes
- attempt to plug PBDs after resyncing host

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
- separate Cluster_host.create into internal db and client calls
- refactor assertions, helpers, tests, and locks
- block Cluster_host operations if not joined, except leave

Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Locking_helpers provides a nice list of active locks as part of
`xe host-get-thread-diagnostics`, which helps debugging.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
This is quite useful for tracing what Xapi_cluster_host methods are
called, and together with the previous commit for tracking deadlocks.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
…ustering daemon is down

Cluster.destroy did not work if you destroyed the last cluster host
(like some of the unit tests actually did).
Cluster_host.destroy on the last node is special: there is no cluster
after you leave (leave would fail in reality), so just destroy the cluster.
Refactor all 3 destroy operations into one, choose automatically based
on number of hosts.

Could've introduced a new API error to forbid destroying the last
cluster host, but it is better if XAPI is able to automatically do the
right thing than to tell the user it should call some other API instead.

Also with Cluster_host.force_destroy you could have already ended up in
a situation where you have no cluster hosts and want to destroy the
cluster, which would hang indefinitely because the daemon was stopped.

We always try to enable the daemon on startup, so keep track on whether
we think it should be running, and if we know we stopped it then just
raise an error when trying to do an RPC.

This is useful in debugging situations where we try to send RPCs too
early too (e.g. before we started the daemon).

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Signed-off-by: Akanksha Mathur <akanksha.mathur@citrix.com>
Define constants for checking cluster token values
@edwintorok edwintorok requested a review from lindig June 8, 2018 16:12
Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of code that has been reviewed before. I had a cursory look and I think it looks good.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 20.815% when pulling 53404c7 on feature/REQ477/master into 48ea5eb on master.

@lindig lindig merged commit ac329ec into master Jun 8, 2018
edwintorok pushed a commit to edwintorok/xen-api that referenced this pull request Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants