-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph: ability to abort orchestration #6693
Conversation
More testing to do on my end, but good for an initial review. |
26080ea
to
45b72f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit message. Also removing one used variable.
Did you mean "unused" variable?
Edit: I was hoping this would be attached to the commit I was looking at but I guess not. This is the first commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With adding a dedicated predicate for the CephCluster
object, the thing I have concerns about is what happens if any of the other cases (CephObjectStore, CephBlockPool, CephFilesystem, CephNFS, CephRBDMirror, etc.) have CRDs created before the CephCluster is created. Assuming I don't find anything wrong, I will likely not request changes and trust that you'll test this before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small nits.
pkg/clusterd/context.go
Outdated
@@ -66,4 +67,7 @@ type Context struct { | |||
|
|||
// The local devices detected on the node | |||
Devices []*sys.LocalDisk | |||
|
|||
// OrchestrationCanceller manages the orchestration and its possible cancellation | |||
OrchestrationCanceller *abool.AtomicBool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a nit, the methods on this are Set
, Unset
, and IsSet
, and I don't read "(un)set canceller" as having the right semantic meaning. I think renaming this to "RequestCancelOrchestration" or "OrchestrationCancellationRequest" or something similar makes the method use more grammatically specific to what's going on in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will go with RequestCancelOrchestration
thanks for the suggestion.
|
||
// If spec Images are different we cancel any on-going orchestration | ||
if objNew.Spec.CephVersion.Image != objOld.Spec.CephVersion.Image { | ||
logger.Info("upgrade requested, cancelling any on-going orchestration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "on-going" -> "ongoing"
Since we moved to the controller-runtime, events are processed one by one and so are reconciles. This means we won't have multiple orchestrations happening at the same time. Thus removing this code. Also removing one unused variable. Signed-off-by: Sébastien Han <seb@redhat.com>
Since the predicate for the CephCluster object will soon move into its own predidacte we need to export isDoNotReconcile so that it can be consummed by the "cluster" package. Signed-off-by: Sébastien Han <seb@redhat.com>
Since we want to pass a context to it, let's extract the logic into its own predicate. Signed-off-by: Sébastien Han <seb@redhat.com>
Conceptually, it makes more sense to have a dedicated predicate for each controller. I had previously homogenized the code to end up with a single one but because this cancellation approach is only for the CephCluster I didn't want to update all the other controller's definitions for something we might never use. Does that clarify? |
45b72f5
to
34bb41d
Compare
Yes! That seems obvious now that you have said it. I still haven't had coffee yet today. 😬 ☕ |
f26218d
to
7a76ed9
Compare
We can now prioritize orchestrations on certain event. Today only two events will cancel on-going orchestrations (if any): * request for cluster deletion * request for cluster upgrade If one of the two are caught by the watcher we will cancel the on-going orchestration. For that we implemented a simple approach based on check points, where we will check for a cancellation request in certain part of the orchestration. Mainly before each mons/mgr/osds orchestration loops. This solution is not perfect, but we are waiting for the controller-runtime to release its 0.7 version which will embed context support. With that we will be able to cancel reconciles more precisely and rapidly. Operator log example: ``` 2020-11-24 13:54:59.499719 I | op-mon: parsing mon endpoints: a=10.109.126.120:6789 2020-11-24 13:54:59.499719 I | op-mon: parsing mon endpoints: a=10.109.126.120:6789 2020-11-25 12:59:12.986264 I | ceph-cluster-controller: done reconciling ceph cluster in namespace "rook-ceph" 2020-11-25 13:07:33.776947 I | ceph-cluster-controller: CR has changed for "rook-ceph". diff= v1.ClusterSpec{ CephVersion: v1.CephVersionSpec{ Image: "ceph/ceph:v15.2.5", - AllowUnsupported: true, + AllowUnsupported: false, }, DriveGroups: nil, Storage: {UseAllNodes: true, Selection: {UseAllDevices: &true}}, ... // 20 identical fields } 2020-11-25 13:07:33.777039 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph" 2020-11-25 13:07:33.785088 I | op-mon: parsing mon endpoints: a=10.107.242.49:6789,b=10.109.71.30:6789,c=10.98.93.224:6789 2020-11-25 13:07:33.788626 I | ceph-cluster-controller: detecting the ceph image version for image ceph/ceph:v15.2.5... 2020-11-25 13:07:35.280789 I | ceph-cluster-controller: detected ceph image version: "15.2.5-0 octopus" 2020-11-25 13:07:35.280806 I | ceph-cluster-controller: validating ceph version from provided image 2020-11-25 13:07:35.285888 I | op-mon: parsing mon endpoints: a=10.107.242.49:6789,b=10.109.71.30:6789,c=10.98.93.224:6789 2020-11-25 13:07:35.287828 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config 2020-11-25 13:07:35.288082 I | cephclient: generated admin config in /var/lib/rook/rook-ceph 2020-11-25 13:07:35.621625 I | ceph-cluster-controller: cluster "rook-ceph": version "15.2.5-0 octopus" detected for image "ceph/ceph:v15.2.5" 2020-11-25 13:07:35.642688 I | op-mon: start running mons 2020-11-25 13:07:35.646323 I | op-mon: parsing mon endpoints: a=10.107.242.49:6789,b=10.109.71.30:6789,c=10.98.93.224:6789 2020-11-25 13:07:35.654070 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.107.242.49:6789","10.109.71.30:6789","10.98.93.224:6789"]}] data:a=10.107.242.49:6789,b=10.109.71.30:6789,c=10.98.93.224:6789 mapping:{"node":{"a":{"Name":"minikube","Hostname":"minikube","Address":"192.168.39.3"},"b":{"Name":"minikube","Hostname":"minikube","Address":"192.168.39.3"},"c":{"Name":"minikube","Hostname":"minikube","Address":"192.168.39.3"}}} maxMonId:2] 2020-11-25 13:07:35.868253 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config 2020-11-25 13:07:35.868573 I | cephclient: generated admin config in /var/lib/rook/rook-ceph 2020-11-25 13:07:37.074353 I | op-mon: targeting the mon count 3 2020-11-25 13:07:38.153435 I | op-mon: checking for basic quorum with existing mons 2020-11-25 13:07:38.178029 I | op-mon: mon "a" endpoint is [v2:10.107.242.49:3300,v1:10.107.242.49:6789] 2020-11-25 13:07:38.670191 I | op-mon: mon "b" endpoint is [v2:10.109.71.30:3300,v1:10.109.71.30:6789] 2020-11-25 13:07:39.477820 I | op-mon: mon "c" endpoint is [v2:10.98.93.224:3300,v1:10.98.93.224:6789] 2020-11-25 13:07:39.874094 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.107.242.49:6789","10.109.71.30:6789","10.98.93.224:6789"]}] data:a=10.107.242.49:6789,b=10.109.71.30:6789,c=10.98.93.224:6789 mapping:{"node":{"a":{"Name":"minikube","Hostname":"minikube","Address":"192.168.39.3"},"b":{"Name":"minikube","Hostname":"minikube","Address":"192.168.39.3"},"c":{"Name":"minikube","Hostname":"minikube","Address":"192.168.39.3"}}} maxMonId:2] 2020-11-25 13:07:40.467999 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config 2020-11-25 13:07:40.469733 I | cephclient: generated admin config in /var/lib/rook/rook-ceph 2020-11-25 13:07:41.071710 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config 2020-11-25 13:07:41.078903 I | cephclient: generated admin config in /var/lib/rook/rook-ceph 2020-11-25 13:07:41.125233 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed 2020-11-25 13:07:41.327778 I | op-k8sutil: updating deployment "rook-ceph-mon-a" after verifying it is safe to stop 2020-11-25 13:07:41.327895 I | op-mon: checking if we can stop the deployment rook-ceph-mon-a 2020-11-25 13:07:44.045644 I | op-k8sutil: finished waiting for updated deployment "rook-ceph-mon-a" 2020-11-25 13:07:44.045706 I | op-mon: checking if we can continue the deployment rook-ceph-mon-a 2020-11-25 13:07:44.045740 I | op-mon: waiting for mon quorum with [a b c] 2020-11-25 13:07:44.109159 I | op-mon: mons running: [a b c] 2020-11-25 13:07:44.474596 I | op-mon: Monitors in quorum: [a b c] 2020-11-25 13:07:44.478565 I | op-mon: deployment for mon rook-ceph-mon-b already exists. updating if needed 2020-11-25 13:07:44.493374 I | op-k8sutil: updating deployment "rook-ceph-mon-b" after verifying it is safe to stop 2020-11-25 13:07:44.493403 I | op-mon: checking if we can stop the deployment rook-ceph-mon-b 2020-11-25 13:07:47.135524 I | op-k8sutil: finished waiting for updated deployment "rook-ceph-mon-b" 2020-11-25 13:07:47.135542 I | op-mon: checking if we can continue the deployment rook-ceph-mon-b 2020-11-25 13:07:47.135551 I | op-mon: waiting for mon quorum with [a b c] 2020-11-25 13:07:47.148820 I | op-mon: mons running: [a b c] 2020-11-25 13:07:47.445946 I | op-mon: Monitors in quorum: [a b c] 2020-11-25 13:07:47.448991 I | op-mon: deployment for mon rook-ceph-mon-c already exists. updating if needed 2020-11-25 13:07:47.462041 I | op-k8sutil: updating deployment "rook-ceph-mon-c" after verifying it is safe to stop 2020-11-25 13:07:47.462060 I | op-mon: checking if we can stop the deployment rook-ceph-mon-c 2020-11-25 13:07:48.853118 I | ceph-cluster-controller: CR has changed for "rook-ceph". diff= v1.ClusterSpec{ CephVersion: v1.CephVersionSpec{ - Image: "ceph/ceph:v15.2.5", + Image: "ceph/ceph:v15.2.6", AllowUnsupported: false, }, DriveGroups: nil, Storage: {UseAllNodes: true, Selection: {UseAllDevices: &true}}, ... // 20 identical fields } 2020-11-25 13:07:48.853140 I | ceph-cluster-controller: upgrade requested, cancelling any ongoing orchestration 2020-11-25 13:07:50.119584 I | op-k8sutil: finished waiting for updated deployment "rook-ceph-mon-c" 2020-11-25 13:07:50.119606 I | op-mon: checking if we can continue the deployment rook-ceph-mon-c 2020-11-25 13:07:50.119619 I | op-mon: waiting for mon quorum with [a b c] 2020-11-25 13:07:50.130860 I | op-mon: mons running: [a b c] 2020-11-25 13:07:50.431341 I | op-mon: Monitors in quorum: [a b c] 2020-11-25 13:07:50.431361 I | op-mon: mons created: 3 2020-11-25 13:07:50.734156 I | op-mon: waiting for mon quorum with [a b c] 2020-11-25 13:07:50.745763 I | op-mon: mons running: [a b c] 2020-11-25 13:07:51.045108 I | op-mon: Monitors in quorum: [a b c] 2020-11-25 13:07:51.054497 E | ceph-cluster-controller: failed to reconcile. failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: CANCELLING CURRENT ORCHESTATION 2020-11-25 13:07:52.055208 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph" 2020-11-25 13:07:52.070690 I | op-mon: parsing mon endpoints: a=10.107.242.49:6789,b=10.109.71.30:6789,c=10.98.93.224:6789 2020-11-25 13:07:52.088979 I | ceph-cluster-controller: detecting the ceph image version for image ceph/ceph:v15.2.6... 2020-11-25 13:07:53.904811 I | ceph-cluster-controller: detected ceph image version: "15.2.6-0 octopus" 2020-11-25 13:07:53.904862 I | ceph-cluster-controller: validating ceph version from provided image ``` Closes: rook#6587 Signed-off-by: Sébastien Han <seb@redhat.com>
7a76ed9
to
ad24990
Compare
ceph: ability to abort orchestration (bp #6693)
ceph: ability to abort orchestration (bp #6693)
Description of your changes:
We can now prioritize orchestrations on certain event. Today only two
events will cancel on-going orchestrations (if any):
If one of the two are caught by the watcher we will cancel the on-going
orchestration.
For that we implemented a simple approach based on check points, where
we will check for a cancellation request in certain part of the
orchestration. Mainly before each mons/mgr/osds orchestration loops.
This solution is not perfect, but we are waiting for the
controller-runtime to release its 0.7 version which will embed context
support. With that we will be able to cancel reconciles more precisely
and rapidly.
Operator log example:
Closes: #6587
Which issue is resolved by this Pull Request:
Resolves #6587
Checklist:
make codegen
) has been run to update object specifications, if necessary.