-
Notifications
You must be signed in to change notification settings - Fork 306
Reconcilable status condition #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Unit tests were not expecting the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried deploying changes from the branch and triggered a Reconcilable error and saw the status.Conditions updated successfully. However I didn't see a FailedCreate event published. Is that intentional? I think line 222 suggests that a event should be published when the controller failed to create or update. Have you seen this behavior?
// to child resources returned no error | ||
rabbitmqCluster.Status.SetCondition(status.Reconcilable, corev1.ConditionTrue, "NoErrors", "Created or Updated all child resources") | ||
if writerErr := r.Status().Update(ctx, rabbitmqCluster); writerErr != nil { | ||
r.Log.Error(writerErr, "Error trying to Update Custom Resource status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think r.Log.Error()
does not include request's namespace or cr name. I think we should specify the object name and its namespace in the msg here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also see that we're not mentioning the request's namespace or cr name in our deletion log messages as well (in all other places we are :)). Can we make this change here too? 😱 Creating a new issue, etc, seems not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make the change to include that information.
@ferozjilla we do log name and namespace in here:
Do you want to add this information too in prepareForDeletion()
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, inside the prepareForDeletion()
is what I was thinking. I think it's nice to add it there too since it's cheap and takes out all doubt about where it's from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, inside the
prepareForDeletion()
is what I was thinking. I think it's nice to add it there too since it's cheap and takes out all doubt about where it's from.
I think the information for the name and namespace is there, perhaps not right-in-your-face. Here is an example of an error in that piece of code:
2020-06-05T10:13:48.084Z INFO rabbitmqcluster-controller Deleting RabbitmqCluster {"namespace": "rabbitmq-system", "name": "my-rabbitmqcluster"}
2020-06-05T10:13:48.099Z ERROR rabbitmqcluster-controller Failed to remove finalizer for deletion {"error": "Operation cannot be fulfilled on rabbitmqclusters.rabbitmq.com \"my-rabbitmqcluster\": StorageError: invalid object, Code: 4, Key: /registry/rabbitmq.com/rabbitmqclusters/rabbitmq-system/my-rabbitmqcluster, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84c5f776-435a-4f2c-ac07-1acd63245f4b, UID in object meta: "}
github.com/go-logr/zapr.(*zapLogger).Error
[stack trace]
In the error message returned, we have the name of the CR in 'Operation cannot be fulfilled on rabbitmqclusters.rabbitmq.com "my-rabbitmqcluster"'; the namespace is included in 'Key: /registry/rabbitmq.com/rabbitmqclusters/rabbitmq-system/my-rabbitmqcluster'. Perhaps is not evident-right-in-ya-face. Do we want to change the format to be more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general :)
Although, please see the comment about testing
I'm making this a draft PR because I'm not happy with the status of the code, even tho it does what it's asked. I'll address the review comments as part of this code revisit. |
16f85f2
to
b935914
Compare
New changes pushed addressing the comments and changing slightly the logic. It is not consistent with how other conditions are set, however this condition is not similar to other conditions. All replicas ready, no warning and cluster available have in common that we determine their state from child resources. We determine Reconcilable condition state based on an error that may or may not occur. For this reason, Reconcilable condition can't be determined at the same time as the others, so it's slightly different than the others and it justifies having a different logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the name Reconcilable
best captures the implementation for this condition as chosen in this PR. I believe calling this Reconciled
would better reflect the implementation approach chosen here. Two reasons:
- This condition is set to
false
at the beginning of child resource reconciliation. If I read the condition at that point in time I would assume the CRD is not reconcilable (i.e.{type: Reconcilable, status: false}
) In 99% of the cases that's simply wrong. The CRD is totally reconcilable, it just hasn't been reconciled yet. - When we get an error from
CreateOrUpdate
we leaveReconcilable
atfalse
. But we have no idea what type of error we got. I mentioned this in the original issue. This might be some sort of transient error and thus does not automatically mean that this CRD is not reconcilable going forward. It simply means it hasn't been reconciled yet due to the error mentioned in the message.
With this in mind, I suggest to rename the condition to Reconciled
. Set it to false
with message reconciliation in progress
at the beginning of the reconcile function. In case of an error from CreateAndUpdate
leave the status at false
but change the message to reconciliation failed with error: ...
. If everything went well switch the status to true
before exiting the reconcile function.
I'm not sure I follow. Since b935914, we initialise the condition to "unknown" the very first time we are creating child resources. In subsequent runs we preserve the previous condition. The comment is not correct tho, it says we initialise to true, but we actually do to unknown.
Having transient errors and temporary false state is one of the challenges impleneting this kind of condition. The condition might flicker between false and true if we hit transient errors. Given that in the event of a reconcile error, the request is requeued, a false state due to a transient error will be corrected on next reconcile, which should happen in a matter of milliseconds from what I've seen in my testing. I don't think we have a way to determine the "reconcilability" of the CR from an error unless we classify before hand all kinds of error we consider "irreconcilable"; and even if we do this, we will have to keep this classification updated, which doesn't feel right somehow. I think the current approach of "if we keep erroring trying to create or update, then it is probably irreconcilable" is good enough because incorrect
I think the behaviour you are describing is the current implementation, except with the initial state (unknown the very first time, previous value subsequently) and just the name change. I'm ok either way. What @ferozjilla and @ChunyiLyu think about renaming to |
I am tending towards renaming to
but still feel we fall short in terms of -
Feels like |
Before I do any further changes, can I get a confirmation (and agreement) from @ChunyiLyu, @st3v and @j4mcs that this is the logic we want for this condition? For context, the current behaviour is as follows:
During the state transitions (e.g. from unknown to true), the last transition time is handled accordingly. Can I get a green light that we want Stev's implementation logic over the current one? Further context, Stev's full comment #155 (review) and my response #155 (comment) |
This is a building block that the controller can use to create and expose a condition of type Reconcilable. Intended to indicate whether RabbitmqCluster resource can be reconciled.
Expose a condition to inform whether the custom resource can be reconciled. We consider the custom resource to be reconcilable if we are able to send a create or update request to KubeAPI and receive a success response.
- Moved the time transition logic to SetCondition - ReconcilableCondition() is intended to be used as a initialiser - Log CR Name and Namespace when there is an error - Update unit tests to expect Reconcilable condition - Rename Reconciable condition to Reconciled
Integration tests started to error as a side effect to update the CR Status more frequently; those are fixed by adding a retry logic to Update calls in tests.
3faf420
to
58550f5
Compare
@@ -0,0 +1,53 @@ | |||
package status_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is a generic test on behaviors of all status conditions. As part of this, can be de duplicate tests that are in all_replicas_ready
, and cluster_available
and no_warnings
? There are many duplicated tests that are about ensuing the lastTransitionTime is changed or preserved...etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status_test.go
is testing the package functions UpdateState()
and UpdateReason()
, which are used by the reconciler as it needs to mutate the reconcile_success
condition. The conditions all_replicas_ready
, cluster_available
and no_warnings
do not use those functions to update their attributes, their "builder" takes care of it.
I agree there is duplicated logic, specially with regards to updating last transition time, however I'm not sure we can de-duplicate the tests in the current state of the conditions. In their current state, every condition is "unique" in terms each condition is its own type, it has nothing in common with the others, and they do not necessarily share the same logic (even tho they do), so I feel they need to be tested separately and as individual components right now.
I'm in favour of refactoring the code around conditions, potentially building a "general case" interface, then every condition would implement this interface, which can be tested; then every condition would have specific tests, or not at all if they don't have specific behaviour.
I remember @j4mcs suggested to me a potential approach for this, but I didn't find the way to put it in practice.
Regardless of our decision to refactor or not, I think it should be part of another issue/pr since it would involve all the status conditions, not just the reconcile_success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without generics you'll still need to test every concrete implementation of the interface. Otherwise you have to build something like knative's duck library (https://github.com/knative/pkg/tree/master/apis/duck)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the small things, the status and its test file names need to be changed to reconcile_success.go
instead of reconcilable.go
. Other than that, please see my comment on the internal/status/status_test.go
. If we introduce this generic test on general status condition behavior, I would like us to de duplicate tests that are in other status conditions as part of this. I would also suggest squash the commits before merging.
Everything else looks good!
if existingReconcileCondition != nil { | ||
reconciledCondition = *existingReconcileCondition | ||
} else { | ||
reconciledCondition = status.ReconcileSuccessCondition(corev1.ConditionUnknown, "Initialising", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I'm not sure I see what we gain from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initialisation is necessary because status.UpdateStatus()
and status.UpdateReason()
do not create the condition if it does not exist. The very first time we need to set it to something, every subsequent run we just keep the previous value.
I'm not sure how else I can express this in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🦖
Got two approvals, merging! |
This closes #10
Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed
Summary Of Changes
Reconcilable
to expose create or update errors from Kube API.SetCondition
. This function helps to mutate an existing condition to an specific status, reason and message.Reconcilable
condition according to Status field should tell me about reconciliation failure #10.go mod tidy
.Additional Context
The implementation has an implicit decision to not consider the
Reconcile()
a failure if there is an error trying to udpdate theReconcilable
condition toTrue
after the "child resource loop" has succeeded. It felt that returningReconcile()
with error and causing another reconcile loop was unnecessary. Other runs of reconcile should eventually set the condition to its desired state, given that for every CR event we trigger from 2 to 7Reconcile()
, expecting the condition to eventually reach the desired state feels ok. I'm up to debate this implementation detail here. 🐇The function
RabbitmqClusterStatus.SetCondition()
silently does nothing if the condition you are trying to set does not exist in the conditions array. This situation should happen almost never because we initialise all conditions duringRabbitmqClusterStatus.SetConditions()
(note the 's') and we invoke this function earlier in the reconcile function. We can changeSetCondition()
to initialise the condition if it does not exist, but the function would look a bit weird IMO. I would like feedback on this 🐰Local Testing
This PR includes new tests at unit and integration level. Run them with:
Manually verified this is working as expected with kind and a locally running operator with:
On a different tab, apply this YAML.