Skip to content

Conversation

Zerpet
Copy link
Member

@Zerpet Zerpet commented Jun 2, 2020

This closes #10

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

  • Introduced a new condition type Reconcilable to expose create or update errors from Kube API.
  • Rabbitmq Status type has a new public function SetCondition. This function helps to mutate an existing condition to an specific status, reason and message.
  • Reconcile function now sets Reconcilable condition according to Status field should tell me about reconciliation failure #10.
  • Removed unnecessary go module thanks to go mod tidy.

Additional Context

The implementation has an implicit decision to not consider the Reconcile() a failure if there is an error trying to udpdate the Reconcilable condition to True after the "child resource loop" has succeeded. It felt that returning Reconcile() with error and causing another reconcile loop was unnecessary. Other runs of reconcile should eventually set the condition to its desired state, given that for every CR event we trigger from 2 to 7 Reconcile(), expecting the condition to eventually reach the desired state feels ok. I'm up to debate this implementation detail here. 🐇

The function RabbitmqClusterStatus.SetCondition() silently does nothing if the condition you are trying to set does not exist in the conditions array. This situation should happen almost never because we initialise all conditions during RabbitmqClusterStatus.SetConditions() (note the 's') and we invoke this function earlier in the reconcile function. We can change SetCondition() to initialise the condition if it does not exist, but the function would look a bit weird IMO. I would like feedback on this 🐰

Local Testing

This PR includes new tests at unit and integration level. Run them with:

$ make unit-tests integration-tests

Manually verified this is working as expected with kind and a locally running operator with:

kind create cluster
make run

On a different tab, apply this YAML.

apiVersion: rabbitmq.pivotal.io/v1beta1
kind: RabbitmqCluster
metadata:
  name: my-rabbitmqcluster
spec:
  replicas: 1
  resources:
    limits: {}
    requests: {}
  service:
    annotations:
      imgonna-: "break-it"

@Zerpet Zerpet added the status conditions CRD Status Conditions label Jun 2, 2020
@Zerpet
Copy link
Member Author

Zerpet commented Jun 2, 2020

Unit tests were not expecting the Reconcilable condition. Shame on me for running focused tests 😳 Commit 16f85f2 should go green through the PR pipeline.

Copy link
Contributor

@ChunyiLyu ChunyiLyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried deploying changes from the branch and triggered a Reconcilable error and saw the status.Conditions updated successfully. However I didn't see a FailedCreate event published. Is that intentional? I think line 222 suggests that a event should be published when the controller failed to create or update. Have you seen this behavior?

// to child resources returned no error
rabbitmqCluster.Status.SetCondition(status.Reconcilable, corev1.ConditionTrue, "NoErrors", "Created or Updated all child resources")
if writerErr := r.Status().Update(ctx, rabbitmqCluster); writerErr != nil {
r.Log.Error(writerErr, "Error trying to Update Custom Resource status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think r.Log.Error() does not include request's namespace or cr name. I think we should specify the object name and its namespace in the msg here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also see that we're not mentioning the request's namespace or cr name in our deletion log messages as well (in all other places we are :)). Can we make this change here too? 😱 Creating a new issue, etc, seems not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make the change to include that information.

@ferozjilla we do log name and namespace in here:

https://github.com/pivotal/rabbitmq-for-kubernetes/blob/74b78fee02f55014030963ba7c6fd9ba8f5fafa0/controllers/rabbitmqcluster_controller.go#L124-L131

Do you want to add this information too in prepareForDeletion() function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, inside the prepareForDeletion() is what I was thinking. I think it's nice to add it there too since it's cheap and takes out all doubt about where it's from.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, inside the prepareForDeletion() is what I was thinking. I think it's nice to add it there too since it's cheap and takes out all doubt about where it's from.

I think the information for the name and namespace is there, perhaps not right-in-your-face. Here is an example of an error in that piece of code:

2020-06-05T10:13:48.084Z        INFO    rabbitmqcluster-controller      Deleting RabbitmqCluster        {"namespace": "rabbitmq-system", "name": "my-rabbitmqcluster"}
2020-06-05T10:13:48.099Z        ERROR   rabbitmqcluster-controller      Failed to remove finalizer for deletion {"error": "Operation cannot be fulfilled on rabbitmqclusters.rabbitmq.com \"my-rabbitmqcluster\": StorageError: invalid object, Code: 4, Key: /registry/rabbitmq.com/rabbitmqclusters/rabbitmq-system/my-rabbitmqcluster, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84c5f776-435a-4f2c-ac07-1acd63245f4b, UID in object meta: "}
github.com/go-logr/zapr.(*zapLogger).Error
   [stack trace]

In the error message returned, we have the name of the CR in 'Operation cannot be fulfilled on rabbitmqclusters.rabbitmq.com "my-rabbitmqcluster"'; the namespace is included in 'Key: /registry/rabbitmq.com/rabbitmqclusters/rabbitmq-system/my-rabbitmqcluster'. Perhaps is not evident-right-in-ya-face. Do we want to change the format to be more explicit?

Copy link
Contributor

@ferozjilla ferozjilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general :)
Although, please see the comment about testing

@Zerpet
Copy link
Member Author

Zerpet commented Jun 3, 2020

I'm making this a draft PR because I'm not happy with the status of the code, even tho it does what it's asked. I'll address the review comments as part of this code revisit.

@Zerpet Zerpet marked this pull request as draft June 3, 2020 12:57
@Zerpet Zerpet force-pushed the rabbitmq-operator-status-reconcilable-10 branch from 16f85f2 to b935914 Compare June 3, 2020 17:54
@Zerpet Zerpet marked this pull request as ready for review June 3, 2020 17:55
@Zerpet
Copy link
Member Author

Zerpet commented Jun 3, 2020

New changes pushed addressing the comments and changing slightly the logic. It is not consistent with how other conditions are set, however this condition is not similar to other conditions. All replicas ready, no warning and cluster available have in common that we determine their state from child resources. We determine Reconcilable condition state based on an error that may or may not occur. For this reason, Reconcilable condition can't be determined at the same time as the others, so it's slightly different than the others and it justifies having a different logic.

@ferozjilla ferozjilla self-requested a review June 4, 2020 21:00
Copy link
Contributor

@ferozjilla ferozjilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@st3v st3v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the name Reconcilable best captures the implementation for this condition as chosen in this PR. I believe calling this Reconciled would better reflect the implementation approach chosen here. Two reasons:

  1. This condition is set to false at the beginning of child resource reconciliation. If I read the condition at that point in time I would assume the CRD is not reconcilable (i.e. {type: Reconcilable, status: false}) In 99% of the cases that's simply wrong. The CRD is totally reconcilable, it just hasn't been reconciled yet.
  2. When we get an error from CreateOrUpdate we leave Reconcilable at false. But we have no idea what type of error we got. I mentioned this in the original issue. This might be some sort of transient error and thus does not automatically mean that this CRD is not reconcilable going forward. It simply means it hasn't been reconciled yet due to the error mentioned in the message.

With this in mind, I suggest to rename the condition to Reconciled. Set it to false with message reconciliation in progress at the beginning of the reconcile function. In case of an error from CreateAndUpdate leave the status at false but change the message to reconciliation failed with error: .... If everything went well switch the status to true before exiting the reconcile function.

@Zerpet
Copy link
Member Author

Zerpet commented Jun 5, 2020

I'm not sure the name Reconcilable best captures the implementation for this condition as chosen in this PR. I believe calling this Reconciled would better reflect the implementation approach chosen here. Two reasons:

  1. This condition is set to false at the beginning of child resource reconciliation. If I read the condition at that point in time I would assume the CRD is not reconcilable (i.e. {type: Reconcilable, status: false}) In 99% of the cases that's simply wrong. The CRD is totally reconcilable, it just hasn't been reconciled yet.

I'm not sure I follow. Since b935914, we initialise the condition to "unknown" the very first time we are creating child resources. In subsequent runs we preserve the previous condition. The comment is not correct tho, it says we initialise to true, but we actually do to unknown.

https://github.com/pivotal/rabbitmq-for-kubernetes/blob/b93591486925f3641170828899a0527e5e8252ea/api/v1beta1/rabbitmqcluster_types.go#L162-L169

  1. When we get an error from CreateOrUpdate we leave Reconcilable at false. But we have no idea what type of error we got. I mentioned this in the original issue. This might be some sort of transient error and thus does not automatically mean that this CRD is not reconcilable going forward. It simply means it hasn't been reconciled yet due to the error mentioned in the message.

Having transient errors and temporary false state is one of the challenges impleneting this kind of condition. The condition might flicker between false and true if we hit transient errors. Given that in the event of a reconcile error, the request is requeued, a false state due to a transient error will be corrected on next reconcile, which should happen in a matter of milliseconds from what I've seen in my testing.

I don't think we have a way to determine the "reconcilability" of the CR from an error unless we classify before hand all kinds of error we consider "irreconcilable"; and even if we do this, we will have to keep this classification updated, which doesn't feel right somehow.

I think the current approach of "if we keep erroring trying to create or update, then it is probably irreconcilable" is good enough because incorrect False states will be corrected within milliseconds and it requires consistent erroring to maintain a false state.

With this in mind, I suggest to rename the condition to Reconciled. Set it to false with message reconciliation in progress at the beginning of the reconcile function. In case of an error from CreateAndUpdate leave the status at false but change the message to reconciliation failed with error: .... If everything went well switch the status to true before exiting the reconcile function.

I think the behaviour you are describing is the current implementation, except with the initial state (unknown the very first time, previous value subsequently) and just the name change. I'm ok either way. What @ferozjilla and @ChunyiLyu think about renaming to Reconciled ?

@ferozjilla
Copy link
Contributor

I am tending towards renaming to Reconciled. I hear that

I think the current approach of "if we keep erroring trying to create or update, then it is probably irreconcilable" is good enough because incorrect False states will be corrected within milliseconds and it requires consistent erroring to maintain a false state.

but still feel we fall short in terms of -

  • point in time reading of these values may not be reliable
  • we're using a best effort approach to predict "irreconcilable", so this may actually not be the case

Feels like Reconciled is a step back from making this conclusion that let's the human operator decide based on the status of Reconciled. This would also ensure the point in time readings are accurate.

@Zerpet
Copy link
Member Author

Zerpet commented Jun 15, 2020

With this in mind, I suggest to rename the condition to Reconciled. Set it to false with message reconciliation in progress at the beginning of the reconcile function. In case of an error from CreateAndUpdate leave the status at false but change the message to reconciliation failed with error: .... If everything went well switch the status to true before exiting the reconcile function.

Before I do any further changes, can I get a confirmation (and agreement) from @ChunyiLyu, @st3v and @j4mcs that this is the logic we want for this condition?

For context, the current behaviour is as follows:

  1. If Reconciable condition is not set, initialise to unknown. Else preserve previous state.
  2. If there is an error during CreateOrUpdate, set condition state to false (reconcile returns at this point)
  3. At the end of the reconcile loop, set state to true

During the state transitions (e.g. from unknown to true), the last transition time is handled accordingly.

Can I get a green light that we want Stev's implementation logic over the current one?

Further context, Stev's full comment #155 (review) and my response #155 (comment)

Zerpet added 4 commits June 18, 2020 15:32
This is a building block that the controller can use to create and
expose a condition of type Reconcilable. Intended to indicate whether
RabbitmqCluster resource can be reconciled.
Expose a condition to inform whether the custom resource can be
reconciled. We consider the custom resource to be reconcilable if we are
able to send a create or update request to KubeAPI and receive a success
response.
- Moved the time transition logic to SetCondition
- ReconcilableCondition() is intended to be used as a initialiser
- Log CR Name and Namespace when there is an error
- Update unit tests to expect Reconcilable condition
- Rename Reconciable condition to Reconciled
Integration tests started to error as a side effect to update the CR
Status more frequently; those are fixed by adding a retry logic to
Update calls in tests.
@Zerpet Zerpet force-pushed the rabbitmq-operator-status-reconcilable-10 branch from 3faf420 to 58550f5 Compare June 18, 2020 15:33
@@ -0,0 +1,53 @@
package status_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is a generic test on behaviors of all status conditions. As part of this, can be de duplicate tests that are in all_replicas_ready, and cluster_available and no_warnings? There are many duplicated tests that are about ensuing the lastTransitionTime is changed or preserved...etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status_test.go is testing the package functions UpdateState() and UpdateReason(), which are used by the reconciler as it needs to mutate the reconcile_success condition. The conditions all_replicas_ready, cluster_available and no_warnings do not use those functions to update their attributes, their "builder" takes care of it.

I agree there is duplicated logic, specially with regards to updating last transition time, however I'm not sure we can de-duplicate the tests in the current state of the conditions. In their current state, every condition is "unique" in terms each condition is its own type, it has nothing in common with the others, and they do not necessarily share the same logic (even tho they do), so I feel they need to be tested separately and as individual components right now.

I'm in favour of refactoring the code around conditions, potentially building a "general case" interface, then every condition would implement this interface, which can be tested; then every condition would have specific tests, or not at all if they don't have specific behaviour.

I remember @j4mcs suggested to me a potential approach for this, but I didn't find the way to put it in practice.

Regardless of our decision to refactor or not, I think it should be part of another issue/pr since it would involve all the status conditions, not just the reconcile_success.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without generics you'll still need to test every concrete implementation of the interface. Otherwise you have to build something like knative's duck library (https://github.com/knative/pkg/tree/master/apis/duck)

Copy link
Contributor

@ChunyiLyu ChunyiLyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the small things, the status and its test file names need to be changed to reconcile_success.go instead of reconcilable.go. Other than that, please see my comment on the internal/status/status_test.go. If we introduce this generic test on general status condition behavior, I would like us to de duplicate tests that are in other status conditions as part of this. I would also suggest squash the commits before merging.

Everything else looks good!

if existingReconcileCondition != nil {
reconciledCondition = *existingReconcileCondition
} else {
reconciledCondition = status.ReconcileSuccessCondition(corev1.ConditionUnknown, "Initialising", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I'm not sure I see what we gain from it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialisation is necessary because status.UpdateStatus() and status.UpdateReason() do not create the condition if it does not exist. The very first time we need to set it to something, every subsequent run we just keep the previous value.

I'm not sure how else I can express this in code?

Copy link
Contributor

@ChunyiLyu ChunyiLyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦖

@Zerpet
Copy link
Member Author

Zerpet commented Jun 24, 2020

Got two approvals, merging! :shipit: I think that squash merging will generate one gigangic commit that will be hard to understand. I will "classic" merge to keep the 5 commits in the history.

@Zerpet Zerpet merged commit 538b938 into master Jun 24, 2020
@Zerpet Zerpet deleted the rabbitmq-operator-status-reconcilable-10 branch June 24, 2020 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status conditions CRD Status Conditions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Status field should tell me about reconciliation failure

5 participants