Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding active operator feature #1531

Merged
merged 2 commits into from
Oct 1, 2021
Merged

Adding active operator feature #1531

merged 2 commits into from
Oct 1, 2021

Conversation

tmjd
Copy link
Member

@tmjd tmjd commented Sep 22, 2021

Description

This feature allows the operator to be deployed to alternate namespaces and the ability to designate which will be the active one.

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good - I haven't reviewed tests yet so will do that next.

Did we decide that we don't need to handle cached secrets during rolling update when switching between operators?


// IsThisOperatorActive will process the passed in ConfigMap and check that
// this running operator is the active one based on the OperatorNamespace.
// The first argument is if this running operator is active.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"argument" -> "return value"

func (r *ReconcileInstallation) checkActive(log logr.Logger) (*corev1.ConfigMap, error) {
cm, err := active.GetActiveConfigMap(r.client)
if err != nil {
r.SetDegraded(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple operators hit this, I'm guessing they will fight with each other to maintain this active operator status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would only be an initial race. Both (or multiple) if running at the same time one could create the map and then a second could overwrite the map. But once either re-reads the config they would not touch it again since they only create the configMap if it doesn't already exist.
So I wouldn't say they fight because an operator will only attempt to create this ConfigMap once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if two or more operators fail to get the active configmap, they will both try to set degraded on the same tigera status resource.

Probably not an issue? Will they try to do anything differently from each other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Sorry about misunderstanding that.
Yeah I see what you mean. Maybe we should not SetDegraded in this specific case and just log it? Or because of the possible fight should we include the namespace in the message, that way it would be an indicator if it was coming from an unexpected operator?

@tmjd
Copy link
Member Author

tmjd commented Sep 23, 2021

Did we decide that we don't need to handle cached secrets during rolling update when switching between operators?

I forgot about this consideration.
It is already a manual separate step to change the active designation, (1) maybe part of the process for changing the active operator includes copying over the secrets to the new active namespace.
Other options I can imagine:
2. Always use tigera-operator namespace for storing the secrets/config that the operator writes to "its namespace".
3. Include a namespace specification in the active-operator ConfigMap for where the secrets/config will be located.

If we did 2 then the operator would need to create tigera-operator namsepace if it didn't exist. And I think if that was the case then we could move the 'active-operator' ConfigMap there.
Always having the operator secrets/config (and the 'active-operator' CM) in the tigera-operator namespace is appealing to me.
Are there downsides to 2? The only thing I can think of is that then the tigera-operator namespace always must exist but I'm not sure how that is different from always having calico-system.

1 is appealing to me since I already have that done and then we could always expand on that to 3 if we didn't want the extra step of copying the secrets/config.

@caseydavenport
Copy link
Member

maybe part of the process for changing the active operator includes copying over the secrets to the new active namespace.

I think I'm happy with this for now.

I don't see any immediate downsides to (2) except that it's maybe a bit weird? Not sure if some day we might want the ability to have separate secrets for the different operators.

So, I think my pref. is (1), which means no code changes here which is doubly good.

@tmjd
Copy link
Member Author

tmjd commented Sep 23, 2021

Even though you're pref is 1 I just thought I'd mention a downside for 2 that I thought of. 2 would be a breaking change for anyone already running the operator in a different namespace, or at least it would require special steps on upgrade.

1 it is.

@tmjd
Copy link
Member Author

tmjd commented Sep 24, 2021

I verified in a cluster that if I deployed a 2nd operator in a new namespace tigera-operator-alt (along with a new SA and binding) that if I copied over the typha and node cert secrets and the typha-ca configmap from the tigera-operator namespace, I was able to switch the active-operator config map to the tigera-operator-alt namespace and there were no typha or node restarts. The alt operator took over and the tigera-operator operator restarted and began waiting to become active.
Switching the active operator back worked as expected too.

@tmjd tmjd merged commit 7af72ae into tigera:master Oct 1, 2021
@tmjd tmjd deleted the actv-oprtr branch October 1, 2021 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants