Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only listen to configured k8s namespaces. #1895

Merged

Conversation

timoreimann
Copy link
Contributor

Watch configured namespaces only.

We previously watched all namespaces, which was causing issues for users that want to apply restrictive RBAC rules. Hence, we change the logic to hook up one informer per namespace and kind we observe.

Note that this moves the namespace filtering into the client/informers (where it properly belongs). A nice side effect of this is that we stop receiving tons of events from unrelated namespaces (most notably kube-system), which has been a source of confusion for users in the past. We also don't need the extra event notifications as "retry signals" in case of ephemeral API failures anymore since we can now safely rely on a fairly low sync period for recovery purposes. This is preferable to getting hammered with event updates at sub-second intervals, only to have Traefik's back-off sleep of 2 seconds kicking in very often.

A few additional improvements that made it in:

  • Wait for the store synchronization to happen during WatchAll instead of when we process events. The wait needs to happen exactly once.
  • Do not close the event channel until all controllers have stopped processing. This was a pending race condition that could yield to crash.
  • Remove one superfluous event/watch channel (name changes depending on which method you're looking at).
  • Remove the namespace-related unit tests. (Must be replaced by to-be-written integration tests.)
  • Group methods logically.
  • Improve GoDocs.

Finally, we stop logging complete events as we receive them to reduce noise and stop leaking sensitive secrets data into the logs.

Fixes #1626.

@timoreimann
Copy link
Contributor Author

timoreimann commented Jul 25, 2017

This is still WIP, mostly because a proper integration test is yet missing (#1889). Nevertheless, I'd be happy to hear some early feedback (from everyone in general and in particular from @containous/kubernetes).

@ldez ldez requested a review from a team July 25, 2017 18:46
@timoreimann
Copy link
Contributor Author

It looks like getting Kubernetes integration tests up and running will still require a bit of effort. Nevertheless, merging the PR is advisable IMO as Traefik can be operated in a much more secure way on Kubernetes with namespace-specific watches. Plus, we stop spamming the logs with tons of unrelated events.

I'll remove the WIP label for now and suggest we test this manually like we did with the previous changes.

@timoreimann
Copy link
Contributor Author

Note to myself: The PR currently sets the default namespace by default. Depending on how #1961 resolves, I may need to change that to all namespaces.

@ldez ldez added the size/L label Aug 16, 2017
@emilevauge
Copy link
Member

Wow @timoreimann!
Design LGTM
As you completely refactored this provider, I would strongly recommend to make some benchmarks and profiling to ensure no memory leaks has been introduced :)

@errm
Copy link
Contributor

errm commented Aug 18, 2017

@timoreimann this is on my radar and I will have a through review done by early next week..

RE the all namespaces thing, my setup currently depends on on this so we should certainly keep it as the default IMO

@timoreimann timoreimann force-pushed the k8s-only-listen-to-configured-namespaces branch from 24172a9 to 98e6ba8 Compare August 18, 2017 23:34
@timoreimann
Copy link
Contributor Author

@errm I have pushed a commit that makes NamespaceAll the default (which should be in line with the now-merged #1961).

Copy link
Contributor

@errm errm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice and clean to me...

Subject to @emilevauge 's concerns about benchmarking / soak testing LGTM

@timoreimann
Copy link
Contributor Author

Yeah I'll definitely try to work out some kind of benchmark.

Probably not going to happen before the 1.4 freeze, but not critical either.

Copy link
Contributor

@dtomcej dtomcej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super clean and boss.

LGTM
:shipit:

@ldez ldez added this to the 1.5 milestone Aug 25, 2017
@@ -19,31 +20,55 @@ import (
"k8s.io/client-go/tools/cache"
)

const resyncPeriod = time.Minute * 5
const resyncPeriod = 10 * time.Second
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timoreimann this is too low, it should be like each 10 minutes. Just in case with this you are forcing a full sync (from api server) each 10 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, there will be syncs only for the restricted items each informer is watching. Or am I missing something?

I'm okay with increasing the period in general. Syncing every 10 minutes though sounds like we'd not be catching up on missed Ingresses, Services, and Endpoints for a fairly long time. I wonder if that would still be acceptable?

Copy link

@aledbf aledbf Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full resync != delta updates (like Ingresses, Services, and Endpoints )

Edit: please check the gce ingress controller
https://github.com/kubernetes/kubernetes/blob/master/cluster/saltbase/salt/l7-gcp/glbc.manifest#L47

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the glbc watches all namespaces. With this PR, Traefik will be watching selective namespaces only. What's being pulled in by a full sync should be subject to the list part of the ListWatch interface. So to my understanding, the sync volume should be much more restricted.

Please let me know if my assumptions are wrong somehow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to my understanding, the sync volume should be much more restricted.

Yes, that's correct but you are assuming the namespace only contains a few objects (ingress,services,secrets, etc). What happens if the user have hundred of objects in the namespace/s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a trade-off between not overloading the cluster and being able to compensate for ephemeral errors in time. Maybe there's no one size fits all, and it'd be best to make the parameter configurable.

I agree though that we can probably bump the interval somewhat for now. Will go with a minute or so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aledbf you might be interested in this thread I started on the kubernetes-sig-api-machinery mailing list. Some take aways for me:

  • The resync period only causes the local cache content to get replayed; in particular, it does not hit the API server (that's what an explicit relist is for).
  • First and foremost, the resync period is a means to compensate for controller (not watch) bugs, though re-queuing techniques available in more recent informer versions take away a lot of that rationale. That said, resyncs may also be used for other cases. (Daniel describes how you can use them as a clock to initiate polls for external APIs periodically.)
  • It's a viable option to disable resyncs completely in order to surface controller bugs more quickly instead of risking to hide them. One example component that does this is the Kubernetes scheduler.

Hope this helps.

We previously watched all namespaces, which was causing issues for users
that want to apply restrictive RBAC rules. Hence, we change the logic to
hook up one informer per namespace and kind we observe.

Note that this moves the namespace filtering into the client/informers
(where it properly belongs). A nice side effect of this is that we stop
receiving tons of events from unrelated namespaces (most notably
kube-system), which has been a source of confusion for users in the
past. We also don't need the extra event notifications as "retry
signals" in case of ephemeral API failures anymore since we can now
safely rely on a fairly low sync period for recovery purposes. This is
preferable to getting hammered with event updates at sub-second
intervals, only to have Traefik's back-off sleep of 2 seconds kicking in
very often.

A few additional improvements that made it in:

- Wait for the store synchronization to happen during WatchAll instead
  of when we process events. The wait needs to happen exactly once.
- Do not close the event channel until all controllers have stopped
  processing. This was a pending race condition that could yield to
  crash.
- Remove one superfluous event/watch channel (name changes depending on
  which method you're looking at).
- Remove the namespace-related unit tests. (Must be replaced by
  to-be-written integration tests.)
- Group methods logically.
- Improve GoDocs.
As a consequence, the lookup key is now dependent on whether we use
NamespaceAll or more specific namespaces.
Shared informers are the standard informer types in more modern releases
of client-go, so we can only benefit by using them now as well (even
though we don't share any caches and presumably won't do so in the near
future).

Additionally, we replace the individual Watch* methods by a single
WatchObjects method.
@timoreimann timoreimann force-pushed the k8s-only-listen-to-configured-namespaces branch from 6c85c79 to d0ac8f5 Compare October 4, 2017 23:14
@timoreimann
Copy link
Contributor Author

timoreimann commented Oct 5, 2017

I finally came around running benchmarks. This is what I did:

  1. Submitted 30 deployments into 3 namespaces with 10 deployments per namespace.
  2. Added Ingresses and Services for each Deployment.
  3. Started Traefik with --debug.
  4. Fired off a script that continuously rescaled all Deployments to a replica count between 1 and 3.
  5. Observe the memory consumption and goroutine count via expvarmon.

Here's what I observed:

  • The number of goroutines varied in a low range between ~110 and ~120. This is higher than the pre-PR count where I could measure ~45. However, this makes total sense given that we now run 4 informers per namespace versus 4 informers overall. The fact that watching a single namespace with the new implementation consumes about ~45 goroutines as well matches up nicely with the expected math.
  • The memory consumption varied between 5 and 10 MB. It would rise and fall even during my load tests, which shows that GC manages to reclaim memory successfully.
  • After stopping the rescaling script, memory consumption dropped as well after a short while. This shows we're not leaking memory.

I also verified that the number of backend servers matches the pod count.

Overall, this looks good to me. We surely want to gain more real world experience to solidify the new implementation. I do think though that the tests demonstrate basic correctness.

Happy to run more tests if desired.

@timoreimann
Copy link
Contributor Author

@dtomcej @errm since you first reviewed, I pushed a few more commits (mostly just refactorings). Are you still okay with the status quo of this PR?

Asking bot not to merge until I get another approval confirmation.

@timoreimann timoreimann added the WIP label Oct 7, 2017
@timoreimann
Copy link
Contributor Author

timoreimann commented Oct 7, 2017

Shortly going back to WIP as I realized RBAC documentation updates are yet missing...

@timoreimann timoreimann removed the WIP label Oct 7, 2017
@timoreimann
Copy link
Contributor Author

timoreimann commented Oct 7, 2017

...aaaaand it's done.

@timoreimann
Copy link
Contributor Author

timoreimann commented Oct 8, 2017

To all involved reviewers: Please indicate until Monday midnight CEST (i.e., in a good 22 hours from now) if your existing review should not hold anymore -- I'm happy to address any further comments.

Otherwise, I'll proceed with merging the PR.

@emilevauge
Copy link
Member

emilevauge commented Oct 9, 2017

@timoreimann I think this is a good example that WIP can be confusing some times :) Do we need to dismiss all reviews done before you removed the WIP label?

@timoreimann
Copy link
Contributor Author

@emilevauge strictly speaking, I guess every new commit should invalidate all existing LGTMs.

Practically speaking, I'm not sure if we want to be that strict, especially since it'd take another three approvals.

For the PR at hand, the changes that happened after the LGTMs were received are rather simple / insignificant. So personally I feel confident moving forward even without another round of eye balls.

Either way, we may want to consider tuning the process to add clarity.

@emilevauge
Copy link
Member

@timoreimann

strictly speaking, I guess every new commit should invalidate all existing LGTMs.

Yes, but here I'm talking about the fact that this PR was already reviewed by 3 maintainers, and still in WIP. Having commits following a review is normal, and easy to follow in Github (especially if the contributor produces well named commits). In this case, I'm not able to say if it's safe to merge or not.

Just saying we should really be cautious with WIP PRs in the future ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubernetes: Problems with RBAC & multiple Namespaces
8 participants