Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce k8s informer factory #2867

Merged

Conversation

yue9944882
Copy link
Contributor

What does this PR do?

Use k8s.io/cilent-go/informers to manage informers and its indexers / store.

Motivation

Currently we are managing the stores of informers explicitly which may be unnecessary after bumping up client-go. Informer factory will help sort out and clearify implementation of kubernetes client.

This PR will simplify kubernetes client a lot.

More

  • Added/updated tests
  • Added/updated documentation

Additional Notes

@ldez
Copy link
Member

ldez commented Feb 16, 2018

@yue9944882 like the last time, could you organize your commit to facilitate the review (please respect the order).

  1. one commit for Gopkg.*
  2. one commit for Træfik code.
  3. one commit for the vendor.

Could you use make dep-ensure.

Thanks.

@ldez ldez changed the title Refactor kubernetes client: introduce informer factory Introduce k8s informer factory Feb 16, 2018
@yue9944882 yue9944882 force-pushed the refactor-kubernetes-client-informer-mgmt branch from ba86d19 to 1c602b4 Compare February 16, 2018 08:41
@ldez
Copy link
Member

ldez commented Feb 16, 2018

Github order by commit date: as you have alter your commit (rebase -i), the commits dates have not changed.

@yue9944882 yue9944882 force-pushed the refactor-kubernetes-client-informer-mgmt branch from 1c602b4 to 8be009e Compare February 16, 2018 09:09
return result
}

// GetService returns the named service from the given namespace.
func (c *clientImpl) GetService(namespace, name string) (*corev1.Service, bool, error) {
var service *corev1.Service
item, exists, err := c.svcStores[c.lookupNamespace(namespace)].GetByKey(namespace + "/" + name)
item, exists, err := c.factories[namespace].Core().V1().Services().Informer().GetStore().GetByKey(namespace + "/" + name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this line could be svc, err := c.factories[namespace].Core().V1().Services().Lister().Services(namespace).Get(name) which use Lister to Get resources from indexer inside the informer. I tries but this will need to change the return types of GetService(namespace, name string) (*corev1.Service, error) in the Client interface and lots of changes in test facilities will change too. Maybe it can be done in another PR later to make this one clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that they are actually the same. But eliminating namespace + "/" + name would be helpful for maintaince.

@yue9944882
Copy link
Contributor Author

@ldez @timoreimann PTAL thx 😉

@ldez
Copy link
Member

ldez commented Feb 16, 2018

@yue9944882 please stop ask PTAL, we already have notifications of commits, comments, ...

@yue9944882
Copy link
Contributor Author

yue9944882 commented Feb 16, 2018

@ldez Oops. Sorry for PTAL🤭I just mean that it is ready for a review.

Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code panics if you don't specify a namespace to watch on:

./dist/traefik --defaultentrypoints=http --entryPoints='Name:http Address::8000' --web --web.readonly --web.address=":8080" --logLevel=INFO --checknewversion=false --kubernetes --kubernetes.endpoint=http://127.0.0.1:8001 --kubernetes.watch=true                                
WARN[2018-02-16T23:36:17+01:00] web provider configuration is deprecated, you should use these options : api, rest provider, ping and metrics 
INFO[2018-02-16T23:36:17+01:00] Traefik version dev built on I don't remember exactly 
INFO[2018-02-16T23:36:17+01:00] 
Stats collection is disabled.
Help us improve Traefik by turning this feature on :)
More details on: https://docs.traefik.io/basics/#collected-data
 
INFO[2018-02-16T23:36:17+01:00] Preparing server http &{Address::8000 TLS:<nil> Redirect:<nil> Auth:<nil> WhitelistSourceRange:[] Compress:false ProxyProtocol:<nil> ForwardedHeaders:0xc4205f1680} with readTimeout=0s writeTimeout=0s idleTimeout=3m0s 
INFO[2018-02-16T23:36:17+01:00] Preparing server traefik &{Address::8080 TLS:<nil> Redirect:<nil> Auth:<nil> WhitelistSourceRange:[] Compress:false ProxyProtocol:<nil> ForwardedHeaders:0xc4205f1880} with readTimeout=0s writeTimeout=0s idleTimeout=3m0s 
INFO[2018-02-16T23:36:17+01:00] Starting server on :8000                     
INFO[2018-02-16T23:36:17+01:00] Starting provider *kubernetes.Provider {"Watch":true,"Filename":"","Constraints":[],"Trace":false,"DebugLogGeneratedTemplate":false,"Endpoint":"http://127.0.0.1:8001","Token":"","CertAuthFilePath":"","DisablePassHostHeaders":false,"EnablePassTLSCert":false,"Namespaces":null,"LabelSelector":"","IngressClass":""} 
INFO[2018-02-16T23:36:17+01:00] Starting server on :8080                     
INFO[2018-02-16T23:36:17+01:00] Creating cluster-external Provider client with endpoint http://127.0.0.1:8001 
ERRO[2018-02-16T23:36:17+01:00] Error in Go routine: runtime error: invalid memory address or nil pointer dereference 
goroutine 66 [running]:
runtime/debug.Stack(0x1363b1f, 0xc42008e2d0, 0x2d5c1b3)
	/usr/local/Cellar/go/1.9.4/libexec/src/runtime/debug/stack.go:24 +0xa7
runtime/debug.PrintStack()
	/usr/local/Cellar/go/1.9.4/libexec/src/runtime/debug/stack.go:16 +0x22
github.com/containous/traefik/safe.defaultRecoverGoroutine(0x29ca8e0, 0x3fa68f0)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/safe/routine.go:148 +0x7d
github.com/containous/traefik/safe.OperationWithRecover.func1.1(0xc4207c9df8)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/safe/routine.go:156 +0x56
panic(0x29ca8e0, 0x3fa68f0)
	/usr/local/Cellar/go/1.9.4/libexec/src/runtime/panic.go:491 +0x283
github.com/containous/traefik/provider/kubernetes.(*clientImpl).GetService(0xc42051f400, 0xc4206cd3f0, 0xa, 0xc420308e00, 0x1f, 0x50, 0x0, 0x0, 0xc420681c18)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/provider/kubernetes/client.go:183 +0x5f
github.com/containous/traefik/provider/kubernetes.(*Provider).loadIngresses(0xc42039dad0, 0x3d73ea0, 0xc42051f400, 0x1, 0x1, 0x0)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/provider/kubernetes/kubernetes.go:235 +0x102f
github.com/containous/traefik/provider/kubernetes.(*Provider).Provide.func1.1(0x0, 0x0)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/provider/kubernetes/kubernetes.go:115 +0x50c
github.com/containous/traefik/safe.OperationWithRecover.func1(0x0, 0x0)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/safe/routine.go:160 +0x6c
github.com/containous/traefik/vendor/github.com/cenk/backoff.RetryNotify(0xc42051f410, 0x3d55860, 0xc42051f420, 0x2df4fa0, 0xc420066080, 0xc4201556c0)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/vendor/github.com/cenk/backoff/retry.go:37 +0x88
github.com/containous/traefik/provider/kubernetes.(*Provider).Provide.func1(0xc420292150)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/provider/kubernetes/kubernetes.go:136 +0x162
github.com/containous/traefik/safe.(*Pool).Go.func1()
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/safe/routine.go:78 +0x3a
github.com/containous/traefik/safe.GoWithRecover.func1(0x2df5140, 0xc420559230)
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/safe/routine.go:142 +0x4d
created by github.com/containous/traefik/safe.GoWithRecover
	/Users/treimann/Development/traefik/src/github.com/containous/traefik/safe/routine.go:136 +0x49
ERRO[2018-02-16T23:36:17+01:00] Provider connection error: Panic in operation: %!s(<nil>); retrying in 726.896979ms

It fails on client.go:183. I suppose it's due to the removal of the lookupNamespace method, causing the lookup at c.factories[namespace] to yield nil.

(Sorry, we are still missing proper integration tests for the Kubernetes provider.)

@yue9944882
Copy link
Contributor Author

yue9944882 commented Feb 17, 2018

@timoreimann Good catch, no doubt this would be a horrible BUG 😨


for t, ok := range factory.WaitForCacheSync(stopCh) {
if !ok {
return nil, fmt.Errorf("timed out waiting for controller caches to sync %s in namespace %s", t.String(), ns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example output of t.String()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


for t, ok := range factory.WaitForCacheSync(stopCh) {
if !ok {
return nil, fmt.Errorf("timed out waiting for controller caches to sync %s in namespace %s", t.String(), ns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use %q for printing ns as it may be the empty string when no namespaces are specified.

// users having granted RBAC permissions for this object.
// https://github.com/containous/traefik/issues/1784 should improve the
// situation here in the future.
informManager.extend(c.WatchObjects(ns, kindSecrets, &corev1.Secret{}, c.secStores, eventCh), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pass false as the last parameter so that we don't wait for the Secrets store to become synchronized for the described reason.

Am I missing something, or did we not carry over that behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this addressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and we skip the synchronization of secret by delaying the invocation of AddEventHandler.

factory.Core().V1().Endpoints().Informer().AddEventHandler(eventHandler)
factory.Core().V1().Secrets().Informer().AddEventHandler(eventHandler)

for t, ok := range factory.WaitForCacheSync(stopCh) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation first started all informers and then waited for the caches to sync. The new implementation does it the other way around.

Are we certain this doesn't change the observable behavior?

"listers/settings/v1alpha1",
"listers/storage/v1",
"listers/storage/v1alpha1",
"listers/storage/v1beta1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly more a question for @ldez: Is it normal that dep references packages which are not used? I thought dep plus our (custom) pruning causes us to only pull in what is really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

safe.Go(func() {
wg.Add(1)
informer.Run(stopCh)
c.factories[ns].Start(stopCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to insert ns := ns after line 141 in order to capture the variable. Otherwise, we'd always end up with the last namespace being referenced in the anonymous goroutine (see also https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realise that informerFactory.Start is an asynchronous call itself and we don't have to invoke it via safe.Go(...)

// users having granted RBAC permissions for this object.
// https://github.com/containous/traefik/issues/1784 should improve the
// situation here in the future.
if t == reflect.TypeOf(&corev1.Secret{}) && !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-pick: Please swap the expression terms for better comprehensibility (and, to a neglectable degree, performance).

for ns, factory := range c.factories {
ings, err := factory.Extensions().V1beta1().Ingresses().Lister().List(labels.Everything())
if err != nil {
log.Errorf("fail to list ingresses in namespace %s: %s", ns, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/fail/Failed/

// users having granted RBAC permissions for this object.
// https://github.com/containous/traefik/issues/1784 should improve the
// situation here in the future.
if t == reflect.TypeOf(&corev1.Secret{}) && !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous implementation, we did not wait for the Secrets' cache to sync at all. The new implementation seems to wait for it but does not consider an error to be critical.

Assuming I got it right: Does this possibly mean that users without permitted RBAC access now see Secrets RBAC errors in their logs? I guess I'm okay with a single log occurrence, but multiple ones would be irritating IMHO.

"listers/settings/v1alpha1",
"listers/storage/v1",
"listers/storage/v1alpha1",
"listers/storage/v1beta1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@timoreimann
Copy link
Contributor

Apologies for the laggy response; I will try hard to get back to this PR and do some final testings within the next 2 days.

Sorry!

@yue9944882
Copy link
Contributor Author

@timoreimann Never mind. The refactoring in this PR might be a big change so we ought to spend more time to prove its correctness. I'd also like to help enhance test for this PR :-)

Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got around testing the PR against against minikube. Everything seems to be working as intended. 🎉

Regarding the tests: IMHO, unit tests aren't very helpful for the changed files in question. What we need is a proper integration test. Fortunately, I have started work in this regard some time ago. It's (still) not done yet but I'm committed to moving it to completion. :-)

Thanks again for the PR!

@ldez ldez added this to the 1.6 milestone Mar 13, 2018
Copy link
Member

@mmatur mmatur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@juliens juliens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mmatur mmatur force-pushed the refactor-kubernetes-client-informer-mgmt branch from fba5f22 to aceff7b Compare March 22, 2018 08:58
@traefiker traefiker merged commit f94fa78 into traefik:master Mar 22, 2018
@yue9944882
Copy link
Contributor Author

@mmatur What does it mean by need-human-merge? Am I missing sth? (facepalm

@mmatur
Copy link
Member

mmatur commented Mar 23, 2018

@yue9944882, you have missed nothing. We have bots that help us to manage PR.

Explanation available here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants