-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 #4382
fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 #4382
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix for the crashloopbackoff! See the inline thoughts.
if !u.IsAbs() { | ||
path := u.Path | ||
u, err = url.Parse(chart.Repo.URL) | ||
if err != nil { | ||
return fmt.Errorf("invalid URL format for chart repo [%s]: %v", chart.ID, err) | ||
} | ||
u.Path = u.Path + path | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me if this is doing what you intend.? In particular, we shouldn't normally just join a url path via concatenation like that.
Actually, checking the link from the issue you've referenced, it looks like you've unintentionally add the change which broke helm (listed in the description of helm/helm#3065 ), rather than the fix from 3065 itself? Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I am looking at the latest helm code
https://github.com/cjauvin/helm/blob/master/pkg/downloader/chart_downloader.go#L219
It is exactly as I have it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I am looking at the latest helm code https://github.com/cjauvin/helm/blob/master/pkg/downloader/chart_downloader.go#L219 It is exactly as I have it
Gah, never mind. The latest helm code does indeed look very different
https://github.com/helm/helm/blob/65d8e72504652e624948f74acbba71c51ac2e342/pkg/downloader/chart_downloader.go#L303
Let me fix this accordingly. BTW, the fix I made based on the earlier version was fine as well. It only didn't handle weird cases like URLs with trailing slashes or Query params which is what helm #3065 was about, not what our #4381 was about. Nevertheless, I will make it consistent with latest helm code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I saw. Just didn't make sense to include the code that broke helm/helm#3065 rather than the code that fixed it. And I imagine the slash issue would hit us at some point if it affected helm users. Thanks.
"[%s]: Initial resync failed after [%d] retries were exhausted, last error: %v", | ||
c.queue.Name(), maxWatcherCacheRetries, err) | ||
// yes, I really want this to panic. Something is seriously wrong and | ||
// possibly restarting kubeapps-apis server is needed... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we'll panic here simply because Redis isn't yet ready (that's what was happening when I created this issue - Redis seems to take a while to come up on my local cluster, which was causing kubeapps-apis to fail to be ready, prior to this change). With this change, I assume kubeapps-apis will become ready faster, but this code will still cause the restart if redis isn't ready in time? Not sure how we could handle it... perhaps explicitly handling waiting for redis to become ready initially, in this go-routine, but before we do the resync with panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I could increase maxWatcherCacheResyncBackoff from 2 to, say, 8. That way, we will wait a LONG time before giving up (2^8 = 128 seconds). Other than that, I would say lets cross that bridge if/when we come to it.
I will point out that existing code
https://github.com/kubeapps/kubeapps/blob/954f8d62add7f03ae440922e3825be49a29bf4db/cmd/kubeapps-apis/plugins/fluxv2/packages/v1alpha1/common/utils.go#L142
already does a "PING" to redis before any of the resync() code even takes place. And if the ping fails, the whole thing fails, again, before resync(). We may or may not want to change that behavior, e.g. by introducing retries w/ exponential back off in that routine (NewRedisClientFromEnv
) or maybe some kind of loop to wait for redis to come up. I'd like to see some evidence first that things are broken (log files will suffice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah great. I missed the ping.
We may or may not want to change that behavior, e.g. by introducing retries w/ exponential back off in that routine (
NewRedisClientFromEnv
) or maybe some kind of loop to wait for redis to come up. I'd like to see some evidence first that things are broken (log files will suffice)
Yep, I'd be keen (eventually, if/when you agree) to do a loop waiting for redis to come up with some timeout, so it only affects the initial load. As for evidence, what I see whenever I enable flux is first this:
k -n kubeapps get po
NAME READY STATUS RESTARTS AGE
kubeapps-c49cb4c5d-vfxmf 2/2 Running 0 20s
kubeapps-internal-dashboard-5b77d89ff-hxf4g 1/1 Running 0 20s
kubeapps-internal-kubeappsapis-5465c7b6d4-mzmjh 1/1 Running 0 2d23h
kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs 0/1 CrashLoopBackOff 1 20s
kubeapps-internal-kubeops-7646db7767-xd6tx 1/1 Running 0 20s
kubeapps-redis-master-0 0/1 Running 0 20s
kubeapps-redis-replicas-0 0/1 Running 0 20s
So here redis is still coming up, so kubeappsapis has already restarted once. The logs show:
k -n kubeapps logs kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs
I0307 05:13:06.075910 1 root.go:37] kubeapps-apis has been configured with: core.ServeOptions{Port:50051, PluginDirs:[]string{"/plugins/fluxv2", "/plugins/resources"}, ClustersConfigPath:"/config/clusters.conf", PluginConfigPath:"/config/kubeapps-apis/plugins.conf", PinnipedProxyURL:"http://kubeapps-internal-pinniped-proxy.kubeapps:3333", GlobalReposNamespace:"kubeapps", UnsafeLocalDevKubeconfig:false, QPS:50, Burst:100}
I0307 05:13:06.671936 1 main.go:22] +fluxv2 RegisterWithGRPCServer
I0307 05:13:06.671984 1 server.go:60] +fluxv2 NewServer(kubeappsCluster: [default], pluginConfigPath: [/config/kubeapps-apis/plugins.conf]
Error: failed to initialize plugins server: failed to register plugins: plug-in "name:\"fluxv2.packages\" version:\"v1alpha1\"" failed to register due to: dial tcp 10.96.16.162:6379: connect: connection refused
...
Once redis is ready, it settles, but is always left with a number of restarts (3 for me), which makes it look like there's been some unexpected problem:
k -n kubeapps get po
NAME READY STATUS RESTARTS AGE
kubeapps-c49cb4c5d-vfxmf 2/2 Running 0 2m23s
kubeapps-internal-dashboard-5b77d89ff-hxf4g 1/1 Running 0 2m23s
kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs 1/1 Running 3 2m23s
kubeapps-internal-kubeops-7646db7767-xd6tx 1/1 Running 0 2m23s
kubeapps-redis-master-0 1/1 Running 0 2m23s
kubeapps-redis-replicas-0 1/1 Running 0 2m23s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that looks pretty convincing. Thanks. Wonder why I have not come across this myself with all the testing I've done?
I will point out that if this is the case, then resync()
issue was a red herring. Meaning, this was what was causing the CrashLoopBackOff, not the indexing of the bitnami repo on startup
Anyway, I will introduce a similar loop with exponential back off into NewRedisClientFromEnv
to work around this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that looks pretty convincing. Thanks. Wonder why I have not come across this myself with all the testing I've done?
Maybe you don't start without Redis each time? (so it's already running). In my case I'm switching between flux, carvel and helm support (for demoing).
I will point out that if this is the case, then
resync()
issue was a red herring. Meaning, this was what was causing the CrashLoopBackOff, not the indexing of the bitnami repo on startup
Two separate issues I think. Those 3 restarts are due to redis not being ready, but then once Redis is ready, the plugin would take 30s to sync before the plugin itself completed its registration. Now because the server itself doesn't start serving until all plugins have been registered successfully, the readiness check would continue to fail during that time, which meant more restarts (I saw some where I had 6 or 7 restarts, but not everytime).
Anyway, I will introduce a similar loop with exponential back off into
NewRedisClientFromEnv
to work around this
Great, thanks, though in this case, we may not want exponential backoff, given that we want to be able to start with minimal delay - maybe just keep pinging ever second or two or something. See what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem. I will try ping a total of say 10 times, every second then give up?
I changed chart_cache.go to correspond to what the latest is in helm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks Greg.
@@ -110,6 +110,7 @@ require ( | |||
github.com/beorn7/perks v1.0.1 // indirect | |||
github.com/cenkalti/backoff/v4 v4.1.2 // indirect | |||
github.com/cespare/xxhash/v2 v2.1.2 // indirect | |||
github.com/chai2010/gettext-go v0.0.0-20160711120539-c6fed771bfd5 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, did I miss where you started using these? (possibly in your test code - though you said that was just moving things around).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a result of having to import "k8s.io/kubectl/pkg/cmd/cp" for my new integration test that tests auto-update feature of flux. I need to be able to copy files into a running pod
see #4329 for discussion
I had some extra time and also fixed [fluxv2] non-FQDN chart url fails on chart view #4381 in this PR