Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 #4382

Merged

Conversation

gfichtenholt
Copy link
Contributor

@gfichtenholt gfichtenholt commented Mar 4, 2022

see #4329 for discussion
I had some extra time and also fixed [fluxv2] non-FQDN chart url fails on chart view #4381 in this PR

@gfichtenholt gfichtenholt self-assigned this Mar 4, 2022
@gfichtenholt gfichtenholt linked an issue Mar 4, 2022 that may be closed by this pull request
2 tasks
@gfichtenholt gfichtenholt changed the title fix for kubeapps-apis CrashLoopBackoff #4329 fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 Mar 5, 2022
@gfichtenholt gfichtenholt linked an issue Mar 5, 2022 that may be closed by this pull request
Copy link
Contributor

@absoludity absoludity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix for the crashloopbackoff! See the inline thoughts.

Comment on lines 152 to 159
if !u.IsAbs() {
path := u.Path
u, err = url.Parse(chart.Repo.URL)
if err != nil {
return fmt.Errorf("invalid URL format for chart repo [%s]: %v", chart.ID, err)
}
u.Path = u.Path + path
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if this is doing what you intend.? In particular, we shouldn't normally just join a url path via concatenation like that.

Actually, checking the link from the issue you've referenced, it looks like you've unintentionally add the change which broke helm (listed in the description of helm/helm#3065 ), rather than the fix from 3065 itself? Is that possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I am looking at the latest helm code
https://github.com/cjauvin/helm/blob/master/pkg/downloader/chart_downloader.go#L219
It is exactly as I have it

Copy link
Contributor Author

@gfichtenholt gfichtenholt Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I am looking at the latest helm code https://github.com/cjauvin/helm/blob/master/pkg/downloader/chart_downloader.go#L219 It is exactly as I have it

Gah, never mind. The latest helm code does indeed look very different
https://github.com/helm/helm/blob/65d8e72504652e624948f74acbba71c51ac2e342/pkg/downloader/chart_downloader.go#L303
Let me fix this accordingly. BTW, the fix I made based on the earlier version was fine as well. It only didn't handle weird cases like URLs with trailing slashes or Query params which is what helm #3065 was about, not what our #4381 was about. Nevertheless, I will make it consistent with latest helm code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I saw. Just didn't make sense to include the code that broke helm/helm#3065 rather than the code that fixed it. And I imagine the slash issue would hit us at some point if it affected helm users. Thanks.

"[%s]: Initial resync failed after [%d] retries were exhausted, last error: %v",
c.queue.Name(), maxWatcherCacheRetries, err)
// yes, I really want this to panic. Something is seriously wrong and
// possibly restarting kubeapps-apis server is needed...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we'll panic here simply because Redis isn't yet ready (that's what was happening when I created this issue - Redis seems to take a while to come up on my local cluster, which was causing kubeapps-apis to fail to be ready, prior to this change). With this change, I assume kubeapps-apis will become ready faster, but this code will still cause the restart if redis isn't ready in time? Not sure how we could handle it... perhaps explicitly handling waiting for redis to become ready initially, in this go-routine, but before we do the resync with panic?

Copy link
Contributor Author

@gfichtenholt gfichtenholt Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I could increase maxWatcherCacheResyncBackoff from 2 to, say, 8. That way, we will wait a LONG time before giving up (2^8 = 128 seconds). Other than that, I would say lets cross that bridge if/when we come to it.
I will point out that existing code
https://github.com/kubeapps/kubeapps/blob/954f8d62add7f03ae440922e3825be49a29bf4db/cmd/kubeapps-apis/plugins/fluxv2/packages/v1alpha1/common/utils.go#L142
already does a "PING" to redis before any of the resync() code even takes place. And if the ping fails, the whole thing fails, again, before resync(). We may or may not want to change that behavior, e.g. by introducing retries w/ exponential back off in that routine (NewRedisClientFromEnv) or maybe some kind of loop to wait for redis to come up. I'd like to see some evidence first that things are broken (log files will suffice)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great. I missed the ping.

We may or may not want to change that behavior, e.g. by introducing retries w/ exponential back off in that routine (NewRedisClientFromEnv) or maybe some kind of loop to wait for redis to come up. I'd like to see some evidence first that things are broken (log files will suffice)

Yep, I'd be keen (eventually, if/when you agree) to do a loop waiting for redis to come up with some timeout, so it only affects the initial load. As for evidence, what I see whenever I enable flux is first this:

k -n kubeapps get po
NAME                                              READY   STATUS             RESTARTS   AGE
kubeapps-c49cb4c5d-vfxmf                          2/2     Running            0          20s
kubeapps-internal-dashboard-5b77d89ff-hxf4g       1/1     Running            0          20s
kubeapps-internal-kubeappsapis-5465c7b6d4-mzmjh   1/1     Running            0          2d23h
kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs    0/1     CrashLoopBackOff   1          20s
kubeapps-internal-kubeops-7646db7767-xd6tx        1/1     Running            0          20s
kubeapps-redis-master-0                           0/1     Running            0          20s
kubeapps-redis-replicas-0                         0/1     Running            0          20s

So here redis is still coming up, so kubeappsapis has already restarted once. The logs show:

k -n kubeapps logs kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs
I0307 05:13:06.075910       1 root.go:37] kubeapps-apis has been configured with: core.ServeOptions{Port:50051, PluginDirs:[]string{"/plugins/fluxv2", "/plugins/resources"}, ClustersConfigPath:"/config/clusters.conf", PluginConfigPath:"/config/kubeapps-apis/plugins.conf", PinnipedProxyURL:"http://kubeapps-internal-pinniped-proxy.kubeapps:3333", GlobalReposNamespace:"kubeapps", UnsafeLocalDevKubeconfig:false, QPS:50, Burst:100}
I0307 05:13:06.671936       1 main.go:22] +fluxv2 RegisterWithGRPCServer
I0307 05:13:06.671984       1 server.go:60] +fluxv2 NewServer(kubeappsCluster: [default], pluginConfigPath: [/config/kubeapps-apis/plugins.conf]
Error: failed to initialize plugins server: failed to register plugins: plug-in "name:\"fluxv2.packages\" version:\"v1alpha1\"" failed to register due to: dial tcp 10.96.16.162:6379: connect: connection refused
...

Once redis is ready, it settles, but is always left with a number of restarts (3 for me), which makes it look like there's been some unexpected problem:

k -n kubeapps get po                                             
NAME                                             READY   STATUS    RESTARTS   AGE
kubeapps-c49cb4c5d-vfxmf                         2/2     Running   0          2m23s
kubeapps-internal-dashboard-5b77d89ff-hxf4g      1/1     Running   0          2m23s
kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs   1/1     Running   3          2m23s
kubeapps-internal-kubeops-7646db7767-xd6tx       1/1     Running   0          2m23s
kubeapps-redis-master-0                          1/1     Running   0          2m23s
kubeapps-redis-replicas-0                        1/1     Running   0          2m23s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that looks pretty convincing. Thanks. Wonder why I have not come across this myself with all the testing I've done?

I will point out that if this is the case, then resync() issue was a red herring. Meaning, this was what was causing the CrashLoopBackOff, not the indexing of the bitnami repo on startup

Anyway, I will introduce a similar loop with exponential back off into NewRedisClientFromEnv to work around this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that looks pretty convincing. Thanks. Wonder why I have not come across this myself with all the testing I've done?

Maybe you don't start without Redis each time? (so it's already running). In my case I'm switching between flux, carvel and helm support (for demoing).

I will point out that if this is the case, then resync() issue was a red herring. Meaning, this was what was causing the CrashLoopBackOff, not the indexing of the bitnami repo on startup

Two separate issues I think. Those 3 restarts are due to redis not being ready, but then once Redis is ready, the plugin would take 30s to sync before the plugin itself completed its registration. Now because the server itself doesn't start serving until all plugins have been registered successfully, the readiness check would continue to fail during that time, which meant more restarts (I saw some where I had 6 or 7 restarts, but not everytime).

Anyway, I will introduce a similar loop with exponential back off into NewRedisClientFromEnv to work around this

Great, thanks, though in this case, we may not want exponential backoff, given that we want to be able to start with minimal delay - maybe just keep pinging ever second or two or something. See what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem. I will try ping a total of say 10 times, every second then give up?

@gfichtenholt
Copy link
Contributor Author

I changed chart_cache.go to correspond to what the latest is in helm.
Please ignore a whole bunch of test files I just moved around because I had some spare time and wanted to clean up testdata directory a bit

Copy link
Contributor

@absoludity absoludity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks Greg.

@@ -110,6 +110,7 @@ require (
github.com/beorn7/perks v1.0.1 // indirect
github.com/cenkalti/backoff/v4 v4.1.2 // indirect
github.com/cespare/xxhash/v2 v2.1.2 // indirect
github.com/chai2010/gettext-go v0.0.0-20160711120539-c6fed771bfd5 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, did I miss where you started using these? (possibly in your test code - though you said that was just moving things around).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a result of having to import "k8s.io/kubectl/pkg/cmd/cp" for my new integration test that tests auto-update feature of flux. I need to be able to copy files into a running pod

@gfichtenholt gfichtenholt merged commit f439918 into vmware-tanzu:main Mar 7, 2022
@gfichtenholt gfichtenholt deleted the further-fluxv2-plugin-features-30 branch March 7, 2022 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[fluxv2] non-FQDN chart url fails on chart view kubeapps-apis CrashLoopBackoff
2 participants