fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 #4382

gfichtenholt · 2022-03-04T06:24:41Z

see #4329 for discussion
I had some extra time and also fixed [fluxv2] non-FQDN chart url fails on chart view #4381 in this PR

…4381

…data

absoludity

Thanks for the fix for the crashloopbackoff! See the inline thoughts.

absoludity · 2022-03-07T01:17:25Z

cmd/kubeapps-apis/plugins/fluxv2/packages/v1alpha1/cache/chart_cache.go

+		if !u.IsAbs() {
+			path := u.Path
+			u, err = url.Parse(chart.Repo.URL)
+			if err != nil {
+				return fmt.Errorf("invalid URL format for chart repo [%s]: %v", chart.ID, err)
+			}
+			u.Path = u.Path + path
+		}


It's not clear to me if this is doing what you intend.? In particular, we shouldn't normally just join a url path via concatenation like that.

Actually, checking the link from the issue you've referenced, it looks like you've unintentionally add the change which broke helm (listed in the description of helm/helm#3065 ), rather than the fix from 3065 itself? Is that possible?

Well, I am looking at the latest helm code
https://github.com/cjauvin/helm/blob/master/pkg/downloader/chart_downloader.go#L219
It is exactly as I have it

Well, I am looking at the latest helm code https://github.com/cjauvin/helm/blob/master/pkg/downloader/chart_downloader.go#L219 It is exactly as I have it

Gah, never mind. The latest helm code does indeed look very different
https://github.com/helm/helm/blob/65d8e72504652e624948f74acbba71c51ac2e342/pkg/downloader/chart_downloader.go#L303
Let me fix this accordingly. BTW, the fix I made based on the earlier version was fine as well. It only didn't handle weird cases like URLs with trailing slashes or Query params which is what helm #3065 was about, not what our #4381 was about. Nevertheless, I will make it consistent with latest helm code

Yep, I saw. Just didn't make sense to include the code that broke helm/helm#3065 rather than the code that fixed it. And I imagine the slash issue would hit us at some point if it affected helm users. Thanks.

absoludity · 2022-03-07T01:32:40Z

cmd/kubeapps-apis/plugins/fluxv2/packages/v1alpha1/cache/watcher_cache.go

+			"[%s]: Initial resync failed after [%d] retries were exhausted, last error: %v",
+			c.queue.Name(), maxWatcherCacheRetries, err)
+		// yes, I really want this to panic. Something is seriously wrong and
+		// possibly restarting kubeapps-apis server is needed...


Is it possible that we'll panic here simply because Redis isn't yet ready (that's what was happening when I created this issue - Redis seems to take a while to come up on my local cluster, which was causing kubeapps-apis to fail to be ready, prior to this change). With this change, I assume kubeapps-apis will become ready faster, but this code will still cause the restart if redis isn't ready in time? Not sure how we could handle it... perhaps explicitly handling waiting for redis to become ready initially, in this go-routine, but before we do the resync with panic?

Well, I could increase maxWatcherCacheResyncBackoff from 2 to, say, 8. That way, we will wait a LONG time before giving up (2^8 = 128 seconds). Other than that, I would say lets cross that bridge if/when we come to it.
I will point out that existing code
https://github.com/kubeapps/kubeapps/blob/954f8d62add7f03ae440922e3825be49a29bf4db/cmd/kubeapps-apis/plugins/fluxv2/packages/v1alpha1/common/utils.go#L142
already does a "PING" to redis before any of the resync() code even takes place. And if the ping fails, the whole thing fails, again, before resync(). We may or may not want to change that behavior, e.g. by introducing retries w/ exponential back off in that routine (NewRedisClientFromEnv) or maybe some kind of loop to wait for redis to come up. I'd like to see some evidence first that things are broken (log files will suffice)

Ah great. I missed the ping.

We may or may not want to change that behavior, e.g. by introducing retries w/ exponential back off in that routine (NewRedisClientFromEnv) or maybe some kind of loop to wait for redis to come up. I'd like to see some evidence first that things are broken (log files will suffice)

Yep, I'd be keen (eventually, if/when you agree) to do a loop waiting for redis to come up with some timeout, so it only affects the initial load. As for evidence, what I see whenever I enable flux is first this:

k -n kubeapps get po NAME READY STATUS RESTARTS AGE kubeapps-c49cb4c5d-vfxmf 2/2 Running 0 20s kubeapps-internal-dashboard-5b77d89ff-hxf4g 1/1 Running 0 20s kubeapps-internal-kubeappsapis-5465c7b6d4-mzmjh 1/1 Running 0 2d23h kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs 0/1 CrashLoopBackOff 1 20s kubeapps-internal-kubeops-7646db7767-xd6tx 1/1 Running 0 20s kubeapps-redis-master-0 0/1 Running 0 20s kubeapps-redis-replicas-0 0/1 Running 0 20s

So here redis is still coming up, so kubeappsapis has already restarted once. The logs show:

k -n kubeapps logs kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs I0307 05:13:06.075910 1 root.go:37] kubeapps-apis has been configured with: core.ServeOptions{Port:50051, PluginDirs:[]string{"/plugins/fluxv2", "/plugins/resources"}, ClustersConfigPath:"/config/clusters.conf", PluginConfigPath:"/config/kubeapps-apis/plugins.conf", PinnipedProxyURL:"http://kubeapps-internal-pinniped-proxy.kubeapps:3333", GlobalReposNamespace:"kubeapps", UnsafeLocalDevKubeconfig:false, QPS:50, Burst:100} I0307 05:13:06.671936 1 main.go:22] +fluxv2 RegisterWithGRPCServer I0307 05:13:06.671984 1 server.go:60] +fluxv2 NewServer(kubeappsCluster: [default], pluginConfigPath: [/config/kubeapps-apis/plugins.conf] Error: failed to initialize plugins server: failed to register plugins: plug-in "name:\"fluxv2.packages\" version:\"v1alpha1\"" failed to register due to: dial tcp 10.96.16.162:6379: connect: connection refused ...

Once redis is ready, it settles, but is always left with a number of restarts (3 for me), which makes it look like there's been some unexpected problem:

k -n kubeapps get po NAME READY STATUS RESTARTS AGE kubeapps-c49cb4c5d-vfxmf 2/2 Running 0 2m23s kubeapps-internal-dashboard-5b77d89ff-hxf4g 1/1 Running 0 2m23s kubeapps-internal-kubeappsapis-c8c597f5f-jw5rs 1/1 Running 3 2m23s kubeapps-internal-kubeops-7646db7767-xd6tx 1/1 Running 0 2m23s kubeapps-redis-master-0 1/1 Running 0 2m23s kubeapps-redis-replicas-0 1/1 Running 0 2m23s

Wow, that looks pretty convincing. Thanks. Wonder why I have not come across this myself with all the testing I've done?

I will point out that if this is the case, then resync() issue was a red herring. Meaning, this was what was causing the CrashLoopBackOff, not the indexing of the bitnami repo on startup

Anyway, I will introduce a similar loop with exponential back off into NewRedisClientFromEnv to work around this

Wow, that looks pretty convincing. Thanks. Wonder why I have not come across this myself with all the testing I've done?

Maybe you don't start without Redis each time? (so it's already running). In my case I'm switching between flux, carvel and helm support (for demoing).

I will point out that if this is the case, then resync() issue was a red herring. Meaning, this was what was causing the CrashLoopBackOff, not the indexing of the bitnami repo on startup

Two separate issues I think. Those 3 restarts are due to redis not being ready, but then once Redis is ready, the plugin would take 30s to sync before the plugin itself completed its registration. Now because the server itself doesn't start serving until all plugins have been registered successfully, the readiness check would continue to fail during that time, which meant more restarts (I saw some where I had 6 or 7 restarts, but not everytime).

Anyway, I will introduce a similar loop with exponential back off into NewRedisClientFromEnv to work around this

Great, thanks, though in this case, we may not want exponential backoff, given that we want to be able to start with minimal delay - maybe just keep pinging ever second or two or something. See what you think.

no problem. I will try ping a total of say 10 times, every second then give up?

gfichtenholt · 2022-03-07T04:16:45Z

I changed chart_cache.go to correspond to what the latest is in helm.
Please ignore a whole bunch of test files I just moved around because I had some spare time and wanted to clean up testdata directory a bit

absoludity

Great, thanks Greg.

absoludity · 2022-03-07T05:24:49Z

go.mod

@@ -110,6 +110,7 @@ require (
 	github.com/beorn7/perks v1.0.1 // indirect
 	github.com/cenkalti/backoff/v4 v4.1.2 // indirect
 	github.com/cespare/xxhash/v2 v2.1.2 // indirect
+	github.com/chai2010/gettext-go v0.0.0-20160711120539-c6fed771bfd5 // indirect


Hmm, did I miss where you started using these? (possibly in your test code - though you said that was just moving things around).

this was a result of having to import "k8s.io/kubectl/pkg/cmd/cp" for my new integration test that tests auto-update feature of flux. I need to be able to copy files into a running pod

gfichtenholt added 19 commits January 28, 2022 10:03

attempt #2

d507e69

Merge branch 'main' of github.com:kubeapps/kubeapps

b8a929d

Merge branch 'main' of github.com:kubeapps/kubeapps

0766610

Merge branch 'main' of github.com:kubeapps/kubeapps

74e5c50

Merge branch 'main' of github.com:kubeapps/kubeapps

cd19a04

Merge branch 'main' of github.com:kubeapps/kubeapps

1eb573f

Merge branch 'main' of github.com:kubeapps/kubeapps

d116bf6

Merge branch 'main' of github.com:kubeapps/kubeapps

df43daa

Merge branch 'main' of github.com:kubeapps/kubeapps

13c2866

Merge branch 'main' of github.com:kubeapps/kubeapps

2170574

Merge branch 'main' of github.com:kubeapps/kubeapps

440e40e

Merge branch 'main' of github.com:kubeapps/kubeapps

30a7cec

Merge branch 'main' of github.com:kubeapps/kubeapps

9067e5f

Merge branch 'main' of github.com:kubeapps/kubeapps

237541a

Merge branch 'main' of github.com:kubeapps/kubeapps

adc524a

Merge branch 'main' of github.com:kubeapps/kubeapps

5f44889

Merge branch 'main' of github.com:kubeapps/kubeapps

2daa59b

Merge branch 'main' of github.com:kubeapps/kubeapps

7fdf5ce

fix for vmware-tanzu#4329 kubeapps-apis CrashLoopBackoff

2ec345f

gfichtenholt self-assigned this Mar 4, 2022

gfichtenholt linked an issue Mar 4, 2022 that may be closed by this pull request

kubeapps-apis CrashLoopBackoff #4329

Closed

2 tasks

gfichtenholt requested a review from absoludity March 4, 2022 07:07

fix for [fluxv2] non-FQDN chart url fails on chart view vmware-tanzu#…

911f49a

…4381

gfichtenholt changed the title ~~fix for kubeapps-apis CrashLoopBackoff #4329~~ fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 Mar 5, 2022

gfichtenholt linked an issue Mar 5, 2022 that may be closed by this pull request

[fluxv2] non-FQDN chart url fails on chart view #4381

Closed

gfichtenholt added 3 commits March 4, 2022 21:52

forgot two files

658c26f

added integration test for flux helm release auto-update

1558410

moved test index yamls into a separate subdirectory not to crowd test…

370acf3

…data

absoludity reviewed Mar 7, 2022

View reviewed changes

fixed chart_cache.go to be consistent with latest helm code

7f0ec83

absoludity approved these changes Mar 7, 2022

View reviewed changes

absoludity reviewed Mar 7, 2022

View reviewed changes

gfichtenholt added 2 commits March 6, 2022 21:58

introduce retries+exponential backoff into NewRedisClientFromEnv

8c792c5

fix retries in NewRedisClientFromEnv

487b4cd

gfichtenholt merged commit f439918 into vmware-tanzu:main Mar 7, 2022

gfichtenholt deleted the further-fluxv2-plugin-features-30 branch March 7, 2022 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 #4382

fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 #4382

gfichtenholt commented Mar 4, 2022 •

edited

Loading

absoludity left a comment

absoludity Mar 7, 2022

gfichtenholt Mar 7, 2022

gfichtenholt Mar 7, 2022 •

edited

Loading

absoludity Mar 7, 2022

absoludity Mar 7, 2022

gfichtenholt Mar 7, 2022 •

edited

Loading

absoludity Mar 7, 2022

gfichtenholt Mar 7, 2022

absoludity Mar 7, 2022

gfichtenholt Mar 7, 2022

gfichtenholt commented Mar 7, 2022

absoludity left a comment

absoludity Mar 7, 2022

gfichtenholt Mar 7, 2022

fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 #4382

fix for kubeapps-apis CrashLoopBackoff #4329 and [fluxv2] non-FQDN chart url fails on chart view #4381 #4382

Conversation

gfichtenholt commented Mar 4, 2022 • edited Loading

absoludity left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfichtenholt Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfichtenholt Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfichtenholt commented Mar 7, 2022

absoludity left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfichtenholt commented Mar 4, 2022 •

edited

Loading

gfichtenholt Mar 7, 2022 •

edited

Loading

gfichtenholt Mar 7, 2022 •

edited

Loading