Kubernetes healthcheck gives access denied #386

pbwur · 2024-05-02T06:55:31Z

Hi,

I'm using the 2.0.0 version of Vernemq with he helmchart. Unfortunately the pod in Kubernetes remains unhealthy. The errormessage is:

Readiness probe failed: Get "http://10.244.76.200:8888/health": dial tcp 10.244.76.200:8888: connect: connection refused

From with the pod using curl with the url http://localhost:8888/health the response is as expected: {"status":"OK"}
It seems the used IP address is the problem.

Using version 2.0.0-rc1 works ok. So looking for the difference here

ioolkos · 2024-05-02T07:23:29Z

@pbwur Thanks, The change must be in PR #380, #382, #384 or #385 then. What does the Verne log tell?

@ashtonian does this ring a bell to you, from the changes to add optional listeners?

👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

pbwur · 2024-05-02T09:10:51Z

ik zi niks on de logging dat wijst op een probleem bij de healthcheck. When the first pod (of 3) starts there are a lot of log statements like:

vmq_swc_store:handle_info/2:555: Replica meta4: Can't initialize AE exchange due to no peer available

After a while VerneMq exists. But before that I'm able to execute the healthcheck using http://localhost:8888/health successfully.

024-05-02T08:53:35.711676+00:00 [debug] <0.292.0> vmq_swc_store:handle_info/2:555: Replica meta9: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:36.920696+00:00 [debug] <0.247.0> vmq_swc_store:handle_info/2:555: Replica meta4: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:37.434670+00:00 [debug] <0.238.0> vmq_swc_store:handle_info/2:555: Replica meta3: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:37.790656+00:00 [debug] <0.283.0> vmq_swc_store:handle_info/2:555: Replica meta8: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:38.419727+00:00 [debug] <0.301.0> vmq_swc_store:handle_info/2:555: Replica meta10: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:38.744695+00:00 [debug] <0.229.0> vmq_swc_store:handle_info/2:555: Replica meta2: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:40.392832+00:00 [debug] <0.265.0> vmq_swc_store:handle_info/2:555: Replica meta6: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:41.044680+00:00 [debug] <0.256.0> vmq_swc_store:handle_info/2:555: Replica meta5: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:41.835692+00:00 [debug] <0.220.0> vmq_swc_store:handle_info/2:555: Replica meta1: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.212673+00:00 [debug] <0.292.0> vmq_swc_store:handle_info/2:555: Replica meta9: Can't initialize AE exchange due to no peer available
I'm the only pod remaining. Not performing leave and/or state purge.
2024-05-02T08:53:42.465663+00:00 [debug] <0.274.0> vmq_swc_store:handle_info/2:555: Replica meta7: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.839671+00:00 [debug] <0.283.0> vmq_swc_store:handle_info/2:555: Replica meta8: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.944858+00:00 [notice] <0.44.0> application_controller:info_exited/3:2129: Application: vmq_server. Exited: stopped. Type: permanent.
2024-05-02T08:53:42.945013+00:00 [notice] <0.44.0> application_controller:info_exited/3:2129: Application: stdout_formatter. Exited: stopped. Type: permanent.

ioolkos · 2024-05-02T09:18:34Z

Those "Replica" logs are normal when you have debug log level on.
I guess Kubernetes terminates the pods here, since it cannot reach the health endpoint.

👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ashtonian · 2024-05-02T19:32:04Z

Probably need to add this back:
https://github.com/vernemq/docker-vernemq/pull/382/files#diff-95359b2d5d846bb085015977b06cde6a1facdc4ac553c06adb7d12e47aa39373L224-L226
May need to add the cluster port back as well.

ioolkos · 2024-05-02T21:18:40Z

@ashtonian Thanks, I reverted this here: #387
cc @pbwur let's see whether this resolves the issue. I can build new images tomorrow.

👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

ioolkos · 2024-05-02T22:11:55Z

@pbwur I have now uploaded 2.0.0 images with a tentative fix to Dockerhub. Can you test one of those to check whether the Kubernetes Health check works now?

👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

pbwur · 2024-05-03T09:23:06Z

@ioolkos, it seems to work now. All 3 nodes of the cluster are starting now. Thanks for the great response!

Although probably not related, I do get an error with the second node after the first node starts successfully. After I delete the persistentStoraceClaim and start the cluster again, everything is ok.

This is part of the logging:

2024-05-03T09:00:36.793105+00:00 [info] <0.686.0> vmq_diversity_app:start/2:85: enable auth script for postgres "./share/lua/auth/postgres.lua"
Error! Failed to eval: vmq_server_cmd:node_join('VerneMQ@vernemq-0.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local')

Runtime terminating during boot ({{badkey,{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>}},[{erlang,map_get,[{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>},#{}],[{error_info,#{module=>erl_erts_errors}}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,history,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,230}]},{vmq_swc_peer_service,attempt_join,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_peer_service.erl"},{line,57}]},{vmq_server_cli,'-vmq_cluster_join_cmd/0-fun-1-',3,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,516}]},{clique_command,run,1,[{file,"/opt/vernemq/_build/default/
2024-05-03T09:00:37.798996+00:00 [error] <0.9.0>: Error in process <0.9.0> on node 'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local' with exit value:, {{badkey,{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>}},[{erlang,map_get,[{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>},#{}],[{error_info,#{module => erl_erts_errors}}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,history,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,230}]},{vmq_swc_peer_service,attempt_join,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_peer_service.erl"},{line,57}]},{vmq_server_cli,'-vmq_cluster_join_cmd/0-fun-1-',3,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,516}]},{clique_command,run,1,[{file,"/opt/vernemq/_build/default/lib/clique/src/clique_command.erl"},{line,87}]},{vmq_server_cli,command,2,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,45}]}]}

Crash dump is being written to: /erl_crash.dump...[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Stream closed EOF for mdtis-poc-mqtt/vernemq-1 (vernemq)

hsudbrock · 2024-05-24T12:21:39Z

@pbwur I have the same issue as the one you describe in your last comment above: When restarting a pod of the vernemq stateful set, I get the exact same error; only after deleting the PVC (and underlying PV) and restarting the pod it comes up again. This issue started with 2.0.0, I did not have it with 1.13.

Did you, by any chance, resolve that issue on your side? If yes, I would be thankful to hear how :)

ioolkos · 2024-05-30T06:33:06Z

@pbwur @hsudbrock Currently looking into the PVC related start error; it looks like some sort of regression.

The following setting in vernemq.conf should prevent it: (by switching to the previous join logic)

vmq_swc.prevent_nonempty_join = off

pbwur · 2024-05-30T06:39:47Z

Hi @hsudbrock and @ioolkos , apologies for the late response. That issue did still happen here also.
It would be great if that setting would fix it. What would be the correct environment variable to set it? DOCKER_VERNEMQ_VMQ_SWC__PREVENT__NONEMPTY__JOIN?

ioolkos · 2024-05-30T06:45:11Z

@pbwur DOCKER_VERNEMQ_VMQ_SWC__PREVENT_NONEMPTY_JOIN

(translate . to __, keep _ as _)

hsudbrock · 2024-05-31T10:29:49Z

Thanks for the hint and the PR for fixing the issue! For me, so far it looks good, i.e., disabling the nonempty join check resulted in no errors when restarting my vernemq cluster so far.

ioolkos closed this as completed May 6, 2024

ioolkos reopened this May 24, 2024

ioolkos mentioned this issue May 30, 2024

Fix SWC summary for empty Nodeclocks vernemq/vernemq#2304

Merged

ioolkos closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes healthcheck gives access denied #386

Kubernetes healthcheck gives access denied #386

pbwur commented May 2, 2024 •

edited

Loading

ioolkos commented May 2, 2024

pbwur commented May 2, 2024

ioolkos commented May 2, 2024

ashtonian commented May 2, 2024

ioolkos commented May 2, 2024

ioolkos commented May 2, 2024

pbwur commented May 3, 2024 •

edited

Loading

hsudbrock commented May 24, 2024 •

edited

Loading

ioolkos commented May 30, 2024

pbwur commented May 30, 2024

ioolkos commented May 30, 2024

hsudbrock commented May 31, 2024

Kubernetes healthcheck gives access denied #386

Kubernetes healthcheck gives access denied #386

Comments

pbwur commented May 2, 2024 • edited Loading

ioolkos commented May 2, 2024

pbwur commented May 2, 2024

ioolkos commented May 2, 2024

ashtonian commented May 2, 2024

ioolkos commented May 2, 2024

ioolkos commented May 2, 2024

pbwur commented May 3, 2024 • edited Loading

hsudbrock commented May 24, 2024 • edited Loading

ioolkos commented May 30, 2024

pbwur commented May 30, 2024

ioolkos commented May 30, 2024

hsudbrock commented May 31, 2024

pbwur commented May 2, 2024 •

edited

Loading

pbwur commented May 3, 2024 •

edited

Loading

hsudbrock commented May 24, 2024 •

edited

Loading