Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes healthcheck gives access denied #386

Closed
pbwur opened this issue May 2, 2024 · 12 comments
Closed

Kubernetes healthcheck gives access denied #386

pbwur opened this issue May 2, 2024 · 12 comments

Comments

@pbwur
Copy link

pbwur commented May 2, 2024

Hi,

I'm using the 2.0.0 version of Vernemq with he helmchart. Unfortunately the pod in Kubernetes remains unhealthy. The errormessage is:

Readiness probe failed: Get "http://10.244.76.200:8888/health": dial tcp 10.244.76.200:8888: connect: connection refused

From with the pod using curl with the url http://localhost:8888/health the response is as expected: {"status":"OK"}
It seems the used IP address is the problem.

Using version 2.0.0-rc1 works ok. So looking for the difference here

@ioolkos
Copy link
Contributor

ioolkos commented May 2, 2024

@pbwur Thanks, The change must be in PR #380, #382, #384 or #385 then. What does the Verne log tell?

@ashtonian does this ring a bell to you, from the changes to add optional listeners?


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@pbwur
Copy link
Author

pbwur commented May 2, 2024

ik zi niks on de logging dat wijst op een probleem bij de healthcheck. When the first pod (of 3) starts there are a lot of log statements like:

vmq_swc_store:handle_info/2:555: Replica meta4: Can't initialize AE exchange due to no peer available

After a while VerneMq exists. But before that I'm able to execute the healthcheck using http://localhost:8888/health successfully.

024-05-02T08:53:35.711676+00:00 [debug] <0.292.0> vmq_swc_store:handle_info/2:555: Replica meta9: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:36.920696+00:00 [debug] <0.247.0> vmq_swc_store:handle_info/2:555: Replica meta4: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:37.434670+00:00 [debug] <0.238.0> vmq_swc_store:handle_info/2:555: Replica meta3: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:37.790656+00:00 [debug] <0.283.0> vmq_swc_store:handle_info/2:555: Replica meta8: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:38.419727+00:00 [debug] <0.301.0> vmq_swc_store:handle_info/2:555: Replica meta10: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:38.744695+00:00 [debug] <0.229.0> vmq_swc_store:handle_info/2:555: Replica meta2: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:40.392832+00:00 [debug] <0.265.0> vmq_swc_store:handle_info/2:555: Replica meta6: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:41.044680+00:00 [debug] <0.256.0> vmq_swc_store:handle_info/2:555: Replica meta5: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:41.835692+00:00 [debug] <0.220.0> vmq_swc_store:handle_info/2:555: Replica meta1: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.212673+00:00 [debug] <0.292.0> vmq_swc_store:handle_info/2:555: Replica meta9: Can't initialize AE exchange due to no peer available
I'm the only pod remaining. Not performing leave and/or state purge.
2024-05-02T08:53:42.465663+00:00 [debug] <0.274.0> vmq_swc_store:handle_info/2:555: Replica meta7: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.839671+00:00 [debug] <0.283.0> vmq_swc_store:handle_info/2:555: Replica meta8: Can't initialize AE exchange due to no peer available
2024-05-02T08:53:42.944858+00:00 [notice] <0.44.0> application_controller:info_exited/3:2129: Application: vmq_server. Exited: stopped. Type: permanent.
2024-05-02T08:53:42.945013+00:00 [notice] <0.44.0> application_controller:info_exited/3:2129: Application: stdout_formatter. Exited: stopped. Type: permanent.

@ioolkos
Copy link
Contributor

ioolkos commented May 2, 2024

Those "Replica" logs are normal when you have debug log level on.
I guess Kubernetes terminates the pods here, since it cannot reach the health endpoint.


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@ashtonian
Copy link
Contributor

Probably need to add this back:
https://github.com/vernemq/docker-vernemq/pull/382/files#diff-95359b2d5d846bb085015977b06cde6a1facdc4ac553c06adb7d12e47aa39373L224-L226
May need to add the cluster port back as well.

@ioolkos
Copy link
Contributor

ioolkos commented May 2, 2024

@ashtonian Thanks, I reverted this here: #387
cc @pbwur let's see whether this resolves the issue. I can build new images tomorrow.


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@ioolkos
Copy link
Contributor

ioolkos commented May 2, 2024

@pbwur I have now uploaded 2.0.0 images with a tentative fix to Dockerhub. Can you test one of those to check whether the Kubernetes Health check works now?


👉 Thank you for supporting VerneMQ: https://github.com/sponsors/vernemq
👉 Using the binary VerneMQ packages commercially (.deb/.rpm/Docker) requires a paid subscription.

@pbwur
Copy link
Author

pbwur commented May 3, 2024

@ioolkos, it seems to work now. All 3 nodes of the cluster are starting now. Thanks for the great response!

Although probably not related, I do get an error with the second node after the first node starts successfully. After I delete the persistentStoraceClaim and start the cluster again, everything is ok.

This is part of the logging:

2024-05-03T09:00:36.793105+00:00 [info] <0.686.0> vmq_diversity_app:start/2:85: enable auth script for postgres "./share/lua/auth/postgres.lua"
Error! Failed to eval: vmq_server_cmd:node_join('VerneMQ@vernemq-0.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local')

Runtime terminating during boot ({{badkey,{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>}},[{erlang,map_get,[{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>},#{}],[{error_info,#{module=>erl_erts_errors}}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,history,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,230}]},{vmq_swc_peer_service,attempt_join,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_peer_service.erl"},{line,57}]},{vmq_server_cli,'-vmq_cluster_join_cmd/0-fun-1-',3,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,516}]},{clique_command,run,1,[{file,"/opt/vernemq/_build/default/
2024-05-03T09:00:37.798996+00:00 [error] <0.9.0>: Error in process <0.9.0> on node 'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local' with exit value:, {{badkey,{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>}},[{erlang,map_get,[{'VerneMQ@vernemq-1.vernemq-headless.mdtis-poc-mqtt.svc.cluster.local',<<34,100,99,27,209,16,239,117,147,202,59,36,181,234,60,253,91,83,95,77>>},#{}],[{error_info,#{module => erl_erts_errors}}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,'-summary/1-lc$^1/1-1-',3,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,220}]},{vmq_swc_plugin,history,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_plugin.erl"},{line,230}]},{vmq_swc_peer_service,attempt_join,1,[{file,"/opt/vernemq/apps/vmq_swc/src/vmq_swc_peer_service.erl"},{line,57}]},{vmq_server_cli,'-vmq_cluster_join_cmd/0-fun-1-',3,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,516}]},{clique_command,run,1,[{file,"/opt/vernemq/_build/default/lib/clique/src/clique_command.erl"},{line,87}]},{vmq_server_cli,command,2,[{file,"/opt/vernemq/apps/vmq_server/src/vmq_server_cli.erl"},{line,45}]}]}

Crash dump is being written to: /erl_crash.dump...[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
Stream closed EOF for mdtis-poc-mqtt/vernemq-1 (vernemq)

@ioolkos ioolkos closed this as completed May 6, 2024
@hsudbrock
Copy link
Contributor

hsudbrock commented May 24, 2024

@pbwur I have the same issue as the one you describe in your last comment above: When restarting a pod of the vernemq stateful set, I get the exact same error; only after deleting the PVC (and underlying PV) and restarting the pod it comes up again. This issue started with 2.0.0, I did not have it with 1.13.

Did you, by any chance, resolve that issue on your side? If yes, I would be thankful to hear how :)

@ioolkos ioolkos reopened this May 24, 2024
@ioolkos
Copy link
Contributor

ioolkos commented May 30, 2024

@pbwur @hsudbrock Currently looking into the PVC related start error; it looks like some sort of regression.

The following setting in vernemq.conf should prevent it: (by switching to the previous join logic)

vmq_swc.prevent_nonempty_join = off

@pbwur
Copy link
Author

pbwur commented May 30, 2024

Hi @hsudbrock and @ioolkos , apologies for the late response. That issue did still happen here also.
It would be great if that setting would fix it. What would be the correct environment variable to set it? DOCKER_VERNEMQ_VMQ_SWC__PREVENT__NONEMPTY__JOIN?

@ioolkos
Copy link
Contributor

ioolkos commented May 30, 2024

@pbwur DOCKER_VERNEMQ_VMQ_SWC__PREVENT_NONEMPTY_JOIN

(translate . to __, keep _ as _)

@hsudbrock
Copy link
Contributor

Thanks for the hint and the PR for fixing the issue! For me, so far it looks good, i.e., disabling the nonempty join check resulted in no errors when restarting my vernemq cluster so far.

@ioolkos ioolkos closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants