New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow configuration of health check for livenessProbe #1479
Comments
We don't support custom health checks for connectors in a kafka connect deployment. The health check for the deployment are based on the connect REST endpoint, rather than individual connectors deployed within it. However what you're trying to does makes sense, so we should try to support this use case somehow.
Right |
TBH, I don't think it does. You cannot bind the pod health to any particular connector. There might be many tasks of manny connectors running on the same pod and you cannot restart it just because of a single connector / task failing. So I think the health-checks are correct in not including the connectors. There is for sure space for additional tooling to monitor / restart the connectors (such as #1468 ). But it is not the health-check where this should be addressed. |
Yes, I agree the #1468, as a completely different way of using connectors within Kubernetes, would be able to handle this better. But TBH a lot depends on how people chose to use the existing Kafka Connect support. If they're deploying a single connector then restarting the connect deployment is a perfectly valid thing to do. Of course it doesn't work if you're using the connect deployment to host a bunch of unrelated connectors. But users might well decide to have multiple separate Kafka connect deployments when they have unrelated connectors. |
Yeah, users might decide to run only a single connector and it might be a wise decision. But they might also not do it. The current implementation takes Kafka Connect as it was designed - as a framework for running many connectors. |
Thanks for the information and feedback. I could be wrong, but it does seem like an easy thing to implement a means to customize connector health checks, especially since a default is available if you do want to use the same connector deployment for multiple connections. I would have to guess that in many if not most situations, even if the connector is using multiple connections the default health check would still need to be made more specific to the deployment. Simply checking that the root path of the API endpoint is available only accounts that it's responding, but does not account for the many variations of what would be its functionality. In our situation, since the health check is not sufficient to detect a failed connection from the plugin to mysql, it means I'll need to separate the connecter deployment from the operator entirely, which I think is worse than customizing a health check. |
We also find this health-check issue very serious in our deployments and we're looking for by-passing it.
|
As explained above, you cannot configure it the actual probe. Just the timings. |
If I can't configure actual probe (but only timings) then the default probe will not detect the situation that connector is not running, right? |
This is a consequence of two different way of handling failure. On the one hand the Kubernetes approach is that it's better when a container fails for the process to exit, and Kube will spin up a new one. On the other hand Kafka Connect is designed as a runtime which hosts a number connectors some of which might fail. In general it doesn't make sense for the Kafka Connect process to exit just because one connector task encountered an error and stopped, because all the other connectors would then take a hit too. The availability of each connector would be the constrained by the availability all of the others. The pattern of running a single connector per Kafka Connect cluster doesn't help here. There's an existing proposal which seeks to address this: https://github.com/strimzi/proposals/blob/master/007-restarting-kafka-connect-connectors-and-tasks.md. @tomasz-zylka-fluke it would be useful to know whether or not that would help with your specific use case. |
If the thing you quoted is right:
This would be a bug in the Connect framework (and possibly in the connector) and should be raised in Apache Kafka in the first place. It should not happen that the connector kills the whole Connect. If it is just the connector what fails, you should restart the connector. Restarting the whole connect is not the right thing to do. It is as if when Kubernetes pod fails you would say that you have to restart the whole worker node instead of just the pod. So I can see how in your case that might be useful, but that is IMHO not the right pattern. So I do not agree that this is serious issue. |
@scholzj sorry my bad - you're right that only the connector is stopped, not the whole Connect cluster. |
@tombentley I will read your link and get back to you - thanks! |
Yes, that is correct, you have to monitor the connector status and restart it. As described in the link shared by Tom, the Connector operator is still work in progress. Once it is finished it should allow you to run connectors like that including restarts. |
It is exactly what we need!
clean and beautiful solution. What is the process of handling |
It is approved, just waiting for implementation ... I think we should check with @ajborley if this is something he plans to work on or if it is something what someone should pick up. |
great news! thanks. looking forward to implementation. |
I began an implementation of proposal 007, but got sidetracked onto other things. I plan to progress my implementation, but it's not currently part of my day-job so I'm afraid I don't know how long it will take. |
@ajborley No pressure ;-). I know that sometimes it is not easy to find the time when other work comes up. |
Hi @ajborley, could you please update us with the status of proposal 007? Thanks! |
I do not remember anymore if it covers absolutely everything, but there is #4114 which should be merged soon hopefully. |
thank you for update @scholzj! |
Triaged on 22.02.2022: The exact issue described here should be handled by restarting the connectors automatically which is something considered separately. The health-checks important part of the operator and they are relied on for rolling updates and similar. If there are suggestiong how to improve them, they can be discussed separately, but we do not want to make them configurable. |
Reading through the documentation and I'm either misunderstanding it or I'm not configuring the parameters properly.
I have a debezium-mysql plugin I'm using on my kafkaconnect deployment. If the mysql server gets restarted or otherwise becomes unavailable, the connect pod throws an exception and stops, waiting to be restarted manually. I need to be able to configure a livenessProbe for the connect deployment I made that will kill the pod if this happens.
I have this in the CRD but it doesn't seem to be reaching the Deployment: https://gist.github.com/jslusher/0f8a8979a844875830d2a0e1cf250595#file-kafka-connect-yaml-L16-L24
I was under the impression based on the wording and the link (the example doesn't show an alternate probe configuration) that configuring a custom check was possible. https://strimzi.io/docs/latest/#proc-configuring-healthchecks-deployment-configuration-kafka
Is it just the parameters listed there that are configurable then?
If so, does that mean I'm out of luck if I want a custom health check?
The text was updated successfully, but these errors were encountered: