Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow configuration of health check for livenessProbe #1479

Closed
jslusher opened this issue Mar 26, 2019 · 23 comments
Closed

allow configuration of health check for livenessProbe #1479

jslusher opened this issue Mar 26, 2019 · 23 comments

Comments

@jslusher
Copy link

jslusher commented Mar 26, 2019

Reading through the documentation and I'm either misunderstanding it or I'm not configuring the parameters properly.

I have a debezium-mysql plugin I'm using on my kafkaconnect deployment. If the mysql server gets restarted or otherwise becomes unavailable, the connect pod throws an exception and stops, waiting to be restarted manually. I need to be able to configure a livenessProbe for the connect deployment I made that will kill the pod if this happens.

I have this in the CRD but it doesn't seem to be reaching the Deployment: https://gist.github.com/jslusher/0f8a8979a844875830d2a0e1cf250595#file-kafka-connect-yaml-L16-L24

I was under the impression based on the wording and the link (the example doesn't show an alternate probe configuration) that configuring a custom check was possible. https://strimzi.io/docs/latest/#proc-configuring-healthchecks-deployment-configuration-kafka

Is it just the parameters listed there that are configurable then?

If so, does that mean I'm out of luck if I want a custom health check?

@tombentley
Copy link
Member

We don't support custom health checks for connectors in a kafka connect deployment. The health check for the deployment are based on the connect REST endpoint, rather than individual connectors deployed within it.

However what you're trying to does makes sense, so we should try to support this use case somehow.

Is it just the parameters listed there that are configurable then?

Right

@scholzj
Copy link
Member

scholzj commented Mar 27, 2019

However what you're trying to does makes sense, so we should try to support this use case somehow.

TBH, I don't think it does. You cannot bind the pod health to any particular connector. There might be many tasks of manny connectors running on the same pod and you cannot restart it just because of a single connector / task failing. So I think the health-checks are correct in not including the connectors.

There is for sure space for additional tooling to monitor / restart the connectors (such as #1468 ). But it is not the health-check where this should be addressed.

@tombentley
Copy link
Member

Yes, I agree the #1468, as a completely different way of using connectors within Kubernetes, would be able to handle this better. But TBH a lot depends on how people chose to use the existing Kafka Connect support. If they're deploying a single connector then restarting the connect deployment is a perfectly valid thing to do. Of course it doesn't work if you're using the connect deployment to host a bunch of unrelated connectors. But users might well decide to have multiple separate Kafka connect deployments when they have unrelated connectors.

@scholzj
Copy link
Member

scholzj commented Mar 27, 2019

Yeah, users might decide to run only a single connector and it might be a wise decision. But they might also not do it. The current implementation takes Kafka Connect as it was designed - as a framework for running many connectors.

@jslusher
Copy link
Author

Thanks for the information and feedback. I could be wrong, but it does seem like an easy thing to implement a means to customize connector health checks, especially since a default is available if you do want to use the same connector deployment for multiple connections. I would have to guess that in many if not most situations, even if the connector is using multiple connections the default health check would still need to be made more specific to the deployment. Simply checking that the root path of the API endpoint is available only accounts that it's responding, but does not account for the many variations of what would be its functionality. In our situation, since the health check is not sufficient to detect a failed connection from the plugin to mysql, it means I'll need to separate the connecter deployment from the operator entirely, which I think is worse than customizing a health check.

@tomasz-zylka-fluke
Copy link

tomasz-zylka-fluke commented Oct 7, 2020

We also find this health-check issue very serious in our deployments and we're looking for by-passing it.
Reading this conversation, we're going to try the solution of one KafkaConnect cluster per one KafkaConnector.
We have three debezium postgresql connectors, therefore my understanding is that we will have three KafkaConnect resources and three KafkaConnector resources, right?.
@scholzj can you help me understand where should I put liveness probe in, to make sure we will restart the pod when connector is not running?
In here?

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaConnect
metadata:
  name: kafka-connect-cluster
  annotations:
  # use-connector-resources configures this KafkaConnect
  # to use KafkaConnector resources to avoid
  # needing to call the Connect REST API directly
    strimzi.io/use-connector-resources: "true"
spec:
  image: address
  replicas: 1
  bootstrapServers: kafka-cluster-kafka-bootstrap:9092
  config:
    config.storage.replication.factor: 1
    offset.storage.replication.factor: 1
    status.storage.replication.factor: 1
    config.providers: file
    config.providers.file.class: org.apache.kafka.common.config.provider.FileConfigProvider
**livenessProbe:
      exec:
        command:
        - sh
        - ...**

@scholzj
Copy link
Member

scholzj commented Oct 7, 2020

As explained above, you cannot configure it the actual probe. Just the timings.

@tomasz-zylka-fluke
Copy link

tomasz-zylka-fluke commented Oct 7, 2020

like @jslusher wrote:

If the mysql server gets restarted or otherwise becomes unavailable, the connect pod throws an exception and stops, waiting to be restarted manually.

@scholzj you're saying there is no way to by-pass it i.e. restart the pod automagically?
this is really serious issue though.

@tomasz-zylka-fluke
Copy link

If I can't configure actual probe (but only timings) then the default probe will not detect the situation that connector is not running, right?

@tombentley
Copy link
Member

This is a consequence of two different way of handling failure. On the one hand the Kubernetes approach is that it's better when a container fails for the process to exit, and Kube will spin up a new one. On the other hand Kafka Connect is designed as a runtime which hosts a number connectors some of which might fail. In general it doesn't make sense for the Kafka Connect process to exit just because one connector task encountered an error and stopped, because all the other connectors would then take a hit too. The availability of each connector would be the constrained by the availability all of the others. The pattern of running a single connector per Kafka Connect cluster doesn't help here.

There's an existing proposal which seeks to address this: https://github.com/strimzi/proposals/blob/master/007-restarting-kafka-connect-connectors-and-tasks.md. @tomasz-zylka-fluke it would be useful to know whether or not that would help with your specific use case.

@scholzj
Copy link
Member

scholzj commented Oct 7, 2020

If the thing you quoted is right:

If the mysql server gets restarted or otherwise becomes unavailable, the connect pod throws an exception and stops, waiting to be restarted manually.

This would be a bug in the Connect framework (and possibly in the connector) and should be raised in Apache Kafka in the first place. It should not happen that the connector kills the whole Connect.

If it is just the connector what fails, you should restart the connector. Restarting the whole connect is not the right thing to do. It is as if when Kubernetes pod fails you would say that you have to restart the whole worker node instead of just the pod. So I can see how in your case that might be useful, but that is IMHO not the right pattern. So I do not agree that this is serious issue.

@tomasz-zylka-fluke
Copy link

tomasz-zylka-fluke commented Oct 7, 2020

@scholzj sorry my bad - you're right that only the connector is stopped, not the whole Connect cluster.
But still, in our particular case it is serious issue, as Connector functionality stops working and our system has no clue about it, therefore it cannot be self-healing system any more and someone must monitor Connector status and manually restart it.

@tomasz-zylka-fluke
Copy link

@tombentley I will read your link and get back to you - thanks!

@scholzj
Copy link
Member

scholzj commented Oct 7, 2020

But still, in our particular case it is serious issue, as Connector functionality stops working and our system has no clue about it, therefore it cannot be self-healing system any more and someone must monitor Connector status and manually restart it.

Yes, that is correct, you have to monitor the connector status and restart it. As described in the link shared by Tom, the Connector operator is still work in progress. Once it is finished it should allow you to run connectors like that including restarts.

@tomasz-zylka-fluke
Copy link

@scholzj @tombentley

There's an existing proposal which seeks to address this: https://github.com/strimzi/proposals/blob/master/007-restarting-kafka-connect-connectors-and-tasks.md. @tomasz-zylka-fluke it would be useful to know whether or not that would help with your specific use case.

It is exactly what we need!

  annotations:
    strimzi.io/restart: "true"

clean and beautiful solution.

What is the process of handling strimzi/proposals? Can I vote on proposals? Do you guys find this 007 proposal meaningful ?

@scholzj
Copy link
Member

scholzj commented Oct 7, 2020

It is approved, just waiting for implementation ... I think we should check with @ajborley if this is something he plans to work on or if it is something what someone should pick up.

@tomasz-zylka-fluke
Copy link

great news! thanks. looking forward to implementation.

@ajborley
Copy link
Member

ajborley commented Oct 7, 2020

I began an implementation of proposal 007, but got sidetracked onto other things. I plan to progress my implementation, but it's not currently part of my day-job so I'm afraid I don't know how long it will take.

@scholzj
Copy link
Member

scholzj commented Oct 7, 2020

@ajborley No pressure ;-). I know that sometimes it is not easy to find the time when other work comes up.

@vmironichev-nix
Copy link

Hi @ajborley, could you please update us with the status of proposal 007? Thanks!

@scholzj
Copy link
Member

scholzj commented Jan 12, 2021

I do not remember anymore if it covers absolutely everything, but there is #4114 which should be merged soon hopefully.

@vmironichev-nix
Copy link

thank you for update @scholzj!

@scholzj
Copy link
Member

scholzj commented Feb 22, 2022

Triaged on 22.02.2022: The exact issue described here should be handled by restarting the connectors automatically which is something considered separately. The health-checks important part of the operator and they are relied on for rolling updates and similar. If there are suggestiong how to improve them, they can be discussed separately, but we do not want to make them configurable.

@scholzj scholzj closed this as completed Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants