allow configuration of health check for livenessProbe #1479

jslusher · 2019-03-26T21:33:52Z

Reading through the documentation and I'm either misunderstanding it or I'm not configuring the parameters properly.

I have a debezium-mysql plugin I'm using on my kafkaconnect deployment. If the mysql server gets restarted or otherwise becomes unavailable, the connect pod throws an exception and stops, waiting to be restarted manually. I need to be able to configure a livenessProbe for the connect deployment I made that will kill the pod if this happens.

I have this in the CRD but it doesn't seem to be reaching the Deployment: https://gist.github.com/jslusher/0f8a8979a844875830d2a0e1cf250595#file-kafka-connect-yaml-L16-L24

I was under the impression based on the wording and the link (the example doesn't show an alternate probe configuration) that configuring a custom check was possible. https://strimzi.io/docs/latest/#proc-configuring-healthchecks-deployment-configuration-kafka

Is it just the parameters listed there that are configurable then?

If so, does that mean I'm out of luck if I want a custom health check?

tombentley · 2019-03-27T07:46:15Z

We don't support custom health checks for connectors in a kafka connect deployment. The health check for the deployment are based on the connect REST endpoint, rather than individual connectors deployed within it.

However what you're trying to does makes sense, so we should try to support this use case somehow.

Is it just the parameters listed there that are configurable then?

Right

scholzj · 2019-03-27T07:58:56Z

However what you're trying to does makes sense, so we should try to support this use case somehow.

TBH, I don't think it does. You cannot bind the pod health to any particular connector. There might be many tasks of manny connectors running on the same pod and you cannot restart it just because of a single connector / task failing. So I think the health-checks are correct in not including the connectors.

There is for sure space for additional tooling to monitor / restart the connectors (such as #1468 ). But it is not the health-check where this should be addressed.

tombentley · 2019-03-27T08:05:35Z

Yes, I agree the #1468, as a completely different way of using connectors within Kubernetes, would be able to handle this better. But TBH a lot depends on how people chose to use the existing Kafka Connect support. If they're deploying a single connector then restarting the connect deployment is a perfectly valid thing to do. Of course it doesn't work if you're using the connect deployment to host a bunch of unrelated connectors. But users might well decide to have multiple separate Kafka connect deployments when they have unrelated connectors.

scholzj · 2019-03-27T08:08:33Z

Yeah, users might decide to run only a single connector and it might be a wise decision. But they might also not do it. The current implementation takes Kafka Connect as it was designed - as a framework for running many connectors.

jslusher · 2019-03-27T16:03:19Z

Thanks for the information and feedback. I could be wrong, but it does seem like an easy thing to implement a means to customize connector health checks, especially since a default is available if you do want to use the same connector deployment for multiple connections. I would have to guess that in many if not most situations, even if the connector is using multiple connections the default health check would still need to be made more specific to the deployment. Simply checking that the root path of the API endpoint is available only accounts that it's responding, but does not account for the many variations of what would be its functionality. In our situation, since the health check is not sufficient to detect a failed connection from the plugin to mysql, it means I'll need to separate the connecter deployment from the operator entirely, which I think is worse than customizing a health check.

tomasz-zylka-fluke · 2020-10-07T08:33:44Z

We also find this health-check issue very serious in our deployments and we're looking for by-passing it.
Reading this conversation, we're going to try the solution of one KafkaConnect cluster per one KafkaConnector.
We have three debezium postgresql connectors, therefore my understanding is that we will have three KafkaConnect resources and three KafkaConnector resources, right?.
@scholzj can you help me understand where should I put liveness probe in, to make sure we will restart the pod when connector is not running?
In here?

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaConnect
metadata:
  name: kafka-connect-cluster
  annotations:
  # use-connector-resources configures this KafkaConnect
  # to use KafkaConnector resources to avoid
  # needing to call the Connect REST API directly
    strimzi.io/use-connector-resources: "true"
spec:
  image: address
  replicas: 1
  bootstrapServers: kafka-cluster-kafka-bootstrap:9092
  config:
    config.storage.replication.factor: 1
    offset.storage.replication.factor: 1
    status.storage.replication.factor: 1
    config.providers: file
    config.providers.file.class: org.apache.kafka.common.config.provider.FileConfigProvider
**livenessProbe:
      exec:
        command:
        - sh
        - ...**

scholzj · 2020-10-07T08:52:28Z

As explained above, you cannot configure it the actual probe. Just the timings.

tomasz-zylka-fluke · 2020-10-07T08:56:21Z

like @jslusher wrote:

If the mysql server gets restarted or otherwise becomes unavailable, the connect pod throws an exception and stops, waiting to be restarted manually.

@scholzj you're saying there is no way to by-pass it i.e. restart the pod automagically?
this is really serious issue though.

tomasz-zylka-fluke · 2020-10-07T08:59:09Z

If I can't configure actual probe (but only timings) then the default probe will not detect the situation that connector is not running, right?

tombentley · 2020-10-07T09:20:57Z

This is a consequence of two different way of handling failure. On the one hand the Kubernetes approach is that it's better when a container fails for the process to exit, and Kube will spin up a new one. On the other hand Kafka Connect is designed as a runtime which hosts a number connectors some of which might fail. In general it doesn't make sense for the Kafka Connect process to exit just because one connector task encountered an error and stopped, because all the other connectors would then take a hit too. The availability of each connector would be the constrained by the availability all of the others. The pattern of running a single connector per Kafka Connect cluster doesn't help here.

There's an existing proposal which seeks to address this: https://github.com/strimzi/proposals/blob/master/007-restarting-kafka-connect-connectors-and-tasks.md. @tomasz-zylka-fluke it would be useful to know whether or not that would help with your specific use case.

scholzj · 2020-10-07T09:23:48Z

If the thing you quoted is right:

If the mysql server gets restarted or otherwise becomes unavailable, the connect pod throws an exception and stops, waiting to be restarted manually.

This would be a bug in the Connect framework (and possibly in the connector) and should be raised in Apache Kafka in the first place. It should not happen that the connector kills the whole Connect.

If it is just the connector what fails, you should restart the connector. Restarting the whole connect is not the right thing to do. It is as if when Kubernetes pod fails you would say that you have to restart the whole worker node instead of just the pod. So I can see how in your case that might be useful, but that is IMHO not the right pattern. So I do not agree that this is serious issue.

tomasz-zylka-fluke · 2020-10-07T09:37:29Z

@scholzj sorry my bad - you're right that only the connector is stopped, not the whole Connect cluster.
But still, in our particular case it is serious issue, as Connector functionality stops working and our system has no clue about it, therefore it cannot be self-healing system any more and someone must monitor Connector status and manually restart it.

tomasz-zylka-fluke · 2020-10-07T09:37:50Z

@tombentley I will read your link and get back to you - thanks!

scholzj · 2020-10-07T10:01:06Z

But still, in our particular case it is serious issue, as Connector functionality stops working and our system has no clue about it, therefore it cannot be self-healing system any more and someone must monitor Connector status and manually restart it.

Yes, that is correct, you have to monitor the connector status and restart it. As described in the link shared by Tom, the Connector operator is still work in progress. Once it is finished it should allow you to run connectors like that including restarts.

tomasz-zylka-fluke · 2020-10-07T10:42:18Z

@scholzj @tombentley

There's an existing proposal which seeks to address this: https://github.com/strimzi/proposals/blob/master/007-restarting-kafka-connect-connectors-and-tasks.md. @tomasz-zylka-fluke it would be useful to know whether or not that would help with your specific use case.

It is exactly what we need!

  annotations:
    strimzi.io/restart: "true"

clean and beautiful solution.

What is the process of handling strimzi/proposals? Can I vote on proposals? Do you guys find this 007 proposal meaningful ?

scholzj · 2020-10-07T10:59:07Z

It is approved, just waiting for implementation ... I think we should check with @ajborley if this is something he plans to work on or if it is something what someone should pick up.

tomasz-zylka-fluke · 2020-10-07T11:00:46Z

great news! thanks. looking forward to implementation.

ajborley · 2020-10-07T13:21:00Z

I began an implementation of proposal 007, but got sidetracked onto other things. I plan to progress my implementation, but it's not currently part of my day-job so I'm afraid I don't know how long it will take.

scholzj · 2020-10-07T14:27:34Z

@ajborley No pressure ;-). I know that sometimes it is not easy to find the time when other work comes up.

vmironichev-nix · 2021-01-12T08:25:25Z

Hi @ajborley, could you please update us with the status of proposal 007? Thanks!

scholzj · 2021-01-12T10:15:48Z

I do not remember anymore if it covers absolutely everything, but there is #4114 which should be merged soon hopefully.

vmironichev-nix · 2021-01-12T10:26:16Z

thank you for update @scholzj!

scholzj · 2022-02-22T16:02:40Z

Triaged on 22.02.2022: The exact issue described here should be handled by restarting the connectors automatically which is something considered separately. The health-checks important part of the operator and they are relied on for rolling updates and similar. If there are suggestiong how to improve them, they can be discussed separately, but we do not want to make them configurable.

tombentley mentioned this issue Mar 27, 2019

Create Connector operator #1468

Closed

scholzj added the enhancement label Jan 12, 2022

scholzj mentioned this issue Jan 18, 2022

Make livenessProbe path configurable for kafka-connect #6228

Closed

scholzj closed this as completed Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow configuration of health check for livenessProbe #1479

allow configuration of health check for livenessProbe #1479

jslusher commented Mar 26, 2019 •

edited

tombentley commented Mar 27, 2019

scholzj commented Mar 27, 2019

tombentley commented Mar 27, 2019

scholzj commented Mar 27, 2019

jslusher commented Mar 27, 2019

tomasz-zylka-fluke commented Oct 7, 2020 •

edited

scholzj commented Oct 7, 2020

tomasz-zylka-fluke commented Oct 7, 2020 •

edited

tomasz-zylka-fluke commented Oct 7, 2020

tombentley commented Oct 7, 2020

scholzj commented Oct 7, 2020

tomasz-zylka-fluke commented Oct 7, 2020 •

edited

tomasz-zylka-fluke commented Oct 7, 2020

scholzj commented Oct 7, 2020

tomasz-zylka-fluke commented Oct 7, 2020

scholzj commented Oct 7, 2020

tomasz-zylka-fluke commented Oct 7, 2020

ajborley commented Oct 7, 2020

scholzj commented Oct 7, 2020

vmironichev-nix commented Jan 12, 2021

scholzj commented Jan 12, 2021

vmironichev-nix commented Jan 12, 2021

scholzj commented Feb 22, 2022

allow configuration of health check for livenessProbe #1479

allow configuration of health check for livenessProbe #1479

Comments

jslusher commented Mar 26, 2019 • edited

tombentley commented Mar 27, 2019

scholzj commented Mar 27, 2019

tombentley commented Mar 27, 2019

scholzj commented Mar 27, 2019

jslusher commented Mar 27, 2019

tomasz-zylka-fluke commented Oct 7, 2020 • edited

scholzj commented Oct 7, 2020

tomasz-zylka-fluke commented Oct 7, 2020 • edited

tomasz-zylka-fluke commented Oct 7, 2020

tombentley commented Oct 7, 2020

scholzj commented Oct 7, 2020

tomasz-zylka-fluke commented Oct 7, 2020 • edited

tomasz-zylka-fluke commented Oct 7, 2020

scholzj commented Oct 7, 2020

tomasz-zylka-fluke commented Oct 7, 2020

scholzj commented Oct 7, 2020

tomasz-zylka-fluke commented Oct 7, 2020

ajborley commented Oct 7, 2020

scholzj commented Oct 7, 2020

vmironichev-nix commented Jan 12, 2021

scholzj commented Jan 12, 2021

vmironichev-nix commented Jan 12, 2021

scholzj commented Feb 22, 2022

jslusher commented Mar 26, 2019 •

edited

tomasz-zylka-fluke commented Oct 7, 2020 •

edited

tomasz-zylka-fluke commented Oct 7, 2020 •

edited

tomasz-zylka-fluke commented Oct 7, 2020 •

edited