Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kafka Health Indicator #14088

Closed
ST-DDT opened this issue Aug 17, 2018 · 22 comments
Closed

Add Kafka Health Indicator #14088

ST-DDT opened this issue Aug 17, 2018 · 22 comments
Labels
status: declined A suggestion or change that we don't feel we should currently apply

Comments

@ST-DDT
Copy link

ST-DDT commented Aug 17, 2018

In previous versions of Spring-Boot there was an inbuild health indicator for Kafka, however somewhere along the way it was lost.

Refs:

Please add the HealthIndicator for Kafka again and add metrics as well.
This can be achieved using the following code:

(includes both metrics and health)

@Configuration
public class KafkaConfig {

	@Autowired
	private KafkaAdmin admin;

	@Autowired
	private MeterRegistry meterRegistry;

	@Autowired
	private Map<String, KafkaTemplate<?, ?>> kafkaTemplates;

	@Bean
	public AdminClient kafkaAdminClient() {
		return AdminClient.create(admin.getConfig());
	}

	@SuppressWarnings("deprecation") // Can be avoided by relying on Double.NaN for non doubles.
	@PostConstruct
	private void initMetrics() {
		final String kafkaPrefix = "kafka.";
		for (Entry<String, KafkaTemplate<?, ?>> templateEntry : kafkaTemplates.entrySet()) {
			final String name = templateEntry.getKey();
			final KafkaTemplate<?, ?> kafkaTemplate = templateEntry.getValue();
			for (Metric metric : kafkaTemplate.metrics().values()) {
				final MetricName metricName = metric.metricName();
				final Builder<Metric> gaugeBuilder = Gauge
						.builder(kafkaPrefix + metricName.name(), metric, Metric::value) // <-- Here
						.description(metricName.description());
				for (Entry<String, String> tagEntry : metricName.tags().entrySet()) {
					gaugeBuilder.tag(kafkaPrefix + tagEntry.getKey(), tagEntry.getValue());
				}
				gaugeBuilder.tag("bean", name);
				gaugeBuilder.register(meterRegistry);
			}
		}
	}

	@Bean
	public HealthIndicator kafkaHealthIndicator() {
		final DescribeClusterOptions describeClusterOptions = new DescribeClusterOptions().timeoutMs(1000);
		final AdminClient adminClient = kafkaAdminClient();
		return () -> {
			final DescribeClusterResult describeCluster = adminClient.describeCluster(describeClusterOptions);
			try {
				final String clusterId = describeCluster.clusterId().get();
				final int nodeCount = describeCluster.nodes().get().size();
				return Health.up()
						.withDetail("clusterId", clusterId)
						.withDetail("nodeCount", nodeCount)
						.build();
			} catch (InterruptedException | ExecutionException e) {
				return Health.down()
						.withException(e)
						.build();
			}
		};

	}

}

Feel free to use or modify the code as you see fit.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Aug 17, 2018
@snicoll snicoll changed the title [Actuator] Add KafkaHealthIndicator again + Metrics Add KafkaHealthIndicator again + Metrics Aug 17, 2018
@snicoll
Copy link
Member

snicoll commented Aug 17, 2018

However somewhere along the way it was lost.

It wasn't lost, It was reverted for the reason exposed in #12225. If you have something that address the concern expressed there, I am more than happy to hear from you. Thanks for sharing but a piece of code with no tests is not something we can use.

As for the metrics support, this is unrelated and we don't deal with several topics in a single issue. There is already an issue in the micrometers project that you could subscribe to.

@snicoll snicoll changed the title Add KafkaHealthIndicator again + Metrics Add Kafka Health Indicator Aug 17, 2018
@snicoll snicoll added the status: waiting-for-feedback We need additional information before we can continue label Aug 17, 2018
@ST-DDT
Copy link
Author

ST-DDT commented Aug 17, 2018

Thanks for sharing but a piece of code with no tests is not something we can use.

Thats why its a feature request and not a pull request.
The code is just an example that might help someone, who knows the internals of Spring, but not the internals of Kafka, to implement this feature. (Or as a snippet to copy for anybody else who wants to use it)

As for the metrics support, this is unrelated and we don't deal with several topics in a single issue.

Sorry. I thought both of them would be monitoring, but I'll use separate issues for that in the future.

@snicoll
Copy link
Member

snicoll commented Aug 17, 2018

As I've already indicated we've tried to implement it already. See #12225 and the reasons why it got reverted. If you can help in that area we're most certainly interested.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Aug 17, 2018
@snicoll snicoll added status: waiting-for-feedback We need additional information before we can continue and removed status: feedback-provided Feedback has been provided labels Aug 17, 2018
@spring-projects-issues
Copy link
Collaborator

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@spring-projects-issues spring-projects-issues added the status: feedback-reminder We've sent a reminder that we need additional information before we can continue label Aug 24, 2018
@ST-DDT
Copy link
Author

ST-DDT commented Aug 24, 2018

I currently cannot help you with that.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue status: feedback-reminder We've sent a reminder that we need additional information before we can continue labels Aug 24, 2018
@MartinX3
Copy link

That's my personal solution.
If the kafka bus is down (standard timeout 60 seconds) it will show the kafka as "down" at the actuator health endpoint.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.stereotype.Component;

import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

@Component
public class KafkaHealthIndicator implements HealthIndicator {
    private final Logger log = LoggerFactory.getLogger(KafkaHealthIndicator.class);

    private KafkaTemplate<String, String> kafka;

    public KafkaHealthIndicator(KafkaTemplate<String, String> kafka) {
        this.kafka = kafka;
    }

    /**
     * Return an indication of health.
     *
     * @return the health for
     */
    @Override
    public Health health() {
        try {
            kafka.send("kafka-health-indicator", "❥").get(100, TimeUnit.MILLISECONDS);
        } catch (InterruptedException | ExecutionException | TimeoutException e) {
            return Health.down(e).build();
        }
        return Health.up().build();
    }
}

@vspiliopoulos

This comment has been minimized.

@philwebb

This comment has been minimized.

@philwebb philwebb added type: enhancement A general enhancement and removed status: feedback-provided Feedback has been provided status: waiting-for-triage An issue we've not yet triaged labels Nov 16, 2018
@philwebb philwebb added this to the 2.x milestone Nov 16, 2018
@zacyang
Copy link

zacyang commented Dec 31, 2018

I have been looking into some out-of-box solution for health indicator for Kafka.

It worth to notice that @MartinX3's solution can only provide connectivity health check while @ST-DDT 's solution can provide connectivity health check and some meta info of the cluster.

It would have to combine acks min.insync.replicas and nodeCount to give a meaningful health indicator, (as it how Kafka consider a message is committed). Otherwise, we could end up with health up but any message sent would end up with a failure.

@otavioprado
Copy link

Any update about it?

@edgardrp
Copy link

I've seen some examples of the /health endpoint in which it seems there's a "broker" component, it would be terrific if that is filled with something like @MartinX3 or @ST-DDT have pointed out.
Is there any update on this?

@onobc
Copy link
Contributor

onobc commented Apr 13, 2021

@snicoll @wilkinsona I would ❤️ to pick this work back up.

At my company we have many many apps all implementing their own copy/pastad variant of health checks for Kafka and KafkaStreams. I was getting ready to write an internal starter but of course wanted to first see if we could add it back into Spring Boot.

I came across this issue (and the others related to the revert) and understand the reason for reverting. What are your thoughts about this in 2021? 😸

@wilkinsona
Copy link
Member

I'm afraid I don't know enough about Kafka and assessing its health to know for certain what we should do here.

@onobc
Copy link
Contributor

onobc commented Apr 16, 2021

Yeh, It is really unfortunate that Kafka does not provide a mechanism to check health or at least give an opinion on a recommended approach.

Looking back through the revert ticket it seems like w/ the opt-in repl factor check and the updates to the admin client that it was really close to being usable. It also seems that you and @snicoll came across another roadblock that made a case for reverting. I know there are 1001 other things to be dealing w/ so I'm not trying to re-hash the past - but rather understand the limitations and see if there is something that would make sense out of the box.

I am not super familiar w/ Cassandra but I see in the CassandraDriverHealthIndicator that it is considered healthy if at least 1 node reports as "up". How is this case different?

I am curious to get @garyrussell opinion on what a "good" Kafka health indicator would be. He seems to dabble in Kafka from time to time ;)

@garyrussell
Copy link
Contributor

garyrussell commented Apr 16, 2021

One of the stumbling blocks is when using transactions - the number of active brokers and in-sync replicas are broker configuration properties, which are not available on the client.

If an application is using transactions and there are not enough brokers to publish a record to a particular partition, the producer hangs until a timeout occurs. It is made more complicated because min.insync.replicas is also a topic-level configuration.

There are just too many of these corner cases to come up with a single robust health indicator.

@onobc
Copy link
Contributor

onobc commented Apr 17, 2021

the number of active brokers and in-sync replicas are broker configuration properties, which are not available on the client.

It is made more complicated because min.insync.replicas is also a topic-level configuration.

Thanks for clarifying @garyrussell. I understand the concern in those other issues now.

@onobc
Copy link
Contributor

onobc commented Apr 17, 2021

I promise not to turn this into a KafkaStreams thread discussion but am curious if there are hidden "stumbling blocks" for KafkaStreams in this area as well?

For KafkaStreams health indicators at my company we have been using the KafkaStreams instance (as returned from StreamsBuilderFactoryBean state, specifically KafkaStreams.state().isRunningOrRebalancing() and it has been working well for us. Is there some other underlying issue w/ this approach that I am not aware of that would make this not a good choice for a robust KafkaStreams health indicator?

@garyrussell
Copy link
Contributor

I you are using exactly once semantics, KafkaStreams will be affected by the insufficient in-sync replicas problem too.

@onobc
Copy link
Contributor

onobc commented Apr 19, 2021

I you are using exactly once semantics, KafkaStreams will be affected by the insufficient in-sync replicas problem too.

Thanks for that info @garyrussell , good to know. We are not using that currently in our streams app.

It seems that "transaction" / "exactly once semantics" case is what adds a good deal of complexity to this (via need for in-sync replicas check). I wonder if it would make sense to add a simple health check that does not cover that case. Although, I don't think there is a good way to conditionally auto-configure that based on whether or not the app is configuring/using that feature of Kafka/KafkaStreams and it could be confusing to users if it works for all but that case.

@wilkinsona
Copy link
Member

Although, I don't think there is a good way to conditionally auto-configure that based on whether or not the app is configuring/using that feature of Kafka/KafkaStreams and it could be confusing to users if it works for all but that case.

I think that's the crux of this one and, as such, I don't think we should try to provide one. I think the risk of giving an inaccurate status is too high.

IMO, we should close this one. Let's see what the rest of the team thinks.

@wilkinsona wilkinsona added the for: team-attention An issue we'd like other members of the team to review label Apr 30, 2021
@onobc
Copy link
Contributor

onobc commented Apr 30, 2021

After digging in more, I agree that there is not an easy way to provide a one-size-fits-all solution. Closing this ticket would probably solidify that decision.

If something is made available by Kafka in the future that makes this feasible, then a ticket can be created at that time.

@philwebb
Copy link
Member

philwebb commented May 4, 2021

+1 to closing. I'm going to go ahead and do that. Thanks for your efforts @bono007

@philwebb philwebb closed this as completed May 4, 2021
@philwebb philwebb added status: declined A suggestion or change that we don't feel we should currently apply and removed for: team-attention An issue we'd like other members of the team to review type: enhancement A general enhancement labels May 4, 2021
@philwebb philwebb removed this from the 2.x milestone May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: declined A suggestion or change that we don't feel we should currently apply
Projects
None yet
Development

No branches or pull requests