-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH 920 - Topic-based retry support #1664
Conversation
Thanks, @tomazfernandes we prefer rebasing PRs rather than adding merge commits (we'll rebase at the end anyway). |
Ok @garyrussell, I actually tried that, but couldn’t push or force push to this branch, it seemed to me because of the open PR (not really experienced with rebasing) Should I try squashing this merge / conflict resolution and rebasing again? Or maybe just leave the conflict there as it was? |
No don't worry; we'll squash and rebase before merging to master any way; weird that you couldn't force push, though. |
Hi @tomazfernandes I finally found some time to take a quick look at this. It is very impressive and seems to be a complete solution and will be a valuable addition to the framework. I like the approach. It's a huge amount of code to review, though so it will take some time. That said, I am inclined to merge it as-is and, perhaps, document it as an "experimental" feature, at least initially; I am sure we can get it into next month's 2.7.0-M2 milestone, as long as you have time to address a few issues.
We would not need the docs for M2, but the sooner, the better. Our only strict asciidoctor rule is Thanks again for such a significant contribution!! |
Hi @garyrussell, that's awesome news! I'm really glad you liked it, thank you very much, I'm really excited about this. Sure, I'll address these issues and implement the tests, no problem. There are a couple of things I've been meaning to improve as well, so I'll get to it. Also, if you have any suggestions please let me know. When would it be a good timeframe for me to finish the changes to get in M2? |
M2 is currently scheduled for Feb 17: https://github.com/spring-projects/spring-kafka/milestone/135 So, we have a few weeks. |
Sounds perfect @garyrussell. Thank you very much for the opportunity. |
Hi @garyrussell, just a quick update. I've implemented some changes and improvements, as well as more than 90% test coverage, both integration and unitary. What's missing is updating the javadocs and addressing the style changes, as well as the documentation, which I plan on doing this week. Unfortunately I had covid and that set me back a couple of weeks, otherwise I'd probably have everything ready by now. I'll commit the code as is so you can take a look if you want - it won't build with Gradle due to checkstyle issues, but it should build and run normally on IntelliJ or by disabling checkstyle. What would be the deadline for committing the javadoc / style adjustments in order to make it into M2? Thanks! |
@tomazfernandes Thanks. The PR needs to be clean and reviews complete by end of day February 16. The reference manual (which has now moved to FYI, we are not working this Friday (12th) and next Monday (15th). |
Sorry to read about your Covid encounter; I hope you are and remain well. |
Thanks for your concern Gary, it wasn't a fun ride at all, but thankfully not as bad as it could have been. I'm well now. About the code, I'll clean up the PR, think I should have it done by tomorrow. Then I'll work on the documentation until the 16th, but hopefully I'll have it ready sooner. Of course, if you feel that's too close to the M2 release date we can push it back to the following release, although I'd really like if we can make it for M2. |
Hi @garyrussell, I've cleaned the PR and formatted the javadocs and code the way you asked. If you could take a look, I think it should be fine now. I'll start working on the documentation soon to have it ready for M2, please let me know if there's anything else that needs to be done or anything I might have missed. Thanks again for the opportunity! |
Please rebase to master; thanks. (Or merge; you said you had problems rebasing before). |
83c200d
to
e5af303
Compare
@garyrussell, seems like this time I got the rebase right. I'll start working on the documentation. Thanks. |
Hi @garyrussell, I've created the documentation for the non-blocking retry functionality and updated the relevant parts of the previous documentations. Hope it's ok, please let me know if there's anything else needed. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work; thanks!
Just a few doc polishing comments and a suggestion to change one constant name.
Thanks again; I think this can go into next week's M2.
Thanks a lot @garyrussell! I'll make these adjustments today after my 9-5. Also, there are a few other improvements / functionalities I'd like do add, such as:
Do you think it's worth trying to get those into M2, or maybe we should go with what we have now and create a new PR afterwards for the next release (considering all goes well)? I'd probably have it done by Tuesday (since you won't be working tomorrow and next Monday). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make DLT and DLT processing optional
I thought about that too, while reading the docs, but then I thought it doesn't hurt to log it - they can always use a different application to consume from the DLT. But I guess it would be nice to have it configurable.
include the ability to configure retries from the application.properties files
That would have to be a PR submitted to the Boot team, after they upgrade to 2.7.0.M2 (next Wednesday).
Do you think it's worth trying to get those into M2
It's up to you, if they are small enough, I still have Tuesday to review.
I would, however, suggest a new PR(s) after we merge this, rather than adding to this (rather large) one.
Another thought I just had is we should at least offer an option to use a different groupId
for the retry container Vs. the main container - otherwise a rebalance on a retry topic will cause an unnecessary rebalance on the main topic.
We also may want to use a different container for each tier of retries, for the same reason.
But we can discuss these after M2.
Thanks again.
...kafka/src/main/java/org/springframework/kafka/retrytopic/RetryTopicConfigurationBuilder.java
Show resolved
Hide resolved
@garyrussell thanks for your comments, I agree on the DLT part. For the group id, it gets suffixed with the retry topic's suffix as long as the user provides a groupId for the main topic. You can check it out in RetryTopicConfigurer.java:431, that's one of the functions for the customiser added to the KafkaListenerABPP processListener method. So each retry topic / dlt should get it's own consumer group. I think there might be some minor adjustments to make in that logic to cover all possibilities of replicating the main endpoint's configuration, maybe we can look into that some time after M2. But it should be good for most use cases. Also, I think I should be able to retrieve the application.properties from the ApplicationContext in order to create a retry configuration, shouldn't I? Or even create a bean for that purpose. Maybe there's a more "Springy" way to do that, which probably is the case for other places of the code as well, feel free to point them out as you see them. I'll take a look into how they're handled internally by the Spring Boot app. As for the other improvements, I think I'll try to code them until Tuesday and maybe commit it to a separate branch, if this PR hasn't been merged by then. Then you can decide whether or not it's small enough for you to review and merge to M2, or if it goes to the release after that - I'll be good with either. Also maybe I can separate the commits so it doesn't have to be an all or nothing decision. Thanks a lot again! |
I'll add this groupId part to the documentation when I make the adjustments later today. |
Right, the properties are in the environment, but it's separation of concerns; application.properties/.yml is processed by Boot's auto configuration in Also Boot has much goodness in its property mapping - e.g. I am sure the Boot team will be receptive to a PR submitted to enhance the auto configuration for this important enhancement. |
Sounds great @garyrussell, I'll look into adding it to Spring Boot's properties then, probably after M2. Thank you very much for your support. |
…orresponding events to MessageListenerContainer interface and its implementations.
…nd a ListenerAdapter that makes use of it.
…picConfigurer.java in the retrytopic package for more information.
…oc overview of the functionality.
- Changes in dependency management - RetryTopicBootstrapper - Some refactorings
Many improvements Test implementations
Hi Tomaz, thanks for the quick response! I am using spring-kafka 2.7.0-M2. This is not a requirement (the requirement is far less frequent TBH), just wanted to check how is this behaving and stumbled across this weird, nearly linear 'magic' 10s delay, which I don't know where it comes from. I've tried to set this to 60s and at start it seems quite fine, but sometimes it happened to be quite spontaneous: but when message throughput is less 'quiet' (seems like less than 10s in betweeen), I get almost exact 10s delay for each: Maybe my configuration is not sufficient somehow? I've provided solely the @backoff(60000L) for this case. Is there something to be configured for the kafka perhaps, or more in the retry feature itself? |
@Duncol Try upgrading to 2.7.0-RC1; with M2, it was tightly coupled to the poll timeout and idle interval; I saw similar issues with M2 and @tomazfernandes made some improvements that are included in RC1. |
@garyrussell @tomazfernandes I've upgraded to the RC1, but now it does not seems to retry at all. I still get the KafkaBackoffException (more concise in RC1), logging the approx backoff, but nothing happens after that (needs restart of the service to kick next (single) retry). M2 works well for my configuration (except the aforementioned issues with backoff times) and the only thing I've changed was the M2 -> RC1 and config data types (int/long -> String) in @RetryableTopic More info about my approach: My retry is based on RuntimeExceptions (exception thrown -> should retry, no exception -> no further retries). First retry is initiated in try/catch around the core service; catch block sends wrapped message to first retry topic via KafkaTemplate. When first topic retry is exhausted and the wrapped msg lands in the DLT, I pass that msg to another retry topic (via KafkaTemplate as for the initial retry topic). After second retry topic 'completes', the DLT just logs the msg, with no further passing. |
Hi @Duncol, thanks a lot for bringing this up! There's indeed a bug when we use the same factory for the KafkaListener and RetryableTopic annotations. It'll be fixed ASAP, but for now as a workaround if you specify a different factory instance for the RetryableTopic annotation it should work. This scenario will be added to our integration tests so that it doesn't happen again. Please let us know how it turns out. Thanks again! |
@tomazfernandes There still seems to be something amiss - when I changed my test app to use a different factory, I see this
@RetryableTopic(attempts = "5", backoff = @Backoff(delay = 1000, multiplier = 2.0),
listenerContainerFactory = "retryFactory")
@KafkaListener(id = "kgh920", topics = "kgh920")
public void listen(String in, @Header(KafkaHeaders.RECEIVED_TOPIC) String topic) {
LOG.info(in + " from " + topic);
throw new RuntimeException("test");
} The first retry is +500ms instead of 1s, the next retry is +2s (correct), the next is +2.5s instead of 4, the next retry is +4s instead of 8. I then see 8 seconds before it goes to the DLT - in earlier versions, I am sure that it went straight to the DLT after the 8 second delivery attempt failed. |
@garyrussell, there are two things going on there: the first is a bug I've found now where the current topic's delay is being used instead of the next topic's - so it'd wrongly be 0, 1s, 2s, 4s, 8s. The 500ms+- differences are related to the poll timeout, which has to be a lot smaller in the retry topics if we have low backoffs such as 1s - not much time to go through all the pause - partition idle event - resume container - resume consumer process considering it takes about 500ms to get there in the first place. I've changed the default configuration to do that in the latest PR regarding this. I've already fixed this bug and will submit a PR after I run the tests. If you can test this out again tomorrow with the fixes it'll be great, I think everything should work as expected. I'll also open the PR for the factories bug. Thanks! |
@tomazfernandes Separate ListenerFactory works, thanks for the quick response, I can move forward now :). Just one side question (maybe more for you @garyrussell ) - whatis the planned release date of spring-kafka 2.7.0? |
Hi @Duncol! Can you share the code where you’re getting this list from? Is it a batch consumer? Thanks for mentioning, your feedback is very important for us! |
@tomazfernandes (just mind it's contantly improved WIP and I'm looking for a cleaner way to register RecordInterceptor just for tests :) ) This is defined in @TestConfiguration class Everything starts as a single message sent via KafkaTemplate |
Hmm, that’s strange. What’s the “callbackKafkaListenerContainerFactory” for? The retry topic’s mechanism relies on the battle tested dead letter publishing recovered to forward messages, and we didn’t see any behavior like this before. The only possibility I see from the feature side would be if we’re registering two consumers per topic, which again never happened in our tests. So what comes to mind is this: The other possibility I see would be if you’re for some reason registering two interceptors for the same factory instance, not really sure how to check for that but probably a breakpoint in the interceptor assignment will do. @garyrussell, any thoughts on this? |
According to your screenshot all the interceptors are placing their records into the same Rings a bell? |
@tomazfernandes 'callbackKafkaListenerContainerFactory' is for our main @KafkaListener. I'm not placing the @RetryableTopic directly over this listener - instead, I have separate handler with two listening methods (each having @KafkaListener + @RetryableTopic over them). I also have one method with @DltHandler.
callbackRetryKafkaListenerContainerFactory and callbackRetryAuxKafkaListenerContainerFactory are having the same ConsumerFactory
|
@artembilan so in each retry, the message goes through both @KafkaListener's and @RetryableTopic's listenerContainerFactory (and thus - configured consumer)? Such scenario would match my outcome. I would expect, that only initial msg arrival is consumed by consumer configured for callbackRetryKafkaListenerContainerFactory (i.e. @KafkaListener) and the retries are just utilizing callbackRetryAuxKafkaListenerContainerFactory's (i.e. @RetryableTopic) consumer for single message consumption. Is this the way the retry works? |
Hmm, well, then you have two listeners per topic, hence two instances of the same message in you collection... That’s the expected behavior, right? I didn’t understand what exactly you’re trying to achieve with this pattern: a single @KafkaListener method with @RetryableTopic should suffice. Also, unless that’s somehow a requirement, you don’t need to handle forwarding to the next topic manually in the DLT method, instead you should let the exception go all the way back to the listener (outside of any try/catch) and the framework will handle message forwarding for you. Makes sense? |
@Duncol Perhaps you misunderstood @tomazfernandes when we had that bug (needing two factories). This is what he meant... @SpringBootApplication
public class Kgh920Application {
private static final Logger LOG = LoggerFactory.getLogger(Kgh920Application.class);
public static void main(String[] args) {
SpringApplication.run(Kgh920Application.class, args);
}
@RetryableTopic(attempts = "5", backoff = @Backoff(delay = 1000, multiplier = 2.0),
listenerContainerFactory = "retryFactory")
@KafkaListener(id = "kgh920", topics = "kgh920")
public void listen(String in, @Header(KafkaHeaders.RECEIVED_TOPIC) String topic) {
LOG.info(in + " from " + topic);
throw new RuntimeException("test");
}
@DltHandler
public void dlt(String in, @Header(KafkaHeaders.RECEIVED_TOPIC) String topic) {
LOG.info(in + " from " + topic);
}
@Bean
ConcurrentKafkaListenerContainerFactory<?, ?> retryFactory(
ConcurrentKafkaListenerContainerFactoryConfigurer configurer,
ObjectProvider<ConsumerFactory<Object, Object>> kafkaConsumerFactory,
KafkaProperties kafkaProps) {
ConcurrentKafkaListenerContainerFactory<Object, Object> factory = new ConcurrentKafkaListenerContainerFactory<>();
configurer.configure(factory, kafkaConsumerFactory
.getIfAvailable(() -> new DefaultKafkaConsumerFactory<>(kafkaProps.buildConsumerProperties())));
return factory;
}
} i.e. specify a different factory on the retry annotation. Using a different factory is no longer needed now that the bug has been fixed, but it you still want to do it, it needs to go on the retry annotation. |
Given the fact, that there is a '-retry' topic created (single topic strategy) for the retry I assumed that the listener for the initial topic (without the '-retry' suffix) is somehow consuming messages from the '-retry' topics, thus - duplicate msg. I'll check it with the fixes, thanks a lot again, much appreciate work you are doing! The manual forward to the next topic is due to different backoff/attempt requirement. Can it be reconfigured on the same retry somehow, as a 'second tier approach' maybe? |
Hi @Duncol, sorry, I totally missed this message, was cleaning up the inbox and found it now. How is the feature working for you, is it behaving as expected? Can you share more details on your retrial requirements, such as number of attempts, delays, etc? Maybe we can work something out to include this second tier. Thanks and sorry again for taking this long to reply. |
@tomazfernandes No worries, in simpliest words - we want to intizilize retry on a particular issue (exception). The retry should happen every 1m for the first 30m and then switch to 'every 1h' mode, for the next 48h. After that, it should land in the DLT. It works nearly as expected with my current solution (incorporating DLT with simple 'if statement') for the interval switch (next topic), but one thing, that is not working right is the fact, that the first retry happens immediately when message reaches each topic (i.e. backoff seems to be applied only in between retries). I assume it's due to the fact, that I am sending new msg to the separate topic with @RetryableTopic. I've done it this way to decouple the logic a bit + have entirely different msg model for passing some metadata. Do you see maybe some solutions for this case? Thanks a lot for this feature and your persistent help again! |
Hi @Duncol, I'm glad the feature is working out for you! This feature is not prepared to handle 2 tiers of delays, but it can handle delaying the message arrival in the first topic of the second tier very easily. All consumers within the Retry Topic feature are wrapped in a KafkaBackoffAwareMessageListenerAdapter, which looks for a back off timestamp header and backs off until that timestamp if it's in the future. So pretty much all you have to do is add the back off timestamp header to the message you send to the other topic and it will back off until the time you specify on the header. It should be something like this: Where Let me know if that works out for you! And thanks again for the feedback! |
Hi @tomazfernandes, sorry for the lack of response, but I was quite occupied with some other tasks. I'll try to implement your suggestion (which looks quite nice and seems a nearly golden bullet for our case - and any further granular adjustments that we might need regarding changing the back-off) and let you know ASAP. One other thing I've stumbled across which might be interesting is that it seems that @RetryableTopic creates additional consumer with the same clientId. This causes the app to throw an exception (logged as WARN) regarding initializing MBean ('Already Exists'):
I've tried separate listenerContainerFactory for the @RetryableTopic with explicit clientId, different from the one I set for particular @KafkaListener, but the one from @KafkaListener takes precedence. When I comment out the @retryable topic, those additional consumers are not being registered and problem disappears. Have you encounter something similar maybe? Those WARNs do not break anything, but as I've read, it may cut off some metrics and pollutes the logfile a bit :) |
Hi @Duncol, thanks for bringing this up! I'm having trouble reproducing your issue. When we specify a clientIdPrefix in the @KafkaListener annotation, the prefix gets suffixed by the topic's suffix (e.g. retry-250). And when we don't, Kafka's ConsumerConfig class has a monotonically increasing number that it appends to the consumer's id so that they're unique at least within the same app instance (line 576). In my tests all consumers end up with different client ids. Which Spring Kafka version are you using? Can you share more details on how we can reproduce the issue? Maybe @garyrussell has something to add? Thanks again for your input! |
I can't reproduce it either; when I enable info logging for sample-04 I get these client ids when I add
The exception seems to indicate you have multiple listeners with the same If you can put together a minimal example that exhibits the behavior, we can take a look. |
Please refer to the RetryTopicConfigurer class' JavaDoc for an overview of the functionalities: https://github.com/tomazfernandes/spring-kafka/blob/GH-920/spring-kafka/src/main/java/org/springframework/kafka/retrytopic/RetryTopicConfigurer.java
I've separated the code in 5 commits: