Enforcing container to stop on transaction fencing #1612

Rajh · 2020-11-09T10:09:28Z

Affects Version(s): 2.5.6.RELEASE

Regarding this stackoverflow question :
https://stackoverflow.com/questions/64665725/why-i-lost-messages-when-using-kafka-with-eos-beta/64665928

When a ProducerFencedException | FencedInstanceIdException occurs, the framework assumes that the consumer will stop, and thus is just logging :

spring-kafka/spring-kafka/src/main/java/org/springframework/kafka/listener/KafkaMessageListenerContainer.java

Lines 1519 to 1523 in fb1977c

    
           catch (ProducerFencedException | FencedInstanceIdException e) { 
        
           	this.logger.error(e, "Producer or '" 
        
           			+ ConsumerConfig.GROUP_INSTANCE_ID_CONFIG 
        
           			+ "' fenced during transaction"); 
        
           }

It would be usefull to have a safeguard , enforcing the consumer container to stop to prevent losing message in an unexpected behavior concerning this exception.

garyrussell · 2020-11-09T15:37:36Z

The consumer cannot tell if the fenced exception is caused by a rebalance or timeout. This is a known limitation of Kafka itself, and will be addressed in a future release.

We can certainly add an option to stop the container on these exceptions, but the proper fix can only be done when the new exception is thrown for a timeout.

https://issues.apache.org/jira/browse/KAFKA-9803

garyrussell · 2020-11-09T15:50:45Z

In the meantime, you should make sure the transaction.timeout.ms is long enough to process a batch of records without timing out.

Rajh · 2020-11-09T15:53:14Z

Reading https://issues.apache.org/jira/browse/KAFKA-9803 I can see :

Currently the coordinator does not distinguish these two cases. Both will end up as a ProducerFencedException, which means the producer needs to shut itself down.

Actually the producer is not shutting down, is it ? Or if it is, what happens to the container ? Since it is not rollbacking, does it try to re produce the message and then the ProducerFactory creates a new one ?

garyrussell · 2020-11-09T17:46:10Z

The producer factory should recycle the producer (it does for me with your test application with the short tx timeout). Any error on begin/abort/commitTransaction() closes the producer and a new one is created.

Rajh · 2020-11-09T18:46:01Z

Yes but since the exception is ignored the consumer skip the messages with the failed producer and continues with a new batch with a new producer.

garyrussell · 2020-11-09T18:47:05Z

Right, so until they fix it, the only thing we can do is stop the container.

Here's another work-around:

Set the container's idleEventInterval. Add an event listener to catch ListenerContainerIdleEvent; then perform a dummy (no-op) transaction; e.g. once a day.

Rajh · 2020-11-09T18:57:49Z

Which kind of no-op can be done to keep a transaction up ? (without producing a message :D)

garyrussell · 2020-11-09T19:13:41Z

template.executeInTransaction(t -> {
      return null;
});

It will just do beginTransaction(), commitTransaction().

Resolves spring-projects#1612 **cherry-pick to 2.5.x**

Rajh · 2020-11-10T08:10:22Z

Does handling ContainerStoppedEvent in order to restart the container can have side effects ?

Say we have 3 instances of the application running on a topic with 3 partitions (1 partition for each consumer).
When a consumer transaction timeout, the container is stopped, this will start a rebalance for the others.
Then the container is restarted, and it will start a new rebalance.

Does those rebalances can trigger a ProducerFencedException but this time due to a rebalance and not a transaction timeout ?
And if this is the case, does restarting everytime can cause a sort of dead lock, or disturb the rebalances ?

garyrussell · 2020-11-10T14:12:59Z

Yes, it will cause a rebalance, although that can be avoided by setting a unique ConsumerConfig.GROUP_INSTANCE_ID_CONFIG on each instance (using static membership).

Rajh · 2020-11-10T14:31:21Z

We are using dynamic auto scaling. This means our consumer partitions are not static.

Causing a rebalance does not really matters, this FencedException should be rare and causing a rebalance to recover is acceptable.
My concern was more about the dead lock having rebalance causing FencedException causing rebalance, etc ...

garyrussell · 2020-11-10T14:44:54Z

It won't deadlock, even if the other instances are not well-behaved (if they don't handle the rebalance in a timely manner).

@SInCE

Resolves #1612 **cherry-pick to 2.5.x** * * Add @SInCE to javadocs; retain route cause of `StopAfterFenceException`. * * Add reason to `ConsumerStoppedEvent`. Resolves #1618 Also provide access to the actual container that stopped the consumer, for example to allow restarting after stopping due to a producer fenced exception. * * Add `@Nullable`s. * * Test Polishing.

@SInCE

Resolves #1612 **cherry-pick to 2.5.x** * * Add @SInCE to javadocs; retain route cause of `StopAfterFenceException`. * * Add reason to `ConsumerStoppedEvent`. Resolves #1618 Also provide access to the actual container that stopped the consumer, for example to allow restarting after stopping due to a producer fenced exception. * * Add `@Nullable`s. * * Test Polishing.

garyrussell added the type: improvement label Nov 9, 2020

garyrussell added this to the 2.6.3 milestone Nov 9, 2020

garyrussell added the backport 2.5.x (obsolete) label Nov 9, 2020

garyrussell added a commit to garyrussell/spring-kafka that referenced this issue Nov 9, 2020

spring-projectsGH-1612: Option: Producer Fenced: Stop Container

5e0baef

Resolves spring-projects#1612 **cherry-pick to 2.5.x**

garyrussell mentioned this issue Nov 9, 2020

GH-1612: Option: Producer Fenced: Stop Container #1614

Merged

garyrussell added a commit to garyrussell/spring-kafka that referenced this issue Nov 9, 2020

spring-projectsGH-1612: Option: Producer Fenced: Stop Container

3a0304e

Resolves spring-projects#1612 **cherry-pick to 2.5.x**

artembilan closed this as completed in #1614 Nov 10, 2020

garyrussell added a commit that referenced this issue Nov 10, 2020

GH-1612: Fix Javadocs

5c1d290

garyrussell added a commit that referenced this issue Nov 10, 2020

GH-1612: Fix Javadocs

2240aff

bartosz-stasikowski-projectdrgn mentioned this issue Nov 5, 2021

Prevent message being reprocessed endlessly many times if using isStopContainerWhenFenced and message cannot be processed because of transaction timeout #1998

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforcing container to stop on transaction fencing #1612

Enforcing container to stop on transaction fencing #1612

Rajh commented Nov 9, 2020 •

edited

garyrussell commented Nov 9, 2020

garyrussell commented Nov 9, 2020

Rajh commented Nov 9, 2020

garyrussell commented Nov 9, 2020

Rajh commented Nov 9, 2020

garyrussell commented Nov 9, 2020

Rajh commented Nov 9, 2020

garyrussell commented Nov 9, 2020

Rajh commented Nov 10, 2020

garyrussell commented Nov 10, 2020

Rajh commented Nov 10, 2020

garyrussell commented Nov 10, 2020

Enforcing container to stop on transaction fencing #1612

Enforcing container to stop on transaction fencing #1612

Comments

Rajh commented Nov 9, 2020 • edited

garyrussell commented Nov 9, 2020

garyrussell commented Nov 9, 2020

Rajh commented Nov 9, 2020

garyrussell commented Nov 9, 2020

Rajh commented Nov 9, 2020

garyrussell commented Nov 9, 2020

Rajh commented Nov 9, 2020

garyrussell commented Nov 9, 2020

Rajh commented Nov 10, 2020

garyrussell commented Nov 10, 2020

Rajh commented Nov 10, 2020

garyrussell commented Nov 10, 2020

Rajh commented Nov 9, 2020 •

edited