Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforcing container to stop on transaction fencing #1612

Closed
Rajh opened this issue Nov 9, 2020 · 12 comments · Fixed by #1614
Closed

Enforcing container to stop on transaction fencing #1612

Rajh opened this issue Nov 9, 2020 · 12 comments · Fixed by #1614

Comments

@Rajh
Copy link

Rajh commented Nov 9, 2020

Affects Version(s): 2.5.6.RELEASE

Regarding this stackoverflow question :
https://stackoverflow.com/questions/64665725/why-i-lost-messages-when-using-kafka-with-eos-beta/64665928

When a ProducerFencedException | FencedInstanceIdException occurs, the framework assumes that the consumer will stop, and thus is just logging :

catch (ProducerFencedException | FencedInstanceIdException e) {
this.logger.error(e, "Producer or '"
+ ConsumerConfig.GROUP_INSTANCE_ID_CONFIG
+ "' fenced during transaction");
}

It would be usefull to have a safeguard , enforcing the consumer container to stop to prevent losing message in an unexpected behavior concerning this exception.

@garyrussell
Copy link
Contributor

The consumer cannot tell if the fenced exception is caused by a rebalance or timeout. This is a known limitation of Kafka itself, and will be addressed in a future release.

We can certainly add an option to stop the container on these exceptions, but the proper fix can only be done when the new exception is thrown for a timeout.

https://issues.apache.org/jira/browse/KAFKA-9803

@garyrussell
Copy link
Contributor

In the meantime, you should make sure the transaction.timeout.ms is long enough to process a batch of records without timing out.

@Rajh
Copy link
Author

Rajh commented Nov 9, 2020

Reading https://issues.apache.org/jira/browse/KAFKA-9803 I can see :

Currently the coordinator does not distinguish these two cases. Both will end up as a ProducerFencedException, which means the producer needs to shut itself down.

Actually the producer is not shutting down, is it ? Or if it is, what happens to the container ? Since it is not rollbacking, does it try to re produce the message and then the ProducerFactory creates a new one ?

@garyrussell
Copy link
Contributor

The producer factory should recycle the producer (it does for me with your test application with the short tx timeout). Any error on begin/abort/commitTransaction() closes the producer and a new one is created.

@Rajh
Copy link
Author

Rajh commented Nov 9, 2020

Yes but since the exception is ignored the consumer skip the messages with the failed producer and continues with a new batch with a new producer.

@garyrussell
Copy link
Contributor

Right, so until they fix it, the only thing we can do is stop the container.

Here's another work-around:

Set the container's idleEventInterval. Add an event listener to catch ListenerContainerIdleEvent; then perform a dummy (no-op) transaction; e.g. once a day.

@Rajh
Copy link
Author

Rajh commented Nov 9, 2020

Which kind of no-op can be done to keep a transaction up ? (without producing a message :D)

@garyrussell
Copy link
Contributor

template.executeInTransaction(t -> {
      return null;
});

It will just do beginTransaction(), commitTransaction().

garyrussell added a commit to garyrussell/spring-kafka that referenced this issue Nov 9, 2020
garyrussell added a commit to garyrussell/spring-kafka that referenced this issue Nov 9, 2020
@Rajh
Copy link
Author

Rajh commented Nov 10, 2020

Does handling ContainerStoppedEvent in order to restart the container can have side effects ?

Say we have 3 instances of the application running on a topic with 3 partitions (1 partition for each consumer).
When a consumer transaction timeout, the container is stopped, this will start a rebalance for the others.
Then the container is restarted, and it will start a new rebalance.

Does those rebalances can trigger a ProducerFencedException but this time due to a rebalance and not a transaction timeout ?
And if this is the case, does restarting everytime can cause a sort of dead lock, or disturb the rebalances ?

@garyrussell
Copy link
Contributor

Yes, it will cause a rebalance, although that can be avoided by setting a unique ConsumerConfig.GROUP_INSTANCE_ID_CONFIG on each instance (using static membership).

@Rajh
Copy link
Author

Rajh commented Nov 10, 2020

We are using dynamic auto scaling. This means our consumer partitions are not static.

Causing a rebalance does not really matters, this FencedException should be rare and causing a rebalance to recover is acceptable.
My concern was more about the dead lock having rebalance causing FencedException causing rebalance, etc ...

@garyrussell
Copy link
Contributor

It won't deadlock, even if the other instances are not well-behaved (if they don't handle the rebalance in a timely manner).

artembilan pushed a commit that referenced this issue Nov 10, 2020
Resolves #1612

**cherry-pick to 2.5.x**

* * Add @SInCE to javadocs; retain route cause of `StopAfterFenceException`.

* * Add reason to `ConsumerStoppedEvent`.

Resolves #1618

Also provide access to the actual container that stopped the consumer, for
example to allow restarting after stopping due to a producer fenced exception.

* * Add `@Nullable`s.

* * Test Polishing.
artembilan pushed a commit that referenced this issue Nov 10, 2020
Resolves #1612

**cherry-pick to 2.5.x**

* * Add @SInCE to javadocs; retain route cause of `StopAfterFenceException`.

* * Add reason to `ConsumerStoppedEvent`.

Resolves #1618

Also provide access to the actual container that stopped the consumer, for
example to allow restarting after stopping due to a producer fenced exception.

* * Add `@Nullable`s.

* * Test Polishing.
garyrussell added a commit that referenced this issue Nov 10, 2020
garyrussell added a commit that referenced this issue Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants