Transaction stuck #8236

darkdarkdragon · 2023-01-14T05:22:54Z

Version & Environment

Redpanda version: (use rpk version): v22.3.10 (rev 1f78ad9)

Ubuntu 22.04.1 LTS
franz-go v1.9.0

What went wrong?

Producer can't start, gives error

unable to commit transaction offsets: UNKNOWN_SERVER_ERROR: The server experienced an unexpected error when processing the request.

What should have happened instead?

Producer should have been started.

How to reproduce the issue?

Additional information

Server logs:

Jan 14 05:06:13 kafka-prod-2 rpk[1727]: WARN  2023-01-14 05:06:13,174 [shard 13] tx - [N:nomad_topic S:Stable G:516] group.cc:1808 - can't begin a tx {producer_identity: id=2242, epoch=258} with tx_seq 21969: a producer id is already involved in a tx with tx_seq 21222
Jan 14 05:06:13 kafka-prod-2 rpk[1727]: WARN  2023-01-14 05:06:13,174 [shard 15] tx - tx_gateway_frontend.cc:1451 - error on begining group tx: tx_errc::unknown_server_error

The text was updated successfully, but these errors were encountered:

bharathv · 2023-01-17T23:21:20Z

Is this a transient issue or does it persist for long period? Helps if you have a simple repro.

A quick scan of the code suggests this may be possible if the consumer offsets topic is rebalancing / changing leadership etc in which case we could throw a retryable error, but we need to validate if that is the case. Helps to have lower log levels or a simpler repro. I think we may have an issue open for this.. @rystsov may know?

rystsov · 2023-01-17T23:49:12Z

@bharathv it looks like the consumer groups elected new leader and it either started processing new request before it had replayed the log or there is something with replaying the log which doesn't update some state, looking...

rystsov · 2023-01-18T00:59:51Z

Ivan, we're going to dig to the bottom of this and cover this edge case.

However even when we fix it there is always a slight chance of running into the fatal errors with the indecisive outcome (unknown server error, invalid txn state or timeout). The application using Kafka client should anticipate it and be ready to recreate a producer when it happens. Upon init_producer_id or when the last tx expires (see transaction.timeout.ms) Redpanda either rolls forward or backward the last tx (depending on its state) and recovers the system to the consistent state.

It's easy to illustrate the need of handling the fatal errors with the timeouts: because network isn't 100% reliable a commit request may time and it may happen on the request path as well as on the response path so the outcome of the operation is unknown. Unknown server error (USE) sounds scary but RP returns it when it runs into a rare non critical unexpected situation. And it isn't specific to Redpanda, Kafka too may return USE for any API request: it intersects all exceptions and when it isn't mapped to any Kafka error they return USE.

bharathv · 2023-01-18T02:34:28Z

@rystsov It could be this too, no? A temporary blip, seems we translate it to USE. or am I missing something?

else if (in_state(group_state::completing_rebalance)) {
        cluster::begin_group_tx_reply reply;
        reply.ec = cluster::tx_errc::rebalance_in_progress; <==
        co_return reply;
    }

rystsov · 2023-01-18T03:55:53Z

@bharathv It's a problem (good catch, we should retry begin_group_tx on that error just like we retry begin_tx) but most likely isn't this problem; we have this reported log line

Jan 14 05:06:13 kafka-prod-2 rpk[1727]: WARN  2023-01-14 05:06:13,174 [shard 13] tx - [N:nomad_topic S:Stable G:516] group.cc:1808 - can't begin a tx {producer_identity: id=2242, epoch=258} with tx_seq 21969: a producer id is already involved in a tx with tx_seq 21222

and it definitely causes USE

redpanda-data/redpanda#8236

rystsov · 2023-01-23T18:36:41Z

failed ci run: https://buildkite.com/redpanda/vtools/builds/5398#0185dd71-ef3d-45f6-aa32-350e714ec148

bharathv · 2023-01-24T15:51:10Z

@rystsov this is technically not a ci-failure right? I'm removing it, but feel free to add it back if I'm wrong.

piyushredpanda · 2023-01-24T17:06:30Z

@bharathv : More a tracking thing than anything. Treat it as a test failure -- chaos or CI is kinda moot.

mmaslankaprv · 2023-02-02T15:24:31Z

@rystsov should we mark this issue with sev/high ?

Fixes #8236 (stuck transactions) by aborting expired transactions

rystsov · 2023-02-05T20:58:00Z

/backport v22.3.x

[v22.3.x] Fixes #8236 (stuck transactions) by aborting expired transactions

darkdarkdragon added the kind/bug Something isn't working label Jan 14, 2023

dotnwat assigned bharathv and rystsov Jan 14, 2023

dotnwat added the area/transactions label Jan 14, 2023

rystsov added a commit to rystsov/chaos that referenced this issue Jan 20, 2023

fail test on 'a producer id is already involved in a tx'

bf0cf9b

redpanda-data/redpanda#8236

rystsov mentioned this issue Jan 20, 2023

Add post run log inspection redpanda-data/chaos#31

Merged

rystsov added a commit to redpanda-data/chaos that referenced this issue Jan 20, 2023

fail test on 'a producer id is already involved in a tx'

794d74e

redpanda-data/redpanda#8236

rystsov added the ci-failure label Jan 23, 2023

bharathv removed their assignment Jan 24, 2023

bharathv removed the ci-failure label Jan 24, 2023

piyushredpanda added the ci-failure label Jan 26, 2023

rystsov mentioned this issue Feb 2, 2023

Fixes #8236 (stuck transactions) by aborting expired transactions #8566

Merged

6 tasks

piyushredpanda closed this as completed in #8566 Feb 2, 2023

piyushredpanda added a commit that referenced this issue Feb 2, 2023

Merge pull request #8566 from rystsov/issue-8236

2372abd

Fixes #8236 (stuck transactions) by aborting expired transactions

vbotbuildovich mentioned this issue Feb 5, 2023

[v22.3.x] Transaction stuck #8635

Closed

bharathv added a commit that referenced this issue Feb 6, 2023

Merge pull request #8636 from vbotbuildovich/backport-8236-v22.3.x-991

78c929d

[v22.3.x] Fixes #8236 (stuck transactions) by aborting expired transactions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transaction stuck #8236

Transaction stuck #8236

darkdarkdragon commented Jan 14, 2023

bharathv commented Jan 17, 2023

rystsov commented Jan 17, 2023

rystsov commented Jan 18, 2023

bharathv commented Jan 18, 2023

rystsov commented Jan 18, 2023 •

edited

rystsov commented Jan 23, 2023

bharathv commented Jan 24, 2023

piyushredpanda commented Jan 24, 2023

mmaslankaprv commented Feb 2, 2023

rystsov commented Feb 5, 2023

Transaction stuck #8236

Transaction stuck #8236

Comments

darkdarkdragon commented Jan 14, 2023

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

bharathv commented Jan 17, 2023

rystsov commented Jan 17, 2023

rystsov commented Jan 18, 2023

bharathv commented Jan 18, 2023

rystsov commented Jan 18, 2023 • edited

rystsov commented Jan 23, 2023

bharathv commented Jan 24, 2023

piyushredpanda commented Jan 24, 2023

mmaslankaprv commented Feb 2, 2023

rystsov commented Feb 5, 2023

rystsov commented Jan 18, 2023 •

edited