Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transaction stuck #8236

Closed
darkdarkdragon opened this issue Jan 14, 2023 · 10 comments · Fixed by #8566
Closed

Transaction stuck #8236

darkdarkdragon opened this issue Jan 14, 2023 · 10 comments · Fixed by #8566
Assignees
Labels

Comments

@darkdarkdragon
Copy link

Version & Environment

Redpanda version: (use rpk version): v22.3.10 (rev 1f78ad9)

Ubuntu 22.04.1 LTS
franz-go v1.9.0

What went wrong?

Producer can't start, gives error

unable to commit transaction offsets: UNKNOWN_SERVER_ERROR: The server experienced an unexpected error when processing the request.

What should have happened instead?

Producer should have been started.

How to reproduce the issue?

Additional information

Server logs:

Jan 14 05:06:13 kafka-prod-2 rpk[1727]: WARN  2023-01-14 05:06:13,174 [shard 13] tx - [N:nomad_topic S:Stable G:516] group.cc:1808 - can't begin a tx {producer_identity: id=2242, epoch=258} with tx_seq 21969: a producer id is already involved in a tx with tx_seq 21222
Jan 14 05:06:13 kafka-prod-2 rpk[1727]: WARN  2023-01-14 05:06:13,174 [shard 15] tx - tx_gateway_frontend.cc:1451 - error on begining group tx: tx_errc::unknown_server_error
@darkdarkdragon darkdarkdragon added the kind/bug Something isn't working label Jan 14, 2023
@bharathv
Copy link
Contributor

Is this a transient issue or does it persist for long period? Helps if you have a simple repro.

A quick scan of the code suggests this may be possible if the consumer offsets topic is rebalancing / changing leadership etc in which case we could throw a retryable error, but we need to validate if that is the case. Helps to have lower log levels or a simpler repro. I think we may have an issue open for this.. @rystsov may know?

@rystsov
Copy link
Contributor

rystsov commented Jan 17, 2023

@bharathv it looks like the consumer groups elected new leader and it either started processing new request before it had replayed the log or there is something with replaying the log which doesn't update some state, looking...

@rystsov
Copy link
Contributor

rystsov commented Jan 18, 2023

Ivan, we're going to dig to the bottom of this and cover this edge case.

However even when we fix it there is always a slight chance of running into the fatal errors with the indecisive outcome (unknown server error, invalid txn state or timeout). The application using Kafka client should anticipate it and be ready to recreate a producer when it happens. Upon init_producer_id or when the last tx expires (see transaction.timeout.ms) Redpanda either rolls forward or backward the last tx (depending on its state) and recovers the system to the consistent state.

It's easy to illustrate the need of handling the fatal errors with the timeouts: because network isn't 100% reliable a commit request may time and it may happen on the request path as well as on the response path so the outcome of the operation is unknown. Unknown server error (USE) sounds scary but RP returns it when it runs into a rare non critical unexpected situation. And it isn't specific to Redpanda, Kafka too may return USE for any API request: it intersects all exceptions and when it isn't mapped to any Kafka error they return USE.

@bharathv
Copy link
Contributor

@rystsov It could be this too, no? A temporary blip, seems we translate it to USE. or am I missing something?

else if (in_state(group_state::completing_rebalance)) {
        cluster::begin_group_tx_reply reply;
        reply.ec = cluster::tx_errc::rebalance_in_progress; <==
        co_return reply;
    }

@rystsov
Copy link
Contributor

rystsov commented Jan 18, 2023

@bharathv It's a problem (good catch, we should retry begin_group_tx on that error just like we retry begin_tx) but most likely isn't this problem; we have this reported log line

Jan 14 05:06:13 kafka-prod-2 rpk[1727]: WARN  2023-01-14 05:06:13,174 [shard 13] tx - [N:nomad_topic S:Stable G:516] group.cc:1808 - can't begin a tx {producer_identity: id=2242, epoch=258} with tx_seq 21969: a producer id is already involved in a tx with tx_seq 21222

and it definitely causes USE

@rystsov
Copy link
Contributor

rystsov commented Jan 23, 2023

@bharathv bharathv removed their assignment Jan 24, 2023
@bharathv
Copy link
Contributor

@rystsov this is technically not a ci-failure right? I'm removing it, but feel free to add it back if I'm wrong.

@piyushredpanda
Copy link
Contributor

@bharathv : More a tracking thing than anything. Treat it as a test failure -- chaos or CI is kinda moot.

@mmaslankaprv
Copy link
Member

@rystsov should we mark this issue with sev/high ?

piyushredpanda added a commit that referenced this issue Feb 2, 2023
Fixes #8236 (stuck transactions) by aborting expired transactions
@rystsov
Copy link
Contributor

rystsov commented Feb 5, 2023

/backport v22.3.x

bharathv added a commit that referenced this issue Feb 6, 2023
[v22.3.x] Fixes #8236 (stuck transactions) by aborting expired transactions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants