Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in TransactionsTest.graceful_leadership_transfer_test #7827

Closed
jcsp opened this issue Dec 19, 2022 · 4 comments · Fixed by #7909
Closed

Failure in TransactionsTest.graceful_leadership_transfer_test #7827

jcsp opened this issue Dec 19, 2022 · 4 comments · Fixed by #7909
Assignees
Labels

Comments

@jcsp
Copy link
Contributor

jcsp commented Dec 19, 2022

https://buildkite.com/redpanda/vtools/builds/4728#01852449-10e4-4085-afb0-4a4baee74d67

test_id:    rptest.tests.transactions_test.TransactionsTest.graceful_leadership_transfer_test^M
status:     FAIL^M 
run time:   34.205 seconds^M
^M
^M
    KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})^M
Traceback (most recent call last):^M
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run^M
    data = self.run_test()^M
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test^M 
    return self.test_context.function(self.test)^M
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped^M
    r = f(self, *args, **kwargs)^M
  File "/home/ubuntu/redpanda/tests/rptest/tests/transactions_test.py", line 320, in graceful_leadership_transfer_test^M
    producer.commit_transaction()^M
cimpl.KafkaException: KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"}^M
@rystsov rystsov self-assigned this Dec 19, 2022
@bharathv
Copy link
Contributor

piyushredpanda added a commit that referenced this issue Dec 20, 2022
Improve txn test observability to debug #7827
@bharathv
Copy link
Contributor

Another instance: https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

^^ includes diagnostics from #7848.

@rystsov
Copy link
Contributor

rystsov commented Dec 21, 2022

Analyzed this failure - https://buildkite.com/redpanda/vtools/builds/4778#01852f08-7011-47b6-9bf0-e352213b7d57

The major root cause: the test sets tx timeout to 10s but two leadership transfers sometime take more than that it causes the test to fail:

172.31.2.23
TRACE 2022-12-20 17:22:01,029 [shard 0] tx - tx_gateway_frontend.cc:866 - processing name:init_tm_tx, tx_id:{0}, timeout:10000
TRACE 2022-12-20 17:22:03,701 [shard 3] tx - tx_gateway_frontend.cc:49 - got_lock name:add_partition_to_tx, tx_id:{0}
TRACE 2022-12-20 17:22:03,702 [shard 0] kafka - request_context.h:168 - [172.31.10.252:45190] sending 24:add_partitions_to_txn for {rdkafka}, response {throttle_time_ms=0 results={{name={topic-yafeofyezv} results={{partition_index=0 error_code={ error_code: none [0] }}}}}}
INFO  2022-12-20 17:22:07,889 [shard 1] raft - [group_id:1, {kafka/topic-yafeofyezv/0}] vote_stm.cc:280 - became the leader term: 2
DEBUG 2022-12-20 17:22:16,044 [shard 1] admin_api_server - admin_server.cc:355 - [_anonymous] POST http://ip-172-31-2-23:9644/v1/raft/1/transfer_leadership
INFO  2022-12-20 17:22:16,044 [shard 1] admin_api_server - admin_server.cc:1412 - Leadership transfer request for raft group 1 to node {nullopt}
INFO  2022-12-20 17:22:16,045 [shard 1] raft - [group_id:1, {kafka/topic-yafeofyezv/0}] consensus.cc:197 - [leadership_transfer] Stepping down as leader in term 2, dirty offset 22
TRACE 2022-12-20 17:22:18,828 [shard 1] tx - tx_gateway_frontend.cc:507 - processing name:try_abort, pid:{producer_identity: id=1, epoch=0}, tx_seq:1
TRACE 2022-12-20 17:22:18,830 [shard 1] tx - tx_gateway_frontend.cc:553 - sending name:try_abort, pid:{producer_identity: id=1, epoch=0}, tx_seq:1, ec:tx_errc::none, committed:false, aborted:true
TRACE 2022-12-20 17:22:20,052 [shard 0] kafka - handler.h:68 - [client_id: {rdkafka}] handling end_txn request {transactional_id={0} producer_id=1 producer_epoch=0 committed=true}
TRACE 2022-12-20 17:22:20,053 [shard 0] kafka - request_context.h:168 - [172.31.10.252:45190] sending 26:end_txn for {rdkafka}, response {throttle_time_ms=0 error_code={ error_code: unknown_server_error [-1] }}

172.31.12.59
INFO  2022-12-20 17:21:58,828 [shard 1] raft - [group_id:1, {kafka/topic-yafeofyezv/0}] vote_stm.cc:280 - became the leader term: 1
TRACE 2022-12-20 17:22:03,701 [shard 1] tx - [{kafka/topic-yafeofyezv/0}] - rm_stm.cc:343 - processing name:begin_tx pid:{producer_identity: id=1, epoch=0} tx_seq:1 timeout:10000 in term:1
TRACE 2022-12-20 17:22:03,704 [shard 1] tx - [{kafka/topic-yafeofyezv/0}] - rm_stm.cc:1286 - processing name:replicate_tx pid:{producer_identity: id=1, epoch=0} in term:1
DEBUG 2022-12-20 17:22:07,722 [shard 1] admin_api_server - admin_server.cc:355 - [_anonymous] POST http://ip-172-31-12-59:9644/v1/raft/1/transfer_leadership
INFO  2022-12-20 17:22:07,722 [shard 1] raft - [group_id:1, {kafka/topic-yafeofyezv/0}] consensus.cc:197 - [leadership_transfer] Stepping down as leader in term 1, dirty offset 11
INFO  2022-12-20 17:22:16,155 [shard 1] raft - [group_id:1, {kafka/topic-yafeofyezv/0}] vote_stm.cc:280 - became the leader term: 3
TRACE 2022-12-20 17:22:18,828 [shard 1] tx - [{kafka/topic-yafeofyezv/0}] - rm_stm.cc:2060 - trying to exprire pid:{producer_identity: id=1, epoch=0} tx_seq:{1}

Other things:

  • when a tx times out Kafka returns ProducerFencedException while we return unknown server error or invalid txn state depending on the RPC
  • when a tx times out Kafka fails abort too while we let it roll
  • better logging is needed

Working on the PR.

rystsov added a commit to rystsov/redpanda that referenced this issue Dec 22, 2022
sometimes two leadership transfers take more than 10s and when it
happens a tx times out causing graceful_leadership_transfer_test
to fail

fixes redpanda-data#7827
rystsov added a commit to rystsov/redpanda that referenced this issue Dec 22, 2022
sometimes two leadership transfers take more than 10s and when it
happens a tx times out causing graceful_leadership_transfer_test
to fail

fixes redpanda-data#7827
rystsov added a commit to rystsov/redpanda that referenced this issue Dec 30, 2022
sometimes two leadership transfers take more than 10s and when it
happens a tx times out causing graceful_leadership_transfer_test
to fail

fixes redpanda-data#7827
@BenPope
Copy link
Member

BenPope commented Dec 30, 2022

Quite a few of these:

FAIL test: TransactionsTest.graceful_leadership_transfer_test (13/153 runs)
failure at 2022-12-30T04:28:17.969Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a
failure at 2022-12-29T15:33:23.538Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4926#01855ce9-9185-4a0f-a5a0-405a19882192
failure at 2022-12-29T04:31:38.661Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
failure at 2022-12-28T15:45:22.952Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4921#018557c4-2990-4c6b-9b28-9ec4be83e9cb
failure at 2022-12-27T15:23:38.216Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4916#0185529d-3893-43cc-a82e-6524ae0839d8
failure at 2022-12-27T04:39:59.859Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
failure at 2022-12-26T15:30:53.221Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4911#01854d77-3803-4f06-94df-09d2713c4046
failure at 2022-12-26T04:32:21.108Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a
failure at 2022-12-25T15:25:50.631Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4903#0185484f-5cfb-4d8e-acbf-0f796ad567d8
failure at 2022-12-24T15:28:11.263Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4894#0185432b-ee15-4712-b049-dd6f66506ba4
failure at 2022-12-24T04:13:23.294Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
failure at 2022-12-25T04:20:18.252Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1
failure at 2022-12-23T15:23:44.002Z: KafkaException(KafkaError{code=UNKNOWN,val=-1,str="EndTxn commit failed: Unknown broker error"})
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/4875#01853e04-0484-4b8e-8b95-458bcbc3db86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants