Skip to content

Commit

Permalink
rm_stm/idempotency: fix the producer lock scope
Browse files Browse the repository at this point in the history
In case of a replication error of current sequence, the code issues a
manual leader step down to prevent the subsequent requests making
progress as that violates idempotency guarantees.

This is done by holding a mutex while the request is in progress. The
mutex is incorrectly released before issuing a step down in such cases
which may theoretically let other requests make progress before step
down is actually issued, the race sequence looks like this

seq=5 replication_error
seq=6 makes progress
seq=5 issues a stepdown

This bug was identified by just eyeballing the code but couldn't be
verified due to lack of trace logs in many partitions test. Seems like
something that should be tightened regardless.

Deployed the patch on a 3 node cluster with 500MB/s OMB run, no
noticeable perf changes.
  • Loading branch information
bharathv committed Feb 24, 2024
1 parent 7a3b3d5 commit def3776
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/v/cluster/rm_stm.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1166,7 +1166,6 @@ ss::future<result<kafka_result>> rm_stm::do_idempotent_replicate(
req_ptr->set_value<ret_t>(errc::replication_error);
co_return errc::replication_error;
}
units.return_all();
enqueued->set_value();
auto replicated = co_await ss::coroutine::as_future(
std::move(stages.replicate_finished));
Expand All @@ -1182,6 +1181,7 @@ ss::future<result<kafka_result>> rm_stm::do_idempotent_replicate(
req_ptr->set_value<ret_t>(result.error());
co_return result.error();
}
units.return_all();
// translate to kafka offset.
auto kafka_offset = from_log_offset(result.value().last_offset);
auto final_result = kafka_result{.last_offset = kafka_offset};
Expand Down

0 comments on commit def3776

Please sign in to comment.