rm_stm/idempotency: fix the producer lock scope

In case of a replication error of current sequence, the code issues a manual leader step down to prevent the subsequent requests making progress as that violates idempotency guarantees. This is done by holding a mutex while the request is in progress. The mutex is incorrectly released before issuing a step down in such cases which may theoretically let other requests make progress before step down is actually issued, the race sequence looks like this seq=5 replication_error seq=6 makes progress seq=5 issues a stepdown This bug was identified by just eyeballing the code but couldn't be verified due to lack of trace logs in many partitions test. Seems like something that should be tightened regardless. Deployed the patch on a 3 node cluster with 500MB/s OMB run, no noticeable perf changes.
vbotbuildovich · Feb 24, 2024 · def3776 · def3776
1 parent 7a3b3d5
commit def3776
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/src/v/cluster/rm_stm.cc b/src/v/cluster/rm_stm.cc
@@ -1166,7 +1166,6 @@ ss::future<result<kafka_result>> rm_stm::do_idempotent_replicate(
         req_ptr->set_value<ret_t>(errc::replication_error);
         co_return errc::replication_error;
     }
-    units.return_all();
     enqueued->set_value();
     auto replicated = co_await ss::coroutine::as_future(
       std::move(stages.replicate_finished));
@@ -1182,6 +1181,7 @@ ss::future<result<kafka_result>> rm_stm::do_idempotent_replicate(
         req_ptr->set_value<ret_t>(result.error());
         co_return result.error();
     }
+    units.return_all();
     // translate to kafka offset.
     auto kafka_offset = from_log_offset(result.value().last_offset);
     auto final_result = kafka_result{.last_offset = kafka_offset};