Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard producer state manager with v_cluster_id #16992

Merged
merged 9 commits into from
Mar 27, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Mar 11, 2024

Change implementation of producer_state_manager to keep separate producer state LRU cache for each v_cluster_id. This approach isolates each virtual cluster queue and guarantees fair distribution of producer state slots among them.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@mmaslankaprv mmaslankaprv force-pushed the sharded-producer-state-mgr branch 2 times, most recently from 4e4412d to ce29dea Compare March 12, 2024 08:50
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Mar 12, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 12, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/46046#018e3236-bc93-42f2-8f65-14656265c541:

"rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.with_tiered_storage=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/46614#018e6653-d8b7-462b-88a8-e1af5b154ae7:

"rptest.tests.raft_availability_test.RaftAvailabilityTest.test_two_nodes_down"

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall, mostly minor comments.

src/v/cluster/namespaced_cache.h Show resolved Hide resolved
@@ -223,38 +219,24 @@ class producer_state {
void update_current_txn_start_offset(std::optional<kafka::offset> offset) {
_current_txn_start_offset = offset;
}
safe_intrusive_list_hook _hook;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this public? seems a bit risky, perhaps befriend the other class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a common pattern across the source to have the hook public, i will try in a follow up to change this.

});
});
if (_evicted) {
throw ss::gate_closed_exception();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gate_closed_exception without a gate instance seems a bit confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, agreed. I wanted to be consistent with previous behavior where a gate was part of the producer_state. Producer eviction was signalled by gate_closed_exception

, virtual_cluster_min_producer_ids(
*this,
"virtual_cluster_min_producer_ids",
"Minium number of active producers per virtual cluster",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: minimum

A little more color might be helpful. Why a minimum is needed.

Also on naming, should we prefix all mpx stuff with "mpx_", mpx_min_producers_per_cluder_id or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't want the mpx_ prefix to be epidemic

src/v/cluster/rm_stm.cc Show resolved Hide resolved
@mmaslankaprv mmaslankaprv force-pushed the sharded-producer-state-mgr branch 3 times, most recently from 66f34b7 to 94db988 Compare March 25, 2024 18:25
bharathv
bharathv previously approved these changes Mar 27, 2024
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

bharathv
bharathv previously approved these changes Mar 27, 2024
Producer state manager is a cluster module service Raft layer should not
depend on any cluster related services.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Added time based eviction driven by an external timer to the
`cluster::namespaced_cache` the method allows to evict entries older
than the requested time.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Using namespaced cache to isolate producer id queues for each virtual
cluster. Each virtual cluster has its own guaranteed number of producer
ids. If not all producer id slots are used virtual clusters will use
more than minimum number of producer ids until they are required to
provide min number of producer ids for other virtual clusters.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Passing partition virtual cluster id to rm_stm to be able to add a
virtual cluster context to producer state that is managed in producer
state manager.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Made producer state manager configurable with minimal number of
producers per virtual cluster.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
When the number of allowed virtual clusters maintained in producer
state manager cache is exceeded a cache throws an exception. The
exception is translated to the error code indicating that the cache size
limit was exceeded. The error code is finally translated to kafka policy
violation error. This error is not retryable and will indicate issue
with number of producers.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
@bharathv bharathv merged commit 8916706 into redpanda-data:dev Mar 27, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants