Refactored `cluster::partition_leaders_table` to use hierarchical structure of metadata #16512

mmaslankaprv · 2024-02-07T09:40:04Z

Using a hierarchical data structure to store leader metadata in cluster::partition_leaders_table. Using a map of maps to store topic partition information as a value in top level topic only map. This way we do not need to keep a separate copy of topic name for each partition and can leverage the same hierarchical structure of node_health_report to reduce number of hash table look ups.

Fixes: https://github.com/redpanda-data/core-internal/issues/1061

Backports Required

Release Notes

Improvements

optimized updating leadership metadata with health reports

mmaslankaprv · 2024-02-07T09:55:11Z

/dt

mmaslankaprv · 2024-02-07T13:36:30Z

/dt

mmaslankaprv · 2024-02-07T17:00:10Z

/dt

mmaslankaprv · 2024-02-08T11:44:08Z

/dt

mmaslankaprv · 2024-02-08T15:22:29Z

/dt

rockwotj

Looks nice, hope you don't mind the drive by review 😄

src/v/cluster/partition_leaders_table.cc

src/v/cluster/partition_leaders_table.h

src/v/cluster/metadata_dissemination_service.cc

rockwotj · 2024-02-09T14:56:16Z

src/v/cluster/partition_leaders_table.h

@@ -109,6 +109,10 @@ class partition_leaders_table {

    leaders_info_t get_leaders() const;

+    uint64_t leaderless_partition_count() const {


It would be nice if there were unit tests for that we bookkeep this right, but I guess there are no existing tests for partition leaders table, so maybe we should file a ticket?

i was thinking about this as well, let me figure out something

src/v/cluster/metadata_dissemination_types.h

StephanDollberg

Thanks for this, once we are close to merge I can get a flamegraph from a high partition load again.

StephanDollberg · 2024-02-09T17:32:56Z

src/v/cluster/partition_leaders_table.cc

-          ntp);
-        return;
-    }
+    const model::ntp ntp(t_it->first.ns, t_it->first.tp, p_id);


This potentially voids the benefits that we are getting as it likely causes an alloc + a memcpy.

It seems to be used:

In the trace logs: Can we just was pass in a ref to the ns_tp and then use that plus the p_id in the log statements?

Later down in the call to _watchers.notify: Can we just construct it then or work around somehow else (passing ntp twice there seems weird). I expect that if statement to be rare anyway.

src/v/cluster/partition_leaders_table.cc

src/v/cluster/partition_leaders_table.h

Added tracking the number of leader less partitions in leaders table. This prevents iterating over the whole list of leaders when generating cluster metrics. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Signed-off-by: Michal Maslanka <michal@redpanda.com>

Added tracking version of partition leaders table to be able to identify concurrent modification. This will allow yielding while iterating the leaders table. If a table is modified during operation an exception is thrown and operation can be retried. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Signed-off-by: Michal Maslanka <michal@redpanda.com>

Leveraging the hierarchical structure of node health report and internals of partition leaders table to minimize the number of lookups in leaders map. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Previously `get_leadership_reply` did not contain any information about the operation state, therefore it was impossible to propagate service error to the client. Added a field indicating if response is successful. The field allow us to explicitly handle errors like partition leaders table concurrent modification. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Signed-off-by: Michal Maslanka <michal@redpanda.com>

Using `ntp_callbacks` to wait for the leaders without additional promises map. When caller requests to wait for a leader we register the notification which sets the promise value when called. This way we do not need a separate mechanism to keep track of leadership notifications. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Replaced previously used `chunked_fifo` with dynamically sized fragmented vectors. Fragemented vector provides a random access iterator and automatically controls the size of allocated chunks Signed-off-by: Michal Maslanka <michal@redpanda.com>

Added methods allowing `fragmented_vector::iter` to satisfy `std::random_random__iterator` concept. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Using async algorithm will call `ss::coroutine::maybe_yield()` every 100 operations while still being lightweight while iterating over synchronously over a chunk. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Signed-off-by: Michal Maslanka <michal@redpanda.com>

vbotbuildovich · 2024-02-26T07:46:03Z

/backport v23.3.x

vbotbuildovich · 2024-02-26T07:46:04Z

/backport v23.2.x

vbotbuildovich · 2024-02-26T07:46:58Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16512-v23.3.x-160 remotes/upstream/v23.3.x
git cherry-pick -x 2410d675c0228c00b8811f3aeb01d3472baa51ae 0d946e06cf9effedd2ec7097c4c0d4ba480bebde bd3cc30119409c54f768e713c65d5951e8ce373f b39795695f5b1a8fe2d0963ac47ac85454466965 abe6173b21cb6e86323c48b8b8c6dd5d27fe35b8 fdeb6ac37ab757a13a458c2351f15443525b86bc 9a255f011290dd52842ac7f1d1170521caba0dab aeab006f49f9c1daff9f7c3dfb9756c6277e34a7 2db09f95fd654eabc88f4c900fcbfa817ffeceeb cddc003d3daf75aa001ff7b0d402ffa5d1fa3a3b e37df0cb2f17915357b3622d1e96a9c635b77ec9 4eed31fd55411341997ceeb1af9e7a7f5eae5e5f 44e6cd7837e4e7060d1016f2f5f5f0f77a5d3db2 f4e6bce918c613707526223d9d21bb4a904b9a9d 9dbeb81913f63de12d5c4e0fd72f5c4b85e4d0d6 8948fe5c5214bca5e5011b0c45b4fa7df9ed3e60

Workflow run logs.

vbotbuildovich · 2024-02-26T07:47:03Z

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16512-v23.2.x-948 remotes/upstream/v23.2.x
git cherry-pick -x 2410d675c0228c00b8811f3aeb01d3472baa51ae 0d946e06cf9effedd2ec7097c4c0d4ba480bebde bd3cc30119409c54f768e713c65d5951e8ce373f b39795695f5b1a8fe2d0963ac47ac85454466965 abe6173b21cb6e86323c48b8b8c6dd5d27fe35b8 fdeb6ac37ab757a13a458c2351f15443525b86bc 9a255f011290dd52842ac7f1d1170521caba0dab aeab006f49f9c1daff9f7c3dfb9756c6277e34a7 2db09f95fd654eabc88f4c900fcbfa817ffeceeb cddc003d3daf75aa001ff7b0d402ffa5d1fa3a3b e37df0cb2f17915357b3622d1e96a9c635b77ec9 4eed31fd55411341997ceeb1af9e7a7f5eae5e5f 44e6cd7837e4e7060d1016f2f5f5f0f77a5d3db2 f4e6bce918c613707526223d9d21bb4a904b9a9d 9dbeb81913f63de12d5c4e0fd72f5c4b85e4d0d6 8948fe5c5214bca5e5011b0c45b4fa7df9ed3e60

Workflow run logs.

github-actions bot added the area/redpanda label Feb 7, 2024

mmaslankaprv force-pushed the hierarchy-leaders-table branch from af1fc75 to 55841f5 Compare February 7, 2024 09:44

mmaslankaprv force-pushed the hierarchy-leaders-table branch from 55841f5 to 716f02a Compare February 7, 2024 11:13

mmaslankaprv force-pushed the hierarchy-leaders-table branch from 716f02a to 8d1ccf6 Compare February 7, 2024 16:58

mmaslankaprv force-pushed the hierarchy-leaders-table branch from 8d1ccf6 to 1ad7040 Compare February 8, 2024 11:06

rockwotj reviewed Feb 8, 2024

View reviewed changes

rockwotj mentioned this pull request Feb 8, 2024

admin: stream more json results #16551

Merged

7 tasks

mmaslankaprv force-pushed the hierarchy-leaders-table branch 2 times, most recently from d7d76c5 to 51b2159 Compare February 9, 2024 07:14

redpanda-data deleted a comment from vbotbuildovich Feb 9, 2024

mmaslankaprv force-pushed the hierarchy-leaders-table branch from 51b2159 to cbc2821 Compare February 9, 2024 10:34

mmaslankaprv marked this pull request as ready for review February 9, 2024 10:34

mmaslankaprv requested review from StephanDollberg, ztlpn, bharathv and rockwotj February 9, 2024 10:38

rockwotj reviewed Feb 9, 2024

View reviewed changes

mmaslankaprv force-pushed the hierarchy-leaders-table branch from cbc2821 to 5945d9c Compare February 9, 2024 17:29

StephanDollberg reviewed Feb 9, 2024

View reviewed changes

mmaslankaprv force-pushed the hierarchy-leaders-table branch 3 times, most recently from e8eca04 to 639aa60 Compare February 10, 2024 13:46

mmaslankaprv requested review from rockwotj and StephanDollberg February 10, 2024 13:46

mmaslankaprv added 13 commits February 23, 2024 07:42

c/leaders: track number of leader less partitions

0d946e0

Added tracking the number of leader less partitions in leaders table. This prevents iterating over the whole list of leaders when generating cluster metrics. Signed-off-by: Michal Maslanka <michal@redpanda.com>

c/controller_probe: do not iterate over all leaders

bd3cc30

Signed-off-by: Michal Maslanka <michal@redpanda.com>

c/md_dissemiantion: replaced chunked fifo with new adaptive frag vec

abe6173

Signed-off-by: Michal Maslanka <michal@redpanda.com>

model: added controller namespace topic definition

fdeb6ac

Signed-off-by: Michal Maslanka <michal@redpanda.com>

c/md_dissemination: optimize leader metadata update with health report

9a255f0

Leveraging the hierarchical structure of node health report and internals of partition leaders table to minimize the number of lookups in leaders map. Signed-off-by: Michal Maslanka <michal@redpanda.com>

tests: added partitions leader table test

2db09f9

Signed-off-by: Michal Maslanka <michal@redpanda.com>

ntp_callbacks: allow notifying without the need of creating ntp

cddc003

Signed-off-by: Michal Maslanka <michal@redpanda.com>

test: added gtest assert_throw for coroutines

e37df0c

Signed-off-by: Michal Maslanka <michal@redpanda.com>

container: added missing iterator methods to fragmented_vector

f4e6bce

Added methods allowing `fragmented_vector::iter` to satisfy `std::random_random__iterator` concept. Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv force-pushed the hierarchy-leaders-table branch from 461ff36 to 707a58e Compare February 23, 2024 07:34

mmaslankaprv added 2 commits February 23, 2024 12:39

c/leaders_table: use async algorithm when updating leaders with report

9dbeb81

Using async algorithm will call `ss::coroutine::maybe_yield()` every 100 operations while still being lightweight while iterating over synchronously over a chunk. Signed-off-by: Michal Maslanka <michal@redpanda.com>

c/leaders_table: use async algorithm to iterate over leaders

8948fe5

Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv force-pushed the hierarchy-leaders-table branch from 707a58e to 8948fe5 Compare February 23, 2024 11:39

mmaslankaprv requested a review from StephanDollberg February 23, 2024 14:12

StephanDollberg approved these changes Feb 23, 2024

View reviewed changes

mmaslankaprv merged commit 60d475e into redpanda-data:dev Feb 26, 2024
16 checks passed

mmaslankaprv deleted the hierarchy-leaders-table branch February 26, 2024 07:45

vbotbuildovich mentioned this pull request Feb 26, 2024

[v23.3.x] Refactored cluster::partition_leaders_table to use hierarchical structure of metadata #16707

Closed

vbotbuildovich mentioned this pull request Feb 26, 2024

[v23.2.x] Refactored cluster::partition_leaders_table to use hierarchical structure of metadata #16708

Open

mmaslankaprv mentioned this pull request Feb 26, 2024

[v23.3.x] Refactored cluster::partition_leaders_table to use hierarchical structure of metadata #16709

Merged

This was referenced Mar 8, 2024

[v23.3.x] Refactored cluster::partition_leaders_table to use hierarchical structure of metadata #16966

Closed

[v23.3.x] Refactored cluster::partition_leaders_table to use hierarchical structure of metadata #16967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored `cluster::partition_leaders_table` to use hierarchical structure of metadata #16512

Refactored `cluster::partition_leaders_table` to use hierarchical structure of metadata #16512

mmaslankaprv commented Feb 7, 2024 •

edited

mmaslankaprv commented Feb 7, 2024

mmaslankaprv commented Feb 7, 2024

mmaslankaprv commented Feb 7, 2024

mmaslankaprv commented Feb 8, 2024

mmaslankaprv commented Feb 8, 2024

rockwotj left a comment

rockwotj Feb 9, 2024

mmaslankaprv Feb 9, 2024

StephanDollberg left a comment

StephanDollberg Feb 9, 2024

vbotbuildovich commented Feb 26, 2024

vbotbuildovich commented Feb 26, 2024

vbotbuildovich commented Feb 26, 2024

vbotbuildovich commented Feb 26, 2024

		@@ -109,6 +109,10 @@ class partition_leaders_table {

		leaders_info_t get_leaders() const;

		uint64_t leaderless_partition_count() const {

Refactored cluster::partition_leaders_table to use hierarchical structure of metadata #16512

Refactored cluster::partition_leaders_table to use hierarchical structure of metadata #16512

Conversation

mmaslankaprv commented Feb 7, 2024 • edited

Backports Required

Release Notes

Improvements

mmaslankaprv commented Feb 7, 2024

mmaslankaprv commented Feb 7, 2024

mmaslankaprv commented Feb 7, 2024

mmaslankaprv commented Feb 8, 2024

mmaslankaprv commented Feb 8, 2024

rockwotj left a comment

Choose a reason for hiding this comment

rockwotj Feb 9, 2024

Choose a reason for hiding this comment

mmaslankaprv Feb 9, 2024

Choose a reason for hiding this comment

StephanDollberg left a comment

Choose a reason for hiding this comment

StephanDollberg Feb 9, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Feb 26, 2024

vbotbuildovich commented Feb 26, 2024

vbotbuildovich commented Feb 26, 2024

vbotbuildovich commented Feb 26, 2024

Refactored `cluster::partition_leaders_table` to use hierarchical structure of metadata #16512

Refactored `cluster::partition_leaders_table` to use hierarchical structure of metadata #16512

mmaslankaprv commented Feb 7, 2024 •

edited