PESDLC-886 Tests to create 10k+ topics and do checks with low throughput #16463

savex · 2024-02-03T00:17:08Z

Single topic creation takes at least 1 sec. This PR utilizes several parallel threads and several nodes to create topics when using single topic creation method.

Also, it appears that kafka.client supports sending a lot of topic specs at the same time. And it is a lot faster.
8192 topics in a single request created in <9 sec.

Both methods retained in the code and is switchable via use_kafka_batching option.

Test has produce/consume stage that produces X number of messages to all topics and selectively checks one for proper message count consumed and proper data consumed

Backports Required

Release Notes

none

savex · 2024-02-03T00:29:50Z

Several parallel threads used and timings collected. Will check how effective would be using several nodes
128 threads

min =   0.204s, max = 212.209s, p25 =140.573s, p50 =190.698s, p75 =201.526s, p90 =201.545s, p95 =201.552s, p99 =211.494s,

32 threads

min =   3.228s, max =  91.742s, p25 = 44.820s, p50 = 47.814s, p75 = 50.249s, p90 = 54.281s, p95 = 71.200s, p99 = 82.948s,

vbotbuildovich · 2024-02-07T00:47:10Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44783#018d80c7-1164-4ffe-96ad-5655a2728fa1

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44853#018d862e-8274-4db4-bea0-5745c8eb4ea3

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45166#018dc8e0-bdcc-4681-96cc-7a7497c85f76

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45367#018de83f-ee42-4a68-aa70-910ba013438f

travisdowns · 2024-02-07T13:41:20Z

Several parallel threads used and timings collected. Will check how effective would be using several nodes
128 threads

What do these timings mean? Like p50=190 is the time to do what?

Note that CreateTopics API lets you create N topics at once in a single API call, so I think that definitely the way to go for bulk topic creation: then we will just be limited by the speed of multi-topic creation on Redpanda side.

travisdowns

Can you clarify the purpose of this script? Is it for invesitgation purposes or will we also use the methods to create topics during testing?

savex · 2024-02-07T15:01:58Z

Those timings are for sending single requests in parallel. Based on RPK and kafka client tools my understanding was that one one topic can be created at a time. That's why I created this ThreadPoolExecutor thing and tried to understand how much parallel requests is the most efficient. But the morning after, I've tried to send over all topics in a single request as python kafka.client supports it. And the creation timings are waaaay faster.

savex · 2024-02-07T15:07:23Z

4096 topics in single batch ~6.48 sec

[INFO  - 2024-02-05 22:36:47,397 - many_partitions_test - _create_many_topics - lineno:1096]: Creating 12000 topics: batch size = 4096, workers = 32
[DEBUG - 2024-02-05 22:36:52,729 - many_partitions_test - _create_many_topics - lineno:1165]: ...0-4096: min =   5.180s, max =   5.180s, p25 =  5.180s, p50 =  5.180s, p75 =  5.180s, p90 =  5.180s, p95 =  5.180s, p99 =  5.180s,
[DEBUG - 2024-02-05 22:36:59,351 - many_partitions_test - _create_many_topics - lineno:1165]: ...4096-8192: min =   5.180s, max =   6.473s, p25 =  5.180s, p50 =  5.826s, p75 =  6.473s, p90 =  6.473s, p95 =  6.473s, p99 =  6.473s,

8192 = <6.22 sec

[INFO  - 2024-02-05 22:41:50,669 - many_partitions_test - _create_many_topics - lineno:1097]: Creating 12000 topics: batch size = 8192, workers = 32
[DEBUG - 2024-02-05 22:41:57,184 - many_partitions_test - _create_many_topics - lineno:1166]: ...0-8192: min =   6.219s, max =   6.219s, p25 =  6.219s, p50 =  6.219s, p75 =  6.219s, p90 =  6.219s, p95 =  6.219s, p99 =  6.219s,

11950 = ~12 sec

[INFO  - 2024-02-05 22:45:28,997 - many_partitions_test - _create_many_topics - lineno:1097]: Creating 11950 topics: batch size = 11950, workers = 32
[DEBUG - 2024-02-05 22:45:41,428 - many_partitions_test - _create_many_topics - lineno:1166]: ...0-11950: min =  11.995s, max =  11.995s, p25 = 11.995s, p50 = 11.995s, p75 = 11.995s, p90 = 11.995s, p95 = 11.995s, p99 = 11.995s,

When creating topics near the cluster limit, BadLogLines being detected:

rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-4-91(4),ip-172-31-1-32(2),ip-172-31-14-83(2),ip-172-31-15-74(2),ip-172-31-8-17(2),ip-172-31-15-248(2),ip-172-31-0-114(2),ip-172-31-3-43(2),ip-172-31-2-107(2) example="WARN  2024-02-05 22:51:08,819 [shard 1:main] seastar_memory - oversized allocation: 786432 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x8bbc2a3 0x882655d 0x8833820 0x795e4dd 0x778c6b1 0x778bbb7 0x46ab30a 0x43cb1b6 0x443565e 0x4414f9f 0x2dbd9aa 0x88f149f 0x88f4c11 0x894f645 0x887e2af /opt/redpanda/lib/libc.so.6+0x91016 /opt/redpanda/lib/libc.so.6+0x1166cf">

travisdowns · 2024-02-07T15:08:55Z

Those timings are for sending single requests in parallel.

OK wow so ~200 seconds to create a topic when we are creating many in parallel, I guess we do not get much speedup from that (compared to 2 seconds when sent alone: i.e., we have 128x more parallelism but it got ~100x slower, for little gain).

savex · 2024-02-07T15:10:40Z

Can you clarify the purpose of this script? Is it for invesitgation purposes or will we also use the methods to create topics during testing?

I am using this script to create topics at large numbers with randomized naming and controllable batch sizes (how many topics being sent to RP in single request).

Goal of the test being created to check that RP can process topic creation in large numbers and not fail/crash or produce bad allocations when approaching topic/partition limits

travisdowns · 2024-02-07T15:11:34Z

@savex wrote:

rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-4-91(4),ip-172-31-1-32(2),ip-172-31-14-83(2),ip-172-31-15-74(2),ip-172-31-8-17(2),ip-172-31-15-248(2),ip-172-31-0-114(2),ip-172-31-3-43(2),ip-172-31-2-107(2) example="WARN  2024-02-05 22:51:08,819 [shard 1:main] seastar_memory - oversized allocation: 786432 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x8bbc2a3 0x882655d 0x8833820 0x795e4dd 0x778c6b1 0x778bbb7 0x46ab30a 0x43cb1b6 0x443565e 0x4414f9f 0x2dbd9aa 0x88f149f 0x88f4c11 0x894f645 0x887e2af /opt/redpanda/lib/libc.so.6+0x91016 /opt/redpanda/lib/libc.so.6+0x1166cf">

Very nice: an oversized allocation. These are exactly the type of issues we want to suss out here. Do you know how to decode the backtrace?

savex · 2024-02-07T15:16:05Z

No, had no chance yet.

savex · 2024-02-07T16:47:56Z

Here is the backtrace for recent logs on oversized allocation

/opt/redpanda/libexec/redpanda: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /opt/redpanda/lib/ld.so, for GNU/Linux 3.2.0, BuildID[sha1]=754f98daf43987fb40970d066dd369a9023d4dfa, stripped

INFO  2024-02-07 16:31:56,116 [shard 0:main] cluster - members_manager.cc:375 - applying finish_reallocations_cmd, offset: 65, node id: 5
WARN  2024-02-07 16:32:03,567 [shard 0:main] seastar_memory - oversized allocation: 393216 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
{/opt/redpanda/libexec/redpanda} 0x92f4ef3: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f5ca1c: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f69cf0: operator new(unsigned long) at ??:0
{/opt/redpanda/libexec/redpanda} 0x799e1fd: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x77cc0f1: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x77cb5f7: kafka::create_topics_request_data::decode_standard(kafka::protocol::decoder&, detail::base_named_type<short, kafka::kafka_api_version, std::__1::integral_constant<bool, true> >) at ??:0
{/opt/redpanda/libexec/redpanda} 0x46bbfca: kafka::handler_template<kafka::create_topics_api, (short)0, (short)6, seastar::future<seastar::foreign_ptr<std::__1::unique_ptr<kafka::response, std::__1::default_delete<kafka::response> > > >, &(kafka::default_estimate_adaptor(unsigned long, kafka::connection_context&))>::handle(kafka::request_context, seastar::smp_service_group) at ??:0
{/opt/redpanda/libexec/redpanda} 0x43dbe76: kafka::handler_base<false>::handle(kafka::request_context&&, seastar::smp_service_group) const at ??:0
{/opt/redpanda/libexec/redpanda} 0x444631e: kafka::process_request(kafka::request_context&&, seastar::smp_service_group, kafka::session_resources const&) at ??:0
{/opt/redpanda/libexec/redpanda} 0x4425c5f: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x2dd666a: seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ??:0
{/opt/redpanda/libexec/redpanda} 0x9029f1f: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x902d691: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x902a826: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f299a0: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f27d98: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x2c8c776: application::run(int, char**) at ??:0
{/opt/redpanda/libexec/redpanda} 0x934af99: main at ??:0
/opt/redpanda/lib/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=6e7b96dfb83f0bdcb6a410469b82f86415e5ada3, for GNU/Linux 3.2.0, stripped

{/opt/redpanda/lib/libc.so.6} 0x2d58f: ?? at ??:0
{/opt/redpanda/lib/libc.so.6} 0x2d648: __libc_start_main at ??:0
{/opt/redpanda/libexec/redpanda} 0x2c84fa4: _start at ??:0

   seastar::internal::coroutine_traits_base<void>::promise_type
WARN  2024-02-07 16:32:03,579 [shard 0:main] seastar_memory - oversized allocation: 1638400 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #1]
{/opt/redpanda/libexec/redpanda} 0x92f4ef3: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f5ca1c: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f69cf0: operator new(unsigned long) at ??:0
{/opt/redpanda/libexec/redpanda} 0x46d4772: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x434915a: seastar::internal::coroutine_traits_base<seastar::foreign_ptr<std::__1::unique_ptr<kafka::response, std::__1::default_delete<kafka::response> > > >::promise_type::run_and_dispose() at ??:0
{/opt/redpanda/libexec/redpanda} 0x9029f1f: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x902d691: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x902a826: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f299a0: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f27d98: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x2c8c776: application::run(int, char**) at ??:0
{/opt/redpanda/libexec/redpanda} 0x934af99: main at ??:0
{/opt/redpanda/lib/libc.so.6} 0x2d58f: ?? at ??:0
{/opt/redpanda/lib/libc.so.6} 0x2d648: __libc_start_main at ??:0
{/opt/redpanda/libexec/redpanda} 0x2c84fa4: _start at ??:0

INFO  2024-02-07 16:32:35,076 [shard 2:main] cluster - controller_backend.cc:395 - Stopping Controller Backend...
Reactor stalled for 65 ms on shard 0. Backtrace:
[Backtrace #2]
{/opt/redpanda/libexec/redpanda} 0x90056ef: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x9006e4d: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x42abf: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f9e09f: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f94536: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x6447f97: cluster::topic_table::~topic_table() at ??:0
{/opt/redpanda/libexec/redpanda} 0x6447dba: seastar::shared_ptr_count_for<cluster::topic_table>::~shared_ptr_count_for() at ??:0
{/opt/redpanda/libexec/redpanda} 0x644b157: std::__1::__function::__func<seastar::sharded<cluster::topic_table>::stop()::'lambda'(seastar::future<void>)::operator()(seastar::future<void>) const::'lambda'(unsigned int), std::__1::allocator<seastar::sharded<cluster::topic_table>::stop()::'lambda'(seastar::future<void>)::operator()(seastar::future<void>) const::'lambda'(unsigned int)>, seastar::future<void> (unsigned int)>::operator()(unsigned int&&) at ??:0
{/opt/redpanda/libexec/redpanda} 0x91055ef: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x644a8f4: seastar::sharded<cluster::topic_table>::stop()::'lambda'(seastar::future<void>)::operator()(seastar::future<void>) const at ??:0
{/opt/redpanda/libexec/redpanda} 0x644c6b6: seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::sharded<cluster::topic_table>::stop()::'lambda'(seastar::future<void>), seastar::futurize<seastar::future<void> >::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::sharded<cluster::topic_table>::stop()::'lambda'(seastar::future<void>)>(seastar::sharded<cluster::topic_table>::stop()::'lambda'(seastar::future<void>)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::sharded<cluster::topic_table>::stop()::'lambda'(seastar::future<void>)&, seastar::future_state<seastar::internal::monostate>&&), void>::run_and_dispose() at ??:0
{/opt/redpanda/libexec/redpanda} 0x9029f1f: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x902d691: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x902a826: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f299a0: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x8f27d98: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x2c8c776: application::run(int, char**) at ??:0
{/opt/redpanda/libexec/redpanda} 0x934af99: main at ??:0
{/opt/redpanda/libexec/redpanda} 0x2d58f: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x2d648: ?? at ??:0
{/opt/redpanda/libexec/redpanda} 0x2c84fa4: _start at ??:0

savex · 2024-02-07T20:36:49Z

Created issue: #16521

ballard26 · 2024-02-21T02:46:46Z

tests/rptest/scale_tests/many_partitions_test.py

+        # Free node that used to create topics
+        self.cluster.free_single(node)


Why do we allocate an entire node for creating the topics? It seems like the ducktape driver node can create them just as well.

So that we do not be limited by the lack of clien's node resources and able to send requests as fast as possible. Client node has significantly less resources comparing to worker ones. When checking the 40k - it would be more significant.

dotnwat · 2024-02-22T05:29:02Z

tests/rptest/remote_scripts/topic_operations.py

+    ioclass.flush()
+
+
+class TopicSwarm():


The goal was to build isolated tool with no complexity in mind.

savex · 2024-02-27T00:13:23Z

/ci-repeat

Script uses confluent_kafka to send requests to redpanda it has ability to skip topic name randomization along with ability to select prefix in topic script and skip randomization related checks if not needed

Test accounts for minimal node configuration if i3en.xlarge with 2 vcpus.

mergify · 2024-02-27T16:18:17Z

⚠️ The sha of the head commit of this PR conflicts with #16750. Mergify cannot evaluate rules on this PR. ⚠️

savex requested review from travisdowns and ballard26 February 3, 2024 00:17

savex force-pushed the dp-995-large-topics-number branch 3 times, most recently from 5a52dcd to 003b6f7 Compare February 6, 2024 22:43

savex marked this pull request as ready for review February 6, 2024 22:44

travisdowns reviewed Feb 7, 2024

View reviewed changes

savex requested a review from travisdowns February 7, 2024 23:46

savex force-pushed the dp-995-large-topics-number branch from df44807 to 80be6bc Compare February 7, 2024 23:52

savex mentioned this pull request Feb 8, 2024

rptest: Create multi-node test with 1k topics and >10M events #16375

Merged

7 tasks

ballard26 reviewed Feb 21, 2024

View reviewed changes

dotnwat reviewed Feb 22, 2024

View reviewed changes

savex force-pushed the dp-995-large-topics-number branch from efd3ed7 to ee47a5a Compare February 22, 2024 20:46

savex self-assigned this Feb 22, 2024

savex force-pushed the dp-995-large-topics-number branch from 0f11a77 to 45b4d0a Compare February 22, 2024 23:11

savex requested a review from ballard26 February 23, 2024 00:35

savex marked this pull request as draft February 23, 2024 20:20

savex force-pushed the dp-995-large-topics-number branch 2 times, most recently from 130ad97 to 9c90b3c Compare February 26, 2024 23:44

savex force-pushed the dp-995-large-topics-number branch from 9c90b3c to 0bd9804 Compare February 27, 2024 00:08

savex marked this pull request as ready for review February 27, 2024 00:11

savex changed the title ~~Methods and scripts to create large number of topics~~ Tests to create 10k+ topics and do checks with low throughput Feb 27, 2024

savex changed the title ~~Tests to create 10k+ topics and do checks with low throughput~~ PESDLC-886 Tests to create 10k+ topics and do checks with low throughput Feb 27, 2024

savex added 5 commits February 27, 2024 10:03

PESDLC-886: Script for topic batch creation

63b775e

Script uses confluent_kafka to send requests to redpanda it has ability to skip topic name randomization along with ability to select prefix in topic script and skip randomization related checks if not needed

PESDLC-886: Add unique topics handling to ProducerSwarm

5712ad2

PESDLC-886: Update rpk.list_topics with detailed mode

079f7e1

PESDLC-886: test_topic_swarm with remote topic creation script

8142ea8

PESDLC-886: Topic throughput test

ba5776c

Test accounts for minimal node configuration if i3en.xlarge with 2 vcpus.

savex force-pushed the dp-995-large-topics-number branch from 0bd9804 to ba5776c Compare February 27, 2024 16:07

savex closed this Feb 27, 2024

savex deleted the dp-995-large-topics-number branch February 27, 2024 16:13

savex mentioned this pull request Feb 27, 2024

PESDLC-901 Tests to create 10k+ topics and do checks with low throughput #16750

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PESDLC-886 Tests to create 10k+ topics and do checks with low throughput #16463

PESDLC-886 Tests to create 10k+ topics and do checks with low throughput #16463

savex commented Feb 3, 2024 •

edited

Loading

savex commented Feb 3, 2024

vbotbuildovich commented Feb 7, 2024 •

edited

Loading

travisdowns commented Feb 7, 2024

travisdowns left a comment

savex commented Feb 7, 2024

savex commented Feb 7, 2024

travisdowns commented Feb 7, 2024

savex commented Feb 7, 2024 •

edited

Loading

travisdowns commented Feb 7, 2024

savex commented Feb 7, 2024

savex commented Feb 7, 2024

savex commented Feb 7, 2024

ballard26 Feb 21, 2024

savex Feb 21, 2024

dotnwat Feb 22, 2024

savex Feb 22, 2024

savex commented Feb 27, 2024

mergify bot commented Feb 27, 2024

		# Free node that used to create topics
		self.cluster.free_single(node)

PESDLC-886 Tests to create 10k+ topics and do checks with low throughput #16463

PESDLC-886 Tests to create 10k+ topics and do checks with low throughput #16463

Conversation

savex commented Feb 3, 2024 • edited Loading

Backports Required

Release Notes

savex commented Feb 3, 2024

vbotbuildovich commented Feb 7, 2024 • edited Loading

travisdowns commented Feb 7, 2024

travisdowns left a comment

Choose a reason for hiding this comment

savex commented Feb 7, 2024

savex commented Feb 7, 2024

travisdowns commented Feb 7, 2024

savex commented Feb 7, 2024 • edited Loading

travisdowns commented Feb 7, 2024

savex commented Feb 7, 2024

savex commented Feb 7, 2024

savex commented Feb 7, 2024

ballard26 Feb 21, 2024

Choose a reason for hiding this comment

savex Feb 21, 2024

Choose a reason for hiding this comment

dotnwat Feb 22, 2024

Choose a reason for hiding this comment

savex Feb 22, 2024

Choose a reason for hiding this comment

savex commented Feb 27, 2024

mergify bot commented Feb 27, 2024

savex commented Feb 3, 2024 •

edited

Loading

vbotbuildovich commented Feb 7, 2024 •

edited

Loading

savex commented Feb 7, 2024 •

edited

Loading