[performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. #10138

see-quick · 2024-05-21T14:01:01Z

Type of change

Enhancement / new feature

Description

This PR focuses on exploring the impact of different configurations on the efficiency of creating, modifying, and deleting Kafka topics. I've played around with a range of batch sizes and linger durations to see how they affect performance across different scales of topic counts.

Based on this graph (KRaft):

One can see that I have tried multiple configurations with batch sizes and linger settings stretching from 1ms to 2000ms. Moreover, the range of topics which I tested was from 50 to 1000 to see some pattern if such configuration scaling well or if there are some problems....(that could be viewed on each curve). This could help us understand the capabilities of UTO with various settings and scale with the best configuration.

I have also implemented the way how we create the events. Currently, we are doing it sequentially and now I have modified it and used ExecutorService to manage and process batches concurrently. More on that in the Javadoc...

[1] - #10050 (review)

Update (19.9.2024):

After a few modifications, also we have decided to remove two use cases from TO and UO (i.e., Alice bulk and bob's streaming). We do not think it adds much value so we would currently stick to capacity and scalability tests, which are now present in those test suites.

Checklist

Write tests
Make sure all tests pass

see-quick · 2024-05-21T14:01:49Z

/packit test --labels performance-topic-operator-capacity

see-quick · 2024-05-21T15:48:39Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacityCreateAndUpdateTopics --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-05-21T15:48:51Z

▶️ Build started - check Jenkins for more info. ▶️

strimzi-ci · 2024-05-21T18:14:49Z

❌ Test Summary ❌

TEST_PROFILE: performance
GROUPS:
TEST_CASE: TopicOperatorPerformance#testCapacityCreateAndUpdateTopics
TOTAL: 6
PASS: 0
FAIL: 6
SKIP: 0
BUILD_NUMBER: 79
OCP_VERSION: 4.15
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

❗ Test Failures ❗

testCapacityCreateAndUpdateTopics[6] 1000, 2000 in io.strimzi.systemtest.performance.TopicOperatorPerformance

Re-run command:
@strimzi-ci run tests --profile=performance --testcase=io.strimzi.systemtest.performance.TopicOperatorPerformance#testCapacityCreateAndUpdateTopics[6] 1000, 2000

see-quick · 2024-05-23T12:05:32Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacityCreateAndUpdateTopics --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-05-23T12:05:43Z

▶️ Build started - check Jenkins for more info. ▶️

strimzi-ci · 2024-05-23T14:08:34Z

❌ Test Summary ❌

TEST_PROFILE: performance
GROUPS:
TEST_CASE: TopicOperatorPerformance#testCapacityCreateAndUpdateTopics
TOTAL: 6
PASS: 0
FAIL: 6
SKIP: 0
BUILD_NUMBER: 80
OCP_VERSION: 4.15
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

❗ Test Failures ❗

testCapacityCreateAndUpdateTopics[6] 1000, 2000 in io.strimzi.systemtest.performance.TopicOperatorPerformance

Re-run command:
@strimzi-ci run tests --profile=performance --testcase=io.strimzi.systemtest.performance.TopicOperatorPerformance#testCapacityCreateAndUpdateTopics[6] 1000, 2000

fvaleri

Hi @see-quick, thanks for working on this.

In order to simulate a busy shared cluster and possibly catch some edge cases, I think we should try to include all 3 kind of topic events (creations, updates and deletes) and run them in parallel.

In my custom test, I'm taking the number of events I want to test as input, then I divide them by 3 to get the number of tasks I have to run in parallel (you would have 1/2 spare events that you can simply consume as noop, that's fine). Each task executes topic creation, update (partition increase and config change) and deletion serially. Wdyt?

see-quick · 2024-05-27T07:46:10Z

Hi @see-quick, thanks for working on this.

In order to simulate a busy shared cluster and possibly catch some edge cases, I think we should try to include all 3 kind of topic events (creations, updates and deletes) and run them in parallel.

In my custom test, I'm taking the number of events I want to test as input, then I divide them by 3 to get the number of tasks I have to run in parallel (you would have 1/2 spare events that you can simply consume as noop, that's fine). Each task executes topic creation, update (partition increase and config change) and deletion serially. Wdyt?

Okay, but that way we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification). Maybe such information is not so important....and if we make all these three operations what is our termination condition? Do we want to create a specific number of topics (e.g., 1000) and see how TO is performing on different configurations? What are then the most important OUT metrics to check? Also, should we execute these tasks incrementally and divide them into batches (i.e., every 100 KafkaTopics?) as we do capacity or should we run all 1000 topics at once?

fvaleri · 2024-05-27T09:04:02Z

we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification)

I think the objective here is not see the upper bound, but to assess performance in a fixed size of events. For example, I'm running test with the following batch of events: 50, 100, 150, ..., 1000. That way, you see how it scales, by simply putting the end-to-end reconciliation (we only care about this one here) time on a line graph, and you can compare with a previous implementation on the very same graph.

With e2e reconciliation time in seconds I mean the time from creation/update to ready, or deletion duration. This is how an example graph looks like (note: we only need number, then you can generate the graph with whatever tool you prefer):

see-quick · 2024-05-27T15:10:19Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

see-quick · 2024-05-28T14:18:11Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-05-28T14:18:25Z

▶️ Build started - check Jenkins for more info. ▶️

strimzi-ci · 2024-05-29T11:12:17Z

❗ Systemtests Failed (no tests results are present) ❗

see-quick · 2024-05-29T12:49:53Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-05-29T12:50:04Z

▶️ Build started - check Jenkins for more info. ▶️

systemtest/src/main/java/io/strimzi/systemtest/Environment.java

strimzi-ci · 2024-05-30T01:06:45Z

✔️ Test Summary ✔️

TEST_PROFILE: null
GROUPS: null
TEST_CASE: null
TOTAL: 120
PASS: 120
FAIL: 0
SKIP: 0
BUILD_NUMBER: 82
OCP_VERSION: null
BUILD_IMAGES: false
FIPS_ENABLED: null
PARALLEL_COUNT: null
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

see-quick · 2024-05-30T07:49:23Z

we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification)

I think the objective here is not see the upper bound, but to assess performance in a fixed size of events. For example, I'm running test with the following batch of events: 50, 100, 150, ..., 1000. That way, you see how it scales, by simply putting the end-to-end reconciliation (we only care about this one here) time on a line graph, and you can compare with a previous implementation on the very same graph.

With e2e reconciliation time in seconds I mean the time from creation/update to ready, or deletion duration. This is how an example graph looks like (note: we only need number, then you can generate the graph with whatever tool you prefer):

So I have tried 6 configurations here:

a) with internal metric - strimzi max reconciliation

b) with external metric - duration of all operations (i.e., create, modify and delete + readiness)

fvaleri

@see-quick nice work.

I left some improvement suggestions, but the base logic is there.

I would also try with BS 100 and LMS 10.

systemtest/src/test/java/io/strimzi/systemtest/performance/TopicOperatorPerformance.java

...est/src/main/java/io/strimzi/systemtest/performance/utils/TopicOperatorPerformanceUtils.java

see-quick · 2024-06-03T09:53:14Z

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

strimzi-ci · 2024-06-03T09:53:26Z

▶️ Build started - check Jenkins for more info. ▶️