Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. #10138

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

see-quick
Copy link
Member

@see-quick see-quick commented May 21, 2024

Type of change

  • Enhancement / new feature

Description

This PR focuses on exploring the impact of different configurations on the efficiency of creating, modifying, and deleting Kafka topics. I've played around with a range of batch sizes and linger durations to see how they affect performance across different scales of topic counts.

Based on this graph (KRaft):

image

One can see that I have tried multiple configurations with batch sizes and linger settings stretching from 1ms to 2000ms. Moreover, the range of topics which I tested was from 50 to 1000 to see some pattern if such configuration scaling well or if there are some problems....(that could be viewed on each curve). This could help us understand the capabilities of UTO with various settings and scale with the best configuration.

I have also implemented the way how we create the events. Currently, we are doing it sequentially and now I have modified it and used ExecutorService to manage and process batches concurrently. More on that in the Javadoc...

[1] - #10050 (review)

Update (19.9.2024):

After a few modifications, also we have decided to remove two use cases from TO and UO (i.e., Alice bulk and bob's streaming). We do not think it adds much value so we would currently stick to capacity and scalability tests, which are now present in those test suites.

Checklist

  • Write tests
  • Make sure all tests pass

@see-quick see-quick self-assigned this May 21, 2024
@see-quick
Copy link
Member Author

/packit test --labels performance-topic-operator-capacity

@see-quick see-quick added this to the 0.42.0 milestone May 21, 2024
@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacityCreateAndUpdateTopics --env=STRIMZI_USE_KRAFT_IN_TESTS=true

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@strimzi-ci
Copy link

❌ Test Summary ❌

TEST_PROFILE: performance
GROUPS:
TEST_CASE: TopicOperatorPerformance#testCapacityCreateAndUpdateTopics
TOTAL: 6
PASS: 0
FAIL: 6
SKIP: 0
BUILD_NUMBER: 79
OCP_VERSION: 4.15
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

❗ Test Failures ❗

  • testCapacityCreateAndUpdateTopics[6] 1000, 2000 in io.strimzi.systemtest.performance.TopicOperatorPerformance

Re-run command:
@strimzi-ci run tests --profile=performance --testcase=io.strimzi.systemtest.performance.TopicOperatorPerformance#testCapacityCreateAndUpdateTopics[6] 1000, 2000

@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testCapacityCreateAndUpdateTopics --env=STRIMZI_USE_KRAFT_IN_TESTS=true

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@strimzi-ci
Copy link

❌ Test Summary ❌

TEST_PROFILE: performance
GROUPS:
TEST_CASE: TopicOperatorPerformance#testCapacityCreateAndUpdateTopics
TOTAL: 6
PASS: 0
FAIL: 6
SKIP: 0
BUILD_NUMBER: 80
OCP_VERSION: 4.15
BUILD_IMAGES: false
FIPS_ENABLED: false
PARALLEL_COUNT: 5
EXCLUDED_GROUPS: loadbalancer,nodeport,olm
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

❗ Test Failures ❗

  • testCapacityCreateAndUpdateTopics[6] 1000, 2000 in io.strimzi.systemtest.performance.TopicOperatorPerformance

Re-run command:
@strimzi-ci run tests --profile=performance --testcase=io.strimzi.systemtest.performance.TopicOperatorPerformance#testCapacityCreateAndUpdateTopics[6] 1000, 2000

Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @see-quick, thanks for working on this.

In order to simulate a busy shared cluster and possibly catch some edge cases, I think we should try to include all 3 kind of topic events (creations, updates and deletes) and run them in parallel.

In my custom test, I'm taking the number of events I want to test as input, then I divide them by 3 to get the number of tasks I have to run in parallel (you would have 1/2 spare events that you can simply consume as noop, that's fine). Each task executes topic creation, update (partition increase and config change) and deletion serially. Wdyt?

@see-quick
Copy link
Member Author

see-quick commented May 27, 2024

Hi @see-quick, thanks for working on this.

In order to simulate a busy shared cluster and possibly catch some edge cases, I think we should try to include all 3 kind of topic events (creations, updates and deletes) and run them in parallel.

In my custom test, I'm taking the number of events I want to test as input, then I divide them by 3 to get the number of tasks I have to run in parallel (you would have 1/2 spare events that you can simply consume as noop, that's fine). Each task executes topic creation, update (partition increase and config change) and deletion serially. Wdyt?

Okay, but that way we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification). Maybe such information is not so important....and if we make all these three operations what is our termination condition? Do we want to create a specific number of topics (e.g., 1000) and see how TO is performing on different configurations? What are then the most important OUT metrics to check? Also, should we execute these tasks incrementally and divide them into batches (i.e., every 100 KafkaTopics?) as we do capacity or should we run all 1000 topics at once?

@fvaleri
Copy link
Contributor

fvaleri commented May 27, 2024

we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification)

I think the objective here is not see the upper bound, but to assess performance in a fixed size of events. For example, I'm running test with the following batch of events: 50, 100, 150, ..., 1000. That way, you see how it scales, by simply putting the end-to-end reconciliation (we only care about this one here) time on a line graph, and you can compare with a previous implementation on the very same graph.

With e2e reconciliation time in seconds I mean the time from creation/update to ready, or deletion duration. This is how an example graph looks like (note: we only need number, then you can generate the graph with whatever tool you prefer):

Screenshot from 2024-05-27 11-01-57

@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

1 similar comment
@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@strimzi-ci
Copy link

Systemtests Failed (no tests results are present)

@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

@strimzi-ci
Copy link

✔️ Test Summary ✔️

TEST_PROFILE: null
GROUPS: null
TEST_CASE: null
TOTAL: 120
PASS: 120
FAIL: 0
SKIP: 0
BUILD_NUMBER: 82
OCP_VERSION: null
BUILD_IMAGES: false
FIPS_ENABLED: null
PARALLEL_COUNT: null
ENV_VARIABLES: STRIMZI_USE_KRAFT_IN_TESTS=true

@see-quick
Copy link
Member Author

see-quick commented May 30, 2024

we would not be able to see the upper bound (i.e., how many KafkaTopics is TO able to handle in creation and modification)

I think the objective here is not see the upper bound, but to assess performance in a fixed size of events. For example, I'm running test with the following batch of events: 50, 100, 150, ..., 1000. That way, you see how it scales, by simply putting the end-to-end reconciliation (we only care about this one here) time on a line graph, and you can compare with a previous implementation on the very same graph.

With e2e reconciliation time in seconds I mean the time from creation/update to ready, or deletion duration. This is how an example graph looks like (note: we only need number, then you can generate the graph with whatever tool you prefer):

Screenshot from 2024-05-27 11-01-57

So I have tried 6 configurations here:

a) with internal metric - strimzi max reconciliation

image

b) with external metric - duration of all operations (i.e., create, modify and delete + readiness)

image

@see-quick see-quick changed the title [performance] - TopicOperator capacity create & modify [performance] - Topic Operator Impact of Batch Size and Linger Settings on Kafka Topic Operations May 30, 2024
@see-quick see-quick marked this pull request as ready for review May 30, 2024 09:10
@see-quick see-quick requested a review from a team May 30, 2024 21:26
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@see-quick nice work.

I left some improvement suggestions, but the base logic is there.

I would also try with BS 100 and LMS 10.

@see-quick
Copy link
Member Author

@strimzi-ci run tests --cluster-type=ocp --cluster-version=4.15 --install-type=bundle --profile=performance --testcase=TopicOperatorPerformance#testPerformanceInFixedSizeOfEvents --env=STRIMZI_USE_KRAFT_IN_TESTS=true

@strimzi-ci
Copy link

▶️ Build started - check Jenkins for more info. ▶️

Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
@see-quick
Copy link
Member Author

/packit test --labels performance

Signed-off-by: see-quick <maros.orsak159@gmail.com>
.packit.yaml Outdated Show resolved Hide resolved
@Frawless Frawless self-requested a review September 18, 2024 17:12
Signed-off-by: see-quick <maros.orsak159@gmail.com>
@see-quick
Copy link
Member Author

/packit test --labels performance

@im-konge im-konge marked this pull request as ready for review September 19, 2024 08:07
Signed-off-by: see-quick <maros.orsak159@gmail.com>
@see-quick see-quick changed the title [performance] - Topic Operator Impact of Batch Size and Linger Settings on Kafka Topic Operations [performance] - Remove bulk and streams use cases from UO and TO. Scallability test added. Sep 19, 2024
@see-quick
Copy link
Member Author

/packit test --labels performance

Signed-off-by: see-quick <maros.orsak159@gmail.com>
@see-quick
Copy link
Member Author

/packit test --labels performance

Signed-off-by: see-quick <maros.orsak159@gmail.com>
@see-quick
Copy link
Member Author

/packit test --labels performance

Copy link
Member

@im-konge im-konge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Signed-off-by: see-quick <maros.orsak159@gmail.com>
Signed-off-by: see-quick <maros.orsak159@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants