rptest: Implement flink service memory autosizing #16119

savex · 2024-01-17T00:12:21Z

Flink service should detect instance specs and scale up memory for Flink's Job Manager and Task Manager processes. Also, it should start additional task managers on single node for each CPU

Memory sizing strategy:
10 % for system
10 % for Job Manager
80% / vcpus for each Task Manager process

Fixes: redpanda-data/devprod#1011

Backports Required

Release Notes

none

savex · 2024-01-17T01:33:35Z

Including HTT team as a reviewers since changes will be related to the HTT code a bit and there will be new routine to identify instance specs using its type name. cc: @ivotron, @piyushredpanda

piyushredpanda · 2024-01-17T03:32:50Z

Is this ready for review, @savex ?

savex · 2024-01-17T13:47:06Z

Not yet. It needs one more function in EC2 to access instance specs.

savex · 2024-01-17T16:45:13Z

It is ready for review.

vbotbuildovich · 2024-01-17T18:44:31Z

new failures in https://buildkite.com/redpanda/redpanda/builds/43840#018d187b-5027-46cb-8e1f-52fed440f885:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/43840#018d187b-502b-427d-a534-ffff8550f084:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/43840#018d188a-a9cb-4f70-8429-9d67ed94f1d9:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/43840#018d188a-a9cf-4e8b-af5e-f324479d4562:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44098#018d31f0-bff5-4928-847b-aedb5087a3ad:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44098#018d31f0-bff7-4f45-ab4e-5843512377dc:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44098#018d3202-8375-4726-a72c-5a53f82387b5:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44098#018d3202-8378-49bf-a425-14ebb32989ef:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

tests/rptest/services/flink.py

vbotbuildovich · 2024-01-22T17:13:42Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44098#018d31f0-bff5-4928-847b-aedb5087a3ad

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44254#018d3e3e-88eb-4690-b03c-3247efdd6a58

tests/rptest/services/flink.py

savex · 2024-01-24T23:41:10Z

Conducted quick check on EC2

ubuntu@ip-172-31-57-74:~/tests$  cd /home/ubuntu/tests ; /usr/bin/env /bin/python3 /home/ubuntu/.vscode-server/extensions/ms-python.debugpy-2023.3.13341006-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 33167 -- -m ducktape --cluster=ducktape.cluster.json.JsonCluster --cluster-file=cluster.json --globals=globals.json --max-parallel=1 --repeat=1 --test-runner-timeout=86400000 rptest/tests/flink_basic_test.py::FlinkBasicTests.test_basic_workload 
[INFO:2024-01-24 23:39:40,517]: starting test run with session id 2024-01-24--002...
[INFO:2024-01-24 23:39:40,518]: running 1 tests...
[INFO:2024-01-24 23:39:40,518]: Triggering test 1 of 1...
[INFO:2024-01-24 23:39:41,549]: RunnerClient: Loading test {'directory': '/home/ubuntu/redpanda/tests/rptest/tests', 'file_name': 'flink_basic_test.py', 'cls_name': 'FlinkBasicTests', 'method_name': 'test_basic_workload', 'injected_args': None}
[INFO:2024-01-24 23:39:41,554]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: on run 1/1
[INFO:2024-01-24 23:39:42,496]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: Setting up...
[INFO:2024-01-24 23:39:48,990]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: Running...
[INFO:2024-01-24 23:40:29,355]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: Tearing down...
[INFO:2024-01-24 23:40:36,255]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: PASS
[INFO:2024-01-24 23:40:36,256]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: Data: None
test_id:    rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload
status:     PASS
run time:   54.701 seconds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
=============================================================================================================================================================================================================================================================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.18
session_id:       2024-01-24--002
run time:         56.482 seconds
tests run:        1
passed:           1
flaky:            0
failed:           0
ignored:          0
opassed:          0
ofailed:          0
=============================================================================================================================================================================================================================================================================================================================

Used to get meta and instance specs for AWS/EC2. Also, metadata getter is updated to use with cluster.node.

Flink is autosized to use while node's memory. It will not consume all of it, only set the maximums. Normally it would not be beyond 10G for <5 jobs.

savex requested review from andrewhsu and travisdowns January 17, 2024 00:12

savex self-assigned this Jan 17, 2024

savex changed the title ~~rptest: Update meta getter to support cluster node~~ rptest: Implement flink service memory autosizing Jan 17, 2024

savex marked this pull request as ready for review January 17, 2024 16:41

savex requested a review from bharathv January 17, 2024 16:42

bharathv reviewed Jan 19, 2024

View reviewed changes

tests/rptest/services/flink.py Outdated Show resolved Hide resolved

tests/rptest/services/flink.py Outdated Show resolved Hide resolved

savex force-pushed the dp-1011-flink-auto-memory-sizing branch from 5de5a2c to 99dc902 Compare January 22, 2024 15:11

savex force-pushed the dp-1011-flink-auto-memory-sizing branch from 8f06229 to a001e8b Compare January 22, 2024 18:54

savex requested a review from bharathv January 22, 2024 18:54

bharathv reviewed Jan 24, 2024

View reviewed changes

tests/rptest/services/flink.py Outdated Show resolved Hide resolved

tests/rptest/services/flink.py Outdated Show resolved Hide resolved

savex force-pushed the dp-1011-flink-auto-memory-sizing branch from f7e0d45 to 6d08313 Compare January 24, 2024 23:27

savex requested a review from bharathv January 24, 2024 23:35

savex force-pushed the dp-1011-flink-auto-memory-sizing branch from 6d08313 to 1b8eee5 Compare January 24, 2024 23:50

savex added 2 commits January 24, 2024 18:20

rptest: New instance specs util class

49f5b8a

Used to get meta and instance specs for AWS/EC2. Also, metadata getter is updated to use with cluster.node.

rptest: Memory sizing and task managers scaling

5d2c5fa

Flink is autosized to use while node's memory. It will not consume all of it, only set the maximums. Normally it would not be beyond 10G for <5 jobs.

savex force-pushed the dp-1011-flink-auto-memory-sizing branch from 1b8eee5 to 5d2c5fa Compare January 25, 2024 00:21

bharathv approved these changes Jan 25, 2024

View reviewed changes

savex merged commit b941d1f into dev Jan 25, 2024
17 checks passed

savex deleted the dp-1011-flink-auto-memory-sizing branch January 25, 2024 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rptest: Implement flink service memory autosizing #16119

rptest: Implement flink service memory autosizing #16119

savex commented Jan 17, 2024 •

edited

savex commented Jan 17, 2024

piyushredpanda commented Jan 17, 2024

savex commented Jan 17, 2024

savex commented Jan 17, 2024

vbotbuildovich commented Jan 17, 2024 •

edited

vbotbuildovich commented Jan 22, 2024 •

edited

savex commented Jan 24, 2024

rptest: Implement flink service memory autosizing #16119

rptest: Implement flink service memory autosizing #16119

Conversation

savex commented Jan 17, 2024 • edited

Backports Required

Release Notes

savex commented Jan 17, 2024

piyushredpanda commented Jan 17, 2024

savex commented Jan 17, 2024

savex commented Jan 17, 2024

vbotbuildovich commented Jan 17, 2024 • edited

vbotbuildovich commented Jan 22, 2024 • edited

savex commented Jan 24, 2024

savex commented Jan 17, 2024 •

edited

vbotbuildovich commented Jan 17, 2024 •

edited

vbotbuildovich commented Jan 22, 2024 •

edited