Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rptest: Implement flink service memory autosizing #16119

Merged
merged 2 commits into from
Jan 25, 2024

Conversation

savex
Copy link
Contributor

@savex savex commented Jan 17, 2024

Flink service should detect instance specs and scale up memory for Flink's Job Manager and Task Manager processes. Also, it should start additional task managers on single node for each CPU

Memory sizing strategy:
10 % for system
10 % for Job Manager
80% / vcpus for each Task Manager process

Fixes: redpanda-data/devprod#1011

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

  • none

@savex savex self-assigned this Jan 17, 2024
@savex savex changed the title rptest: Update meta getter to support cluster node rptest: Implement flink service memory autosizing Jan 17, 2024
@savex
Copy link
Contributor Author

savex commented Jan 17, 2024

Including HTT team as a reviewers since changes will be related to the HTT code a bit and there will be new routine to identify instance specs using its type name. cc: @ivotron, @piyushredpanda

@piyushredpanda
Copy link
Contributor

Is this ready for review, @savex ?

@savex
Copy link
Contributor Author

savex commented Jan 17, 2024

Not yet. It needs one more function in EC2 to access instance specs.

@savex savex marked this pull request as ready for review January 17, 2024 16:41
@savex savex requested a review from bharathv January 17, 2024 16:42
@savex
Copy link
Contributor Author

savex commented Jan 17, 2024

It is ready for review.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 17, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/43840#018d187b-5027-46cb-8e1f-52fed440f885:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/43840#018d187b-502b-427d-a534-ffff8550f084:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/43840#018d188a-a9cb-4f70-8429-9d67ed94f1d9:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/43840#018d188a-a9cf-4e8b-af5e-f324479d4562:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44098#018d31f0-bff5-4928-847b-aedb5087a3ad:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44098#018d31f0-bff7-4f45-ab4e-5843512377dc:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44098#018d3202-8375-4726-a72c-5a53f82387b5:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload"

new failures in https://buildkite.com/redpanda/redpanda/builds/44098#018d3202-8378-49bf-a425-14ebb32989ef:

"rptest.tests.flink_basic_test.FlinkBasicTests.test_transaction_workload"

tests/rptest/services/flink.py Outdated Show resolved Hide resolved
tests/rptest/services/flink.py Outdated Show resolved Hide resolved
@savex savex force-pushed the dp-1011-flink-auto-memory-sizing branch from 5de5a2c to 99dc902 Compare January 22, 2024 15:11
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 22, 2024

@savex savex force-pushed the dp-1011-flink-auto-memory-sizing branch from 8f06229 to a001e8b Compare January 22, 2024 18:54
@savex savex requested a review from bharathv January 22, 2024 18:54
tests/rptest/services/flink.py Outdated Show resolved Hide resolved
tests/rptest/services/flink.py Outdated Show resolved Hide resolved
@savex savex force-pushed the dp-1011-flink-auto-memory-sizing branch from f7e0d45 to 6d08313 Compare January 24, 2024 23:27
@savex savex requested a review from bharathv January 24, 2024 23:35
@savex
Copy link
Contributor Author

savex commented Jan 24, 2024

Conducted quick check on EC2

ubuntu@ip-172-31-57-74:~/tests$  cd /home/ubuntu/tests ; /usr/bin/env /bin/python3 /home/ubuntu/.vscode-server/extensions/ms-python.debugpy-2023.3.13341006-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 33167 -- -m ducktape --cluster=ducktape.cluster.json.JsonCluster --cluster-file=cluster.json --globals=globals.json --max-parallel=1 --repeat=1 --test-runner-timeout=86400000 rptest/tests/flink_basic_test.py::FlinkBasicTests.test_basic_workload 
[INFO:2024-01-24 23:39:40,517]: starting test run with session id 2024-01-24--002...
[INFO:2024-01-24 23:39:40,518]: running 1 tests...
[INFO:2024-01-24 23:39:40,518]: Triggering test 1 of 1...
[INFO:2024-01-24 23:39:41,549]: RunnerClient: Loading test {'directory': '/home/ubuntu/redpanda/tests/rptest/tests', 'file_name': 'flink_basic_test.py', 'cls_name': 'FlinkBasicTests', 'method_name': 'test_basic_workload', 'injected_args': None}
[INFO:2024-01-24 23:39:41,554]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: on run 1/1
[INFO:2024-01-24 23:39:42,496]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: Setting up...
[INFO:2024-01-24 23:39:48,990]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: Running...
[INFO:2024-01-24 23:40:29,355]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: Tearing down...
[INFO:2024-01-24 23:40:36,255]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: PASS
[INFO:2024-01-24 23:40:36,256]: RunnerClient: rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload: Data: None
test_id:    rptest.tests.flink_basic_test.FlinkBasicTests.test_basic_workload
status:     PASS
run time:   54.701 seconds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
=============================================================================================================================================================================================================================================================================================================================
SESSION REPORT (ALL TESTS)
ducktape version: 0.8.18
session_id:       2024-01-24--002
run time:         56.482 seconds
tests run:        1
passed:           1
flaky:            0
failed:           0
ignored:          0
opassed:          0
ofailed:          0
=============================================================================================================================================================================================================================================================================================================================

@savex savex force-pushed the dp-1011-flink-auto-memory-sizing branch from 6d08313 to 1b8eee5 Compare January 24, 2024 23:50
   Used to get meta and instance specs for AWS/EC2.
   Also, metadata getter is updated to use with cluster.node.
  Flink is autosized to use while node's memory. It will not consume all
  of it, only set the maximums. Normally it would not be beyond 10G for
  <5 jobs.
@savex savex force-pushed the dp-1011-flink-auto-memory-sizing branch from 1b8eee5 to 5d2c5fa Compare January 25, 2024 00:21
@savex savex merged commit b941d1f into dev Jan 25, 2024
17 checks passed
@savex savex deleted the dp-1011-flink-auto-memory-sizing branch January 25, 2024 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants