-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/topology_custom: test_read_repair.py: reduce run-time #17529
test/topology_custom: test_read_repair.py: reduce run-time #17529
Conversation
90591b4
to
7a53267
Compare
🔴 CI State: FAILURE✅ - Build Failed Tests (1/23435):Build Details:
|
Please check flakiness (by running multiple times), and debug mode runtime reduction. |
@@ -212,7 +212,9 @@ async def test_incremental_read_repair(data_class, workdir, manager): | |||
seed = int(time.time()) | |||
logger.info(f"random-seed: {seed}") | |||
random.seed(seed) | |||
cmdline = ["--hinted-handoff-enabled", "0", "--query-tombstone-page-limit", "1000"] | |||
cmdline = ["--hinted-handoff-enabled", "0", | |||
"--query-tombstone-page-limit", "10", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: (feel free to ignore, because it's a pre-existing issue)
It's sad we rely more and more on each test starting a Scylla cluster from scratch. It's not really necessary, you can see in test/cql-pytest/test_tombstone_limit.py an example test that can use any Scylla cluster and temporarily reduce query_tombstone_page_limit to 10. That approach doesn't allow for parallel tests on the same cluster, but it at least allows reusing the cluster - which could shave the remaining 13.5 seconds of this test to much less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suspect almost half of the test's run-time is startup and teardown. In theory clusters are reused, although I don't know under what conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does adding custom command-line parameters make a cluster dirty? @kostja, @kbr-scylla ?
Here, I can use the config table and revert my config changes at the end of the test, if that will make this cluster reusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a node makes the cluster dirty.
Since test/topology_custom starts with 0-size cluster, it means every test has to add at least one node, means there is no cluster reuse in this suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was discussed many times.... sorry for piling on more. I think maybe instead of different suite for different defaults (that are impossible to guess from the suite name), we should merge these suites as much as possible, and have tests communicate what kind of cluster they need, via fixtures. We could have fixtures for the most common setups, as well as a fixture for a "custom" one. Tests depending on this fixture, will get 0-node cluster.
We can also use decorators if fixtures prove problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
But merging suites should also come with an improvement to test.py and cluster pool so we better utilize available resources on test machine. Currently it's very stupid, each suite has its constant pool_size
which determines how many concurrent clusters will run within this suite, and different suites have no concurrency control w.r.t. each other (they all run at the same time IIUC).
And after everyone agreed with each other, we all went home and nothing was done.
Regulates the page size in bytes via config, instead of the currently used hard-coded constant. Allows tests to configure lower limits so they can work with smaller data-sets when testing paging related functionality. Not wired yet.
Returns an instance with the page_limit reset to 0. This converts a max_results_size which is usable only with the "page_size_and_safety_limit" feature, to one which can be used before this feature. To be used in the next patch.
This patch changes get_unlimited_query_max_result_size(): * Also set the page-size field, not just the soft/hard limits * Renames it to get_query_max_result_size() * Update callers, specifically storage_proxy::get_max_result_size(), which now has a much simpler common return path and has to drop the page size on one rare return path. This is a purely mechanical change, no behaviour is changed.
…_bytes As the page size for user queries, instead of the hard-coded constant used before. For system queries, we keep using the previous constant.
This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time. * Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries. * Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB. With this two changes, we can reduce the data size: * total_rows: 20000 -> 100 * max_live_rows: 32 -> 8 The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).
7a53267
to
5dc145a
Compare
New in v2:
I ran the test 100 times in dev mode, and all passed. |
! |
I'll kick CI |
Please run it 100 times in debug too |
Done, passed. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
…from Botond Dénes This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time. * Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries. * Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB. The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size. With this two changes, we can reduce the data size: * total_rows: 20000 -> 100 * max_live_rows: 32 -> 8 The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine). Fixes: #15425 Fixes: #16899 Closes #17529 * github.com:scylladb/scylladb: test/topology_custom: test_read_repair.py: reduce run-time replica/database: get_query_max_result_size(): use query_page_size_in_bytes replica/database: use include page-size in max-result-size query-request: max_result_size: add without_page_limit() db/config: introduce query_page_size_in_bytes (cherry picked from commit 616eec2)
This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time.
The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size.
With this two changes, we can reduce the data size:
The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).
Fixes: #15425
Fixes: #16899