test/topology_custom: test_read_repair.py: reduce run-time #17529

denesb · 2024-02-26T15:03:39Z

This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time.

Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries.
Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB.

The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size.

With this two changes, we can reduce the data size:

total_rows: 20000 -> 100
max_live_rows: 32 -> 8

The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).

Fixes: #15425
Fixes: #16899

scylladb-promoter · 2024-02-26T17:10:21Z

🔴 CI State: FAILURE

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
❌ - Unit Tests

Failed Tests (1/23435):

max_result_size_for_unlimited_query_selection_test 🔍

Build Details:

Duration: 2 hr 0 min
Builder: spider5.cloudius-systems.com

avikivity · 2024-02-26T18:01:35Z

Please check flakiness (by running multiple times), and debug mode runtime reduction.

replica/database.cc

db/config.cc

nyh · 2024-02-26T19:41:47Z

test/topology_custom/test_read_repair.py

@@ -212,7 +212,9 @@ async def test_incremental_read_repair(data_class, workdir, manager):
    seed = int(time.time())
    logger.info(f"random-seed: {seed}")
    random.seed(seed)
-    cmdline = ["--hinted-handoff-enabled", "0", "--query-tombstone-page-limit", "1000"]
+    cmdline = ["--hinted-handoff-enabled", "0",
+               "--query-tombstone-page-limit", "10",


Nitpick: (feel free to ignore, because it's a pre-existing issue)
It's sad we rely more and more on each test starting a Scylla cluster from scratch. It's not really necessary, you can see in test/cql-pytest/test_tombstone_limit.py an example test that can use any Scylla cluster and temporarily reduce query_tombstone_page_limit to 10. That approach doesn't allow for parallel tests on the same cluster, but it at least allows reusing the cluster - which could shave the remaining 13.5 seconds of this test to much less.

Yes, I suspect almost half of the test's run-time is startup and teardown. In theory clusters are reused, although I don't know under what conditions.

Does adding custom command-line parameters make a cluster dirty? @kostja, @kbr-scylla ?

Here, I can use the config table and revert my config changes at the end of the test, if that will make this cluster reusable.

Adding a node makes the cluster dirty.

Since test/topology_custom starts with 0-size cluster, it means every test has to add at least one node, means there is no cluster reuse in this suite.

I know this was discussed many times.... sorry for piling on more. I think maybe instead of different suite for different defaults (that are impossible to guess from the suite name), we should merge these suites as much as possible, and have tests communicate what kind of cluster they need, via fixtures. We could have fixtures for the most common setups, as well as a fixture for a "custom" one. Tests depending on this fixture, will get 0-node cluster.
We can also use decorators if fixtures prove problematic.

Agreed.

But merging suites should also come with an improvement to test.py and cluster pool so we better utilize available resources on test machine. Currently it's very stupid, each suite has its constant pool_size which determines how many concurrent clusters will run within this suite, and different suites have no concurrency control w.r.t. each other (they all run at the same time IIUC).

And after everyone agreed with each other, we all went home and nothing was done.

test/boost/database_test.cc

Regulates the page size in bytes via config, instead of the currently used hard-coded constant. Allows tests to configure lower limits so they can work with smaller data-sets when testing paging related functionality. Not wired yet.

Returns an instance with the page_limit reset to 0. This converts a max_results_size which is usable only with the "page_size_and_safety_limit" feature, to one which can be used before this feature. To be used in the next patch.

This patch changes get_unlimited_query_max_result_size(): * Also set the page-size field, not just the soft/hard limits * Renames it to get_query_max_result_size() * Update callers, specifically storage_proxy::get_max_result_size(), which now has a much simpler common return path and has to drop the page size on one rare return path. This is a purely mechanical change, no behaviour is changed.

…_bytes As the page size for user queries, instead of the hard-coded constant used before. For system queries, we keep using the previous constant.

This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time. * Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries. * Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB. With this two changes, we can reduce the data size: * total_rows: 20000 -> 100 * max_live_rows: 32 -> 8 The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).

denesb · 2024-02-27T08:47:17Z

New in v2:

fix typo: many data -> much data
fix database_test (refactor fallout)

I ran the test 100 times in dev mode, and all passed.
Debug mode reduction: 585s -> 52s.

avikivity · 2024-02-27T14:41:41Z

Debug mode reduction: 585s -> 52s.

!

avikivity · 2024-02-27T14:44:24Z

I'll kick CI

kbr-scylla · 2024-02-27T15:00:37Z

I ran the test 100 times in dev mode, and all passed.

Please run it 100 times in debug too

denesb · 2024-02-27T16:47:46Z

I ran the test 100 times in dev mode, and all passed.

Please run it 100 times in debug too

Done, passed.

scylladb-promoter · 2024-02-27T16:50:59Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests
✅ - dtest with topology changes

Build Details:

Duration: 2 hr 6 min
Builder: spider5.cloudius-systems.com

…from Botond Dénes This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time. * Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries. * Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB. The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size. With this two changes, we can reduce the data size: * total_rows: 20000 -> 100 * max_live_rows: 32 -> 8 The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine). Fixes: #15425 Fixes: #16899 Closes #17529 * github.com:scylladb/scylladb: test/topology_custom: test_read_repair.py: reduce run-time replica/database: get_query_max_result_size(): use query_page_size_in_bytes replica/database: use include page-size in max-result-size query-request: max_result_size: add without_page_limit() db/config: introduce query_page_size_in_bytes (cherry picked from commit 616eec2)

denesb requested a review from tgrabiec as a code owner February 26, 2024 15:03

denesb requested a review from avikivity February 26, 2024 15:07

denesb self-assigned this Feb 26, 2024

denesb added the symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework label Feb 26, 2024

This was referenced Feb 26, 2024

[aarch64][debug] test_read_repair.test_incremental_read_repair[partition-tombstone] #16899

Closed

test_read_repair takes infinite time in debug mode #15425

Closed

denesb force-pushed the read-repair-test-speedup branch from 90591b4 to 7a53267 Compare February 26, 2024 15:08

avikivity reviewed Feb 26, 2024

View reviewed changes

replica/database.cc Show resolved Hide resolved

nyh approved these changes Feb 26, 2024

View reviewed changes

tchaikov reviewed Feb 27, 2024

View reviewed changes

test/boost/database_test.cc Show resolved Hide resolved

denesb added 5 commits February 27, 2024 02:14

db/config: introduce query_page_size_in_bytes

5e37c14

Regulates the page size in bytes via config, instead of the currently used hard-coded constant. Allows tests to configure lower limits so they can work with smaller data-sets when testing paging related functionality. Not wired yet.

query-request: max_result_size: add without_page_limit()

97615e0

Returns an instance with the page_limit reset to 0. This converts a max_results_size which is usable only with the "page_size_and_safety_limit" feature, to one which can be used before this feature. To be used in the next patch.

replica/database: get_query_max_result_size(): use query_page_size_in…

7f3ca3a

…_bytes As the page size for user queries, instead of the hard-coded constant used before. For system queries, we keep using the previous constant.

denesb force-pushed the read-repair-test-speedup branch from 7a53267 to 5dc145a Compare February 27, 2024 08:46

scylladb-promoter merged commit 616eec2 into scylladb:master Feb 27, 2024
6 checks passed

yaronkaikov added backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed and removed backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/topology_custom: test_read_repair.py: reduce run-time #17529

test/topology_custom: test_read_repair.py: reduce run-time #17529

denesb commented Feb 26, 2024

scylladb-promoter commented Feb 26, 2024

avikivity commented Feb 26, 2024

nyh Feb 26, 2024

denesb Feb 27, 2024

denesb Feb 27, 2024

kbr-scylla Feb 27, 2024

denesb Feb 27, 2024

kbr-scylla Feb 27, 2024

denesb commented Feb 27, 2024

avikivity commented Feb 27, 2024

avikivity commented Feb 27, 2024

kbr-scylla commented Feb 27, 2024

denesb commented Feb 27, 2024

scylladb-promoter commented Feb 27, 2024

test/topology_custom: test_read_repair.py: reduce run-time #17529

test/topology_custom: test_read_repair.py: reduce run-time #17529

Conversation

denesb commented Feb 26, 2024

scylladb-promoter commented Feb 26, 2024

🔴 CI State: FAILURE

Failed Tests (1/23435):

Build Details:

avikivity commented Feb 26, 2024

nyh Feb 26, 2024

Choose a reason for hiding this comment

denesb Feb 27, 2024

Choose a reason for hiding this comment

denesb Feb 27, 2024

Choose a reason for hiding this comment

kbr-scylla Feb 27, 2024

Choose a reason for hiding this comment

denesb Feb 27, 2024

Choose a reason for hiding this comment

kbr-scylla Feb 27, 2024

Choose a reason for hiding this comment

denesb commented Feb 27, 2024

avikivity commented Feb 27, 2024

avikivity commented Feb 27, 2024

kbr-scylla commented Feb 27, 2024

denesb commented Feb 27, 2024

scylladb-promoter commented Feb 27, 2024

🟢 CI State: SUCCESS

Build Details: