Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/topology_custom: test_read_repair.py: reduce run-time #17529

Merged
merged 5 commits into from
Feb 27, 2024

Conversation

denesb
Copy link
Contributor

@denesb denesb commented Feb 26, 2024

This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time.

  • Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries.
  • Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB.

The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size.

With this two changes, we can reduce the data size:

  • total_rows: 20000 -> 100
  • max_live_rows: 32 -> 8

The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).

Fixes: #15425
Fixes: #16899

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
❌ - Unit Tests

Failed Tests (1/23435):

Build Details:

  • Duration: 2 hr 0 min
  • Builder: spider5.cloudius-systems.com

@avikivity
Copy link
Member

Please check flakiness (by running multiple times), and debug mode runtime reduction.

db/config.cc Outdated Show resolved Hide resolved
@@ -212,7 +212,9 @@ async def test_incremental_read_repair(data_class, workdir, manager):
seed = int(time.time())
logger.info(f"random-seed: {seed}")
random.seed(seed)
cmdline = ["--hinted-handoff-enabled", "0", "--query-tombstone-page-limit", "1000"]
cmdline = ["--hinted-handoff-enabled", "0",
"--query-tombstone-page-limit", "10",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: (feel free to ignore, because it's a pre-existing issue)
It's sad we rely more and more on each test starting a Scylla cluster from scratch. It's not really necessary, you can see in test/cql-pytest/test_tombstone_limit.py an example test that can use any Scylla cluster and temporarily reduce query_tombstone_page_limit to 10. That approach doesn't allow for parallel tests on the same cluster, but it at least allows reusing the cluster - which could shave the remaining 13.5 seconds of this test to much less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suspect almost half of the test's run-time is startup and teardown. In theory clusters are reused, although I don't know under what conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does adding custom command-line parameters make a cluster dirty? @kostja, @kbr-scylla ?

Here, I can use the config table and revert my config changes at the end of the test, if that will make this cluster reusable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a node makes the cluster dirty.

Since test/topology_custom starts with 0-size cluster, it means every test has to add at least one node, means there is no cluster reuse in this suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was discussed many times.... sorry for piling on more. I think maybe instead of different suite for different defaults (that are impossible to guess from the suite name), we should merge these suites as much as possible, and have tests communicate what kind of cluster they need, via fixtures. We could have fixtures for the most common setups, as well as a fixture for a "custom" one. Tests depending on this fixture, will get 0-node cluster.
We can also use decorators if fixtures prove problematic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

But merging suites should also come with an improvement to test.py and cluster pool so we better utilize available resources on test machine. Currently it's very stupid, each suite has its constant pool_size which determines how many concurrent clusters will run within this suite, and different suites have no concurrency control w.r.t. each other (they all run at the same time IIUC).

And after everyone agreed with each other, we all went home and nothing was done.

Regulates the page size in bytes via config, instead of the currently
used hard-coded constant. Allows tests to configure lower limits so they
can work with smaller data-sets when testing paging related
functionality.
Not wired yet.
Returns an instance with the page_limit reset to 0. This converts a
max_results_size which is usable only with the
"page_size_and_safety_limit" feature, to one which can be used before
this feature.
To be used in the next patch.
This patch changes get_unlimited_query_max_result_size():
* Also set the page-size field, not just the soft/hard limits
* Renames it to get_query_max_result_size()
* Update callers, specifically storage_proxy::get_max_result_size(),
  which now has a much simpler common return path and has to drop the
  page size on one rare return path.

This is a purely mechanical change, no behaviour is changed.
…_bytes

As the page size for user queries, instead of the hard-coded constant
used before. For system queries, we keep using the previous constant.
This test needed a lot of data to ensure multiple pages when doing the
read repair. This change two key configuration items, allowing
for a drastic reduction of the data size and consequently a large
reduction in run-time.
* Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a,
  reducing this to a too small value would start killing internal
  queries. Now, after said commit, this is no longer a concern, as this
  limit no longer affects unpaged queries.
* Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB.

With this two changes, we can reduce the data size:
* total_rows: 20000 -> 100
* max_live_rows: 32 -> 8

The runtime of the test consequently drops from 62 seconds to 13.5
seconds (dev mode, on my build machine).
@denesb
Copy link
Contributor Author

denesb commented Feb 27, 2024

New in v2:

  • fix typo: many data -> much data
  • fix database_test (refactor fallout)

I ran the test 100 times in dev mode, and all passed.
Debug mode reduction: 585s -> 52s.

@avikivity
Copy link
Member

Debug mode reduction: 585s -> 52s.

!

@avikivity
Copy link
Member

I'll kick CI

@kbr-scylla
Copy link
Contributor

I ran the test 100 times in dev mode, and all passed.

Please run it 100 times in debug too

@denesb
Copy link
Contributor Author

denesb commented Feb 27, 2024

I ran the test 100 times in dev mode, and all passed.

Please run it 100 times in debug too

Done, passed.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests
✅ - dtest with topology changes

Build Details:

  • Duration: 2 hr 6 min
  • Builder: spider5.cloudius-systems.com

@scylladb-promoter scylladb-promoter merged commit 616eec2 into scylladb:master Feb 27, 2024
6 checks passed
denesb pushed a commit that referenced this pull request Feb 28, 2024
…from Botond Dénes

This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time.
* Changes query-tombstone-page-limit 1000 -> 10. Before f068d1a,  reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries.
* Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB.

The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size.

With this two changes, we can reduce the data size:
* total_rows: 20000 -> 100
* max_live_rows: 32 -> 8

The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine).

Fixes: #15425
Fixes: #16899

Closes #17529

* github.com:scylladb/scylladb:
  test/topology_custom: test_read_repair.py: reduce run-time
  replica/database: get_query_max_result_size(): use query_page_size_in_bytes
  replica/database: use include page-size in max-result-size
  query-request: max_result_size: add without_page_limit()
  db/config: introduce query_page_size_in_bytes

(cherry picked from commit 616eec2)
@yaronkaikov yaronkaikov added backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed and removed backport/5.2 backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
7 participants