Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: large delays between io_fiber iterations in schema change test #15622

Closed
kbr-scylla opened this issue Oct 3, 2023 · 54 comments
Closed
Assignees

Comments

@kbr-scylla
Copy link
Contributor

When investigating #15312 I noticed that sometimes there would be large (~3 seconds) delays between io_fiber iterations, judging by poll_output() log messages.

Attaching fragment of one of a log file from one run of the test
example.txt

full logs can be found here:

[2023-09-29T19:54:52.728Z] +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2023-09-29T19:54:52.728Z] |                                                      Collected logs by test-id: 13f60000-7013-4d3c-9973-e8ecf7fb94b8                                                      |
[2023-09-29T19:54:52.728Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
[2023-09-29T19:54:52.729Z] | Cluster set               | Link                                                                                                                                          |
[2023-09-29T19:54:52.729Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
[2023-09-29T19:54:52.729Z] | db-cluster                | https://cloudius-jenkins-test.s3.amazonaws.com/13f60000-7013-4d3c-9973-e8ecf7fb94b8/20230929_194957/db-cluster-13f60000.tar.gz                |
[2023-09-29T19:54:52.729Z] | sct-runner-events         | https://cloudius-jenkins-test.s3.amazonaws.com/13f60000-7013-4d3c-9973-e8ecf7fb94b8/20230929_194957/sct-runner-events-13f60000.tar.gz         |
[2023-09-29T19:54:52.729Z] | sct-runner-python-log     | https://cloudius-jenkins-test.s3.amazonaws.com/13f60000-7013-4d3c-9973-e8ecf7fb94b8/20230929_194957/sct-13f60000.log.tar.gz                   |
[2023-09-29T19:54:52.729Z] | loader-set                | https://cloudius-jenkins-test.s3.amazonaws.com/13f60000-7013-4d3c-9973-e8ecf7fb94b8/20230929_194957/loader-set-13f60000.tar.gz                |
[2023-09-29T19:54:52.729Z] | monitor-set               | https://cloudius-jenkins-test.s3.amazonaws.com/13f60000-7013-4d3c-9973-e8ecf7fb94b8/20230929_194957/monitor-set-13f60000.tar.gz               |
[2023-09-29T19:54:52.729Z] | parallel-timelines-report | https://cloudius-jenkins-test.s3.amazonaws.com/13f60000-7013-4d3c-9973-e8ecf7fb94b8/20230929_194957/parallel-timelines-report-13f60000.tar.gz |
[2023-09-29T19:54:52.729Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

In the fragment we can see the delay, one poll_output:

2023-09-29T19:31:10+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1    !DEBUG | scylla[5307]:  [shard 0:stre] raft - fsm::poll_output() 9a1e4e90-5199-4e4e-8acf-19b787628dd7 stable index: 7110 last index: 7111

and the next one:

2023-09-29T19:31:13+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1    !DEBUG | scylla[5307]:  [shard 0:stre] raft - fsm::poll_output() 9a1e4e90-5199-4e4e-8acf-19b787628dd7 stable index: 7111 last index: 7111

and that is even though there were messages to send in the meantime e.g.

2023-09-29T19:31:10+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1    !DEBUG | scylla[5307]:  [shard 0:stre] raft - tick[9a1e4e90-5199-4e4e-8acf-19b787628dd7]: replicate to 5cb9fe3b-f59c-4724-b987-57fd45c89017 because match=7110 < last_idx=7111 || follower commit_idx=7108 < commit_idx=7110
2023-09-29T19:31:10+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1    !DEBUG | scylla[5307]:  [shard 0:stre] raft - replicate_to[9a1e4e90-5199-4e4e-8acf-19b787628dd7->5cb9fe3b-f59c-4724-b987-57fd45c89017]: called next=7112 match=7110
2023-09-29T19:31:10+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1    !DEBUG | scylla[5307]:  [shard 0:stre] raft - replicate_to[9a1e4e90-5199-4e4e-8acf-19b787628dd7->5cb9fe3b-f59c-4724-b987-57fd45c89017]: next past last next=7112 stable=7111, empty=true
2023-09-29T19:31:10+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1    !DEBUG | scylla[5307]:  [shard 0:stre] raft - replicate_to[9a1e4e90-5199-4e4e-8acf-19b787628dd7->5cb9fe3b-f59c-4724-b987-57fd45c89017]: send empty
2023-09-29T19:31:10+00:00 longevity-ParallelAddDropColumnMonk-db-node-13f60000-1    !DEBUG | scylla[5307]:  [shard 0:stre] raft - replicate_to[9a1e4e90-5199-4e4e-8acf-19b787628dd7->5cb9fe3b-f59c-4724-b987-57fd45c89017]: next past last next=7112 stable=7111, empty=false

each message should trigger the next poll_output()

so it looks like io_fiber is stuck somewhere. Perhaps when trying to persist something. It could be that there are large disk delays here? But the metrics relevant to schema commitlog did not suggest that there would be large delays persisting stuff to schema-commitlog tables (such as the raft table).

Needs further investigation.

@kbr-scylla
Copy link
Contributor Author

Got logs from another run with more logging in io_fiber. (Ran on this commit: kbr-scylla@7c54a1f, job: https://jenkins.scylladb.com/job/scylla-staging/job/KamilBraun/job/longevity-schema-changes-3h-test-artsiom_mishuta-clone/7/consoleText)

2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber storing log entries 1

...

2023-10-03T16:57:07+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber stored log entries

so it took 6 seconds to do this...

                logger.trace("[{}] io_fiber storing log entries {}", _id, entries.size());
                co_await _persistence->store_log_entries(entries);
                logger.trace("[{}] io_fiber stored log entries", _id);

It's weird. The variability seems very high -- usually this storing step happens in the same second (unfortunately the logs don't provide millisecond resolution for some reason), but this one time it suddenly took a whole 6 seconds. Even if the disk is struggling, what could cause such a big spike? @xemul is this even a plausible thing to happen?

(or maybe we have a bug in raft_sys_table_storage?)

@kbr-scylla
Copy link
Contributor Author

Attaching log fragment from that node (which is also the leader btw)
long-poll.txt

Full logs can be found here:

[2023-10-03T19:07:58.537Z] +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-03T19:07:58.537Z] |                                                      Collected logs by test-id: 5dbd49f9-7ca7-44ee-ac98-76d224e59b03                                                      |
[2023-10-03T19:07:58.537Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-03T19:07:58.537Z] | Cluster set               | Link                                                                                                                                          |
[2023-10-03T19:07:58.538Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-03T19:07:58.538Z] | db-cluster                | https://cloudius-jenkins-test.s3.amazonaws.com/5dbd49f9-7ca7-44ee-ac98-76d224e59b03/20231003_190157/db-cluster-5dbd49f9.tar.gz                |
[2023-10-03T19:07:58.538Z] | sct-runner-events         | https://cloudius-jenkins-test.s3.amazonaws.com/5dbd49f9-7ca7-44ee-ac98-76d224e59b03/20231003_190157/sct-runner-events-5dbd49f9.tar.gz         |
[2023-10-03T19:07:58.538Z] | sct-runner-python-log     | https://cloudius-jenkins-test.s3.amazonaws.com/5dbd49f9-7ca7-44ee-ac98-76d224e59b03/20231003_190157/sct-5dbd49f9.log.tar.gz                   |
[2023-10-03T19:07:58.538Z] | loader-set                | https://cloudius-jenkins-test.s3.amazonaws.com/5dbd49f9-7ca7-44ee-ac98-76d224e59b03/20231003_190157/loader-set-5dbd49f9.tar.gz                |
[2023-10-03T19:07:58.538Z] | monitor-set               | https://cloudius-jenkins-test.s3.amazonaws.com/5dbd49f9-7ca7-44ee-ac98-76d224e59b03/20231003_190157/monitor-set-5dbd49f9.tar.gz               |
[2023-10-03T19:07:58.538Z] | parallel-timelines-report | https://cloudius-jenkins-test.s3.amazonaws.com/5dbd49f9-7ca7-44ee-ac98-76d224e59b03/20231003_190157/parallel-timelines-report-5dbd49f9.tar.gz |
[2023-10-03T19:07:58.538Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

@kbr-scylla
Copy link
Contributor Author

A concurrent group0_state_machine::apply() was running during this time, which is also writing to schema commitlog tables. Maybe these writes somehow blocked the io_fiber write?

2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] group0_raft_sm - apply() starting state ID: deedf90c-620d-11ee-784d-a9da6da00c99
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] group0_raft_sm - cmd: prev_state_id: deedf90c-620d-11ee-784d-a9da6da00c99, new_state_id: df208912-620d-11ee-b19c-f02d743a185f, crea
tor_addr: 10.12.0.49, creator_id: 0ca3ec7d-ac8c-49e6-b745-8351df8b58f2
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428
f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b63c49125c8d11ee8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombston
e: timestamp=1696352221261237, deletion_time=1696352221}}{row {position: clustered, ckp{0010df208912620d11eeb19cf02d743a185f}, 0} tombstone {row_tombstone: none} marker {row_marker: 1696352221261237 0 0}, column description atomic_cell{CQL DDL statement:
 "ALTER TABLE keyspace1.standard1 DROP ( ZICXZ3H587 );",ts=1696352221261237,expiry=-1,ttl=0}}}
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] migration_manager - Applying schema mutations from 10.12.0.49:0
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1  !WARNING | scylla[5319]:  [shard 0:stre] querier - Read 17 live rows and 1887 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b65
79737061636531}} (see tombstone_warn_threshold)
...
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1  !WARNING | scylla[5319]:  [shard 0:stre] querier - Read 17 live rows and 1887 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b65
79737061636531}} (see tombstone_warn_threshold)
...
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1  !WARNING | scylla[5319]:  [shard 0:stre] querier - Read 16 live rows and 1888 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1  !WARNING | scylla[5319]:  [shard 0:stre] querier - Read 16 live rows and 1888 tombstones for system_schema.columns partition key "keyspace1" {{-707437814682527445, 00096b6579737061636531}} (see tombstone_warn_threshold)
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1     !INFO | scylla[5319]:  [shard 0:stre] schema_tables - Altering keyspace1.standard1 id=8a70e5f0-6204-11ee-9c8f-fb131c02fd84 version=df207fd0-620d-11ee-85be-d8f25765e2fd
...
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1     !INFO | scylla[5319]:  [shard 0:stre] query_processor - Column definitions for keyspace1.standard1 changed, invalidating related prepared statements
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1     !INFO | scylla[5319]: Reactor stalled for 32 ms on shard 2. Backtrace: 0x5a2b83a 0x5a2ac45 0x5a2bfef 0x3dbaf 0x2e7a842 0x2e7a67e 0x2963c59 0x2963bec 0x2963cc5 0x2963bec 0x3772707 0x368beb2 0x1f89aee 0x1f7dcfc 0x1f7c556 0x1f81033 0x3707735 0x37074ab 0x13dabf9 0x5a3d15f 0x5a3e437 0x5a61e83 0x5a0cd9a 0x8c946 0x11286f
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1     !INFO | scylla[5319]:  [shard 1:stre] query_processor - Column definitions for keyspace1.standard1 changed, invalidating related prepared statements
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1     !INFO | scylla[5319]:  [shard 3:stre] query_processor - Column definitions for keyspace1.standard1 changed, invalidating related prepared statements
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1     !INFO | scylla[5319]:  [shard 2:stre] query_processor - Column definitions for keyspace1.standard1 changed, invalidating related prepared statements
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] migration_manager - Gossiping my schema version df208912-620d-11ee-b19c-f02d743a185f
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1     !INFO | scylla[5319]:  [shard 0:stre] schema_tables - Schema version changed to df208912-620d-11ee-b19c-f02d743a185f
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] migration_manager - Not pulling schema because schema pulls were disabled due to Raft.
2023-10-03T16:57:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] group0_raft_sm - apply() finished, starting state ID: deedf90c-620d-11ee-784d-a9da6da00c99, resulting state ID: df208912-620d-11ee-b19c-f02d743a185f

But that apply() finished within 1 second. The next apply() started much later:

2023-10-03T16:57:08+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] applying entries, first idx 3682 last idx 3682
2023-10-03T16:57:08+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] group0_raft_sm - apply() is called with 1 commands

so it doesn't seem plausible that this apply() somehow blocked io_fiber...

And since it finished so quickly during this time, maybe there indeed is a problem with our persistence implementation?

@kbr-scylla
Copy link
Contributor Author

Ugh, looks like my logs are being eaten:

2023-10-03T15:57:48+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber storing log entries 1
2023-10-03T15:57:48+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber stored log entries
2023-10-03T15:57:48+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber sending 2 messages
2023-10-03T15:57:48+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber sent 2 messages
2023-10-03T15:57:48+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber finished iteration 6291

...

2023-10-03T15:57:51+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber stored log entries
2023-10-03T15:57:51+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber sending 2 messages
2023-10-03T15:57:51+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber sent 2 messages
2023-10-03T15:57:51+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber finished iteration 7109
2023-10-03T15:57:51+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber starting iteration 7110

There are no logs for iterations 6292 - 7108 and part of 7109.

The timestamps suggest that the logs are continuous though. There is no time jump in timestamps.

Maybe it's a problem with SCT's log collector? (I don't know how this works, whether it assigns timestamps or takes them from the process...) cc @temichus could it be that if log collector or something hangs in SCT, we'll skip a bunch of logs from a node?

@kbr-scylla
Copy link
Contributor Author

Winner of this round:

2023-10-03T18:46:14+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber storing log entries 1
...
2023-10-03T18:46:39+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber stored log entries

25 seconds.

(I verified that this is from the same io_fiber iteration, not from logs being eaten.)

@kbr-scylla
Copy link
Contributor Author

Btw. a side-effect of having a long io_fiber iteration:

2023-10-03T18:46:39+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber sending 486 messages

Those 486 messages are all the same, they are empty append_entries with exactly the same parameters. It's a huge waste. (cc @gleb-cloudius)

@kbr-scylla
Copy link
Contributor Author

The entry that took 25 seconds to persist on node 1 had this schema change:

ALTER TABLE keyspace1.standard1 ADD ( DBYH6M80SQ set<frozen<set<frozen<map<frozen<map<frozen<list<int>>,frozen<map<varchar,text>>>>,frozen<list<frozen<list<time>>>>>>>>> )

and index 7518.

2023-10-03T18:46:14+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - advance_stable_idx[0ca3ec7d-ac8c-49e6-b745-8351df8b58f2]: prev_stable_idx=7517, idx=7518
2023-10-03T18:46:14+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber storing log entries 1

On other nodes, it looks that the entry was persisted instantaneously.

2023-10-03T18:46:39+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-2    !DEBUG | scylla[5332]:  [shard 0:stre] raft - advance_stable_idx[de0ffbfd-f074-4822-bae8-ec5ff1fd14a9]: prev_stable_idx=7517, idx=7518
2023-10-03T18:46:39+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-2    !DEBUG | scylla[5332]:  [shard 0:stre] raft - [de0ffbfd-f074-4822-bae8-ec5ff1fd14a9] io_fiber storing log entries 1
2023-10-03T18:46:39+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-2    !DEBUG | scylla[5332]:  [shard 0:stre] raft - [de0ffbfd-f074-4822-bae8-ec5ff1fd14a9] io_fiber stored log entries
2023-10-03T18:46:39+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-3    !DEBUG | scylla[5368]:  [shard 0:stre] raft - advance_stable_idx[85a9e60b-3c9a-44dc-8d18-ab3c73486b9f]: prev_stable_idx=7517, idx=7518
2023-10-03T18:46:39+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-3    !DEBUG | scylla[5368]:  [shard 0:stre] raft - [85a9e60b-3c9a-44dc-8d18-ab3c73486b9f] io_fiber storing log entries 1
2023-10-03T18:46:39+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-3    !DEBUG | scylla[5368]:  [shard 0:stre] raft - [85a9e60b-3c9a-44dc-8d18-ab3c73486b9f] io_fiber stored log entries

So it doesn't seem that the command itself was problematic somehow (with size or whatever).

@kbr-scylla
Copy link
Contributor Author

We can see a clear spike in commitlog metrics at that moment:
Screenshot from 2023-10-04 12-30-01

however, those are metrics for the regular commitlog.

For schema commitlog the corresponding metrics are constantly 0.

@kbr-scylla
Copy link
Contributor Author

And the main scheduling group (which is, as we found, where schema commitlog runs) io queue metrics also look nice:
Screenshot from 2023-10-04 12-32-19

I cannot find any evidence that the write is waiting for disk / commitlog.

Maybe it's a problem with batch statement execution? Maybe parsing the CQLs? I would expect reactor stalls in that case, but there were none during this 25 second record time...

@xemul
Copy link
Contributor

xemul commented Oct 4, 2023

@xemul is this even a plausible thing to happen?

It depends :) I saw minutes delays on RAID5 setup -- the driver was rebalancing internally and blocked all queued requests for that period. But that was not some "regular situation" of course.

By and large the in-disk delay metrics should show if it's indeed kernel+disk or something else.
Also IO latency is quite sensitive to reactor poll rates, so checking the polls metrics can also be valuable.
I once tried to make this dependency on poll rate more evident with scylladb/seastar#1492, but it's not yet merged

@kbr-scylla
Copy link
Contributor Author

Something that I haven't paid attention to earlier, is that performance seems to gradually become worse over time.
Could in theory be the loader, dunno.

Screenshot from 2023-10-04 12-43-13

@avikivity
Copy link
Member

It's normal for an empty cluster to degrade quickly as it moves from in-memory to mixed memory/sstable and compaction bandwidth increases. Still it could be something else.

@kbr-scylla
Copy link
Contributor Author

rate(scylla_reactor_polls{instance="10.12.0.49", shard="0", job="scylla"}[1m])

Screenshot from 2023-10-04 13-06-39

That jump at 18:48 seems to correlate with compactions starting.

@xemul
Copy link
Contributor

xemul commented Oct 4, 2023

The rate is too good. It polls much more frequently than once every task-quota-ms (0.5ms)

@kbr-scylla
Copy link
Contributor Author

That's how the rates look throughout the test on all nodes/shards:
Screenshot from 2023-10-04 13-19-37

@kbr-scylla
Copy link
Contributor Author

For a while I suspected this could be a problem with batch statement implementation, which is used when persisting Raft log entries.

However, persisting commit index times are also reaching great heights, and they are not using batch statement:

2023-10-03T18:23:49+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber got 1 committed entries storing commit idx 6697

...

2023-10-03T18:23:59+00:00 longevity-ParallelAddDropColumnMonk-db-node-5dbd49f9-1    !DEBUG | scylla[5319]:  [shard 0:stre] raft - [0ca3ec7d-ac8c-49e6-b745-8351df8b58f2] io_fiber stored commit idx

10 seconds.

This function has a very simple implementation:

future<> raft_sys_table_storage::store_commit_idx(raft::index_t idx) {
    return execute_with_linearization_point([this, idx] {
        static const auto store_cql = format("INSERT INTO system.{} (group_id, commit_idx) VALUES (?, ?)",
            db::system_keyspace::RAFT);
        return _qp.execute_internal(
            store_cql,
            {_group_id.id, int64_t(idx)},
            cql3::query_processor::cache_internal::yes).discard_result();
    });
}

@avikivity
Copy link
Member

Make sure the main group has enough shares.

@kbr-scylla
Copy link
Contributor Author

BTW. I have no clue what's the point of this:

future<> raft_sys_table_storage::execute_with_linearization_point(std::function<future<>()> f) {
    promise<> task_promise;
    auto pending_fut = std::exchange(_pending_op_fut, task_promise.get_future());
    co_await std::move(pending_fut);
    try {
        co_await f();
        task_promise.set_value();
    } catch (...) {
        task_promise.set_exception(std::current_exception());
        throw;
    }
}

looks like a round-about way to implement... a mutex?

Also, we never call raft_sys_table_storage functions concurrently (calls are sequential from io_fiber), and if we did because of concurrent Raft servers in the future, we certainly wouldn't want different servers to block each other from persisting stuff...

(cc @kostja)

@avikivity
Copy link
Member

@xemul it shows very high starvation time, why?

@avikivity
Copy link
Member

Wouldn't each group have its own io fiber?

@kbr-scylla
Copy link
Contributor Author

Make sure the main group has enough shares.

200

In comparison, commitlog group (where regular commitlog runs) has constant 1000

and compaction looks like this:
Screenshot from 2023-10-04 13-49-45

@xemul
Copy link
Contributor

xemul commented Oct 4, 2023

@xemul it shows very high starvation time, why?

@avikivity , can be due to scylladb/seastar#1790 (comment)

@kbr-scylla
Copy link
Contributor Author

@avikivity could it be interaction of raft_sys_table_storage statements with cassandra-stress queries through execution stage?

static thread_local inheriting_concrete_execution_stage<
        future<::shared_ptr<cql_transport::messages::result_message>>,
        const modification_statement*,
        query_processor&,
        service::query_state&,
        const query_options&> modify_stage{"cql3_modification", modification_statement_executor::get()};

inheriting_concrete_execution_stage class mentions scheduling groups in comments.

I'm not sure which scheduling group our raft_sys_table_storage statements are running on. IIUC, when we reach the schema commitlog level, we switch to main scheduling group for that part, but I don't know what happens on the CQL statement level.

@kbr-scylla
Copy link
Contributor Author

Ah, of course.

It's not the commitlog writes that take so long...
It's this:

    auto f = co_await coroutine::as_future(this->apply_in_memory(m, s, std::move(h), timeout));
2023-10-04T22:18:02+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 10605ms, to commitlog 20ms

Why didn't I think of this before? The write to memtable is where we're waiting!

I'm guessing that this is caused by memory pressure, because we're not flushing memtables quickly enough, because of the cassandra-stress workload. So the write to system/schema tables has to wait until user data gets flushed to disk.

Maybe these two need to be separated as well?

@kbr-scylla
Copy link
Contributor Author

@avikivity @tgrabiec @denesb @xemul

do you think we should separate system/schema write path from user data write path even more, so that if memory pressure is high and user table memtables are flushed often, this doesn't block system/schema writes?

Something like using a separate (and small) memory pool for system etc.?

(This is based on my guess that we're blocking on apply_in_memory because of memory pressure, but I could be again completely off the mark, like with commitlog hypothesis earlier... But what else could it be?)

@denesb
Copy link
Contributor

denesb commented Oct 5, 2023

AFAIK we already have separate system and user dirty memory managers, with separate memory pools.

@kbr-scylla
Copy link
Contributor Author

Ah, right...

Actually now I noticed something even more interesting, I think
So I have this code in database::do_apply:

    auto end1 = db_clock::now();
    auto dur1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start);

    if (dur1.count() >= log_threshold) {
        auto s = cf.schema();
        dblog.warn("write to commitlog cf {}.{} took {}", s->ks_name(), s->cf_name(), dur1);
    }

    auto f = co_await coroutine::as_future(this->apply_in_memory(m, s, std::move(h), timeout));

    auto end2 = db_clock::now();
    auto dur2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start);

    if (dur2.count() >= log_threshold) {
        auto s = cf.schema();
        dblog.warn("write to memory cf {}.{} took {}, to commitlog {}", s->ks_name(), s->cf_name(), dur2, dur1);
    }

(start = db_clock::now() is defined at the beginning of do_apply)

You can find logs from my latest run here:

[2023-10-05T01:06:20.129Z] +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-05T01:06:20.129Z] |                                                      Collected logs by test-id: 9b3fe25e-2a9c-4190-8f90-ba18dce01631                                                      |
[2023-10-05T01:06:20.129Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-05T01:06:20.129Z] | Cluster set               | Link                                                                                                                                          |
[2023-10-05T01:06:20.129Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-05T01:06:20.129Z] | db-cluster                | https://cloudius-jenkins-test.s3.amazonaws.com/9b3fe25e-2a9c-4190-8f90-ba18dce01631/20231005_010027/db-cluster-9b3fe25e.tar.gz                |
[2023-10-05T01:06:20.129Z] | sct-runner-events         | https://cloudius-jenkins-test.s3.amazonaws.com/9b3fe25e-2a9c-4190-8f90-ba18dce01631/20231005_010027/sct-runner-events-9b3fe25e.tar.gz         |
[2023-10-05T01:06:20.129Z] | sct-runner-python-log     | https://cloudius-jenkins-test.s3.amazonaws.com/9b3fe25e-2a9c-4190-8f90-ba18dce01631/20231005_010027/sct-9b3fe25e.log.tar.gz                   |
[2023-10-05T01:06:20.129Z] | loader-set                | https://cloudius-jenkins-test.s3.amazonaws.com/9b3fe25e-2a9c-4190-8f90-ba18dce01631/20231005_010027/loader-set-9b3fe25e.tar.gz                |
[2023-10-05T01:06:20.129Z] | monitor-set               | https://cloudius-jenkins-test.s3.amazonaws.com/9b3fe25e-2a9c-4190-8f90-ba18dce01631/20231005_010027/monitor-set-9b3fe25e.tar.gz               |
[2023-10-05T01:06:20.129Z] | parallel-timelines-report | https://cloudius-jenkins-test.s3.amazonaws.com/9b3fe25e-2a9c-4190-8f90-ba18dce01631/20231005_010027/parallel-timelines-report-9b3fe25e.tar.gz |
[2023-10-05T01:06:20.129Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

What I expected is that this log message for exceeding log_threshold (which I hardcoded to 3000ms) would be appearing for different tables.

But it's appearing only for system.raft.

$ awk '/write to memory cf/{print}' longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1/messages.log
2023-10-04T21:38:05+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 3295ms, to commitlog 2ms
2023-10-04T21:54:23+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 6694ms, to commitlog 4ms
2023-10-04T21:57:26+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 8934ms, to commitlog 9ms
2023-10-04T22:15:15+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 6657ms, to commitlog 1ms
2023-10-04T22:18:02+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 10605ms, to commitlog 20ms
2023-10-04T22:31:17+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 8174ms, to commitlog 18ms
2023-10-04T22:31:29+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 11259ms, to commitlog 1ms
2023-10-04T22:34:27+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 10750ms, to commitlog 2ms
2023-10-04T22:35:19+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 3284ms, to commitlog 8ms
2023-10-04T22:47:01+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 14376ms, to commitlog 1ms
2023-10-04T22:49:54+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 16258ms, to commitlog 4ms
2023-10-04T22:53:11+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 14108ms, to commitlog 4ms
2023-10-04T22:55:37+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 3518ms, to commitlog 0ms
2023-10-04T23:05:52+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 7075ms, to commitlog 3ms
2023-10-04T23:06:40+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 12007ms, to commitlog 13ms
2023-10-04T23:08:53+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 3701ms, to commitlog 1ms
2023-10-04T23:10:59+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 3066ms, to commitlog 6ms
2023-10-04T23:11:50+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 4247ms, to commitlog 7ms
2023-10-04T23:11:59+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 9070ms, to commitlog 3ms
2023-10-04T23:14:13+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 17166ms, to commitlog 0ms
2023-10-04T23:24:07+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 5963ms, to commitlog 55ms
2023-10-04T23:25:20+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 7551ms, to commitlog 5ms
2023-10-04T23:26:41+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 15007ms, to commitlog 7ms
2023-10-04T23:26:50+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 7269ms, to commitlog 16ms
2023-10-04T23:27:35+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 6292ms, to commitlog 1ms
2023-10-04T23:29:42+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 7270ms, to commitlog 6ms
2023-10-04T23:32:23+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 8811ms, to commitlog 0ms
2023-10-04T23:40:58+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 8004ms, to commitlog 2ms
2023-10-04T23:43:46+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 8162ms, to commitlog 25ms
2023-10-04T23:45:16+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 14342ms, to commitlog 4ms
2023-10-04T23:49:23+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 11661ms, to commitlog 13ms
2023-10-04T23:49:30+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 5254ms, to commitlog 3ms
2023-10-04T23:58:27+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 9822ms, to commitlog 6ms
2023-10-05T00:02:36+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 19594ms, to commitlog 19ms
2023-10-05T00:07:03+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 15128ms, to commitlog 14ms
2023-10-05T00:17:12+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 14105ms, to commitlog 3ms
2023-10-05T00:21:04+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 10106ms, to commitlog 14ms
2023-10-05T00:23:51+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 8808ms, to commitlog 11ms
2023-10-05T00:25:43+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 20750ms, to commitlog 1ms
2023-10-05T00:34:06+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 11066ms, to commitlog 31ms
2023-10-05T00:34:45+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 3958ms, to commitlog 1ms
2023-10-05T00:37:37+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 17178ms, to commitlog 2ms
2023-10-05T00:38:34+00:00 longevity-ParallelAddDropColumnMonk-db-node-9b3fe25e-1  !WARNING | scylla[5307]:  [shard 0:stre] database - write to memory cf system.raft took 4500ms, to commitlog 8ms

Same for other nodes.

So... maybe the separate memory pool is actually causing the problem here?

@denesb
Copy link
Contributor

denesb commented Oct 5, 2023

(but I think the system one is considerably smaller, which might be a problem with all these new system tables, like system.tablets).

@kbr-scylla
Copy link
Contributor Author

What I expected is that this log message for exceeding log_threshold (which I hardcoded to 3000ms) would be appearing for different tables.

But it's appearing only for system.raft.

Actually this doesn't prove anything, the write timeout for user writes is set to 2000ms. So any user write which'd exceed this threshold would timeout and we wouldn't see this message.

@kbr-scylla
Copy link
Contributor Author

@kbr-scylla do you have a local reproducer?

No. Every time I make a change, I'm waiting 1 hour to build an AMI and then 3 hours for the SCT test to finish. And from what I see, the high latencies start about ~1 hour into the test.

@tgrabiec
Copy link
Contributor

tgrabiec commented Oct 6, 2023

Maybe cache update after memtable flush takes a long time due to some priority inversion which causes real dirty memory to linger and block writes.
Maybe there are too many background flushes (_background_work_flush_serializer). System tables have smaller memory pool, and touch multiple tables, so they flush more frequently. SStable sealing latency may dominate and cause this queue to grow faster than for user tables.

A flush every 10MB should be invisible. And we're running the flush concurrently with filling another memtable.

It's more frequent that that, each schema change touches ~10 tables. We moved to the schema commitlog partly to avoid latency associated with flushing memtables, which we saw taking 30s in some setups. If you accumulate several such commands, we may hit _max_background_work.

Maybe run_when_memory_available() is confused (though I see no reason for that to happen).

@kbr-scylla do you have a local reproducer?

@tgrabiec
Copy link
Contributor

tgrabiec commented Oct 6, 2023

@kbr-scylla Did you check the scylla_memtables_pending_flushes metric?

@kbr-scylla
Copy link
Contributor Author

We moved to the schema commitlog partly to avoid latency associated with flushing memtables, which we saw taking 30s in some setups.

Well, I just checked what the default value is for flush_schema_tables_after_modification, and it's true:

    , flush_schema_tables_after_modification(this, "flush_schema_tables_after_modification", liveness::LiveUpdate, value_status::Used, true,
        "Flush tables in the system_schema keyspace after schema modification. This is required for crash recovery, but slows down tests and can be disabled for them")

and the test is using the default.

@kbr-scylla Did you check the scylla_memtables_pending_flushes metric?

No. To be honest I'm a bit tired with this issue at this point, and I feel like it is more in the storage group's ballpark; you guys would be much better qualified and efficient investigating this than I do. I think it's clear that it's not a problem with Raft implementation itself, but the underlying storage that Raft is using.

If you want to look at the metrics, the last run is here:
https://jenkins.scylladb.com/job/scylla-staging/job/KamilBraun/job/longevity-schema-changes-3h-test-artsiom_mishuta-clone/13/consoleText
note: it used 100MB of system memory (the issue still occurred)

Links to all the logs are provided in the above job log:

[2023-10-06T12:17:45.042Z] +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-06T12:17:45.042Z] |                                                      Collected logs by test-id: ba5b3790-dd79-4f08-8903-318c649cf77c                                                      |
[2023-10-06T12:17:45.042Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-06T12:17:45.042Z] | Cluster set               | Link                                                                                                                                          |
[2023-10-06T12:17:45.043Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+
[2023-10-06T12:17:45.043Z] | db-cluster                | https://cloudius-jenkins-test.s3.amazonaws.com/ba5b3790-dd79-4f08-8903-318c649cf77c/20231006_121248/db-cluster-ba5b3790.tar.gz                |
[2023-10-06T12:17:45.043Z] | sct-runner-events         | https://cloudius-jenkins-test.s3.amazonaws.com/ba5b3790-dd79-4f08-8903-318c649cf77c/20231006_121248/sct-runner-events-ba5b3790.tar.gz         |
[2023-10-06T12:17:45.043Z] | sct-runner-python-log     | https://cloudius-jenkins-test.s3.amazonaws.com/ba5b3790-dd79-4f08-8903-318c649cf77c/20231006_121248/sct-ba5b3790.log.tar.gz                   |
[2023-10-06T12:17:45.043Z] | loader-set                | https://cloudius-jenkins-test.s3.amazonaws.com/ba5b3790-dd79-4f08-8903-318c649cf77c/20231006_121248/loader-set-ba5b3790.tar.gz                |
[2023-10-06T12:17:45.043Z] | monitor-set               | https://cloudius-jenkins-test.s3.amazonaws.com/ba5b3790-dd79-4f08-8903-318c649cf77c/20231006_121248/monitor-set-ba5b3790.tar.gz               |
[2023-10-06T12:17:45.043Z] | parallel-timelines-report | https://cloudius-jenkins-test.s3.amazonaws.com/ba5b3790-dd79-4f08-8903-318c649cf77c/20231006_121248/parallel-timelines-report-ba5b3790.tar.gz |
[2023-10-06T12:17:45.043Z] +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+

in particular the test-id is ba5b3790-dd79-4f08-8903-318c649cf77c
which you can use to restore monitor stack using this job:
https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=ba5b3790-dd79-4f08-8903-318c649cf77c

If I were to do the next step, I'd probably check what happens when I run with flush_schema_tables_after_modification set to false.
Otherwise, I'd probably try to create a local reproducer at this point, because waiting for 4 hours at each iteration is pretty inefficient.

@tgrabiec
Copy link
Contributor

tgrabiec commented Oct 6, 2023 via email

@kbr-scylla
Copy link
Contributor Author

As it turns out (thanks @kostja for pointing this out), system.raft and system.paxos should actually be using the user memory manager.

static bool maybe_write_in_user_memory(schema_ptr s) {
    return (s.get() == system_keyspace::batchlog().get()) || (s.get() == system_keyspace::paxos().get())
            || s == system_keyspace::v3::scylla_views_builds_in_progress()
            || s == system_keyspace::raft();
}

future<> system_keyspace::make(
        locator::effective_replication_map_factory& erm_factory,
        replica::database& db) {
    for (auto&& table : system_keyspace::all_tables(db.get_config())) {
        co_await db.create_local_system_table(table, maybe_write_in_user_memory(table), erm_factory);
    }
}
future<> database::create_local_system_table(
        schema_ptr table, bool write_in_user_memory, locator::effective_replication_map_factory& erm_factory) {
    auto ks_name = table->ks_name();
    if (!has_keyspace(ks_name)) {
        bool durable = _cfg.data_file_directories().size() > 0;
        auto ksm = make_lw_shared<keyspace_metadata>(ks_name,
                "org.apache.cassandra.locator.LocalStrategy",
                std::map<sstring, sstring>{},
                durable
                );
        co_await create_keyspace(ksm, erm_factory, replica::database::system_keyspace::yes);
    }
    auto& ks = find_keyspace(ks_name);
    auto cfg = ks.make_column_family_config(*table, *this);
    if (write_in_user_memory) {
        cfg.dirty_memory_manager = &_dirty_memory_manager;
    } else {
        cfg.memtable_scheduling_group = default_scheduling_group();
        cfg.memtable_to_cache_scheduling_group = default_scheduling_group();
    }
    co_await add_column_family(ks, table, std::move(cfg), true);
}

However, how does the above relate to this?

future<> database::create_in_memory_keyspace(const lw_shared_ptr<keyspace_metadata>& ksm, locator::effective_replication_map_factory& erm_factory, system_keyspace system) {
    auto kscfg = make_keyspace_config(*ksm);
    if (system == system_keyspace::yes) {
        kscfg.enable_disk_reads = kscfg.enable_disk_writes = kscfg.enable_commitlog = !_cfg.volatile_system_keyspace_for_testing();
        kscfg.enable_cache = _cfg.enable_cache();
        // don't make system keyspace writes wait for user writes (if under pressure)
        kscfg.dirty_memory_manager = &_system_dirty_memory_manager;
    }
    if (extensions().is_extension_internal_keyspace(ksm->name())) {
        // don't make internal keyspaces write wait for user writes (if under pressure), and also to avoid possible deadlocks.
        kscfg.dirty_memory_manager = &_system_dirty_memory_manager;
    }
    keyspace ks(ksm, std::move(kscfg), erm_factory);
    co_await ks.create_replication_strategy(get_shared_token_metadata(), ksm->strategy_options());
    _keyspaces.emplace(ksm->name(), std::move(ks));
}

keyspace config has its own memory manager, table config has its own?
So which one are we actually using?

@kostja
Copy link
Contributor

kostja commented Oct 6, 2023

I remember that we have been re-running benchmarks after adding system.paxos to user memory manager and seeing an imporvement. So at least before the user memory manager was in use. Perhaps we're observing a regression?

@kbr-scylla
Copy link
Contributor Author

So which one are we actually using?

The cf one seems to be overwriting the ks one.

Ok, in that case, back to the idea of seeing what happens if we use the system memory manager for system.raft?

@kbr-scylla
Copy link
Contributor Author

If I understand correctly, memtable flushes are triggered on two causes:

  • when memtables grow too large, causing large memory pressure
  • when we reach commitlog length limit

And I would guess that the user memory manager is looking at the regular commitlog length to trigger flushes, not schema commitlog?

But system.raft is the one (?) special table in that it uses schema commitlog, but user memory pool.
Which means, the second reason of memtable flush would not apply to system.raft.

Still, if the write is waiting on memtable write, it means that there's large memory pressure (?) so it should be unblocked by a memtable flush... right?

Not sure if I'm making any sense here.

@avikivity
Copy link
Member

system.raft should be using the system memory manager. Or a user workload can stall a schema or topology change.

@kbr-scylla
Copy link
Contributor Author

kbr-scylla commented Oct 7, 2023

New run with system.raft moved to system memory:
https://jenkins.scylladb.com/job/scylla-staging/job/KamilBraun/job/longevity-schema-changes-3h-test-artsiom_mishuta-clone/15/consoleText

It looks like it solved the problem described in the issue -- the maximum observed io_fiber duration in this run is 266ms.

However, cassandra-stress operations are still timing out, some read barriers are still timing out, so the original issue (#15312) is not yet solved.

@kbr-scylla
Copy link
Contributor Author

In fact, the test seems to fail earlier for some reason with this change (after 1.5h). Perhaps the performance has increased due to the system.raft change, causing the cluster to become even more overloaded?

@avikivity
Copy link
Member

cassandra-stress timeouts could be due to overload (misconfigured test) or a side effect of failing barriers.

Perhaps we should shift group0 operations to their own scheduling group, or to the maintenance scheduling group, so they don't compete with the user workload.

@gleb-cloudius
Copy link
Contributor

BTW. system.paxos is also using the system memory manager, right?

I remember there was a lot of trouble in the past with Paxos performance, and AFAIK it wasn't all completely resolved. Maybe the reason is that system.paxos has to work with 10MB of memory?

cc @gleb-cloudius

Paxos suppose to use user memory: b08679e

kbr-scylla added a commit to kbr-scylla/scylladb that referenced this issue Nov 6, 2023
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).

This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.

This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue scylladb#15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.

Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.

In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.

Fixes: scylladb#15622
@avikivity
Copy link
Member

@kbr-scylla should we backport this?

It's a non-trivial change (despite being a one-liner). Nevertheless, I think it should be backported to 5.4 soon and we can consider 5.2 later.

@kbr-scylla
Copy link
Contributor Author

Yes, to 5.4 definitely, we should include it in our pre-release testing.

To 5.2 not necessarily. The problem manifests as timeouts during write/read operations under heavy load when performing a concurrent schema change. So it should not be that common, and it's transient.

@kbr-scylla kbr-scylla added the backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed label Nov 9, 2023
kbr-scylla added a commit that referenced this issue Nov 16, 2023
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).

This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.

This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.

Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.

In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.

Fixes: #15622

Closes #15972

(cherry picked from commit f094e23)
@kbr-scylla
Copy link
Contributor Author

Queued backport to 5.4.

@kbr-scylla kbr-scylla removed Backport candidate backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Nov 16, 2023
kbr-scylla added a commit that referenced this issue Jan 25, 2024
`system.raft` was using the "user memory pool", i.e. the
`dirty_memory_manager` for this table was set to
`database::_dirty_memory_manager` (instead of
`database::_system_dirty_memory_manager`).

This meant that if a write workload caused memory pressure on the user
memory pool, internal `system.raft` writes would have to wait for
memtables of user tables to get flushed before the write would proceed.

This was observed in SCT longevity tests which ran a heavy workload on
the cluster and concurrently, schema changes (which underneath use the
`system.raft` table). Raft would often get stuck waiting many seconds
for user memtables to get flushed. More details in issue #15622.
Experiments showed that moving Raft to system memory fixed this
particular issue, bringing the waits to reasonable levels.

Currently `system.raft` stores only one group, group 0, which is
internally used for cluster metadata operations (schema and topology
changes) -- so it makes sense to keep use system memory.

In the future we'd like to have other groups, for strongly consistent
tables. These groups should use the user memory pool. It means we won't
be able to use `system.raft` for them -- we'll just have to use a
separate table.

Fixes: #15622

Closes #15972

(cherry picked from commit f094e23)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants