New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: large delays between io_fiber
iterations in schema change test
#15622
Comments
Got logs from another run with more logging in
so it took 6 seconds to do this... logger.trace("[{}] io_fiber storing log entries {}", _id, entries.size());
co_await _persistence->store_log_entries(entries);
logger.trace("[{}] io_fiber stored log entries", _id); It's weird. The variability seems very high -- usually this storing step happens in the same second (unfortunately the logs don't provide millisecond resolution for some reason), but this one time it suddenly took a whole 6 seconds. Even if the disk is struggling, what could cause such a big spike? @xemul is this even a plausible thing to happen? (or maybe we have a bug in |
Attaching log fragment from that node (which is also the leader btw) Full logs can be found here:
|
A concurrent
But that
so it doesn't seem plausible that this And since it finished so quickly during this time, maybe there indeed is a problem with our persistence implementation? |
Ugh, looks like my logs are being eaten:
There are no logs for iterations 6292 - 7108 and part of 7109. The timestamps suggest that the logs are continuous though. There is no time jump in timestamps. Maybe it's a problem with SCT's log collector? (I don't know how this works, whether it assigns timestamps or takes them from the process...) cc @temichus could it be that if log collector or something hangs in SCT, we'll skip a bunch of logs from a node? |
Winner of this round:
25 seconds. (I verified that this is from the same io_fiber iteration, not from logs being eaten.) |
Btw. a side-effect of having a long io_fiber iteration:
Those 486 messages are all the same, they are empty append_entries with exactly the same parameters. It's a huge waste. (cc @gleb-cloudius) |
The entry that took 25 seconds to persist on node 1 had this schema change:
and index
On other nodes, it looks that the entry was persisted instantaneously.
So it doesn't seem that the command itself was problematic somehow (with size or whatever). |
It depends :) I saw minutes delays on RAID5 setup -- the driver was rebalancing internally and blocked all queued requests for that period. But that was not some "regular situation" of course. By and large the in-disk delay metrics should show if it's indeed kernel+disk or something else. |
It's normal for an empty cluster to degrade quickly as it moves from in-memory to mixed memory/sstable and compaction bandwidth increases. Still it could be something else. |
The rate is too good. It polls much more frequently than once every task-quota-ms (0.5ms) |
For a while I suspected this could be a problem with batch statement implementation, which is used when persisting Raft log entries. However, persisting commit index times are also reaching great heights, and they are not using batch statement:
10 seconds. This function has a very simple implementation: future<> raft_sys_table_storage::store_commit_idx(raft::index_t idx) {
return execute_with_linearization_point([this, idx] {
static const auto store_cql = format("INSERT INTO system.{} (group_id, commit_idx) VALUES (?, ?)",
db::system_keyspace::RAFT);
return _qp.execute_internal(
store_cql,
{_group_id.id, int64_t(idx)},
cql3::query_processor::cache_internal::yes).discard_result();
});
} |
Make sure the main group has enough shares. |
BTW. I have no clue what's the point of this: future<> raft_sys_table_storage::execute_with_linearization_point(std::function<future<>()> f) {
promise<> task_promise;
auto pending_fut = std::exchange(_pending_op_fut, task_promise.get_future());
co_await std::move(pending_fut);
try {
co_await f();
task_promise.set_value();
} catch (...) {
task_promise.set_exception(std::current_exception());
throw;
}
} looks like a round-about way to implement... a mutex? Also, we never call (cc @kostja) |
@xemul it shows very high starvation time, why? |
Wouldn't each group have its own io fiber? |
@avikivity , can be due to scylladb/seastar#1790 (comment) |
@avikivity could it be interaction of static thread_local inheriting_concrete_execution_stage<
future<::shared_ptr<cql_transport::messages::result_message>>,
const modification_statement*,
query_processor&,
service::query_state&,
const query_options&> modify_stage{"cql3_modification", modification_statement_executor::get()};
I'm not sure which scheduling group our raft_sys_table_storage statements are running on. IIUC, when we reach the schema commitlog level, we switch to |
Ah, of course. It's not the commitlog writes that take so long... auto f = co_await coroutine::as_future(this->apply_in_memory(m, s, std::move(h), timeout));
Why didn't I think of this before? The write to memtable is where we're waiting! I'm guessing that this is caused by memory pressure, because we're not flushing memtables quickly enough, because of the cassandra-stress workload. So the write to system/schema tables has to wait until user data gets flushed to disk. Maybe these two need to be separated as well? |
@avikivity @tgrabiec @denesb @xemul do you think we should separate system/schema write path from user data write path even more, so that if memory pressure is high and user table memtables are flushed often, this doesn't block system/schema writes? Something like using a separate (and small) memory pool for system etc.? (This is based on my guess that we're blocking on |
AFAIK we already have separate system and user dirty memory managers, with separate memory pools. |
Ah, right... Actually now I noticed something even more interesting, I think auto end1 = db_clock::now();
auto dur1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start);
if (dur1.count() >= log_threshold) {
auto s = cf.schema();
dblog.warn("write to commitlog cf {}.{} took {}", s->ks_name(), s->cf_name(), dur1);
}
auto f = co_await coroutine::as_future(this->apply_in_memory(m, s, std::move(h), timeout));
auto end2 = db_clock::now();
auto dur2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start);
if (dur2.count() >= log_threshold) {
auto s = cf.schema();
dblog.warn("write to memory cf {}.{} took {}, to commitlog {}", s->ks_name(), s->cf_name(), dur2, dur1);
} ( You can find logs from my latest run here:
What I expected is that this log message for exceeding But it's appearing only for
Same for other nodes. So... maybe the separate memory pool is actually causing the problem here? |
(but I think the system one is considerably smaller, which might be a problem with all these new system tables, like |
Actually this doesn't prove anything, the write timeout for user writes is set to 2000ms. So any user write which'd exceed this threshold would timeout and we wouldn't see this message. |
No. Every time I make a change, I'm waiting 1 hour to build an AMI and then 3 hours for the SCT test to finish. And from what I see, the high latencies start about ~1 hour into the test. |
It's more frequent that that, each schema change touches ~10 tables. We moved to the schema commitlog partly to avoid latency associated with flushing memtables, which we saw taking 30s in some setups. If you accumulate several such commands, we may hit _max_background_work.
|
@kbr-scylla Did you check the scylla_memtables_pending_flushes metric? |
Well, I just checked what the default value is for , flush_schema_tables_after_modification(this, "flush_schema_tables_after_modification", liveness::LiveUpdate, value_status::Used, true,
"Flush tables in the system_schema keyspace after schema modification. This is required for crash recovery, but slows down tests and can be disabled for them") and the test is using the default.
No. To be honest I'm a bit tired with this issue at this point, and I feel like it is more in the storage group's ballpark; you guys would be much better qualified and efficient investigating this than I do. I think it's clear that it's not a problem with Raft implementation itself, but the underlying storage that Raft is using. If you want to look at the metrics, the last run is here: Links to all the logs are provided in the above job log:
in particular the test-id is If I were to do the next step, I'd probably check what happens when I run with |
On Fri, Oct 6, 2023 at 4:55 PM Kamil Braun ***@***.***> wrote:
We moved to the schema commitlog partly to avoid latency associated with
flushing memtables, which we saw taking 30s in some setups.
Well, I just checked what the default value is for
flush_schema_tables_after_modification, and it's true:
, flush_schema_tables_after_modification(this, "flush_schema_tables_after_modification", liveness::LiveUpdate, value_status::Used, true,
"Flush tables in the system_schema keyspace after schema modification. This is required for crash recovery, but slows down tests and can be disabled for them")
and the test is using the default.
This option is ignored when schema commitlog is used.
… Message ID: ***@***.***>
|
As it turns out (thanks @kostja for pointing this out), system.raft and system.paxos should actually be using the user memory manager. static bool maybe_write_in_user_memory(schema_ptr s) {
return (s.get() == system_keyspace::batchlog().get()) || (s.get() == system_keyspace::paxos().get())
|| s == system_keyspace::v3::scylla_views_builds_in_progress()
|| s == system_keyspace::raft();
}
future<> system_keyspace::make(
locator::effective_replication_map_factory& erm_factory,
replica::database& db) {
for (auto&& table : system_keyspace::all_tables(db.get_config())) {
co_await db.create_local_system_table(table, maybe_write_in_user_memory(table), erm_factory);
}
}
future<> database::create_local_system_table(
schema_ptr table, bool write_in_user_memory, locator::effective_replication_map_factory& erm_factory) {
auto ks_name = table->ks_name();
if (!has_keyspace(ks_name)) {
bool durable = _cfg.data_file_directories().size() > 0;
auto ksm = make_lw_shared<keyspace_metadata>(ks_name,
"org.apache.cassandra.locator.LocalStrategy",
std::map<sstring, sstring>{},
durable
);
co_await create_keyspace(ksm, erm_factory, replica::database::system_keyspace::yes);
}
auto& ks = find_keyspace(ks_name);
auto cfg = ks.make_column_family_config(*table, *this);
if (write_in_user_memory) {
cfg.dirty_memory_manager = &_dirty_memory_manager;
} else {
cfg.memtable_scheduling_group = default_scheduling_group();
cfg.memtable_to_cache_scheduling_group = default_scheduling_group();
}
co_await add_column_family(ks, table, std::move(cfg), true);
} However, how does the above relate to this? future<> database::create_in_memory_keyspace(const lw_shared_ptr<keyspace_metadata>& ksm, locator::effective_replication_map_factory& erm_factory, system_keyspace system) {
auto kscfg = make_keyspace_config(*ksm);
if (system == system_keyspace::yes) {
kscfg.enable_disk_reads = kscfg.enable_disk_writes = kscfg.enable_commitlog = !_cfg.volatile_system_keyspace_for_testing();
kscfg.enable_cache = _cfg.enable_cache();
// don't make system keyspace writes wait for user writes (if under pressure)
kscfg.dirty_memory_manager = &_system_dirty_memory_manager;
}
if (extensions().is_extension_internal_keyspace(ksm->name())) {
// don't make internal keyspaces write wait for user writes (if under pressure), and also to avoid possible deadlocks.
kscfg.dirty_memory_manager = &_system_dirty_memory_manager;
}
keyspace ks(ksm, std::move(kscfg), erm_factory);
co_await ks.create_replication_strategy(get_shared_token_metadata(), ksm->strategy_options());
_keyspaces.emplace(ksm->name(), std::move(ks));
} keyspace config has its own memory manager, table config has its own? |
I remember that we have been re-running benchmarks after adding system.paxos to user memory manager and seeing an imporvement. So at least before the user memory manager was in use. Perhaps we're observing a regression? |
The Ok, in that case, back to the idea of seeing what happens if we use the system memory manager for |
If I understand correctly, memtable flushes are triggered on two causes:
And I would guess that the user memory manager is looking at the regular commitlog length to trigger flushes, not schema commitlog? But Still, if the write is waiting on memtable write, it means that there's large memory pressure (?) so it should be unblocked by a memtable flush... right? Not sure if I'm making any sense here. |
system.raft should be using the system memory manager. Or a user workload can stall a schema or topology change. |
New run with system.raft moved to system memory: It looks like it solved the problem described in the issue -- the maximum observed io_fiber duration in this run is 266ms. However, cassandra-stress operations are still timing out, some read barriers are still timing out, so the original issue (#15312) is not yet solved. |
In fact, the test seems to fail earlier for some reason with this change (after 1.5h). Perhaps the performance has increased due to the |
cassandra-stress timeouts could be due to overload (misconfigured test) or a side effect of failing barriers. Perhaps we should shift group0 operations to their own scheduling group, or to the maintenance scheduling group, so they don't compete with the user workload. |
Paxos suppose to use user memory: b08679e |
`system.raft` was using the "user memory pool", i.e. the `dirty_memory_manager` for this table was set to `database::_dirty_memory_manager` (instead of `database::_system_dirty_memory_manager`). This meant that if a write workload caused memory pressure on the user memory pool, internal `system.raft` writes would have to wait for memtables of user tables to get flushed before the write would proceed. This was observed in SCT longevity tests which ran a heavy workload on the cluster and concurrently, schema changes (which underneath use the `system.raft` table). Raft would often get stuck waiting many seconds for user memtables to get flushed. More details in issue scylladb#15622. Experiments showed that moving Raft to system memory fixed this particular issue, bringing the waits to reasonable levels. Currently `system.raft` stores only one group, group 0, which is internally used for cluster metadata operations (schema and topology changes) -- so it makes sense to keep use system memory. In the future we'd like to have other groups, for strongly consistent tables. These groups should use the user memory pool. It means we won't be able to use `system.raft` for them -- we'll just have to use a separate table. Fixes: scylladb#15622
@kbr-scylla should we backport this? It's a non-trivial change (despite being a one-liner). Nevertheless, I think it should be backported to 5.4 soon and we can consider 5.2 later. |
Yes, to 5.4 definitely, we should include it in our pre-release testing. To 5.2 not necessarily. The problem manifests as timeouts during write/read operations under heavy load when performing a concurrent schema change. So it should not be that common, and it's transient. |
`system.raft` was using the "user memory pool", i.e. the `dirty_memory_manager` for this table was set to `database::_dirty_memory_manager` (instead of `database::_system_dirty_memory_manager`). This meant that if a write workload caused memory pressure on the user memory pool, internal `system.raft` writes would have to wait for memtables of user tables to get flushed before the write would proceed. This was observed in SCT longevity tests which ran a heavy workload on the cluster and concurrently, schema changes (which underneath use the `system.raft` table). Raft would often get stuck waiting many seconds for user memtables to get flushed. More details in issue #15622. Experiments showed that moving Raft to system memory fixed this particular issue, bringing the waits to reasonable levels. Currently `system.raft` stores only one group, group 0, which is internally used for cluster metadata operations (schema and topology changes) -- so it makes sense to keep use system memory. In the future we'd like to have other groups, for strongly consistent tables. These groups should use the user memory pool. It means we won't be able to use `system.raft` for them -- we'll just have to use a separate table. Fixes: #15622 Closes #15972 (cherry picked from commit f094e23)
Queued backport to 5.4. |
`system.raft` was using the "user memory pool", i.e. the `dirty_memory_manager` for this table was set to `database::_dirty_memory_manager` (instead of `database::_system_dirty_memory_manager`). This meant that if a write workload caused memory pressure on the user memory pool, internal `system.raft` writes would have to wait for memtables of user tables to get flushed before the write would proceed. This was observed in SCT longevity tests which ran a heavy workload on the cluster and concurrently, schema changes (which underneath use the `system.raft` table). Raft would often get stuck waiting many seconds for user memtables to get flushed. More details in issue #15622. Experiments showed that moving Raft to system memory fixed this particular issue, bringing the waits to reasonable levels. Currently `system.raft` stores only one group, group 0, which is internally used for cluster metadata operations (schema and topology changes) -- so it makes sense to keep use system memory. In the future we'd like to have other groups, for strongly consistent tables. These groups should use the user memory pool. It means we won't be able to use `system.raft` for them -- we'll just have to use a separate table. Fixes: #15622 Closes #15972 (cherry picked from commit f094e23)
When investigating #15312 I noticed that sometimes there would be large (~3 seconds) delays between
io_fiber
iterations, judging bypoll_output()
log messages.Attaching fragment of one of a log file from one run of the test
example.txt
full logs can be found here:
In the fragment we can see the delay, one poll_output:
and the next one:
and that is even though there were messages to send in the meantime e.g.
each message should trigger the next
poll_output()
so it looks like
io_fiber
is stuck somewhere. Perhaps when trying to persist something. It could be that there are large disk delays here? But the metrics relevant to schema commitlog did not suggest that there would be large delays persisting stuff to schema-commitlog tables (such as the raft table).Needs further investigation.
The text was updated successfully, but these errors were encountered: