-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reactor stalled for 32 ms after during repair in row_cache::update() #7181
Comments
@avikivity , could this be the same as the one in the description?
|
as a reference to myself, if need to report a new issue, this is the stall message in the log: |
Yes, they're exactly the same. |
btw, not repair related. |
I don't understand it. The code is preemptible. How many times did this happen? What is the schema and data shape (sizes of column, number of rows in a partition)? |
i can't confirm it right now, but it happened some 4 times in a small time window |
those are the c-s commands we run:
|
Please extract the info. How large the time window? Was it always the same shard? Maybe it was a fluke. |
checking |
@xemul please take a look. Why is there a |
i found it happening 3 times (in that sequence, maybe in other parts as well, but i'm not sure about it)
|
@avikivity , there's |
Ah, didn't realize it was a value not a type. |
@avikivity any direction on how to address this - so we can assign |
I believe it's a fluke (the kernel took some time away from us, or the hypervisor). If it doesn't repeat we can close it. @tgrabiec please opine. |
Could be a fluke. The fact that there is no stall detector report for smaller periods, just straight to 32ms, supports this hypothesis. |
That depends on the detector threshold. Though I think it is 5ms for these test. @fgelcer when reporting latency issues, please report the stall detector threshold as well. |
I saw single-digit stalls in the log as well. |
@avikivity , you are right. For those stalls checks I'm running, all of them have |
I see with_reserve, but not refill_emergency_reserve. So this isn't #325. |
closing this for now
|
this is being reproduced again.
logs are here |
i will report in case it happens more than once |
another time, on the same node:
|
This is erase_and_dispose() in template <typename Updater>
future<> row_cache::do_update(external_updater eu, memtable& m, Updater updater) {
return do_update(std::move(eu), [this, &m, updater = std::move(updater)] {
real_dirty_memory_accounter real_dirty_acc(m, _tracker);
m.on_detach_from_region_group();
_tracker.region().merge(m); // Now all data in memtable belongs to cache
_tracker.memtable_cleaner().merge(m._cleaner);
STAP_PROBE(scylla, row_cache_update_start);
auto cleanup = defer([&m, this] {
invalidate_sync(m);
STAP_PROBE(scylla, row_cache_update_end);
});
return seastar::async([this, &m, updater = std::move(updater), real_dirty_acc = std::move(real_dirty_acc)] () mutable {
size_t size_entry;
// In case updater fails, we must bring the cache to consistency without deferring.
auto cleanup = defer([&m, this] {
invalidate_sync(m);
_prev_snapshot_pos = {};
_prev_snapshot = {};
});
coroutine update; // Destroy before cleanup to release snapshots before invalidating.
partition_presence_checker is_present = _prev_snapshot->make_partition_presence_checker();
while (!m.partitions.empty()) {
with_allocator(_tracker.allocator(), [&] () {
auto cmp = dht::ring_position_comparator(*_schema);
{
size_t partition_count = 0;
{
STAP_PROBE(scylla, row_cache_update_one_batch_start);
// FIXME: we should really be checking should_yield() here instead of
// need_preempt(). However, should_yield() is currently quite
// expensive and we need to amortize it somehow.
do {
STAP_PROBE(scylla, row_cache_update_partition_start);
{
if (!update) {
_update_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {
memtable_entry& mem_e = *m.partitions.begin();
size_entry = mem_e.size_in_allocator_without_rows(_tracker.allocator());
partitions_type::bound_hint hint;
auto cache_i = _partitions.lower_bound(mem_e.key(), cmp, hint);
update = updater(_update_section, cache_i, mem_e, is_present, real_dirty_acc, hint);
});
});
}
// We use cooperative deferring instead of futures so that
// this layer has a chance to restore invariants before deferring,
// in particular set _prev_snapshot_pos to the correct value.
if (update.run() == stop_iteration::no) {
return;
}
update = {};
real_dirty_acc.unpin_memory(size_entry);
_update_section(_tracker.region(), [&] {
with_linearized_managed_bytes([&] {
auto i = m.partitions.begin();
i.erase_and_dispose(dht::raw_token_less_comparator{}, [&] (memtable_entry* e) noexcept {
m.evict_entry(*e, _tracker.memtable_cleaner());
});
});
});
++partition_count;
}
STAP_PROBE(scylla, row_cache_update_partition_end);
} while (!m.partitions.empty() && !need_preempt());
with_allocator(standard_allocator(), [&] {
if (m.partitions.empty()) {
_prev_snapshot_pos = {};
} else {
_update_section(_tracker.region(), [&] {
_prev_snapshot_pos = m.partitions.begin()->key();
});
}
});
STAP_PROBE1(scylla, row_cache_update_one_batch_end, partition_count);
}
}
});
real_dirty_acc.commit();
seastar::thread::yield();
}
}).finally([cleanup = std::move(cleanup)] {});
});
} There is a preemption check in the do/while loop, but how can a single iteration take 32ms? /cc @xemul |
well, if you are checking the latest trace above this comment, then it took 7 ms... |
one more occurrence, this time on node-3:
|
this is the AMI id |
So this rules out the tuned problem. |
I don't understand it. |
A stall in partition_presence_checker could be caused by a large number of sstables with overlapping partition ranges, so there is a lot of bloom filters to check |
Maybe something along these lines could help: diff --git a/table.cc b/table.cc
index a76b48fa4..72b032731 100644
--- a/table.cc
+++ b/table.cc
@@ -1051,10 +1051,11 @@ table::make_partition_presence_checker(lw_shared_ptr<sstables::sstable_set> ssta
auto hk = sstables::sstable::make_hashed_key(*_schema, key.key());
for (auto&& s : sst) {
if (s->filter_has_key(hk)) {
return partition_presence_checker_result::maybe_exists;
}
+ seastar::thread::maybe_yield();
}
return partition_presence_checker_result::definitely_doesnt_exist;
};
}
|
How many sstables do we need for the loop to take 32ms? |
level 0 sstables are unconditionally returned by incremental_selector::select(). they're thought to span roughly all token range. however, after repair, we may have tons (thousands?) of L0 sstables which span only a subset of the token space. after off-strategy, we'll no longer have this issue. 1) at most 256 sstables per shard. 2) they're held in partitioned_sstable_set. while off-strategy is not in, we may consider changing incremental_selector to only return L0 ssts which overlap with the requested key, instead of returning them all unconditionally. this will not only fix the stall after repair, but also avoid unneeded read ampl., which results from reading index unnecessarily. |
Even a thousand sstables. 32ms / 1000 = 32us. Does a bloom filter check take 32us? No, it doesn't. |
object_descriptor::encode, see #8828 |
Closing as duplicate of #8828. |
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
Installation details
Scylla version (or git commit hash): 666.development-0.20200901.b3f00685e with build-id 0de9edcca81c447d53c39dd1adc333f769f4e240
Cluster size: 3
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0096fe0439d41b6d6
db logs here
The text was updated successfully, but these errors were encountered: