New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inserted data only becomes available after restart #16759
Comments
While it might take some time for data to be replicated to other nodes, it should be immediately visible on all replicas which acknowledged it. In particular, since this is a single-node cluster, all INSERTs which succeeded should be visible immediately.
It shouldn't be possible for that to happen. If the write eventually became visible, then it must have been there the whole time, since it was acknowledged -- either in the memtables (in-memory trees with recent writes), or the sstables (on-disk files with older writes). That's why my first suspicion is a bug in the read path. Since the write became visible after restart, it shouldn't be a bug in sstable readers – a restart doesn't change the readers and doesn't change the sstables. And since memtable readers do almost the same thing during SELECTs (which didn't work) and during flush to disk (which did work), it also shouldn't be a bug in memtable readers. Hence, it sounds like a bug in the row cache (a layer between the SELECTs and the sstables, which holds merged results of recent reads and writes). If you are again able to reproduce a state where an acknowledged write isn't visible, please run
Also, try doing the SELECT with BYPASS CACHE. If it's a cache bug, this select should succeed.
Well, replay procedures do exist in Scylla. When something is written, it's first written to the commitlog (which is persisted to disk either before the write is acknowledged or soon after, depending on the configuration), and then it's written to memtables. If Scylla doesn't shut down cleanly, then the commitlog is replayed on startup to recover the lost memtables. But that only happens after an unclean shutdown, and it's logged. Your logs don't mention commitlog replay, so commitlog should be irrelevant to this case. There's also batchlog replay, but that would only be relevant if you were doing BATCH inserts and they failed. |
Thanks for the great analysis, it looks like your suspicion is spot on as far as my investigations go. The bug occurred again today and just as you predicted, the results suggest the culprit is somewhere in the cache:
At the same time,
That clearly shows the data being available even though the cache says otherwise. To confirm that, here is the output of
Does that give you any further insights? Having |
From your outputs, it seems that there is an empty entry in the cache, which is hiding the actual data. That's bad, because the cache is supposed to always be consistent with the sstables. So if you could provide more info about the circumstances of the bug, it could be a big help. Please share anything that could be helpful in reproducing the problem, in particular:
|
I tried to gather as much information as possible to answer your questions by logging queries in Scylla. To do so, I set
There were actually a few more queries like these but since the only difference was the affected table, I don't think they matter here. I'm wondering why I don't see any Regarding your questions:
|
The "column1" and "value" resemble thrift-compatibility table layout. Are you using thrift API to create/access this table? |
Actually, I was going to ask a related question, but I forgot about it: |
@tgrabiec I'm not accessing the table directly but via JanusGraph which uses Scylla as a storage backend. @michoecho It is definitely not a |
Quick update regarding the |
Just in case this helps you narrowing down the search space: Repairing the cache works both when using |
For what it's worth we have this code: Lines 2009 to 2048 in 0564320
This contains at least one path that can cause exactly that:
if this exception is thrown sporadically or if there was a truncate at some point it can explain this. |
How long does it take to reproduce? We could try to bisect it. If you can prepare a standalone reproducer, we could run it, otherwise we could provide you with instructions. |
@avikivity I mentioned in the opening post that the issue only occurs at scale. Even then, we have seen spans of multiple days without the issue appearing even once. Since the documentation says that downgrading is not possible, I guess that implies we would have to setup a fresh environment in order to bisect versions. Sadly, we are already struggling with compute resources so replicating the entire environment is barely an option for us. Edit: If it turns out in-place downgrades between versions 5.4.1 and 5.2.11 are indeed safe, I'd be willing to bisect in our prod env. |
Since it's a single-node cluster I assumed it's not too large. What's the dataset size? Downgrading is not possible. Even if it were possible, it's risky as you'd test intermediate versions with (even more) bugs. The most reasonable approach is to snapshot the data and try it own in bisects on nodes created for the purpose. |
Ah, since it's single-node it's possible to downgrade by editing the system.local table (removing features introduced in 5.4). |
@michoecho we can try to run the randomized cache tests with a much higher iteration count, maybe it will catch something. Unlikely but it's a low-effort check. |
The dataset size is about 2TB. To replicate the setup, I'd have to clone the entire environment which includes all the clients that generate the workload and make the requests. That's what's beyond our capabilities.
But you would still not recommend it because of possibly buggy intermediate versions is what I read from your earlier comment.
Do these tests use concurrent clients to issue queries? High concurrency and limitation of resources could be key to reproduce the issue. Most of the times I noticed the bug, Scylla was operating at full load, using 100% of its 8 CPUs. However I'm not overly confident this is a necessary condition. One could still argue that the more queries there are in a given time frame, the more likely it is to find one that fails. |
What about replicating at 1/8 scale? To just 1 vcpu (and with 1/8 of the storage, memory, and workload)?
Correct. We could do some work to look for fixes beyond a particular commit hash, but it's hard to be sure.
At least some do. I have low confidence that we'll see the same failures though.
This doesn't mean much. With a moderate write load it will use spare CPU to compact. You can see that in the advanced dashboard under CPU time, compaction.
Right. Yet another option: cook up a version that verifies every read with the equivalent of BYPASS CACHE, and then dumps core on a discrepancy. This requires:
|
How would I even do that? The first thing I could find in the docs is that it's impossible to disable the cache. |
It is actually possible to disable cache per node for all requests via config param. In scylla.yaml that's |
@michoecho - could this be solved by 938b993 ? |
I don't think so, although I have no proof. |
There's |
… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759 Closes scylladb#17138
Just reporting that we've been running on the latest ScyllaDB version for a few weeks now and the issue didn't occur anymore since we updated. Great work! |
Thanks for the confirmation @rngcntr ! Very useful input. |
… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759 Closes scylladb#17138
A rather minimal reproducer for scylladb#16759. Not extensive.
Installation details
Scylla version: 5.4.1
Cluster size: 1 Node
Platform: Docker on Kubernetes
After Upgrading Scylla from 5.2.11 via 5.4.0 to 5.4.1, we started observing missing data in Scylla. Every once in a while, an
INSERT
is committed successfully but the inserted values are not visible until Scylla is restarted. As far as we can tell, version 5.2.11 was not affected while 5.4.1 is. 5.4.0 was not long enough in operation to make a reliable statement.The suspected bug occurs extremely rarely, making it hard for us to reproduce. Out of ~800M inserts per day, only a dozen are affected. In our running environment, I was able to perform the following steps:
INSERT
workload using multiple concurrent clients.SELECT
on a previouslyINSERT
ed key returns no columnsSELECT * FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore WHERE key = <affected_key>
. The result is empty.INSERT
ed data suddenly became available:X
obviously masks potentially private hexadecimal data which is not relevant to this report.What I already tried instead of restarting Scylla:
nodetool refresh
nodetool flush
nodetool rebuild
andnodetool repair
even though I'm aware both shouldn't make any difference on a single node cluster.It looks to me like Scylla has somehow recognized the transaction as completed even though it has not been persisted as expected. There seems to be some procedure which is triggered either by the shutdown or by the startup that picks this transaction up and replays it so the its modifications actually become available.
Appendix: The Scylla logs captured during this operation. I removed all compaction logs to reduce the file to a reasonable size.
The text was updated successfully, but these errors were encountered: