Inserted data only becomes available after restart #16759

rngcntr · 2024-01-12T11:12:27Z

Installation details
Scylla version: 5.4.1
Cluster size: 1 Node
Platform: Docker on Kubernetes

After Upgrading Scylla from 5.2.11 via 5.4.0 to 5.4.1, we started observing missing data in Scylla. Every once in a while, an INSERT is committed successfully but the inserted values are not visible until Scylla is restarted. As far as we can tell, version 5.2.11 was not affected while 5.4.1 is. 5.4.0 was not long enough in operation to make a reliable statement.

The suspected bug occurs extremely rarely, making it hard for us to reproduce. Out of ~800M inserts per day, only a dozen are affected. In our running environment, I was able to perform the following steps:

Execute the normal INSERT workload using multiple concurrent clients.
Wait until a client notices that a SELECT on a previously INSERTed key returns no columns
Shut down all clients to make sure there is no unnecessary load on our ScyllaDB
(optional) wait as long as you want for ScyllaDB to (process backlog/perform compactions/achieve consistency/whatever...)
Execute query: SELECT * FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore WHERE key = <affected_key>. The result is empty.
Restart ScyllaDB and wait until it is operational.
Execute the same query again and notice the INSERTed data suddenly became available:

key	column1	value
<affected_key>	0xXX	0xXXXXXXXXXXXXXXXX
<affected_key>	0xXX	0xXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

X obviously masks potentially private hexadecimal data which is not relevant to this report.

What I already tried instead of restarting Scylla:

nodetool refresh
nodetool flush
nodetool rebuild and nodetool repair even though I'm aware both shouldn't make any difference on a single node cluster.

It looks to me like Scylla has somehow recognized the transaction as completed even though it has not been persisted as expected. There seems to be some procedure which is triggered either by the shutdown or by the startup that picks this transaction up and replays it so the its modifications actually become available.

Appendix: The Scylla logs captured during this operation. I removed all compaction logs to reduce the file to a reasonable size.

The text was updated successfully, but these errors were encountered:

michoecho · 2024-01-12T15:39:21Z

(optional) wait as long as you want for ScyllaDB to (process backlog/perform compactions/achieve consistency/whatever...)

While it might take some time for data to be replicated to other nodes, it should be immediately visible on all replicas which acknowledged it. In particular, since this is a single-node cluster, all INSERTs which succeeded should be visible immediately.

It looks to me like Scylla has somehow recognized the transaction as completed even though it has not been persisted as expected.

It shouldn't be possible for that to happen. If the write eventually became visible, then it must have been there the whole time, since it was acknowledged -- either in the memtables (in-memory trees with recent writes), or the sstables (on-disk files with older writes).

That's why my first suspicion is a bug in the read path. Since the write became visible after restart, it shouldn't be a bug in sstable readers – a restart doesn't change the readers and doesn't change the sstables. And since memtable readers do almost the same thing during SELECTs (which didn't work) and during flush to disk (which did work), it also shouldn't be a bug in memtable readers.

Hence, it sounds like a bug in the row cache (a layer between the SELECTs and the sstables, which holds merged results of recent reads and writes).

If you are again able to reproduce a state where an acknowledged write isn't visible, please run SELECT * FROM MUTATION_FRAGMENTS(tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore) WHERE key = <affected_key>.

SELECT * FROM MUTATION_FRAGMENTS(...) is a debug statement which queries all sources of data (i.e. sstables, memtables, and the cache). That should tell us where the write is stored, and what's hiding it.

Also, try doing the SELECT with BYPASS CACHE. If it's a cache bug, this select should succeed.

There seems to be some procedure which is triggered either by the shutdown or by the startup that picks this transaction up and replays it so the its modifications actually become available.

Well, replay procedures do exist in Scylla. When something is written, it's first written to the commitlog (which is persisted to disk either before the write is acknowledged or soon after, depending on the configuration), and then it's written to memtables. If Scylla doesn't shut down cleanly, then the commitlog is replayed on startup to recover the lost memtables.

But that only happens after an unclean shutdown, and it's logged. Your logs don't mention commitlog replay, so commitlog should be irrelevant to this case.

There's also batchlog replay, but that would only be relevant if you were doing BATCH inserts and they failed.

rngcntr · 2024-01-15T11:53:14Z

Thanks for the great analysis, it looks like your suspicion is spot on as far as my investigations go. The bug occurred again today and just as you predicted, the results suggest the culprit is somewhere in the cache:

SELECT * FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore WHERE key = <affected_key>; yields an empty response.

At the same time, SELECT * FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore WHERE key = <affected_key> BYPASS CACHE; shows the expected result:

key	column1	value
<affected_key>	0xXX	0xXXXXXXXXXXXXXXXX
<affected_key>	0xXX	0xXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key>	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

That clearly shows the data being available even though the cache says otherwise. To confirm that, here is the output of SELECT * FROM MUTATION_FRAGMENTS(tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore) WHERE key = <affected_key>:

key	mutation_source	partition_region	column1	position_weight	metadata	mutation_fragment_kind	value
<affected_key>	row-cache	0		null	{"tombstone":{}}	partition start	null
<affected_key>	row-cache	3		null	null	partition end	null
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	0		null	{"tombstone":{}}	partition start	null
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	2	0xXX	0	{"marker":{"timestamp":1705313742651345}	"columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}}	clustering row
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	2	0xXX	0	{"marker":{"timestamp":1705313742651345}	"columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}}	clustering row
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	2	0xXXXXXX	0	{"marker":{"timestamp":1705313742651345}	"columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}}	clustering row
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	2	0xXXXX	0	{"marker":{"timestamp":1705313742651345}	"columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}}	clustering row
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	2	0xXXXX	0	{"marker":{"timestamp":1705313742651345}	"columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}}	clustering row
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	2	0xXXXX	0	{"marker":{"timestamp":1705313742651345}	"columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}}	clustering row
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	2	0xXXXX	0	{"marker":{"timestamp":1705313742651345}	"columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}}	clustering row
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	2	0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX	0	{"marker":{"timestamp":1705313742651345}	"columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}}	clustering row
<affected_key>	sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db	3		null	null	partition end	null

Does that give you any further insights? Having null in the cache looks suspicious to me because if I query an unaffected key, there's either no row-cache mutation source at all or it contains the same data as the sstables.

michoecho · 2024-01-16T13:21:34Z

Having null in the cache looks suspicious to me because if I query an unaffected key, there's either no row-cache mutation source at all or it contains the same data as the sstables.

From your outputs, it seems that there is an empty entry in the cache, which is hiding the actual data. That's bad, because the cache is supposed to always be consistent with the sstables.
Unfortunately it's hard to know what went wrong, because we only see the end result. And there were very big changes in cache code between 5.2 and 5.4, too big to guess which one was bad, especially since row cache is the most complicated part of the program.

So if you could provide more info about the circumstances of the bug, it could be a big help.

Please share anything that could be helpful in reproducing the problem, in particular:

What kind of deletes are run on the table?

Partition deletes? (WHERE key = ?)
Row deletes? (WHERE key = ? AND column1 = ?)
Row range deletes? For what kind of ranges? (E.g. WHERE key = ? AND column1 < ?, or WHERE key = ? AND column1 >= ? AND column1 <= ?)

What kind of selects are run on the table?

Partition selects?
Row selects?
Row range selects? For what kind of ranges?

Did you set a custom gc_grace_seconds or tombstone_gc when creating the table?
Are you setting any custom TTL or manually setting timestamps in the affected table? (E.g. INSERT ... USING TTL x or INSERT ... USING TIMESTAMP x)?
Do you happen to know if there had been any prior deletions (or TTL expirations) in the affected partition?
Do you happen to know how much time passed before the relevant insert and the bad select?
Is something uncommon happening around the same time as the bad inserts/selects (e.g. ALTER TABLE)?

rngcntr · 2024-01-17T10:52:29Z

I tried to gather as much information as possible to answer your questions by logging queries in Scylla. To do so, I set nodetool setlogginglevel query_processor trace for a while. Here are all the queries I found in the logs:

INSERT INTO tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore (key,column1,value) VALUES (:key,:column1,:value) USING TIMESTAMP :timestamp
DELETE FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore USING TIMESTAMP :timestamp WHERE key=:key AND column1=:column1

There were actually a few more queries like these but since the only difference was the affected table, I don't think they matter here.

I'm wondering why I don't see any SELECT queries. Are these not handled by the query_processor so I would have to configure a different log level for those?

Regarding your questions:

As seen above
As I said earlier, I wasn't able to log SELECT queries in Scylla yet. But the example SELECT I made in the opening post is affected by the problem. There could however be other SELECTs going on that influence the cache. At the moment, I can't tell.
a) On most tables, gc_grace_seconds is not set and therefore uses the default value of 984000. In one table however, we modified this setting a few months ago but also reverted it to the default value soon after. Both tables are affected by the issue, so that shouldn't make a difference. b) tombstone_gc was never touched.
a) USING TTL no, USING TIMESTAMP yes
Leveraging some domain knowledge, my educated guess would be that it's rather unlikely for any kind of deletion to have occurred on the affected partition. In all cases I have noticed so far, the affected partition should have been created just moments before the issue appeared.
Almost answered by 5. There are probably only a few milliseconds between the INSERT and the failed SELECT. Could be an indicator that the cache already gets corrupted by the INSERT, since I have never observed a situation where a prior SELECT has succeeded before another one fails.
Not to my knowledge. There's no table management going on and the schema stays the same all the time. The only thing Scylla is doing in the background are compactions.

tgrabiec · 2024-01-17T11:01:56Z

The "column1" and "value" resemble thrift-compatibility table layout. Are you using thrift API to create/access this table?

michoecho · 2024-01-17T11:03:27Z

The "column1" and "value" resemble thrift-compatibility table layout. Are you using thrift API to create/access this table?

Actually, I was going to ask a related question, but I forgot about it:
Is the $ value for column1 something special? A null (not just an empty string, but null) key, perhaps?

rngcntr · 2024-01-17T11:55:51Z

@tgrabiec I'm not accessing the table directly but via JanusGraph which uses Scylla as a storage backend. janusgraph-cql communicates with Scylla via the DataStax Java Driver. Thrift was abondoned by JanusGraph a while ago, so I don't think the driver still uses it.

@michoecho It is definitely not a null, that looks different. null is represented in this table by an empty cell. I'd assume $ has some semantic meaning to JanusGraph but Scylla should be agnostic to that. I can try to investigate this further in case it is relevant to you.

rngcntr · 2024-01-17T13:39:38Z

Quick update regarding the $: That's just a column with value 0x24 and my IDE interpreted that as an ASCII code. I will update the tables in my earlier comments accordingly

rngcntr · 2024-01-19T13:14:16Z

Just in case this helps you narrowing down the search space:
Today I thought it could be worth trying to INSERT the same data again and check if that repairs the row cache. It does. So there's at least nothing inherently wrong with the inserted (key,column1,value) triplets and the cache does not permanently refuse to store that data.

Repairing the cache works both when using INSERT [...] or INSERT [...] USING TIMESTAMP [...].

eliransin · 2024-01-24T09:47:28Z

For what it's worth we have this code:

scylladb/replica/database.cc

Lines 2009 to 2048 in 0564320

    
               db::rp_handle h; 
        
               auto cl = cf.commitlog(); 
        
               if (cl != nullptr && cf.durable_writes()) { 
        
                   std::exception_ptr ex; 
        
                   try { 
        
                       commitlog_entry_writer cew(s, m, sync); 
        
                       auto f_h = co_await coroutine::as_future(cf.commitlog()->add_entry(uuid, cew, timeout)); 
        
                       if (!f_h.failed()) { 
        
                           h = f_h.get(); 
        
                       } else { 
        
                           ex = f_h.get_exception(); 
        
                       } 
        
                   } catch (...) { 
        
                       ex = std::current_exception(); 
        
                   } 
        
                   if (ex) { 
        
                       if (is_timeout_exception(ex)) { 
        
                           ++_stats->total_writes_timedout; 
        
                           ex = wrap_commitlog_add_error<wrapped_timed_out_error>(cf.schema(), m, std::move(ex)); 
        
                       } else { 
        
                           ex = wrap_commitlog_add_error<>(s, m, std::move(ex)); 
        
                       } 
        
                       co_await coroutine::exception(std::move(ex)); 
        
                   } 
        
               } 
        
               auto f = co_await coroutine::as_future(this->apply_in_memory(m, s, std::move(h), timeout)); 
        
               if (f.failed()) { 
        
                 auto ex = f.get_exception(); 
        
                 if (try_catch<mutation_reordered_with_truncate_exception>(ex)) { 
        
                   // This mutation raced with a truncate, so we can just drop it. 
        
                   dblog.debug("replay_position reordering detected"); 
        
                   co_return; 
        
                 } else if (is_timeout_exception(ex)) { 
        
                   ++_stats->total_writes_timedout; 
        
                 } 
        
                 co_await coroutine::return_exception_ptr(std::move(ex)); 
        
               } 
        
               // Success, prevent incrementing failure counter 
        
               update_writes_failed.cancel(); 
        
           }

This contains at least one path that can cause exactly that:

If we get mutation_reordered_with_truncate_exception during apply in memory.

if this exception is thrown sporadically or if there was a truncate at some point it can explain this.

avikivity · 2024-01-24T11:55:08Z

How long does it take to reproduce? We could try to bisect it.

If you can prepare a standalone reproducer, we could run it, otherwise we could provide you with instructions.

rngcntr · 2024-01-24T12:15:40Z

@avikivity I mentioned in the opening post that the issue only occurs at scale. Even then, we have seen spans of multiple days without the issue appearing even once. Since the documentation says that downgrading is not possible, I guess that implies we would have to setup a fresh environment in order to bisect versions. Sadly, we are already struggling with compute resources so replicating the entire environment is barely an option for us.

Edit: If it turns out in-place downgrades between versions 5.4.1 and 5.2.11 are indeed safe, I'd be willing to bisect in our prod env.

avikivity · 2024-01-24T18:58:55Z

Since it's a single-node cluster I assumed it's not too large. What's the dataset size?

Downgrading is not possible. Even if it were possible, it's risky as you'd test intermediate versions with (even more) bugs. The most reasonable approach is to snapshot the data and try it own in bisects on nodes created for the purpose.

avikivity · 2024-01-24T18:59:58Z

Ah, since it's single-node it's possible to downgrade by editing the system.local table (removing features introduced in 5.4).

avikivity · 2024-01-24T19:56:08Z

@michoecho we can try to run the randomized cache tests with a much higher iteration count, maybe it will catch something. Unlikely but it's a low-effort check.

rngcntr · 2024-01-25T06:09:05Z

Since it's a single-node cluster I assumed it's not too large. What's the dataset size?

The dataset size is about 2TB. To replicate the setup, I'd have to clone the entire environment which includes all the clients that generate the workload and make the requests. That's what's beyond our capabilities.

Ah, since it's single-node it's possible to downgrade by editing the system.local table (removing features introduced in 5.4).

But you would still not recommend it because of possibly buggy intermediate versions is what I read from your earlier comment.

we can try to run the randomized cache tests with a much higher iteration count

Do these tests use concurrent clients to issue queries? High concurrency and limitation of resources could be key to reproduce the issue. Most of the times I noticed the bug, Scylla was operating at full load, using 100% of its 8 CPUs. However I'm not overly confident this is a necessary condition. One could still argue that the more queries there are in a given time frame, the more likely it is to find one that fails.

avikivity · 2024-01-25T11:59:19Z

Since it's a single-node cluster I assumed it's not too large. What's the dataset size?

The dataset size is about 2TB. To replicate the setup, I'd have to clone the entire environment which includes all the clients that generate the workload and make the requests. That's what's beyond our capabilities.

What about replicating at 1/8 scale? To just 1 vcpu (and with 1/8 of the storage, memory, and workload)?

Ah, since it's single-node it's possible to downgrade by editing the system.local table (removing features introduced in 5.4).

But you would still not recommend it because of possibly buggy intermediate versions is what I read from your earlier comment.

Correct. We could do some work to look for fixes beyond a particular commit hash, but it's hard to be sure.

we can try to run the randomized cache tests with a much higher iteration count

Do these tests use concurrent clients to issue queries?

At least some do. I have low confidence that we'll see the same failures though.

High concurrency and limitation of resources could be key to reproduce the issue. Most of the times I noticed the bug, Scylla was operating at full load, using 100% of its 8 CPUs.

This doesn't mean much. With a moderate write load it will use spare CPU to compact. You can see that in the advanced dashboard under CPU time, compaction.

However I'm not overly confident this is a necessary condition. One could still argue that the more queries there are in a given time frame, the more likely it is to find one that fails.

Right.

Yet another option: cook up a version that verifies every read with the equivalent of BYPASS CACHE, and then dumps core on a discrepancy.

This requires:

your system can sustain the query rate with BYPASS CACHE (can be tested by disabling the cache temporarily; and perhaps you can enlarge it temporarily)
you're okay with a crash and some downtime when the bug is detected; no risk to data
you're okay with giving us the core dump (we can sign an NDA) to analyze

rngcntr · 2024-01-25T15:25:31Z

What about replicating at 1/8 scale? To just 1 vcpu (and with 1/8 of the storage, memory, and workload)?

That sounds more reasonable, but since we currently expect the issue to come up just by chance, wouldn't that be a huge disadvantage for bisecting? At the current throughput, it would take at least two, better three days without an error to state with reasonable confidence that a version is not affected by the bug. Scaled by 1/8th that would give us something in the range of ~3 weeks for one iteration of git bisect. And there are around 2^12 commits to be bisected. Let alone that reducing the load could still make the error less likely due to reduced concurrency.

This doesn't mean much. With a moderate write load it will use spare CPU to compact. You can see that in the advanced dashboard under CPU time, compaction.

The panel you mentioned varies between 5% and 25% of time spent on compaction. I also built a panel that shows the load distribution between groups:

Yet another option: cook up a version that verifies every read with the equivalent of BYPASS CACHE, and then dumps core on a discrepancy.

I can try this tomorrow, but I expect a significant decrease of throughput, given that Scylla already utilizes the disk quite a bit:

rngcntr · 2024-01-26T05:55:55Z

can be tested by disabling the cache temporarily

How would I even do that? The first thing I could find in the docs is that it's impossible to disable the cache.

tgrabiec · 2024-01-26T09:38:51Z

can be tested by disabling the cache temporarily

How would I even do that? The first thing I could find in the docs is that it's impossible to disable the cache.

It is actually possible to disable cache per node for all requests via config param. In scylla.yaml that's enable_cache: false. It requires node restart to take effect.

mykaul · 2024-01-28T10:20:00Z

@michoecho - could this be solved by 938b993 ?

michoecho · 2024-01-28T10:31:52Z

@michoecho - could this be solved by 938b993 ?

I don't think so, although I have no proof.

avikivity · 2024-01-28T18:14:00Z

can be tested by disabling the cache temporarily

How would I even do that? The first thing I could find in the docs is that it's impossible to disable the cache.

There's enable_cache config item, WITH caching = {'enabled': 'false'} clause for ALTER TABLE, and BYPASS CACHE for individual queries.

… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759 Closes scylladb#17138

rngcntr · 2024-03-14T09:29:05Z

(And of course there's always the dreadful possibility that there exist two different bugs with the same symptoms and the one I found is different than the one you found. I hope not.)

Just reporting that we've been running on the latest ScyllaDB version for a few weeks now and the issue didn't occur anymore since we updated. Great work!

mykaul · 2024-03-14T09:48:58Z

Thanks for the confirmation @rngcntr ! Very useful input.

Cover issue scylladb/scylladb#16759

… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759 Closes scylladb#17138

A rather minimal reproducer for scylladb#16759. Not extensive.

Cover issue scylladb/scylladb#16759

bhalevy added triage/master Looking for assignee symptom/data consistency labels Jan 15, 2024

tgrabiec self-assigned this Jan 15, 2024

mykaul added P1 Urgent status/release blocker Preventing from a release to be promoted labels Jan 16, 2024

dorlaor assigned michoecho Jan 25, 2024

tgrabiec mentioned this issue Jan 26, 2024

docs: Issue in page ScyllaDB FAQ #17003

Closed

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 25, 2024

test(longevity): add longevity with rf=1

15bb8c9

Cover issue scylladb/scylladb#16759

juliayakovlev mentioned this issue Mar 25, 2024

test(longevity): add longevity with rf=1 scylladb/scylla-cluster-tests#7297

Draft

2 tasks

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 26, 2024

test(longevity): add longevity with rf=1

c5d54e2

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 26, 2024

test(longevity): add longevity with rf=1

1870de0

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 27, 2024

test(longevity): add longevity with rf=1

38be2fe

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 31, 2024

test(longevity): add longevity with rf=1

6dc8c8e

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 31, 2024

test(longevity): add longevity with rf=1

2ed7101

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 2, 2024

test(longevity): add longevity with rf=1

1cd325d

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 16, 2024

test(longevity): add longevity with rf=1

0ac28d0

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024

test(longevity): add longevity with rf=1

884bc31

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024

test(longevity): add longevity with rf=1

8f83f81

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024

test(longevity): add longevity with rf=1

b4dda49

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024

test(longevity): add longevity with rf=1

64a7c32

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024

test(longevity): add longevity with rf=1

38779c6

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 28, 2024

test(longevity): add longevity with rf=1

f0e1ca0

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 28, 2024

test(longevity): add longevity with rf=1

99a39d9

Cover issue scylladb/scylladb#16759

dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024

row_cache_test: test cache consistency during memtable-to-cache merge

df2d091

A rather minimal reproducer for scylladb#16759. Not extensive.

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 30, 2024

test(longevity): add longevity with rf=1

01c70d4

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 30, 2024

test(longevity): add longevity with rf=1

0f70626

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 8, 2024

test(longevity): add longevity with rf=1

0b0b4ed

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 8, 2024

test(longevity): add longevity with rf=1

3d657c0

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 8, 2024

test(longevity): add longevity with rf=1

e6b9418

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 8, 2024

test(longevity): add longevity with rf=1

301f958

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 9, 2024

test(longevity): add longevity with rf=1

7ded8d4

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 12, 2024

test(longevity): add longevity with rf=1

a8e7438

Cover issue scylladb/scylladb#16759

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 12, 2024

test(longevity): add longevity with rf=1

c39f564

Cover issue scylladb/scylladb#16759

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inserted data only becomes available after restart #16759

Inserted data only becomes available after restart #16759

rngcntr commented Jan 12, 2024 •

edited

michoecho commented Jan 12, 2024

rngcntr commented Jan 15, 2024 •

edited

michoecho commented Jan 16, 2024

rngcntr commented Jan 17, 2024

tgrabiec commented Jan 17, 2024

michoecho commented Jan 17, 2024

rngcntr commented Jan 17, 2024 •

edited

rngcntr commented Jan 17, 2024

rngcntr commented Jan 19, 2024

eliransin commented Jan 24, 2024

avikivity commented Jan 24, 2024

rngcntr commented Jan 24, 2024 •

edited

avikivity commented Jan 24, 2024

avikivity commented Jan 24, 2024

avikivity commented Jan 24, 2024

rngcntr commented Jan 25, 2024

avikivity commented Jan 25, 2024

rngcntr commented Jan 25, 2024

rngcntr commented Jan 26, 2024

tgrabiec commented Jan 26, 2024

mykaul commented Jan 28, 2024

michoecho commented Jan 28, 2024

avikivity commented Jan 28, 2024

rngcntr commented Mar 14, 2024

mykaul commented Mar 14, 2024

Inserted data only becomes available after restart #16759

Inserted data only becomes available after restart #16759

Comments

rngcntr commented Jan 12, 2024 • edited

michoecho commented Jan 12, 2024

rngcntr commented Jan 15, 2024 • edited

michoecho commented Jan 16, 2024

rngcntr commented Jan 17, 2024

tgrabiec commented Jan 17, 2024

michoecho commented Jan 17, 2024

rngcntr commented Jan 17, 2024 • edited

rngcntr commented Jan 17, 2024

rngcntr commented Jan 19, 2024

eliransin commented Jan 24, 2024

avikivity commented Jan 24, 2024

rngcntr commented Jan 24, 2024 • edited

avikivity commented Jan 24, 2024

avikivity commented Jan 24, 2024

avikivity commented Jan 24, 2024

rngcntr commented Jan 25, 2024

avikivity commented Jan 25, 2024

rngcntr commented Jan 25, 2024

rngcntr commented Jan 26, 2024

tgrabiec commented Jan 26, 2024

mykaul commented Jan 28, 2024

michoecho commented Jan 28, 2024

avikivity commented Jan 28, 2024

rngcntr commented Mar 14, 2024

mykaul commented Mar 14, 2024

rngcntr commented Jan 12, 2024 •

edited

rngcntr commented Jan 15, 2024 •

edited

rngcntr commented Jan 17, 2024 •

edited

rngcntr commented Jan 24, 2024 •

edited