Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserted data only becomes available after restart #16759

Closed
rngcntr opened this issue Jan 12, 2024 · 35 comments
Closed

Inserted data only becomes available after restart #16759

rngcntr opened this issue Jan 12, 2024 · 35 comments
Assignees
Labels
P1 Urgent status/release blocker Preventing from a release to be promoted symptom/data consistency

Comments

@rngcntr
Copy link

rngcntr commented Jan 12, 2024

Installation details
Scylla version: 5.4.1
Cluster size: 1 Node
Platform: Docker on Kubernetes

After Upgrading Scylla from 5.2.11 via 5.4.0 to 5.4.1, we started observing missing data in Scylla. Every once in a while, an INSERT is committed successfully but the inserted values are not visible until Scylla is restarted. As far as we can tell, version 5.2.11 was not affected while 5.4.1 is. 5.4.0 was not long enough in operation to make a reliable statement.

The suspected bug occurs extremely rarely, making it hard for us to reproduce. Out of ~800M inserts per day, only a dozen are affected. In our running environment, I was able to perform the following steps:

  1. Execute the normal INSERT workload using multiple concurrent clients.
  2. Wait until a client notices that a SELECT on a previously INSERTed key returns no columns
  3. Shut down all clients to make sure there is no unnecessary load on our ScyllaDB
  4. (optional) wait as long as you want for ScyllaDB to (process backlog/perform compactions/achieve consistency/whatever...)
  5. Execute query: SELECT * FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore WHERE key = <affected_key>. The result is empty.
  6. Restart ScyllaDB and wait until it is operational.
  7. Execute the same query again and notice the INSERTed data suddenly became available:
key column1 value
<affected_key> 0xXX 0xXXXXXXXXXXXXXXXX
<affected_key> 0xXX 0xXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

X obviously masks potentially private hexadecimal data which is not relevant to this report.

What I already tried instead of restarting Scylla:

  • nodetool refresh
  • nodetool flush
  • nodetool rebuild and nodetool repair even though I'm aware both shouldn't make any difference on a single node cluster.

It looks to me like Scylla has somehow recognized the transaction as completed even though it has not been persisted as expected. There seems to be some procedure which is triggered either by the shutdown or by the startup that picks this transaction up and replays it so the its modifications actually become available.

Appendix: The Scylla logs captured during this operation. I removed all compaction logs to reduce the file to a reasonable size.

@michoecho
Copy link
Contributor

(optional) wait as long as you want for ScyllaDB to (process backlog/perform compactions/achieve consistency/whatever...)

While it might take some time for data to be replicated to other nodes, it should be immediately visible on all replicas which acknowledged it. In particular, since this is a single-node cluster, all INSERTs which succeeded should be visible immediately.

It looks to me like Scylla has somehow recognized the transaction as completed even though it has not been persisted as expected.

It shouldn't be possible for that to happen. If the write eventually became visible, then it must have been there the whole time, since it was acknowledged -- either in the memtables (in-memory trees with recent writes), or the sstables (on-disk files with older writes).

That's why my first suspicion is a bug in the read path. Since the write became visible after restart, it shouldn't be a bug in sstable readers – a restart doesn't change the readers and doesn't change the sstables. And since memtable readers do almost the same thing during SELECTs (which didn't work) and during flush to disk (which did work), it also shouldn't be a bug in memtable readers.

Hence, it sounds like a bug in the row cache (a layer between the SELECTs and the sstables, which holds merged results of recent reads and writes).

If you are again able to reproduce a state where an acknowledged write isn't visible, please run SELECT * FROM MUTATION_FRAGMENTS(tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore) WHERE key = <affected_key>.

SELECT * FROM MUTATION_FRAGMENTS(...) is a debug statement which queries all sources of data (i.e. sstables, memtables, and the cache). That should tell us where the write is stored, and what's hiding it.

Also, try doing the SELECT with BYPASS CACHE. If it's a cache bug, this select should succeed.

There seems to be some procedure which is triggered either by the shutdown or by the startup that picks this transaction up and replays it so the its modifications actually become available.

Well, replay procedures do exist in Scylla. When something is written, it's first written to the commitlog (which is persisted to disk either before the write is acknowledged or soon after, depending on the configuration), and then it's written to memtables. If Scylla doesn't shut down cleanly, then the commitlog is replayed on startup to recover the lost memtables.

But that only happens after an unclean shutdown, and it's logged. Your logs don't mention commitlog replay, so commitlog should be irrelevant to this case.

There's also batchlog replay, but that would only be relevant if you were doing BATCH inserts and they failed.

@rngcntr
Copy link
Author

rngcntr commented Jan 15, 2024

Thanks for the great analysis, it looks like your suspicion is spot on as far as my investigations go. The bug occurred again today and just as you predicted, the results suggest the culprit is somewhere in the cache:

SELECT * FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore WHERE key = <affected_key>; yields an empty response.

At the same time, SELECT * FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore WHERE key = <affected_key> BYPASS CACHE; shows the expected result:

key column1 value
<affected_key> 0xXX 0xXXXXXXXXXXXXXXXX
<affected_key> 0xXX 0xXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
<affected_key> 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

That clearly shows the data being available even though the cache says otherwise. To confirm that, here is the output of SELECT * FROM MUTATION_FRAGMENTS(tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore) WHERE key = <affected_key>:

key mutation_source partition_region column1 position_weight metadata mutation_fragment_kind value
<affected_key> row-cache 0 null {"tombstone":{}} partition start null
<affected_key> row-cache 3 null null partition end null
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 0 null {"tombstone":{}} partition start null
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 2 0xXX 0 {"marker":{"timestamp":1705313742651345} "columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}} clustering row
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 2 0xXX 0 {"marker":{"timestamp":1705313742651345} "columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}} clustering row
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 2 0xXXXXXX 0 {"marker":{"timestamp":1705313742651345} "columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}} clustering row
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 2 0xXXXX 0 {"marker":{"timestamp":1705313742651345} "columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}} clustering row
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 2 0xXXXX 0 {"marker":{"timestamp":1705313742651345} "columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}} clustering row
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 2 0xXXXX 0 {"marker":{"timestamp":1705313742651345} "columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}} clustering row
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 2 0xXXXX 0 {"marker":{"timestamp":1705313742651345} "columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}} clustering row
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 2 0xXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0 {"marker":{"timestamp":1705313742651345} "columns":{"value":{"is_live":true,"type":"regular","timestamp":1705313742651345}}} clustering row
<affected_key> sstable:/var/lib/scylla/data/tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX/edgestore-a8a600f010c411ee90302aa2f5097436/me-3gcs_0sim_408wg27cloiofdn8bl-big-Data.db 3 null null partition end null

Does that give you any further insights? Having null in the cache looks suspicious to me because if I query an unaffected key, there's either no row-cache mutation source at all or it contains the same data as the sstables.

@tgrabiec tgrabiec self-assigned this Jan 15, 2024
@mykaul mykaul added P1 Urgent status/release blocker Preventing from a release to be promoted labels Jan 16, 2024
@michoecho
Copy link
Contributor

Having null in the cache looks suspicious to me because if I query an unaffected key, there's either no row-cache mutation source at all or it contains the same data as the sstables.

From your outputs, it seems that there is an empty entry in the cache, which is hiding the actual data. That's bad, because the cache is supposed to always be consistent with the sstables.
Unfortunately it's hard to know what went wrong, because we only see the end result. And there were very big changes in cache code between 5.2 and 5.4, too big to guess which one was bad, especially since row cache is the most complicated part of the program.

So if you could provide more info about the circumstances of the bug, it could be a big help.

Please share anything that could be helpful in reproducing the problem, in particular:

  1. What kind of deletes are run on the table?
  • Partition deletes? (WHERE key = ?)
  • Row deletes? (WHERE key = ? AND column1 = ?)
  • Row range deletes? For what kind of ranges? (E.g. WHERE key = ? AND column1 < ?, or WHERE key = ? AND column1 >= ? AND column1 <= ?)
  1. What kind of selects are run on the table?
  • Partition selects?
  • Row selects?
  • Row range selects? For what kind of ranges?
  1. Did you set a custom gc_grace_seconds or tombstone_gc when creating the table?
  2. Are you setting any custom TTL or manually setting timestamps in the affected table? (E.g. INSERT ... USING TTL x or INSERT ... USING TIMESTAMP x)?
  3. Do you happen to know if there had been any prior deletions (or TTL expirations) in the affected partition?
  4. Do you happen to know how much time passed before the relevant insert and the bad select?
  5. Is something uncommon happening around the same time as the bad inserts/selects (e.g. ALTER TABLE)?

@rngcntr
Copy link
Author

rngcntr commented Jan 17, 2024

I tried to gather as much information as possible to answer your questions by logging queries in Scylla. To do so, I set nodetool setlogginglevel query_processor trace for a while. Here are all the queries I found in the logs:

INSERT INTO tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore (key,column1,value) VALUES (:key,:column1,:value) USING TIMESTAMP :timestamp
DELETE FROM tenant_6e10XXXX_XXXX_XXXX_XXXX_XXXXXXXXXXXX.edgestore USING TIMESTAMP :timestamp WHERE key=:key AND column1=:column1

There were actually a few more queries like these but since the only difference was the affected table, I don't think they matter here.

I'm wondering why I don't see any SELECT queries. Are these not handled by the query_processor so I would have to configure a different log level for those?

Regarding your questions:

  1. As seen above
  2. As I said earlier, I wasn't able to log SELECT queries in Scylla yet. But the example SELECT I made in the opening post is affected by the problem. There could however be other SELECTs going on that influence the cache. At the moment, I can't tell.
  3. a) On most tables, gc_grace_seconds is not set and therefore uses the default value of 984000. In one table however, we modified this setting a few months ago but also reverted it to the default value soon after. Both tables are affected by the issue, so that shouldn't make a difference. b) tombstone_gc was never touched.
  4. a) USING TTL no, USING TIMESTAMP yes
  5. Leveraging some domain knowledge, my educated guess would be that it's rather unlikely for any kind of deletion to have occurred on the affected partition. In all cases I have noticed so far, the affected partition should have been created just moments before the issue appeared.
  6. Almost answered by 5. There are probably only a few milliseconds between the INSERT and the failed SELECT. Could be an indicator that the cache already gets corrupted by the INSERT, since I have never observed a situation where a prior SELECT has succeeded before another one fails.
  7. Not to my knowledge. There's no table management going on and the schema stays the same all the time. The only thing Scylla is doing in the background are compactions.

@tgrabiec
Copy link
Contributor

The "column1" and "value" resemble thrift-compatibility table layout. Are you using thrift API to create/access this table?

@michoecho
Copy link
Contributor

The "column1" and "value" resemble thrift-compatibility table layout. Are you using thrift API to create/access this table?

Actually, I was going to ask a related question, but I forgot about it:
Is the $ value for column1 something special? A null (not just an empty string, but null) key, perhaps?

@rngcntr
Copy link
Author

rngcntr commented Jan 17, 2024

@tgrabiec I'm not accessing the table directly but via JanusGraph which uses Scylla as a storage backend. janusgraph-cql communicates with Scylla via the DataStax Java Driver. Thrift was abondoned by JanusGraph a while ago, so I don't think the driver still uses it.

@michoecho It is definitely not a null, that looks different. null is represented in this table by an empty cell. I'd assume $ has some semantic meaning to JanusGraph but Scylla should be agnostic to that. I can try to investigate this further in case it is relevant to you.

@rngcntr
Copy link
Author

rngcntr commented Jan 17, 2024

Quick update regarding the $: That's just a column with value 0x24 and my IDE interpreted that as an ASCII code. I will update the tables in my earlier comments accordingly

@rngcntr
Copy link
Author

rngcntr commented Jan 19, 2024

Just in case this helps you narrowing down the search space:
Today I thought it could be worth trying to INSERT the same data again and check if that repairs the row cache. It does. So there's at least nothing inherently wrong with the inserted (key,column1,value) triplets and the cache does not permanently refuse to store that data.

Repairing the cache works both when using INSERT [...] or INSERT [...] USING TIMESTAMP [...].

@eliransin
Copy link
Contributor

For what it's worth we have this code:

scylladb/replica/database.cc

Lines 2009 to 2048 in 0564320

db::rp_handle h;
auto cl = cf.commitlog();
if (cl != nullptr && cf.durable_writes()) {
std::exception_ptr ex;
try {
commitlog_entry_writer cew(s, m, sync);
auto f_h = co_await coroutine::as_future(cf.commitlog()->add_entry(uuid, cew, timeout));
if (!f_h.failed()) {
h = f_h.get();
} else {
ex = f_h.get_exception();
}
} catch (...) {
ex = std::current_exception();
}
if (ex) {
if (is_timeout_exception(ex)) {
++_stats->total_writes_timedout;
ex = wrap_commitlog_add_error<wrapped_timed_out_error>(cf.schema(), m, std::move(ex));
} else {
ex = wrap_commitlog_add_error<>(s, m, std::move(ex));
}
co_await coroutine::exception(std::move(ex));
}
}
auto f = co_await coroutine::as_future(this->apply_in_memory(m, s, std::move(h), timeout));
if (f.failed()) {
auto ex = f.get_exception();
if (try_catch<mutation_reordered_with_truncate_exception>(ex)) {
// This mutation raced with a truncate, so we can just drop it.
dblog.debug("replay_position reordering detected");
co_return;
} else if (is_timeout_exception(ex)) {
++_stats->total_writes_timedout;
}
co_await coroutine::return_exception_ptr(std::move(ex));
}
// Success, prevent incrementing failure counter
update_writes_failed.cancel();
}

This contains at least one path that can cause exactly that:

  1. If we get mutation_reordered_with_truncate_exception during apply in memory.

if this exception is thrown sporadically or if there was a truncate at some point it can explain this.

@avikivity
Copy link
Member

How long does it take to reproduce? We could try to bisect it.

If you can prepare a standalone reproducer, we could run it, otherwise we could provide you with instructions.

@rngcntr
Copy link
Author

rngcntr commented Jan 24, 2024

@avikivity I mentioned in the opening post that the issue only occurs at scale. Even then, we have seen spans of multiple days without the issue appearing even once. Since the documentation says that downgrading is not possible, I guess that implies we would have to setup a fresh environment in order to bisect versions. Sadly, we are already struggling with compute resources so replicating the entire environment is barely an option for us.

Edit: If it turns out in-place downgrades between versions 5.4.1 and 5.2.11 are indeed safe, I'd be willing to bisect in our prod env.

@avikivity
Copy link
Member

Since it's a single-node cluster I assumed it's not too large. What's the dataset size?

Downgrading is not possible. Even if it were possible, it's risky as you'd test intermediate versions with (even more) bugs. The most reasonable approach is to snapshot the data and try it own in bisects on nodes created for the purpose.

@avikivity
Copy link
Member

Ah, since it's single-node it's possible to downgrade by editing the system.local table (removing features introduced in 5.4).

@avikivity
Copy link
Member

@michoecho we can try to run the randomized cache tests with a much higher iteration count, maybe it will catch something. Unlikely but it's a low-effort check.

@rngcntr
Copy link
Author

rngcntr commented Jan 25, 2024

Since it's a single-node cluster I assumed it's not too large. What's the dataset size?

The dataset size is about 2TB. To replicate the setup, I'd have to clone the entire environment which includes all the clients that generate the workload and make the requests. That's what's beyond our capabilities.

Ah, since it's single-node it's possible to downgrade by editing the system.local table (removing features introduced in 5.4).

But you would still not recommend it because of possibly buggy intermediate versions is what I read from your earlier comment.

we can try to run the randomized cache tests with a much higher iteration count

Do these tests use concurrent clients to issue queries? High concurrency and limitation of resources could be key to reproduce the issue. Most of the times I noticed the bug, Scylla was operating at full load, using 100% of its 8 CPUs. However I'm not overly confident this is a necessary condition. One could still argue that the more queries there are in a given time frame, the more likely it is to find one that fails.

@avikivity
Copy link
Member

Since it's a single-node cluster I assumed it's not too large. What's the dataset size?

The dataset size is about 2TB. To replicate the setup, I'd have to clone the entire environment which includes all the clients that generate the workload and make the requests. That's what's beyond our capabilities.

What about replicating at 1/8 scale? To just 1 vcpu (and with 1/8 of the storage, memory, and workload)?

Ah, since it's single-node it's possible to downgrade by editing the system.local table (removing features introduced in 5.4).

But you would still not recommend it because of possibly buggy intermediate versions is what I read from your earlier comment.

Correct. We could do some work to look for fixes beyond a particular commit hash, but it's hard to be sure.

we can try to run the randomized cache tests with a much higher iteration count

Do these tests use concurrent clients to issue queries?

At least some do. I have low confidence that we'll see the same failures though.

High concurrency and limitation of resources could be key to reproduce the issue. Most of the times I noticed the bug, Scylla was operating at full load, using 100% of its 8 CPUs.

This doesn't mean much. With a moderate write load it will use spare CPU to compact. You can see that in the advanced dashboard under CPU time, compaction.

However I'm not overly confident this is a necessary condition. One could still argue that the more queries there are in a given time frame, the more likely it is to find one that fails.

Right.

Yet another option: cook up a version that verifies every read with the equivalent of BYPASS CACHE, and then dumps core on a discrepancy.

This requires:

  • your system can sustain the query rate with BYPASS CACHE (can be tested by disabling the cache temporarily; and perhaps you can enlarge it temporarily)
  • you're okay with a crash and some downtime when the bug is detected; no risk to data
  • you're okay with giving us the core dump (we can sign an NDA) to analyze

@rngcntr
Copy link
Author

rngcntr commented Jan 25, 2024

What about replicating at 1/8 scale? To just 1 vcpu (and with 1/8 of the storage, memory, and workload)?

That sounds more reasonable, but since we currently expect the issue to come up just by chance, wouldn't that be a huge disadvantage for bisecting? At the current throughput, it would take at least two, better three days without an error to state with reasonable confidence that a version is not affected by the bug. Scaled by 1/8th that would give us something in the range of ~3 weeks for one iteration of git bisect. And there are around 2^12 commits to be bisected. Let alone that reducing the load could still make the error less likely due to reduced concurrency.

This doesn't mean much. With a moderate write load it will use spare CPU to compact. You can see that in the advanced dashboard under CPU time, compaction.

The panel you mentioned varies between 5% and 25% of time spent on compaction. I also built a panel that shows the load distribution between groups:
image

Yet another option: cook up a version that verifies every read with the equivalent of BYPASS CACHE, and then dumps core on a discrepancy.

I can try this tomorrow, but I expect a significant decrease of throughput, given that Scylla already utilizes the disk quite a bit:
image

@rngcntr
Copy link
Author

rngcntr commented Jan 26, 2024

can be tested by disabling the cache temporarily

How would I even do that? The first thing I could find in the docs is that it's impossible to disable the cache.

@tgrabiec
Copy link
Contributor

can be tested by disabling the cache temporarily

How would I even do that? The first thing I could find in the docs is that it's impossible to disable the cache.

It is actually possible to disable cache per node for all requests via config param. In scylla.yaml that's enable_cache: false. It requires node restart to take effect.

@mykaul
Copy link
Contributor

mykaul commented Jan 28, 2024

@michoecho - could this be solved by 938b993 ?

@michoecho
Copy link
Contributor

@michoecho - could this be solved by 938b993 ?

I don't think so, although I have no proof.

@avikivity
Copy link
Member

can be tested by disabling the cache temporarily

How would I even do that? The first thing I could find in the docs is that it's impossible to disable the cache.

There's enable_cache config item, WITH caching = {'enabled': 'false'} clause for ALTER TABLE, and BYPASS CACHE for individual queries.

syuu1228 pushed a commit to syuu1228/scylla that referenced this issue Feb 19, 2024
… preempted

Commit e81fc1f accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes scylladb#16759

Closes scylladb#17138
@rngcntr
Copy link
Author

rngcntr commented Mar 14, 2024

(And of course there's always the dreadful possibility that there exist two different bugs with the same symptoms and the one I found is different than the one you found. I hope not.)

Just reporting that we've been running on the latest ScyllaDB version for a few weeks now and the issue didn't occur anymore since we updated. Great work!

@mykaul
Copy link
Contributor

mykaul commented Mar 14, 2024

Thanks for the confirmation @rngcntr ! Very useful input.

juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 25, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 26, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 26, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 27, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 31, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Mar 31, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 2, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 16, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 25, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 28, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 28, 2024
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
… preempted

Commit e81fc1f accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes scylladb#16759

Closes scylladb#17138
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this issue Apr 30, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 30, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue Apr 30, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 8, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 8, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 8, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 8, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 9, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 12, 2024
juliayakovlev added a commit to juliayakovlev/scylla-cluster-tests that referenced this issue May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Urgent status/release blocker Preventing from a release to be promoted symptom/data consistency
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants