query_data_on_all_shards(): paged, stateful and multi-range reads can miss partitions #12916

denesb · 2023-02-17T11:50:15Z

Because some of the shard readers can be resumed on the next page, such that there was a fast_forward_to() to the next range in the range list, that they missed. These readers will not produce partitions for that range and these will be missing from the output.
Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what.
@nyh is alternator passing down multiple ranges in a single range read to replicas?

The text was updated successfully, but these errors were encountered:

denesb · 2023-02-17T12:32:41Z

Found by multishard_mutation_query.test_read_all_multi_range, with #12574 applied.

nyh · 2023-02-19T16:46:17Z

Because some of the shard readers can be resumed on the next page, such that there was a fast_forward_to() to the next range in the range list, that they missed. These readers will not produce partitions for that range and these will be missing from the output. Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what. @nyh is alternator passing down multiple ranges in a single range read to replicas?

In #9167 @havaker was going to read from multiple partition ranges in, I think distributed counts (but @havaker please correct me here).

Soon afterwards, I wanted to do such reads in the new Alternator TTL feature but then discovered it was broken and opened another issue #9659 but not wanting to wait for its fix (TTL was considered urgent...), I modified the Alternator TTL code to read each partition range separately (see mentions of #9167 in alternator/ttl.cc). The regular Alternator (outside the TTL code) doesn't need to read more than one partition range, so it was never a problem there.

So I would check with @havaker whether distributed counting or whatever other code he was working on needs this use case, of reads with multiple partition ranges.

tgrabiec · 2023-02-20T11:02:05Z

Why don't we call fast_forward_to() on all readers when it happens?

denesb · 2023-02-20T11:06:11Z

Why don't we call fast_forward_to() on all readers when it happens?

The multishard reader creates its shard reader in an on-demand fashion: on the first fill_buffer() call that needs data from said shard. It can happen that fast_forward_to() is called when some shards had no data requested from them at all. Creating the reader just to fast forward it is wasteful, instead we just pass it the new range when we create it. This works well in general, but in the case of stateful reads creating a reader can result in just returning a saved reader, and in this case, special handling is needed.

avikivity · 2023-02-21T10:30:48Z

Is there a user impact?

denesb · 2023-02-21T10:48:54Z

I don't think so, but I'm not sure. I was hoping @havaker can reveal that.
Normal CQL queries as well as alternator reads don't use the multi-range aspect of multishard_mutation_query().

denesb · 2023-02-21T10:58:16Z

I just checked and the RPC interface has a single range as its parameter. All direct uses also pass a single range, except the test which found the issue. So I think there is no impact. I seem to remember we wanted to use the multi-range capability because at one point I fixed all the known problems with it and even wrote a test for it. Maybe this future user never materialized.

avikivity · 2023-02-21T11:18:18Z

The multi-range user is Thrift, which supports wraparound ranges, which we translate to a vector with two ranges.

avikivity · 2023-02-21T11:18:30Z

So, no real impact.

nyh · 2023-02-27T16:55:57Z

I just checked and the RPC interface has a single range as its parameter. All direct uses also pass a single range, except the test which found the issue. So I think there is no impact. I seem to remember we wanted to use the multi-range capability because at one point I fixed all the known problems with it and even wrote a test for it. Maybe this future user never materialized.

Murphy's Law suggests that any API you create (like an ability to pass multiple ranges to a read) someone will try to use it without realizing it is broken. As I noted in a comment above, both @havaker and I tried to use it at roughly the same time for different projects, and wasted a lot of time figuring out that it was broken and working around the problem. If we don't fix these bugs, can we at least change the API to have just one range instead of a vector, or, alternatively, leave the existing API but add a big warning in front of the API not to pass more than one range (maybe even internal_error() check?), also pointing to a failing test that shows the bug - to deter future developers from wasting their time using this API unless they plan to fix the bug first.

P.S. @avikivity If we are aware of a bug that affects Thrift, and don't plan to fix it, maybe we should remove Thrift? #3811

denesb · 2023-02-28T05:36:32Z

At one point Shlomi proposed the same: either fix the multi-range path or forbid passing multiple ranges to this method, to prevent such users on tripping on the bugs it has. I opted for the former: I wrote dedicated tests to stress the multi-range aspect of this method and fixed all the bugs I found (there were quite a few). This is just another bug that was revealed after one of those dedicated multi-range tests were further improved recently. It was not a long-known bug that no one cared to fix because no one passes multiple ranges.
Passing multiple ranges to this method is supported and we have tests to ensure it works correctly.

havaker · 2023-02-28T21:23:12Z

Because some of the shard readers can be resumed on the next page, such that there was a fast_forward_to() to the next range in the range list, that they missed. These readers will not produce partitions for that range and these will be missing from the output. Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what. @nyh is alternator passing down multiple ranges in a single range read to replicas?

In #9167 @havaker was going to read from multiple partition ranges in, I think distributed counts (but @havaker please correct me here).

It is true that distributed aggregates do reads with multiple partition ranges. My first ever pull request to Scylla was an attempt to fix that code path #9175.

havaker · 2023-02-28T21:31:02Z

Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what.

Knowing that distributed aggregates use reads with multiple ranges, I believe something may be affected.

denesb · 2023-03-01T05:30:14Z

Maybe this bug can explain the distributed count returning wrong result bug we've been seeing for a while.

kolinfluence · 2023-03-04T03:59:04Z

this sounds kind of serious, when will it be fixed and updated in the master?

denesb · 2023-03-06T05:01:55Z

this sounds kind of serious, when will it be fixed and updated in the master?

This bug only possibly affects a recently introduced feature, distributed aggregates. Normal queries are not affected.
The fix is available for some time, waiting to be merged: #12932

…to current range' from Botond Dénes When creating the reader, the lifecycle policy might return one that was saved on the last page and survived in the cache. This reader might have skipped some fast-forwarding ranges while sitting in the cache. To avoid using a reader reading a stale range (from the read's POV), check its read range and fast forward it if necessary. Fixes: #12916 Closes #12932 * github.com:scylladb/scylladb: readers/multishard: shard_reader: fast-forward created reader to current range readers/multishard: reader_lifecycle_policy: add get_read_range() test/boost/multishard_mutation_query_test: paging: handle range becoming wrapping

avikivity · 2023-04-09T10:23:35Z

Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what.

Knowing that distributed aggregates use reads with multiple ranges, I believe something may be affected.

Can we confirm this?

avikivity · 2023-05-04T12:10:19Z

@havaker ping

denesb self-assigned this Feb 17, 2023

denesb added the type/bug label Feb 17, 2023

denesb mentioned this issue Feb 20, 2023

readers/multishard: shard_reader: fast-forward created reader to current range #12932

Merged

DoronArazii added this to the 5.3 milestone Feb 20, 2023

scylladb-promoter closed this as completed in #12932 Mar 26, 2023

scylladb-promoter added the Backport candidate label Mar 26, 2023

mykaul removed the Backport candidate label Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query_data_on_all_shards(): paged, stateful and multi-range reads can miss partitions #12916

query_data_on_all_shards(): paged, stateful and multi-range reads can miss partitions #12916

denesb commented Feb 17, 2023

denesb commented Feb 17, 2023

nyh commented Feb 19, 2023 •

edited

tgrabiec commented Feb 20, 2023

denesb commented Feb 20, 2023

avikivity commented Feb 21, 2023

denesb commented Feb 21, 2023

denesb commented Feb 21, 2023

avikivity commented Feb 21, 2023

avikivity commented Feb 21, 2023

nyh commented Feb 27, 2023 •

edited

denesb commented Feb 28, 2023

havaker commented Feb 28, 2023

havaker commented Feb 28, 2023

denesb commented Mar 1, 2023

kolinfluence commented Mar 4, 2023

denesb commented Mar 6, 2023

avikivity commented Apr 9, 2023

avikivity commented May 4, 2023

query_data_on_all_shards(): paged, stateful and multi-range reads can miss partitions #12916

query_data_on_all_shards(): paged, stateful and multi-range reads can miss partitions #12916

Comments

denesb commented Feb 17, 2023

denesb commented Feb 17, 2023

nyh commented Feb 19, 2023 • edited

tgrabiec commented Feb 20, 2023

denesb commented Feb 20, 2023

avikivity commented Feb 21, 2023

denesb commented Feb 21, 2023

denesb commented Feb 21, 2023

avikivity commented Feb 21, 2023

avikivity commented Feb 21, 2023

nyh commented Feb 27, 2023 • edited

denesb commented Feb 28, 2023

havaker commented Feb 28, 2023

havaker commented Feb 28, 2023

denesb commented Mar 1, 2023

kolinfluence commented Mar 4, 2023

denesb commented Mar 6, 2023

avikivity commented Apr 9, 2023

avikivity commented May 4, 2023

nyh commented Feb 19, 2023 •

edited

nyh commented Feb 27, 2023 •

edited