New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query_data_on_all_shards(): paged, stateful and multi-range reads can miss partitions #12916
Comments
Found by multishard_mutation_query.test_read_all_multi_range, with #12574 applied. |
In #9167 @havaker was going to read from multiple partition ranges in, I think distributed counts (but @havaker please correct me here). Soon afterwards, I wanted to do such reads in the new Alternator TTL feature but then discovered it was broken and opened another issue #9659 but not wanting to wait for its fix (TTL was considered urgent...), I modified the Alternator TTL code to read each partition range separately (see mentions of #9167 in alternator/ttl.cc). The regular Alternator (outside the TTL code) doesn't need to read more than one partition range, so it was never a problem there. So I would check with @havaker whether distributed counting or whatever other code he was working on needs this use case, of reads with multiple partition ranges. |
Why don't we call |
The multishard reader creates its shard reader in an on-demand fashion: on the first fill_buffer() call that needs data from said shard. It can happen that fast_forward_to() is called when some shards had no data requested from them at all. Creating the reader just to fast forward it is wasteful, instead we just pass it the new range when we create it. This works well in general, but in the case of stateful reads creating a reader can result in just returning a saved reader, and in this case, special handling is needed. |
Is there a user impact? |
I don't think so, but I'm not sure. I was hoping @havaker can reveal that. |
I just checked and the RPC interface has a single range as its parameter. All direct uses also pass a single range, except the test which found the issue. So I think there is no impact. I seem to remember we wanted to use the multi-range capability because at one point I fixed all the known problems with it and even wrote a test for it. Maybe this future user never materialized. |
The multi-range user is Thrift, which supports wraparound ranges, which we translate to a vector with two ranges. |
So, no real impact. |
Murphy's Law suggests that any API you create (like an ability to pass multiple ranges to a read) someone will try to use it without realizing it is broken. As I noted in a comment above, both @havaker and I tried to use it at roughly the same time for different projects, and wasted a lot of time figuring out that it was broken and working around the problem. If we don't fix these bugs, can we at least change the API to have just one range instead of a vector, or, alternatively, leave the existing API but add a big warning in front of the API not to pass more than one range (maybe even internal_error() check?), also pointing to a failing test that shows the bug - to deter future developers from wasting their time using this API unless they plan to fix the bug first. P.S. @avikivity If we are aware of a bug that affects Thrift, and don't plan to fix it, maybe we should remove Thrift? #3811 |
At one point Shlomi proposed the same: either fix the multi-range path or forbid passing multiple ranges to this method, to prevent such users on tripping on the bugs it has. I opted for the former: I wrote dedicated tests to stress the multi-range aspect of this method and fixed all the bugs I found (there were quite a few). This is just another bug that was revealed after one of those dedicated multi-range tests were further improved recently. It was not a long-known bug that no one cared to fix because no one passes multiple ranges. |
It is true that distributed aggregates do reads with multiple partition ranges. My first ever pull request to Scylla was an attempt to fix that code path #9175. |
Knowing that distributed aggregates use reads with multiple ranges, I believe something may be affected. |
Maybe this bug can explain the distributed count returning wrong result bug we've been seeing for a while. |
this sounds kind of serious, when will it be fixed and updated in the master? |
This bug only possibly affects a recently introduced feature, distributed aggregates. Normal queries are not affected. |
…to current range' from Botond Dénes When creating the reader, the lifecycle policy might return one that was saved on the last page and survived in the cache. This reader might have skipped some fast-forwarding ranges while sitting in the cache. To avoid using a reader reading a stale range (from the read's POV), check its read range and fast forward it if necessary. Fixes: #12916 Closes #12932 * github.com:scylladb/scylladb: readers/multishard: shard_reader: fast-forward created reader to current range readers/multishard: reader_lifecycle_policy: add get_read_range() test/boost/multishard_mutation_query_test: paging: handle range becoming wrapping
Can we confirm this? |
@havaker ping |
Because some of the shard readers can be resumed on the next page, such that there was a
fast_forward_to()
to the next range in the range list, that they missed. These readers will not produce partitions for that range and these will be missing from the output.Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what.
@nyh is alternator passing down multiple ranges in a single range read to replicas?
The text was updated successfully, but these errors were encountered: