Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query_data_on_all_shards(): paged, stateful and multi-range reads can miss partitions #12916

Closed
denesb opened this issue Feb 17, 2023 · 18 comments · Fixed by #12932
Closed

query_data_on_all_shards(): paged, stateful and multi-range reads can miss partitions #12916

denesb opened this issue Feb 17, 2023 · 18 comments · Fixed by #12932
Assignees
Labels
Milestone

Comments

@denesb
Copy link
Contributor

denesb commented Feb 17, 2023

Because some of the shard readers can be resumed on the next page, such that there was a fast_forward_to() to the next range in the range list, that they missed. These readers will not produce partitions for that range and these will be missing from the output.
Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what.
@nyh is alternator passing down multiple ranges in a single range read to replicas?

@denesb denesb self-assigned this Feb 17, 2023
@denesb
Copy link
Contributor Author

denesb commented Feb 17, 2023

Found by multishard_mutation_query.test_read_all_multi_range, with #12574 applied.

@nyh
Copy link
Contributor

nyh commented Feb 19, 2023

Because some of the shard readers can be resumed on the next page, such that there was a fast_forward_to() to the next range in the range list, that they missed. These readers will not produce partitions for that range and these will be missing from the output. Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what. @nyh is alternator passing down multiple ranges in a single range read to replicas?

In #9167 @havaker was going to read from multiple partition ranges in, I think distributed counts (but @havaker please correct me here).

Soon afterwards, I wanted to do such reads in the new Alternator TTL feature but then discovered it was broken and opened another issue #9659 but not wanting to wait for its fix (TTL was considered urgent...), I modified the Alternator TTL code to read each partition range separately (see mentions of #9167 in alternator/ttl.cc). The regular Alternator (outside the TTL code) doesn't need to read more than one partition range, so it was never a problem there.

So I would check with @havaker whether distributed counting or whatever other code he was working on needs this use case, of reads with multiple partition ranges.

@tgrabiec
Copy link
Contributor

Why don't we call fast_forward_to() on all readers when it happens?

@denesb
Copy link
Contributor Author

denesb commented Feb 20, 2023

Why don't we call fast_forward_to() on all readers when it happens?

The multishard reader creates its shard reader in an on-demand fashion: on the first fill_buffer() call that needs data from said shard. It can happen that fast_forward_to() is called when some shards had no data requested from them at all. Creating the reader just to fast forward it is wasteful, instead we just pass it the new range when we create it. This works well in general, but in the case of stateful reads creating a reader can result in just returning a saved reader, and in this case, special handling is needed.

@DoronArazii DoronArazii added this to the 5.3 milestone Feb 20, 2023
@avikivity
Copy link
Member

Is there a user impact?

@denesb
Copy link
Contributor Author

denesb commented Feb 21, 2023

I don't think so, but I'm not sure. I was hoping @havaker can reveal that.
Normal CQL queries as well as alternator reads don't use the multi-range aspect of multishard_mutation_query().

@denesb
Copy link
Contributor Author

denesb commented Feb 21, 2023

I just checked and the RPC interface has a single range as its parameter. All direct uses also pass a single range, except the test which found the issue. So I think there is no impact. I seem to remember we wanted to use the multi-range capability because at one point I fixed all the known problems with it and even wrote a test for it. Maybe this future user never materialized.

@avikivity
Copy link
Member

The multi-range user is Thrift, which supports wraparound ranges, which we translate to a vector with two ranges.

@avikivity
Copy link
Member

So, no real impact.

@nyh
Copy link
Contributor

nyh commented Feb 27, 2023

I just checked and the RPC interface has a single range as its parameter. All direct uses also pass a single range, except the test which found the issue. So I think there is no impact. I seem to remember we wanted to use the multi-range capability because at one point I fixed all the known problems with it and even wrote a test for it. Maybe this future user never materialized.

Murphy's Law suggests that any API you create (like an ability to pass multiple ranges to a read) someone will try to use it without realizing it is broken. As I noted in a comment above, both @havaker and I tried to use it at roughly the same time for different projects, and wasted a lot of time figuring out that it was broken and working around the problem. If we don't fix these bugs, can we at least change the API to have just one range instead of a vector, or, alternatively, leave the existing API but add a big warning in front of the API not to pass more than one range (maybe even internal_error() check?), also pointing to a failing test that shows the bug - to deter future developers from wasting their time using this API unless they plan to fix the bug first.

P.S. @avikivity If we are aware of a bug that affects Thrift, and don't plan to fix it, maybe we should remove Thrift? #3811

@denesb
Copy link
Contributor Author

denesb commented Feb 28, 2023

At one point Shlomi proposed the same: either fix the multi-range path or forbid passing multiple ranges to this method, to prevent such users on tripping on the bugs it has. I opted for the former: I wrote dedicated tests to stress the multi-range aspect of this method and fixed all the bugs I found (there were quite a few). This is just another bug that was revealed after one of those dedicated multi-range tests were further improved recently. It was not a long-known bug that no one cared to fix because no one passes multiple ranges.
Passing multiple ranges to this method is supported and we have tests to ensure it works correctly.

@havaker
Copy link
Contributor

havaker commented Feb 28, 2023

Because some of the shard readers can be resumed on the next page, such that there was a fast_forward_to() to the next range in the range list, that they missed. These readers will not produce partitions for that range and these will be missing from the output. Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what. @nyh is alternator passing down multiple ranges in a single range read to replicas?

In #9167 @havaker was going to read from multiple partition ranges in, I think distributed counts (but @havaker please correct me here).

It is true that distributed aggregates do reads with multiple partition ranges. My first ever pull request to Scylla was an attempt to fix that code path #9175.

@havaker
Copy link
Contributor

havaker commented Feb 28, 2023

Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what.

Knowing that distributed aggregates use reads with multiple ranges, I believe something may be affected.

@denesb
Copy link
Contributor Author

denesb commented Mar 1, 2023

Maybe this bug can explain the distributed count returning wrong result bug we've been seeing for a while.

@kolinfluence
Copy link

this sounds kind of serious, when will it be fixed and updated in the master?

@denesb
Copy link
Contributor Author

denesb commented Mar 6, 2023

this sounds kind of serious, when will it be fixed and updated in the master?

This bug only possibly affects a recently introduced feature, distributed aggregates. Normal queries are not affected.
The fix is available for some time, waiting to be merged: #12932

avikivity added a commit that referenced this issue Mar 26, 2023
…to current range' from Botond Dénes

When creating the reader, the lifecycle policy might return one that was saved on the last page and survived in the cache. This reader might have skipped some fast-forwarding ranges while sitting in the cache. To avoid using a reader reading a stale range (from the read's POV), check its read range and fast forward it if necessary.

Fixes: #12916

Closes #12932

* github.com:scylladb/scylladb:
  readers/multishard: shard_reader: fast-forward created reader to current range
  readers/multishard: reader_lifecycle_policy: add get_read_range()
  test/boost/multishard_mutation_query_test: paging: handle range becoming wrapping
@avikivity
Copy link
Member

Not sure what is affected if anything. Range scans only ever use a single range on the replica, so these methods were silently broken w.r.t. multiple ranges for a long time. But I do remember fixing up the known issues because there was going to be a user for it, but don't remember what.

Knowing that distributed aggregates use reads with multiple ranges, I believe something may be affected.

Can we confirm this?

@avikivity
Copy link
Member

@havaker ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants