-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New need_filtering() returns false for a query that needs filtering #7708
Comments
d5a6aa4 also introduced this "regression": Before:
After:
After discussion with @slivne (and @haaawk), we came to the conclusion that the previous behaviour was the correct one (requiring the user to include /cc @dekimir |
I don't understand why the index is used to serve this query. Why not just query the base table, given that the partitions are known and the clustering range is continuous? @psarna must know the answer. :)
Why is this incorrect? The base table can be queried with one read command and no filtering. |
@psarna help ... |
We need to understand if this is a regression and from which point. Do we need to backport this to 4.3 / 4.2 ... |
Hm, that's true, this statement can run without filtering... I think that the main issue here is that we always had a hidden assumption that ALLOW FILTERING never actually changes the query plan - it's only a safety measure to warn users about possible performance consequences. And here, we have an option to either use an index and filter or just query a (potentially large) token range. This type of query is tough, since in some cases it makes more sense to query the base table, and sometimes it would be better to use an index... I don't really have any strong opinions here, both requiring and not requiring filtering makes sense here. If I had to choose now, I'd go with @dekimir's semantics and just query the base table, since we can, and querying indexes always induces a latency cost. Also, it would be weird for users if this query only starts requiring filtering after we create an index on it, and works just fine without indexes. In distant future, a query optimizer could decide, based on statistics we don't have yet, if it's more profitable to use an index + filter or just query the base table and be done with it. |
Note that the bug here isn't about requiring |
I thought we understood that this started with d5a6aa4? (So no backporting needed yet.) |
@nyh (IIRC) commented that although the query can be satisfied without filtering, it still has the property of "query unlimited data per page". A query such as A query such as A query such as But a query such as I agree with @avelanarius and @nyh that this is a bug. |
@avikivity this is not about requiring |
According to #7708 (comment), it's the same thing. need_filtering() -> require ALLOW FILTERING. |
No, you agree with them that #7608 is a bug. |
That comment mixes up bugs that I'm striving to keep separate. The bug here is whether |
This indeed has a separate issue: #7608
The results are wrong? Then let's begin by writing a test for this, that demonstrates wrong results. |
Um, we did that more than a week ago. Those tests were queued and dequeued when they failed. That's how this whole discussion started. Once again, this bug is about correctness, not performance expectations. "Needs filtering" is orthogonal to "needs ALLOW FILTERING". Please stick to that side of the discussion. To reiterate where we were: I and @psarna think that a query that restricts the token and a clustering prefix needs no separate filtering pass, index or no index. |
Where is this mentioned in this issue? Only now see that github automatically connected that pull request to this issue by github, but there is no text mentioning these tests, and I didn't even notice it. I wish we used the notion of "xfailing" tests more. These are tests which we commit while still failing, with the intention of making them pass later. I don't think Avi would have been confused about what this issue is about if the issue mentioned a specific test which fails and needs to be made to pass.
I did! |
I wish we used the notion of "xfailing" tests more. These are tests which
we commit while still failing, with the intention of making them pass later.
This is us trying to make them pass. The test author had the rug pulled
from under him by an independent change in master that he wasn't aware of
during review.
This has nothing to do with performance, I don't know why you are saying
that I think it does.
Not you, others.
|
On Sun, Nov 29, 2020 at 8:48 AM Avi Kivity ***@***.***> wrote:
@nyh <https://github.com/nyh> (IIRC) commented that although the query
can be satisfied without filtering, it still has the property of "query
unlimited data per page".
A query such as SELECT * FROM tab WHERE pk = ? AND ck1 = ? AND ck2 = ? is
a point query, and even if it returns no results, it queries O(1) data
(which has O(log n)).
A query such as SELECT * FROM tab WHERE pk = ? AND ck1 >= ? AND ck1 <= ?
is a partition clustering range query, and even if it returns no results,
it queries O(1) data (which has O(log n)).
A query such as SELECT * FROM tab is a full scan, and will return O(n)
data for O(n) work, so amortized O(1) per row.
But a query such as SELECT * FROM tab WHERE ck = ? can return no data, or
O(1) data,
O(n) data no? if all partitions share the same clustering key.
… for O(n) work. Even though it doesn't filter using the filtering
machinery, it suffers from the same problem as filtering queries, in that
the amount of work done is potentially not proportional to the amount of
data returned.
I agree with @avelanarius <https://github.com/avelanarius> and @nyh
<https://github.com/nyh> that this is a bug.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7708 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKYA42JNJALNZYWVYZFFGLSSIYHFANCNFSM4UDTGE6Q>
.
|
:sigh: Once again, please don't discuss performance here. That's what #7608 is for. |
Example query that shows this issue - will refer to it later:
(as I see two options for fixing this issue: |
@avelanarius some arguments were discussed in #7699. I think they apply here, too. |
I don't think this discussion (in #7699) analyzes this problem thoroughly.
You can do this query with or without index. If index is used, there is a need for filtering. If no index is used, there is no need for filtering, as this is a ck prefix. But that doesn't mean that it's faster. If for example values in This line of reasoning roughly holds true if One flaw in my reasoning is that after reading the index, doing base queries is slow (in non-whole_partitions and non-partition_slices case) because you have to do small queries per each row. And that could eliminate all benefits of using index (would be interesting in the future to measure this empirically on some synthetic benchmark - how "restrictive" the index has to be Currently, if someone is not happy with performance of query because it uses an index, they can remove the index (of course potentially negatively affecting other queries). However, if we changed such queries to not use an index, the user will have no option of accelerating the query. I don't vehemently object to changing such queries to not use the index - I'm just worried about the case I described above and that some user will be negatively affected by this (with no clear workaround). |
Agree with @avelanarius. Keys in the index should have good selectivity. Of course we can't guarantee that, but the guidelines for using indexes are to only create them if there is good selectivity. So in general we should prefer the index. |
The problem with this argument is that it will complicate Moreover, I get that an index is created with the intention of being used. Maybe it should even be used unconditionally in all queries it can support, regardless of performance impact. But what's the answer if there are multiple indices (a common case, from what I've seen) or when an index slows down a query? |
That's a problem with need_filtering() and the code structure, not with the argument. The argument tries to provide the most predictable performance for the largest number of queries.
The gains can easily be demonstrated. We can also demonstrate cases where @avelanarius logic leads to worse performance, but that's generally true, and the gain in performance when it works is much larger than the loss in performance when it doesn't.
Of course it's not simple, this is why I'm pushing to have a single expression for the WHERE clause so we have everything available. The index selection loop already contains a rudimentary scoring system to allow selecting the index based on its properties. We can later feed it information about index selectivity.
There is plenty of work in the optimizer left.
Right now we have too little information to make such decisions. Most databases collect more information so that they can make better decision, and also provide a hint system to tell the database to prefer one choice. @avelanarius argument is one step along that road, because the road is long does not mean we should not take the first step. |
I now agree that we should use an index for queries the index supports (when the full PK isn't known) and complicate |
need_filtering will probably die a needful death, yes. A replacement might be
where query_plan includes the candidate index, and maybe other stuff. Return nullopt if the plan cannot work, otherwise an estimate of the goodness of the plan. |
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes scylladb#8991. Fixes scylladb#7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
…ering key' from Jan Ciołek Add examples from issue #8991 to tests Both of these tests pass on `cassandra 4.0` but fail on `scylla 4.4.3` First test tests that selecting values from indexed table using only clustering key returns correct values. The second test tests that performing this operation requires filtering. The filtering test looks similar to [the one for #7608](https://github.com/scylladb/scylla/blob/1924e8d2b63e6611b78ac6252b5ddbc4884f9d22/test/cql-pytest/test_allow_filtering.py#L124) but there are some differences - here the table has two clustering columns and an index, so it could test different code paths. Contains a quick fix for the `needs_filtering()` function to make these tests pass. It returns `true` for this case and the one described in #7708. This implementation is a bit conservative - it might sometimes return `true` where filtering isn't actually needed, but at least it prevents scylla from returning incorrect results. Fixes #8991. Fixes #7708. Closes #8994 * github.com:scylladb/scylla: cql3: Fix need_filtering on indexed table cql-pytest: Test selecting using only clustering key requires filtering cql-pytest: Test selecting from indexed table using clustering key
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes #8991. Fixes #7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> (cherry picked from commit 5414924)
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes #8991. Fixes #7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> (cherry picked from commit 5414924)
Backported to 4.4, 4.5 |
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes #8991. Fixes #7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> (cherry picked from commit 5414924)
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes #8991. Fixes #7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> (cherry picked from commit 5414924)
There were cases where a query on an indexed table needed filtering but need_filtering returned false. This is fixed by using new conditions in cases where we are using an index. Fixes #8991. Fixes #7708. For now this is an overly conservative implementation that returns true in some cases where filtering is not needed. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com> (cherry picked from commit 5414924)
New
need_filtering()
implementation introduced in d5a6aa4 returns invalid result (false
instead oftrue
) in such a case:Index primary key:
((ck2), idx_token, pk1, pk2, ck1)
Restrictions on index (for example query):
ck2 = 1, idx_token <= X, ck1 = 1
; butpk1
,pk2
- unrestricted: needs filtering!need_filtering()
returns incorrectlyfalse
exactly on this line ("Guaranteed continuous clustering range." case): https://github.com/scylladb/scylla/blob/5f81f97773361065b95b7063fca6410c83591bd7/cql3/restrictions/statement_restrictions.cc#L540The text was updated successfully, but these errors were encountered: