Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New need_filtering() returns false for a query that needs filtering #7708

Closed
avelanarius opened this issue Nov 26, 2020 · 34 comments · Fixed by #8994
Closed

New need_filtering() returns false for a query that needs filtering #7708

avelanarius opened this issue Nov 26, 2020 · 34 comments · Fixed by #8994
Assignees
Milestone

Comments

@avelanarius
Copy link

avelanarius commented Nov 26, 2020

New need_filtering() implementation introduced in d5a6aa4 returns invalid result (false instead of true) in such a case:

CREATE TABLE t (pk1 int, pk2 int, ck1 int, ck2 int, v int, v2 int, PRIMARY KEY((pk1, pk2), ck1, ck2));
CREATE INDEX t_global ON t(ck2);
SELECT * FROM t WHERE token(pk1, pk2) <= token(1, 3) AND ck2 = 1 AND ck1 = 1;

Index primary key: ((ck2), idx_token, pk1, pk2, ck1)
Restrictions on index (for example query): ck2 = 1, idx_token <= X, ck1 = 1; but pk1, pk2 - unrestricted: needs filtering!

need_filtering() returns incorrectly false exactly on this line ("Guaranteed continuous clustering range." case): https://github.com/scylladb/scylla/blob/5f81f97773361065b95b7063fca6410c83591bd7/cql3/restrictions/statement_restrictions.cc#L540

@avelanarius avelanarius self-assigned this Nov 26, 2020
@avelanarius avelanarius changed the title New need_filtering() returns false for query that needs filtering New need_filtering() returns false for a query that needs filtering Nov 26, 2020
@avelanarius
Copy link
Author

d5a6aa4 also introduced this "regression":

Before:

> CREATE KEYSPACE ks WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
> CREATE TABLE ks.t(pk int, ck int, PRIMARY KEY(pk, ck));
> SELECT * FROM ks.t WHERE ck = 5;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

After:

> CREATE KEYSPACE ks WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
> CREATE TABLE ks.t(pk int, ck int, PRIMARY KEY(pk, ck));
> SELECT * FROM ks.t WHERE ck = 5;

 pk | ck
----+----

After discussion with @slivne (and @haaawk), we came to the conclusion that the previous behaviour was the correct one (requiring the user to include ALLOW FILTERING).

/cc @dekimir

@dekimir
Copy link
Contributor

dekimir commented Nov 26, 2020

d5a6aa4 also introduced this "regression":

Duplicate of #7608. It was like that even before d5a6aa4, BTW.

@dekimir
Copy link
Contributor

dekimir commented Nov 27, 2020

SELECT * FROM t WHERE token(pk1, pk2) <= token(1, 3) AND ck2 = 1 AND ck1 = 1;

I don't understand why the index is used to serve this query. Why not just query the base table, given that the partitions are known and the clustering range is continuous? @psarna must know the answer. :)

need_filtering() returns incorrectly false

Why is this incorrect? The base table can be queried with one read command and no filtering.

@slivne
Copy link
Contributor

slivne commented Nov 29, 2020

@psarna help ...

@slivne slivne added this to the 4.4 milestone Nov 29, 2020
@slivne
Copy link
Contributor

slivne commented Nov 29, 2020

We need to understand if this is a regression and from which point. Do we need to backport this to 4.3 / 4.2 ...

@psarna
Copy link
Contributor

psarna commented Nov 29, 2020

Hm, that's true, this statement can run without filtering... I think that the main issue here is that we always had a hidden assumption that ALLOW FILTERING never actually changes the query plan - it's only a safety measure to warn users about possible performance consequences. And here, we have an option to either use an index and filter or just query a (potentially large) token range.

This type of query is tough, since in some cases it makes more sense to query the base table, and sometimes it would be better to use an index...

I don't really have any strong opinions here, both requiring and not requiring filtering makes sense here. If I had to choose now, I'd go with @dekimir's semantics and just query the base table, since we can, and querying indexes always induces a latency cost. Also, it would be weird for users if this query only starts requiring filtering after we create an index on it, and works just fine without indexes.

In distant future, a query optimizer could decide, based on statistics we don't have yet, if it's more profitable to use an index + filter or just query the base table and be done with it.

@dekimir
Copy link
Contributor

dekimir commented Nov 29, 2020

Note that the bug here isn't about requiring ALLOW FILTERING (which is worth discussing as a separate issue) but the fact that when need_filtering() returns false, the filtering step doesn't happen, so the results are wrong.

@dekimir
Copy link
Contributor

dekimir commented Nov 29, 2020

We need to understand if this is a regression and from which point. Do we need to backport this to 4.3 / 4.2 ...

I thought we understood that this started with d5a6aa4? (So no backporting needed yet.)

@avikivity
Copy link
Member

@nyh (IIRC) commented that although the query can be satisfied without filtering, it still has the property of "query unlimited data per page".

A query such as SELECT * FROM tab WHERE pk = ? AND ck1 = ? AND ck2 = ? is a point query, and even if it returns no results, it queries O(1) data (which has O(log n)).

A query such as SELECT * FROM tab WHERE pk = ? AND ck1 >= ? AND ck1 <= ? is a partition clustering range query, and even if it returns no results, it queries O(1) data (which has O(log n)).

A query such as SELECT * FROM tab is a full scan, and will return O(n) data for O(n) work, so amortized O(1) per row.

But a query such as SELECT * FROM tab WHERE ck = ? can return no data, or O(1) data, for O(n) work. Even though it doesn't filter using the filtering machinery, it suffers from the same problem as filtering queries, in that the amount of work done is potentially not proportional to the amount of data returned.

I agree with @avelanarius and @nyh that this is a bug.

@dekimir
Copy link
Contributor

dekimir commented Nov 29, 2020

@avikivity this is not about requiring ALLOW FILTERING or amount of data fetched. This is about whether the filtering step is needed after read. It isn't, if you're reading the base table.

@avikivity
Copy link
Member

According to #7708 (comment), it's the same thing. need_filtering() -> require ALLOW FILTERING.

@dekimir
Copy link
Contributor

dekimir commented Nov 29, 2020

I agree with @avelanarius and @nyh that this is a bug.

No, you agree with them that #7608 is a bug.

@dekimir
Copy link
Contributor

dekimir commented Nov 29, 2020

According to #7708 (comment), it's the same thing. need_filtering() -> require ALLOW FILTERING.

That comment mixes up bugs that I'm striving to keep separate. The bug here is whether do_filter should be invoked or skipped, not whether O(n) data is fetched.

@nyh
Copy link
Contributor

nyh commented Nov 29, 2020

Note that the bug here isn't about requiring ALLOW FILTERING (which is worth discussing as a separate issue)

This indeed has a separate issue: #7608

but the fact that when need_filtering() returns false, the filtering step doesn't happen, so the results are wrong.

The results are wrong? Then let's begin by writing a test for this, that demonstrates wrong results.

@dekimir
Copy link
Contributor

dekimir commented Nov 29, 2020

The results are wrong? Then let's begin by writing a test for this, that demonstrates wrong results.

Um, we did that more than a week ago. Those tests were queued and dequeued when they failed. That's how this whole discussion started.

Once again, this bug is about correctness, not performance expectations. "Needs filtering" is orthogonal to "needs ALLOW FILTERING". Please stick to that side of the discussion.

To reiterate where we were: I and @psarna think that a query that restricts the token and a clustering prefix needs no separate filtering pass, index or no index.

@nyh
Copy link
Contributor

nyh commented Nov 29, 2020

The results are wrong? Then let's begin by writing a test for this, that demonstrates wrong results.

Um, we did that more than a week ago. Those tests were queued and dequeued when they failed. That's how this whole discussion started.

Where is this mentioned in this issue? Only now see that github automatically connected that pull request to this issue by github, but there is no text mentioning these tests, and I didn't even notice it.

I wish we used the notion of "xfailing" tests more. These are tests which we commit while still failing, with the intention of making them pass later. I don't think Avi would have been confused about what this issue is about if the issue mentioned a specific test which fails and needs to be made to pass.

Once again, this bug is about correctness, not performance expectations. "Needs filtering" is orthogonal to "needs ALLOW FILTERING". Please stick to that side of the discussion.

I did!
A test would do a query and check the results, and it will fail if the results are wrong. This has nothing to do with performance, I don't know why you are saying that I think it does.

@dekimir
Copy link
Contributor

dekimir commented Nov 29, 2020 via email

@raphaelsc
Copy link
Member

raphaelsc commented Nov 29, 2020 via email

@dekimir
Copy link
Contributor

dekimir commented Nov 30, 2020

:sigh: Once again, please don't discuss performance here. That's what #7608 is for.

@avelanarius
Copy link
Author

avelanarius commented Nov 30, 2020

Example query that shows this issue - will refer to it later:

CREATE TABLE t (pk int, ck1 int, ck2 int, PRIMARY KEY(pk, ck1, ck2))
CREATE INDEX ON t(ck2)
SELECT * FROM t WHERE ck1 = ? AND ck2 = ? ALLOW FILTERING

(as SEASTAR_TEST_CASE here: https://gist.github.com/avelanarius/630dbe437e973b764cce78bb158e18bc)

I see two options for fixing this issue:
a) Make need_filtering() return false if index is not used and true if index is used (in this example). (talking about "do_filter"). This will require using _uses_secondary_indexing in need_filtering(), but in d5a6aa4 @dekimir got rid of using _uses_secondary_indexing here and probably will be against adding it back (of course in a limited way).
b) Decide on whether the example query (above) should use index or not. There are strong (performance) arguments in favour/against using it or not. If we decide against a), I will write those performance arguments. I have a patch removing using the index in queries like the example query already prepared.

@dekimir
Copy link
Contributor

dekimir commented Nov 30, 2020

b) Decide on whether the example query (above) should use index or not.

@avelanarius some arguments were discussed in #7699. I think they apply here, too.

@avelanarius
Copy link
Author

I don't think this discussion (in #7699) analyzes this problem thoroughly.

CREATE TABLE t (pk int, ck1 int, ck2 int, PRIMARY KEY(pk, ck1, ck2))
CREATE INDEX ON t(ck2)
SELECT * FROM t WHERE ck1 = ? AND ck2 = ? ALLOW FILTERING

You can do this query with or without index. If index is used, there is a need for filtering. If no index is used, there is no need for filtering, as this is a ck prefix. But that doesn't mean that it's faster.

If for example values in ck2 are distinct, doing this query with index will be faster. The ck2 index partitions are 1 row big. This one row will be read from index, filtered (ck1 restriction) and 1 row queried from the base table. On the other hand, while doing this query without index, you will have to iterate over all partitions and do ck search inside each of them (only one will yield result; using Statistics from SSTable min_clustering_key max_clustering_key might help to reduce the work).

This line of reasoning roughly holds true if ck2 (indexed column) is of high cardinality. And such high cardinality columns are great for indexes. And I assume that if someone created index on ck2, they knew that this column was of high cardinality and they want the index to be used. Surely, they wouldn't create this index if the cardinality was very low as:
a) Index partitions would get very large
b) Usage of such index would not accelerate queries compared to naive filtering

One flaw in my reasoning is that after reading the index, doing base queries is slow (in non-whole_partitions and non-partition_slices case) because you have to do small queries per each row. And that could eliminate all benefits of using index (would be interesting in the future to measure this empirically on some synthetic benchmark - how "restrictive" the index has to be
to run faster).

Currently, if someone is not happy with performance of query because it uses an index, they can remove the index (of course potentially negatively affecting other queries). However, if we changed such queries to not use an index, the user will have no option of accelerating the query.

I don't vehemently object to changing such queries to not use the index - I'm just worried about the case I described above and that some user will be negatively affected by this (with no clear workaround).

@avikivity
Copy link
Member

Agree with @avelanarius. Keys in the index should have good selectivity. Of course we can't guarantee that, but the guidelines for using indexes are to only create them if there is good selectivity. So in general we should prefer the index.

@dekimir
Copy link
Contributor

dekimir commented Dec 1, 2020

The problem with this argument is that it will complicate need_filtering() beyond what's reasonable, for uncertain gains. Even for limited cases like this, the logic on whether to use the filter is pretty involved; what will happen when other subexpressions and other indices come into play? I see, for example, that @avelanarius has already changed the query from what was in his initial comment, by dropping TOKEN. Will TOKEN be a separate case to be recognized? Will a regular column change the reasoning? What about when there is more than one applicable index?

Moreover, need_filtering() will need to communicate back to the caller which assumption it made about indexing; otherwise, there's a high risk of future bugs where the caller assumes something different and thus has the wrong answer. (Or if we make the caller decide in advance which index to use, that needs to be communicated to need_filtering().) All this will complicate the logic beyond easy comprehension; we've been down that road before.

I get that an index is created with the intention of being used. Maybe it should even be used unconditionally in all queries it can support, regardless of performance impact. But what's the answer if there are multiple indices (a common case, from what I've seen) or when an index slows down a query?

@avikivity
Copy link
Member

The problem with this argument is that it will complicate need_filtering() beyond what's reasonable,

That's a problem with need_filtering() and the code structure, not with the argument. The argument tries to provide the most predictable performance for the largest number of queries.

for uncertain gains.

The gains can easily be demonstrated. We can also demonstrate cases where @avelanarius logic leads to worse performance, but that's generally true, and the gain in performance when it works is much larger than the loss in performance when it doesn't.

Even for limited cases like this, the logic on whether to use the filter is pretty involved; what will happen when other subexpressions and other indices come into play?

Of course it's not simple, this is why I'm pushing to have a single expression for the WHERE clause so we have everything available. The index selection loop already contains a rudimentary scoring system to allow selecting the index based on its properties. We can later feed it information about index selectivity.

I see, for example, that @avelanarius has already changed the query from what was in his initial comment, by dropping TOKEN. Will TOKEN be a separate case to be recognized? Will a regular column change the reasoning? What about when there is more than one applicable index?

There is plenty of work in the optimizer left.

Moreover, need_filtering() will need to communicate back to the caller which assumption it made about indexing; otherwise, there's a high risk of future bugs where the caller assumes something different and thus has the wrong answer. (Or if we make the caller decide in advance which index to use, that needs to be communicated to need_filtering().) All this will complicate the logic beyond easy comprehension; we've been down that road before.

I get that an index is created with the intention of being used. Maybe it should even be used unconditionally in all queries it can support, regardless of performance impact. But what's the answer if there are multiple indices (a common case, from what I've seen) or when an index slows down a query?

Right now we have too little information to make such decisions. Most databases collect more information so that they can make better decision, and also provide a hint system to tell the database to prefer one choice.

@avelanarius argument is one step along that road, because the road is long does not mean we should not take the first step.

@dekimir
Copy link
Contributor

dekimir commented Dec 2, 2020

I now agree that we should use an index for queries the index supports (when the full PK isn't known) and complicate need_filtering accordingly, to fix this bug. As for the "first step on a long road", time will tell. In my vision, the road leaves need_filtering behind.

@avikivity
Copy link
Member

need_filtering will probably die a needful death, yes.

A replacement might be

std::optional<int> plan_score(expression where_clause, query_plan x) ;

where query_plan includes the candidate index, and maybe other stuff. Return nullopt if the plan cannot work, otherwise an estimate of the goodness of the plan.

@haaawk haaawk self-assigned this Mar 21, 2021
@slivne slivne modified the milestones: 4.4, 4.5 Mar 29, 2021
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 9, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 9, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 9, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 9, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 9, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 9, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 9, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 12, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 13, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 19, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
cvybhu added a commit to cvybhu/scylladb that referenced this issue Jul 19, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes scylladb#8991.
Fixes scylladb#7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
nyh added a commit that referenced this issue Jul 19, 2021
…ering key' from Jan Ciołek

Add examples from issue #8991 to tests
Both of these tests pass on `cassandra 4.0` but fail on `scylla 4.4.3`

First test tests that selecting values from indexed table using only clustering key returns correct values.
The second test tests that performing this operation requires filtering.

The filtering test looks similar to [the one for #7608](https://github.com/scylladb/scylla/blob/1924e8d2b63e6611b78ac6252b5ddbc4884f9d22/test/cql-pytest/test_allow_filtering.py#L124) but there are some differences - here the table has two clustering columns and an index, so it could test different code paths.

Contains a quick fix for the `needs_filtering()` function to make these tests pass.
It returns `true` for this case and the one described in #7708.

This implementation is a bit conservative - it might sometimes return `true` where filtering isn't actually needed, but at least it prevents scylla from returning incorrect results.

Fixes #8991.
Fixes #7708.

Closes #8994

* github.com:scylladb/scylla:
  cql3: Fix need_filtering on indexed table
  cql-pytest: Test selecting using only clustering key requires filtering
  cql-pytest: Test selecting from indexed table using clustering key
avikivity pushed a commit that referenced this issue Oct 18, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 5414924)
avikivity pushed a commit that referenced this issue Oct 18, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 5414924)
@avikivity
Copy link
Member

Backported to 4.4, 4.5

avikivity pushed a commit that referenced this issue Oct 27, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 5414924)
avikivity pushed a commit that referenced this issue Oct 28, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 5414924)
avikivity pushed a commit that referenced this issue Oct 28, 2021
There were cases where a query on an indexed table
needed filtering but need_filtering returned false.

This is fixed by using new conditions in cases where
we are using an index.

Fixes #8991.
Fixes #7708.

For now this is an overly conservative implementation
that returns true in some cases where filtering
is not needed.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
(cherry picked from commit 5414924)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants