Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow filtering doesn't page correctly #4156

Closed
avnerbarr opened this issue Jan 28, 2019 · 17 comments
Closed

Allow filtering doesn't page correctly #4156

avnerbarr opened this issue Jan 28, 2019 · 17 comments

Comments

@avnerbarr
Copy link

avnerbarr commented Jan 28, 2019

scylla --version
3.0.1-0.20190120.3c4f8cf6e

testing the filtering feature in the cqlsh - it returns one result but doesn't "page" (or continue , what ever the term is) to the subsequent results - it just shows the "--MORE---" foot note

cqlsh> SELECT * FROM data_feeds.data_feeds_log where section_id=8767466 allow filtering;

 section_id | feed_id | start_time                      | deletions | end_time                        | insertions | message | modifications | version
------------+---------+---------------------------------+-----------+---------------------------------+------------+---------+---------------+---------------
    8767466 |    4142 | 2019-01-27 15:53:48.928000+0000 |         0 | 2019-01-27 15:53:49.443000+0000 |        150 |    null |             0 | 1548604428498

---MORE---
---MORE---
---MORE---
---MORE---
cqlsh> DESCRIBE data_feeds.data_feeds_log;

CREATE TABLE data_feeds.data_feeds_log (
    section_id int,
    feed_id int,
    start_time timestamp,
    deletions int,
    end_time timestamp,
    insertions int,
    message text,
    modifications int,
    version text,
    PRIMARY KEY ((section_id, feed_id), start_time)
) WITH CLUSTERING ORDER BY (start_time ASC)```
@avnerbarr avnerbarr changed the title Allow filtering doesn't Allow filtering doesn't page correctly Jan 28, 2019
@psarna
Copy link
Contributor

psarna commented Jan 28, 2019

What ---MORE--- represents here is empty pages, because all the results were filtered out from them. It was a design decision whether coordinator should return empty pages to the client (which is allowed by the protocol), or try to skip empty pages in order to find one that has at least one row. The downside of solution 2 is that it could easily lead to client-side timeouts if the amount of rows to be filtered is big, so we went with option 1.

@slivne
Copy link
Contributor

slivne commented Jan 28, 2019

Please note:

  1. CQLSH default page size is very small 100
  2. You can increase the page size - that will force fetching more results to filter through.

@duarten
Copy link
Contributor

duarten commented Jan 28, 2019

Indeed, we could enhance our cqlsh to automatically page empty result sets. I opened scylladb/scylla-tools-java#81.

@slivne
Copy link
Contributor

slivne commented Jan 28, 2019

I think the bigger question is how to better handle the case of empty pages in ALLOW FILTERING and we did have some other alternatives. such as fetch additional results till we are closer to the timeout time of the query.

@duarten
Copy link
Contributor

duarten commented Jan 28, 2019

That's not a complete solution, since we can still send empty pages to the client. I still think there's value in modifying cqlsh.

Indeed we can try and fetch more pages if we have budget for it. But would we apply this solution if we got a single row back? I don't think we need to avoid these intermediate user requests for this case, since allow filtering is not supposed to be an efficient operation - there's even a warning about that.

@slivne
Copy link
Contributor

slivne commented Jan 28, 2019 via email

@avikivity
Copy link
Member

I don't think we should play games with trying to fetch more rows. There will always be a case where it doesn't work (consider a query that scans the entire table and returns no rows).

The inefficiency here is in the query, adding an extra hop to the client makes it less efficient, but not by much. Filtering should be done when the query returns >1% of the data, below that it is inefficient.

@avnerbarr
Copy link
Author

avnerbarr commented Jan 28, 2019

You have many cases where the "allow filtering" is utilized for experimentation and examination of the data directly in cqlsh- not for efficiency reasons- this is required since the key is "synthetic" or "bucketed" (i.e by hours) because of partitioning constraints - in this case the feature is meant as a convenience and should be easy to use, not there for peformance optimizations reasons

@duarten
Copy link
Contributor

duarten commented Jan 28, 2019

Agreed, but it's still a tooling issue, not a DB issue.

@nyh
Copy link
Contributor

nyh commented Jan 29, 2019

I think this is a cqlsh bug. cqlsh documentation states that "PAGING ON displays query results in 100-line chunks followed by the more prompt". But this is not really what it does. It asks Cassandra or Scylla for up to 100 rows, and outputs the page it got. But that page may be smaller than 100 rows, it may even be (as we noticed) empty. cqlsh needs to fetch the next page in this case, again and again, until it has 100 rows to output, if it really wants to obey its documentation. Alternaively, cqlsh's documentation can be modified.

This is similar to piping to "more" in Linux: "something | more" is supposed to show a full screen (e.g., 25 lines) of output at a time. If "something" outputs just 2 lines (or zero lines) and pauses, more waits longer until 25 lines are available and only then says "--More--". It doesn't just show a "--More--" after two lines of output because that's what it got quickly.

But I do agree with @slivne that with changes to Scylla (to try to produce more results for a longer time instead of quickly returning a page of zero results) would have made this issue less important. But I wonder if it doesn't make even more sense to just fix cqlsh, and optimize Scylla for this case (to return fewer silly empty pages) later.

@avikivity
Copy link
Member

It's not possible, in the general case, to fix this in the database. Suppose you filter all the data away; scylla will have to scan the entire data set before returning the final page (which would also be empty).

@tgrabiec
Copy link
Contributor

It's more efficient to scan the table on the scylla side without involving the client, so that network latency to the client doesn't limit throughput.

The client shouldn't set a timeout for queries whose completion is not bounded in time and he's willing to wait for as long as it takes. Filtering queries and count(*) are such operations.

For canceling queries on the server side for which no one is waiting we should have a different mechanism - connection drop should cancel them. The driver should also expose a way to cancel.

@duarten
Copy link
Contributor

duarten commented Jan 29, 2019

I don't understand why we should be interested in optimizing allow filtering queries, or queries that are performed without paging. I think aggregation is a different topic, and something we should support efficiently.

I also don't think it's wise for clients to perform operations without setting a timeout, or they can easily block the application for an arbitrary amount of time.

@tgrabiec
Copy link
Contributor

tgrabiec commented Jan 29, 2019 via email

@duarten
Copy link
Contributor

duarten commented Jan 29, 2019

On Tue, Jan 29, 2019 at 12:08 PM Duarte Nunes @.***> wrote: I don't understand why we should be interested in optimizing allow filtering queries, or queries that are performed without paging.
We can certainly wait until a user with a weird use case comes along and complains.

:D

But on the serious side, I don't think users expect allow filtering queries to perform optimally, and I also don't think the interstitial round trips to and from the client will make a huge difference for these types of queries.

I think aggregation is a different topic, and something we should support efficiently.
Both aggregate queries and filtered queries have the same issue with timeouts, so they could share the solution.

A filtering query is more or less sequential, and the result set is bounded by the page size. An aggregation is fully concurrent, so there are other vector for optimization (i.e., map-reduce).

A potential solution that we discussed would be to send keep-alives as empty pages, but then we'd see the same behavior.

I also don't think it's wise for clients to perform operations without setting a timeout, or they can easily block the application for an arbitrary amount of time.
If there is an upper bound which stems from the business logic, then it should be imposed by the application, but I think that's for the application to decide, not the driver. When I run a filtering query via cqlsh, I don't want it to timeout, unless I'm no longer interested in the results. ^C should cancel the query. Detecting failures or deadlocks within the system is a different matter.

Agreed, my point was simply that we can't recommend users to not use timeouts, because I don't think that's the general case. If they are able, and aware of the risks, then querying without a timeout is legitimate.

@slivne
Copy link
Contributor

slivne commented Feb 3, 2019

We tested this on Cassandra and it does not return an empty page - so although the protocol supports it - its clear we do not behave in the same manner.

One solution that tzach suggested is to make the drivers "smarter" and fetch the next page if the returned page was empty.

@duarten
Copy link
Contributor

duarten commented Feb 3, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants