Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only group by if using having clauses #1750

Merged
merged 5 commits into from Oct 1, 2021

Conversation

WithoutPants
Copy link
Collaborator

Only group by ids if having is used in the query. This results in significant performance improvement on the images query, since it potentially eliminates a table scan. On a contrived database with 4m images, it results in 200ms for the query, down from 4 seconds for a page size of 40.

@WithoutPants WithoutPants added the improvement Something needed tweaking. label Sep 21, 2021
@WithoutPants WithoutPants added this to the Version 0.10.0 milestone Sep 21, 2021
@gitgiggety
Copy link
Contributor

How sure are you that every query which is affected by this change is using distinct id (without additional columns being selected). Otherwise those queries would yield different results. (Although I don't know if SQLite already throws an error when SELECTing fields which aren't in the GROUP BY list. It at least should do, but for example MySQL doesn't or didn't).

@gitgiggety
Copy link
Contributor

Well, just tested it:

SELECT DISTINCT id, performer_id FROM scenes INNER JOIN performers_scenes ON performers_scenes.scene_id = scenes.id GROUP BY id;

This does return a single row per (scene) id containing a random performer_id (for that scene). While a compliant DB would result in an error stating that performer_id can't be used in the select clause as it isn't in the group by.

When dropping the GROUP BY this query will return multiple records per scene, one per unique scene id + performer combination. So "just" removing the GROUP BY isn't possible unless certain this is only used by queries which are generated as SELECT DISTINCT id FROM.

@WithoutPants
Copy link
Collaborator Author

As far as I could tell, buildQueryBody is only used by the queryBuilder which in turn is only used by Query, all of which query on distinct ids.

@gitgiggety
Copy link
Contributor

👍

When I'm at my development computer I'll try to verify that as well so that two people confirmed it's the case.

@kermieisinthehouse
Copy link
Collaborator

As far as I could tell, buildQueryBody is only used by the queryBuilder which in turn is only used by Query, all of which query on distinct ids.

Should we perhaps add a comment to one of these now indicating that DISTINCT is a hard requirement?

@gitgiggety
Copy link
Contributor

Just wondering. IIRC the "includes all" modifier for the multi / hierarchical criterion does add a HAVING COUNT(DISTINCT ...) = <number of values>. It might be worthwhile to drop this as well when there is only one value? (/using "includes" code path when there is only one value)
I don't think it's necessary to add the HAVING then, and my guess is that in your testcase for example filtering for 1 performer (or tag) does yield very different performance results when using "includes all" vs "includes"? Or at least it could be the case because of "includes all" adding the HAVING thus keeping the GROUP BY thus potentially still doing a full table scan. (But I might be wrong depending on the generated query plan which obviously will also be based on how many rows are already removed by the join).

@gitgiggety
Copy link
Contributor

gitgiggety commented Sep 21, 2021

Found an issue, which might show up in multiple areas.

Applying "Image Count" sorting on the tags page results in a SQL error:

error executing find query with SQL: SELECT DISTINCT tags.id FROM tags LEFT JOIN images_tags ON images_tags.tag_id = tags.id ORDER BY COUNT(distinct images_tags.image_id) ASC LIMIT 40 OFFSET 0 , args: [], error: misuse of aggregate: COUNT()

Obviously meaning the GROUP BY is missing.

Note I did use "Image Count" sorting on the performers page already which did work fine. So it seems to depend on "something else" which does add the group by then, or it uses ORDER BY (SELECT COUNT (*) FROM ... WHERE ...). But in that case it would be a dependent subquery which normally would be slow (/slower than joining in the relevant data).

So I guess there needs to be some other means to indicate to the query (builder) that the GROUP BY clause must be added.

Edit:
After going through all pages and all "* count" sort options it seems like /tags is the only one being broken (all of them, so "image count", "scene count", "performer count" etc). All other types do seem to use the dependent subquery approach.

@gitgiggety
Copy link
Contributor

+1

When I'm at my development computer I'll try to verify that as well so that two people confirmed it's the case.

So:
buildQueryBody is used in two places, by queryBuilder::executeCount which is both used by galleryQueryBuilder::QueryCount and imageQueryBuilder::QueryCount, both use a qb.makeQuery which uses query.body = selectDistinctIDs(<table>) so those only use SELECT DISTINCT. The other usage of buildQueryBody is repository::executeFindQuery which in turn is used by queryBuilder::executeFind which in turn is used by the Query method of all the type specific query builders. I went through all of them (gallery, image, movies, performer, scene, scene_marker, studio and tag) and all use a makeQuery which does query.body = selectDistinctIDs(<table>).
As double check I also checked assignments to <query>.body and all of them are also selectDistinctIDs (which would be all the makeQuery methods :)). So I can confirm that this part is fine.

So that leaves the issue of aggregates being used (in other words, at least the issue with the "* count" sorting on the /tags page).

@WithoutPants
Copy link
Collaborator Author

I've changed the tag sorting code to be consistent with the other sorting code. This eliminates the necessity of using group by. It also appears to improve performance when sorting by scene and image counts. On the contrived generated database, it improved query performance by around 80%.

I've also added integration tests exercising the sorting code..

@WithoutPants WithoutPants merged commit ca0a8b0 into stashapp:develop Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something needed tweaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants