Only group by if using having clauses #1750

WithoutPants · 2021-09-21T12:06:17Z

Only group by ids if having is used in the query. This results in significant performance improvement on the images query, since it potentially eliminates a table scan. On a contrived database with 4m images, it results in 200ms for the query, down from 4 seconds for a page size of 40.

gitgiggety · 2021-09-21T12:12:41Z

How sure are you that every query which is affected by this change is using distinct id (without additional columns being selected). Otherwise those queries would yield different results. (Although I don't know if SQLite already throws an error when SELECTing fields which aren't in the GROUP BY list. It at least should do, but for example MySQL doesn't or didn't).

gitgiggety · 2021-09-21T12:20:20Z

Well, just tested it:

SELECT DISTINCT id, performer_id FROM scenes INNER JOIN performers_scenes ON performers_scenes.scene_id = scenes.id GROUP BY id;

This does return a single row per (scene) id containing a random performer_id (for that scene). While a compliant DB would result in an error stating that performer_id can't be used in the select clause as it isn't in the group by.

When dropping the GROUP BY this query will return multiple records per scene, one per unique scene id + performer combination. So "just" removing the GROUP BY isn't possible unless certain this is only used by queries which are generated as SELECT DISTINCT id FROM.

WithoutPants · 2021-09-21T12:24:51Z

As far as I could tell, buildQueryBody is only used by the queryBuilder which in turn is only used by Query, all of which query on distinct ids.

gitgiggety · 2021-09-21T12:34:49Z

👍

When I'm at my development computer I'll try to verify that as well so that two people confirmed it's the case.

kermieisinthehouse · 2021-09-21T15:03:45Z

As far as I could tell, buildQueryBody is only used by the queryBuilder which in turn is only used by Query, all of which query on distinct ids.

Should we perhaps add a comment to one of these now indicating that DISTINCT is a hard requirement?

gitgiggety · 2021-09-21T16:25:48Z

Just wondering. IIRC the "includes all" modifier for the multi / hierarchical criterion does add a HAVING COUNT(DISTINCT ...) = <number of values>. It might be worthwhile to drop this as well when there is only one value? (/using "includes" code path when there is only one value)
I don't think it's necessary to add the HAVING then, and my guess is that in your testcase for example filtering for 1 performer (or tag) does yield very different performance results when using "includes all" vs "includes"? Or at least it could be the case because of "includes all" adding the HAVING thus keeping the GROUP BY thus potentially still doing a full table scan. (But I might be wrong depending on the generated query plan which obviously will also be based on how many rows are already removed by the join).

gitgiggety · 2021-09-21T17:03:20Z

Found an issue, which might show up in multiple areas.

Applying "Image Count" sorting on the tags page results in a SQL error:

error executing find query with SQL: SELECT DISTINCT tags.id FROM tags LEFT JOIN images_tags ON images_tags.tag_id = tags.id ORDER BY COUNT(distinct images_tags.image_id) ASC LIMIT 40 OFFSET 0 , args: [], error: misuse of aggregate: COUNT()

Obviously meaning the GROUP BY is missing.

Note I did use "Image Count" sorting on the performers page already which did work fine. So it seems to depend on "something else" which does add the group by then, or it uses ORDER BY (SELECT COUNT (*) FROM ... WHERE ...). But in that case it would be a dependent subquery which normally would be slow (/slower than joining in the relevant data).

So I guess there needs to be some other means to indicate to the query (builder) that the GROUP BY clause must be added.

Edit:
After going through all pages and all "* count" sort options it seems like /tags is the only one being broken (all of them, so "image count", "scene count", "performer count" etc). All other types do seem to use the dependent subquery approach.

gitgiggety · 2021-09-21T17:33:08Z

+1

When I'm at my development computer I'll try to verify that as well so that two people confirmed it's the case.

So:
buildQueryBody is used in two places, by queryBuilder::executeCount which is both used by galleryQueryBuilder::QueryCount and imageQueryBuilder::QueryCount, both use a qb.makeQuery which uses query.body = selectDistinctIDs(<table>) so those only use SELECT DISTINCT. The other usage of buildQueryBody is repository::executeFindQuery which in turn is used by queryBuilder::executeFind which in turn is used by the Query method of all the type specific query builders. I went through all of them (gallery, image, movies, performer, scene, scene_marker, studio and tag) and all use a makeQuery which does query.body = selectDistinctIDs(<table>).
As double check I also checked assignments to <query>.body and all of them are also selectDistinctIDs (which would be all the makeQuery methods :)). So I can confirm that this part is fine.

So that leaves the issue of aggregates being used (in other words, at least the issue with the "* count" sorting on the /tags page).

WithoutPants · 2021-10-01T04:20:23Z

I've changed the tag sorting code to be consistent with the other sorting code. This eliminates the necessity of using group by. It also appears to improve performance when sorting by scene and image counts. On the contrived generated database, it improved query performance by around 80%.

I've also added integration tests exercising the sorting code..

Only group by if using having clauses

1a18fcb

WithoutPants added the improvement Something needed tweaking. label Sep 21, 2021

WithoutPants added this to the Version 0.10.0 milestone Sep 21, 2021

WithoutPants added 3 commits October 1, 2021 12:16

Merge remote-tracking branch 'upstream/develop' into group-by-having

de8efb2

Change tag sorting SQL

1d7ed24

Add sorting unit tests

a98970f

kermieisinthehouse approved these changes Oct 1, 2021

View reviewed changes

Add changelog entry [skip ci]

e28d849

WithoutPants merged commit ca0a8b0 into stashapp:develop Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only group by if using having clauses #1750

Only group by if using having clauses #1750

WithoutPants commented Sep 21, 2021

gitgiggety commented Sep 21, 2021

gitgiggety commented Sep 21, 2021

WithoutPants commented Sep 21, 2021

gitgiggety commented Sep 21, 2021

kermieisinthehouse commented Sep 21, 2021

gitgiggety commented Sep 21, 2021

gitgiggety commented Sep 21, 2021 •

edited

gitgiggety commented Sep 21, 2021

WithoutPants commented Oct 1, 2021

Only group by if using having clauses #1750

Only group by if using having clauses #1750

Conversation

WithoutPants commented Sep 21, 2021

gitgiggety commented Sep 21, 2021

gitgiggety commented Sep 21, 2021

WithoutPants commented Sep 21, 2021

gitgiggety commented Sep 21, 2021

kermieisinthehouse commented Sep 21, 2021

gitgiggety commented Sep 21, 2021

gitgiggety commented Sep 21, 2021 • edited

gitgiggety commented Sep 21, 2021

WithoutPants commented Oct 1, 2021

gitgiggety commented Sep 21, 2021 •

edited