"SELECT DISTINCT" on all queries is a performance sucker #4923

tractorcow · 2016-01-13T04:35:50Z

I'd rather address and resolve all instances where using distinct is absolutely necessary, and resolve those. Using distinct on all queries to solve poorly formed queries is lazy.

Possible implementation (from Sam)

My recommendation for this, as a minor release change:

Set DISTINCT to false by default
Add a helper method enableGroupedDistinct() that adds a clause along the lines of GROUP BY 1,2,3,4,5... . This will be preferred over the DISTINCT clause
Whenever a non-linear relation is added via applyRelation(), ensure that enableGroupDistinct() is called.

The specific queries called aren't part if our public API, only the data returned by them. So I don't think this will be a breaking change. We'd need to test that, of course.

The text was updated successfully, but these errors were encountered:

chillu · 2017-05-01T20:38:06Z

@tractorcow Is this an API change? It's currently assigned against 4.0.0 stable

tractorcow · 2017-05-01T23:24:27Z

It will result in API breakages as many parts of the ORM rely on distinct to enforce unique rows. e.g. left join on one_to_many relations.

chillu · 2017-09-07T21:29:02Z

@tractorcow OK moving this out of the 4.0 milestone, doesn't seem realistic at this point.

tractorcow · 2018-01-23T22:36:43Z

In 4.0 we have made a lot of effort in reducing the need for distinct. For instance,

public function applyRelation($relation, $linearOnly = false)

$linearOnly = true will ensure that distinct isn't necessary.

In 5.x I suggest switching distinct to false by default and seeing if we can live with that as the new status quo.

DataQuery::initialiseQuery() calls setDistinct(true), which should be off instead.

stojg · 2018-11-13T18:55:43Z

As a SRE/DevOps I am in very much in favour of removing as many DISTINCTs as possible since they often (always?) causes temporary tables in at least MySQL:

Example from a DESCRIBE of a random query with distinct:

+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
| id | select_type | table         | type   | possible_keys       | key         | key_len | ref                        | rows  | Extra                                               |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
|  1 | SIMPLE      | BlogPost_Live | range  | PRIMARY,PublishDate | PublishDate | 6       | NULL                       | 28702 | Using index condition; Using where; Using temporary |
|  1 | SIMPLE      | SiteTree_Live | eq_ref | PRIMARY,ClassName   | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | Using where                                         |
|  1 | SIMPLE      | Page_Live     | eq_ref | PRIMARY             | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | NULL                                                |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+

Without distinct:

+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
| id | select_type | table         | type   | possible_keys       | key         | key_len | ref                        | rows  | Extra                              |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
|  1 | SIMPLE      | BlogPost_Live | range  | PRIMARY,PublishDate | PublishDate | 6       | NULL                       | 28702 | Using index condition; Using where |
|  1 | SIMPLE      | SiteTree_Live | eq_ref | PRIMARY,ClassName   | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | Using where                        |
|  1 | SIMPLE      | Page_Live     | eq_ref | PRIMARY             | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | NULL                               |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+

Notice how the top row loses the Using temporary in the Extra column

From the MySQL docs:

Using temporary

To resolve the query, MySQL needs to create a temporary table to hold the result. This typically > happens if the query contains GROUP BY and ORDER BY clauses that list columns differently.
Source: http://ftp.nchu.edu.tw/MySQL/doc/refman/5.0/en/using-explain.html

(sorry about the close/repoen of the issue, it was an tab-enter mistake)

sminnee · 2018-11-13T21:15:03Z

DISTINCT could be removed from queries that don't add 1-to-many joins, notably those that don't use the dot-syntax in filters/sorts that reference has_many or many_many relations.

A lot of queries would be covered by this restriction, so it would be a useful optimisation.

Additionally, we could replace DISTINCT by GROUP BY [all, the fields] if we wanted, but would that actually help?

Finally, the filters/sorts on related data could potentially be refactored into subqueries, but again, I don't know if a subquery would have any benefit there.

If @stojg or others have any views on those 2nd questions, please post :-)

stojg · 2018-11-14T21:13:17Z

I don't think a group by would help.. the following is sort of how it works according to MySQL

Evaluation of statements that contain an ORDER BY clause and a different GROUP BY clause, or for which the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.

Evaluation of DISTINCT combined with ORDER BY may require a temporary table.

https://dev.mysql.com/doc/refman/5.6/en/internal-temporary-tables.html

Subqueries could possibly work, but as per above documentation.. so it is a rather tricky thing

Tables created for subquery or semi-join materialization (see Section 8.2.2, “Optimizing Subqueries, Derived Tables, and Views”).

tractorcow · 2018-11-14T21:31:33Z

@sminnee you can see in DataQuery::applyRelation() I have added a flag $linearOnly which causes an error in case a join is added that would require a distinct.

You could possibly refactor this logic to conditionally add DISTINCT to the query as needed, so it would silently self-optimise if you only ever added linear joins, but would add distinct when joining on many_many tables.

At the moment you can ONLY sort on linear relations. It's just a matter of applying the same logic to conditions as well. ;)

All we really need is the free performance when we know it's safe.

sminnee · 2018-11-14T21:38:40Z

At the moment you can ONLY sort on linear relations.

That may mean that this wouldn't work:

Group::get()->sort('Member.Count()')

Moving from DISTINCT to GROUP BY 1,2,3,4,5... might help with that.

sminnee · 2019-02-13T21:42:13Z

My recommendation for this, as a minor release change:

Set DISTINCT to false by default
Add a helper method enableGroupedDistinct() that adds a clause along the lines of GROUP BY 1,2,3,4,5... . This will be preferred over the DISTINCT clause
Whenever a non-linear relation is added via applyRelation(), ensure that enableGroupDistinct() is called.

The specific queries called aren't part if our public API, only the data returned by them. So I don't think this will be a breaking change. We'd need to test that, of course.

tractorcow · 2019-03-04T23:08:40Z

I feel like a method that adds a join and a group-by in a single atomic action would be a great idea. An aggregate API (such as Member.Count()) could piggy-back off this grouped join. Something like ->joinAggregation()? Join on table, add an aggregate column, group by other selected columns. :P

nfauchelle · 2021-04-03T09:48:52Z

Ran into this issue while I was trying to work out why some queries were taking a while.

If I hack the core to turn off DISTINCT by default it makes a big different, queries I tested (which can be simple Page::get()->filter(on single item)->limit(x)) types) are 7-8 times faster without the Distinct.

I can't seem to work out a nice way to turn this off per query - looks like an all or nothing approach via an extension.

Is this still being pursued?

tractorcow added this to the 4.0.0 milestone Jan 13, 2016

chillu added the affects/v4 label Jun 2, 2016

sminnee added the type/enhancement label Sep 21, 2016

tractorcow mentioned this issue Apr 30, 2017

Fluent extension causes Errors with MySQL version >= 5.7.5 tractorcow-farm/silverstripe-fluent#257

Open

chillu added affects/v3 type/api-break complexity/high impact/high labels May 19, 2017

chillu removed this from the Recipe 4.0.0 milestone Sep 7, 2017

tractorcow removed the affects/v3 label Jan 23, 2018

stojg closed this as completed Nov 13, 2018

stojg reopened this Nov 13, 2018

sminnee mentioned this issue Feb 13, 2019

FIX Filtering the version history by a specific date is now more performant in large datasets silverstripe/silverstripe-versioned#213

Closed

sminnee added change/minor and removed type/api-break labels Feb 13, 2019

tractorcow mentioned this issue Apr 11, 2019

FIX Implemented various options to improve archive query performance silverstripe/silverstripe-versioned#225

Merged

ScopeyNZ mentioned this issue Apr 15, 2019

FIX Calculate threshold condition with SQL rather than PHP #8919

Closed

emteknetnz removed the change/minor label Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"SELECT DISTINCT" on all queries is a performance sucker #4923

"SELECT DISTINCT" on all queries is a performance sucker #4923

tractorcow commented Jan 13, 2016 •

edited by sminnee

Loading

chillu commented May 1, 2017

tractorcow commented May 1, 2017

chillu commented Sep 7, 2017

tractorcow commented Jan 23, 2018 •

edited

Loading

stojg commented Nov 13, 2018 •

edited

Loading

sminnee commented Nov 13, 2018

stojg commented Nov 14, 2018 •

edited

Loading

tractorcow commented Nov 14, 2018 •

edited

Loading

sminnee commented Nov 14, 2018

sminnee commented Feb 13, 2019

tractorcow commented Mar 4, 2019 •

edited

Loading

nfauchelle commented Apr 3, 2021

"SELECT DISTINCT" on all queries is a performance sucker #4923

"SELECT DISTINCT" on all queries is a performance sucker #4923

Comments

tractorcow commented Jan 13, 2016 • edited by sminnee Loading

Possible implementation (from Sam)

chillu commented May 1, 2017

tractorcow commented May 1, 2017

chillu commented Sep 7, 2017

tractorcow commented Jan 23, 2018 • edited Loading

stojg commented Nov 13, 2018 • edited Loading

sminnee commented Nov 13, 2018

stojg commented Nov 14, 2018 • edited Loading

tractorcow commented Nov 14, 2018 • edited Loading

sminnee commented Nov 14, 2018

sminnee commented Feb 13, 2019

tractorcow commented Mar 4, 2019 • edited Loading

nfauchelle commented Apr 3, 2021

tractorcow commented Jan 13, 2016 •

edited by sminnee

Loading

tractorcow commented Jan 23, 2018 •

edited

Loading

stojg commented Nov 13, 2018 •

edited

Loading

stojg commented Nov 14, 2018 •

edited

Loading

tractorcow commented Nov 14, 2018 •

edited

Loading

tractorcow commented Mar 4, 2019 •

edited

Loading