Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"SELECT DISTINCT" on all queries is a performance sucker #4923

Open
tractorcow opened this issue Jan 13, 2016 · 12 comments
Open

"SELECT DISTINCT" on all queries is a performance sucker #4923

tractorcow opened this issue Jan 13, 2016 · 12 comments

Comments

@tractorcow
Copy link
Contributor

tractorcow commented Jan 13, 2016

I'd rather address and resolve all instances where using distinct is absolutely necessary, and resolve those. Using distinct on all queries to solve poorly formed queries is lazy.


Possible implementation (from Sam)

My recommendation for this, as a minor release change:

  • Set DISTINCT to false by default
  • Add a helper method enableGroupedDistinct() that adds a clause along the lines of GROUP BY 1,2,3,4,5... . This will be preferred over the DISTINCT clause
  • Whenever a non-linear relation is added via applyRelation(), ensure that enableGroupDistinct() is called.

The specific queries called aren't part if our public API, only the data returned by them. So I don't think this will be a breaking change. We'd need to test that, of course.

@chillu
Copy link
Member

chillu commented May 1, 2017

@tractorcow Is this an API change? It's currently assigned against 4.0.0 stable

@tractorcow
Copy link
Contributor Author

It will result in API breakages as many parts of the ORM rely on distinct to enforce unique rows. e.g. left join on one_to_many relations.

@chillu
Copy link
Member

chillu commented Sep 7, 2017

@tractorcow OK moving this out of the 4.0 milestone, doesn't seem realistic at this point.

@chillu chillu removed this from the Recipe 4.0.0 milestone Sep 7, 2017
@tractorcow
Copy link
Contributor Author

tractorcow commented Jan 23, 2018

In 4.0 we have made a lot of effort in reducing the need for distinct. For instance,

public function applyRelation($relation, $linearOnly = false)

$linearOnly = true will ensure that distinct isn't necessary.

In 5.x I suggest switching distinct to false by default and seeing if we can live with that as the new status quo.

DataQuery::initialiseQuery() calls setDistinct(true), which should be off instead.

@stojg
Copy link

stojg commented Nov 13, 2018

As a SRE/DevOps I am in very much in favour of removing as many DISTINCTs as possible since they often (always?) causes temporary tables in at least MySQL:

Example from a DESCRIBE of a random query with distinct:

+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
| id | select_type | table         | type   | possible_keys       | key         | key_len | ref                        | rows  | Extra                                               |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+
|  1 | SIMPLE      | BlogPost_Live | range  | PRIMARY,PublishDate | PublishDate | 6       | NULL                       | 28702 | Using index condition; Using where; Using temporary |
|  1 | SIMPLE      | SiteTree_Live | eq_ref | PRIMARY,ClassName   | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | Using where                                         |
|  1 | SIMPLE      | Page_Live     | eq_ref | PRIMARY             | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | NULL                                                |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+-----------------------------------------------------+

Without distinct:

+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
| id | select_type | table         | type   | possible_keys       | key         | key_len | ref                        | rows  | Extra                              |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+
|  1 | SIMPLE      | BlogPost_Live | range  | PRIMARY,PublishDate | PublishDate | 6       | NULL                       | 28702 | Using index condition; Using where |
|  1 | SIMPLE      | SiteTree_Live | eq_ref | PRIMARY,ClassName   | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | Using where                        |
|  1 | SIMPLE      | Page_Live     | eq_ref | PRIMARY             | PRIMARY     | 4       | SS_mysite.BlogPost_Live.ID |     1 | NULL                               |
+----+-------------+---------------+--------+---------------------+-------------+---------+----------------------------+-------+------------------------------------+

Notice how the top row loses the Using temporary in the Extra column

From the MySQL docs:

Using temporary

To resolve the query, MySQL needs to create a temporary table to hold the result. This typically > happens if the query contains GROUP BY and ORDER BY clauses that list columns differently.
Source: http://ftp.nchu.edu.tw/MySQL/doc/refman/5.0/en/using-explain.html

(sorry about the close/repoen of the issue, it was an tab-enter mistake)

@stojg stojg closed this as completed Nov 13, 2018
@stojg stojg reopened this Nov 13, 2018
@sminnee
Copy link
Member

sminnee commented Nov 13, 2018

DISTINCT could be removed from queries that don't add 1-to-many joins, notably those that don't use the dot-syntax in filters/sorts that reference has_many or many_many relations.

A lot of queries would be covered by this restriction, so it would be a useful optimisation.

Additionally, we could replace DISTINCT by GROUP BY [all, the fields] if we wanted, but would that actually help?

Finally, the filters/sorts on related data could potentially be refactored into subqueries, but again, I don't know if a subquery would have any benefit there.

If @stojg or others have any views on those 2nd questions, please post :-)

@stojg
Copy link

stojg commented Nov 14, 2018

I don't think a group by would help.. the following is sort of how it works according to MySQL

Evaluation of statements that contain an ORDER BY clause and a different GROUP BY clause, or for which the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.

Evaluation of DISTINCT combined with ORDER BY may require a temporary table.

https://dev.mysql.com/doc/refman/5.6/en/internal-temporary-tables.html

Subqueries could possibly work, but as per above documentation.. so it is a rather tricky thing

Tables created for subquery or semi-join materialization (see Section 8.2.2, “Optimizing Subqueries, Derived Tables, and Views”).

@tractorcow
Copy link
Contributor Author

tractorcow commented Nov 14, 2018

@sminnee you can see in DataQuery::applyRelation() I have added a flag $linearOnly which causes an error in case a join is added that would require a distinct.

You could possibly refactor this logic to conditionally add DISTINCT to the query as needed, so it would silently self-optimise if you only ever added linear joins, but would add distinct when joining on many_many tables.

At the moment you can ONLY sort on linear relations. It's just a matter of applying the same logic to conditions as well. ;)

All we really need is the free performance when we know it's safe.

@sminnee
Copy link
Member

sminnee commented Nov 14, 2018

At the moment you can ONLY sort on linear relations.

That may mean that this wouldn't work:

Group::get()->sort('Member.Count()')

Moving from DISTINCT to GROUP BY 1,2,3,4,5... might help with that.

@sminnee
Copy link
Member

sminnee commented Feb 13, 2019

My recommendation for this, as a minor release change:

  • Set DISTINCT to false by default
  • Add a helper method enableGroupedDistinct() that adds a clause along the lines of GROUP BY 1,2,3,4,5... . This will be preferred over the DISTINCT clause
  • Whenever a non-linear relation is added via applyRelation(), ensure that enableGroupDistinct() is called.

The specific queries called aren't part if our public API, only the data returned by them. So I don't think this will be a breaking change. We'd need to test that, of course.

@tractorcow
Copy link
Contributor Author

tractorcow commented Mar 4, 2019

I feel like a method that adds a join and a group-by in a single atomic action would be a great idea. An aggregate API (such as Member.Count()) could piggy-back off this grouped join. Something like ->joinAggregation()? Join on table, add an aggregate column, group by other selected columns. :P

@nfauchelle
Copy link
Contributor

Ran into this issue while I was trying to work out why some queries were taking a while.

If I hack the core to turn off DISTINCT by default it makes a big different, queries I tested (which can be simple Page::get()->filter(on single item)->limit(x)) types) are 7-8 times faster without the Distinct.

I can't seem to work out a nice way to turn this off per query - looks like an all or nothing approach via an extension.

Is this still being pursued?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants