Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Status.group instead of Status.distinct in HashQueryService #14662

Merged
merged 1 commit into from
Aug 25, 2020

Conversation

akihikodaki
Copy link
Contributor

Hi, it's been a while since I made a pull request the last time.

Pawoo is finally updated to 2.9.4 today and we are appreciating lots of improvements. I want to contribute a change made for Pawoo in the process of updating the instance.

The below is the description of this change which is also in the commit message:

DISTINCT clause removes duplicated records according to all the selected attributes. In reality, it can remove duplicated records only looking at statuses.id, but the clause confuses the query planner and yields insufficient performance.
The behavior is also problematic if the scope produced by HashQueryService is used to query columns without id (using pluck method, for example). The scope is expected to contain unique statuses, but the uniquness will be evaluated with some arbitrary columns other than id.

GROUP BY clause resolves those problem by explicitly specifying the column to take into account for the record distinction.

A workaround for the problem of DISTINCT clause in Api::V1::Timelines::TagController is no longer necessary and removed.

DISTINCT clause removes duplicated records according to all the selected
attributes. In reality, it can remove duplicated records only looking at
statuses.id, but the clause confuses the query planner and yields
insufficient performance.
The behavior is also problematic if the scope produced by HashQueryService
is used to query columns without id (using pluck method, for example). The
scope is expected to contain unique statuses, but the uniquness will be
evaluated with some arbitrary columns other than id.

GROUP BY clause resolves those problem by explicitly specifying the
column to take into account for the record distinction.

A workaround for the problem of DISTINCT clause in
Api::V1::Timelines::TagController is no longer necessary and removed.
@akihikodaki akihikodaki added the performance Runtime performance label Aug 25, 2020
@Gargron Gargron merged commit 41eeb9e into mastodon:master Aug 25, 2020
Copy link
Contributor

@ClearlyClaire ClearlyClaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by this and I have a hard time imagining the final query.

Something I notice, though, is that we use distinct here and there with a column name as parameter, as if it worked the same way as group. This may be because we got confused by a different distinct method in older versions of Rails:
https://apidock.com/rails/ActiveRecord/ConnectionAdapters/SchemaStatements/distinct

Anyway, this PR shouldn't hurt, and we should look into our other uses of distinct throughout the code.

@akihikodaki
Copy link
Contributor Author

I am a bit confused by this and I have a hard time imagining the final query.

Here is a query for Api::V1::Timelines::TagController when parameter only_media is falsy.

SELECT "statuses"."id", "statuses"."updated_at" FROM "statuses" INNER JOIN "statuses_tags" ON "statuses_tags"."status_id" = "statuses"."id" LEFT OUTER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" WHERE "statuses"."visibility" = $1 AND (statuses.reblog_of_id IS NULL) AND "statuses_tags"."tag_id" = $2 AND "accounts"."silenced_at" IS NULL GROUP BY "statuses"."id" ORDER BY "statuses"."id" DESC, "statuses"."id" DESC LIMIT $3

Something I notice, though, is that we use distinct here and there with a column name as parameter, as if it worked the same way as group. This may be because we got confused by a different distinct method in older versions of Rails:
https://apidock.com/rails/ActiveRecord/ConnectionAdapters/SchemaStatements/distinct

Anyway, this PR shouldn't hurt, and we should look into our other uses of distinct throughout the code.

This is an insightful comment. I didn't have a careful look at the removed distinct(:id) because I could simply remove it in this case.

thenameisnigel-old pushed a commit to ChatterlyOSE/Chatterly that referenced this pull request Sep 6, 2020
…odon#14662)

DISTINCT clause removes duplicated records according to all the selected
attributes. In reality, it can remove duplicated records only looking at
statuses.id, but the clause confuses the query planner and yields
insufficient performance.
The behavior is also problematic if the scope produced by HashQueryService
is used to query columns without id (using pluck method, for example). The
scope is expected to contain unique statuses, but the uniquness will be
evaluated with some arbitrary columns other than id.

GROUP BY clause resolves those problem by explicitly specifying the
column to take into account for the record distinction.

A workaround for the problem of DISTINCT clause in
Api::V1::Timelines::TagController is no longer necessary and removed.
thenameisnigel-old pushed a commit to ChatterlyOSE/Chatterly that referenced this pull request Sep 7, 2020
…odon#14662)

DISTINCT clause removes duplicated records according to all the selected
attributes. In reality, it can remove duplicated records only looking at
statuses.id, but the clause confuses the query planner and yields
insufficient performance.
The behavior is also problematic if the scope produced by HashQueryService
is used to query columns without id (using pluck method, for example). The
scope is expected to contain unique statuses, but the uniquness will be
evaluated with some arbitrary columns other than id.

GROUP BY clause resolves those problem by explicitly specifying the
column to take into account for the record distinction.

A workaround for the problem of DISTINCT clause in
Api::V1::Timelines::TagController is no longer necessary and removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Runtime performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants