Use Status.group instead of Status.distinct in HashQueryService #14662

akihikodaki · 2020-08-25T11:01:14Z

Hi, it's been a while since I made a pull request the last time.

Pawoo is finally updated to 2.9.4 today and we are appreciating lots of improvements. I want to contribute a change made for Pawoo in the process of updating the instance.

The below is the description of this change which is also in the commit message:

DISTINCT clause removes duplicated records according to all the selected attributes. In reality, it can remove duplicated records only looking at statuses.id, but the clause confuses the query planner and yields insufficient performance.
The behavior is also problematic if the scope produced by HashQueryService is used to query columns without id (using pluck method, for example). The scope is expected to contain unique statuses, but the uniquness will be evaluated with some arbitrary columns other than id.

GROUP BY clause resolves those problem by explicitly specifying the column to take into account for the record distinction.

A workaround for the problem of DISTINCT clause in Api::V1::Timelines::TagController is no longer necessary and removed.

DISTINCT clause removes duplicated records according to all the selected attributes. In reality, it can remove duplicated records only looking at statuses.id, but the clause confuses the query planner and yields insufficient performance. The behavior is also problematic if the scope produced by HashQueryService is used to query columns without id (using pluck method, for example). The scope is expected to contain unique statuses, but the uniquness will be evaluated with some arbitrary columns other than id. GROUP BY clause resolves those problem by explicitly specifying the column to take into account for the record distinction. A workaround for the problem of DISTINCT clause in Api::V1::Timelines::TagController is no longer necessary and removed.

ClearlyClaire

I am a bit confused by this and I have a hard time imagining the final query.

Something I notice, though, is that we use distinct here and there with a column name as parameter, as if it worked the same way as group. This may be because we got confused by a different distinct method in older versions of Rails:
https://apidock.com/rails/ActiveRecord/ConnectionAdapters/SchemaStatements/distinct

Anyway, this PR shouldn't hurt, and we should look into our other uses of distinct throughout the code.

akihikodaki · 2020-08-25T12:54:01Z

I am a bit confused by this and I have a hard time imagining the final query.

Here is a query for Api::V1::Timelines::TagController when parameter only_media is falsy.

SELECT "statuses"."id", "statuses"."updated_at" FROM "statuses" INNER JOIN "statuses_tags" ON "statuses_tags"."status_id" = "statuses"."id" LEFT OUTER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" WHERE "statuses"."visibility" = $1 AND (statuses.reblog_of_id IS NULL) AND "statuses_tags"."tag_id" = $2 AND "accounts"."silenced_at" IS NULL GROUP BY "statuses"."id" ORDER BY "statuses"."id" DESC, "statuses"."id" DESC LIMIT $3

Something I notice, though, is that we use distinct here and there with a column name as parameter, as if it worked the same way as group. This may be because we got confused by a different distinct method in older versions of Rails:
https://apidock.com/rails/ActiveRecord/ConnectionAdapters/SchemaStatements/distinct

Anyway, this PR shouldn't hurt, and we should look into our other uses of distinct throughout the code.

This is an insightful comment. I didn't have a careful look at the removed distinct(:id) because I could simply remove it in this case.

…odon#14662) DISTINCT clause removes duplicated records according to all the selected attributes. In reality, it can remove duplicated records only looking at statuses.id, but the clause confuses the query planner and yields insufficient performance. The behavior is also problematic if the scope produced by HashQueryService is used to query columns without id (using pluck method, for example). The scope is expected to contain unique statuses, but the uniquness will be evaluated with some arbitrary columns other than id. GROUP BY clause resolves those problem by explicitly specifying the column to take into account for the record distinction. A workaround for the problem of DISTINCT clause in Api::V1::Timelines::TagController is no longer necessary and removed.

akihikodaki added the performance Runtime performance label Aug 25, 2020

Gargron approved these changes Aug 25, 2020

View reviewed changes

Gargron merged commit 41eeb9e into mastodon:master Aug 25, 2020

ClearlyClaire reviewed Aug 25, 2020

View reviewed changes

akihikodaki mentioned this pull request Aug 28, 2020

Replace incorrect use of distinct with group #14675

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Status.group instead of Status.distinct in HashQueryService #14662

Use Status.group instead of Status.distinct in HashQueryService #14662

akihikodaki commented Aug 25, 2020

ClearlyClaire left a comment

akihikodaki commented Aug 25, 2020

Use Status.group instead of Status.distinct in HashQueryService #14662

Use Status.group instead of Status.distinct in HashQueryService #14662

Conversation

akihikodaki commented Aug 25, 2020

ClearlyClaire left a comment

Choose a reason for hiding this comment

akihikodaki commented Aug 25, 2020