-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize direct timeline #7614
optimize direct timeline #7614
Conversation
Thank you for your effort to address the issue. Query and index optimization are for different purposes, and if you see improvements in both, probably you would need index optimization too. In case of your query, I tested an index after some discussion with you on Mastodon. In case of Before applying an index:
Applying the index:
After applying the index and restarting PostgreSQL to invalidate caches on memory:
Be aware that the account (and the database, too, as it is the only account on the instance) does not have that much direct statuses. It should use merge join instead of hash join for accounts with many direct statuses, but I'm not sure whether PostgreSQL planner is wise enough. The index should also be tested on accounts with many direct statuses. If not, this change would help PostgreSQL adopting a more optimal plan. |
app/models/status.rb
Outdated
@@ -196,6 +196,21 @@ def as_direct_timeline(account) | |||
apply_timeline_filters(query, account, false) | |||
end | |||
|
|||
def as_direct_timeline_from_me(account) | |||
query = where("statuses.account_id = #{account.id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better
query = where(account_id: account.id)`
app/models/status.rb
Outdated
|
||
def as_direct_timeline_to_me(account) | ||
query = joins("LEFT OUTER JOIN mentions ON statuses.id = mentions.status_id AND mentions.account_id = #{account.id}") | ||
.where("mentions.account_id = #{account.id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate conditions of mentions.account_id
app/models/status.rb
Outdated
end | ||
|
||
def as_direct_timeline_to_me(account) | ||
query = joins("LEFT OUTER JOIN mentions ON statuses.id = mentions.status_id AND mentions.account_id = #{account.id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better
query = Status.joins(:mentions).merge(Mention.where(account_id: account.id))
(a1 + a2) | ||
.sort { |a, b| b.id <=> a.id } | ||
.uniq(&:id) | ||
.slice(0, limit_param(DEFAULT_STATUSES_LIMIT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using take
is simple.
.take(limit_param(DEFAULT_STATUSES_LIMIT))
a1 = cache_collection direct_statuses_from_me, Status | ||
a2 = cache_collection direct_statuses_to_me, Status | ||
(a1 + a2) | ||
.sort { |a, b| b.id <=> a.id } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use sort_by
.
.sort_by(&:id).reverse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as_direct_timeline
is unnecessary and should be deleted.
@abcang, all fixed. @akihikodaki , I do not agree with "There is no room for improvement in the query plan." Merge Join is used in queries before applying this PR, but it Sorts the rows that is much larger than limit. There is no sort of mentions in the query after applying this PR. The number of unused rows is somewhat reduced. It is not included in this PR, but if you devise more, you can add the pagenation condition not only for statuses.id but also for mentions.status_id . You may be able to reduce the unused rows in scanning mentions. pagenation to statuses.id
pagenation to statuses.id and mentions.status_id
Heap Fetches for mentions is decreased from 629 to 469. |
I guess you are talking about: https://gist.github.com/tateisu/cc6bff006b898094262245491b631f2f
Here you have to pay attention to two parts, which have different reasons why they query so many statuses:
The former one has tons of rows because it does not have
The problem is solved in my index. I don't know about the later one, so it is concerning also for me. Ideally it should not have extra |
currently my instance has both of
and still statuses_dm is used for query that before apply this PR. After applying this PR, _from_me part will be like as:
And _to_me part will be like as:
The query plan becomes very simple and the query plan will be more stable. (wasteful reading of mentions is still exists, but this will not improve unless we change the table structure.) |
How do you imagine the table structure needs to be changed? |
I wonder if there would be a benefit in storing mentioned account ids in an array column instead of separate table. (History: there was no "notifications" at the beginning, so instead there was a "mentions" column/API, so it made sense to have a separate table. Nowadays nothing queries the mentions table directly) |
to avoid waste reading of mentions table/index (1) add visibility column to mentions.
or add direct column to mentions.
(2) change pagenation to mentions.status_id instead of statuses.id (3) change _to_me query like as:
then waste reading will be eliminated.
|
@Gargron If the mentions table is structured to have an array data type of account for one status, it would be difficult to create an index equivalent to "index_mentions_on_account_id_and_status_id on mentions (account_id, status_id)". |
I think that all my ideas were implemented. Please let us know if you have any comments. |
after applied this RP, this query is used to get statuses that mentions to the user.
|
I've check query log by customizing PostgreSQL's log_statement. the constant parameter such as "where visibility=3" is changed to bind parameter for prepared statements by Rails's default behavior. then PostgreSQL can't use partial index because that decides which index is used when statement is prepared, this timing PostgreSQL can't know about bind parameter. to avoid this, set PREPARED_STATEMENTS=false in .env.production. |
Now using "arel_table" to disable "bind parameter for constant condition". query log
constant conditions (direct,silenced,visibility) is not parameterized. |
Fixed all things suggested at Mastodon from @abcang . |
commits rebased. |
index size example: statuses 2.71GB mentions 183MB (after adding "direct" column ) accounts 111MB |
I did not intend to increase this PR at the first stage. |
I feel that This PR is too big and Premature optimization. My PR #7633 is about x100 faster than before the commit. At first, we should merge without changing tables and release as rc. |
Even if I divide the PR, I think that I keep the following items. |
About this case, I will wait for the judgment by project collaborators. |
I thought about it now. Direct message TL is close to home + notification TL rather than public TL.
please comment about this. |
|
in commit be1d38a and e66b427, I've temporary revoke additional column and partial index. migration IDs are:
I believe there is no problems in additional column and partial index for performance and disk space, but some users hate it. |
cache_collection direct_statuses, Status | ||
a1 = cache_collection direct_statuses_from_me, Status | ||
a2 = cache_collection direct_statuses_to_me, Status | ||
(a1 + a2).uniq(&:id).sort_by(&:id).reverse.take(limit_param(DEFAULT_STATUSES_LIMIT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is, will the two collections ever be used separately? If not, it doesn't make sense to put this logic into a controller, and should rather become either a model method or a lib finder class. I also see no need to cache_collection them separately, since cache_collection takes (id, updated_at)
objects and turns them into full statuses, you could use that on the end result array, e.g. keep cache_collection in the controller while the business logic is somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fix it.
Sorry for late reply. It is natural, but I still don't think
I'm not expecting his kind of query will be stable. The best plan varies by the scale of the table. |
@akihikodaki this PR currently does not contains index, but if I prepare the next PR, statuses_dm_account and mentions_dm will be included. after applied this PR, statuses_dm is used in _to_me part ,that checks visibility=3 and get updated_at with index only scan, but index only scan is not mandatory. In case of DM TL, I recognize that the instance requiring query optimization is the majority. Your case is rather rare. |
app/models/status.rb
Outdated
|
||
apply_timeline_filters(query, account, false) | ||
result = (query_from_me.to_a + query_to_me.to_a).uniq(&:id).sort_by(&:id).reverse.take(limit) | ||
logger.debug result.inspect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug
is left
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
@Gargron Is this acceptable? |
Now Status.as_direct_timeline returns ActiveRecord::Relation if cache_ids method argument is false. |
StatusThreadingConcern also does not return relation, that is fine |
* optimize direct timeline * fix typo in class name * change filter condition for direct timeline * fix codestyle issue * revoke index_accounts_not_silenced because direct timeline does not use it. * revoke index_accounts_not_silenced because direct timeline does not use it. * fix rspec test condition. * fix rspec test condition. * fix rspec test condition. * revoke adding column and partial index * (direct timeline) move merging logic to model * fix pagination parameter * add method arguments that switches return array of status or cache_ids * fix order by * returns ActiveRecord.Relation in default behavor * fix codestyle issue
* optimize direct timeline * fix typo in class name * change filter condition for direct timeline * fix codestyle issue * revoke index_accounts_not_silenced because direct timeline does not use it. * revoke index_accounts_not_silenced because direct timeline does not use it. * fix rspec test condition. * fix rspec test condition. * fix rspec test condition. * revoke adding column and partial index * (direct timeline) move merging logic to model * fix pagination parameter * add method arguments that switches return array of status or cache_ids * fix order by * returns ActiveRecord.Relation in default behavor * fix codestyle issue
* optimize direct timeline * fix typo in class name * change filter condition for direct timeline * fix codestyle issue * revoke index_accounts_not_silenced because direct timeline does not use it. * revoke index_accounts_not_silenced because direct timeline does not use it. * fix rspec test condition. * fix rspec test condition. * fix rspec test condition. * revoke adding column and partial index * (direct timeline) move merging logic to model * fix pagination parameter * add method arguments that switches return array of status or cache_ids * fix order by * returns ActiveRecord.Relation in default behavor * fix codestyle issue
The direct timeline is mix of DM from user and DM to user.
This PR separates query to pair of _from_me and _to_me, then merge after querying for performance.
Also this PR does:
Temporary revoked:
some information that is NOT included in this PR.
using union for merge 2 query on DB side
https://gist.github.com/tateisu/a7de7a8ff816e83ea2894e599499849d
adding partial index for statuses
https://gist.github.com/tateisu/cc6bff006b898094262245491b631f2f