-
-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve account deletion performances further #15407
Improve account deletion performances further #15407
Conversation
In addition, we should be able to do smart things for |
@account.statuses.reorder(nil).find_in_batches do |statuses| | ||
statuses.reject! { |status| reported_status_ids.include?(status.id) } if keep_account_record? | ||
|
||
@account.statuses.reorder(nil).where.not(id: reported_status_ids).in_batches do |statuses| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is probably wrong here because we don't use statuses
as a relation. If you want to iterate on the relation you have to call each_record
according to the documentation which we do not do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchEnumerator
is Enumerable
so I don't think that should be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not as obvious to me... The variable is accessed multiple times, what is the behaviour of this enumerable? What if each time something implicitly casts to_a on it, it requeries the database? Why do the docs specify:
NOTE: If you are going to iterate through each record, you should call each_record on the yielded BatchEnumerator:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I don't know. Well, I can change that, it doesn't appear to make a big difference (but a positive one nevertheless).
I think you mean |
|
8e29cc9
to
52bdb9c
Compare
2de4b7a
to
8979c88
Compare
As in Mastodon proper, reblogs don't appear in public TLs
8979c88
to
d5da937
Compare
I think I'm about done with this PR, I don't really know where to trim some processing time regarding statuses (except maybe making There's lots of possible improvements in All in all, here is a breakdown of the time spent deleting each of the accounts with the latest commit of this PR:
|
# references to. | ||
redis.pipelined do | ||
reblogged_id_sets.each do |feed_id, future| | ||
future.value.each do |reblogged_id| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does getting the future value interact with being inside another pipelined block? Is the benefit of pipelining not lost when we block on every iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I haven't changed anything here (it's code moved from the cleanup scheduler).
There are two pipelines, one after the other, I think the benefit is that we can build the first one directly, and then wait for its response when building the second one.
@@ -27,19 +27,14 @@ def call(statuses, **options) | |||
# transaction lock the database, but we use the delete method instead | |||
# of destroy to avoid all callbacks. We rely on foreign keys to | |||
# cascade the delete faster without loading the associations. | |||
statuses_and_reblogs.each(&:delete) | |||
statuses_and_reblogs.each_slice(50) { |slice| Status.where(id: slice.map(&:id)).delete_all } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is fine since an account being deleted would likely not see other processes trying to access or update its statuses anymore. If there is no congestion due to this then the batches might as well be larger, I guess we'll find out in production how high it can be set.
@@ -95,7 +85,7 @@ def unpush_from_public_timelines(status) | |||
redis.publish(status.local? ? 'timeline:public:local:media' : 'timeline:public:remote:media', payload) | |||
end | |||
|
|||
@tags[status.id].each do |hashtag| | |||
status.tags.map { |tag| tag.name.mb_chars.downcase }.each do |hashtag| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating two arrays when you could just do the name thing in the each
block itself I guess. Minor issue due to the likely size of the arrays though.
Benchmark
Protocol
Before each test, the database is re-created with the same backup from my production instance (~single-user, a few local accounts for different purposes). Media files are not copied over.
The development environment does not have ElasticSearch enabled, and the database server runs locally.
The timing values are obtained by running the following code in a Rails console:
top_account_ids
contains the identifier of the 5 accounts with the most toots known to my instance, they are broken down as follows:The accounts are deleted in that order, and interact between them, so an account being cleared will slightly lower the number of toots (reblogs) to be processed in other accounts.
Results
The first few commits make a very sizable difference, then the benefits aren't that obvious on my example. Still, they make the logic a bit cleaner and I expect it to make a slightly bigger difference on instances with more accounts interacting with the deleted ones.