Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve account deletion performances further #15407

Merged

Conversation

ClearlyClaire
Copy link
Contributor

@ClearlyClaire ClearlyClaire commented Dec 22, 2020

Benchmark

Protocol

Before each test, the database is re-created with the same backup from my production instance (~single-user, a few local accounts for different purposes). Media files are not copied over.

The development environment does not have ElasticSearch enabled, and the database server runs locally.

The timing values are obtained by running the following code in a Rails console:

Benchmark.measure do
  top_account_ids.each do |account_id|
    DeleteAccountService.new.call(Account.find(account_id), skip_activitypub: true)
  end
end

top_account_ids contains the identifier of the 5 accounts with the most toots known to my instance, they are broken down as follows:

# total toots public toots DMs local? favourites reported toots
1 105207 13044 48 no 1054 1
2 73196 66623 0 no 369 0
3 57971 19237 329 no 1921 0
4 56710 4067 1255 yes 37959 0
5 54403 31091 56 no 3148 0

The accounts are deleted in that order, and interact between them, so an account being cleared will slightly lower the number of toots (reblogs) to be processed in other accounts.

Results

commit user system total real
2334add (without this PR) 750.049139s (12min30s) 55.861878s 805.911017s (13min26s) 2624.943940s (43min45s)
8c70b01 (deletes statuses by batches of 50 instead of individually) 606.902921s (10min07s) 41.281127s 648.184048s (10min48s) 1549.474583s (25min50s)
e20b52e (do not precompute values that are only used once) 596.145949s (9min56s) 42.301746s 638.447695s (10min38s) 1527.635537s (25min27s)
fb8ad0f (do not generate redis events for toots older than two weeks) 479.127842s (7min59s) 40.115994s 519.243836s (8min39s) 1448.805232s (24min09s)
eff219a (filter reported toots a priori) 476.933216s (7min57s) 39.721144s 516.654360s (8min37s) 1424.850713s (23min45s)
fc6a592 (do not process reblogs when cleaning up public timelines) 477.414900s (7min57s) 39.666471s 517.081371s (8min37s) 1426.631086s (23min47s)
d5da937 (clean up deleted account's feed in one go) 477.063304s (7min57s) 37.829388s 514.892692s (8min35s) 1413.292442s (23min33s)
f57ed40 (delete rather than destroy a few more associations, tweak preloading) 474.929086s (7min55s) 38.475544s 513.404630s (8min33s) 1415.174932s (23min35s)

The first few commits make a very sizable difference, then the benefits aren't that obvious on my example. Still, they make the logic a bit cleaner and I expect it to make a slightly bigger difference on instances with more accounts interacting with the deleted ones.

@ClearlyClaire
Copy link
Contributor Author

In addition, we should be able to do smart things for favourites and bookmarks but I'll have to read up on how Chewy works to efficiently update the indexes though.

@account.statuses.reorder(nil).find_in_batches do |statuses|
statuses.reject! { |status| reported_status_ids.include?(status.id) } if keep_account_record?

@account.statuses.reorder(nil).where.not(id: reported_status_ids).in_batches do |statuses|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is probably wrong here because we don't use statuses as a relation. If you want to iterate on the relation you have to call each_record according to the documentation which we do not do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchEnumerator is Enumerable so I don't think that should be an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not as obvious to me... The variable is accessed multiple times, what is the behaviour of this enumerable? What if each time something implicitly casts to_a on it, it requeries the database? Why do the docs specify:

NOTE: If you are going to iterate through each record, you should call each_record on the yielded BatchEnumerator:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I don't know. Well, I can change that, it doesn't appear to make a big difference (but a positive one nevertheless).

@Gargron
Copy link
Member

Gargron commented Dec 22, 2020

In addition, we should be able to do smart things for favourites and bookmarks

I think you mean favourites, active_relationships and passive_relationships - bookmarks have no associated counters. I already have some code for that...

@ClearlyClaire
Copy link
Contributor Author

bookmarks has no associated counters but it has ES indexing

@ClearlyClaire ClearlyClaire force-pushed the fixes/delete-account-service-perf branch 2 times, most recently from 8e29cc9 to 52bdb9c Compare December 22, 2020 19:12
@ClearlyClaire ClearlyClaire force-pushed the fixes/delete-account-service-perf branch 2 times, most recently from 2de4b7a to 8979c88 Compare December 22, 2020 19:35
@ClearlyClaire ClearlyClaire force-pushed the fixes/delete-account-service-perf branch from 8979c88 to d5da937 Compare December 22, 2020 19:51
@ClearlyClaire ClearlyClaire marked this pull request as ready for review December 22, 2020 22:14
@ClearlyClaire
Copy link
Contributor Author

I think I'm about done with this PR, I don't really know where to trim some processing time regarding statuses (except maybe making unpush_from_home_timelines and unpush_from_list_timelines smarter, but it's tricky).

There's lots of possible improvements in favourites, active_relationships and passive_relationships (and to a lesser extent bookmarks) but I understood from your earlier post that you were experimenting with them, so I'm leaving that up to you for the time being.

All in all, here is a breakdown of the time spent deleting each of the accounts with the latest commit of this PR:

# statuses media attachments notifications active_relationships passive_relationships favourites
1 319118.9ms 4947.4ms 274.0ms 54.8ms 25.0ms 20304.3ms
2 174690.6ms 12454.8ms 31.2ms 23.7ms 11.8ms 4233.3ms
3 192512.5ms 8221.9ms 149.2ms 20.8ms 30.5ms 15070.6ms
4 196633.6ms 17311.3ms 17.0ms 2128.6ms 581.0ms 292957.6ms
5 142705.5ms 6301.1ms 25.8ms 52.6ms 13.0ms 4947.7ms

# references to.
redis.pipelined do
reblogged_id_sets.each do |feed_id, future|
future.value.each do |reblogged_id|
Copy link
Member

@Gargron Gargron Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does getting the future value interact with being inside another pipelined block? Is the benefit of pipelining not lost when we block on every iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I haven't changed anything here (it's code moved from the cleanup scheduler).

There are two pipelines, one after the other, I think the benefit is that we can build the first one directly, and then wait for its response when building the second one.

@@ -27,19 +27,14 @@ def call(statuses, **options)
# transaction lock the database, but we use the delete method instead
# of destroy to avoid all callbacks. We rely on foreign keys to
# cascade the delete faster without loading the associations.
statuses_and_reblogs.each(&:delete)
statuses_and_reblogs.each_slice(50) { |slice| Status.where(id: slice.map(&:id)).delete_all }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is fine since an account being deleted would likely not see other processes trying to access or update its statuses anymore. If there is no congestion due to this then the batches might as well be larger, I guess we'll find out in production how high it can be set.

@@ -95,7 +85,7 @@ def unpush_from_public_timelines(status)
redis.publish(status.local? ? 'timeline:public:local:media' : 'timeline:public:remote:media', payload)
end

@tags[status.id].each do |hashtag|
status.tags.map { |tag| tag.name.mb_chars.downcase }.each do |hashtag|
Copy link
Member

@Gargron Gargron Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating two arrays when you could just do the name thing in the each block itself I guess. Minor issue due to the likely size of the arrays though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants