Improve account deletion performances further #15407

ClearlyClaire · 2020-12-22T16:55:10Z

Benchmark

Protocol

Before each test, the database is re-created with the same backup from my production instance (~single-user, a few local accounts for different purposes). Media files are not copied over.

The development environment does not have ElasticSearch enabled, and the database server runs locally.

The timing values are obtained by running the following code in a Rails console:

Benchmark.measure do
  top_account_ids.each do |account_id|
    DeleteAccountService.new.call(Account.find(account_id), skip_activitypub: true)
  end
end

top_account_ids contains the identifier of the 5 accounts with the most toots known to my instance, they are broken down as follows:

#	total toots	public toots	DMs	local?	favourites	reported toots
1	105207	13044	48	no	1054	1
2	73196	66623	0	no	369	0
3	57971	19237	329	no	1921	0
4	56710	4067	1255	yes	37959	0
5	54403	31091	56	no	3148	0

The accounts are deleted in that order, and interact between them, so an account being cleared will slightly lower the number of toots (reblogs) to be processed in other accounts.

Results

commit	user	system	total	real
`2334add` (without this PR)	750.049139s (12min30s)	55.861878s	805.911017s (13min26s)	2624.943940s (43min45s)
`8c70b01` (deletes statuses by batches of 50 instead of individually)	606.902921s (10min07s)	41.281127s	648.184048s (10min48s)	1549.474583s (25min50s)
`e20b52e` (do not precompute values that are only used once)	596.145949s (9min56s)	42.301746s	638.447695s (10min38s)	1527.635537s (25min27s)
`fb8ad0f` (do not generate redis events for toots older than two weeks)	479.127842s (7min59s)	40.115994s	519.243836s (8min39s)	1448.805232s (24min09s)
`eff219a` (filter reported toots a priori)	476.933216s (7min57s)	39.721144s	516.654360s (8min37s)	1424.850713s (23min45s)
`fc6a592` (do not process reblogs when cleaning up public timelines)	477.414900s (7min57s)	39.666471s	517.081371s (8min37s)	1426.631086s (23min47s)
`d5da937` (clean up deleted account's feed in one go)	477.063304s (7min57s)	37.829388s	514.892692s (8min35s)	1413.292442s (23min33s)
`f57ed40` (delete rather than destroy a few more associations, tweak preloading)	474.929086s (7min55s)	38.475544s	513.404630s (8min33s)	1415.174932s (23min35s)

The first few commits make a very sizable difference, then the benefits aren't that obvious on my example. Still, they make the logic a bit cleaner and I expect it to make a slightly bigger difference on instances with more accounts interacting with the deleted ones.

app/services/batched_remove_status_service.rb

ClearlyClaire · 2020-12-22T17:21:39Z

In addition, we should be able to do smart things for favourites and bookmarks but I'll have to read up on how Chewy works to efficiently update the indexes though.

Gargron · 2020-12-22T17:26:33Z

app/services/delete_account_service.rb

-    @account.statuses.reorder(nil).find_in_batches do |statuses|
-      statuses.reject! { |status| reported_status_ids.include?(status.id) } if keep_account_record?
-
+    @account.statuses.reorder(nil).where.not(id: reported_status_ids).in_batches do |statuses|


The change is probably wrong here because we don't use statuses as a relation. If you want to iterate on the relation you have to call each_record according to the documentation which we do not do.

BatchEnumerator is Enumerable so I don't think that should be an issue.

It's not as obvious to me... The variable is accessed multiple times, what is the behaviour of this enumerable? What if each time something implicitly casts to_a on it, it requeries the database? Why do the docs specify:

NOTE: If you are going to iterate through each record, you should call each_record on the yielded BatchEnumerator:

Hmm. I don't know. Well, I can change that, it doesn't appear to make a big difference (but a positive one nevertheless).

Gargron · 2020-12-22T17:32:07Z

In addition, we should be able to do smart things for favourites and bookmarks

I think you mean favourites, active_relationships and passive_relationships - bookmarks have no associated counters. I already have some code for that...

ClearlyClaire · 2020-12-22T17:34:05Z

bookmarks has no associated counters but it has ES indexing

app/services/batched_remove_status_service.rb

…wo weeks

As in Mastodon proper, reblogs don't appear in public TLs

app/workers/scheduler/feed_cleanup_scheduler.rb

ClearlyClaire · 2020-12-22T22:46:01Z

I think I'm about done with this PR, I don't really know where to trim some processing time regarding statuses (except maybe making unpush_from_home_timelines and unpush_from_list_timelines smarter, but it's tricky).

There's lots of possible improvements in favourites, active_relationships and passive_relationships (and to a lesser extent bookmarks) but I understood from your earlier post that you were experimenting with them, so I'm leaving that up to you for the time being.

All in all, here is a breakdown of the time spent deleting each of the accounts with the latest commit of this PR:

#	statuses	media attachments	notifications	active_relationships	passive_relationships	favourites
1	319118.9ms	4947.4ms	274.0ms	54.8ms	25.0ms	20304.3ms
2	174690.6ms	12454.8ms	31.2ms	23.7ms	11.8ms	4233.3ms
3	192512.5ms	8221.9ms	149.2ms	20.8ms	30.5ms	15070.6ms
4	196633.6ms	17311.3ms	17.0ms	2128.6ms	581.0ms	292957.6ms
5	142705.5ms	6301.1ms	25.8ms	52.6ms	13.0ms	4947.7ms

Gargron · 2020-12-22T22:46:43Z

app/lib/feed_manager.rb

+    # references to.
+    redis.pipelined do
+      reblogged_id_sets.each do |feed_id, future|
+        future.value.each do |reblogged_id|


How does getting the future value interact with being inside another pipelined block? Is the benefit of pipelining not lost when we block on every iteration?

Hm, I haven't changed anything here (it's code moved from the cleanup scheduler).

There are two pipelines, one after the other, I think the benefit is that we can build the first one directly, and then wait for its response when building the second one.

Gargron · 2020-12-22T22:51:47Z

app/services/batched_remove_status_service.rb

@@ -27,19 +27,14 @@ def call(statuses, **options)
    # transaction lock the database, but we use the delete method instead
    # of destroy to avoid all callbacks. We rely on foreign keys to
    # cascade the delete faster without loading the associations.
-    statuses_and_reblogs.each(&:delete)
+    statuses_and_reblogs.each_slice(50) { |slice| Status.where(id: slice.map(&:id)).delete_all }


I guess this is fine since an account being deleted would likely not see other processes trying to access or update its statuses anymore. If there is no congestion due to this then the batches might as well be larger, I guess we'll find out in production how high it can be set.

Gargron · 2020-12-22T22:53:16Z

app/services/batched_remove_status_service.rb

@@ -95,7 +85,7 @@ def unpush_from_public_timelines(status)
      redis.publish(status.local? ? 'timeline:public:local:media' : 'timeline:public:remote:media', payload)
    end

-    @tags[status.id].each do |hashtag|
+    status.tags.map { |tag| tag.name.mb_chars.downcase }.each do |hashtag|


Creating two arrays when you could just do the name thing in the each block itself I guess. Minor issue due to the likely size of the arrays though.

Gargron reviewed Dec 22, 2020

View reviewed changes

app/services/batched_remove_status_service.rb Outdated Show resolved Hide resolved

Gargron reviewed Dec 22, 2020

View reviewed changes

ClearlyClaire force-pushed the fixes/delete-account-service-perf branch 2 times, most recently from 8e29cc9 to 52bdb9c Compare December 22, 2020 19:12

Gargron reviewed Dec 22, 2020

View reviewed changes

app/services/batched_remove_status_service.rb Outdated Show resolved Hide resolved

ClearlyClaire force-pushed the fixes/delete-account-service-perf branch 2 times, most recently from 2de4b7a to 8979c88 Compare December 22, 2020 19:35

ClearlyClaire added 6 commits December 22, 2020 20:43

Delete status records by batches of 50

8c70b01

Do not precompute values that are only used once

e20b52e

Do not generate redis events for removal of public toots older than t…

fb8ad0f

…wo weeks

Filter reported toots a priori for polls and status deletion

eff219a

Do not process reblogs when cleaning up public timelines

fc6a592

As in Mastodon proper, reblogs don't appear in public TLs

Clean the deleted account's own feed in one go

d5da937

ClearlyClaire force-pushed the fixes/delete-account-service-perf branch from 8979c88 to d5da937 Compare December 22, 2020 19:51

Gargron reviewed Dec 22, 2020

View reviewed changes

app/workers/scheduler/feed_cleanup_scheduler.rb Show resolved Hide resolved

ClearlyClaire added 3 commits December 22, 2020 21:21

Refactor Account#clean_feed_manager and List#clean_feed_manager

01dd8b6

Delete instead of destroy a few more associations

e495ed7

Fix preloading

f57ed40

ClearlyClaire marked this pull request as ready for review December 22, 2020 22:14

Gargron reviewed Dec 22, 2020

View reviewed changes

Gargron approved these changes Dec 22, 2020

View reviewed changes

Gargron merged commit 3249d35 into mastodon:master Dec 22, 2020

ClearlyClaire mentioned this pull request Dec 23, 2020

Improve performances of deleting favourites when deleting accounts #15412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve account deletion performances further #15407

Improve account deletion performances further #15407

ClearlyClaire commented Dec 22, 2020 •

edited

Loading

ClearlyClaire commented Dec 22, 2020

Gargron Dec 22, 2020

ClearlyClaire Dec 22, 2020

Gargron Dec 22, 2020

ClearlyClaire Dec 22, 2020

Gargron commented Dec 22, 2020

ClearlyClaire commented Dec 22, 2020

ClearlyClaire commented Dec 22, 2020

Gargron Dec 22, 2020 •

edited

Loading

ClearlyClaire Dec 22, 2020

Gargron Dec 22, 2020

Gargron Dec 22, 2020 •

edited

Loading

Improve account deletion performances further #15407

Improve account deletion performances further #15407

Conversation

ClearlyClaire commented Dec 22, 2020 • edited Loading

Benchmark

Protocol

Results

ClearlyClaire commented Dec 22, 2020

Gargron Dec 22, 2020

Choose a reason for hiding this comment

ClearlyClaire Dec 22, 2020

Choose a reason for hiding this comment

Gargron Dec 22, 2020

Choose a reason for hiding this comment

ClearlyClaire Dec 22, 2020

Choose a reason for hiding this comment

Gargron commented Dec 22, 2020

ClearlyClaire commented Dec 22, 2020

ClearlyClaire commented Dec 22, 2020

Gargron Dec 22, 2020 • edited Loading

Choose a reason for hiding this comment

ClearlyClaire Dec 22, 2020

Choose a reason for hiding this comment

Gargron Dec 22, 2020

Choose a reason for hiding this comment

Gargron Dec 22, 2020 • edited Loading

Choose a reason for hiding this comment

ClearlyClaire commented Dec 22, 2020 •

edited

Loading

Gargron Dec 22, 2020 •

edited

Loading

Gargron Dec 22, 2020 •

edited

Loading