Add option to exclude suspended domains/subdomains from tootctl domains crawl #11454

dariusk · 2019-07-31T18:57:10Z

I encountered an issue on crawling the fediverse via tootctl domains crawl where there are about 185,000 spam instances of the format

xyza1sietvv739ur5bujjc.gab.best
xyzhnpydyv0cyiglhaoexo2.gab.best
xyzpr0aazvaj2.gab.best
xyzwiib6iw9378p.gab.best

I'd like more accurate stats. This new option ignores any instances suspended server-wide as well as their associated subdomains. So as an admin, I simply add gab.best to my domain blocks, and then run

tootctl domains crawl --exclude-suspended

to get stats excluding all 185k of those domains. This also significantly improves execution time for the crawl because it doesn't have to make three GET requests per each of those 185k domains.

Implementation notes

This queries all domain suspensions up front, then runs a regexp on each domain to see if it matches the subdomain. This improves performance over what may be the obvious implementation, which is to ask DomainBlocks.blocked?(domain) for each domain -- this method hits the DB once per domain checked, slowing things down considerably.

This new option ignores any instances suspended server-wide as well as their associated subdomains. This queries all domain blocks up front, then runs a regexp on each domain. This improves performance over what may be the obvious implementation, which is to ask `DomainBlocks.blocked?(domain)` for each domain -- this hits the DB many times, slowing things down considerably.

Gargron · 2019-07-31T19:36:07Z

lib/mastodon/domains_cli.rb


      pool = Concurrent::ThreadPoolExecutor.new(min_threads: 0, max_threads: options[:concurrency], idletime: 10, auto_terminate: true, max_queue: 0)

      work_unit = ->(domain) do
        next if stats.key?(domain)
+        next if blocked_domains.any? { |blocked| domain.match(Regexp.new('\\.?' + blocked + '$')) }


I suppose you could pre-compose a regex with all the domains, dunno which way is faster off the top of my head, it would be a big regex but you would save on an array iteration and initializing a new regex on each item.

Compiling a giant regexp does seem significantly faster, I'll rewrite.

require 'Benchmark' domains_to_test = [] blocked_domains = [] 200000.times do blocked_domains.push(rand(400).to_s) end 15000.times do domains_to_test.push(rand(400).to_s) end puts "Regexp.new" puts Benchmark.measure { domains_to_test.each do |domain| blocked_domains.any? { |blocked| domain.match(Regexp.new('\\.?' + blocked + '$')) } end } puts "precompiled giant Regexp" puts Benchmark.measure { reg = Regexp.new('\\.?' + blocked_domains.join('|') + '$') domains_to_test.each do |domain| domain.match(reg) end }

yields

Regexp.new 8.581110 0.078617 8.659727 ( 8.678938) precompiled giant Regexp 0.231612 0.009794 0.241406 ( 0.241989)

nightpool · 2019-08-01T00:08:58Z

lib/mastodon/domains_cli.rb

+      failed          = Concurrent::AtomicFixnum.new(0)
+      start_at        = Time.now.to_f
+      seed            = start ? [start] : Account.remote.domains
+      blocked_domains = options[:exclude_suspended] ? Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$') : ''


are Ruby regexes thread-safe? I think you might need to make this a thread-local (and it would be easier to read, since you could put it into a memoized method and wouldn't have to include the weird options[:exclude_domain] conditional

I put the conditional there so the DB doesn't get hit if the --exclude-suspended option is false. But you're right, this code does work if I simply remove the ternary like so:

blocked_domains = Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$')

It just means that the DB gets hit on this line every time even if this feature isn't being used. But it's only one query per run so perhaps that's okay.

And once we get into issues of concurrent programming in Ruby I'm afraid I'm in over my head. In most languages, regular expressions are thread-safe because they are immutable, but I can't say 100% for Ruby's case. The code does work, and there is no Concurrent::Regexp, which I think means it's probably the case that it's thread safe.

That aside, if the modification above is to your liking I'm happy to include it.

I'm saying that by extracting the code into a lazy function:

def blocked_domains @blocked_domains ||= Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$') end

we could avoid both compiling the regex unnecessarily and still making the code cleaner. I was also saying that this sort of refactoring would be required if we needed to maintain a new copy of the regex per thread, instead of one regex instance shared across all of the threads. (but using thread-local variables instead). However, now that I consider it, the lazy instance-variable approach would introduce thread-unsafety itself, so that's probably right out.

on regex thread safety—many regex implementations use stateful caches, adaptive optimizations, etc and need some amounts of thread-local space to work in, see https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#using-a-regex-from-multiple-threads for an example. JRuby apparently had a thread safety bug in their Regexes once: jruby/jruby#3670. However, I can't find any discussion of the thread safety of mri ruby Regex instances, so i'm fine assuming they're thread-safe.

all that said, I think it probably makes the most sense to just compile the regex no matter what. conditioning the compile on the option is probably premature optimization.

dariusk · 2019-08-01T17:34:00Z

Okay, per @nightpool's comments I removed the ternary operator.

nightpool · 2019-08-01T22:41:26Z

LGTM! Thanks for bearing with me as I thought through all this thread safety stuff out loud

…ns crawl (mastodon#11454) * Add "--exclude-suspended" to tootctl domains crawl This new option ignores any instances suspended server-wide as well as their associated subdomains. This queries all domain blocks up front, then runs a regexp on each domain. This improves performance over what may be the obvious implementation, which is to ask `DomainBlocks.blocked?(domain)` for each domain -- this hits the DB many times, slowing things down considerably. * cleaning up code style * Compiling regex * Removing ternary operator

dariusk added 2 commits July 31, 2019 11:48

cleaning up code style

13a6a12

Gargron reviewed Jul 31, 2019

View reviewed changes

Compiling regex

074d335

nightpool requested changes Aug 1, 2019

View reviewed changes

Removing ternary operator

106a6d2

nightpool approved these changes Aug 1, 2019

View reviewed changes

Gargron merged commit f96f45e into mastodon:master Aug 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to exclude suspended domains/subdomains from tootctl domains crawl #11454

Add option to exclude suspended domains/subdomains from tootctl domains crawl #11454

dariusk commented Jul 31, 2019 •

edited

Loading

Gargron Jul 31, 2019

dariusk Jul 31, 2019

nightpool Aug 1, 2019

dariusk Aug 1, 2019

nightpool Aug 1, 2019

nightpool Aug 1, 2019 •

edited

Loading

dariusk commented Aug 1, 2019

nightpool commented Aug 1, 2019

Add option to exclude suspended domains/subdomains from tootctl domains crawl #11454

Add option to exclude suspended domains/subdomains from tootctl domains crawl #11454

Conversation

dariusk commented Jul 31, 2019 • edited Loading

Implementation notes

Gargron Jul 31, 2019

Choose a reason for hiding this comment

dariusk Jul 31, 2019

Choose a reason for hiding this comment

nightpool Aug 1, 2019

Choose a reason for hiding this comment

dariusk Aug 1, 2019

Choose a reason for hiding this comment

nightpool Aug 1, 2019

Choose a reason for hiding this comment

nightpool Aug 1, 2019 • edited Loading

Choose a reason for hiding this comment

dariusk commented Aug 1, 2019

nightpool commented Aug 1, 2019

dariusk commented Jul 31, 2019 •

edited

Loading

nightpool Aug 1, 2019 •

edited

Loading