Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to exclude suspended domains/subdomains from tootctl domains crawl #11454

Merged
merged 4 commits into from
Aug 3, 2019
Merged

Conversation

dariusk
Copy link
Contributor

@dariusk dariusk commented Jul 31, 2019

I encountered an issue on crawling the fediverse via tootctl domains crawl where there are about 185,000 spam instances of the format

xyza1sietvv739ur5bujjc.gab.best
xyzhnpydyv0cyiglhaoexo2.gab.best
xyzpr0aazvaj2.gab.best
xyzwiib6iw9378p.gab.best

I'd like more accurate stats. This new option ignores any instances suspended server-wide as well as their associated subdomains. So as an admin, I simply add gab.best to my domain blocks, and then run

tootctl domains crawl --exclude-suspended

to get stats excluding all 185k of those domains. This also significantly improves execution time for the crawl because it doesn't have to make three GET requests per each of those 185k domains.

Implementation notes

This queries all domain suspensions up front, then runs a regexp on each domain to see if it matches the subdomain. This improves performance over what may be the obvious implementation, which is to ask DomainBlocks.blocked?(domain) for each domain -- this method hits the DB once per domain checked, slowing things down considerably.

This new option ignores any instances suspended server-wide as
well as their associated subdomains. This queries all domain
blocks up front, then runs a regexp on each domain. This improves
performance over what may be the obvious implementation, which is
to ask `DomainBlocks.blocked?(domain)` for each domain -- this
hits the DB many times, slowing things down considerably.

pool = Concurrent::ThreadPoolExecutor.new(min_threads: 0, max_threads: options[:concurrency], idletime: 10, auto_terminate: true, max_queue: 0)

work_unit = ->(domain) do
next if stats.key?(domain)
next if blocked_domains.any? { |blocked| domain.match(Regexp.new('\\.?' + blocked + '$')) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could pre-compose a regex with all the domains, dunno which way is faster off the top of my head, it would be a big regex but you would save on an array iteration and initializing a new regex on each item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiling a giant regexp does seem significantly faster, I'll rewrite.

require 'Benchmark'

domains_to_test = []
blocked_domains = []
200000.times do blocked_domains.push(rand(400).to_s) end
15000.times do domains_to_test.push(rand(400).to_s) end

puts "Regexp.new"
puts Benchmark.measure {
  domains_to_test.each do |domain|
    blocked_domains.any? { |blocked| domain.match(Regexp.new('\\.?' + blocked + '$')) }
  end
}

puts "precompiled giant Regexp"
puts Benchmark.measure {
  reg = Regexp.new('\\.?' + blocked_domains.join('|') + '$')
  domains_to_test.each do |domain|
    domain.match(reg)
  end
}

yields

Regexp.new
  8.581110   0.078617   8.659727 (  8.678938)
precompiled giant Regexp
  0.231612   0.009794   0.241406 (  0.241989)

failed = Concurrent::AtomicFixnum.new(0)
start_at = Time.now.to_f
seed = start ? [start] : Account.remote.domains
blocked_domains = options[:exclude_suspended] ? Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$') : ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are Ruby regexes thread-safe? I think you might need to make this a thread-local (and it would be easier to read, since you could put it into a memoized method and wouldn't have to include the weird options[:exclude_domain] conditional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the conditional there so the DB doesn't get hit if the --exclude-suspended option is false. But you're right, this code does work if I simply remove the ternary like so:

       blocked_domains = Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$')

It just means that the DB gets hit on this line every time even if this feature isn't being used. But it's only one query per run so perhaps that's okay.

And once we get into issues of concurrent programming in Ruby I'm afraid I'm in over my head. In most languages, regular expressions are thread-safe because they are immutable, but I can't say 100% for Ruby's case. The code does work, and there is no Concurrent::Regexp, which I think means it's probably the case that it's thread safe.

That aside, if the modification above is to your liking I'm happy to include it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that by extracting the code into a lazy function:

def blocked_domains
  @blocked_domains ||= Regexp.new('\\.?' + DomainBlock.where(severity: 1).pluck(:domain).join('|') + '$')
end

we could avoid both compiling the regex unnecessarily and still making the code cleaner. I was also saying that this sort of refactoring would be required if we needed to maintain a new copy of the regex per thread, instead of one regex instance shared across all of the threads. (but using thread-local variables instead). However, now that I consider it, the lazy instance-variable approach would introduce thread-unsafety itself, so that's probably right out.

on regex thread safety—many regex implementations use stateful caches, adaptive optimizations, etc and need some amounts of thread-local space to work in, see https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#using-a-regex-from-multiple-threads for an example. JRuby apparently had a thread safety bug in their Regexes once: jruby/jruby#3670. However, I can't find any discussion of the thread safety of mri ruby Regex instances, so i'm fine assuming they're thread-safe.

Copy link
Member

@nightpool nightpool Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all that said, I think it probably makes the most sense to just compile the regex no matter what. conditioning the compile on the option is probably premature optimization.

@dariusk
Copy link
Contributor Author

dariusk commented Aug 1, 2019

Okay, per @nightpool's comments I removed the ternary operator.

@nightpool
Copy link
Member

LGTM! Thanks for bearing with me as I thought through all this thread safety stuff out loud

@Gargron Gargron merged commit f96f45e into mastodon:master Aug 3, 2019
hiyuki2578 pushed a commit to ProjectMyosotis/mastodon that referenced this pull request Oct 2, 2019
…ns crawl (mastodon#11454)

* Add "--exclude-suspended" to tootctl domains crawl

This new option ignores any instances suspended server-wide as
well as their associated subdomains. This queries all domain
blocks up front, then runs a regexp on each domain. This improves
performance over what may be the obvious implementation, which is
to ask `DomainBlocks.blocked?(domain)` for each domain -- this
hits the DB many times, slowing things down considerably.

* cleaning up code style

* Compiling regex

* Removing ternary operator
rtucker pushed a commit to vulpineclub/mastodon that referenced this pull request Jan 7, 2021
…ns crawl (mastodon#11454)

* Add "--exclude-suspended" to tootctl domains crawl

This new option ignores any instances suspended server-wide as
well as their associated subdomains. This queries all domain
blocks up front, then runs a regexp on each domain. This improves
performance over what may be the obvious implementation, which is
to ask `DomainBlocks.blocked?(domain)` for each domain -- this
hits the DB many times, slowing things down considerably.

* cleaning up code style

* Compiling regex

* Removing ternary operator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants