Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolver locking up entire MRI process #2175

Closed
mperham opened this issue Feb 4, 2015 · 13 comments
Closed

DNS resolver locking up entire MRI process #2175

mperham opened this issue Feb 4, 2015 · 13 comments

Comments

@mperham
Copy link
Collaborator

mperham commented Feb 4, 2015

Several customers have reported Sidekiq "freezing" over the last 3-6 months. When I had them put require 'resolv-replace' in their initializer, the problem went away. What might cause this?

@evanphx
Copy link

evanphx commented Feb 4, 2015

If you look at the methods that are patched by resolve-replace you'll see all the places that might result in a hang. The hang is because those methods end up calling into libc's resolver which, if DNS is configured wrong, could block for a long time while libc waits for an answer.

@mperham
Copy link
Collaborator Author

mperham commented Feb 4, 2015

Ok, but why would that block all threads on a modern MRI version?

@mperham
Copy link
Collaborator Author

mperham commented Feb 4, 2015

If I call TCPSocket.new("google.com", 80), will that hostname lookup block the entire process unless I use resolv-replace? I tested that theory this morning by using a custom local DNS server which paused before returning a response and it did not block the process.

@mperham
Copy link
Collaborator Author

mperham commented Feb 4, 2015

I did this:

require 'thread'
require 'rubydns'

class MyServer < RubyDNS::Server
  def process(name, resource_class, transaction)
    sleep 3
  end
end

RubyDNS.run_server(asynchronous: true, server_class: MyServer)

t = Thread.new do
  while true
    print '.'
    sleep 1
  end
end

Resolv.getaddress "fake.host"

t.join

And updated the local resolv.conf to use 127.0.0.1. I got back NXDOMAIN after 3 seconds, just as expected, with 3 dots printed to screen.

@evanphx
Copy link

evanphx commented Feb 4, 2015

What about the TCPSocket.connect? Does it cause a block? There are many, many paths to code that might trigger DNS lookup in MRI.

@mperham
Copy link
Collaborator Author

mperham commented Feb 4, 2015

There's no such method.

@mperham
Copy link
Collaborator Author

mperham commented Feb 5, 2015

> require 'socket'
=> true
> TCPSocket.connect
NoMethodError: undefined method `connect' for TCPSocket:Class

@mperham
Copy link
Collaborator Author

mperham commented Feb 5, 2015

I can totally buy that a native gem like mysql2 or pg might call into their native client libraries which might perform a native DNS lookup without releasing the GIL. Is that the scenario you are referring to?

@evanphx
Copy link

evanphx commented Feb 5, 2015

Er, sorry, I meant TCPSocket.new, like you initially indicated.

As for native gems, if that were the cause, using resolv-replace wouldn't fix it.

@mperham
Copy link
Collaborator Author

mperham commented Feb 5, 2015

I'm still struggling to put together sample code which actually shows a full process lockup. Making a bad request with Typhoeus and Curb both work fine. TCPSocket.new works fine. Socket.gethostbyname works fine. I wonder if it is platform-dependent? I'm using Ruby 2.2 on 14.04LTS.

@mperham mperham closed this as completed Mar 27, 2015
@pboling
Copy link

pboling commented May 14, 2021

We just had a half-hour worker brown-out, shortly after upgrading from Ruby 2.7.2 to 2.7.3, caused entirely by hundreds of errors from sidekiq workers, which were all occurring in resolve.rb, and we do have resolv-replace loaded in our code.

During investigation I found this: https://bugs.ruby-lang.org/issues/17781 which includes a repro that works on 2.7.3, 3.0.0, and 3.0.1 (but not 2.7.2).

require 'resolv'
65536.times { Resolv::DNS.new.getresource('www.example.net', Resolv::DNS::Resource::IN::A) }
puts "Ran 65536 times"
Resolv::DNS.new.getresource('www.example.net', Resolv::DNS::Resource::IN::A)
puts "Ran 65537 times" # never printed

I found this issue while looking to figure out a resolve related freeze in my app, and although it is a totally different situation, the pertinent questions are the same:

Question 1: Should I use resolv-replace?
Answer 1: in 2021, no. The drawbacks of resolv-replace are too many, and the benefits too small, if there even are any. From what I found by googling, the drawbacks are:

  • no IPv6 support,
  • doesn't play well with proxies,
  • doesn't respect local /etc/hosts config, and
  • it is a global override that hijacks all DNS lookups coming from Ruby, as it can't be applied discretely.

SOLUTION: Remove require "resolv-replace" from code.

  • We did not need to add any kind of require "resolv" statement to replace it.

Question 2: How do I fix the freeze caused by the bug in resolv.rb which affects Ruby 2.7.3 / 3.0.0 / 3.0.1?
Answer 2: Switch to the repo / gem library of resolv that recently merged the fix:

SOLUTION: Switch to pulling resolv from GitHub.

in Gemfile:

gem "resolv", github: "ruby/resolv", branch: "master", ref: "c80893765dcd50e9d34b3e9dbd427cc651dc55cf"

and then:

src/api[hotfix/ENG-325-remove-resolv-replace]% bundle install
src/api[hotfix/ENG-325-remove-resolv-replace]% bundle info resolv
  * resolv (0.2.0 c808937)
	Summary: Thread-aware DNS resolver library in Ruby.
	Homepage: https://github.com/ruby/resolv
	Source Code: https://github.com/ruby/resolv
	Path: /Users/pboling/.asdf/installs/ruby/2.7.3/lib/ruby/gems/2.7.0/bundler/gems/resolv-c80893765dcd

Confirmed that does patch the bug in Ruby 2.7.3:

Rack::Shell v1.0.0 started in development environment.
[1] pry(main)> 65536.times { Resolv::DNS.new.getresource('www.example.net', Resolv::DNS::Resource::IN::A) }
=> 65536
[2] pry(main)> Resolv::DNS.new.getresource('www.example.net', Resolv::DNS::Resource::IN::A)
=> #<Resolv::DNS::Resource::IN::A:0x00007fb443263438 @address=#<Resolv::IPv4 93.184.216.34>, @ttl=86400>
[3] pry(main)> puts "Ran 65537 times"
Ran 65537 times
=> nil

@tonywok
Copy link

tonywok commented Jun 15, 2021

@pboling Hi there, first off, thanks for the info!

For context, I'm coming to this issue after debugging a deadlocking issue described best by New Relic's docs in which I intend to use resolv/replace as a work around. Your note about "not using it in 2021" is giving me pause and I want to make sure I understand what you're suggesting.

I'm a little confused by your rationale. I'm reading your suggestion as "don't use resolv/replace, but do use resolv (specifically the gem with the patch instead of what ships with your ruby)".

Are you using resolv directly in your code? Based on my understanding of resolv/replace, if you don't require it, you'll be still be using Socket DNS since you've not replaced it.

@pboling
Copy link

pboling commented Jun 16, 2021

I'm reading your suggestion as "don't use resolv/replace, but do use resolv (specifically the gem with the patch instead of what ships with your ruby)".

Correct.

The way Ruby ships with "bundled" gems now is far different than what it used to be.

A new version of the resolv gem, with the fix, has now been shipped, but a new version of Ruby which packages it has not yet been shipped. This means you have to add the gem you your Gemfile to get the fixed version. The version that ships with Ruby 2.7.3/3.0.0/3.0.1 is broken.

When you specify the gem resolve in your Gemfile it will be automatically "required" by bundler, as it does all gems which do not specify require: false.

So to get a fixed version of the standard resolv, just add the gem to your Gemfile! This may be what is causing your deadlock issue, if you haven't switched to the fixed resolv and are on Ruby v2.7.3/3.0.0/3.0.1. I would check that before considering resolv/replace.

If that doesn't fix it, then it does sound like using the pure Ruby resolve/replace is the best solution for the Resque/NewRelic deadlocks.

Just be aware of the downsides... (which I gleaned from third party sources all over the internet, not from directly understanding the source code, so could be wrong):

• no IPv6 support,
• doesn't play well with proxies,
• doesn't respect local /etc/hosts config, and
• it is a global override that hijacks all DNS lookups coming from Ruby, as it can't be applied discretely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants