Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOSE_WAIT, too many open files #277

Closed
heaven opened this issue Jan 5, 2016 · 67 comments
Closed

CLOSE_WAIT, too many open files #277

heaven opened this issue Jan 5, 2016 · 67 comments

Comments

@heaven
Copy link

heaven commented Jan 5, 2016

This code:

easy = Curl::Easy.new(site.rss_url)
easy.ssl_verify_peer = false
easy.follow_location = true
easy.max_redirects = 5
easy.connect_timeout = 15
easy.timeout = 20
easy.perform

leaks connections.

After some time I am getting errors from sidekiq: Error fetching job: Too many open files - getaddrinfo.

$ netstat -an | awk '/tcp/ {print $6}' | sort | uniq -c:

   4024 CLOSE_WAIT
    163 ESTABLISHED
     12 LISTEN
     85 TIME_WAIT

$ lsof -p 6245 | grep CLOSE_WAIT | wc -l

4020

A part of lsof output:

ruby    6245 deploy 4085u  IPv4 445657907         0t0       TCP ...:38806->theautismeducationsite.com:http (CLOSE_WAIT)
ruby    6245 deploy 4086u  IPv4 445657898         0t0       TCP ...:42963->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4087u  IPv4 445657912         0t0       TCP ...:45454->qm-in-f118.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4088u  IPv4 445657915         0t0       TCP ...:49490->192.0.78.13:https (CLOSE_WAIT)
ruby    6245 deploy 4089u  IPv4 445661545         0t0       TCP ...:52097->qm-in-f121.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4090u  IPv4 445657929         0t0       TCP ...:42984->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4091u  IPv4 445658814         0t0       TCP ...:42973->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4092u  IPv4 445648740         0t0       TCP ...:60335->ps478203.dreamhost.com:http (CLOSE_WAIT)
ruby    6245 deploy 4093u  IPv4 445659562         0t0       TCP ...:42985->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4094u  IPv4 445659564         0t0       TCP ...:43010->qm-in-f132.1e100.net:http (CLOSE_WAIT)
ruby    6245 deploy 4095u  IPv4 445648749         0t0       TCP ...:43009->qm-in-f132.1e100.net:http (CLOSE_WAIT)
@taf2
Copy link
Owner

taf2 commented Jan 5, 2016

after easy .perform if you run easy.close do you still see the open file issue? this should run with GC but if you're not having any GC runs then it would not close those connections... perhaps?

@heaven
Copy link
Author

heaven commented Jan 5, 2016

I do have GC, I even do run it manually at some points. Will try easy.close suggestion and let you know.

@nubis
Copy link

nubis commented Feb 2, 2016

I think i just ran into this issue as well. (Using Curl::Easy)
I'm not calling 'perform', I use 'http(:get)' instead.

Initially just running my script would immediately start piling up sockets in the ESTAB and CLOSE-WAIT state (as reported by the 'ss' utility in my ubuntu box)

After adding an explicit call to 'curl.close' after being done with my curl object I got my script to run in constant file descriptor space. But this just lasted a few minutes, then it started piling up sockets in the CLOSE-WAIT state again.

Finally, I added a 'GC.start' call after calling 'close' on my curl object and it seems to have fixed it for good.

It was an ugly fix though, I just want to keep using curl/curb in my library but I'm afraid switching may be easier for me than fixing this all by myself (I have never ever read curl code, or native ruby extensions for the matter). I can certainly help diagnosing this, I know I can reliably reproduce the issue by just removing those fixes.

Cheers!

@taf2
Copy link
Owner

taf2 commented Feb 2, 2016

Well, it's good to hear that close + GC.start works around the issue. It makes sense, because the socket is only closed in a GC cycle. I think there is also a keepalive setting in curl we can set to resolve this, which would be less "ugly" to your point. Doing a little google search however, it looks like the issue is related to this bug report: http://sourceforge.net/p/curl/bugs/1010/

@thomax
Copy link

thomax commented Feb 26, 2016

I use Curl::Easy for large amounts of traffic, and am also getting a pile-up sockets in the CLOSE_WAIT state. Solving this by closing, and opening a new, connection after every get would be costly. Is there any other solution on the horizon, @taf2 ?

@taf2
Copy link
Owner

taf2 commented Feb 26, 2016

It looks like this might related to a bad version of libcurl? Did you check the bug report or also see this: https://curl.haxx.se/mail/lib-2009-03/0009.html

@thomax
Copy link

thomax commented Feb 26, 2016

Thanks for the tip @taf2 ! I'll look into that!

@stayhero
Copy link

I'm running into this as well since we basically upgraded to Ruby 2.3.0 and curb from 0.8.5 to 0.9.1.

We use Ubuntu LTS 14.04 with libcurl 7.35 (latest Ubuntu 14.04 packaged version). Is the suggested solution to compile libcurl manually on latest Ubuntu LTS? Or is it possible a Ruby downgrade would help (because probably Ruby 2.3 has changed their GC behavior etc)? To be honest I'm a bit hesitant to install libcurl manually instead of using the default Ubuntu apt-repos. ;)

@stayhero
Copy link

BTW: the issue described here https://curl.haxx.se/mail/lib-2009-03/0009.html should have been fixed in libcurl 7.20 at least? Hence I guess this is not the problem here?

@stayhero
Copy link

@thomax @heaven @nubis Did you run into it in Ruby versions < 2.3?

@thomax
Copy link

thomax commented Mar 14, 2016

@stayhero Yeah, I'm having this problem on Ruby v2.2.3p173 and I'm sure our libcurl is patched.

@thomax
Copy link

thomax commented Mar 16, 2016

Could this be a problem with ruby garbage collection, rather than curb/curl?

@nubis
Copy link

nubis commented Mar 16, 2016

Ruby 2.2.1p85 when this happened.

On Mon, Mar 14, 2016 at 4:24 AM, Christian notifications@github.com wrote:

@thomax https://github.com/thomax @heaven https://github.com/heaven
@nubis https://github.com/nubis Did you run into it in Ruby versions <
2.3?


Reply to this email directly or view it on GitHub
#277 (comment).

@thomax
Copy link

thomax commented Mar 17, 2016

We've solved this problem in our codebase. We used to just grab a new Curb::Easy instance for each request, resulting in the aforementioned issue. Our workaround was to always reuse and reset the Curl::Easy instance, something along these lines:

def self.get(url = nil, params = nil, &block)
  url, params = url_and_params_from_args(url, params, &block)
  return with_curl do |easy|
    easy.url = url_with_params(url, params)
    easy.http_get
  end
end

def self.with_curl(&block)
  easy = Thread.current[:pebblebed_curb_easy] ||= Curl::Easy.new
  easy.reset
  yield easy
  return handle_http_errors(Response.new(easy))
end

def self.handle_http_errors(response)
  if response.status == 404
    errmsg = "Resource not found: '#{response.url}'"
    errmsg << extract_error_summary(response.body)
    raise HttpNotFoundError.new(ActiveSupport::SafeBuffer.new(errmsg), response.status)
  elsif response.status >= 400
    errmsg = "Service request to '#{response.url}' failed (#{response.status}):"
    errmsg << extract_error_summary(response.body)
    raise HttpError.new(ActiveSupport::SafeBuffer.new(errmsg), response.status, response)
  end
  response
end

@taf2
Copy link
Owner

taf2 commented Mar 18, 2016

@thomax you should look at using the following methods as they internally use Thread.current to reuse a curl handle:

Curl.get
Curl.post

@thomax
Copy link

thomax commented Mar 18, 2016

@taf2 Thanks for the suggestion, I'll look into that!

Would it be unexpected behaviour if Curl::Easy just wrapped Curl.method and quietly spared me the trouble of worrying about the pileup of unusable sockets?

@taf2
Copy link
Owner

taf2 commented Mar 18, 2016

Possibly Curl::Easy allows you to get a new libcurl easy handle. I'm not exactly sure what is causing your issue... Do you have sample code to recreate the issue?

@mikaelhm
Copy link

mikaelhm commented Apr 5, 2016

I also have this issue with code like this:

def post_request(url, request_body)
  request = Curl::Easy.new(url)
  request.headers['Content-type'] = 'application/json'
  request.certtype = 'PEM'
  request.cert = "<certificate-string>"
  request.cacert = "<root-cert-string>"
  request.cert_key = "<private-key-string>"
  request.http_post(request_body)

  return request.body_str
end

I tried @nubis suggestion with GC.start and it works:

def post_request(url, request_body)
  request = Curl::Easy.new(url)
  request.headers['Content-type'] = 'application/json'
  request.certtype = 'PEM'
  request.cert = "<certificate-string>"
  request.cacert = "<root-cert-string>"
  request.cert_key = "<private-key-string>"
  request.http_post(request_body)
  response_body = request.body_str
  request.close
  GC.start

  return response_body
end

System:

  • ruby: 2.1.1p76
  • libcurl: 7.22.0
  • Ubuntu: 12.04 and 14.04

@levilansing
Copy link

+1 I also have this problem using Curb inside a thin service. Even if I use Curl.post/get AND explicitly close the connections AND run GC.start occasionally, that only helps reduce the problem, it seems to still occur eventually.

Symptoms of failure are weird. Either a "Couldn't resolve host name (Curl::Err::HostResolutionError)" or sometimes it causes a Segmentation fault.

@taf2
Copy link
Owner

taf2 commented Apr 8, 2016

@mikaelhm the GC.start is probably not necessary only request.close - to ensure curb tells libcurl to close the handle, which would eventually be closed by GC anyway. GC will after a call to close just immediately run and free up your existing ruby objects which might be good or might not. You can for example in unicorn put that GC.start outside the request which might be better...

@levilansing mixing Curl.post/Curl.get with close, may not be what you want? Curl.post/Curl.get will put a single Curl::Easy handle into Thread.current - meaning the handle is shared/re-used between requests. The main benefit of using it is to share the existing connect.

Thinking about this issue, I suspect we might need to do something internal to force the handles to be freed or I wonder if the handle's are piling up in multi handle. It might be interesting to get the value of handle.mult.idle?

I've made a commit here: 2509064..e3126a8

That clears out the multi request handle early. I think this might be the solution, please let me know.

@mikaelhm
Copy link

@taf2 I tried with just request.close before reading about GC.start-workaround. In my case the problem wasn't fixed until i introduced the GC.start-workaround.

I will try to remove it again and upgrade curb to 0.9.2 and see if your attempt fixed it.

@taf2
Copy link
Owner

taf2 commented Apr 10, 2016

@mikaelhm I bet the GC.start is triggering the multi handle to close that could be the leak. If so 0.9.2 will fix it unless of course there are other leaks :(

@mikaelhm
Copy link

I will let you know. Thank for attempting to fix the leak

@mikaelhm
Copy link

@taf2 i cannot test the fix in 0.9.2 #296

@mikaelhm
Copy link

After a days use, I feel confident that @taf2 fixed the memory/socket leak.

0.9.3 fixed my issues.

@stayhero
Copy link

0.9.3. solved our issues as well.

@ta
Copy link

ta commented Sep 12, 2017

We ran into this issue today as well with curb 0.9.4. The GC.start workaround seems to have fixed the issue.

Environment:

$ uname -srv
Linux 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017

$ dpkg -l | grep libcurl4-openssl-dev
ii  libcurl4-openssl-dev:amd64           7.35.0-1ubuntu2.10

$ ruby -v
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux-gnu]

$ bundle list | grep curb
  * curb (0.9.4)

Code to reproduce:

100.times { curl = Curl::Easy.new("http://google.com"); curl.get }

After a few minutes: $ netstat -an | grep "CLOSE_WAIT" | wc -l (yielded +100 in our tests before workaround and yielded 2 after workaround)

@pts-owentran
Copy link

pts-owentran commented Sep 28, 2017

Hi. For the past two weeks, I've been chasing down growing memory on my Rails Puma server randomly where file descriptors are used with connections in CLOSE_WAIT from curb requests after upgrading from 0.8.8 to 0.9.4. The GC.stat has a growing number of heap_final_slots indicating objects waiting to be finalized but never getting GC'ed. One this state is triggered, the heap grows unbounded until the worker crashes.

[28864] [2017-09-27 05:58:37 +0000] {:count=>108, :heap_allocated_pages=>6811, :heap_sorted_length=>7014, :heap_allocatable_pages=>0, :heap_available_slots=>2776137, :heap_live_slots=>2035764, :heap_free_slots=>740373, :heap_final_slots=>0, :heap_marked_slots=>788219, :heap_swept_slots=>141653, :heap_eden_pages=>4995, :heap_tomb_pages=>1816, :total_allocated_pages=>6811, :total_freed_pages=>0, :total_allocated_objects=>61074009, :total_freed_objects=>59038245, :malloc_increase_bytes=>24043216, :malloc_increase_bytes_limit=>67108864, :minor_gc_count=>76, :major_gc_count=>32, :remembered_wb_unprotected_objects=>35460, :remembered_wb_unprotected_objects_limit=>41569, :old_objects=>720516, :old_objects_limit=>850914, :oldmalloc_increase_bytes=>88012080, :oldmalloc_increase_bytes_limit=>131586007}
[28864] [2017-09-27 05:59:37 +0000] {:count=>108, :heap_allocated_pages=>6811, :heap_sorted_length=>7014, :heap_allocatable_pages=>0, :heap_available_slots=>2776137, :heap_live_slots=>2026272, :heap_free_slots=>749865, :heap_final_slots=>0, :heap_marked_slots=>788219, :heap_swept_slots=>311425, :heap_eden_pages=>4973, :heap_tomb_pages=>1838, :total_allocated_pages=>6811, :total_freed_pages=>0, :total_allocated_objects=>61234292, :total_freed_objects=>59208020, :malloc_increase_bytes=>25442760, :malloc_increase_bytes_limit=>67108864, :minor_gc_count=>76, :major_gc_count=>32, :remembered_wb_unprotected_objects=>35460, :remembered_wb_unprotected_objects_limit=>41569, :old_objects=>720516, :old_objects_limit=>850914, :oldmalloc_increase_bytes=>89411624, :oldmalloc_increase_bytes_limit=>131586007}
[28864] [2017-09-27 06:00:37 +0000] {:count=>108, :heap_allocated_pages=>6811, :heap_sorted_length=>7014, :heap_allocatable_pages=>0,   :heap_available_slots=>2776137, :heap_live_slots=>2026395, :heap_free_slots=>749742,  :heap_final_slots=>0,    :heap_marked_slots=>788219, :heap_swept_slots=>517073, :heap_eden_pages=>4973, :heap_tomb_pages=>1838, :total_allocated_pages=>6811, :total_freed_pages=>0, :total_allocated_objects=>61440066, :total_freed_objects=>59413671, :malloc_increase_bytes=>46605192, :malloc_increase_bytes_limit=>67108864, :minor_gc_count=>76, :major_gc_count=>32, :remembered_wb_unprotected_objects=>35460, :remembered_wb_unprotected_objects_limit=>41569, :old_objects=>720516, :old_objects_limit=>850914, :oldmalloc_increase_bytes=>110574040, :oldmalloc_increase_bytes_limit=>131586007}
[28864] [2017-09-27 06:01:37 +0000] {:count=>109, :heap_allocated_pages=>6811, :heap_sorted_length=>7014, :heap_allocatable_pages=>142, :heap_available_slots=>2776137, :heap_live_slots=>1714003, :heap_free_slots=>1060659, :heap_final_slots=>1475, :heap_marked_slots=>789481, :heap_swept_slots=>67690,  :heap_eden_pages=>4949, :heap_tomb_pages=>1862, :total_allocated_pages=>6811, :total_freed_pages=>0, :total_allocated_objects=>61894234, :total_freed_objects=>60178756, :malloc_increase_bytes=>48688,    :malloc_increase_bytes_limit=>67108864, :minor_gc_count=>77, :major_gc_count=>32, :remembered_wb_unprotected_objects=>35587, :remembered_wb_unprotected_objects_limit=>41569, :old_objects=>730902, :old_objects_limit=>850914, :oldmalloc_increase_bytes=>130802344, :oldmalloc_increase_bytes_limit=>131586007}
[28864] [2017-09-27 06:02:37 +0000] {:count=>109, :heap_allocated_pages=>6811, :heap_sorted_length=>7014, :heap_allocatable_pages=>0, :heap_available_slots=>2776137, :heap_live_slots=>1702113, :heap_free_slots=>1071494, :heap_final_slots=>2530, :heap_marked_slots=>789481, :heap_swept_slots=>389097, :heap_eden_pages=>4922, :heap_tomb_pages=>1889, :total_allocated_pages=>6811, :total_freed_pages=>0, :total_allocated_objects=>62204483, :total_freed_objects=>60499840, :malloc_increase_bytes=>25402264, :malloc_increase_bytes_limit=>67108864, :minor_gc_count=>77, :major_gc_count=>32, :remembered_wb_unprotected_objects=>35587, :remembered_wb_unprotected_objects_limit=>41569, :old_objects=>730902, :old_objects_limit=>850914, :oldmalloc_increase_bytes=>131649888, :oldmalloc_increase_bytes_limit=>131586007}
[28864] [2017-09-27 06:03:37 +0000] {:count=>109, :heap_allocated_pages=>6811, :heap_sorted_length=>7014, :heap_allocatable_pages=>0, :heap_available_slots=>2776137, :heap_live_slots=>1701008, :heap_free_slots=>1071013, :heap_final_slots=>4116, :heap_marked_slots=>789481, :heap_swept_slots=>824320, :heap_eden_pages=>4921, :heap_tomb_pages=>1890, :total_allocated_pages=>6811, :total_freed_pages=>0, :total_allocated_objects=>62639288, :total_freed_objects=>60934164, :malloc_increase_bytes=>10043088, :malloc_increase_bytes_limit=>67108864, :minor_gc_count=>77, :major_gc_count=>32, :remembered_wb_unprotected_objects=>35587, :remembered_wb_unprotected_objects_limit=>41569, :old_objects=>730902, :old_objects_limit=>850914, :oldmalloc_increase_bytes=>116290712, :oldmalloc_increase_bytes_limit=>131586007}

screen shot 2017-09-27 at 10 05 55 pm

I've bought myself some time with the puma-killer gem but want to solve the underlying issue. Calling close in an ensure block seems to do nothing. Using GC.start seems to clear any connections but GC time per request can be anywhere from 100-200 ms according to my NewRelic monitoring.

newrelic

Attempting to rollback to version 0.8.8 and see if the issue still manifests.

@pts-owentran
Copy link

pts-owentran commented Sep 28, 2017

@ta I was unable to reproduce any CLOSE_WAIT with version 0.9.4 on my vagrant box with almost the same environment as you:

$ uname -srv
Linux 3.13.0-128-generic #177-Ubuntu SMP Tue Aug 8 11:40:23 UTC 2017

$ dpkg -l | grep libcurl4-openssl-dev
ii  libcurl4-openssl-dev:amd64           7.35.0-1ubuntu2.10                         amd64        

$ ruby -v
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]

$ bundle list | grep curb
  * curb (0.9.4)

I did observe, that running the code block 100 times creates about 100 ESTABLISHED connections that won't go away until GC is called regardless if you call close. With GC.start added, connections are cleared (verified by doing lsof | grep ESTABLISHED | wc -l).

So GC.start will fix the issue at the cost of much higher response time. I will report back on how our production site reacts. It normally takes 4-6 hours under moderate traffic to start seeing the CLOSE_WAIT to manifest.

@ta
Copy link

ta commented Sep 28, 2017

@pts-owentran The problem only surfaced in our environment after upgrading from ruby 2.2.x to 2.4.x (we skipped 2.3.x).

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

Thanks @robuye !

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

Okay, I need to test this more but maybe adding:

diff --git a/ext/curb_multi.c b/ext/curb_multi.c
index 705e522..aa588a9 100644
--- a/ext/curb_multi.c
+++ b/ext/curb_multi.c
@@ -276,6 +276,9 @@ static void rb_curl_mutli_handle_complete(VALUE self, CURL *easy_handle, int res

   curl_easy_getinfo(rbce->curl, CURLINFO_RESPONSE_CODE, &response_code);

+  // request is finished remove easy handle from multi handle to allow connections to be freed
+  rb_funcall(self, rb_intern("remove"), 1, easy);
+
   if (result != 0) {
     if (!rb_easy_nil("failure_proc")) {
       callargs = rb_ary_new3(3, rb_easy_get("failure_proc"), easy, rb_curl_easy_error(result));

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

argh we already do that... so never mind... also running the example test provided... and running netstat i'm not seeing any connection leaks... maybe i'm not testing this correclty?

while running the test script:

taf2@ip-10-55-11-11:~/work/curb$ netstat -anp |grep ruby |head
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 10.55.11.11:54086           151.101.249.178:80          ESTABLISHED 18298/ruby
tcp        0      0 10.55.11.11:38648           172.217.8.3:80              ESTABLISHED 18298/ruby

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

What version of libcurl are you guys testing on? I ran my test on ruby-2.5.3 and libcurl 7.53.1

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

I didn't have time to sit on it since my last comment.

Here's the configuration I was using:

root@f764d8e598a6:/code# ruby -v
ruby 2.3.8p459 (2018-10-18 revision 65136) [x86_64-linux]
root@f764d8e598a6:/code# curl -V
curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 OpenSSL/1.0.2q zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL 

This is from docker, and I also tested it locally in the following configuration:

$ ruby -v
ruby 2.3.3p222 (2016-11-21 revision 56859) [x86_64-linux]

$ curl -V
curl 7.61.1-DEV (x86_64-pc-linux-gnu) libcurl/7.61.1-DEV OpenSSL/1.0.2g zlib/1.2.8
Release-Date: [unreleased]
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP UnixSockets HTTPS-proxy 

Something to bear in mind, you need to recompile and reload irb when you checkout different commit for testing.

@taf2 I mention I used an example from the readme, but I also added a loop there to run it 100 times. Not sure you noticed, but without it 2 FDs are expected.

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

Yup, I increased the loop to 1000 and monitored FDs open and it never exceeded 2 for me. I used ruby-2.5.3 for testing... and libcurl 7.53.1 and was not able to leak any FDs...

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

alright, let me check it out on ruby 2.5, I will be back shortly.

@morhekil
Copy link

for us the server where we see the leaks runs ruby 2.5.1 and libcurl 7.22.0

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

It's the same on ruby 2.5.3p105 and curl 7.52.1.

@taf2 are you on osx perhaps?

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

@robuye ec2 aws amazon linux - so a flavor of centos...

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

I could test it only on ubuntu 16.04 on EC2 and I see leaked FDs there. That's on 0.9.7 only, but it's probably irrelevant. Same on docker centos-ruby (2.6).

I will see if I can debug this bug further over the weekend.

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

yeah, i must not be testing this correctly then... Here's a gist I created to help me run the loop and also monitor netstat... maybe i'm not filtering netstat correctly? Or maybe something about how I'm running it...

https://gist.github.com/taf2/683ca2f9cec226de44c8f992b1ca5cc2

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

okay typical me! I was running 0.9.2 🤯

anyways the test script in the gist should be helpful then in fixing this.

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

hah that happened too often to me. I noticed calling GC.start closes all the FDs if that's of any help.

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

We can use this to add it as a regression test... should figure out how to get netstat for bsd stystems to work to for osx users.

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

When modifying the test script to GC.start after m.perform, the connection counts stay in the 2 - 3 range... so when GC runs connections are cleaned up. The issue here is we can either wait and let GC do the connection close or we keep the connection around because maybe the end user will want to use the connection again...

I'm thinking we might need an explicit option for keep the connection open so it can be re-used or close it right away...

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

Yea, I was thinking something along the lines, but I don't understand why this has changed as of 0.9.4 and I'm unsure if the connections are actually reused as they should 🤔

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

Restoring these lines from e3126a fixes the problem:

  curl_multi_cleanup(rbcm->handle);
  rbcm->handle = NULL;

the FDs are going down to zero over ~ 30 seconds. This delay is interesting, it looks like a timeout is kicking in?

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

Bear in mind it's not a real fix, because if it stops curb from reusing connections that's pretty serious regression.

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

Actually looking at the description of this commit this seems to be exactly what you were looking for @taf2 - to close FDs without waiting for GC.

Calling it at the end of perform seems reasonable because at this point we stopped all the transfers.

The commit you added included more changes and perhaps this would prevent curb from reusing connections.

The test code we were using creates 100 multi handles and each has 2 easy handles so it doesn't show if the connections are reused. Also high number of FDs is expected. I will try to come up with another test to see if we reuse connections or not.

@taf2
Copy link
Owner

taf2 commented Jan 25, 2019

@robuye yeah, I think the case of re-use is when doing something simple like:

res = Curl.get("https://www.google.com/")

...

res = Curl.get("https://www.google.com/") # should re-use existing connection ?

@robuye
Copy link
Contributor

robuye commented Jan 25, 2019

I think this would create separate multis each with own pool and they wouldn't be able to reuse connections. At least that's my understanding of libcurl.

I was thinking we'd need multiple requests in a single multi and to the same host (ideally HTTPS so we pay massive overhead here). And those should be slow to respond so the pool fills up and reusing begins.

@robuye
Copy link
Contributor

robuye commented Jan 27, 2019

I think I have figured it out and I also have some food for thoughts.

So libcurl will try to hold a connection open for as long as possible and it will reuse cached connections if possible too. See this thread from libcurl mailing list, it goes into details about this behaviour and Daniel explains it much better than I could.

In case of Ruby we will keep the FDs open until multi handle is GCed. GC calls curl_multi_cleanup(rbcm->handle); and then libcurl closes all FDs. Curb hooks it into multi finalizer and this ties FDs lifetime to GC runs.

Note that once curl_multi_cleanup is called the multi handle cannot be reused. Hooking it into the finalizer saves curb a lot of trouble because it will always work regardless of how it's used, it just doesn't behave well in long running, rarely GCed processes.

The code I used to test this behaviour is significantly longer so I will just link a gist here. It demonstrates how connections are reused and shows interesting results so give it a shot.

I have enabled multi.max_connects = 1 there and it causes the multi handle to keep 1 connection in cache at most, therefore this alone can mitigate leaked FDs. It can't be 0 so at least 1 FD per multi will still leak (until GC kicks in). This may or may not work for you @pts-owentran & @morhekil, depends how you use multi handle (note that curb always uses multi handle, even when you use easy interface and there is no way around this).

I pointed curb to my local server to minimize any network latency and get better insight into connections. Here's the script output:

System FDs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
These FDs will be ignored.
Additional FDs open before start (0): []
Starting multi
Transfer 1 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 2 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 3 completed. Open FDs (2): [12, 13]
FD 12; socket: 192.168.16.2:35860 => 192.168.16.3:80
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 4 completed. Open FDs (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Transfer 5 completed. Open FDs (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
All transfers completed
Open FDs after multi completed (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Sleeping 10 seconds...
..........
Open FDs after sleep (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80

Add 5 new transfers

FDs open before start (1): [13]
FD 13; socket: 192.168.16.2:35862 => 192.168.16.3:80
Starting multi
Transfer 1 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 2 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 3 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 4 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
Transfer 5 completed. Open FDs (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80
All transfers completed
Open FDs after multi completed (1): [12]
FD 12; socket: 192.168.16.2:35870 => 192.168.16.3:80

And here's what happened according to server:

{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"524 (192.168.16.2:35854)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"525 (192.168.16.2:35856)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"526 (192.168.16.2:35858)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"527 (192.168.16.2:35860)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:22+00:00","connection":"528 (192.168.16.2:35862)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"528 (192.168.16.2:35862)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"529 (192.168.16.2:35864)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"530 (192.168.16.2:35866)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"531 (192.168.16.2:35868)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:37:33+00:00","connection":"532 (192.168.16.2:35870)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }

All 10 requests made by curb, but only 1 reused a connection. When I removed multi.max_connects = 1 multi would keep initial 5 FDs open and it would reuse it for the second batch of easy handles:

{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"538 (192.168.16.2:35914)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.001","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"539 (192.168.16.2:35916)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"540 (192.168.16.2:35918)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"541 (192.168.16.2:35920)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:12+00:00","connection":"542 (192.168.16.2:35922)","connection_requests_count":"1","request":"GET /sleep/1 HTTP/1.1","request_time":"1.000","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"538 (192.168.16.2:35914)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"539 (192.168.16.2:35916)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"540 (192.168.16.2:35918)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"541 (192.168.16.2:35920)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }
{ "@timestamp":"2019-01-27T14:46:23+00:00","connection":"542 (192.168.16.2:35922)","connection_requests_count":"2","request":"GET /sleep/1 HTTP/1.1","request_time":"1.002","status":"200" }

This is somewhat irrelevant to the FDs leak we discuss here, but I wanted to make sure we're not introducing another regression.

Regarding the FDs leak I think the best fix is to expose curl_multi_cleanup to Ruby-land and allow user to call it at any time to release FDs. Something along the lines of:

URL = 'http://google.com'
multi = Curl::Multi.new
multi.add(Curl::Easy.new(URL))
multi.perform
# connections & FDs remain open and will be reused by next handle
multi.add(Curl::Easy.new(URL))
multi.perform
multi.close
# all connections & FDs are closed
multi.add(Curl::Easy.new(URL))
# => Curl::Err::MultiBadHandle: CURLError: Invalid multi handle
multi.perform
# => Curl::Err::MultiBadHandle: CURLError: Invalid multi handle

Each multi manages it's own pool of connections and it won't reuse FDs from another multi so leaving sockets hanging until GC runs feels quite wasteful. We could provide an option to autoclose a handle when perform returns too (maybe Curl::Multi.autoclose = true), it could be convenient to set it once (ie in Rails initializer) and never worry about it again.

I think it should be fairly simple and safe fix, so let me know if it's feasible.

@taf2
Copy link
Owner

taf2 commented Jan 27, 2019

@robuye great analysis, I like the idea of a Curl::Multi.autoclose = true and exposing multi.close. Maybe something like the following: https://github.com/taf2/curb/tree/autoclose_close

I agree we need some regression tests for connection keep-alive and connection leaking.

@robuye
Copy link
Contributor

robuye commented Jan 27, 2019

Thanks! Your implementation is better than what I came up with. Cool idea with re-initializing @requests.

I ran my gist on your branch and it seems to be working exactly as expected. Nice work 👍

@pts-owentran
Copy link

pts-owentran commented Jan 27, 2019 via email

@taf2
Copy link
Owner

taf2 commented Jan 28, 2019

@pts-owentran thanks, we've pushed 0.9.8 to rubygems. If you have chance to run this let me know, I think with the updates to include a close method and autoclose we're actually close to a 1.x release 😄

@taf2 taf2 closed this as completed Jan 28, 2019
@taf2 taf2 reopened this Jan 28, 2019
@taf2 taf2 closed this as completed Jan 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests