Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falcon for large edge includes (developing/applying IO strategy) #40

Open
julik opened this issue Dec 8, 2018 · 108 comments
Open

Falcon for large edge includes (developing/applying IO strategy) #40

julik opened this issue Dec 8, 2018 · 108 comments
Assignees

Comments

@julik
Copy link

julik commented Dec 8, 2018

Thank you for exploring async in Ruby! We are currently looking at raising our throughput on our download servers (if you are curious there is a presentation about it here https://speakerdeck.com/julik/streaming-large-files-with-ruby - specifically see slide 16

[libCURL get to *FILE] -> [Ruby File object] -> [non-blocking sendfile + IO#wait when EAGAIN]

This is repeated multiple times to "splice" response bodies together and works really well except that one response served this way consumes an entire thread. Async and fibers seem to be a good way to approach this problem, and nio4r also seems to be a great option because I can leverage the ByteBuffer implementation and the IO reactor. But this is where the question obviously arises. In our scheme we have at the moment (see above) there are two elements which are crucial - that we download upstream data into a file buffer (in reality it is on a RAM filesystem) which is not in the Ruby heap. We then tell the kernel to write that file into the socket that services our client on the Puma side, and again no bytes enter the Ruby heap. In practice it means that we have next to no allocation overhead during streaming, regardless of the size of the workload / number of upstream requests we need to perform. nio4r supports this usage pattern if you use raw sockets and a ByteBuffer, from what I understood in the docs something like this

buf = ByteBuffer.new(capacity: 24*1024)
while ... do # check for EOF etc
  buf.read_from(file) # If I get a 0 as return value I could call task.yield here (unlinkely)
  buf.flip
  buf.write_to(socket)  # If I get a 0 as return value I could call task.yield here as well (very likely)
  buf.flip
end

This would allow reuse of the ByteBuffer and while not as efficient as sendfile()- it would become read(infd, buf, size); write(outfd, buf, size) it would still allow us to accomplish what we need (not introducing these heaps of data into the Ruby string heap).

I have taken a peek at the beers.rb example and what I could envision is something like this if I reproduce it:

def stream_through(output_body, task, url, headers_for_upstream)
  # Obtuse URL management is not nice :-(
  uri = URI(url)
  base_uri = uri.dup
  base_uri.path = ""
  base_uri.query = nil

  endpoint = Async::HTTP::URLEndpoint.parse(base_uri.to_s)
  client = Async::HTTP::Client.new(endpoint)
  request = Async::HTTP::Request.new(client.scheme, uri.hostname, "GET", uri.request_uri, headers_for_upstream)
  response = client.call(request)
  if (200..2006).cover?(response.status) && body = response.body
    while chunk = body.read
      output_body.write(chunk)
      task.yield # Inserted for completeness sake
    end
  end
  response.close
end

run do |env|
    known_content_length = ... # I do know it in advance
    remote_urls_to_splice = ... # as well as these
    
    current_task = Async::Task.current
    async_output_body = Async::HTTP::Body::Writable.new
    current_task.async do |task|
      remote_urls_to_splice.each do |url_string|
        stream_through(async_output_body, task, url_string, {})
      end
    ensure
      async_output_body.close
    end
    
    [200, {'Content-Length' => known_content_length}, async_output_body]
  end
end

The problem that we have is that this will work well when the data transfer is relatively low-volume (chats, websockets etc.). But for us it will immediately blow up the Ruby heap with strings, since Async::HTTP::Body::Writable from what I can see is basically a "messagebox" (channel) for Strings. Mem use will be probably similar to what you could achieve with Rack's #each on a streaming body yielding Strings (we tried, it is immense and the application doesn't fit in RAM very quickly). What I want to do instead is pass the lowest-level possible objects to the reactor and tell it "Dear reactor, please use this fixed buffer to copy N bytes from this fd to that fd, and if there is an EAGAIN yield and try again later". But if both my "upstream" response body and my writable server response body are messgeboxes for Strings this option doesn't actually exist right?

Strictly speaking - yes, I am looking for an alternative to a non-blocking splice(). I can have plain (no-SSL) upstreams if that makes the job easier, I can also omit the "buffer to file first" if the rest of the setup works well. Everything in the setup is strictly HTTP1.1 at this point and the previous implementation even used HTTP1.0 for simplicity's sake.

So the question is, I guess - is this kind of workload a fit for falcon? Is it a good fit for nio4r? I do have the feeling that orchestrating these large-volume IO ops with Ruby should be perfectly feasible but when I examine examples and involved collaborator modules all I see are Strings, Strings, Strings... (primarily async/http). Is there some kind of wrapper around the nio4r ByteBuffer maybe that I could use as the async response body instead maybe?..

Maybe somehow get access to the actual output socket Falcon sets up (a-la Rack hijack) and perform non-blocking IO on that socket manually via nio4r?

I believe this is intimately related to #7 among others.

Edit: or if there is something I could use to no-string-copy from the async-http client body to the writable body of my HTTP response that could work too 🤔

@ioquatix
Copy link
Member

ioquatix commented Dec 8, 2018

I think you should try it first and then see if memory usage/strings allocation is an issue. We've already done some optimisation in this area (minimising string allocations). Once you've figured out specific code paths that are causing junk to be allocated, it could be raised as an issue.

Regarding splicing from input to output, it's not protocol agnostic especially w.r.t. HTTP/2.

That being said, maybe there is room for a HTTP/1 specific code path which suits your requirements.

NIO4R byte buffer hasn't been too useful in practice, but maybe we could make that work better if we know specifically what parts aren't up to scratch.

@julik
Copy link
Author

julik commented Dec 9, 2018

👍 Thanks, we will do some stress testing.

@ioquatix
Copy link
Member

Did you make any progress on this?

@julik
Copy link
Author

julik commented Dec 20, 2018

We did. We ran falcon with 6 worker processes, putting it behind nginx on one of our production instances (where puma used to run instead). We had to switch to a TCP socket from a unix socket that Puma uses for that.

I also had to implement a backstop for the async backpressure issue which would otherwise destroy us, something like this

  def wait_for_queue_throughput(output_body, max_queue_items_pending, task)
    # Ok, this is a Volkswagen, but bear with me. When we are running
    # inside the test suite, our reactor will finish _first_, and _then_ will our
    # Writable body be read in full. This means that we are going to be
    # throttling the writes but on the other end nobody is really reading much.
    # That, in turn, means that the test will fail as the response will not
    # going to be written in full. There, I said it. This is volkswagen.
    return if 'test' == ENV['RACK_ENV']
    
    # and then see whether we can do anything
    max_waited_s = 15
    backpressure_sleep_s = 0.1
    waited_for_s = 0.0
    while output_body.pending_count > max_queue_items_pending
      LOGGER.debug { "Slow client - putting task to sleep" }
      waited_for_s += backpressure_sleep_s
      if waited_for_s > max_waited_s
        LOGGER.info { "Slow client, closing" }
        raise "Slow client, disconnecting them"
      end
      # There should be a way to awake this task when this WritableBody has been read from on the other end
      task.sleep(backpressure_sleep_s)
    end

    # Let other tasks take things off the queue inside the Body::Writable
    task.yield
  end

To do this I had to expose the queue item count on the Body::Writable thing. Our upstream we pull data from is S3 over HTTPS, but we are pulling many different objects at the same time. Since the default chunk size seems to be hovering around 4KB in our case I opted for 8 items on the queue as limit.

We limited the server to the same number of clients allowed to connect as our current implementation (600) and here is what happened:

grafana-dlserver-async

I think you were right that we needed to test this first, as the mem situation seems to be fine, we are not leaking much - at least not in a few hours we ran the test, so hats off to your buffer size choices and how you managed to reuse a string there ❤️

What we did observe:

  • Way less context switches, yay!
  • Doesn't seem to leak. Yay!
  • We are not achieving the same troughput we used to have with the same number of clients, while we actually want to exceed it (this is a c8 EC2 instance)
  • We are pegging the CPUs and we are pegging them quite a bit. Most likely this became our limiting factor instead of the network bandwidth we can consume, and I suspect it is because of this pumpin' of strings around. On an rbspy profile from our previous implementation the most CPU time came from running IO.select on a single writable socket, not from the actual sendfile() or IO.copy_stream which are just a write()/read() pair if you do not use SSL. I did not have time to profile this in production yet as we could only carve out a small window to do the experiment and...
  • It looks like either falcon or nginx is leaking TCP connections. When we looked at netstat we found a ton of connections from nginx downstream to CloudFront origins in TIME_WAIT state. So even though we do send Connection: close on our responses it seems nginx still makes its connections to downstream keepalive, which is not desired. We need to tweak nginx' settings a bit since we do not feel ready exposing a naked Ruby webserver for this workload just yet.
  • We also dispatch webhooks from this thing (not too many but a few) and during this testing the webhooks were not async-enabled - it was sync Patron, so we were blocking the reactor during their dispatches
  • We ping Redis during this procedure and this is not async-enabled, but Redis is run locally and very fast so I doubt it contributes much
  • Our main Rack application action which creates the download manifests is also not Async-enabled at this stage since that would be a bit much rewriting at once

I am intending to force nginx to do a connection: close and the webhook dispatch has been replaced by asynchttp now, so we are going for another round of tests in January. I think we will also reduce the number of processes. But it does seem I do need a lower-level IO strategy for this. I am almost contemplating injecting an Async reactor into Puma on a separate thread so that we can "ship off" hijacked sockets to it. Would welcome any advice ;-)

@ioquatix
Copy link
Member

That is really useful feedback.

We have async-redis which is pretty decent, but undergoing active development right now.

Falcon does all parsing within Ruby land so it's going to be slower than a server which implements it in C. But for many applications, the overhead is not so big.

Leaking connections seems odd. If you can make a small repro with only Falcon I'd be interested to see it because we also check for leaking sockets. The Falcon test suite is pretty small though.

There are a handful of options.

One thing which might benefit you, is the planned work for a C backend for falcon to optimise the request/response cycle on the server side. This will be an optional paid upgrade. Additionally, if you are interested, I have an open source library which is well proven for handling large numbers of request and large amounts of data. We can shape this into a custom web server for your exact requirements and I guarantee you will achieve within a few % of the theoretical throughput of the hardware/vm.

@ioquatix
Copy link
Member

Do you mind explaining the path you are taking through Falcon for serving content. Are you using HTTP/1.1? What are you using for the upstream request?

@ioquatix
Copy link
Member

ioquatix commented Dec 20, 2018

I will implement back pressure within the queue too - I'll try to make it in the next release. Your implementation might not be optimal.

@julik
Copy link
Author

julik commented Dec 20, 2018

We are using HTTP/1.1 from falcon to nginx, and HTTP/1.1 from nginx to CloudFront which is our fronting CDN. HTTP/2 is not in the picture for us at the moment. To clarify: nginx is "downstream" for falcon, CloudFront is "downstream" for nginx. Our "upstreams" (servers our Ruby app is making requests to) are S3 for the data we proxy through and a couple of small requests to our internal systems for metadata, also over HTTP/1.0. These do not egress our VPC and are tiny compared to the amount of data "put through" from S3 to downstream.

One thing which might benefit you, is the planned work for a C backend for falcon to optimise the request/response cycle on the server side.

These are interesting propositions. I did look at the business support model for falcon but I don't think we are ready to commit to it at this stage. First we have a pretty variable workload and though we can predict how many proxy servers we are going to run by way of capacity planning, having what is effectively a support contract for that number of servers might be not very considerate at this stage. It might also happen that we are going to replan to use a different runtime and then can drastically reduce the number of servers since we are going to be able to saturate their NICs to the maximum. Second is we obviously need to see the requisite performance materialise.

So at the moment I think contributing to the ecosystem with explorations, tests and eventual patches might be a better option, but I might be mistaken ofc.

This will be an optional paid upgrade. Additionally, if you are interested, I have an open source library which is well proven for handling large numbers of request and large amounts of data. We can shape this into a custom web server for your exact requirements and I guarantee you will achieve within a few % of the theoretical throughput of the hardware/vm.

I am interested. There is a bit of a concern for me that probably building an entirely custom proprietary webserver might be a bad idea from the point of view of my colleagues since they also will have to support it and debug it should things go south. Let's chat ;-)

Your implementation might not be optimal.

Yes, please ❤️ The best I could find is opportunistically sleep the task for some time, I am certain it could be woken up sooner if the task is somehow coupled to the nio4r monitor.

P.S. I do believe that we could achieve this throughput if it were possible to get access to the nio4r socket objects from within falcon already tho.

@ioquatix
Copy link
Member

socketry/async-http#6 is now fixed. It has documentation which might help you.

@julik
Copy link
Author

julik commented Dec 20, 2018

Awesome!

@ioquatix
Copy link
Member

First we have a pretty variable workload and though we can predict how many proxy servers we are going to run by way of capacity planning, having what is effectively a support contract for that number of servers might be not very considerate at this stage

If you can think of a better way to do this I am open to ideas.

@ioquatix
Copy link
Member

ioquatix commented Dec 20, 2018

P.S. I do believe that we could achieve this throughput if it were possible to get access to the nio4r socket objects from within falcon already tho.

Realistically, the only way to do something like this would be a partial hijack. That's not supported in Falcon at the moment. But maybe it's possible. Only full hijack is supported, and it's Rack compatible so it returns the raw underlying IO, extracted from the reactor:

env[::Rack::RACK_HIJACK] = lambda do
wrapper = request.hijack
# We dup this as it might be taken out of the normal control flow, and the io will be closed shortly after returning from this method.
io = wrapper.io.dup
wrapper.close
# This is implicitly returned:
env[::Rack::RACK_HIJACK_IO] = io
end

Maybe there is a better way to do this, or even just expose the IO directly in env.

@ioquatix
Copy link
Member

Can you explain your ideal slow client disconnecting policy? e.g. less than x bytes/s for y minutes? or something else?

@julik
Copy link
Author

julik commented Dec 21, 2018

The ideal would be that if bytes per second for a client is below N bytes per second over N seconds I would kick the client out. However, I can in a way "abstract this up" because my chunk size ends up pretty much always being the default nonblocking read chunk size async-http provides, so I can extrapolate the from that and disconnect clients if there is no movement in the queue for that much time - which is the abstraction I found so far. I do have an object that keeps tabs on how much data got sent over the last N seconds and I can use that object as well, but let's try to contemplate the queue length indicator one for a minute.

With the implementation I had there was some measurement because the task would resume in a polling fashion, after some time. With the new LimitedQueue implementation it seems that it is possible that, basically, the task can be stopped indefinitely if nothing starts reading from the queue due to the use of a condition. Imagine this:

  • Writing scope tries to add an item to the queue, queue is full, condition is created and set to notify the task to resume when something gets read from the queue. The task is "frozen" and the next task is scheduled
  • Nothing performs a read and the condition never gets a signal.
  • What then? The task ends up sitting in the task pool and never gets restarted probably? 🤔

I did try a simplistic experiment like this:

it 'reactifies' do
    reactor = Async::Reactor.new
    20.times do |i|
      reactor.run do |task|
        set_timer = task.reactor.after(0.1) { $stderr.puts "task #{i} blows up" }
        set_timer.cancel if i != 3
        $stderr.puts "Hello from task #{i}"
        task.yield
      end
    end
  end

That does work, the task 3 does print data. But if I raise an exception from the after block the exception brings down the reactor (which makes sense if I understand Timers correctly in that the timer is attached not at the task level but at the reactor level). There is also nothing quite like task.raise which is probably also a good thing since Thread#raise was long considered malpractice. But what else should be used in this case? If I manually sleep the task and do a timer comparison when it gets awoken, so even if that would wake the task more often than desired it would let me preempt the task to do the timer bookkeeping.

Basically I need "some" way to forcibly terminate a task if there is no movement on the queue for that much time. Or some way to poll an object for a decision on whether the client should be disconnected or not - IMO if we poll for it once per second or even less often the impact on the reactor wont be immense. I might be overthinking it tho...

@ioquatix
Copy link
Member

This is something I’ve thought about for a while.

If you call socket.read should that operation block indefinitely?

I think the answer is no. Especially not by default.

There should be some logical timeout, or at least a way to specify it explicitly per socket per operation.

Does the OS provide a timeout? If I make a read operation with no data will it still be waiting 100 years later?

A minimum throughput is a similar issue. We have to be careful to design a system that is actually robust against slow clients, ideally not allocating resources in a way which makes it trivial to DoS a service/system.

Mitigatins at the queue level doesn’t prevent malicious users because there are other non-queue related areas of the protocol which can cause resource exhaustion.

So what I’d like to see is a wrapper around the socket or the stream buffer which handles this for the entire system. Ideally we can specify a policy eg minimum bit rates and timeouts, and have it work across the entire system.

@julik
Copy link
Author

julik commented Dec 21, 2018

Yep, being able to set a timeout for each read and each write would be ideal. What I effectively have attempted with my polling solution is actually doing the write part of it. Moreover, if there is a way to set a timer that will awake and raise the fiber before calling write and then to cancel that timer we will have a workable solution. BTW, there is some fascinating thinking about cancelable tasks in http://joeduffyblog.com/2015/11/19/asynchronous-everything/ (in case you haven't seen it)

@julik
Copy link
Author

julik commented Dec 21, 2018

Having investigated a bit, would this work? Specifically, will it "override" a Condition?

  TimeoutOnWrite = Struct.new(:async_task, :writable, :timeout_s) do
    def write(data)
      async_task.timeout(timeout_s) { writable.write(data) }
    end
  end
  
  body = Async::HTTP::Body::Writable.new(content_length, queue: Async::LimitedQueue.new(8))
  Async::Reactor.run do |task|
    body_with_timeout = TimeoutOnWrite.new(task, body, 3) # timeout on write in 3 seconds?..
    # ...many times over, repeatedly etc
    body_with_timeout.write(data_from_upstream)

@ioquatix
Copy link
Member

Unfortunately it's not sufficient.

It needs to be in the buffering/socket layer.

@julik
Copy link
Author

julik commented Dec 21, 2018

Yep, tried that implementation and though the timeout does fire it brings down the reactor (and the entire falcon process as well!)

I will revert to my less-than-ideal polling implementation for now.

@ioquatix
Copy link
Member

You need to handle the timeout.

Something like

begin
	task.timeout do
		socket.write(...)
	end
rescue
	body.close
end

@ioquatix
Copy link
Member

If that's the route you want to go down, even just temporarily, you should probably make a body wrapper with this behaviour. But as I said it's not a fully general solution.

@julik
Copy link
Author

julik commented Jan 4, 2019

Happy 2019 @ioquatix and other socketry contributors!

We have deployed our falcon-based service as a canary and observing the results. Meanwhile I am trying to figure out where the limits are regarding the number of clients and how easy is it for falcon not to saturate the CPU but to "stuff the pipe". To that end I've implemented 3 simple "stuffer" webservers that generate one chunk of random data, and then repeatedly send it over the wire to achieve a given content-length.

To eliminate the network issues from the equation I tested over loopback for now. The results are interesting.

Go with stuffer.go all default options

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Connection: close
< Content-Length: 5861125462
< Date: Fri, 04 Jan 2019 12:28:46 GMT
< Content-Type: application/octet-stream
< 
{ [3953 bytes data]
100 5589M  100 5589M    0     0   720M      0  0:00:07  0:00:07 --:--:--  713M
* Closing connection 0

real	0m7.780s
user	0m2.575s
sys	0m4.542s

Falcon with async-io

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200
< connection: close
< server: falcon/0.19.6
< date: Fri, 04 Jan 2019 12:41:25 GMT
< content-length: 5861125462
< 
{ [16261 bytes data]
100 5589M  100 5589M    0     0   257M      0  0:00:21  0:00:21 --:--:--  260M
* Closing connection 0

real	0m21.739s
user	0m3.840s
sys	0m7.375s
julik@nanobuk stuffer (master) $ 

Puma with partial hijack and blocking write()

julik@nanobuk stuffer (master) $ time curl -v http://localhost:9395/?bytes=5861125462 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 9395 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 9395 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9395
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Connection: close
< Content-Length: 5861125462
< 
{ [16384 bytes data]
100 5589M  100 5589M    0     0   831M      0  0:00:06  0:00:06 --:--:--  842M
* Closing connection 0

real	0m6.742s
user	0m2.361s
sys	0m4.110s

The code is in the repo here: https://github.com/julik/stuffer

Unless I have really missed something, there is a roughly 3x overhead to these async bodies. Which sort of brings back my original question - is there a way, with the existing async-io model, for me to use the sockets directly and yield them back to the reactor if they would block? Or have a minimum size wrapper for this which would work with something like IO.copy_stream or NIO::ByteBuffer which both expect a real fd to be returned from #to_io?

@ioquatix
Copy link
Member

ioquatix commented Jan 5, 2019

Without digging into it too much (dude I'm on holiday at the beach), I did a quick test of your code vs using then LimitedQueue and got a 3x perf increase on my old MBP laptop.

Here is your current implementation:

> time curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9292 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9292
> User-Agent: curl/7.63.0
> Accept: */*
> 
< HTTP/1.1 200
< server: stuffer/falcon
< connection: close
< content-length: 5861125462
< 
{ [16384 bytes data]
100 5589M  100 5589M    0     0   127M      0  0:00:43  0:00:43 --:--:--  121M
* Closing connection 0
curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null  2.36s user 3.76s system 13% cpu 43.888 total

Here is using LimitedQueue:

> time curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9292 (#0)
> GET /?bytes=5861125462 HTTP/1.1
> Host: localhost:9292
> User-Agent: curl/7.63.0
> Accept: */*
> 
< HTTP/1.1 200
< server: stuffer/falcon
< connection: close
< content-length: 5861125462
< 
{ [40960 bytes data]
100 5589M  100 5589M    0     0   310M      0  0:00:18  0:00:18 --:--:--  283M
* Closing connection 0
curl -v "http://localhost:9292/?bytes=5861125462" > /dev/null  2.41s user 3.76s system 34% cpu 18.027 total

@julik
Copy link
Author

julik commented Jan 5, 2019

Without digging into it too much (dude I'm on holiday at the beach)

Man I envy you we are freezing here in the northern hemisphere 🥶 Enjoy your holidays ;-) I will do some experiments with the limited queue just need to find a stopgap measure for it so that I wont have socket starvation on the reading end (connect to us, read 1 block, not read anything for a loooong time all the wile keeping the writing task asleep).

@ioquatix
Copy link
Member

ioquatix commented Jan 5, 2019

I think a timeout at the socket level makes sense for almost all protocols.

I don't even know if the timeout allows to be reset but something like this would be nice:

Task.timeout(60) do |timeout|
  body.each do |chunk|
    @stream.write(chunk)
    timeout.reset
  end
end

I wonder if @stream.write should take a flush option too, might minimise the overhead of writing many chunks.

Ultimately the implementation might be better at a lower level. I'll think about it.

@ioquatix
Copy link
Member

ioquatix commented Jan 6, 2019

@julik I'm playing around with some timeout/throughput concerns.

  • Do you think it makes sense to have per-socket timeout for any/all async operations? e.g. connect, send, recv, read, write, etc.

  • Do you think it makes sense to track throughput and disconnect if less than some minimum?

  • Do you have any other ideas about how this should work?

I'm thinking that socket timeouts for async operations belong directly in the socket wrapper and apply to all operations.

I also think there are higher level concerns about what constitutes a bad client... but not sure how generally these can be applied or if they should be protocol specific.

I know that slow clients as part of a DoS might continue to consume 1 byte per second. But does that really matter? If you put a low-throughput logic to disconnect sockets, DoS clients can just consume above that water mark. So, can such an approach really protect against malicious clients, or are we just trying to disconnect users who have broken the request upstream somehow (i.e. stopped reading response).

@ioquatix
Copy link
Member

ioquatix commented Jan 6, 2019

The other question is, should we have a timeout by default? It seems a bit silly to me that sockets can block code indefinitely.

@julik
Copy link
Author

julik commented Jan 6, 2019

TL;DR:

Do you think it makes sense to have per-socket timeout for any/all async operations? e.g. connect, send, recv, read, write, etc.

Yes.

Do you think it makes sense to track throughput and disconnect if less than some minimum?

Yes, or provide a way to do so (hook into NIO)

Do you have any other ideas about how this should work?

At the minimum - two flags on the falcon executable that would set minimum throughput barriers for reading and writing. They could be default-"off" but you do need them.

All good questions. It is ofc to be debated whether it is possible to protect against both slow loris and slow read attacks completely. You could say that it is impossible, just as it is very hard to protect from a high-volume attack. But just like bike locks I think making the attacks less convenient to carry out is a good step to take. Another reason why I think this is especially relevant for Falcon is that from what I understand Falcon is aiming to be the webserver, without a fronting downstream proxy like nginx - which in today's setups generally does a good job of dealing with these attack types. But falcon is also supposed to do SSL termination from what I understand (because HTTP/2 and all), and in general it seems it is aiming to become the server for a machine providing a Ruby web application.

So IMO setting at least _basic_limits to protect people from slow HTTP attacks is in scope for falcon yes. How it should be configurable I don't know but I would say a certain number of bytes must flow through the pipe per second over that many seconds (window average). If this transfer rate is not maintained then the client should be forcibly disconnected. This applies to both reading the HTTP request (slow loris attack) and writing the response (slow read attack). So if you ask me IMO yes, you do need a timeout by default at least when you are not explicitly in websocket mode where a connection might be sleeping for minutes on end. I am not aware of attacks with "slow connect" but probably there are some 🤷‍♀️

I believe puma does not have slow loris protection but it reads the request using its own IO reactor, so it probably relies on the "we can handle many many clients" property of IO reactors for this. For writing Puma is susceptible to slow read as one response consumes a thread. It is probably less severe for falcon due to the intrinsic fact that falcon is one big IO reactor but the max fd limit on the server does become a concern.

That is the "transport" end of the problem, for which there probably should be configurable timeouts on the webserver level (maybe even config options for falcon itself).

In terms of IO - yes, I do believe you want to have configurable timeouts for all reads and writes simply because if you do not have them, say, in your HTTP client, it means you can only make requests to trusted HTTP servers as you need to make an assumption that the endpoint will not "hang you up" indefinitely. It is less of a problem with HTTP endpoints being "adversarial" (it can be if you do web scraping for example, it is a concern!), it can be a problem with endpoints being badly coded. For example there is an SSE endpoint in LaunchDarkly which is currently used via async-http. It is designed to send a "ping" message every now and then to keep the connection alive - and it is all good as long as this option works. but what if it just gives you an EAGAIN once and does not come up in the NIO reactor monitor list for 2 hours after? The calling code currently has to manage this and arrange reconnects if I'm not mistaken. Maybe it is even a feature that belongs in NIO.

For our uses without async-io we opted for configuring libCURL with certain timeouts we know are sane for our uses, and we use both the connect timeouts and the throughput gate (the endpoint must furnish that many bytes within that much time otherwise we bail out).

Regarding the service I am testing falcon on - it is more of an issue protecting from homegrown hand-rolled download managers that open a connection for a bytes=..-.. range of content but do not read it, or do not start reading it in time, or opportunistically open many connections using Range headers in the hopes that they will obtain better download speeds that way (which they won't, but they do consume a connection per range).

I also think there are higher level concerns about what constitutes a bad client... but not sure how generally these can be applied or if they should be protocol specific.

I don't know. I do feel that if there is, intrinsically, a pair of objects servicing a particular client (task + writable) there should be a way to exercise at least "some" push control over the reading client socket - to use these objects to "kick" the client out. If these objects wait on a condition variable for the client to have done something in the first place (if it is always reactive) then this becomes pretty hard to do.

With the current setup what bothers me the most is that I don't know whether falcon will time out a socket in a slow read situation, and if it will - where do I configure the timeout. Cause I sure do need that feature (or I need to investigate the bizarrio nginx documentation to figure out all the options that will, at the same time, protect me from these attacks and not buffer too much response along the way).

@julik
Copy link
Author

julik commented Jan 7, 2019

For later - here is how many of these "enforce throughput" exceptions we have noticed since we started running the test code (last Thursday):

slowloris_count

I am going to try to integrate this with the limit queue by putting a few around conditionals on the write of the body meanwhile.

@ioquatix
Copy link
Member

I've released async v1.13.0 which changed the internal timeout API to raise the Async::TimeoutError exception and I'm preparing an update to async-io which includes a per-socket timeout_duration:

https://github.com/socketry/async-io/blob/e9e7c268324002dc9e4db0f18a93bc4a0a26b38b/spec/async/io/socket_spec.rb#L87-L100

I'm not sure where is the best place to set timeout_duration, but perhaps the endpoint or accept_each could do it as an option.

This should catch connections that simply stop responding.

It won't catch connections that are maliciously slow though.

For that, we need more advanced throughput calculations.

@julik thanks so much for all your detailed feedback, it's really really awesome.

@julik
Copy link
Author

julik commented Jul 1, 2019

Nope, what is returned from the Rack middleware - the code under test - then gets captured by rack-test and fed into a Rack MockResponse. MockResponse will call to_str if it is available to convert the response body into a string one can do assertions on etc.

If I do this

  class ResponseBody < Async::HTTP::Body::Writable
    def to_str
      Async do
        join
      end.wait
    end
  end

the effect is then the same as just using a small LimitQueue - my tests block and hang.

@julik
Copy link
Author

julik commented Jul 1, 2019

But I think I can live with the VW-workaround using a larger queue size for the tests for now at least.

@ioquatix
Copy link
Member

ioquatix commented Jul 1, 2019

Try adding include_context Async::RSpec::Reactor to your specs.

Rack::Test should be okay... when you call get "url" it traverse your middleware. Is your middleware invoking Async::Internet to generate a response? Or does it have some kind of loop to generate response data?

@ioquatix
Copy link
Member

ioquatix commented Jul 1, 2019

Do you think 0x1FFFF makes sense for a default stream buffer size?

I'm just trying to think of something that is going to allow the network to be saturated without using lots of MB per stream.

@ioquatix
Copy link
Member

ioquatix commented Jul 1, 2019

Actually, it turns out to be the window size for the entire connection, as well as per stream, with HTTP/2. Well, I'm still not sure, but clearly setting it to 2GB was wrong :p

@ioquatix
Copy link
Member

ioquatix commented Jul 1, 2019

I've released async-http v0.46.3 which includes HTTP/2 back-pressure streaming. That should fix the issues you had with S3 using SSL (HTTP/2 probably negotiated).

The defaults I chose for both HTTP/2 client and server is 1MB chunk size and 16MB maximum flow control window. Provided you keep on consuming the data you should never hit this limit, but if you make a connection to S3 using HTTP/2, and don't read the response body, it won't update the flow control window now, and that will stop S3 from sending more data.

Regarding everything else, for sending the data to the client, there are two options:

  • Feed the response body of S3 directly to the client response body if that's possible it will be optimal for buffering.

  • If you need to do processing, make a task around the response body that reads chunks, does the process, and writes to a Writable body with a limited queue, probably limited to 2-4 chunks. It depends on how much latency you want to deal with. The point of the limited queue is to tie your processing pipeline into the back-pressure streaming of the upstream response body.

Using an appropriate queue size probably solves the other problems you were having. Otherwise, flow control won't help, given the HTTP/2 frame size limits, a queue of 1024 chunks could store a GB of data.

@ioquatix
Copy link
Member

ioquatix commented Jul 1, 2019

There was a small bug in v0.9.0 so I've released v0.9.1 of protocol-http2.

Also, here is how you should use rack-test with asynchronous response bodies:

https://github.com/socketry/async-http/blob/master/spec/rack/test_spec.rb

@julik
Copy link
Author

julik commented Jul 2, 2019

Thanks, this looks like a useful suggestion. I do encounter issues though with this testing approach. For example, it seems that the context inclusion adds assertions that verify socket presence which the test has no business touching:

dispatches the webhook for a complete download 0.08054s
expected: == [#<IO:<STDERR>>, #<IO:<STDOUT>>, #<IO:<STDIN>>]
     got:    [#<Redis::Connection::TCPSocket:fd 9>, #<IO:<STDERR>>, #<IO:<STDOUT>>, #<IO:<STDIN>>]

as well as

expected: == [#<Redis::Connection::TCPSocket:fd 9>, #<IO:<STDERR>>, #<IO:<STDOUT>>, #<IO:<STDIN>>, #<File:/var/folders/cm/42x9327j4gxb_l1p5bp9q9tw0000gn/T/zip-readback20190702-33720-1u51xfy (closed)>]
     got:    [#<Redis::Connection::TCPSocket:fd 9>, #<IO:<STDERR>>, #<IO:<STDOUT>>, #<IO:<STDIN>>, #<File:/var/folders/cm/42x9327j4gxb_l1p5bp9q9tw0000gn/T/zip-readback20190702-33720-1nwds5n>]

These are very cryptic (one array of file descriptors is compared to... another array of file descriptors? what does it mean? does the context assume only the Async-managed IOs in the system may remain opened or referenced anywhere?)

Also some test fail, even though I know for a fact that they pass - so it seems the assertions fire earlier than they should (once the reactor has completed). For instance we have a test that ensures that if the upstream closes the connection before we could read the entire response, we perform a subsequent request and are then able to retrieve the remaining data in that request. That test fails if I include the context, and passes if I remove the context.

I've released async-http v0.46.3 which includes HTTP/2 back-pressure streaming. That should fix the issues you had with S3 using SSL (HTTP/2 probably negotiated).

Awesome - so if I update async-http will it fix the overfetching with HTTP plain, HTTP/2 or both?

The defaults I chose for both HTTP/2 client and server is 1MB chunk size and 16MB maximum flow control window. Provided you keep on consuming the data you should never hit this limit, but if you make a connection to S3 using HTTP/2, and don't read the response body, it won't update the flow control window now, and that will stop S3 from sending more data.

If we want to service, say, 1200 clients on one box (not unheard of when we have high load windows) this amounts to 19GB of memory per client - if every client holds a connection to S3. This is... available on the box but seems really excessive.

Regarding everything else, for sending the data to the client, there are two options:
Feed the response body of S3 directly to the client response body if that's possible it will be optimal for buffering.

That is not an option since we do response splicing. The suggestion to read chunks is what we do now effectively.

Using an appropriate queue size probably solves the other problems you were having.

What is the simplest way to size this based on a given number of clients I want to service and the amount of RAM on the box, given the fact that every client only fetches from 1 source upstream at the same time, and that most clients are likely to be slow-ish?

@ioquatix
Copy link
Member

ioquatix commented Jul 2, 2019

Just on the first point, it means your tests are leaking file descriptors.

@ioquatix
Copy link
Member

ioquatix commented Jul 2, 2019

Awesome - so if I update async-http will it fix the overfetching with HTTP plain, HTTP/2 or both?

HTTP/1 was already okay, it was HTTP/2 that had some problems. But I don't know if it was affecting you or not. I just checked, and it doesn't seem like S3 supports HTTP/2. So this might not have been an issue for you, but it was a problem and it's fixed now. Back-pressure is very important to ensure buffering works effectively.

Also some test fail, even though I know for a fact that they pass .

You'll need to share some repros.

If we want to service, say, 1200 clients on one box (not unheard of when we have high load windows) this amounts to 19GB of memory per client - if every client holds a connection to S3. This is... available on the box but seems really excessive.

1200 clients is pretty small to be honest. You should be able to easily serve that with the defaults. If you are doing significant processing, it will depend on the total data rate. You can probably support several fast clients vs many slow clients, etc.

Each CPU core should be able to serve up to ~100,000 clients, depending on the processing overhead. HTTP/1 means that each connection requires two file descriptors - i.e. the outgoing connection to S3 and the incoming connection from the client. So, just double check you have enough file descriptors per process.

In my testing, async-io can accept between 1000-2000 connections per second per process as a rough guide.

What is the simplest way to size this based on a given number of clients I want to service and the amount of RAM on the box, given the fact that every client only fetches from 1 source upstream at the same time, and that most clients are likely to be slow-ish?

We could probably implement a dynamically sized queue based on the flow. Pending that implementation, just ensure your queue is set to 2-4 chunks. Everything else will scale up relative to available CPU. Probably your biggest issue so far was using a limited queue with size 1024.

@ioquatix
Copy link
Member

ioquatix commented Jul 2, 2019

The smaller the queue, the less memory used per client. It might not even be stupid to try a queue with size = 1.

@julik
Copy link
Author

julik commented Jul 2, 2019

Okay, I'm going to update async-http, set a small queue length and prepare for a new round of testing.

@julik
Copy link
Author

julik commented Jul 11, 2019

This is becoming the longest issue in the known history of GH. But I'm super happy we are making progress. So I updated async-http and made another round of tests.

This is the perf profile of the server during testing:

Screenshot 2019-07-09 at 19 08 17

This is still with plain HTTP URLs, no SSL at play. I have adjusted the LimitQueue length to 4 in production. What is remarkable:

  • The memory use is no longer a sawtooth. It does however gobble up a ton of it, and I doubt the amount is... justified. We were servicing about 600 connections during the peak time before I started draining the server.
  • The amount of memory in use did not decrease during connection draining. This can be attributed to the fact that we don't have Hongli's malloc_trim patch yet and there are many dead objects in the heap. I do feel like there is string churn though.
  • We don't leak sockets anymore, great!
  • The network throughput is near-symmterical, meaning we are not choking on data going in, at least in theory (there is some "temporal delta" between peaks on the throughput which I can discount to slow clients being connected)

The throughput I could get from it is disappointing however. On Puma, on these servers, we get up to 400MB/s both ways. Here is the same graph surrounded with our usual periods of running Puma:

Screenshot 2019-07-11 at 03 07 17

So Falcon starts at our usual throughput that I would expect, and then the throughput decays to substantially lower values. Here is that moment on the graphs:

Screenshot 2019-07-11 at 03 09 53

It does indeed peg the cores, and that is fine, but I do expect the throughput to go higher. Also the memory consumption is quite extreme. What I do suspect:

  • There is still some overbuffering somewhere. Maybe I should take the limit queue size to 1 or 2 instead of 4?
  • How many workers should I run? I can roughly size pumas for the amount of connections we handle, but how do I size Falcon if we assume they will all select every time?
  • This might sound preposterous, but if I had something like a very small HTTP client which basically holds a socket - can I pass that socket from an Async task to the reactor and have it resume me? I still suspect the Async::HTTP stuff might be too smart for what I am using it for, which might be coming at a cost
  • Probably I am still not sold on read #=> String without buffer reuse. Normally you could allocate one String object for servicing the entire response. Maybe this is not a nice model with a LimitQueue but such a LimitQueue could also be a ring buffer? It does go into "manual memory management" territory but what if it can be massively more efficient for this use case?

@ioquatix
Copy link
Member

Thanks for the update it's super helpful, and I'm glad we are making progress.

You are right the memory usage does seem high.

I found there was an issue with how HTTP/1 fixed response body and async-io buffering was interacting.

I added some specs for throughput testing, and while more work is required, I did get some promising results for a simple server/client running in the same process using TCP sockets:

Data size: 4096MB Duration: 6.48s Throughput: 632.29MB/s

This is on a fairly crappy MacBook Pro, so I suspect you can get more on a chunky server.

Can you also confirm one thing for me, what is the content type of the response body?

Because falcon will opportunistically compress it using deflate if it looks like a good candidate for compression.

@ioquatix
Copy link
Member

One more question, to help me know what we are aiming for - what is the maximum theoretical throughput? I'm assuming you have at least 10 GbE, or do you have bonded GbE? Or something else?

@ioquatix
Copy link
Member

Sweeping IO size, reading 4GB of data, given a specific chunk size:

MAXIMUM_READ_SIZE=1024 414.55MB/s
MAXIMUM_READ_SIZE=2048 695.8MB/s
MAXIMUM_READ_SIZE=3072 577.36MB/s
MAXIMUM_READ_SIZE=4096 984.5MB/s
MAXIMUM_READ_SIZE=5120 536.2MB/s
MAXIMUM_READ_SIZE=6144 539.59MB/s
MAXIMUM_READ_SIZE=7168 522.02MB/s
MAXIMUM_READ_SIZE=8192 1153.69MB/s
MAXIMUM_READ_SIZE=9216 566.14MB/s
MAXIMUM_READ_SIZE=10240 647.54MB/s
MAXIMUM_READ_SIZE=11264 647.57MB/s
MAXIMUM_READ_SIZE=12288 590.49MB/s
MAXIMUM_READ_SIZE=13312 1395.04MB/s
MAXIMUM_READ_SIZE=14336 544.59MB/s
MAXIMUM_READ_SIZE=15360 832.32MB/s
MAXIMUM_READ_SIZE=16384 1657.35MB/s
MAXIMUM_READ_SIZE=17408 1656.59MB/s
MAXIMUM_READ_SIZE=18432 1683.01MB/s
MAXIMUM_READ_SIZE=19456 1687.31MB/s
MAXIMUM_READ_SIZE=20480 680.4MB/s
MAXIMUM_READ_SIZE=21504 1423.78MB/s
MAXIMUM_READ_SIZE=22528 684.07MB/s
MAXIMUM_READ_SIZE=23552 611.82MB/s
MAXIMUM_READ_SIZE=24576 1762.6MB/s
MAXIMUM_READ_SIZE=25600 715.83MB/s
MAXIMUM_READ_SIZE=26624 1742.94MB/s
MAXIMUM_READ_SIZE=27648 1734.43MB/s
MAXIMUM_READ_SIZE=28672 1771.98MB/s
MAXIMUM_READ_SIZE=29696 619.41MB/s
MAXIMUM_READ_SIZE=30720 635.69MB/s
MAXIMUM_READ_SIZE=31744 780.36MB/s
MAXIMUM_READ_SIZE=32768 1855.79MB/s

Then, power of 2 going up further:

MAXIMUM_READ_SIZE=32768 1864.85MB/s
MAXIMUM_READ_SIZE=65536 1955.68MB/s
MAXIMUM_READ_SIZE=131072 2005.4MB/s
MAXIMUM_READ_SIZE=262144 2039.5MB/s
MAXIMUM_READ_SIZE=524288 2060.61MB/s
MAXIMUM_READ_SIZE=1048576 2078.26MB/s
MAXIMUM_READ_SIZE=2097152 2077.0MB/s
MAXIMUM_READ_SIZE=4194304 1982.92MB/s
MAXIMUM_READ_SIZE=8388608 2035.61MB/s

Obviously such a result is much more nuanced depending on a whole ton of factors, but at least it's interesting to see that: page aligned buffers work much better and anything beyond 128KB seems pointless.

@ioquatix
Copy link
Member

Using async-io 1.23.x:

(HTTP/1.0) Data size: 1024MB Duration: 21.61s Throughput: 47.39MB/s
(HTTP/1.1) Data size: 1024MB Duration: 11.27s Throughput: 90.88MB/s
(HTTP/2.0) Data size: 1024MB Duration: 9.76s Throughput: 104.9MB/s

Using async-io 1.24.x (constrained maximum read size to 8MB):

(HTTP/1.0) Data size: 1024MB Duration: 1.63s Throughput: 628.71MB/s
(HTTP/1.1) Data size: 1024MB Duration: 1.57s Throughput: 654.01MB/s
(HTTP/2.0) Data size: 1024MB Duration: 9.11s Throughput: 112.46MB/s

HTTP/2.0 performance is kind of to be expected since there is a ton of data framing required. Maybe it can be optimised in pure Ruby. More than likely, some kind of writev operation is required.

@ioquatix
Copy link
Member

ioquatix commented Jul 13, 2019

If you want to reduce the size of the stream buffer, you can specify ASYNC_IO_MAXIMUM_READ_SIZE environment variable (undocumented, experimental). On Linux, 65k seems pretty decent, but it will depend on the kind of I/O latency you are dealing with. For your use case, you could try 1MB, make sure it's page aligned (e.g. 1024*1024).

https://github.com/socketry/async-io/blob/df3a6b015c938ec43defb57ccd4af7c547f7cb0e/lib/async/io/generic.rb#L26-L30

@ioquatix
Copy link
Member

I've added throughput specs to async-io and async-http so hopefully any regressions would be detected.

I've released these changes in async-http v0.46.4.

@ioquatix
Copy link
Member

So, I've been trying to get a handle on what the expected throughput should look like.

In my current benchmarks on Travis (not exactly representative, I know), transferring 120KB, falcon was outperforming puma by a decent amount, and the above patch increased the margin a bit.

Puma (120KB response body)

Running 2s test @ http://127.0.0.1:9292/big
  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.33ms  185.70us   3.97ms   91.80%
    Req/Sec   753.55     28.82   787.00     90.00%
  3001 requests in 2.00s, 345.71MB read
Requests/sec:   1500.20
Transfer/sec:    172.82MB

Falcon (async-io 1.23, 120KB response body)

Running falcon with 2 concurrent connections...
Running 2s test @ http://127.0.0.1:9292/big
  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   800.67us  622.12us   9.04ms   95.26%
    Req/Sec     1.36k   409.46     1.90k    51.22%
  5533 requests in 2.10s, 633.43MB read
Requests/sec:   2635.67
Transfer/sec:    301.74MB

Falcon (async-io 1.24, 120KB response body)

Running 2s test @ http://127.0.0.1:9292/big
  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   729.23us  436.64us  12.45ms   96.29%
    Req/Sec     1.43k    32.54     1.47k    85.71%
  5984 requests in 2.10s, 685.17MB read
Requests/sec:   2849.30
Transfer/sec:    326.25MB

However, this isn't really representative of your situation. So, I changed the test a bit to see how it performs, increasing the body size to 1000 chunks of 12KB.

Puma (12MB réponse body)

Running puma with 2 concurrent connections...
Running 2s test @ http://127.0.0.1:9292/big
  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.13ms    0.86ms  19.16ms   91.70%
    Req/Sec    70.49      5.45   100.00     90.24%
  289 requests in 2.10s, 3.24GB read
Requests/sec:    137.66
Transfer/sec:      1.54GB

Falcon (async-io 1.24, 12MB response body)

Running 2s test @ http://127.0.0.1:9292/big
  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.28ms    1.08ms  30.57ms   87.78%
    Req/Sec    42.86      4.57    50.00     71.43%
  180 requests in 2.10s, 2.01GB read
Requests/sec:     85.76
Transfer/sec:      0.96GB

Because I didn't consider this use case before, I never investigate it. Falcon is 2/3 the throughput of Puma in this micro-benchmark. I believe it's possible to improve on this, but I'm not sure if it's going to make any practical difference in real-world scenarios. So, I pass this back to you, can you please update latest async-http and give me your report?

@ioquatix
Copy link
Member

Regarding memory per client, you could expect a maximum of ASYNC_IO_MAXIMUM_READ_SIZE * queue depth. But, for network IO, it should be much less. You could probably set ASYNC_IO_MAXIMUM_READ_SIZE to 64KB as suggested.

@julik
Copy link
Author

julik commented Jul 13, 2019

64-65K seems best as on Linux this is the size of the fd buffer for non-blocking IO. I think your numbers hint at it being the optimum as well, because then your reads are aligned with the tick-tock nature of the socket buffer in the OS.

ASYNC_IO_MAXIMUM_READ_SIZE * queue depth

What is the queue depth in this case? Number of responses currently being served, number of connected clients which are past accept, number of connected clients both in response serving phase and in the accept phase etc.? Is there a metric output of it somewhere I could plot?

One more question, to help me know what we are aiming for - what is the maximum theoretical throughput? I'm assuming you have at least 10 GbE, or do you have bonded GbE? Or something else?

I believe 400MB to 500MB per second is the maximum. Neither of the two - it is an EC2 instance which is serving out to the internet (via CloudFront, so CloudFront pulls from the instance), so it is roughly give or take half of a 10GbE. This matches numbers in https://cloudonaut.io/ec2-network-performance-cheat-sheet/ which show 4.97Gb baseline for our instance type. Have you tested falcon on 2 machines connected via metal 10GbE by any chance?

What I am looking for is that I can let many more clients use the server with their speed being reduced, which is the issue we are facing with Puma at high load - there are lots of slow clients, and we would like to be able to accommodate them without the server melting.

Can you also confirm one thing for me, what is the content type of the response body?
Because falcon will opportunistically compress it using deflate if it looks like a good candidate for compression.

I believe we force the content type of the output to binary/content-stream - also to prevent any kind of "smart" treatment of the served content by the browsers. Maybe at some point I will be implementing multiple range responses as well - there is nothing preventing us from doing it, it is just involved as it is already.

The actual file types we serve are highly varied, and we also support Range responses on our end - so a client can request a Range of a virtual response from us, which we then transform in an intersection of Range requests to the upstream files + some generated chunks of data. We have a high number of file types, and we also serve them in stored mode inside ZIPs - so we offer Range support over on-the-fly created ZIP files. The less clever my webserver tries to be in a situation like this the better. Also most of the file corpus we deal with (and this is tens of millions per day) are files like MP4 and JPEG - there is zero gain from trying to opportunistically compress them in any phase of the process.

So, I pass this back to you, can you please update latest async-http and give me your report?

Certainly! Making changes now.

@julik
Copy link
Author

julik commented Jul 13, 2019

I believe it's possible to improve on this, but I'm not sure if it's going to make any practical difference in real-world scenarios.

If the cap is on the total throughput achievable regardless of the number of clients then this would be an issue, since to have the performance we have now we then would have to scale up the fleet by 33% - and it is already quite large. If the cap is for one client then it should be fine as you would be hard pressed to find a downloader who could saturate 5Gb of the lane in one swoop.

@julik
Copy link
Author

julik commented Jul 13, 2019

Yess! So far it seems to be reaching Puma levels, and adjusting the buffer size and updating async-http did do something magical. The RAM use is growing slowly but not nearly as explosively as before. I will let it run for a few hours I think and see where we end up.

Screenshot 2019-07-13 at 16 51 06

@julik
Copy link
Author

julik commented Jul 13, 2019

At the rightmost point of the graph it is handling ~390 clients.

@julik
Copy link
Author

julik commented Jul 13, 2019

Ok, the joy was short lived. It worked for a while and then... it suffocated itself it seems. The log was filled with ReadTimeout errors (see the code I've mailed you for where this gets raised). So while the memory use was fine, it looks like something is going on with multiplexing HTTP clients. At the time the values drop it was receiving normal traffic.

The throughput dropped near-instantly (from normal to 3-4MB per second), CPU load as well.

Screenshot 2019-07-13 at 20 40 02

Around 18:40 I started draining the server.

I think I am sold on Falcon but I need a dumber HTTP client that integrates with it. Very basic: 1 socket per client outbound, reopen socket on EOF/timeout/EPIPE. No HTTP2, no compression, no multiplexing. How do I make one or which one can I use? Should I just yield a socket from the task?

@ioquatix
Copy link
Member

Async::HTTP is not complicated and is delivering on most of what you want. You could use individual connects to avoid the pool but then you don’t get any benefits of connection reuse which are quite decent in practice. Sounds like just a few more issues to nail. I’ll check the logged errors.

@ioquatix
Copy link
Member

The log was filled with ReadTimeout errors (see the code I've mailed you for where this gets raised). So while the memory use was fine, it looks like something is going on with multiplexing HTTP clients. At the time the values drop it was receiving normal traffic.

Can you send me the log details and related code?

@julik
Copy link
Author

julik commented Jul 24, 2019

@ioquatix Given the last flurry of updates to the libraries - should I do another round of tests?

@ioquatix
Copy link
Member

@julik that would be awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants