Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower performance compared to curio #1595

Open
rthalley opened this issue Jun 9, 2020 · 10 comments
Open

Slower performance compared to curio #1595

rthalley opened this issue Jun 9, 2020 · 10 comments

Comments

@rthalley
Copy link

rthalley commented Jun 9, 2020

I've used trio in some projects lately, and also recently added it to dnspython. As part of some random performance testing, I compared "minimal DNS server" performance between ordinary python sockets, curio, and trio. I'll attach the trio program to the bottom of this report. Trio performed significantly worse. E.g. on a particular linux VM:

Regular python I/O: 9498 QPS
Curio: 8979 QPS
Trio: 5007 QPS

I've seem similar behavior on a Mac:

Regular python I/O: 8359 QPS
Curio: 8061 QPS
Trio: 3425 QPS

I'm using dnsperf to generate load, and it does a good job of keeping UDP input queues full, so the ideal expected behavior if you strace the program is to see a ton of recvfrom() and sendto() system calls, and nothing else. In particular, you don't expect to see any epoll_wait(). Ordinary python I/O and curio behave as expected, but going through the loop in trio looks like:

recvfrom()
clock_gettime()
epoll_wait() (not waiting on any fds if I'm reading strace output correctly)
clock_gettime()
clock_gettime()
epoll_wait()
clock_gettime()
sendto()
clock_gettime()
epoll_wait()
clock_gettime()

I don't understand trio's internals enough to debug this further at the moment, but I thought I would make a report, as this seems like excessive epolling.

I haven't dtrussed on the mac, but python tracing indicated a lot of time related to kqueue.

This was with trio 0.15.1, and Cpython 3.8.3 and 3.7.7.

import socket
import trio
import trio.socket

import dns.message
import dns.rcode

async def serve():
    with trio.socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
        await s.bind(('127.0.0.1', 5354))
        while True:
            (wire, where) = await s.recvfrom(65535)
            q = dns.message.from_wire(wire)
            r = dns.message.make_response(q)
            r.set_rcode(dns.rcode.REFUSED)
            await s.sendto(r.to_wire(), where)

def main():
    trio.run(serve)

if __name__ == '__main__':
    main()
@rthalley
Copy link
Author

rthalley commented Jun 9, 2020

P.S. I enjoy trio, keep up the good work!

@rthalley
Copy link
Author

rthalley commented Jun 9, 2020

Also, if you want to generate the load, create a file called input with one line

www.example A

and then run

dnsperf -d input -p 5354 -l 10

@rthalley
Copy link
Author

Note I wasn't reading the strace output correctly about whether epoll matched any fds or not; I just don't know. If there's anything you want me to test let me know :)

@rthalley
Copy link
Author

One last comment today :)

So, I read some trio code, and while I don't fully understand the details, it seems to me that the difference between curio and trio here is that curio only schedules if the system call would block, whereas trio introduces a schedule point after every I/O call. The curio way can starve other tasks as a sufficiently busy task might never block, whereas the trio way can result in excessive scheduling resulting in significantly lower throughput. This is a tough thing, and hard for a library to decide what's best. In the curio style world, the application can count how many loops it has done and sleep(0) every (say) 100 calls and avoid a lot of not helpful scheduler overhead while still not starving other tasks, but if you forget it can be bad.

I wonder if there could be a scoped I/O mode in trio where within the scope you would schedule if you had to (i.e. the system call would block), but if you didn't have to it would NOT schedule until you'd hit some specified count of schedule points.

e.g.

async with trio.defer_schedule_limit(100):
await io()

This would allow applications that want to be more aggressive with I/O do so, while still letting the default behavior be safe.

@smurfix
Copy link
Contributor

smurfix commented Jun 10, 2020

That'd be useful. Alternately we could use a time limit (or the perf counter) which would be more deterministic for the rest of the code.

njsmith added a commit to njsmith/trio that referenced this issue Jun 15, 2020
Profiling a user-written DNS microbenchmark in python-triogh-1595 showed that UDP
sendto operations were spending a substantial amount of time in
_resolve_address, which is responsible for resolving any hostname
given in the sendto call. This is weird, since the benchmark wasn't
using hostnames, just a raw IP address.

Surprisingly, it turns out that calling getaddrinfo at all is also
quite expensive, even you give it an already-resolved plain IP address
so there's no work for it to do:

    In [1]: import socket

    In [2]: %timeit socket.getaddrinfo("127.0.0.1", 80)
    5.84 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

    In [3]: %timeit socket.inet_aton("127.0.0.1")
    187 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I thought getaddrinfo would be effectively free in this case, but
apparently I was wrong. Also, this doesn't really matter for TCP
connections, since they only pass through this path once per
connection, but UDP goes through here on every packet, so the overhead
adds up quickly.

Solution: add a fast-path to _resolve_address when user's address is
already resolved, so we skip all the work. On the benchmark in python-triogh-1595
on my laptop, this PR takes us from ~7000 queries/second to
~9000 queries/second, so a ~30% speedup.

The patch is a bit more complicated than I expected. There are three
parts:

- The fast path itself

- Skipping unnecessary checkpoints in _resolve_address, including when
  we're on the fastpath: this is an internal function and it turns out
  that almost all our callers are already doing checkpoints, so there's
  no need to do another checkpoint inside _resolve_address. Even with
  the fast checkpoints from python-triogh-1613, the fast-path+checkpoint is still
  only ~8000 queries/second on the DNS microbenchmark, versus ~9000
  queries/second without the checkpoint.

- _resolve_address used to always normalize IPv6 addresses into
  4-tuples, as a side-effect of how getaddrinfo works. The fast path
  doesn't call getaddrinfo, so the tests needed adjusting so they
  don't expect this normalization, and to make sure that our tests for
  the getaddrinfo normalization code don't hit the fast path.
njsmith added a commit to njsmith/trio that referenced this issue Jun 15, 2020
Profiling a user-written DNS microbenchmark in python-triogh-1595 showed that UDP
sendto operations were spending a substantial amount of time in
_resolve_address, which is responsible for resolving any hostname
given in the sendto call. This is weird, since the benchmark wasn't
using hostnames, just a raw IP address.

Surprisingly, it turns out that calling getaddrinfo at all is also
quite expensive, even you give it an already-resolved plain IP address
so there's no work for it to do:

    In [1]: import socket

    In [2]: %timeit socket.getaddrinfo("127.0.0.1", 80)
    5.84 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

    In [3]: %timeit socket.inet_aton("127.0.0.1")
    187 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I thought getaddrinfo would be effectively free in this case, but
apparently I was wrong. Also, this doesn't really matter for TCP
connections, since they only pass through this path once per
connection, but UDP goes through here on every packet, so the overhead
adds up quickly.

Solution: add a fast-path to _resolve_address when user's address is
already resolved, so we skip all the work. On the benchmark in python-triogh-1595
on my laptop, this PR takes us from ~7000 queries/second to
~9000 queries/second, so a ~30% speedup.

The patch is a bit more complicated than I expected. There are three
parts:

- The fast path itself

- Skipping unnecessary checkpoints in _resolve_address, including when
  we're on the fastpath: this is an internal function and it turns out
  that almost all our callers are already doing checkpoints, so there's
  no need to do another checkpoint inside _resolve_address. Even with
  the fast checkpoints from python-triogh-1613, the fast-path+checkpoint is still
  only ~8000 queries/second on the DNS microbenchmark, versus ~9000
  queries/second without the checkpoint.

- _resolve_address used to always normalize IPv6 addresses into
  4-tuples, as a side-effect of how getaddrinfo works. The fast path
  doesn't call getaddrinfo, so the tests needed adjusting so they
  don't expect this normalization, and to make sure that our tests for
  the getaddrinfo normalization code don't hit the fast path.
@njsmith
Copy link
Member

njsmith commented Jun 15, 2020

👋 Hello!

Trio is definitely not yet as fast as it could be, or as we want it to be. Partly this is following the old rule "first make it right, then make it fast", and partly it's because it's hard to optimize things well without good benchmarks to profile. So this benchmark is super helpful :-)

Trio does do more schedules and IO polls than curio, but I'm not sure how much that actually contributes to what you're seeing. Running your benchmark through py-spy, the first thing that jumped out at me is that we were spending a tremendous amount of time in getaddrinfo, which doesn't make any sense, since your program doesn't use any hostnames...

It turns out that the issue was that when you call trio_socket.sendto(..., address), then Trio picks apart address and resolves any hostnames it finds before it calls the underlying Python socket.sendto. If it didn't do that, and just passed the address through directly, then socket.sendto would implicitly do a blocking DNS lookup for any hostnames, and block the whole program. So the basic idea is good... but we were assuming that calling getaddrinfo on an already-resolved literal IP address was basically free, and it turns out that that's false (at least on Linux). So that's fixed in #1617 by adding a fast-path to skip calling getaddrinfo entirely whenever we can.

(Asyncio has a similar optimization already. Curio doesn't try to resolve hostnames in addresses at all; it just passes through whatever address you use directly to the stdlib socket module. Which of course is the fastest thing, but it means that users just have to know to manually resolve their hostnames before calling into curio socket methods, and if you forget it will seem to work but you'll get random stalls.)

I don't know how much of the gap between curio/sync python that fixes, but it might be interesting to re-run your benchmarks on the latest Trio checkout and see :-).

After fixing that, the next big thing that shows up in the profile for me is the complex metaprogramming stuff we use in trio.socket to try to reduce boilerplate (if you're curious, check out _try_sync, _nonblocking_helper, _make_simple_sock_method_wrapper). It does reduce boilerplate, but it adds a bunch of runtime overhead, so we've been planning for a while to switch to using code-generation to write out all the boilerplate directly. That's a somewhat bigger job though, so it will wait until someone gets around to it. (Volunteers welcome :-).)

Once we've fixed that, we'll have a better sense of how much overhead those extra schedules are adding... in my profile it looks like it's only ~5% of the CPU time currently, but it's kind of hard to estimate, b/c schedule points have a diffuse cost. But we certainly could explore options like "skip bare checkpoints if we've already done one within the last 0.5 ms" or so. We just need some good data to compare the different options.

@rthalley
Copy link
Author

I checked out master branch from github this morning and installed it on the same system I used for testing before. There was some improvement, from the 5007 QPS with 0.14 to 5750 QPS.

I wonder if py-spy was missing something (e.g. C calls) because I used python's cProfile module and what I see from the dnsperf load is epoll is the biggest cost. I tend to believe that epoll() is the actual issue because of how frequently it is called, and because of prior experience with cooperative multitasking in C. I had the load running before I started the python server, and I had the server bail out while the load was still running, so the times sampled in epoll should reflect actual costs.

    54353    2.365    0.000    2.365    0.000 {method 'poll' of 'select.epoll' objects}
   108695    1.267    0.000    2.258    0.000 /home/user/.cache/pypoetry/virtualenvs/dnspython-GS3fw1XI-py3.7/lib64/python3.7/site-packages/trio-0.16.0+dev-py3.7.egg/trio/_socket.py:570(_nonblocking_helper)
   244557    0.841    0.000    1.363    0.000 /home/user/src/dnspython/dns/wiredata.py:26(__getitem__)
   217384    0.803    0.000    1.176    0.000 /home/user/src/dnspython/dns/name.py:360(__hash__)
    54354    0.789    0.000   17.231    0.000 /home/user/.cache/pypoetry/virtualenvs/dnspython-GS3fw1XI-py3.7/lib64/python3.7/site-packages/trio-0.16.0+dev-py3.7.egg/trio/_core/_run.py:1992(unrolled_run)
1222809/1114117    0.789    0.000    0.925    0.000 {built-in method builtins.len}
    27173    0.775    0.000    4.221    0.000 /home/user/src/dnspython/dns/name.py:595(to_wire)
  1114142    0.712    0.000    0.712    0.000 {built-in method builtins.isinstance}
   489125    0.692    0.000    1.115    0.000 /usr/lib64/python3.7/enum.py:284(__call__)
   135865    0.503    0.000    0.799    0.000 /home/user/src/dnspython/dns/name.py:260(_validate_labels)
   380422    0.499    0.000    0.762    0.000 /home/user/src/dnspython/dns/name.py:290(_maybe_convert_to_binary)
   135865    0.491    0.000    1.224    0.000 /usr/lib64/python3.7/enum.py:836(__and__)
   489125    0.422    0.000    0.424    0.000 /usr/lib64/python3.7/enum.py:526(__new__)
    27173    0.405    0.000    2.248    0.000 /home/user/src/dnspython/dns/name.py:963(from_wire)
   135865    0.382    0.000    2.288    0.000 /home/user/src/dnspython/dns/name.py:314(__init__)
   624979    0.373    0.000    0.373    0.000 {method 'lower' of 'bytes' objects}
    54350    0.366    0.000   15.263    0.000 pts.py:10(serve)
   135865    0.346    0.000    1.107    0.000 /home/user/src/dnspython/dns/name.py:318(<listcomp>)
    81519    0.271    0.000    0.674    0.000 /usr/lib64/python3.7/enum.py:830(__or__)
    27173    0.251    0.000    3.876    0.000 /home/user/src/dnspython/dns/message.py:744(read)
   108694    0.218    0.000    0.457    0.000 /home/user/.cache/pypoetry/virtualenvs/dnspython-GS3fw1XI-py3.7/lib64/python3.7/site-packages/trio-0.16.0+dev-py3.7.egg/trio/_core/_traps.py:30(cancel_shielded_checkpoint)
    54354    0.213    0.000    0.367    0.000 /home/user/.cache/pypoetry/virtualenvs/dnspython-GS3fw1XI-py3.7/lib64/python3.7/site-packages/trio-0.16.0+dev-py3.7.egg/trio/_core/_run.py:1313(reschedule)
    27173    0.197    0.000    1.596    0.000 /home/user/src/dnspython/dns/message.py:1166(make_response)
   108691    0.192    0.000    0.649    0.000 /home/user/.cache/pypoetry/virtualenvs/dnspython-GS3fw1XI-py3.7/lib64/python3.7/site-packages/trio-0.16.0+dev-py3.7.egg/trio/_socket.py:34(__aexit__)
    27173    0.183    0.000    3.234    0.000 /home/user/src/dnspython/dns/message.py:622(_get_question)
   271730    0.182    0.000    0.182    0.000 {method 'write' of '_io.BytesIO' objects}
    27174    0.182    0.000    0.583    0.000 /home/user/.cache/pypoetry/virtualenvs/dnspython-GS3fw1XI-py3.7/lib64/python3.7/site-packages/trio-0.16.0+dev-py3.7.egg/trio/_socket.py:495(_resolve_address_nocp)
    27173    0.172    0.000    4.589    0.000 /home/user/src/dnspython/dns/message.py:770(from_wire)
    27173    0.170    0.000    4.483    0.000 /home/user/src/dnspython/dns/renderer.py:123(add_question)
    27173    0.162    0.000    0.573    0.000 /home/user/src/dnspython/dns/message.py:297(find_rrset)

@rthalley
Copy link
Author

Also I note that number 2 on the list was _nonblocking_helper(). Given we're not taking any of the BlockingIOError paths, I'm guessing the cost here is related to the invocation of the scheduler after every I/O in the aexit of _try_sync. (This is consistent with the gettimeofday() and epoll_wait() system calls I see between recvfrom() and then sendto().)

@njsmith
Copy link
Member

njsmith commented Jun 16, 2020

In my experience, trying to use intuition to speed things up will often send you off down blind alleys... it's important to focus on data :-).

Also I generally avoid cProfile, because (a) it's not very accurate (it adds a lot of per-function overhead, so it tends to tell you to replace all your small functions with big functions, regardless of whether that will actually speed up your program or not), (b) it's not very useful (to get a clear picture of what's going on you really need your profiler to show call stacks, not just individual functions). I like sampling call-stack profilers like py-spy or pyinstrument. (And FWIW py-spy at least captures C call stacks as well, though it's not super relevant here b/c we don't use many extension modules.)

That said: your cProfile data is claiming that epoll.poll is responsible for 2.4% of the total runtime. So those epoll calls are a factor, but they don't explain why trio is 2x slower than curio/sync code :-). We have to keep digging.

The extra schedules are a part of that: the cost of those includes the epoll calls, but also other administrivia like checking for timeouts, pausing/resuming tasks, etc. Exactly how much I'm not certain.

@rthalley
Copy link
Author

rthalley commented Jun 16, 2020

  _     ._   __/__   _ _  _  _ _/_   Recorded: 05:23:36  Samples:  28116
 /_//_/// /_\ / //_// / //_'/ //     Duration: 30.153    CPU time: 26.882
/   _/                      v3.1.3

Program: pts.py

30.152 <module>  pts.py:2
└─ 30.001 main  pts.py:20
   └─ 30.001 run  trio/_core/_run.py:1802
      ├─ 27.208 unrolled_run  trio/_core/_run.py:1992
      │  ├─ 24.052 serve  pts.py:9
      │  │  ├─ 7.970 to_wire  dns/message.py:402
      │  │  │  ├─ 7.230 add_question  dns/renderer.py:123
      │  │  │  │  ├─ 6.830 to_wire  dns/name.py:595
      │  │  │  │  │  ├─ 2.410 [self]  
      │  │  │  │  │  ├─ 2.120 __init__  dns/name.py:314
      │  │  │  │  │  │  ├─ 0.946 <listcomp>  dns/name.py:318
      │  │  │  │  │  │  │  ├─ 0.626 _maybe_convert_to_binary  dns/name.py:290
      │  │  │  │  │  │  │  └─ 0.320 [self]  
      │  │  │  │  │  │  ├─ 0.772 _validate_labels  dns/name.py:260
      │  │  │  │  │  │  └─ 0.402 [self]  
      │  │  │  │  │  ├─ 1.864 __hash__  dns/name.py:360
      │  │  │  │  │  └─ 0.348 __len__  dns/name.py:675
      │  │  │  │  └─ 0.373 [self]  
      │  │  │  └─ 0.320 write_header  dns/renderer.py:273
      │  │  ├─ 7.112 from_wire  dns/message.py:770
      │  │  │  ├─ 5.968 read  dns/message.py:744
      │  │  │  │  ├─ 5.037 _get_question  dns/message.py:622
      │  │  │  │  │  ├─ 3.261 from_wire  dns/name.py:963
      │  │  │  │  │  │  ├─ 1.292 __init__  dns/name.py:314
      │  │  │  │  │  │  │  ├─ 0.553 <listcomp>  dns/name.py:318
      │  │  │  │  │  │  │  │  └─ 0.421 _maybe_convert_to_binary  dns/name.py:290
      │  │  │  │  │  │  │  └─ 0.550 _validate_labels  dns/name.py:260
      │  │  │  │  │  │  ├─ 1.196 __getitem__  dns/wiredata.py:26
      │  │  │  │  │  │  │  └─ 1.091 [self]  
      │  │  │  │  │  │  └─ 0.614 [self]  
      │  │  │  │  │  ├─ 0.993 find_rrset  dns/message.py:297
      │  │  │  │  │  │  ├─ 0.411 __hash__  dns/name.py:360
      │  │  │  │  │  │  └─ 0.330 [self]  
      │  │  │  │  │  ├─ 0.412 __getitem__  dns/wiredata.py:26
      │  │  │  │  │  └─ 0.371 [self]  
      │  │  │  │  └─ 0.452 [self]  
      │  │  │  └─ 0.516 __and__  enum.py:836
      │  │  ├─ 3.557 sendto  trio/_socket.py:743
      │  │  │  ├─ 2.427 _nonblocking_helper  trio/_socket.py:570
      │  │  │  │  ├─ 1.613 [self]  
      │  │  │  │  └─ 0.580 __aexit__  trio/_socket.py:34
      │  │  │  │     └─ 0.412 cancel_shielded_checkpoint  trio/_core/_traps.py:30
      │  │  │  └─ 0.901 _resolve_remote_address_nocp  trio/_socket.py:567
      │  │  │     └─ 0.834 _resolve_address_nocp  trio/_socket.py:495
      │  │  │        ├─ 0.465 family  socket.py:432
      │  │  │        │  └─ 0.315 _intenum_converter  socket.py:97
      │  │  │        └─ 0.369 [self]  
      │  │  ├─ 2.475 make_response  dns/message.py:1166
      │  │  │  ├─ 0.822 set_opcode  dns/message.py:586
      │  │  │  │  └─ 0.360 __and__  enum.py:836
      │  │  │  ├─ 0.816 __and__  enum.py:836
      │  │  │  │  ├─ 0.426 __call__  enum.py:284
      │  │  │  │  └─ 0.390 [self]  
      │  │  │  └─ 0.348 __or__  enum.py:830
      │  │  ├─ 1.436 wrapper  trio/_socket.py:338
      │  │  │  └─ 1.267 _nonblocking_helper  trio/_socket.py:570
      │  │  │     ├─ 0.494 __aexit__  trio/_socket.py:34
      │  │  │     │  └─ 0.346 cancel_shielded_checkpoint  trio/_core/_traps.py:30
      │  │  │     └─ 0.485 [self]  
      │  │  ├─ 0.959 set_rcode  dns/message.py:566
      │  │  │  ├─ 0.407 __and__  enum.py:836
      │  │  │  └─ 0.346 __or__  enum.py:830
      │  │  └─ 0.542 [self]  
      │  ├─ 1.779 [self]  
      │  ├─ 0.544 reschedule  trio/_core/_run.py:1313
      │  │  └─ 0.372 [self]  
      │  └─ 0.522 peekitem  sortedcontainers/sorteddict.py:472
      └─ 2.531 get_events  trio/_core/_io_epoll.py:218

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants