New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slower performance compared to curio #1595
Comments
P.S. I enjoy trio, keep up the good work! |
Also, if you want to generate the load, create a file called input with one line and then run dnsperf -d input -p 5354 -l 10 |
Note I wasn't reading the strace output correctly about whether epoll matched any fds or not; I just don't know. If there's anything you want me to test let me know :) |
One last comment today :) So, I read some trio code, and while I don't fully understand the details, it seems to me that the difference between curio and trio here is that curio only schedules if the system call would block, whereas trio introduces a schedule point after every I/O call. The curio way can starve other tasks as a sufficiently busy task might never block, whereas the trio way can result in excessive scheduling resulting in significantly lower throughput. This is a tough thing, and hard for a library to decide what's best. In the curio style world, the application can count how many loops it has done and sleep(0) every (say) 100 calls and avoid a lot of not helpful scheduler overhead while still not starving other tasks, but if you forget it can be bad. I wonder if there could be a scoped I/O mode in trio where within the scope you would schedule if you had to (i.e. the system call would block), but if you didn't have to it would NOT schedule until you'd hit some specified count of schedule points. e.g. async with trio.defer_schedule_limit(100): This would allow applications that want to be more aggressive with I/O do so, while still letting the default behavior be safe. |
That'd be useful. Alternately we could use a time limit (or the perf counter) which would be more deterministic for the rest of the code. |
Profiling a user-written DNS microbenchmark in python-triogh-1595 showed that UDP sendto operations were spending a substantial amount of time in _resolve_address, which is responsible for resolving any hostname given in the sendto call. This is weird, since the benchmark wasn't using hostnames, just a raw IP address. Surprisingly, it turns out that calling getaddrinfo at all is also quite expensive, even you give it an already-resolved plain IP address so there's no work for it to do: In [1]: import socket In [2]: %timeit socket.getaddrinfo("127.0.0.1", 80) 5.84 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [3]: %timeit socket.inet_aton("127.0.0.1") 187 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) I thought getaddrinfo would be effectively free in this case, but apparently I was wrong. Also, this doesn't really matter for TCP connections, since they only pass through this path once per connection, but UDP goes through here on every packet, so the overhead adds up quickly. Solution: add a fast-path to _resolve_address when user's address is already resolved, so we skip all the work. On the benchmark in python-triogh-1595 on my laptop, this PR takes us from ~7000 queries/second to ~9000 queries/second, so a ~30% speedup. The patch is a bit more complicated than I expected. There are three parts: - The fast path itself - Skipping unnecessary checkpoints in _resolve_address, including when we're on the fastpath: this is an internal function and it turns out that almost all our callers are already doing checkpoints, so there's no need to do another checkpoint inside _resolve_address. Even with the fast checkpoints from python-triogh-1613, the fast-path+checkpoint is still only ~8000 queries/second on the DNS microbenchmark, versus ~9000 queries/second without the checkpoint. - _resolve_address used to always normalize IPv6 addresses into 4-tuples, as a side-effect of how getaddrinfo works. The fast path doesn't call getaddrinfo, so the tests needed adjusting so they don't expect this normalization, and to make sure that our tests for the getaddrinfo normalization code don't hit the fast path.
Profiling a user-written DNS microbenchmark in python-triogh-1595 showed that UDP sendto operations were spending a substantial amount of time in _resolve_address, which is responsible for resolving any hostname given in the sendto call. This is weird, since the benchmark wasn't using hostnames, just a raw IP address. Surprisingly, it turns out that calling getaddrinfo at all is also quite expensive, even you give it an already-resolved plain IP address so there's no work for it to do: In [1]: import socket In [2]: %timeit socket.getaddrinfo("127.0.0.1", 80) 5.84 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [3]: %timeit socket.inet_aton("127.0.0.1") 187 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) I thought getaddrinfo would be effectively free in this case, but apparently I was wrong. Also, this doesn't really matter for TCP connections, since they only pass through this path once per connection, but UDP goes through here on every packet, so the overhead adds up quickly. Solution: add a fast-path to _resolve_address when user's address is already resolved, so we skip all the work. On the benchmark in python-triogh-1595 on my laptop, this PR takes us from ~7000 queries/second to ~9000 queries/second, so a ~30% speedup. The patch is a bit more complicated than I expected. There are three parts: - The fast path itself - Skipping unnecessary checkpoints in _resolve_address, including when we're on the fastpath: this is an internal function and it turns out that almost all our callers are already doing checkpoints, so there's no need to do another checkpoint inside _resolve_address. Even with the fast checkpoints from python-triogh-1613, the fast-path+checkpoint is still only ~8000 queries/second on the DNS microbenchmark, versus ~9000 queries/second without the checkpoint. - _resolve_address used to always normalize IPv6 addresses into 4-tuples, as a side-effect of how getaddrinfo works. The fast path doesn't call getaddrinfo, so the tests needed adjusting so they don't expect this normalization, and to make sure that our tests for the getaddrinfo normalization code don't hit the fast path.
👋 Hello! Trio is definitely not yet as fast as it could be, or as we want it to be. Partly this is following the old rule "first make it right, then make it fast", and partly it's because it's hard to optimize things well without good benchmarks to profile. So this benchmark is super helpful :-) Trio does do more schedules and IO polls than curio, but I'm not sure how much that actually contributes to what you're seeing. Running your benchmark through py-spy, the first thing that jumped out at me is that we were spending a tremendous amount of time in It turns out that the issue was that when you call (Asyncio has a similar optimization already. Curio doesn't try to resolve hostnames in addresses at all; it just passes through whatever address you use directly to the stdlib I don't know how much of the gap between curio/sync python that fixes, but it might be interesting to re-run your benchmarks on the latest Trio checkout and see :-). After fixing that, the next big thing that shows up in the profile for me is the complex metaprogramming stuff we use in Once we've fixed that, we'll have a better sense of how much overhead those extra schedules are adding... in my profile it looks like it's only ~5% of the CPU time currently, but it's kind of hard to estimate, b/c schedule points have a diffuse cost. But we certainly could explore options like "skip bare checkpoints if we've already done one within the last 0.5 ms" or so. We just need some good data to compare the different options. |
I checked out master branch from github this morning and installed it on the same system I used for testing before. There was some improvement, from the 5007 QPS with 0.14 to 5750 QPS. I wonder if py-spy was missing something (e.g. C calls) because I used python's cProfile module and what I see from the dnsperf load is epoll is the biggest cost. I tend to believe that epoll() is the actual issue because of how frequently it is called, and because of prior experience with cooperative multitasking in C. I had the load running before I started the python server, and I had the server bail out while the load was still running, so the times sampled in epoll should reflect actual costs.
|
Also I note that number 2 on the list was _nonblocking_helper(). Given we're not taking any of the BlockingIOError paths, I'm guessing the cost here is related to the invocation of the scheduler after every I/O in the aexit of _try_sync. (This is consistent with the gettimeofday() and epoll_wait() system calls I see between recvfrom() and then sendto().) |
In my experience, trying to use intuition to speed things up will often send you off down blind alleys... it's important to focus on data :-). Also I generally avoid cProfile, because (a) it's not very accurate (it adds a lot of per-function overhead, so it tends to tell you to replace all your small functions with big functions, regardless of whether that will actually speed up your program or not), (b) it's not very useful (to get a clear picture of what's going on you really need your profiler to show call stacks, not just individual functions). I like sampling call-stack profilers like py-spy or pyinstrument. (And FWIW py-spy at least captures C call stacks as well, though it's not super relevant here b/c we don't use many extension modules.) That said: your cProfile data is claiming that The extra schedules are a part of that: the cost of those includes the epoll calls, but also other administrivia like checking for timeouts, pausing/resuming tasks, etc. Exactly how much I'm not certain. |
|
I've used trio in some projects lately, and also recently added it to dnspython. As part of some random performance testing, I compared "minimal DNS server" performance between ordinary python sockets, curio, and trio. I'll attach the trio program to the bottom of this report. Trio performed significantly worse. E.g. on a particular linux VM:
Regular python I/O: 9498 QPS
Curio: 8979 QPS
Trio: 5007 QPS
I've seem similar behavior on a Mac:
Regular python I/O: 8359 QPS
Curio: 8061 QPS
Trio: 3425 QPS
I'm using dnsperf to generate load, and it does a good job of keeping UDP input queues full, so the ideal expected behavior if you strace the program is to see a ton of recvfrom() and sendto() system calls, and nothing else. In particular, you don't expect to see any epoll_wait(). Ordinary python I/O and curio behave as expected, but going through the loop in trio looks like:
recvfrom()
clock_gettime()
epoll_wait() (not waiting on any fds if I'm reading strace output correctly)
clock_gettime()
clock_gettime()
epoll_wait()
clock_gettime()
sendto()
clock_gettime()
epoll_wait()
clock_gettime()
I don't understand trio's internals enough to debug this further at the moment, but I thought I would make a report, as this seems like excessive epolling.
I haven't dtrussed on the mac, but python tracing indicated a lot of time related to kqueue.
This was with trio 0.15.1, and Cpython 3.8.3 and 3.7.7.
The text was updated successfully, but these errors were encountered: