Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make trio.lowlevel.checkpoint much faster #1613

Merged
merged 1 commit into from Jun 14, 2020

Conversation

njsmith
Copy link
Member

@njsmith njsmith commented Jun 14, 2020

This is a bit of a kluge, and will hopefully be cleaned up in the
future when we overhaul KeyboardInterrupt (gh-733) and/or the
cancellation checking APIs (gh-961). But 'checkpoint()' is a common
operation, and this speeds it up by ~7x, so... not bad for a ~4 line
change.

Before this change:

  • await checkpoint(): ~24k/second
  • await cancel_shielded_checkpoint(): ~180k/second

After this change:

  • await checkpoint(): ~170k/second
  • await cancel_shielded_checkpoint(): ~180k/second

Benchmark script:

import time
import trio

LOOP_SIZE = 1_000_000

async def main():
    start = time.monotonic()
    for _ in range(LOOP_SIZE):
        await trio.lowlevel.checkpoint()
        #await trio.lowlevel.cancel_shielded_checkpoint()
    end = time.monotonic()
    print(f"{LOOP_SIZE / (end - start):.2f} schedules/second")

trio.run(main)

This is a bit of a kluge, and will hopefully be cleaned up in the
future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the
cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common
operation, and this speeds it up by ~7x, so... not bad for a ~4 line
change.

Before this change:

- `await checkpoint()`: ~24k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

After this change:

- `await checkpoint()`: ~170k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

Benchmark script:

```python
import time
import trio

LOOP_SIZE = 1_000_000

async def main():
    start = time.monotonic()
    for _ in range(LOOP_SIZE):
        await trio.lowlevel.checkpoint()
        #await trio.lowlevel.cancel_shielded_checkpoint()
    end = time.monotonic()
    print(f"{LOOP_SIZE / (end - start):.2f} schedules/second")

trio.run(main)
```
@codecov
Copy link

codecov bot commented Jun 14, 2020

Codecov Report

Merging #1613 into master will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1613   +/-   ##
=======================================
  Coverage   99.69%   99.69%           
=======================================
  Files         110      110           
  Lines       13840    13840           
  Branches     1058     1059    +1     
=======================================
  Hits        13798    13798           
  Misses         27       27           
  Partials       15       15           
Impacted Files Coverage Δ
trio/_core/_run.py 99.76% <100.00%> (+<0.01%) ⬆️
trio/_core/_ki.py 100.00% <0.00%> (ø)
trio/_core/_thread_cache.py 100.00% <0.00%> (ø)
trio/_core/_wakeup_socketpair.py 100.00% <0.00%> (ø)

@oremanj oremanj merged commit 6c13b13 into python-trio:master Jun 14, 2020
njsmith added a commit to njsmith/trio that referenced this pull request Jun 15, 2020
Profiling a user-written DNS microbenchmark in python-triogh-1595 showed that UDP
sendto operations were spending a substantial amount of time in
_resolve_address, which is responsible for resolving any hostname
given in the sendto call. This is weird, since the benchmark wasn't
using hostnames, just a raw IP address.

Surprisingly, it turns out that calling getaddrinfo at all is also
quite expensive, even you give it an already-resolved plain IP address
so there's no work for it to do:

    In [1]: import socket

    In [2]: %timeit socket.getaddrinfo("127.0.0.1", 80)
    5.84 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

    In [3]: %timeit socket.inet_aton("127.0.0.1")
    187 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I thought getaddrinfo would be effectively free in this case, but
apparently I was wrong. Also, this doesn't really matter for TCP
connections, since they only pass through this path once per
connection, but UDP goes through here on every packet, so the overhead
adds up quickly.

Solution: add a fast-path to _resolve_address when user's address is
already resolved, so we skip all the work. On the benchmark in python-triogh-1595
on my laptop, this PR takes us from ~7000 queries/second to
~9000 queries/second, so a ~30% speedup.

The patch is a bit more complicated than I expected. There are three
parts:

- The fast path itself

- Skipping unnecessary checkpoints in _resolve_address, including when
  we're on the fastpath: this is an internal function and it turns out
  that almost all our callers are already doing checkpoints, so there's
  no need to do another checkpoint inside _resolve_address. Even with
  the fast checkpoints from python-triogh-1613, the fast-path+checkpoint is still
  only ~8000 queries/second on the DNS microbenchmark, versus ~9000
  queries/second without the checkpoint.

- _resolve_address used to always normalize IPv6 addresses into
  4-tuples, as a side-effect of how getaddrinfo works. The fast path
  doesn't call getaddrinfo, so the tests needed adjusting so they
  don't expect this normalization, and to make sure that our tests for
  the getaddrinfo normalization code don't hit the fast path.
njsmith added a commit to njsmith/trio that referenced this pull request Jun 15, 2020
Profiling a user-written DNS microbenchmark in python-triogh-1595 showed that UDP
sendto operations were spending a substantial amount of time in
_resolve_address, which is responsible for resolving any hostname
given in the sendto call. This is weird, since the benchmark wasn't
using hostnames, just a raw IP address.

Surprisingly, it turns out that calling getaddrinfo at all is also
quite expensive, even you give it an already-resolved plain IP address
so there's no work for it to do:

    In [1]: import socket

    In [2]: %timeit socket.getaddrinfo("127.0.0.1", 80)
    5.84 µs ± 53.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

    In [3]: %timeit socket.inet_aton("127.0.0.1")
    187 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I thought getaddrinfo would be effectively free in this case, but
apparently I was wrong. Also, this doesn't really matter for TCP
connections, since they only pass through this path once per
connection, but UDP goes through here on every packet, so the overhead
adds up quickly.

Solution: add a fast-path to _resolve_address when user's address is
already resolved, so we skip all the work. On the benchmark in python-triogh-1595
on my laptop, this PR takes us from ~7000 queries/second to
~9000 queries/second, so a ~30% speedup.

The patch is a bit more complicated than I expected. There are three
parts:

- The fast path itself

- Skipping unnecessary checkpoints in _resolve_address, including when
  we're on the fastpath: this is an internal function and it turns out
  that almost all our callers are already doing checkpoints, so there's
  no need to do another checkpoint inside _resolve_address. Even with
  the fast checkpoints from python-triogh-1613, the fast-path+checkpoint is still
  only ~8000 queries/second on the DNS microbenchmark, versus ~9000
  queries/second without the checkpoint.

- _resolve_address used to always normalize IPv6 addresses into
  4-tuples, as a side-effect of how getaddrinfo works. The fast path
  doesn't call getaddrinfo, so the tests needed adjusting so they
  don't expect this normalization, and to make sure that our tests for
  the getaddrinfo normalization code don't hit the fast path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants