New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve error handling in open_tcp_stream #809

Merged
merged 10 commits into from Dec 22, 2018

Conversation

Projects
None yet
2 participants
@smason
Copy link
Contributor

smason commented Dec 18, 2018

improve handling of open_tcp_stream in the case of errors/exceptions being thrown. previously it looked like there were a few places the code could be interrupted and it would end up leaking file descriptors

I've changed it to track everything in open_sockets, and only have a single winner.

I started writing this along with Nathaniel's PyCon 2018 talk but ended up structuring the retry behaviour differently as I found his confusing to follow. Not sure if my code will catch all the cases appropriately or is actually better, hence the WIP.

Comments appreciated!

improve handling of open_tcp_stream
 * mostly around closing sockets in more exceptional conditions
 * code flow changed to be slightly easier to follow for me
@codecov

This comment has been minimized.

Copy link

codecov bot commented Dec 18, 2018

Codecov Report

Merging #809 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #809      +/-   ##
==========================================
- Coverage   99.01%   99.01%   -0.01%     
==========================================
  Files          96       94       -2     
  Lines       11682    11631      -51     
  Branches      828      832       +4     
==========================================
- Hits        11567    11516      -51     
  Misses         94       94              
  Partials       21       21
Impacted Files Coverage Δ
trio/_highlevel_open_tcp_stream.py 96.96% <100%> (+0.04%) ⬆️
trio/tests/test_highlevel_open_unix_stream.py 100% <100%> (ø) ⬆️
trio/tests/test_highlevel_open_tcp_stream.py 100% <100%> (ø) ⬆️
trio/_highlevel_open_unix_stream.py 83.33% <100%> (+5.55%) ⬆️
trio/__init__.py 100% <0%> (ø) ⬆️
trio/_core/tests/test_result.py
trio/_core/_result.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16743c0...f0052d5. Read the comment docs.

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Dec 19, 2018

Whoa, this is awesome.

Yeah, the old code could leak file descriptors in rare cases. It required an unlucky alignment of the stars, and only really matters on PyPy, but it's been bugging me for months. So thank you. It looks like I never actually opened an issue for it, but there's a brief mention here.

I never thought of moving the loop-and-wait logic into the parent task like that. That's really elegant! (I guess this might be a side-effect of my originally writing this back in prehistoric Trio, where you couldn't put blocking logic into the parent task? But it's kind of mind-blowing to realize I spent weeks tweaking the logic for the talk and still missed this.)

One quibble with the socket tracking: I think I'd factor it out into a helper, like:

@contextmanager
def close_all():
    sockets_to_close = set()
    try:
        yield sockets_to_close
    finally:
        for sock in sockets_to_close:
            sock.close()

async def open_tcp_stream(...):
    with close_all() as sockets_to_close:
        ...
        # Opening a socket
        sock = socket.socket(...)
        sockets_to_close.add(sock)
        ...
        # Finishing up
        if winning_socket is None:
            ...
        else:
            sockets_to_close.remove(winning_socket)
            return winning_socket

This makes the main function's flow clearer, and lets you drop the awkward except: winning_socket = None; raise logic.

It would be great if we could add tests to confirm that the new code cleans up sockets correctly, in cases where the old code didn't. If you look in trio/tests/test_highlevel_open_tcp_stream.py, there's a pretty robust framework for setting up fake happy eyeballs scenarios, and it already has code to check that sockets are closed properly, so in principle it should just be a question of setting up a scenario where the connection is cancelled at just the wrong time. That might be difficult to do though, I'm not sure :-). (I don't remember off the top of my head exactly what the conditions are for provoking the old code to leak file descriptors.)

We'll also want a news entry, noting that we fixed some edge cases where open_tcp_stream could leak file descriptors on PyPy.

@smason

This comment has been minimized.

Copy link
Contributor Author

smason commented Dec 19, 2018

thanks for the positive feedback! I'll do some cleaning up and push again. didn't want to go too far without some confirmation that this was a useful change

the "simplified nurseries" is an interesting feature, feels like an intuitive change

I'm not sure how I'd provoke any resource leakage, it was more of an academic concern… I made sure tests succeeded before committing, but that's as far as I went into looking at your test framework — I'm generally afraid of concurrency-related heisenbugs

also, thanks for publishing Trio! it's a very cool experiment, be interesting to see where it goes!

smason added some commits Dec 19, 2018

@smason

This comment has been minimized.

Copy link
Contributor Author

smason commented Dec 19, 2018

I've done the easy stuff… just need to figure out a test that would leak in the old version and make sure my version catches it

I put dd2be12 into its own commit as it feels somewhat gratuitous and might lead to unnecessary history/merge awkwardness down the line.

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Dec 20, 2018

I put dd2be12 into its own commit as it feels somewhat gratuitous and might lead to unnecessary history/merge awkwardness down the line.

Looks fine to me.

I've done the easy stuff… just need to figure out a test that would leak in the old version and make sure my version catches it

I looked at it again, and I think one case where it could leak would be if open_tcp_stream gets cancelled at the same moment as a socket successfully connects. We already have a simple test for cancellation (test_cancel) – it might be enough to write a similar test with a cancel timeout set to match the same time as the scenario connect succeeds.

I'm not sure if this would work or not though... when two things are supposed to happen at "the same time" then in fact one of them ends up happening first, and it might not be the one we want :-).

Another approach, that's a little more complicated but that would definitely work, would be to set up a cancel scope around the whole call, and then arrange for the FakeSocket.connect method to cancel it just before it returns.

smason added some commits Dec 20, 2018

add tests for IPv6 only network setups
pretty sure this is very rare in the wild, but it seems to happen on
private networks
add test making sure sockets are cleaned up in more places
previous version of the code would leak sockets if an exception was
thrown after performing the connect

using a KeyboardInterrupt like this is a valid case, but a bit of a
stand-in for some unexplained failure happening at the "wrong" time
make sure we close all sockets in the presence of exceptions
some parts of the code base use a bare "except:" but more parts catch
BaseException which feels better so I went with that as it means I
don't have to call sys.exc_info to get the thrown exception back
@smason

This comment has been minimized.

Copy link
Contributor Author

smason commented Dec 20, 2018

@njsmith your existing code was pretty water-tight! thinking up valid/reliable cases that would actually fail was difficult. I thought getting your thread cancellation idea to reliably provoke any misbehaviour would be kind of awkward, it could also be fixed easily by moving the code around to do:

        with close_on_error(socket(*socket_args)) as sock:
            await sock.connect(target_sockaddr)
            winning_sockets.append(sock)

the test I left in (that also leaks when I pull open_tcp_stream from master) simulates a non-OSError being thrown later on. though it would be pretty difficult to hit in practice

I've also made the close_all context manager continue closing everything in the presence of errors

feel free to squash the last two fixup commits, I left them as is per nodejs suggestions

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Dec 21, 2018

it could also be fixed easily by moving the code around to do:

I don't think that would fix it... suppose that the cancellation happens after the connect succeeds. At that point it's too late to deliver a cancellation through the await sock.connect call. And there aren't any other awaits inside the child task at that point, so the child task wouldn't see a Cancelled exception. The parent though is still blocked in await nursery.__aexit__(), so it will get the Cancelled and exit immediately without cleaning up winning_sockets.

But with the new code it's obvious that it handles all these cases correctly without needing any complicated analysis, so I'm not too worried.

feel free to squash the last two fixup commits, I left them as is per nodejs suggestions

In general we're pretty relaxed about squashing-or-not. Keeping a nice clean history is nice, but I don't want people to feel like they need to spend a bunch of time fiddling with git to contribute.

@njsmith
Copy link
Member

njsmith left a comment

I think this is looking good, except for one small comment below, plus we still need a newsfragment – lmk if you need help figuring out how to do that.

Show resolved Hide resolved trio/_highlevel_open_tcp_stream.py Outdated

@njsmith njsmith changed the title WIP: improve handling of open_tcp_stream in error conditions improve error handling in open_tcp_stream Dec 21, 2018

@njsmith njsmith merged commit 5a03bee into python-trio:master Dec 22, 2018

5 checks passed

codecov/patch 100% of diff hit (target 99.01%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +0.98% compared to 16743c0
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Dec 22, 2018

🎉

And per our standard practice, I'm sending you a Github invite now. No pressure of course, but if you want to keep contributing then you are very welcome! And always happy to chat about what you're working on or what you might be interested in working on or... well, whatever, really :-)

@smason smason deleted the smason:open_tcp_stream branch Dec 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment