Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-109534: fix reference leak when SSL handshake fails #114074

Merged
merged 2 commits into from Feb 1, 2024

Conversation

ordinary-jamie
Copy link
Contributor

@ordinary-jamie ordinary-jamie commented Jan 15, 2024

Fixes a reference cycle issue between the Future object used in selector_events.BaseSelectorEventLoop._accept_connection2 and SSLProtocol which prevented the SSLProtocol from deallocating and causing spikes in memory usage.

The root cause of the issue is that, on SSL handshake failure, the SSLProtocol sets the Future (waiter) exception, and the traceback frames in that exception will hold references to the protocol object and its buffers.

Refactored SSLProtocol._on_handshake_complete to not use try/except as this created a frame that kept the waiter object alive. Edit: see comments below

self._on_handshake_complete(ConnectionResetError)
self._on_handshake_complete(ConnectionResetError())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mjpieters; This fix caused the ssl-over-ssl test from GH-113214 (PR #113334) to fail and want to make sure I did not break anything on your end.

The test will raise ConnectionResetError (via SSLProtocol.eof_received) and normally this won't
be caught by the exception handler in SSLProtocol._fatal_error because it is of instance OSError. But because we no longer raise the exception in SSLProtocol._on_handshake_complete(exc), the exception object is never initialised and the isinstance doesn't work as intended. Changed SSLProtocol.eof_received to pass an exception object and not class to fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in this area, but passing in an instance instead of the class looks fine by me. I did leave a comment on the removal of the try...except block however as this did more than just handle raise handshake_exc.

@ordinary-jamie
Copy link
Contributor Author

Was not sure how to add tests for this, the current replication from the issue is as follows

import asyncio
import ssl
import tracemalloc


async def main(certfile, keyfile):
    ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
    ssl_context.load_cert_chain(certfile, keyfile)

    await asyncio.start_server(
        lambda _, w: w.write(b"\0"), "127.0.0.1", "4443", ssl=ssl_context
    )

    current_1, _ = tracemalloc.get_traced_memory()
    proc = await asyncio.create_subprocess_shell(
        f"curl https://127.0.0.1:4443 2> /dev/null"
    )
    await proc.communicate()
    current_2, _ = tracemalloc.get_traced_memory()

    print(f"{(current_2-current_1)/1000:.2f}KB")


if __name__ == "__main__":
    tracemalloc.start()
    asyncio.run(main(certfile="server.crt", keyfile="server.key"))

The delta in tracemalloc.get_traced_memory will show something in the order of 10KB (as opposed to 250+KB). But I'm not sure if using tracemalloc etc in tests is a good idea...

peercert = sslobj.getpeercert()
except Exception as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handler here is intended to (also) catch exceptions raised by sslobj.getpeercert(). That is probably going to be an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words: you can't avoid using a try...except block here. Is removing this really necessary to break the cycle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! This was my misunderstanding of the problem. It turns out that setting the original handshake_exc to None also fixed our problem -- which my understanding is that there are two separate instances of the exception, each with a traceback and only one was being handled.

try:
    if handshake_exc is None:
        self._set_state(SSLProtocolState.WRAPPED)
    else:
        raise handshake_exc

    peercert = sslobj.getpeercert()
except Exception as exc:
    handshake_exc = None    # <--- fixes the problem

@ordinary-jamie
Copy link
Contributor Author

ordinary-jamie commented Jan 15, 2024

This code snippet should replicate our issue, which I'm currently trying to convert into a unittest for this PR

import asyncio
import ssl
import weakref
import tracemalloc


class SSLProtocol:
    def __init__(self, future) -> None:
        self.buffer = bytearray(256 * 1024)
        self.future = future

    def _do_handshake(self):
        try:
            raise ssl.SSLCertVerificationError()
        except ssl.SSLError as exc:
            self._on_handshake_complete(exc)

    def _on_handshake_complete(self, handshake_exc):
        try:
            if handshake_exc is None:
                ...
            else:
                raise handshake_exc
        except Exception as exc:
            # handshake_exc = None              # <-------- Uncomment to break assertion
            self.future.set_exception(exc)
            self.future = None


async def accept_connection():
    """Mimicking selector_events.BaseSelectorEventLoop._accept_connection2"""

    loop = asyncio.get_running_loop()
    waiter = loop.create_future()

    ssl_proto = SSLProtocol(waiter)

    # selector_events._SelectorSocketTransport would schedule the protocols
    # connection_made method
    loop.call_soon(ssl_proto._do_handshake)

    try:
        await waiter
    except BaseException:
        waiter = None  # This is needed with `handshake_exc = None`
        ref = weakref.ref(ssl_proto)  # For testing

        # The protocol **should** fall out of scope here...
        # In _accept_connection2 the app & socket transports will release their
        # references to the protocol via their close methods and there will be no
        # references to it in this task (it falls out of scope of _make_ssl_transport)
        ssl_proto = None

        return ref


if __name__ == "__main__":
    tracemalloc.start()
    m_start = tracemalloc.get_traced_memory()[0]

    ssl_proto_ref = asyncio.run(accept_connection())

    m_delta = tracemalloc.get_traced_memory()[0] - m_start
    # Uncomment `handshake_exc = None` to break assertions
    assert ssl_proto_ref() is not None
    assert m_delta > 256_000 # bytes

@mjpieters
Copy link
Contributor

Does uvloop show the same issue? The implementation here is a fairly direct port from their Cython implementation, it'd be interesting to see what is different if uvloop doesn't show the same leak.

@ordinary-jamie
Copy link
Contributor Author

ordinary-jamie commented Jan 15, 2024

Does uvloop show the same issue? The implementation here is a fairly direct port from their Cython implementation, it'd be interesting to see what is different if uvloop doesn't show the same leak.

Hm. The initial issue ticket says uvloop is affected, I only just tested it, and it appears my first snippet (with asyncio.start_server) don't seem to show the same scale of problem (only 18KB in delta tracemalloc).

My second snippet (with the mimicked classes) will still show the same issue with uvloop, however...

And briefly looking over the uvloop implementation of SSLProtocol, I can't see any obvious differences!

Let me look into this!

@ordinary-jamie
Copy link
Contributor Author

Sorry for the slow response @mjpieters, it's been a busy week at the office!

it'd be interesting to see what is different if uvloop doesn't show the same leak.

uvloop doesn't appear to have the same issue. Despite the SSLProtocol implementation in both libraries being very similar. However, the reason for the difference seems to be how the Server/socket transports are implemented, for which there is significant deviation.

(Note, I am quite out-of-depth here, but tried my best navigating through uvloop's code base...)

To me, the main difference seems to be that in uvloop the waiter (future) exception is set through a done callback allowing the waiter to fall out of scope and break the reference cycle between Exception (and traceback) and Future.

In asyncio, the reader callback (BaseSelectorEventLoop._accept_connection -> BaseSelectorEventLoop._accept_connection2) both creates the waiter/future and catches the exception it will raise on await. In this fix, we break the reference cycle by setting the waiter to None in the frame.

uvloop does not do this. The UVStreamServer.on_listen method instead creates the waiter/future and instead of awaiting it, adds a done callback to check the result/exception on the future. Indeed, a naive check of gc.get_referrers seems to support this idea.

Copy link
Contributor

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work debugging this! This change looks good to me. It sounds like we should backport this change to 3.11 as well?

@hauntsaninja hauntsaninja added needs backport to 3.11 only security fixes needs backport to 3.12 bug and security fixes labels Jan 29, 2024
@ordinary-jamie
Copy link
Contributor Author

Thanks for all your work debugging this! This change looks good to me. It sounds like we should backport this change to 3.11 as well?

Would you like me to create the PRs for the backports or is this something best left for the core team?

@hauntsaninja
Copy link
Contributor

hauntsaninja commented Jan 30, 2024

I'll handle backports! Just wanted to mention and to leave open for a little bit in case other folks have opinions.

@ordinary-jamie
Copy link
Contributor Author

Thanks @hauntsaninja! Just want to mention that the issue has some open threads RE:python:3.13-rc-alpine not observing an issue. I will continue investigating this separate to this PR.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say I haven't been able to follow all this (the issue is very noisy) but these changes look good regardless.

@hauntsaninja hauntsaninja merged commit 80aa7b3 into python:main Feb 1, 2024
33 checks passed
@miss-islington-app
Copy link

Thanks @ordinary-jamie for the PR, and @hauntsaninja for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 1, 2024
…H-114074)

(cherry picked from commit 80aa7b3)

Co-authored-by: Jamie Phan <jamie@ordinarylab.dev>
@bedevere-app
Copy link

bedevere-app bot commented Feb 1, 2024

GH-114829 is a backport of this pull request to the 3.12 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 1, 2024
…H-114074)

(cherry picked from commit 80aa7b3)

Co-authored-by: Jamie Phan <jamie@ordinarylab.dev>
@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Feb 1, 2024
@bedevere-app
Copy link

bedevere-app bot commented Feb 1, 2024

GH-114830 is a backport of this pull request to the 3.11 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.11 only security fixes label Feb 1, 2024
hauntsaninja pushed a commit that referenced this pull request Feb 1, 2024
) (#114830)

gh-109534: fix reference leak when SSL handshake fails (GH-114074)
(cherry picked from commit 80aa7b3)

Co-authored-by: Jamie Phan <jamie@ordinarylab.dev>
hauntsaninja pushed a commit that referenced this pull request Feb 1, 2024
) (#114829)

gh-109534: fix reference leak when SSL handshake fails (GH-114074)
(cherry picked from commit 80aa7b3)

Co-authored-by: Jamie Phan <jamie@ordinarylab.dev>
@gvanrossum
Copy link
Member

Thanks @ordinary-jamie for researching this and coming up with the fix!

Thanks @hauntsaninja for taking over the review. Really appreciate it!

@hauntsaninja
Copy link
Contributor

Thank you mjpieters as well! :-)

@ordinary-jamie
Copy link
Contributor Author

Thanks everyone for the support and guidance - I've learnt a lot from you all!

I will continue to monitor the issue thread to see if other users are observing other related bugs.

aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants