Skip to content

Setting TCP_NODELAY on macOS sometimes fails #70

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pquentin opened this issue Sep 27, 2022 · 9 comments
Closed

Setting TCP_NODELAY on macOS sometimes fails #70

pquentin opened this issue Sep 27, 2022 · 9 comments

Comments

@pquentin
Copy link

As you know we're using Thespian in Rally, and while we're using it on Linux in production, multiple developers use macOS and we realized recently that one of our integration was hanging on macOS because it was stuck in the actor system. It currently reproduces 100% of the time for one developer on both ARM and x86, but we don't have a clean reproducer that does not involve Rally: elastic/rally#1575.

What happens is that the code to disable Nagle's algorithm occasionally fails with an OSError. Apparently macOS forbids setting TCP_NODELAY during socket shutdown. Various projects have run into that issue, including the Envoy proxy: envoyproxy/envoy#1446.

It was difficult to see this issue as the error is only logged, not propagated, and this just caused the actor system to hang.

@pquentin
Copy link
Author

@kquick Would you accept a PR that tries to disable Nagle's algorithm but suppresses any OSError?

@kquick
Copy link
Member

kquick commented Sep 29, 2022

Hi, @pquentin. Thanks for the bug report, and sorry for not responding earlier. I'm happy to receive a PR on this, although I have been looking into it (although intermittently due to being busy with a couple of other things).

This is a bit of an unfortunate issue, since it's not clear to me when it would be safe to perform the TCP_NODELAY setting. I've got a branch in progress (https://github.com/thespianpy/Thespian/tree/issue70) that dismisses the OSError, but I'm concerned about the performance effects of not having this set. I have not found an OS trigger that would be reasonable to use, so I'd been looking into whether I could add some retries for this to the internal socket management state machine.

I also don't have access to a Mac for a couple of weeks, so I'm only able to verify that these changes don't disrupt normal behavior under Linux. Since this is a bug that causes a Thespian failure, and you are encountering this in production, I'm willing to do a Thespian release with just the error dismissal, and follow up later with any TCP_NODELAY re-attempts for performance, but I'd like to ask if you would be willing to test both of these for me on a Mac?

@pquentin
Copy link
Author

Quick answer from my phone: no worries! We're not using this in production, only on developer machines, and we have an easy workaround: comment out the offending line.

We'd be happy to test any fix you come up within the next few weeks/months. Thanks.

@inqueue
Copy link

inqueue commented Jan 12, 2023

Hi @kquick, I work with @pquentin and ran into the issue again today after forgetting to use our documented workaround in a new environment. It seems to reproduce reliably with Rally on macOS and would be happy to run any tests you like.

The differences I see between successful and unsuccessful executions (between macOS and Linux) are the protocol number and the missing raddr from lsock at

lsock, rmtTxAddr = self.new_socket(self.socket.accept)
during a failure. The value of lsock at various stages of execution is in elastic/rally#1575 (comment).

Success (Linux)

76.9s _acceptNewIncoming: "lsock: "='lsock: ',
      lsock=<socket.socket fd=31, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.2.15', 40081), raddr=('10.0.2.15', 52944)>
76.9s _acceptNewIncoming: "rmtTxAddr: "='rmtTxAddr: ',
      rmtTxAddr=('10.0.2.15', 52944)

Failure (macOS)

21.2s _acceptNewIncoming: "lsock: "='lsock: ',
      lsock=<socket.socket fd=60, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 59173)>
21.2s _acceptNewIncoming: "rmtTxAddr: "='rmtTxAddr: ', 
      rmtTxAddr=('192.168.5.116', 59224)

@kquick
Copy link
Member

kquick commented Jan 13, 2023

Hi @inqueue , thanks for the additional information. You refer to a "documented workaround"... is that the use of Thespian on the https://github.com/thespianpy/Thespian/tree/issue70 branch or something else you are doing?

Your additional information leads me to suspect connection attempts by some other source other than a Thespian client. I've pushed an additional change to the branch above to address this situation. If you could try this and confirm that it (a) doesn't hang as in the failure cases, and (b) still seems to reliably process all connections and data in your configuration, then I would be comfortable merging this and making a new Thespian release.

@inqueue
Copy link

inqueue commented Jan 13, 2023

Hi Kevin,

You refer to a "documented workaround"... is that the use of Thespian on the https://github.com/thespianpy/Thespian/tree/issue70 branch or something else you are doing?

Commenting the following line has been our workaround:

lsock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

On macOS, the failure case can be reliably switched off/on by adding and removing the changes in https://github.com/thespianpy/Thespian/tree/issue70. I also tested it on Linux for completeness and it looks good there as well. +1 to merge it. Thank you!

@kquick
Copy link
Member

kquick commented Jan 16, 2023

Thespian 3.10.7 is released now with these changes. Please let me know if you have any issues with that release.

@inqueue
Copy link

inqueue commented Jan 17, 2023

Hi @kquick, we did encounter one error on macOS when using 3.10.7:

DEBUG:esrally.actor:Checking capabilities [{'Thespian ActorSystem Name': 'multiprocQueueBase', 'Thespian ActorSystem Version': 1, 'Python Version': (3, 8, 13, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1673978163702'}] against requirements [{'coordinator': True}] failed.
2023-01-17 12:56:03.718155 p54867 I    ++++ Admin started @ ActorAddr-Q.ThespianQ / gen (3, 10)
2023-01-17 12:56:03.725607 p54867 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.ThespianQ.a, <class 'thespian.system.messages.multiproc.LoggerConnected'> msg: <thespian.system.messages.multiproc.LoggerConnected object at 0x1037035e0>)
2023-01-17 12:56:03.726543 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999773-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected obje
ct at 0x103703730>-quit_0:04:59.999752)
2023-01-17 12:56:03.728853 p54772 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999663-<class 'thespian.system.messages.admin.QueryExists'>-<thespian.system.messages.admin.QueryExists object at 0x1036
f6970>-quit_0:04:59.999650)
2023-01-17 12:56:03.729617 p54867 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.QueryExists'> msg: <thespian.system.messages.admin.QueryExists object at 0x103703820>)
2023-01-17 12:56:03.730150 p54867 dbg  Attempting intent TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999803-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x103703be0>-quit_0:
04:59.999792)
2023-01-17 12:56:03.730473 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999474-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x103703be0>-quit_0:
04:59.999466)
2023-01-17 12:56:03.733756 p54772 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999929-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#1_of_None-quit_0:04:59.999915)
2023-01-17 12:56:03.734321 p54867 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.~1, <class 'thespian.system.messages.admin.PendingActor'> msg: PendingActor#1_of_None)
2023-01-17 12:56:03.734686 p54867 I    Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-Q.~1
2023-01-17 12:56:03.735352 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999929-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 10, /Users/jbryan/dev/pr-review/rally/esrally/actor.py,
122, "Checking capabilities [{'Thespian ActorSystem Name': 'multipr...-quit_0:04:59.999918)
- 2023-01-17 12:56:03.735849 p54867 Warn no system has compatible capabilities for Actor esrally.racecontrol.BenchmarkActor
2023-01-17 12:56:03.736326 p54867 dbg  Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999822-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999811)
2023-01-17 12:56:03.736646 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999507-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999498)
2023-01-17 12:56:06.743766 p54772 I    ActorSystem shutdown requested.
2023-01-17 12:56:06.745121 p54772 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999463-<class 'thespian.system.messages.admin.SystemShutdown'>-<thespian.system.messages.admin.SystemShutdown object at 0x1036f67c0>-quit_0:04:59.999383)
2023-01-17 12:56:06.746577 p54867 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.SystemShutdown'> msg: <thespian.system.messages.admin.SystemShutdown object at 0x1037152e0>)
2023-01-17 12:56:06.747781 p54867 dbg  ---- shutdown initiated by ActorAddr-Q.~
2023-01-17 12:56:06.749025 p54867 dbg  Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999569-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3585 actual None-quit_0:04:59.999532)
2023-01-17 12:56:06.749700 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.998863-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3585 actual None-quit_0:04:59.998844)
2023-01-17 12:56:06.750640 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999896-<class 'thespian.system.logdirector.LoggerExitRequest'>-<thespian.system.logdirector.LoggerExitRequest object a
t 0x103715340>-quit_0:04:59.999879)

I also hit this error during testing when using a patched 3.10.6 with the changes from https://github.com/thespianpy/Thespian/tree/issue70. Interestingly, this error did not surface with 3.10.1 (the version we are currently on) with the two issue70 commits, which is the same version the changes were validated against. I should have stuck with 3.10.6 and reported the error.

@kquick
Copy link
Member

kquick commented Jan 17, 2023

In the above example, you are using the multiprocQueueBase, whereas issue70 fixed the multiprocTCPBase, so these are unrelated. The error above indicates that it tried to find an actor system with certain capabilities (probably { 'coordinator': True }) that didn't match the running actor system (whose capabilities are unfortunately truncated in the above logging (see "Checking capabilities ...) so I can't confirm whether this is an error or not, but it's likely the multiprocQueueBase actor system you started doesn't have these capabilities.

Perhaps you intended to start a multprocTCPBase rather than a multiprocQueueBase since the latter does not support multiple host systems or actor system conventions (see https://thespianpy.com/doc/using.html#outline-container-hH-2a5fa63d-e6eb-43b9-bea8-47223b27544e for more details), but if this was intentional, please open a new issue and we can work to resolve your problem.

p12tic added a commit to p12tic/twisted that referenced this issue Nov 18, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 18, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 19, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 19, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 19, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 19, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 19, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 19, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 19, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 19, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 21, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 21, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 21, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 21, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
p12tic added a commit to p12tic/twisted that referenced this issue Nov 22, 2024
There are bug reports about failures when setting TCP_NODELAY under
certain conditions on macOS:
thespianpy/Thespian#70,
envoyproxy/envoy#1446.

It is reasonable to simply eat errors coming from setting TCP_NODELAY
because TCP_NODELAY is relatively small performance optimization. In
almost all cases the caller will not be able to do anything to remedy
the situation and will simply continue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants