-
Notifications
You must be signed in to change notification settings - Fork 67
Setting TCP_NODELAY on macOS sometimes fails #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kquick Would you accept a PR that tries to disable Nagle's algorithm but suppresses any OSError? |
Hi, @pquentin. Thanks for the bug report, and sorry for not responding earlier. I'm happy to receive a PR on this, although I have been looking into it (although intermittently due to being busy with a couple of other things). This is a bit of an unfortunate issue, since it's not clear to me when it would be safe to perform the TCP_NODELAY setting. I've got a branch in progress (https://github.com/thespianpy/Thespian/tree/issue70) that dismisses the OSError, but I'm concerned about the performance effects of not having this set. I have not found an OS trigger that would be reasonable to use, so I'd been looking into whether I could add some retries for this to the internal socket management state machine. I also don't have access to a Mac for a couple of weeks, so I'm only able to verify that these changes don't disrupt normal behavior under Linux. Since this is a bug that causes a Thespian failure, and you are encountering this in production, I'm willing to do a Thespian release with just the error dismissal, and follow up later with any TCP_NODELAY re-attempts for performance, but I'd like to ask if you would be willing to test both of these for me on a Mac? |
Quick answer from my phone: no worries! We're not using this in production, only on developer machines, and we have an easy workaround: comment out the offending line. We'd be happy to test any fix you come up within the next few weeks/months. Thanks. |
Hi @kquick, I work with @pquentin and ran into the issue again today after forgetting to use our documented workaround in a new environment. It seems to reproduce reliably with Rally on macOS and would be happy to run any tests you like. The differences I see between successful and unsuccessful executions (between macOS and Linux) are the protocol number and the missing
lsock at various stages of execution is in elastic/rally#1575 (comment).
Success (Linux)
Failure (macOS)
|
Hi @inqueue , thanks for the additional information. You refer to a "documented workaround"... is that the use of Thespian on the https://github.com/thespianpy/Thespian/tree/issue70 branch or something else you are doing? Your additional information leads me to suspect connection attempts by some other source other than a Thespian client. I've pushed an additional change to the branch above to address this situation. If you could try this and confirm that it (a) doesn't hang as in the failure cases, and (b) still seems to reliably process all connections and data in your configuration, then I would be comfortable merging this and making a new Thespian release. |
Hi Kevin,
Commenting the following line has been our workaround:
On macOS, the failure case can be reliably switched off/on by adding and removing the changes in https://github.com/thespianpy/Thespian/tree/issue70. I also tested it on Linux for completeness and it looks good there as well. +1 to merge it. Thank you! |
Thespian 3.10.7 is released now with these changes. Please let me know if you have any issues with that release. |
Hi @kquick, we did encounter one error on macOS when using 3.10.7: DEBUG:esrally.actor:Checking capabilities [{'Thespian ActorSystem Name': 'multiprocQueueBase', 'Thespian ActorSystem Version': 1, 'Python Version': (3, 8, 13, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1673978163702'}] against requirements [{'coordinator': True}] failed. 2023-01-17 12:56:03.718155 p54867 I ++++ Admin started @ ActorAddr-Q.ThespianQ / gen (3, 10)
2023-01-17 12:56:03.725607 p54867 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.ThespianQ.a, <class 'thespian.system.messages.multiproc.LoggerConnected'> msg: <thespian.system.messages.multiproc.LoggerConnected object at 0x1037035e0>)
2023-01-17 12:56:03.726543 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999773-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected obje
ct at 0x103703730>-quit_0:04:59.999752)
2023-01-17 12:56:03.728853 p54772 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999663-<class 'thespian.system.messages.admin.QueryExists'>-<thespian.system.messages.admin.QueryExists object at 0x1036
f6970>-quit_0:04:59.999650)
2023-01-17 12:56:03.729617 p54867 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.QueryExists'> msg: <thespian.system.messages.admin.QueryExists object at 0x103703820>)
2023-01-17 12:56:03.730150 p54867 dbg Attempting intent TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999803-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x103703be0>-quit_0:
04:59.999792)
2023-01-17 12:56:03.730473 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999474-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x103703be0>-quit_0:
04:59.999466)
2023-01-17 12:56:03.733756 p54772 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999929-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#1_of_None-quit_0:04:59.999915)
2023-01-17 12:56:03.734321 p54867 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.~1, <class 'thespian.system.messages.admin.PendingActor'> msg: PendingActor#1_of_None)
2023-01-17 12:56:03.734686 p54867 I Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-Q.~1
2023-01-17 12:56:03.735352 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999929-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 10, /Users/jbryan/dev/pr-review/rally/esrally/actor.py,
122, "Checking capabilities [{'Thespian ActorSystem Name': 'multipr...-quit_0:04:59.999918)
- 2023-01-17 12:56:03.735849 p54867 Warn no system has compatible capabilities for Actor esrally.racecontrol.BenchmarkActor
2023-01-17 12:56:03.736326 p54867 dbg Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999822-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999811)
2023-01-17 12:56:03.736646 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999507-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999498)
2023-01-17 12:56:06.743766 p54772 I ActorSystem shutdown requested.
2023-01-17 12:56:06.745121 p54772 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999463-<class 'thespian.system.messages.admin.SystemShutdown'>-<thespian.system.messages.admin.SystemShutdown object at 0x1036f67c0>-quit_0:04:59.999383)
2023-01-17 12:56:06.746577 p54867 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.SystemShutdown'> msg: <thespian.system.messages.admin.SystemShutdown object at 0x1037152e0>)
2023-01-17 12:56:06.747781 p54867 dbg ---- shutdown initiated by ActorAddr-Q.~
2023-01-17 12:56:06.749025 p54867 dbg Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999569-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3585 actual None-quit_0:04:59.999532)
2023-01-17 12:56:06.749700 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.998863-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3585 actual None-quit_0:04:59.998844)
2023-01-17 12:56:06.750640 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999896-<class 'thespian.system.logdirector.LoggerExitRequest'>-<thespian.system.logdirector.LoggerExitRequest object a
t 0x103715340>-quit_0:04:59.999879) I also hit this error during testing when using a patched 3.10.6 with the changes from https://github.com/thespianpy/Thespian/tree/issue70. Interestingly, this error did not surface with 3.10.1 (the version we are currently on) with the two issue70 commits, which is the same version the changes were validated against. I should have stuck with 3.10.6 and reported the error. |
In the above example, you are using the Perhaps you intended to start a |
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
There are bug reports about failures when setting TCP_NODELAY under certain conditions on macOS: thespianpy/Thespian#70, envoyproxy/envoy#1446. It is reasonable to simply eat errors coming from setting TCP_NODELAY because TCP_NODELAY is relatively small performance optimization. In almost all cases the caller will not be able to do anything to remedy the situation and will simply continue.
As you know we're using Thespian in Rally, and while we're using it on Linux in production, multiple developers use macOS and we realized recently that one of our integration was hanging on macOS because it was stuck in the actor system. It currently reproduces 100% of the time for one developer on both ARM and x86, but we don't have a clean reproducer that does not involve Rally: elastic/rally#1575.
What happens is that the code to disable Nagle's algorithm occasionally fails with an OSError. Apparently macOS forbids setting TCP_NODELAY during socket shutdown. Various projects have run into that issue, including the Envoy proxy: envoyproxy/envoy#1446.
It was difficult to see this issue as the error is only logged, not propagated, and this just caused the actor system to hang.
The text was updated successfully, but these errors were encountered: