-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiple threading bugs including #699 and #697 #701
Fix multiple threading bugs including #699 and #697 #701
Conversation
Some followup notes: To my knowledge, there is STILL a minor bug/race in the code, where Some of the synchronization logic that this pull request has modified is very fragile - so it is no surprise that there were some bugs before. But I think that is very hard to avoid given the current way the code was written - a major refactor would be needed to get the code into a state where thread-safety is easier and more natural, instead of tricky. Especially, a significant amount of complexity in thread-safety reasoning comes from the fact that From a thread-safety and code-correctness perspective, it would be much cleaner and simpler if The only other thing I can tell that making the mutex non-recursive would lose is for the ability for the user to call something like Get with a ContentReceiver, and inside the content receiver function start another query. In theory this could work on the very last call to the ContentReceiver when the message is complete... but this seems kind of crazy and hopefully no user of this library is doing that. Feel free to let me know if I'm missing something, otherwise a big TODO item for the future would be to get rid of the recursiveness of the mutex. It would be a much bigger refactor than just this pull request, but would make the thread-safety of the code far easier to maintain. And it would also be one step towards making |
Apparently the inline checks here passed but oss-fuzz got a test fail for this pull request. https://github.com/lightvector/cpp-httplib/runs/1266643298
But then I reran it and it worked: https://github.com/lightvector/cpp-httplib/actions/runs/311486216 And it's not obvious to me how ServerTest.SlowPostFail is related to this pull request directly, since it doesn't interact much with the socket shutdown/close refactors in this pull request all that much more than the other tests. At least, this test does seem to be nondeterministic somehow? |
@lightvector, thank very much for the pull request. Since it's a big change, it will definitely take a while for me to fully absorb all the details. If it handles more than one thing, is it possible to split into pieces? Also as long as you could, could you provide corresponding unit tests as well? Thanks! |
75017ff
to
4bb0013
Compare
I haven't been able to get to this. I've still been using my version successfully for my own project, but if you need me to break this up further for your own ability to review, then it might be a little while. |
@lightvector, I got some free time this weekend, and was able to look into the pull request. I now understand what it's doing, and looks very good! I'll merge it soon. I think it will also fix #771 as well. |
I have merged it after the merge conflict has manually been fixed at 02d3cd5. Thanks for your great contribution! |
@yhirose - cool! Despite not getting around to working further on this myself in this case, I'm glad to have been of help. :) |
In the process of fixing #697 and #699, I found that actually the situation is much more complicated. There were multiple issues with the old code, and simply applying the logical fix for those issues alone resulted in failures of test and other things due to the other bugs.
Many of these issues were a bit difficult to untangle individually, and required a nontrivial rework of the shutdown functions, so this pull request reworks all of the shutdown-related functions for the client to fix those thread-safety issues.
Here is the list of things that were fixed:
In 0743d78 it looks like a race condition that causes segfaults was buried under the rug by adding a timeout. This isn't a good fix because the segfault can still happen even with the timeout, the timeout only makes it less common. The race condition is that
close_socket
is unsafe to call when there are other threads still using the socket because other code might end up attempting to accessINVALID_SOCKET
believing it is a correct socket. And even if that were patched, theoretically the OS could reassign that socket id again causing it to be a valid but completely different socket corresponding to anything else in the application.The proper fix is to add a
socket_requests_in_flight_
variable that is also maintained undersocket_mutex_
that tracks when a request might be actively using the socket despite holding the mutex. See in the new implementation ofinline void ClientImpl::stop()
- when this variable is > 0, we now refrain from closing the socket and instead only shutdown the socket (which should be thread-safe) to disable read/write on the socket. We then set another new variablesocket_should_be_closed_when_request_is_done_
so that the thread that is still involved in the active request can know to close the socket when it is done.So now, if there is any ongoing request, the request thread takes full responsibility for closing the socket, later. That way, there can be no race - the same thread making the requests is the one closing the socket.
In could you reopen #697? #699 and error cant pass verifier on windows #697 it was observed that on some OSes, it can cause an error to attempt to SSL_shutdown on a closed socket. In general, one should never use a closed socket with any SSL functions.
The fix is to refactor the shutdown logic into three separate functions
shutdown_ssl
andshutdown_socket
andclose_socket
because each of them needs to be called in different situations, and we now we no longer ever callshutdown_ssl
after callingclose_socket
.It might be a bit confusing to need three functions here, but I don't know of a good way to do it with fewer than three. The general order of calling them should be
shutdown_ssl
thenshutdown_socket
thenclose_socket
, but it is also necessary to call the second one alone (as in the first bullet point above), andshutdown_ssl
needs to be virtual.There was a memory and an SSL resource leak in the
~ClientImpl
destructor, because it used to simply callClientImpl::close_socket
, which setsocket_.ssl
to nullptr without freeing it. The tricky subtle cause of this bug is that ~ClientImpl runs, virtual functions no longer call SSLClient's version of those functions! Only SSLClient's version of those functions clean upsocket_.ssl
properly. The proper fix, which is implemented now, is to explicitly make sure~SSLClient
cleans up thesocket_.ssl
first.error_
was accessed from multiple threads without synchronization. Strictly speaking, this is undefined C++ behavior unless this value is made atomic, so it is now made atomic.All the tests (mostly) pass, for me, at least, and the code seems to work properly in my own personal application, so I think the state of things is good, but of course the logic is still a bit tricky. Let me know what you think.