Fix multiple threading bugs including #699 and #697 #701

lightvector · 2020-10-16T22:17:15Z

In the process of fixing #697 and #699, I found that actually the situation is much more complicated. There were multiple issues with the old code, and simply applying the logical fix for those issues alone resulted in failures of test and other things due to the other bugs.

Many of these issues were a bit difficult to untangle individually, and required a nontrivial rework of the shutdown functions, so this pull request reworks all of the shutdown-related functions for the client to fix those thread-safety issues.

Here is the list of things that were fixed:

In 0743d78 it looks like a race condition that causes segfaults was buried under the rug by adding a timeout. This isn't a good fix because the segfault can still happen even with the timeout, the timeout only makes it less common. The race condition is that close_socket is unsafe to call when there are other threads still using the socket because other code might end up attempting to access INVALID_SOCKET believing it is a correct socket. And even if that were patched, theoretically the OS could reassign that socket id again causing it to be a valid but completely different socket corresponding to anything else in the application.

The proper fix is to add a socket_requests_in_flight_ variable that is also maintained under socket_mutex_ that tracks when a request might be actively using the socket despite holding the mutex. See in the new implementation of inline void ClientImpl::stop() - when this variable is > 0, we now refrain from closing the socket and instead only shutdown the socket (which should be thread-safe) to disable read/write on the socket. We then set another new variable socket_should_be_closed_when_request_is_done_ so that the thread that is still involved in the active request can know to close the socket when it is done.

So now, if there is any ongoing request, the request thread takes full responsibility for closing the socket, later. That way, there can be no race - the same thread making the requests is the one closing the socket.
In could you reopen #697? #699 and error cant pass verifier on windows #697 it was observed that on some OSes, it can cause an error to attempt to SSL_shutdown on a closed socket. In general, one should never use a closed socket with any SSL functions.

The fix is to refactor the shutdown logic into three separate functions shutdown_ssl and shutdown_socket and close_socket because each of them needs to be called in different situations, and we now we no longer ever call shutdown_ssl after calling close_socket.

It might be a bit confusing to need three functions here, but I don't know of a good way to do it with fewer than three. The general order of calling them should be shutdown_ssl then shutdown_socket then close_socket, but it is also necessary to call the second one alone (as in the first bullet point above), and shutdown_ssl needs to be virtual.
There was a memory and an SSL resource leak in the ~ClientImpl destructor, because it used to simply call ClientImpl::close_socket, which set socket_.ssl to nullptr without freeing it. The tricky subtle cause of this bug is that ~ClientImpl runs, virtual functions no longer call SSLClient's version of those functions! Only SSLClient's version of those functions clean up socket_.ssl properly. The proper fix, which is implemented now, is to explicitly make sure ~SSLClient cleans up the socket_.ssl first.
error_ was accessed from multiple threads without synchronization. Strictly speaking, this is undefined C++ behavior unless this value is made atomic, so it is now made atomic.

All the tests (mostly) pass, for me, at least, and the code seems to work properly in my own personal application, so I think the state of things is good, but of course the logic is still a bit tricky. Let me know what you think.

lightvector · 2020-10-16T22:29:00Z

Some followup notes:

To my knowledge, there is STILL a minor bug/race in the code, where ClientImpl::stop sets error_ = Error::Canceled; but even with error_ being atomic there is no proper synchronization and no guarantee that this error value will be seen by the thread running the requests. Indeed, if you revert 0cc108d and go back to 100 threads, for me a small fraction of the time the ClientStop test fails due to this bug. However, the way the code is currently written, fixing this bug is very hard.

Some of the synchronization logic that this pull request has modified is very fragile - so it is no surprise that there were some bugs before. But I think that is very hard to avoid given the current way the code was written - a major refactor would be needed to get the code into a state where thread-safety is easier and more natural, instead of tricky.

Especially, a significant amount of complexity in thread-safety reasoning comes from the fact that request_mutex_ is recursive. Generally, recursive mutexes are very hard to reason about. It took me 3-4 attempts and multiple hours to make this refactor in a way that I could reason to be thread-safe because of the difficulty of the recursive mutex. As far as I can tell, there isn't a good reason for this mutex to be recursive - the reason it's recursive right now is because redirect recursively calls send nested from inside of send's own calls. Am I missing another purpose of the recursion?

From a thread-safety and code-correctness perspective, it would be much cleaner and simpler if process_request returned the information that a redirect was necessary instead of calling send, so that send itself could loop and try the redirected url, instead of recursing. So the recursion is unnecessarily complicated.

The only other thing I can tell that making the mutex non-recursive would lose is for the ability for the user to call something like Get with a ContentReceiver, and inside the content receiver function start another query. In theory this could work on the very last call to the ContentReceiver when the message is complete... but this seems kind of crazy and hopefully no user of this library is doing that.

Feel free to let me know if I'm missing something, otherwise a big TODO item for the future would be to get rid of the recursiveness of the mutex. It would be a much bigger refactor than just this pull request, but would make the thread-safety of the code far easier to maintain. And it would also be one step towards making error_ = Error::Canceled; properly synchronized, while right now it is still a slight bug.

lightvector · 2020-10-16T22:49:38Z

Apparently the inline checks here passed but oss-fuzz got a test fail for this pull request. https://github.com/lightvector/cpp-httplib/runs/1266643298

[ RUN      ] ServerTest.SlowPostFail
test.cc:2088: Failure
Value of: !res
  Actual: false
Expected: true
[  FAILED  ] ServerTest.SlowPostFail (2186 ms)

But then I reran it and it worked: https://github.com/lightvector/cpp-httplib/actions/runs/311486216

And it's not obvious to me how ServerTest.SlowPostFail is related to this pull request directly, since it doesn't interact much with the socket shutdown/close refactors in this pull request all that much more than the other tests. At least, this test does seem to be nondeterministic somehow?

yhirose · 2020-10-17T00:51:22Z

@lightvector, thank very much for the pull request. Since it's a big change, it will definitely take a while for me to fully absorb all the details. If it handles more than one thing, is it possible to split into pieces? Also as long as you could, could you provide corresponding unit tests as well? Thanks!

lightvector · 2020-10-23T19:18:05Z

I haven't been able to get to this. I've still been using my version successfully for my own project, but if you need me to break this up further for your own ability to review, then it might be a little while.

yhirose · 2020-11-29T17:34:08Z

@lightvector, I got some free time this weekend, and was able to look into the pull request. I now understand what it's doing, and looks very good! I'll merge it soon. I think it will also fix #771 as well.

yhirose · 2020-11-29T17:35:54Z

I have merged it after the merge conflict has manually been fixed at 02d3cd5. Thanks for your great contribution!

lightvector · 2020-12-03T04:52:32Z

@yhirose - cool! Despite not getting around to working further on this myself in this case, I'm glad to have been of help. :)

Fix multiple threading bugs including yhirose#699 and yhirose#697

ecd709a

yhirose force-pushed the master branch 2 times, most recently from 75017ff to 4bb0013 Compare October 20, 2020 02:14

yhirose force-pushed the master branch from 4cb608d to c2afc5c Compare November 21, 2020 13:47

Merge branch 'master' into fix-multiple-threading-bugs

5389aca

yhirose closed this Nov 29, 2020

This was referenced Nov 29, 2020

Memory leaks in initialize_ssl #771

Closed

could you reopen #697? #699

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiple threading bugs including #699 and #697 #701

Fix multiple threading bugs including #699 and #697 #701

lightvector commented Oct 16, 2020 •

edited

Loading

lightvector commented Oct 16, 2020 •

edited

Loading

lightvector commented Oct 16, 2020

yhirose commented Oct 17, 2020

lightvector commented Oct 23, 2020

yhirose commented Nov 29, 2020

yhirose commented Nov 29, 2020

lightvector commented Dec 3, 2020

Fix multiple threading bugs including #699 and #697 #701

Fix multiple threading bugs including #699 and #697 #701

Conversation

lightvector commented Oct 16, 2020 • edited Loading

lightvector commented Oct 16, 2020 • edited Loading

lightvector commented Oct 16, 2020

yhirose commented Oct 17, 2020

lightvector commented Oct 23, 2020

yhirose commented Nov 29, 2020

yhirose commented Nov 29, 2020

lightvector commented Dec 3, 2020

lightvector commented Oct 16, 2020 •

edited

Loading

lightvector commented Oct 16, 2020 •

edited

Loading