New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many threads waiting around ERR_clear_error #2084
Comments
Just to clarify here. This problem is actually present in the client package not the server package. That is, you do not see contention in actual server code, right? If so, what is the client actually doing here that seems independent of whatever the server is doing. |
Thanks for looking; I think the problem is spread out, not specific to client: The above stacktrace was the one quoted on our internal EOS bug report. I was just looking at (another) one of the trace logs left on the machine. In this out of 2489 threads, 1806 mention int_thread_get_item. e.g.
or
or
and in this snapshot it seems this was the thread that was just releasing the lock:
(I can get us complete traces too, if you'd like a more complete picture) |
OK, so this is a general issue. Now, we never clear So, the problem may be that we never call ERR_remove_thread_state(0) when a thread terminates. The assumption was that OpenSSL would do that. But now, even if it does do that no one is going to remove the side table entry that CryptoLite has, correct? If so, that hash table would grow without bound and become progressively slow even with the fix you propose. |
I think it's a single hash table; the complication I saw with CryptoLite was that it was another place we used openssl, and in the patch I wanted to make sure it did the initialisation that setup sslTLS_id_callback for openssl. (as XrdTls and XrdCryptosslFactory already do). Once ssl starts to do something that needs that function, and a user callback hasn't been set, it puts its own default and you can't change it again. Assuming the entries are well distributed in the hash-table, I believe it's usual working should keep the access time from growing even if the number of entries grows. (it should maintain a load factor, enlarging the number of buckets and redistributing entries as it grows). But that does leave it potentially growing in size; but I thought that since it does that now we could try to solve it incrementally. A complication I am unsure about was to how to reliably call ERR_remove_thread_state for all required threads (e.g. all threads that have used openssl and may have setup an error state). I though about ways it could be added at the end of XrdSysThread_Xeq, but this would miss some threads in the XrdCl client, which doesn't use XrdSysPthread to create threads. Perhaps this could work (and it is quite localised):
but I still have some uncertainly about the performance of using thread_local, so I didn't put it in the initial PR. |
Good point! I guess it's who you believe. This post and article on thread local implementation says they are just as fast as any other variable. |
Oh yes, I'm OK with an incremental approach. At least we can actually measure what, if any, improvement we get. So, let me know when you are happy with the pull request. Also, do you agree that we can safely remove use of gcc atomics in favor of std::atomics whenever we find them and it's easy to remove such support? That way we can actually use the best data type for what is needed. |
Seems fine to gradually remove the gcc atomics and replace with std::atomics (via our XrdSys::RAtomic, if memory_order_relaxed is ok). I didn't get chance to update the PR today; will try tomorrow. |
hello @abh3 I've tidied away the old atomics (in the small area I was changing); and tried to be more careful defining the mixing function, in case somehow we're building with a 32bit long (I don't think we have anything where this can be true, but still). I moved the PR out of draft. |
Fixed by #2085. |
Hi,
On some busy EOS instances (xrootd severs ~v5.5.5, on centos 7) it has sometimes been seen that many, e.g. 1800 threads, can be waiting inside ERR_clear_error at any moment. The process was not deadlocked, it is assumed there is just too much contention and/or lock held for too long. For example:
(trace mentions XrdCl, but this was in a server process)
I've done some investigation of this and will open a PR "in draft" now with an idea; but I realise we have previously had some issues and work done around openssl error handling, e.g. effort to clean the error queues after use. So I expect there might be some discussion to be had here.
I saw two approaches: (a) reducing or moving calls to ERR_clear_error or (b) making ERR_clear_error faster. I had some indication that ERR_clear_error() is taking longer than necessary due to some problems which are specific to openssl 1.0.x, so I've followed (b) in the initial PR I have.
The problems (concerning ERR_clear_error() performance) I had in mind are: (i) Poor use by libcrypto of its internal hash-table that it uses to locate thread-specific error state, due to limited variation of the low bits of the keys. And (ii) accumulation of these thread-specific states: They can be removed with ERR_remove_thread_state(0) [removing is different from clearing], which XrdHttp does do in some situations, but there are still cases and other openssl uses where threads that have gained this error state do not clear it up before terminating. If new threads are created with not-previously-used thread-ids there is an accumulation of the libcrypto error states.
The PR draft I'm opening limits to addressing (i) for a start, leaving (ii) as a (apparently small) memory leak. I thought for now this might be enough for initial discussion, and leave that as a second issue or PR.
The text was updated successfully, but these errors were encountered: