Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many threads waiting around ERR_clear_error #2084

Closed
smithdh opened this issue Sep 12, 2023 · 9 comments · Fixed by #2085
Closed

Many threads waiting around ERR_clear_error #2084

smithdh opened this issue Sep 12, 2023 · 9 comments · Fixed by #2085
Assignees
Milestone

Comments

@smithdh
Copy link
Contributor

smithdh commented Sep 12, 2023

Hi,

On some busy EOS instances (xrootd severs ~v5.5.5, on centos 7) it has sometimes been seen that many, e.g. 1800 threads, can be waiting inside ERR_clear_error at any moment. The process was not deadlocked, it is assumed there is just too much contention and/or lock held for too long. For example:

TID 234963:
#0  0x00007f32ed9b554d __lll_lock_wait
#1  0x00007f32ed9b0e9b _L_lock_883
#2  0x00007f32ed9b0d68 pthread_mutex_lock
#3  0x00007f32ed02cf83 int_thread_get
#4  0x00007f32ed02d48d int_thread_get_item
#5  0x00007f32ed02de20 ERR_get_state
#6  0x00007f32ed02e065 ERR_clear_error
#7  0x00007f32e35af3f1 XrdCl::XRootDTransport::DoAuthentication(XrdCl::HandShakeData*, XrdCl::XRootDChannelInfo*)
#8  0x00007f32e35b2c0b XrdCl::XRootDTransport::HandShakeMain(XrdCl::HandShakeData*, XrdCl::AnyObject&)
#9  0x00007f32e35b2ec3 XrdCl::XRootDTransport::HandShake(XrdCl::HandShakeData*, XrdCl::AnyObject&)
#10 0x00007f32e36244fa XrdCl::AsyncSocketHandler::HandleHandShake(std::unique_ptr<XrdCl::Message, std::default_delete<XrdCl::Message> >)
#11 0x00007f32e3624a91 XrdCl::AsyncSocketHandler::OnReadWhileHandshaking()
#12 0x00007f32e3625df5 XrdCl::AsyncSocketHandler::Event(unsigned char, XrdCl::Socket*)
#13 0x00007f32e359aba8 (anonymous namespace)::SocketCallBack::Event(XrdSys::IOEvents::Channel*, void*, int)
#14 0x00007f32ee843bbb XrdSys::IOEvents::Poller::CbkXeq(XrdSys::IOEvents::Channel*, int, int, char const*)
#15 0x00007f32ee844ee3 XrdSys::IOEvents::PollE::Dispatch(XrdSys::IOEvents::Channel*, unsigned int)
#16 0x00007f32ee8450c0 XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**)
#17 0x00007f32ee84179d XrdSys::IOEvents::BootStrap::Start(void*)
#18 0x00007f32ee84a207 XrdSysThread_Xeq
#19 0x00007f32ed9aeea5 start_thread
#20 0x00007f32ed6d7b0d __clone 

(trace mentions XrdCl, but this was in a server process)

I've done some investigation of this and will open a PR "in draft" now with an idea; but I realise we have previously had some issues and work done around openssl error handling, e.g. effort to clean the error queues after use. So I expect there might be some discussion to be had here.

I saw two approaches: (a) reducing or moving calls to ERR_clear_error or (b) making ERR_clear_error faster. I had some indication that ERR_clear_error() is taking longer than necessary due to some problems which are specific to openssl 1.0.x, so I've followed (b) in the initial PR I have.

The problems (concerning ERR_clear_error() performance) I had in mind are: (i) Poor use by libcrypto of its internal hash-table that it uses to locate thread-specific error state, due to limited variation of the low bits of the keys. And (ii) accumulation of these thread-specific states: They can be removed with ERR_remove_thread_state(0) [removing is different from clearing], which XrdHttp does do in some situations, but there are still cases and other openssl uses where threads that have gained this error state do not clear it up before terminating. If new threads are created with not-previously-used thread-ids there is an accumulation of the libcrypto error states.

The PR draft I'm opening limits to addressing (i) for a start, leaving (ii) as a (apparently small) memory leak. I thought for now this might be enough for initial discussion, and leave that as a second issue or PR.

@abh3
Copy link
Member

abh3 commented Sep 12, 2023

Just to clarify here. This problem is actually present in the client package not the server package. That is, you do not see contention in actual server code, right? If so, what is the client actually doing here that seems independent of whatever the server is doing.

@smithdh
Copy link
Contributor Author

smithdh commented Sep 12, 2023

Thanks for looking; I think the problem is spread out, not specific to client: The above stacktrace was the one quoted on our internal EOS bug report. I was just looking at (another) one of the trace logs left on the machine. In this out of 2489 threads, 1806 mention int_thread_get_item. e.g.

TID 101714:
#0  0x00007f32ed9b554d __lll_lock_wait
#1  0x00007f32ed9b0e9b _L_lock_883
#2  0x00007f32ed9b0d68 pthread_mutex_lock
#3  0x00007f32ed02cf83 int_thread_get
#4  0x00007f32ed02d48d int_thread_get_item
#5  0x00007f32ed02de20 ERR_get_state
#6  0x00007f32ed02e065 ERR_clear_error
#7  0x00007f32ed3867e5 ssl3_accept
#8  0x00007f32ed396598 ssl23_accept
#9  0x00007f32e63dbcc1 XrdHttpProtocol::Process(XrdLink*)
#10 0x00007f32ee8b6ead XrdLinkXeq::DoIt()
#11 0x00007f32ee8b3832 XrdLink::setProtocol(XrdProtocol*, bool, bool)
#12 0x00007f32ee8ba217 XrdScheduler::Run()
#13 0x00007f32ee8ba339 XrdStartWorking(void*)
#14 0x00007f32ee84a207 XrdSysThread_Xeq
#15 0x00007f32ed9aeea5 start_thread
#16 0x00007f32ed6d7b0d __clone

or

TID 130953:
#0  0x00007f32ed9b554d __lll_lock_wait
#1  0x00007f32ed9b0e9b _L_lock_883
#2  0x00007f32ed9b0d68 pthread_mutex_lock
#3  0x00007f32ed02cf83 int_thread_get
#4  0x00007f32ed02d48d int_thread_get_item
#5  0x00007f32ed02de20 ERR_get_state
#6  0x00007f32ed02e065 ERR_clear_error
#7  0x00007f32ed3af4ac ssl_add_cert_chain
#8  0x00007f32ed394125 ssl3_output_cert_chain
#9  0x00007f32ed385fad ssl3_send_server_certificate
#10 0x00007f32ed386bc5 ssl3_accept
#11 0x00007f32ed396598 ssl23_accept
#12 0x00007f32e63dbcc1 XrdHttpProtocol::Process(XrdLink*)
#13 0x00007f32ee8b6ead XrdLinkXeq::DoIt()
#14 0x00007f32ee8b3832 XrdLink::setProtocol(XrdProtocol*, bool, bool)
#15 0x00007f32ee8ba217 XrdScheduler::Run()
#16 0x00007f32ee8ba339 XrdStartWorking(void*)
#17 0x00007f32ee84a207 XrdSysThread_Xeq
#18 0x00007f32ed9aeea5 start_thread
#19 0x00007f32ed6d7b0d __clone

or

TID 136917:
#0  0x00007f32ed9b554d __lll_lock_wait
#1  0x00007f32ed9b0e9b _L_lock_883
#2  0x00007f32ed9b0d68 pthread_mutex_lock
#3  0x00007f32ed02cf83 int_thread_get
#4  0x00007f32ed02d48d int_thread_get_item
#5  0x00007f32ed02de20 ERR_get_state
#6  0x00007f32ed02df5e ERR_put_error
#7  0x00007f32ed030fbd EVP_CIPHER_CTX_set_key_length
#8  0x00007f32e4f3c51f XrdCryptosslCipher::Finalize(bool, char*, int, char const*)
#9  0x00007f32e5162558 XrdSecProtocolgsi::ServerDoCert(XrdSutBuffer*, XrdSutBuffer**, XrdOucString&)
#10 0x00007f32e5171edd XrdSecProtocolgsi::ParseServerInput(XrdSutBuffer*, XrdSutBuffer**, XrdOucString&)
#11 0x00007f32e51722b2 XrdSecProtocolgsi::Authenticate(XrdSecBuffer*, XrdSecBuffer**, XrdOucErrInfo*)
#12 0x00007f32eeb85c04 XrdXrootdProtocol::do_Auth()
#13 0x00007f32eeb78473 XrdXrootdProtocol::Process2()
#14 0x00007f32ee8b6ead XrdLinkXeq::DoIt()
#15 0x00007f32ee8ba217 XrdScheduler::Run()
#16 0x00007f32ee8ba339 XrdStartWorking(void*)
#17 0x00007f32ee84a207 XrdSysThread_Xeq
#18 0x00007f32ed9aeea5 start_thread
#19 0x00007f32ed6d7b0d __clone

and in this snapshot it seems this was the thread that was just releasing the lock:

TID 210551:
#0  0x00007f32ed9b55ea __lll_unlock_wake
#1  0x00007f32ed9b1f9e _L_unlock_738
#2  0x00007f32ed9b1f10 __pthread_mutex_unlock
#3  0x00007f32ed02d4df int_thread_get_item
#4  0x00007f32ed02de20 ERR_get_state
#5  0x00007f32ed02e125 get_error_values
#6  0x00007f32ed02eb6b ERR_print_errors_cb
#7  0x00007f32e63d82cb XrdHttpProtocol::getDataOneShot(int, bool)
#8  0x00007f32e63dbe25 XrdHttpProtocol::Process(XrdLink*)
#9  0x00007f32ee8b6ead XrdLinkXeq::DoIt()
#10 0x00007f32ee8ba217 XrdScheduler::Run()
#11 0x00007f32ee8ba339 XrdStartWorking(void*)
#12 0x00007f32ee84a207 XrdSysThread_Xeq
#13 0x00007f32ed9aeea5 start_thread
#14 0x00007f32ed6d7b0d __clone

(I can get us complete traces too, if you'd like a more complete picture)

@abh3
Copy link
Member

abh3 commented Sep 12, 2023

OK, so this is a general issue. Now, we never clear So, the problem may be that we never call ERR_remove_thread_state(0) when a thread terminates. The assumption was that OpenSSL would do that. But now, even if it does do that no one is going to remove the side table entry that CryptoLite has, correct? If so, that hash table would grow without bound and become progressively slow even with the fix you propose.

@smithdh
Copy link
Contributor Author

smithdh commented Sep 12, 2023

I think it's a single hash table; the complication I saw with CryptoLite was that it was another place we used openssl, and in the patch I wanted to make sure it did the initialisation that setup sslTLS_id_callback for openssl. (as XrdTls and XrdCryptosslFactory already do). Once ssl starts to do something that needs that function, and a user callback hasn't been set, it puts its own default and you can't change it again.

Assuming the entries are well distributed in the hash-table, I believe it's usual working should keep the access time from growing even if the number of entries grows. (it should maintain a load factor, enlarging the number of buckets and redistributing entries as it grows). But that does leave it potentially growing in size; but I thought that since it does that now we could try to solve it incrementally.

A complication I am unsure about was to how to reliably call ERR_remove_thread_state for all required threads (e.g. all threads that have used openssl and may have setup an error state). I though about ways it could be added at the end of XrdSysThread_Xeq, but this would miss some threads in the XrdCl client, which doesn't use XrdSysPthread to create threads.

Perhaps this could work (and it is quite localised):

 unsigned long sslTLS_id_callback(void)
 {
  struct thid_tls_s {
    thid_tls_s() {
	// base thread-id on the id given by XrdSysThread;
	// but openssl uses thread-id*13 as a key for looking
	// up per thread crypto ERR structures in a hash table.
	// So mix bits to give better use of internal hash.
	// Mixing function based on the splitmix64 PRNG.
	uint64_t x = (uint64_t)XrdSysThread::ID();
	x ^= x >> 30;
	x *= 0xbf58476d1ce4e5b9ULL;
	x ^= x >> 27;
	x *= 0x94d049bb133111ebULL;
	x ^= x >> 31;
	tid_ = (unsigned long)x;
     }
    ~thid_tls_s() {
	// this thread may be using the crypto ERR_ functions,
	// make sure thread state is removed when thread exits.
	ERR_remove_thread_state(nullptr);
     }
     unsigned long tid_;
   };
   // using a pointer for the common case of the per thread structure
   // already having been initialised was seen to slightly reduce
   // number of calls to __tls_get_addr
   thread_local thid_tls_s *thidp = 0;
   if (!thidp) {
     thread_local thid_tls_s store;
     thidp = &store;
   }

   return thidp->tid_;
 }

but I still have some uncertainly about the performance of using thread_local, so I didn't put it in the initial PR.

@abh3
Copy link
Member

abh3 commented Sep 12, 2023

Good point! I guess it's who you believe. This post and article on thread local implementation says they are just as fast as any other variable.
https://stackoverflow.com/questions/2459692/linuxs-thread-local-storage-implementation
The document itself
https://www.akkadia.org/drepper/tls.pdf
explains the rather complicated implementation and the end result does indeed support that thread local storage is not a CPU drainer nor does it needlessly waste memory.

@abh3
Copy link
Member

abh3 commented Sep 12, 2023

Oh yes, I'm OK with an incremental approach. At least we can actually measure what, if any, improvement we get. So, let me know when you are happy with the pull request.

Also, do you agree that we can safely remove use of gcc atomics in favor of std::atomics whenever we find them and it's easy to remove such support? That way we can actually use the best data type for what is needed.

@smithdh
Copy link
Contributor Author

smithdh commented Sep 13, 2023

Seems fine to gradually remove the gcc atomics and replace with std::atomics (via our XrdSys::RAtomic, if memory_order_relaxed is ok).

I didn't get chance to update the PR today; will try tomorrow.

@smithdh
Copy link
Contributor Author

smithdh commented Sep 14, 2023

hello @abh3 I've tidied away the old atomics (in the small area I was changing); and tried to be more careful defining the mixing function, in case somehow we're building with a 32bit long (I don't think we have anything where this can be true, but still). I moved the PR out of draft.

@amadio
Copy link
Member

amadio commented Sep 15, 2023

Fixed by #2085.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants