Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear behaviour of XRD_PARALLELEVTLOOP and XRD_WORKERTHREADS #1495

Closed
vepadulano opened this issue Aug 16, 2021 · 4 comments
Closed

Unclear behaviour of XRD_PARALLELEVTLOOP and XRD_WORKERTHREADS #1495

vepadulano opened this issue Aug 16, 2021 · 4 comments
Assignees

Comments

@vepadulano
Copy link

There has been some interest in investigating the behaviour of XRD_PARALLELEVTLOOP variable for ROOT workflows. I would like to report some simple tests here to ask for a clarification. All tests are doing a simple xrdcp with variations of the two variables mentioned in the title.

XRD_PARALLELEVTLOOP=4

In theory this should use 4 threads, but there are 10 instead

$ XRD_PARALLELEVTLOOP=4 xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
[784MB/2.09GB][ 36%][==================>                               ][11.04MB/s]
$ ps aux | grep xrdcp
vpadulan    2875 14.5  0.4 698364 77920 pts/0    Sl+  12:15   0:03 xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
$ ps hH p 2875 | wc -l
10

XRD_PARALLELEVTLOOP=1

This should use 1 thread, I see 7

$ XRD_PARALLELEVTLOOP=1 xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
[184MB/2.09GB][  8%][====>                                             ][10.82MB/s]
$ ps aux | grep xrdcp
vpadulan    3000 20.0  0.2 608092 46488 pts/0    Sl+  12:18   0:00 xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
$ ps hH p 3000 | wc -l
7

XRD_WORKERTHREADS=1 XRD_PARALLELEVTLOOP=1

I have found another environment variable in the xrootd docs https://xrootd.slac.stanford.edu/doc/xrdcl-docs/xrdcldocs.pdf described as "Number of threads processing user callbacks." with default value 3 . Setting both variables to 1 leads to 5 threads

$ XRD_WORKERTHREADS=1 XRD_PARALLELEVTLOOP=1 xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
[192MB/2.09GB][  8%][====>                                             ][10.67MB/s]
$ ps aux | grep xrdcp
vpadulan    3036 17.3  0.2 460628 48240 pts/0    Sl+  12:21   0:00 xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
$ ps hH p 3036 | wc -l
5

So for now:

  1. Setting XRD_PARALLELEVTLOOP=1 makes the xrdcp process use 7 thread, of which 3 are explicable by the default value of XRD_WORKERTHREADS, 1 is the event loop, but I still can't reason about the other 3 threads.
  2. The two variables seem to be independently adding threads to the xrdcp process when they are increased.

To conclude, I would like to understand where those extra 3 threads could be coming from or get any better insight from you. Thanks a lot !

@simonmichal
Copy link
Contributor

simonmichal commented Aug 23, 2021

@vepadulano : thanks for reporting the outcome of your tests! could you just tell me which version of xrdcp you were using?

To shed some light on your results:

  • XRD_PARALLELEVTLOOP is by default set to 1, and is the number of event-loop threads handling the async I/O; in some cases e.g. if the xrootd client is interacting with many servers (as it does in case of XCache) a single event loop can become CPU bound and in those scenarios it makes sense to use multiple event-loops

  • XRD_WORKERTHREADS is by default set to 3, and is the number of threads in the thread-pool used to call user completion handlers

The two variables seem to be independently adding threads to the xrdcp process when they are increased.

Your observation is correct, those two variable are responsible for totally independent pools of threads.

  • there is also the TaskManager thread, which runs various timers, amongst others responsible for the request timeouts

  • in case of xrdcp there is also the main execution thread

To summarise, in case of xrdcp if one sets XRD_PARALLELEVTLOOP=1 and XRD_WORKERTHREADS=1 there should be 4 threads, which is what I observe in my xrdcp process:

(gdb) info threads
  Id   Target Id         Frame 
  4    Thread 0x7ffff1224700 (LWP 17480) "xrdcp" 0x00007ffff644bb3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
  3    Thread 0x7ffff1a25700 (LWP 17479) "xrdcp" 0x00007ffff69368ed in nanosleep () from /lib64/libc.so.6
  2    Thread 0x7ffff2226700 (LWP 17478) "xrdcp" 0x00007ffff696ffd3 in epoll_wait () from /lib64/libc.so.6
* 1    Thread 0x7ffff7fb3900 (LWP 17465) "xrdcp" XrdCl::ClassicCopyJob::Run (this=0x66e960, progress=0x7fffffffca50) at /home/simonm/git/xrootd/src/XrdCl/XrdClClassicCopyJob.cc:2400
[simonm@idefix ~]$ ps -elf | grep xrdcp
0 S simonm   17453  1021  0  80   0 - 86484 poll_s 13:58 pts/0    00:00:01 gdb ./xrdcp
0 t simonm   17465 17453  0  80   0 - 62609 ptrace 13:59 pts/0    00:00:00 /home/simonm/git/xrootd/build/src/XrdCl/./xrdcp -f Makefile root://slc7-test//tmp
[simonm@idefix ~]$ ps hH p 17465 | wc -l
4

Now I'm not sure where is this one additional thread you are seeing coming from, that's why I asked about the version you are using.

@simonmichal
Copy link
Contributor

@vepadulano : do you have any further questions or can we close this one?

@vepadulano
Copy link
Author

Hi @simonmichal ,
Sorry for the delay in my response and thank you for your thorough explanation!
The version I used to run those tests initially was 4.12.8, but I've just tried again with version 5 and I still see 5 threads when running the xrdcp process:

[~]: xrootd -v
v5.3.1
[~]: ps aux | grep xrdcp
vpadulan  176936 33.6  0.1 463728 79400 pts/1    Sl+  09:17   0:01 xrdcp root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root .
vpadulan  177010  0.0  0.0 221396   852 pts/2    S+   09:17   0:00 grep --color=auto xrdcp
[~]: ps hH p 176936 | wc -l
5

I will look further into XRD_PARALLELEVTLOOP, it is unclear for example how much this can change an application runtime which is already multithreaded /multiprocessed with high thread/process count.

Up to you whether you want to close this issue now or understand better why I still get 5 threads. Let me know if I can help in that direction.

@simonmichal
Copy link
Contributor

@vepadulano : I will double check the thread count and make sure it's right :-), I think what you are really after is the XRD_PARALLELEVTLOOP so let me elaborate bit more about it.

Regarding the XRD_PARALLELEVTLOOP, it is the parallel number of event loops the client is using. In case of single event loop, all socket I/O events are processed by a single thread, in general this is good because we avoid context switching (as opposed to synchronous I/O). However in some cases this can lead to a situation where the client becomes CPU bound. For example imagine xrdcp is copying data between two very fast servers (say 100GE, with ramdisk or optane). In a setup like this the event-loop will be receiving new I/O events faster than it is able to process them and as a results will limit the transfer rate. If we use 2 event-loops on the other hand, the source and the destination I/O events will be handled by separate threads/event-loops which could result in 2x faster I/O event processing (we measured 2.5GB/s vs 5GB/s). Similar effect could be also observed if your applications that use XRootD client to fetch data from multiple locations.

I will post a short summary in the original root issue (root-project/root#7709).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants