-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have a more sensible default for TFileMerger::fMaxOpenedFiles
#11276
Comments
The performance of As a compromise, I propose to reduce the max value to 256 (already a factor 4 slower compared to the typical 1024) ? Would that be okay for your situation? |
256 wouldn't have helped in the mentioned GGUS ticket. There was an input set of 133 files, of which more than 24 were on a single pool with a limit of 24 concurrent transfers. Another approach could be to detect this situation in hadd and then suggest the use of |
I see. So actually this is a case where the limit is not known to the OS/through-getrlimit. How can we detect locally the limit of concurrent transfers for a given server? |
We can not access the original ticket. What is the actual hardware that was being accessed? What is mounted as a local -appearing file system or was it being accessed via a remote protocols (i.e the file name were prefixed with root://...). In first approximation, I don't how we could detect your use case? If you were able to set |
Hi @pcanal, The input is a list of 133 files in the format root://x509up_u@xrootd.grid.surfsara.nl//pnfs/grid.sara.nl/data/lhcb/LHCb_USER/lhcb/user/v/username/2021_08/520789/520789382/x24mu__wmomsc_a.root. The limit is in the dCache storage system (xrootd.grid.surfsara.nl), not on the client side. This limit is there for a reason: it prevents the storage pools from being overloaded with transfers and crashing. When hadd tries to open all files at once, it tries to read more files concurrently than the limit per dCache storage pool allows. The first files are served, but the rest of the transfers are queued. This means, that they remain open but zero bytes are served, until some of the other transfers finish. But hadd never finishes those because it insists on reading from all files at the same time. So it gets stuck into a deadlock situation. If hadd would detect this situation (I'm getting data for some files but zero bytes for other files), it would make sense to stop reading all files concurrently, but instead continue reading from all files that it can, close those files, and then receive data for the other files. If hadd could do that, such a deadlock would be prevented, while performance would still be the maximum available. Cheers, |
Do you know if the xrootd routine just "hang" in that case or return with request to retry later? If they just hang there is not much I can see doing to detect the case unless there is an xrootd routine that detect/support this case that we could replace the current call with. (and we would need some help to update the xrootd plugin in ROOT to support and test this). |
I'm afraid I don't know answer your question. But I understand the difficulty. There might be a more pragmatic approach: if reading remote files fails, |
Explain what you would like to see improved
By default,
hadd/TFileMerger
will open the maximum number of allowed opened files.root/io/io/src/TFileMerger.cxx
Lines 61 to 93 in 3dd47b9
This seems too large as a default, and even led to problems with sites (https://ggus.eu/?mode=ticket_info&ticket_id=153653)
Optional: share how it could be improved
Have a more sensible default ? 32 ? 64 ?
The text was updated successfully, but these errors were encountered: