Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rising number of open files in xcache #975

Closed
nikoladze opened this issue May 7, 2019 · 25 comments
Closed

Rising number of open files in xcache #975

nikoladze opened this issue May 7, 2019 · 25 comments

Comments

@nikoladze
Copy link
Contributor

Hi,

We are noticing a steady increase in the number of open files on our xcache server in Munich, in particular when the server is under heavy load:

plot_openfiles_xcache_20190506_1719.pdf

It seems the files are not closed anymore, even if the connections are and the files are fully downloaded as suggested by looking at the network traffic monitoring for the same time periods:

plot_openfiles_xcache_20190506_1719_ganglia_overlay.pdf

We are running with prefetch mode enabled with xrootd 4.9.1 (0.rc3.el7).

When the number of open files hits the limit (in our case ~16k), clients start receiving empty or corrupt files (checksum errors). Sometimes the xrootd process also crashes then with SEGV. In most cases the files can be received correctly after a restart of the xrootd process.

Cheers,
Nikolai

@osschar
Copy link
Contributor

osschar commented May 7, 2019

Hi Nikolai,

Actually we just figured this one out with Andy ... this due to relatively short default timeout (180s) given to the cache to close a file after a client disconnect. When a system is busy, cache can not finish writing to the file in time and then xrootd layer decides to "leak" it. We are working on a proper fix but in the meantime, please use:
pss.ciosync 60 900
to increase the time given to the cache to close the file. The above line means try every 60s for a total of 900s -- this works for us at UCSD where we also had this problem.

See:
http://xrootd.org/doc/dev49/pss_config.htm#_Toc525070685

Cheers,
Matevz

@abh3
Copy link
Member

abh3 commented May 7, 2019 via email

@nikoladze
Copy link
Contributor Author

Hi @abh3, @osschar,

Thanks a lot for the Information - we are trying to run with the suggested ciosync option now and see how it behaves.
Unfortunately i couldn't get the crash reproduced yet with a debugger attached, but i'll try again next week.

Cheers,
Nikolai

@nikoladze
Copy link
Contributor Author

Hi again,

I still don't have any further information on the crashes unfortunately as we currently have the production queue that processed files via xcache taken offline.
However, i can provide 2 new pieces of feedback:

  • The ciosync workaround seems to partially solve the problem. However, not 100% - sometimes we still observed that the number of files started again to rise steadily as seen in this plot plot_openfiles_05-11_05_18.pdf

  • This might be a different issue, but i suspect it happens as well when the file limit is reached: We see some files that end up corrupted in the cache (wrong checksum). These files are marked as "complete" in the .cinfo files and when downloading them via the xcache server, the client receives the corrupted file. Out of 200k "complete" files in our cache we saw 91 such cases. From a quick check of 2 of these files they seem to have these things in common:

    • number of bytes is correct
    • the .cinfo file contains a certain number of bytesMissed, but not matching the size of the empty block
    • the wrong checksum originates from missing parts in the file (blocks of 1 MiB size filled with zeros)

Cheers,
Nikolai

@abh3
Copy link
Member

abh3 commented Jun 21, 2019 via email

@nikoladze
Copy link
Contributor Author

This is still 4.9.1, precisely the versions xrootd-4.9.1-0.rc3.el7.x86_64 since April 1 and xrootd-4.9.1-1.el7.x86_64 since May 9.

@abh3
Copy link
Member

abh3 commented Jun 24, 2019 via email

@nikoladze
Copy link
Contributor Author

Hi Andy,

That is essentially what the plots show. What's called "sockets" is when i grep for "TCP" and whats called "files" is when i grep for the folder where the xcache data (and also log, spool) is stored.

Cheers,
Nikolai

@osschar
Copy link
Contributor

osschar commented Jun 25, 2019

Hi Nikolai,

I don't think the fd growth and file contents errors are related.

About the file errors, people doing ATLAS tests in the US saw exactly the same symptom in about the same time-frame - certain blocks of a file being "all zeros". It was traced down to some network/storage/proxy SNAFU at ATLAS SW T2. Is it possible you pulled the files from that site, too? Nevertheless, we are working on providing a way to detect such errors in the caching proxy. Mind you, you would see the same problem were you to transfer the file using xrdcp ... so we really need either a protocol level checks or a way for caching proxy to retrieve checksums from some service providing them.

About the rising number of fds, this will only happen when caching proxy is seriously overloaded, i.e., it is not able to write data to disk and those writes are also competing with reads for data that is already on the disk. Can you please describe your setup (xrootd config, disk configuration being used) and expected number of jobs and their read rates? Also, can you please show machine load and network in/out plots for the same time interval, say, 17.5. to 19.5. [ Of course, there is also a possibility that something else is going wrong, that's why Andy was asking about details of what kind of fds are leaked ... however, the ratio of files to sockets of 2 : 1 is indicative of the fd leak related to ciosync (2 files (data + cinfo) and 1 socket to the remote). ]

The bytes_missed simply means that XCache has its write queue full and so served that many bytes to local clients by directly forwarding the request to the remote, without trying to write it to disk.

Cheers,
Matevz

@osschar
Copy link
Contributor

osschar commented Jun 25, 2019

One more thing ... even when Andy and I rework the file-close protocol so that it eventually closes out all the fds, the situation from the overload perspective won't change much, that is, the caching proxy will still struggle to write out the last remaining blocks and then close the file.

So, if you are consistently hitting this issue it means you either need a beefier machine (correctly configured for allowing O(1000) simultaneous read/write streams (raid 5/6 or zfs setups are known to choke with this rather badly)) or a caching cluster.

There is another solution for this case, immediate local client redirect to the origin ... and this is almost ready to go. But it won't work if local clients can not connect to WAN.

@abh3
Copy link
Member

abh3 commented Jun 25, 2019 via email

@nikoladze
Copy link
Contributor Author

Unfortunately i don't have the log file anymore. And currently we don't have such a high load running on the server that this issue occurs. When we see it again i'll save the log. I just remember that at that time i didn't find anything special happen at that time. Also the load wasn't higher than in the period before.

@abh3
Copy link
Member

abh3 commented Jun 26, 2019 via email

@osschar
Copy link
Contributor

osschar commented Jun 27, 2019

Hi Nikolai,

Would you mind sharing config/hardware details of your cache setup? We can also take it offline, if you prefer.

Matevz

@nikoladze
Copy link
Contributor Author

The setup is described on slide 3 here (xrootd version is 4.9.1 now):

https://indico.cern.ch/event/769502/contributions/3197773/attachments/1810707/2957051/nikolai_doma_12.03.2019_xcache_lmu.pdf

@osschar
Copy link
Contributor

osschar commented Jun 28, 2019

RAID-6 is probably killing you under heavy multi-file load. This is the case for both hardware and software raids.

XRootd/XCache can work with a set of independent disks and this scales much better, see page 12 of this:
https://indico.cern.ch/event/727208/contributions/3444604/attachments/1859894/3056280/XCache-FeaturesEtc-Lyon-2019.pdf
and of course the Holy docs:
http://xrootd.org/doc/dev49/ofs_config.htm#_Toc522916548
http://xrootd.org/doc/dev49/pss_config.htm#_Toc525070687

@gdxd
Copy link

gdxd commented Jul 4, 2019

Hi,
just a comment on file corruption, you suspected it could be related to issues with network/storage/proxy SNAFU at ATLAS SW T2.
We had our Panda queue configured such that it was reading exclusively from MPPMU, that's our neigbour side in Munich (just 500 m away). So I cannot see how it could be possibly related to SW T2.
cheers

Guenter

@osschar
Copy link
Contributor

osschar commented Jul 12, 2019

Hi Nikolai, Guenter,

Did you try:
a) increasing further the limits; and, more importantly,
b) switching from RAID to individual disks?

Cheers,
Matevz

@nikoladze
Copy link
Contributor Author

Hi Matevz,

We have not tried that yet. But we continued running jobs via the xcache server, so we can proceed to investigate the problems when they occur again.
I'm currently trying again to get a backtrace of the crashes that happen sometimes. Related to that: Do i have to do anything special to get the debug infos for xrootd? I tried to install them in the singularity container that runs xcache via the xrootd-debuginfo package via the http://www.xrootd.org/binaries/xrootd-stable-slc7.repo but gdb complains, e.g:

warning: the debug information found in "/usr/lib/debug/usr/lib64/libXrdServer.so.2.0.0.debug" does not match "/lib64/libXrdServer.so.2" (CRC mismatch).

Cheers,
Nikolai

@osschar
Copy link
Contributor

osschar commented Jul 16, 2019

Hi Nikolai,

Maybe a version/build mismatch? Installing xrootd-rebuginfo rpm directly always worked from me (but i do my own builds). You could also try installing them with debuginfo-install (in yum-utils) ... this should take care of dependencies, too, IIRC.

Can you please try with 4.10 that was just released? The fd leak under overload is still there (so don't reduce the timeouts just yet) but there were some other fixes in xrootd client that address some of the crashes seen in the wild.

I didn't realize you were running in a container. Do you guys use host networking?

Also, you should try to use separate disks, not raid ... you are leaving 10x performance on the floor there.

Cheers,
Matevz

@gdxd
Copy link

gdxd commented Jul 16, 2019

Hi Matevz,
we consider to try it with separate disks. Related to that, with separate disks we'll have of course data loss in case of a disk failure (which happens occasionally and that's why we run our T2 storage with Raid-6). For a cache system data loss is of course less critical, but still, xcache needs to cope with it, i.e. clean-up file-lists, metadata info, etc. How does this work, is an xcache restart necessary/sufficient?

cheers,
Guenter

@osschar
Copy link
Contributor

osschar commented Jul 16, 2019

Hi Guenter,

You'll probably want to put metadata (where cinfo files are stored) and root-fs (basically sym-links into data disks) on a non-data disk.

After data disk failure, you replace the disk (or comment it out in the xrootd.cfg) and restart. XRootd will refuse to start if a configured target directory for oss.space does not exist. If you lose disk with meta-data, you have to clear the cache.

You can remove the stale links after data disk replacement but actually don't have to ... each lfn will get cleared when its time comes (when it would be purged or when an open is attempted). Thinking about this, I could add full data-space scan during the startup purge -- normally purge only scans meta-data file to determine "age" of a file.

Cheers,
Matevz

@abh3
Copy link
Member

abh3 commented Jul 16, 2019 via email

@nikoladze
Copy link
Contributor Author

Hi Andy,

The containers where build on the same server where we also run them.

Cheers,
Nikolai

@abh3
Copy link
Member

abh3 commented Feb 18, 2020

I am closing this as here has been no activity and no reports from other sites of similar issues. Please reopen if this is still a roblem.

@abh3 abh3 closed this as completed Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants