Rising number of open files in xcache #975

nikoladze · 2019-05-07T08:49:31Z

Hi,

We are noticing a steady increase in the number of open files on our xcache server in Munich, in particular when the server is under heavy load:

plot_openfiles_xcache_20190506_1719.pdf

It seems the files are not closed anymore, even if the connections are and the files are fully downloaded as suggested by looking at the network traffic monitoring for the same time periods:

plot_openfiles_xcache_20190506_1719_ganglia_overlay.pdf

We are running with prefetch mode enabled with xrootd 4.9.1 (0.rc3.el7).

When the number of open files hits the limit (in our case ~16k), clients start receiving empty or corrupt files (checksum errors). Sometimes the xrootd process also crashes then with SEGV. In most cases the files can be received correctly after a restart of the xrootd process.

Cheers,
Nikolai

osschar · 2019-05-07T17:29:53Z

Hi Nikolai,

Actually we just figured this one out with Andy ... this due to relatively short default timeout (180s) given to the cache to close a file after a client disconnect. When a system is busy, cache can not finish writing to the file in time and then xrootd layer decides to "leak" it. We are working on a proper fix but in the meantime, please use:
pss.ciosync 60 900
to increase the time given to the cache to close the file. The above line means try every 60s for a total of 900s -- this works for us at UCSD where we also had this problem.

See:
http://xrootd.org/doc/dev49/pss_config.htm#_Toc525070685

Cheers,
Matevz

abh3 · 2019-05-07T20:56:18Z

What Matevz says is completely correct. However, even after exhausting file descriptors, xrootd should neither crash nor give bad data. So, at least for the crash could we get a traceback or (hoping for the impossible) access to the core file. From: Matevž Tadel Sent: Tuesday, May 07, 2019 10:29 AM To: xrootd/xrootd Cc: Subscribed Subject: Re: [xrootd/xrootd] Rising number of open files in xcache (#975) Hi Nikolai, Actually we just figured this one out with Andy ... this due to relatively short default timeout (180s) given to the cache to close a file after a client disconnect. When a system is busy, cache can not finish writing to the file in time and then xrootd layer decides to "leak" it. We are working on a proper fix but in the meantime, please use: pss.ciosync 60 900 to increase the time given to the cache to close the file. The above line means try every 60s for a total of 900s -- this works for us at UCSD where we also had this problem. See: http://xrootd.org/doc/dev49/pss_config.htm#_Toc525070685 Cheers, Matevz — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

nikoladze · 2019-05-10T15:22:02Z

Hi @abh3, @osschar,

Thanks a lot for the Information - we are trying to run with the suggested ciosync option now and see how it behaves.
Unfortunately i couldn't get the crash reproduced yet with a debugger attached, but i'll try again next week.

Cheers,
Nikolai

nikoladze · 2019-06-21T13:39:31Z

Hi again,

I still don't have any further information on the crashes unfortunately as we currently have the production queue that processed files via xcache taken offline.
However, i can provide 2 new pieces of feedback:

The ciosync workaround seems to partially solve the problem. However, not 100% - sometimes we still observed that the number of files started again to rise steadily as seen in this plot plot_openfiles_05-11_05_18.pdf
This might be a different issue, but i suspect it happens as well when the file limit is reached: We see some files that end up corrupted in the cache (wrong checksum). These files are marked as "complete" in the .cinfo files and when downloading them via the xcache server, the client receives the corrupted file. Out of 200k "complete" files in our cache we saw 91 such cases. From a quick check of 2 of these files they seem to have these things in common:
- number of bytes is correct
- the .cinfo file contains a certain number of bytesMissed, but not matching the size of the empty block
- the wrong checksum originates from missing parts in the file (blocks of 1 MiB size filled with zeros)

Cheers,
Nikolai

abh3 · 2019-06-21T20:12:02Z

Hi Nikolai, Just for completeness, which version are you running? Andy

…

On Fri, 21 Jun 2019, Nikolai Hartmann wrote: Hi again, I still don't have any further information on the crashes unfortunately as we currently have the production queue that processed files via xcache taken offline. However, i can provide 2 new pieces of feedback: - The `ciosync` workaround seems to partially solve the problem. However, not 100% - sometimes we still observed that the number of files started again to rise steadily as seen in this plot [plot_openfiles_05-11_05_18.pdf](https://github.com/xrootd/xrootd/files/3314664/plot_openfiles_05-11_05_18.pdf) - This might be a different issue, but i suspect it happens as well when the file limit is reached: We see some files that end up corrupted in the cache (wrong checksum). These files are marked as "complete" in the `.cinfo` files and when downloading them via the xcache server, the client receives the corrupted file. Out of 200k "complete" files in our cache we saw 91 such cases. From a quick check of 2 of these files they seem to have these things in common: - number of bytes is correct - the `.cinfo` file contains a certain number of `bytesMissed`, but not matching the size of the empty block - the wrong checksum originates from missing parts in the file (blocks of 1 MiB size filled with zeros) Cheers, Nikolai -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #975 (comment)

nikoladze · 2019-06-24T06:29:29Z

This is still 4.9.1, precisely the versions xrootd-4.9.1-0.rc3.el7.x86_64 since April 1 and xrootd-4.9.1-1.el7.x86_64 since May 9.

abh3 · 2019-06-24T21:55:07Z

Hi Nikolai, It would be good if we ciuld find out which file descriptors are being accumulated. So, when the server is running after a while, could you do an lsof on it (it's 'lsof -p <pid>' and could up via grep how many TCP connectins are open and how many files are open (what to grep for will be obvious from he output). Do this on a periodic basis to see which ones aremonitonically increasing. Andy

…

On Sun, 23 Jun 2019, Nikolai Hartmann wrote: This is still 4.9.1, precisely the versions `xrootd-4.9.1-0.rc3.el7.x86_64` since April 1 and `xrootd-4.9.1-1.el7.x86_64` since May 9. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #975 (comment)

nikoladze · 2019-06-25T05:44:52Z

Hi Andy,

That is essentially what the plots show. What's called "sockets" is when i grep for "TCP" and whats called "files" is when i grep for the folder where the xcache data (and also log, spool) is stored.

Cheers,
Nikolai

osschar · 2019-06-25T09:06:56Z

Hi Nikolai,

I don't think the fd growth and file contents errors are related.

About the file errors, people doing ATLAS tests in the US saw exactly the same symptom in about the same time-frame - certain blocks of a file being "all zeros". It was traced down to some network/storage/proxy SNAFU at ATLAS SW T2. Is it possible you pulled the files from that site, too? Nevertheless, we are working on providing a way to detect such errors in the caching proxy. Mind you, you would see the same problem were you to transfer the file using xrdcp ... so we really need either a protocol level checks or a way for caching proxy to retrieve checksums from some service providing them.

About the rising number of fds, this will only happen when caching proxy is seriously overloaded, i.e., it is not able to write data to disk and those writes are also competing with reads for data that is already on the disk. Can you please describe your setup (xrootd config, disk configuration being used) and expected number of jobs and their read rates? Also, can you please show machine load and network in/out plots for the same time interval, say, 17.5. to 19.5. [ Of course, there is also a possibility that something else is going wrong, that's why Andy was asking about details of what kind of fds are leaked ... however, the ratio of files to sockets of 2 : 1 is indicative of the fd leak related to ciosync (2 files (data + cinfo) and 1 socket to the remote). ]

The bytes_missed simply means that XCache has its write queue full and so served that many bytes to local clients by directly forwarding the request to the remote, without trying to write it to disk.

Cheers,
Matevz

osschar · 2019-06-25T09:46:49Z

One more thing ... even when Andy and I rework the file-close protocol so that it eventually closes out all the fds, the situation from the overload perspective won't change much, that is, the caching proxy will still struggle to write out the last remaining blocks and then close the file.

So, if you are consistently hitting this issue it means you either need a beefier machine (correctly configured for allowing O(1000) simultaneous read/write streams (raid 5/6 or zfs setups are known to choke with this rather badly)) or a caching cluster.

There is another solution for this case, immediate local client redirect to the origin ... and this is almost ready to go. But it won't work if local clients can not connect to WAN.

abh3 · 2019-06-25T13:38:27Z

Hi Nikolai, Ah, I was looking for a literal graph :-( Anyway, the graph is fascinating. Is horizontal access is by date? The big questionis what happened near midnight May 12th. Could you send me the log file (if you still have it) for the two days in question? Please don't post it. If you don't have one from the graph, could you send one with a corresponding graph for another period? I need to understand what kind of load triggered this. As Matevz says, it may simply be that once the server is totally overloaded, things go downhill. In that case, we need to recognize the situation much earlier when we have more time to recover. Andy

…

On Mon, 24 Jun 2019, Nikolai Hartmann wrote: Hi Andy, That is essentially what the plots show. What's called "sockets" is when i grep for "TCP" and whats called "files" is when i grep for the folder where the xcache data (and also log, spool) is stored. Cheers, Nikolai -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #975 (comment)

nikoladze · 2019-06-26T09:44:08Z

Unfortunately i don't have the log file anymore. And currently we don't have such a high load running on the server that this issue occurs. When we see it again i'll save the log. I just remember that at that time i didn't find anything special happen at that time. Also the load wasn't higher than in the period before.

abh3 · 2019-06-26T16:25:00Z

Hi Nikolai, Usually, there are subtle "hints" in the log that indicate things are going south. Additional file descriptors wouldn be allocated unless client's ask for them. Andy

…

On Wed, 26 Jun 2019, Nikolai Hartmann wrote: Unfortunately i don't have the log file anymore. And currently we don't have such a high load running on the server that this issue occurs. When we see it again i'll save the log. I just remember that at that time i didn't find anything special happen at that time. Also the load wasn't higher than in the period before. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #975 (comment)

osschar · 2019-06-27T08:40:44Z

Hi Nikolai,

Would you mind sharing config/hardware details of your cache setup? We can also take it offline, if you prefer.

Matevz

nikoladze · 2019-06-27T14:11:15Z

The setup is described on slide 3 here (xrootd version is 4.9.1 now):

https://indico.cern.ch/event/769502/contributions/3197773/attachments/1810707/2957051/nikolai_doma_12.03.2019_xcache_lmu.pdf

osschar · 2019-06-28T09:30:07Z

RAID-6 is probably killing you under heavy multi-file load. This is the case for both hardware and software raids.

XRootd/XCache can work with a set of independent disks and this scales much better, see page 12 of this:
https://indico.cern.ch/event/727208/contributions/3444604/attachments/1859894/3056280/XCache-FeaturesEtc-Lyon-2019.pdf
and of course the Holy docs:
http://xrootd.org/doc/dev49/ofs_config.htm#_Toc522916548
http://xrootd.org/doc/dev49/pss_config.htm#_Toc525070687

gdxd · 2019-07-04T10:07:18Z

Hi,
just a comment on file corruption, you suspected it could be related to issues with network/storage/proxy SNAFU at ATLAS SW T2.
We had our Panda queue configured such that it was reading exclusively from MPPMU, that's our neigbour side in Munich (just 500 m away). So I cannot see how it could be possibly related to SW T2.
cheers

Guenter

osschar · 2019-07-12T10:41:11Z

Hi Nikolai, Guenter,

Did you try:
a) increasing further the limits; and, more importantly,
b) switching from RAID to individual disks?

Cheers,
Matevz

nikoladze · 2019-07-12T15:11:21Z

Hi Matevz,

We have not tried that yet. But we continued running jobs via the xcache server, so we can proceed to investigate the problems when they occur again.
I'm currently trying again to get a backtrace of the crashes that happen sometimes. Related to that: Do i have to do anything special to get the debug infos for xrootd? I tried to install them in the singularity container that runs xcache via the xrootd-debuginfo package via the http://www.xrootd.org/binaries/xrootd-stable-slc7.repo but gdb complains, e.g:

warning: the debug information found in "/usr/lib/debug/usr/lib64/libXrdServer.so.2.0.0.debug" does not match "/lib64/libXrdServer.so.2" (CRC mismatch).

Cheers,
Nikolai

osschar · 2019-07-16T11:42:01Z

Hi Nikolai,

Maybe a version/build mismatch? Installing xrootd-rebuginfo rpm directly always worked from me (but i do my own builds). You could also try installing them with debuginfo-install (in yum-utils) ... this should take care of dependencies, too, IIRC.

Can you please try with 4.10 that was just released? The fd leak under overload is still there (so don't reduce the timeouts just yet) but there were some other fixes in xrootd client that address some of the crashes seen in the wild.

I didn't realize you were running in a container. Do you guys use host networking?

Also, you should try to use separate disks, not raid ... you are leaving 10x performance on the floor there.

Cheers,
Matevz

gdxd · 2019-07-16T11:57:33Z

Hi Matevz,
we consider to try it with separate disks. Related to that, with separate disks we'll have of course data loss in case of a disk failure (which happens occasionally and that's why we run our T2 storage with Raid-6). For a cache system data loss is of course less critical, but still, xcache needs to cope with it, i.e. clean-up file-lists, metadata info, etc. How does this work, is an xcache restart necessary/sufficient?

cheers,
Guenter

osschar · 2019-07-16T12:18:55Z

Hi Guenter,

You'll probably want to put metadata (where cinfo files are stored) and root-fs (basically sym-links into data disks) on a non-data disk.

After data disk failure, you replace the disk (or comment it out in the xrootd.cfg) and restart. XRootd will refuse to start if a configured target directory for oss.space does not exist. If you lose disk with meta-data, you have to clear the cache.

You can remove the stale links after data disk replacement but actually don't have to ... each lfn will get cleared when its time comes (when it would be purged or when an open is attempted). Thinking about this, I could add full data-space scan during the startup purge -- normally purge only scans meta-data file to determine "age" of a file.

Cheers,
Matevz

abh3 · 2019-07-16T16:35:36Z

Hi Nikolai, On small question. Does the OS used to build the container match the OS being used to run the container? We've had other sites rtrying to run wih a mismatch with pretty bad results. Andy

…

On Tue, 16 Jul 2019, Matev? Tadel wrote: Hi Nikolai, Maybe a version/build mismatch? Installing xrootd-rebuginfo rpm directly always worked from me (but i do my own builds). You could also try installing them with debuginfo-install (in yum-utils) ... this should take care of dependencies, too, IIRC. Can you please try with 4.10 that was just released? The fd leak under overload is still there (so don't reduce the timeouts just yet) but there were some other fixes in xrootd client that address some of the crashes seen in the wild. I didn't realize you were running in a container. Do you guys use host networking? Also, you should try to use separate disks, not raid ... you are leaving 10x performance on the floor there. Cheers, Matevz -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #975 (comment)

nikoladze · 2019-07-17T09:39:05Z

Hi Andy,

The containers where build on the same server where we also run them.

Cheers,
Nikolai

abh3 · 2020-02-18T14:56:26Z

I am closing this as here has been no activity and no reports from other sites of similar issues. Please reopen if this is still a roblem.

nikoladze mentioned this issue Jul 17, 2019

Segfaults in xcache #1026

Closed

abh3 closed this as completed Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rising number of open files in xcache #975

Rising number of open files in xcache #975

nikoladze commented May 7, 2019

osschar commented May 7, 2019

abh3 commented May 7, 2019 via email

nikoladze commented May 10, 2019

nikoladze commented Jun 21, 2019

abh3 commented Jun 21, 2019 via email

nikoladze commented Jun 24, 2019

abh3 commented Jun 24, 2019 via email

nikoladze commented Jun 25, 2019

osschar commented Jun 25, 2019

osschar commented Jun 25, 2019

abh3 commented Jun 25, 2019 via email

nikoladze commented Jun 26, 2019

abh3 commented Jun 26, 2019 via email

osschar commented Jun 27, 2019

nikoladze commented Jun 27, 2019

osschar commented Jun 28, 2019

gdxd commented Jul 4, 2019

osschar commented Jul 12, 2019

nikoladze commented Jul 12, 2019

osschar commented Jul 16, 2019

gdxd commented Jul 16, 2019

osschar commented Jul 16, 2019

abh3 commented Jul 16, 2019 via email

nikoladze commented Jul 17, 2019

abh3 commented Feb 18, 2020

Rising number of open files in xcache #975

Rising number of open files in xcache #975

Comments

nikoladze commented May 7, 2019

osschar commented May 7, 2019

abh3 commented May 7, 2019 via email

nikoladze commented May 10, 2019

nikoladze commented Jun 21, 2019

abh3 commented Jun 21, 2019 via email

nikoladze commented Jun 24, 2019

abh3 commented Jun 24, 2019 via email

nikoladze commented Jun 25, 2019

osschar commented Jun 25, 2019

osschar commented Jun 25, 2019

abh3 commented Jun 25, 2019 via email

nikoladze commented Jun 26, 2019

abh3 commented Jun 26, 2019 via email

osschar commented Jun 27, 2019

nikoladze commented Jun 27, 2019

osschar commented Jun 28, 2019

gdxd commented Jul 4, 2019

osschar commented Jul 12, 2019

nikoladze commented Jul 12, 2019

osschar commented Jul 16, 2019

gdxd commented Jul 16, 2019

osschar commented Jul 16, 2019

abh3 commented Jul 16, 2019 via email

nikoladze commented Jul 17, 2019

abh3 commented Feb 18, 2020