Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large client memory allocations from specific client resource #142

Closed
rjefferson opened this issue Sep 26, 2014 · 11 comments
Closed

large client memory allocations from specific client resource #142

rjefferson opened this issue Sep 26, 2014 · 11 comments

Comments

@rjefferson
Copy link

Hi,

We have two distinct compute clusters (PDSF & Carver) that we use to access an xrootd system that exists on one of those clusters (PDSF). The two clusters run the same OS and we link our software to the same client libraries. We have no problems accessing the service from the cluster where the service is located. It also functions from the second cluster (Carver) however the process allocates a huge amount of memory (seemingly related to file size and # of files read) that is not relinquished such that if we try to read multiple files, the jobs die with:

Xrd: PhyConnection: Can't run reader thread: out of system resources. Critical error.

(or sometimes crash with a 'bad alloc' error). I can recreate the symptom using xrdcp (our xrd version is v3.3.4). When I run with --debug #, I don't see any difference between the two systems. When I then run in valgrind, they are identical except valgrind issues several warnings from Carver of the type:

==22839== Warning: set address range perms: large range [0x39431000, 0x49432000) (defined)

with an address range nearly always 0x10001000 wide. The warning seems to be just a notice from valgrind that a large address range was allocated. No such warnings appear when run from PDSF.

At this point, the only difference I see between running from the two resources is the network topology, which I believe is largely IB connected. The admins are available to debug will some help for what to target.

thanks,

Jeff

@abh3
Copy link
Member

abh3 commented Sep 26, 2014

This is a known problem with the old client when used in the context of a proxy server. In fact, starting in SL6 it's a known problem for any kind of data server not only xrootd). If you are using this in proxy mode then add this to the config file:

pss.setopt ReadCacheSize 0

That effectively turns off the per-file cache which can consume huge amount of memory if many files are being opened. For any kind of server (proxy or not) you will want to add the following (based on shell) to your sysinit script prior to starting the server:

export MALLOC_ARENA_MAX=4

This limits storage fragmentation which can become a serious issue in a multi-threaded server. Alternatively, you can install tcmalloc (from Google) or jemalloc (from Facebook) that pretty much solves the brain-dead malloc included in SL6.

@rjefferson
Copy link
Author

While the problem may be sl6 malloc as that's the OS we have installed, the suggestions didn't help. We're currently not running a proxy as the access should not transit any firewalls and setting the MALLOC_ARENA_MAX didn't change the behavior. I haven't suggested tcmalloc or jemalloc to the admins yet.

Any other ideas how to debug? I still think it's odd that access from one cluster with the same OS & same xrootd client binaries has this issue but access from the other cluster doesn't.

thanks

@abh3
Copy link
Member

abh3 commented Oct 1, 2014

OK, I totally misunderstood your problem. Based on the latest information, there is no problem with the xrootd server, it's a problem with your application, right? If so, we need to tackle this in another way.

Is it true that running the identical client job on one cluster encounters the problem but does not on the other cluster? It would seem that's what is being said. I assume you have logs from the client job. Could you post them indicating which one worked and which one failed.

Based on what I think is going on here, I would say that the two environments are not really identical in some critical way.We just need to find out what the difference is.

@ljanyst
Copy link
Contributor

ljanyst commented Oct 1, 2014

This happens when pthread_create fails. There may be two reasons for it:

  1. The system lacks resources to spawn another thread (typically not enough memory to allocate another execution stack)
  2. The system-imposed limit on the total number of threads in a process would be exceeded should the operation succeed.

@rjefferson
Copy link
Author

Thanks - I think it is a combination of problems. The 'malloc' was indeed part of it. I preloaded jemalloc when running the client and the memory issue represented by: "Warning: set address range perms: large range" and growing utilization is gone. I can now read large number of files without a problem as long as they all come from the same data server. However, if I access files spread over a few servers, I then get either the bad_alloc error or, more often,

Xrd: PhyConnection: Can't run reader thread: out of system resources. Critical error.

This seems to point to a thread limit as the memory appears stable. I see the maxproc limit is different on the two systems: 256 for the problem system, Carver, and 1024 for the one that works, PDSF. When I tried monitoring that late yesterday, I only witnessed 5 threads being spawned. I'll double check that and check further with the admins to come up with more diagnostics.

@rjefferson
Copy link
Author

So it seems that the low resource limits on the interactive nodes were in the end causing the failures. I ran successfully via interactive batch which has much looser limits. Overall I found that using jemalloc cut the memory usage by 1/3 for a specific test job (400MB vs 130MB). So that was indeed a useful outcome.

thanks to all

@abh3
Copy link
Member

abh3 commented Oct 1, 2014

Great. It would be useful to see how tcmalloc compares. Do you think you can run the same job using that?

@rjefferson
Copy link
Author

I tried building gperftools but it failed finding libunwind. I'll try adding that and see if I can build it. I too would like to find out too. thx

@rjefferson
Copy link
Author

We got gperftools installed & I re-ran the test with tcmalloc. The job reads 12 files stored on 12 different xrootd servers with a total size of 5.5GB. I also put those 12 files on a shared file system (GPFS). I used the LD_PRELOAD directive to switch between tcmalloc, jemalloc, and native malloc.

  • Reading files from GPFS, the process was stable using 100MB, independent of malloc
  • Reading files from xrootd: tcmalloc -> 123MB, jemalloc -> 134MB, native malloc -> 410MB

Both tcmalloc and jemalloc got rid of the large address allocation warnings from valgrind.

@abh3
Copy link
Member

abh3 commented Oct 3, 2014

Thanks for the test! Now if only standard Linux used one of the two better performing mallocs instead of the really brain-dead one they decided on.

@ljanyst
Copy link
Contributor

ljanyst commented Oct 3, 2014

Thanks for the benchmarks @rjefferson!

@ljanyst ljanyst closed this as completed Oct 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants