Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execution error -- libgomp: Thread creation failed: Resource temporarily unavailable #10

Closed
grendon opened this issue Jul 25, 2014 · 5 comments

Comments

@grendon
Copy link

grendon commented Jul 25, 2014

The reference genome TAIR10 for arabidopsis thaliana. Size 119481543
The Sakata reads files come from this url
http://1001genomes.org/data/JGI/JGIHeazlewood2011/releases/current/TAIR10/strains/

PE reads 60173258

read length 100
avg depth 36.87
std depth 313.73

Several benchmarking jobs were submitted to an SGI UV1000 node with 384 Intel Xeon X7542 and 2TB of memory.
I assigned 150gb of memory to each job.
Number of cores assigned to each job varied = 36, 24, 16, 8
All but one job finished successfully. The job with 36 cores failed. Below are the last few lines of the error log.

2014-07-25 14:56:11 [2b53393de700] Opened fastq stream on /home/a-m/gren
don/tair-isaac-pipeline_test/index/sakata_reads/lane1_read1.fastq
2014-07-25 14:56:11 [2b53393de700] Opened fastq stream on /home/a-m/gren
don/tair-isaac-pipeline_test/index/sakata_reads/lane1_read2.fastq
2014-07-25 14:56:11 [2b53393de700] Resetting Fastq data for 5000000 clus
ters
2014-07-25 14:56:13 [2b53393de700] Resetting Fastq data done for 5000000
clusters
2014-07-25 14:56:54 [2b53393de700] Loading Fastq data done. Loaded 50000
00 clusters for TileMetadata(1101, 1, 1, 5000000, 0)
2014-07-25 14:56:54 [2b53393de700] Sorting matches by barcode for TileMe
tadata(1101, 1, 1, 5000000, 0)
2014-07-25 14:56:54 [2b53391dd700] Loading matches for TileMetadata(1101
, 2, 1, 5000000, 1)

libgomp: Thread creation failed: Resource temporarily unavailable

@rpetrovski
Copy link
Contributor

This failure is typically a result of going over the system set limits. For example if the thread stack size times the number of threads iSAAC attempts to create goes over the ulimit -v, it will fail to create threads.
Can you please post the ulimit -a output.

I'm not familiar with SGI UV 1000. From a random specification on the web I gather the maximum number of cores per compute node you can have is 16. Am I correct?

It looks like you've tried to override the default iSAAC threading with -j option. Is that the case? There is usually no reason to do that unless you are debugging or working around a poorly-built system. iSAAC picks up the number of compute threads from the amount of hardware threads supported by the system. Going above that will cause more threads to access memory concurrently than the system is designed for. This will result in L1,2,3 cache trashing and therefore suboptimal performance. Setting lower values can be justified in cases when there isn't enough RAM to accommodate the per-thread memory allocations iSAAC does. However, 150G of memory is way more than enough with rest of the iSAAC options set to their defaults. Do you actually have 150G of RAM physically available on the node?

@lsmainzer
Copy link

Rpetrovski:

sorry about the delay in replying to you. These nodes have 384 Intel Xeon X7542 @ 2.67 GHZ CPUs per node, and 2 TB of RAM per node. Thus, we are not exceeding the number of cores on the node by asking iSAAC aligner to use 48 threads. We are also not exceeding the available RAM.

However, I notice that iSAAC does not tend to respect the user-set number of threads specified to it on the command line using option -j. For example, when I ran tests on a different computer, which has 48 dual-threaded cores, I specified -j 48, but from reading the logs I notice that in fact all 96 virtual threads were used. Is this expected behavior, or am I seeing a bug?

Tthe "libgomp: Thread creation failed" error shows up specifically when running on a shared cluster node. I understand iSAAC was designed to run alone on a node, so maybe that is why we are seeing this error. For example, if a node has 384 cores, but we want iSAAC to only use 48, we will specify 48 via the -j option and also tell the PBS script to submit the iSAAC job with a limitation of 48 threads, so that other users could utilize the other threads. However, iSAAC appears to ignore these limitatons: not only ignoring the -j option, but also not complying with the scheduler's limits.

I think it is reasonable to expect software to use all resources on a node it is running on. However, I think we would never have seen the "libgomp: Thread creation failed" problem if iSAAC did not try to use more than the 48 threads specified with the "-j 48" setting. This does not seem to be correct behavior, since it negates the entire reason for having the -j option.

Are you aware of this behavior? Have you ever encountered it? We would be happy to run more tests to clarify.

Thank you very much,
Luda

@rpetrovski
Copy link
Contributor

True, the iSAAC-01 does not limit the number of threads __gnu_parallel::sort uses. I'll try to come up with some sort of workaround in January.

Roman.

@rpetrovski rpetrovski reopened this Dec 19, 2014
@rpetrovski
Copy link
Contributor

Replaced gnu parallel sort with home-made one. Please try iSAAC-01.15.01.28.

Roman.

@chunhualiao
Copy link

I just saw a similar error with gcc 4.9.2. The input code is a simple OpenMP nested parallelism code.
The problem is the workstation has 72 logical processors. The program will try to start 72*72 threads by default, which triggers "libgomp: Thread creation failed: Resource temporarily unavailable" .

The solution is to limit the number of threads at each level of parallelism, using num_threads() clause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants