too many threads generated till -su: fork: retry: Resource temporarily unavailable #1511

jack-gits · 2022-03-15T16:54:58Z

Please have a look at FAQ's and Troubleshooting guide, your query may be already addressed.

Your issue may already be reported!
Please search on the issue tracker before creating one.

Context

torchserve version: pytorch/torchserve:latest-gpu
torch-model-archiver version: 0.5.2
torch version: 1.6.0
torchvision version [if any]: 0.7.0
torchtext version [if any]:
torchaudio version [if any]:
java version: openjdk version "1.8.0_292"
Operating System and version: Ubuntu 16.04

Your Environment

Installed using source? [yes/no]:
Are you planning to deploy it using docker container? [yes/no]: yes
Is it a CPU or GPU environment?: GPU
Using a default/custom handler? [If possible upload/share custom handler/model]: custom handler
What kind of model is it e.g. vision, text, audio?: vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? local model
[If public url then provide link.]:
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:
Link to your project [if any]:

Expected Behavior

I'm using workflow of torchserve in docker. when inferencing, system generate lots of threads till the system "-su: fork: retry: Resource temporarily unavailable"

Current Behavior

Possible Solution

Steps to Reproduce

torchserve --start
register workflow by api
3, inferencing, there's about 6000 cases to be inferenced.

Failure Logs [if any]

inference/torchserver# 2022-03-14T15:03:32,999 [ERROR] pool-3-thread-2 org.pytorch.serve.metrics.MetricCollector -
java.io.IOException: Cannot run program "/usr/bin/python3" (in directory "/usr/local/lib/python3.6/dist-packages"): error=11, Resource temporarily unavailable
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) ~[?:?]
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) ~[?:?]
at java.lang.Runtime.exec(Runtime.java:592) ~[?:?]
at org.pytorch.serve.metrics.MetricCollector.run(MetricCollector.java:42) ~[model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: error=11, Resource temporarily unavailable
at java.lang.ProcessImpl.forkAndExec(Native Method) ~[?:?]
at java.lang.ProcessImpl.(ProcessImpl.java:340) ~[?:?]
at java.lang.ProcessImpl.start(ProcessImpl.java:271) ~[?:?]
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1107) ~[?:?]
... 9 more
[15806.571s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 136k, guardsize: 0k, detached.

jack-gits · 2022-03-18T08:39:26Z

any update？

maaquib · 2022-03-23T18:35:33Z

Can reproduce using the dog-cat classification workflow example

$ curl -X POST "http://127.0.0.1:8081/workflows?url=dog_breed_wf.war"
{
  "status": "Workflow dog_breed_wf has been registered and scaled successfully."
}
$ ps -efT | cat | grep wf_store | grep -v grep | wc -l
52
$ curl https://raw.githubusercontent.com/udacity/dog-project/master/images/Labrador_retriever_06457.jpg -o Dog1.jpg
$ curl -s http://127.0.0.1:8080/wfpredict/dog_breed_wf -T Dog1.jpg > /dev/null
$ ps -efT | cat | grep wf_store | grep -v grep | wc -l
60
$ curl -s http://127.0.0.1:8080/wfpredict/dog_breed_wf -T Dog1.jpg > /dev/null
model-server@1182613e41ce:~$ ps -efT | cat | grep wf_store | grep -v grep | wc -l
66
$ for i in {1..100}; do curl -s http://127.0.0.1:8080/wfpredict/dog_breed_wf -T Dog1.jpg > /dev/null; done
$ ps -efT | cat | grep wf_store | grep -v grep | wc -l
407
$ ps -efT | cat | grep wf_store | head -1
model-s+    15    15     1  0 18:13 pts/0    00:00:00 java -Dmodel_server_home=/home/venv/lib/python3.8/site-packages -Djava.io.tmpdir=/home/model-server/tmp -cp .:/home/venv/lib/python3.8/site-packages/ts/frontend/* org.pytorch.serve.ModelServer --python /home/venv/bin/python -s model_store/ -w wf_store/ -ncs

Num of WAITING (parking) threads increases by 3 with every inference request

$ jstack 15 | grep WAITING | wc -l
335
$ jstack 15 | grep "ThreadPoolExecutor.runWorker" | wc -l
334

From heap dump

"pool-100-thread-1" #378 prio=5 os_prio=0 cpu=0.35ms elapsed=726.22s tid=0x00007f3070213800 nid=0x281 waiting on condition  [0x00007f2f23027000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.13/Native Method)
	- parking to wait for  <0x0000000424864c40> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.13/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.13/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.13/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.13/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.13/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.13/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.13/Thread.java:829)

jack-gits · 2022-03-24T03:02:48Z

when will be released?

msaroufim · 2022-03-24T16:32:09Z

As soon as the PR is merged it needs a day to be added in nightly builds https://pypi.org/project/torchserve-nightly/

For an official release will probably add this in 0.6, still discussing an exact date with the team

msaroufim assigned maaquib Mar 15, 2022

msaroufim added the workflowx Issues related to workflow / ensemble models label Mar 16, 2022

msaroufim added bug Something isn't working urgent labels Mar 22, 2022

msaroufim added this to the v0.6.0 milestone Mar 22, 2022

maaquib added a commit to maaquib/serve that referenced this issue Mar 23, 2022

Fix workflow thread issue pytorch#1511

29b1151

maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022

Fix workflow thread issue pytorch#1511

3327be2

maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022

Fix workflow thread issue pytorch#1511

e830667

maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022

Fix workflow thread issue pytorch#1511

74a40f9

maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022

Fix workflow thread issue pytorch#1511

48ef932

maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022

Fix workflow thread issue pytorch#1511

85a8fac

maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022

Fix workflow thread issue pytorch#1511

6d2e6ae

maaquib mentioned this issue Apr 7, 2022

too many threads generated till -su: fork: retry: Resource temporarily unavailable #1552

Merged

7 tasks

lxning closed this as completed in #1552 Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many threads generated till -su: fork: retry: Resource temporarily unavailable #1511

too many threads generated till -su: fork: retry: Resource temporarily unavailable #1511

jack-gits commented Mar 15, 2022

jack-gits commented Mar 18, 2022

maaquib commented Mar 23, 2022 •

edited

jack-gits commented Mar 24, 2022

msaroufim commented Mar 24, 2022

too many threads generated till -su: fork: retry: Resource temporarily unavailable #1511

too many threads generated till -su: fork: retry: Resource temporarily unavailable #1511

Comments

jack-gits commented Mar 15, 2022

Context

Your Environment

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Failure Logs [if any]

jack-gits commented Mar 18, 2022

maaquib commented Mar 23, 2022 • edited

From heap dump

jack-gits commented Mar 24, 2022

msaroufim commented Mar 24, 2022

maaquib commented Mar 23, 2022 •

edited