can't start new thread error after running for a while #170

ssube · 2023-02-18T03:41:48Z

Depending on how many requests are made, and maybe how many are cancelled, there is a thread leak that eventually exhausts the worker pool and then prevents Flask for responding to new requests:

----------------------------------------
Exception occurred during processing of request from ('10.2.2.77', 63823)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 697, in process_request
    t.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
----------------------------------------
----------------------------------------
Exception occurred during processing of request from ('10.2.2.77', 63824)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 697, in process_request
    t.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
----------------------------------------

The text was updated successfully, but these errors were encountered:

ssube · 2023-02-18T21:30:17Z

This looks like it may be a memory leak rather than a thread leak:

 error: ['Traceback (most recent call last):\n', '  File "/home/ssube/onnx-web/api/onnx_web/server/device_pool.py", line 243, in job_done\n    f.result()\n', '  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result\n    return self.__get_result()\n', '  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result\n    raise self._exception\n', '  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run\n    result = self.fn(*self.args, **self.kwargs)\n', '  File "/home/ssube/onnx-web/api/onnx_web/diffusion/run.py", line 32, in run_txt2img_pipeline\n    pipe = load_pipeline(\n', '  File "/home/ssube/onnx-web/api/onnx_web/diffusion/load.py", line 185, in load_pipeline\n    pipe = pipeline.from_pretrained(\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 739, in from_pretrained\n    loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/diffusers/pipelines/onnx_utils.py", line 206, in from_pretrained\n    return cls._from_pretrained(\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/diffusers/pipelines/onnx_utils.py", line 173, in _from_pretrained\n    model = OnnxRuntimeModel.load_model(\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/diffusers/pipelines/onnx_utils.py", line 78, in load_model\n    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__\n    self._create_inference_session(providers, provider_options, disabled_optimizers)\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 395, in _create_inference_session\n    sess.initialize_session(providers, provider_options, disabled_optimizers)\n', 'onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 58982400\n\n']
[2023-02-18 20:48:18,408] DEBUG: onnx_web.utils: running garbage collection with 3 active threads
[2023-02-18 20:48:18,455] DEBUG: onnx_web.utils: running Torch garbage collection for device: cuda - CUDAExecutionProvider ({'device_id': 0})
[2023-02-18 20:48:18,455] DEBUG: onnx_web.utils: remaining CUDA VRAM usage: 1413611520 of 25632964608
----------------------------------------
Exception occurred during processing of request from ('10.2.2.16', 36308)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 697, in process_request
    t.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
----------------------------------------

There are only 3 active threads, per the logs, and VRAM is reasonable, but virtual memory has been pretty much maxed out for all of the threads.

This happened after 86 requests, using this repro script:

test_images=0
while true;
do
  curl 'http://10.2.2.79:5000/api/txt2img?cfg=16.00&steps=35&scheduler=deis-multi&seed=-1&prompt=an+astronaut+eating+a+hamburger&negativePrompt=&model=diffusion-snow-globe-v1&platform=any&upscaling=upscaling-real-esrgan-x2-plus&correction=correction-codeformer&lpw=false&width=512&height=512&upscaleOrder=correction-both' \
    -X 'POST' \
    --compressed \
    --insecure || break;
  ((test_images++));
  echo "waiting after $test_images";
  sleep 35;
done

ssube · 2023-02-19T00:58:13Z

Tested again with a fresh server, the error happened after 94 images:

[2023-02-19 00:56:16,036] DEBUG: onnx_web.utils: running garbage collection with 3 active threads
[2023-02-19 00:56:16,084] DEBUG: onnx_web.utils: running Torch garbage collection for device: cuda - CUDAExecutionProvider ({'device_id': 0})
[2023-02-19 00:56:16,085] DEBUG: onnx_web.utils: remaining CUDA VRAM usage: 2246180864 of 25632964608
----------------------------------------
Exception occurred during processing of request from ('10.2.2.16', 37058)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 697, in process_request
    t.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
----------------------------------------

Virtual memory was very high, as before:

ssube · 2023-02-19T01:05:53Z

https://stackoverflow.com/a/66130494 suggests this could be related to/resolved by #159

ssube · 2023-02-19T03:57:31Z

Running under waitress lasted for 93 images and gives a more useful error:

Traceback (most recent call last):
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/ssube/onnx-web/api/onnx_web/serve.py", line 558, in txt2img
    executor.submit(
  File "/home/ssube/onnx-web/api/onnx_web/server/device_pool.py", line 236, in submit
    context = JobContext(key, self.devices, device_index=device)
  File "/home/ssube/onnx-web/api/onnx_web/server/device_pool.py", line 33, in __init__
    self.cancel = Value("B", cancel)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 135, in Value
    return Value(typecode_or_type, *args, lock=lock,
  File "/usr/lib/python3.10/multiprocessing/sharedctypes.py", line 79, in Value
    lock = ctx.RLock()
  File "/usr/lib/python3.10/multiprocessing/context.py", line 73, in RLock
    return RLock(ctx=self.get_context())
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 187, in __init__
    SemLock.__init__(self, RECURSIVE_MUTEX, 1, 1, ctx=ctx)
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 57, in __init__
    sl = self._semlock = _multiprocessing.SemLock(
OSError: [Errno 12] Cannot allocate memory

with

waiting after 93                        
<!doctype html>               
<html lang=en>                                                                                                                                                                      
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>                   
waiting after 94

ssube · 2023-02-24T02:50:40Z

This is not related to the requests or the server. If many jobs are queued up, the memory leak continues growing as they run, after the requests stop.

microsoft/onnxruntime#14641

ssube · 2023-02-27T00:33:28Z

This should be fixed by #205, which switches to a process worker pool. I ran the memory leak test script for 600 images with no problems.

ssube · 2023-03-03T00:40:57Z

I've tested the solution more, up through 1051 images, made a few improvements and found a few issues:

this is a custom ProcessPoolExecutor and a fairly complex bit of code, it will need thorough tests
it is possible to leak a device worker process, which may keep VRAM allocated
calling .terminate() on the device workers will break their pending job queues, only on Windows
shutting the server down often requires pressing Ctrl+C twice, but I think that can be fixed

ssube · 2023-03-12T14:46:45Z

With the v0.8.1 fixes, I've tested this up through 4242 images in single series.

There is still a bug when sending multiple images in quick succession. If a worker is recycled in between, it may attempt to start a second job on the same device (given enough memory), and one of them is likely to fail in some way.

ssube added status/new issues that have not been confirmed yet type/bug broken features scope/api labels Feb 18, 2023

ssube added this to the v0.8 milestone Feb 18, 2023

ssube added a commit that referenced this issue Feb 18, 2023

fix(api): log number of active threads when running GC (#170)

d09446c

ssube added status/progress issues that are in progress and have a branch and removed status/new issues that have not been confirmed yet labels Feb 19, 2023

ssube mentioned this issue Feb 27, 2023

per device worker #205

Merged

ssube added status/fixed issues that have been fixed and released and removed status/progress issues that are in progress and have a branch labels Mar 1, 2023

ssube added a commit that referenced this issue Mar 2, 2023

fix(api): track completed jobs for each device worker (#170)

1f3a5f6

ssube closed this as completed Mar 3, 2023

ssube mentioned this issue Mar 5, 2023

v0.8.0 release checklist #217

Closed

99 tasks

This was referenced Mar 13, 2023

Massive Memory Leak c# DirectML microsoft/onnxruntime#14466

Closed

fix any remaining memory leaks #162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't start new thread error after running for a while #170

can't start new thread error after running for a while #170

ssube commented Feb 18, 2023

ssube commented Feb 18, 2023 •

edited

ssube commented Feb 19, 2023

ssube commented Feb 19, 2023

ssube commented Feb 19, 2023 •

edited

ssube commented Feb 24, 2023 •

edited

ssube commented Feb 27, 2023

ssube commented Mar 3, 2023

ssube commented Mar 12, 2023

can't start new thread error after running for a while #170

can't start new thread error after running for a while #170

Comments

ssube commented Feb 18, 2023

ssube commented Feb 18, 2023 • edited

ssube commented Feb 19, 2023

ssube commented Feb 19, 2023

ssube commented Feb 19, 2023 • edited

ssube commented Feb 24, 2023 • edited

ssube commented Feb 27, 2023

ssube commented Mar 3, 2023

ssube commented Mar 12, 2023

ssube commented Feb 18, 2023 •

edited

ssube commented Feb 19, 2023 •

edited

ssube commented Feb 24, 2023 •

edited