Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't start new thread error after running for a while #170

Closed
ssube opened this issue Feb 18, 2023 · 8 comments
Closed

can't start new thread error after running for a while #170

ssube opened this issue Feb 18, 2023 · 8 comments
Labels
scope/api status/fixed issues that have been fixed and released type/bug broken features
Milestone

Comments

@ssube
Copy link
Owner

ssube commented Feb 18, 2023

Depending on how many requests are made, and maybe how many are cancelled, there is a thread leak that eventually exhausts the worker pool and then prevents Flask for responding to new requests:

----------------------------------------
Exception occurred during processing of request from ('10.2.2.77', 63823)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 697, in process_request
    t.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
----------------------------------------
----------------------------------------
Exception occurred during processing of request from ('10.2.2.77', 63824)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 697, in process_request
    t.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
----------------------------------------
@ssube ssube added status/new issues that have not been confirmed yet type/bug broken features scope/api labels Feb 18, 2023
@ssube ssube added this to the v0.8 milestone Feb 18, 2023
@ssube
Copy link
Owner Author

ssube commented Feb 18, 2023

This looks like it may be a memory leak rather than a thread leak:

 error: ['Traceback (most recent call last):\n', '  File "/home/ssube/onnx-web/api/onnx_web/server/device_pool.py", line 243, in job_done\n    f.result()\n', '  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result\n    return self.__get_result()\n', '  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result\n    raise self._exception\n', '  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run\n    result = self.fn(*self.args, **self.kwargs)\n', '  File "/home/ssube/onnx-web/api/onnx_web/diffusion/run.py", line 32, in run_txt2img_pipeline\n    pipe = load_pipeline(\n', '  File "/home/ssube/onnx-web/api/onnx_web/diffusion/load.py", line 185, in load_pipeline\n    pipe = pipeline.from_pretrained(\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 739, in from_pretrained\n    loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/diffusers/pipelines/onnx_utils.py", line 206, in from_pretrained\n    return cls._from_pretrained(\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/diffusers/pipelines/onnx_utils.py", line 173, in _from_pretrained\n    model = OnnxRuntimeModel.load_model(\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/diffusers/pipelines/onnx_utils.py", line 78, in load_model\n    return ort.InferenceSession(path, providers=[provider], sess_options=sess_options)\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__\n    self._create_inference_session(providers, provider_options, disabled_optimizers)\n', '  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 395, in _create_inference_session\n    sess.initialize_session(providers, provider_options, disabled_optimizers)\n', 'onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 58982400\n\n']
[2023-02-18 20:48:18,408] DEBUG: onnx_web.utils: running garbage collection with 3 active threads
[2023-02-18 20:48:18,455] DEBUG: onnx_web.utils: running Torch garbage collection for device: cuda - CUDAExecutionProvider ({'device_id': 0})
[2023-02-18 20:48:18,455] DEBUG: onnx_web.utils: remaining CUDA VRAM usage: 1413611520 of 25632964608
----------------------------------------
Exception occurred during processing of request from ('10.2.2.16', 36308)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 697, in process_request
    t.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
----------------------------------------

image

There are only 3 active threads, per the logs, and VRAM is reasonable, but virtual memory has been pretty much maxed out for all of the threads.

This happened after 86 requests, using this repro script:

test_images=0
while true;
do
  curl 'http://10.2.2.79:5000/api/txt2img?cfg=16.00&steps=35&scheduler=deis-multi&seed=-1&prompt=an+astronaut+eating+a+hamburger&negativePrompt=&model=diffusion-snow-globe-v1&platform=any&upscaling=upscaling-real-esrgan-x2-plus&correction=correction-codeformer&lpw=false&width=512&height=512&upscaleOrder=correction-both' \
    -X 'POST' \
    --compressed \
    --insecure || break;
  ((test_images++));
  echo "waiting after $test_images";
  sleep 35;
done

@ssube
Copy link
Owner Author

ssube commented Feb 19, 2023

Tested again with a fresh server, the error happened after 94 images:

[2023-02-19 00:56:16,036] DEBUG: onnx_web.utils: running garbage collection with 3 active threads
[2023-02-19 00:56:16,084] DEBUG: onnx_web.utils: running Torch garbage collection for device: cuda - CUDAExecutionProvider ({'device_id': 0})
[2023-02-19 00:56:16,085] DEBUG: onnx_web.utils: remaining CUDA VRAM usage: 2246180864 of 25632964608
----------------------------------------
Exception occurred during processing of request from ('10.2.2.16', 37058)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 697, in process_request
    t.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
----------------------------------------

Virtual memory was very high, as before:

image

@ssube ssube added status/progress issues that are in progress and have a branch and removed status/new issues that have not been confirmed yet labels Feb 19, 2023
@ssube
Copy link
Owner Author

ssube commented Feb 19, 2023

https://stackoverflow.com/a/66130494 suggests this could be related to/resolved by #159

@ssube
Copy link
Owner Author

ssube commented Feb 19, 2023

Running under waitress lasted for 93 images and gives a more useful error:

Traceback (most recent call last):
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/ssube/onnx-web/api/onnx_env/lib/python3.10/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/ssube/onnx-web/api/onnx_web/serve.py", line 558, in txt2img
    executor.submit(
  File "/home/ssube/onnx-web/api/onnx_web/server/device_pool.py", line 236, in submit
    context = JobContext(key, self.devices, device_index=device)
  File "/home/ssube/onnx-web/api/onnx_web/server/device_pool.py", line 33, in __init__
    self.cancel = Value("B", cancel)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 135, in Value
    return Value(typecode_or_type, *args, lock=lock,
  File "/usr/lib/python3.10/multiprocessing/sharedctypes.py", line 79, in Value
    lock = ctx.RLock()
  File "/usr/lib/python3.10/multiprocessing/context.py", line 73, in RLock
    return RLock(ctx=self.get_context())
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 187, in __init__
    SemLock.__init__(self, RECURSIVE_MUTEX, 1, 1, ctx=ctx)
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 57, in __init__
    sl = self._semlock = _multiprocessing.SemLock(
OSError: [Errno 12] Cannot allocate memory

with

waiting after 93                        
<!doctype html>               
<html lang=en>                                                                                                                                                                      
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>                   
waiting after 94    

@ssube
Copy link
Owner Author

ssube commented Feb 24, 2023

This is not related to the requests or the server. If many jobs are queued up, the memory leak continues growing as they run, after the requests stop.

microsoft/onnxruntime#14641

@ssube
Copy link
Owner Author

ssube commented Feb 27, 2023

This should be fixed by #205, which switches to a process worker pool. I ran the memory leak test script for 600 images with no problems.

@ssube ssube added status/fixed issues that have been fixed and released and removed status/progress issues that are in progress and have a branch labels Mar 1, 2023
@ssube
Copy link
Owner Author

ssube commented Mar 3, 2023

I've tested the solution more, up through 1051 images, made a few improvements and found a few issues:

  • this is a custom ProcessPoolExecutor and a fairly complex bit of code, it will need thorough tests
  • it is possible to leak a device worker process, which may keep VRAM allocated
  • calling .terminate() on the device workers will break their pending job queues, only on Windows
  • shutting the server down often requires pressing Ctrl+C twice, but I think that can be fixed

@ssube ssube closed this as completed Mar 3, 2023
@ssube ssube mentioned this issue Mar 5, 2023
99 tasks
@ssube
Copy link
Owner Author

ssube commented Mar 12, 2023

With the v0.8.1 fixes, I've tested this up through 4242 images in single series.

There is still a bug when sending multiple images in quick succession. If a worker is recycled in between, it may attempt to start a second job on the same device (given enough memory), and one of them is likely to fail in some way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope/api status/fixed issues that have been fixed and released type/bug broken features
Projects
None yet
Development

No branches or pull requests

1 participant