model cache not always keeping models loaded #227

ssube · 2023-03-09T14:04:13Z

Something is causing models to be unloaded from the cache between requests, which kind of defeats the point of caching models in-memory in the first place. I've walked through the code in the debugger and added some logging to the cache .set and .drop methods, with no luck so far.

Caching models between jobs on the same device provides a pretty substantial speedup, since the models take 5-10 seconds to load and optimize.

The text was updated successfully, but these errors were encountered:

ssube · 2023-03-11T14:39:07Z

With extra logging, including PID and thread ID, there is no indication that the code is removing items from the cache:

[2023-03-11 14:32:42,958] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0                                                                                                                                 
[2023-03-11 14:32:42,958] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model not found in cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', F
alse, None)                                                                                                                                                                                                                         
[2023-03-11 14:32:42,958] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0
[2023-03-11 14:32:42,958] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: dropping item from cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', F
alse, None)                                                                                                                                                                                                                         
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0                                                                                                                                 
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: adding new model to cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)                                                                                                                                                                                                                        
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 1
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model cache below limit, 1 of 4
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 1                                                                                                                                 
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: adding new model to cache: scheduler ('deis-multi', '../models/stable-diffusion-onnx-v1-5')
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 2                                                                                                                                 
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model cache below limit, 2 of 4
[2023-03-11 14:33:33,475] DEBUG: 1995558 140595396067328 onnx_web.server.hacks: patching download path: https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth -> ../models/.cache/correction-codeformer.pth
[2023-03-11 14:33:33,836] DEBUG: 1995558 140595396067328 onnx_web.server.hacks: patching download path: https://s3.eu-central-1.wasabisys.com/nextml-model-data/codeformer/weights/facelib/detection_Resnet50_Final.pth -> ../models
/.cache/detection-resnet50-final.pth
[2023-03-11 14:33:33,987] DEBUG: 1995558 140595396067328 onnx_web.server.hacks: patching download path: https://s3.eu-central-1.wasabisys.com/nextml-model-data/codeformer/weights/facelib/parsing_parsenet.pth -> ../models/.cache/
parsing-parsenet.pth
[2023-03-11 14:33:34,712] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 2
[2023-03-11 14:33:34,712] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model not found in cache: resrgan ('../models/upscaling-real-esrgan-x4-plus.onnx', 'onnx')
[2023-03-11 14:33:35,063] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 2
[2023-03-11 14:33:35,063] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: adding new model to cache: resrgan ('../models/upscaling-real-esrgan-x4-plus.onnx', 'onnx')
[2023-03-11 14:33:35,063] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 3
[2023-03-11 14:33:35,063] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model cache below limit, 3 of 4
[2023-03-11 14:35:19,278] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0
[2023-03-11 14:35:19,278] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model not found in cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', F
alse, None)
[2023-03-11 14:35:19,278] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0
[2023-03-11 14:35:19,278] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: dropping item from cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', F
alse, None)
[2023-03-11 14:35:21,341] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0
[2023-03-11 14:35:21,341] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: adding new model to cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', 
False, None)
[2023-03-11 14:35:21,341] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 1

However, watching the startup sequence with PID/TID in the logs does indicate that the cache was created on the main thread, not the worker:

[2023-03-11 14:41:28,661] DEBUG: 1998132 140397405224960 onnx_web.server.context: cache size set True, 4                                                                                                                            
[2023-03-11 14:41:28,661] DEBUG: 1998132 140397405224960 onnx_web.server.model_cache: creating model cache with limit of 4 models
[2023-03-11 14:41:30,128] DEBUG: 1998235 140531136765952 onnx_web.server.model_cache: creating model cache with limit of 5 models

That does not fully explain the inner list reverting to 0 items, but suggests some multiprocessing bug.

ssube · 2023-03-11T16:22:29Z

Creating a cache within the worker process does not seem to make a difference, but I was able to confirm that the cache itself is being replaced by including a few id()s in the logs:

[2023-03-11 15:58:01,483] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: creating model cache with limit of 5 models, 140458031911296, 140451001093440
[2023-03-11 16:12:26,114] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: get cache size: 0, 140451000988816, 140451012225024
[2023-03-11 16:12:26,114] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: model not found in cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)
[2023-03-11 16:12:26,114] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: drop cache size: 0, 140451000988816, 140451012225024
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: set cache size: 0, 140451000988816, 140451012225024
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: adding new model to cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: prune cache size: 1, 140451000988816, 140451012225024
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: set cache size: 1, 140451000988816, 140451012225024
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: adding new model to cache: scheduler ('deis-multi', '../models/stable-diffusion-onnx-v1-5')
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: prune cache size: 2, 140451000988816, 140451012225024
[2023-03-11 16:13:30,109] DEBUG: 2017857 140458038505472 onnx_web.server.hacks: patching download path: https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth -> ../models/.cache/correction-codeformer.pth
[2023-03-11 16:13:30,472] DEBUG: 2017857 140458038505472 onnx_web.server.hacks: patching download path: https://s3.eu-central-1.wasabisys.com/nextml-model-data/codeformer/weights/facelib/detection_Resnet50_Final.pth -> ../models/.cache/detection-resnet50-final.pth
[2023-03-11 16:13:30,677] DEBUG: 2017857 140458038505472 onnx_web.server.hacks: patching download path: https://s3.eu-central-1.wasabisys.com/nextml-model-data/codeformer/weights/facelib/parsing_parsenet.pth -> ../models/.cache/parsing-parsenet.pth
[2023-03-11 16:13:31,564] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: get cache size: 0, 140437651356400, 140438182871296
[2023-03-11 16:13:31,564] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: model not found in cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)
[2023-03-11 16:13:31,564] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: drop cache size: 0, 140437651356400, 140438182871296
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: set cache size: 0, 140437651356400, 140438182871296
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: adding new model to cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: prune cache size: 1, 140437651356400, 140438182871296
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: set cache size: 1, 140437651356400, 140438182871296
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: adding new model to cache: scheduler ('deis-multi', '../models/stable-diffusion-onnx-v1-5')
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: prune cache size: 2, 140437651356400, 140438182871296

ssube · 2023-03-11T16:40:25Z

Adding a __del__ method with some logging makes it look like the cache is getting garbage collected, which is interesting since it's attached to the server context, which is used in every job:

[2023-03-11 16:39:22,237] DEBUG: 2026024 140270594187264 onnx_web.server.model_cache: deleting cache with 2 models, 140263754283712, 140263995124608                             
Exception ignored in: <function ModelCache.__del__ at 0x7f91c179a680>                                                                                                                                                               
Traceback (most recent call last):                                                                                                                                                                                                  
  File "/opt/onnx-web/api/onnx_web/server/model_cache.py", line 19, in __del__                                                                                                                                                      
    raise Exception("deleting cache with models in it")                                                           
Exception: deleting cache with models in it

Adding a reference to the cache in the pre-job logs is not helping either:

[2023-03-11 16:46:08,393] INFO: 2028469 139889488232448 onnx_web.worker.worker: worker for cuda got job: txt2img_619685547_dbe11c486d6f570fa9d335dfca23d7616d420fee7afb63cf74cfd9def24718ca_1678553106_0.png (cache size: 0)

ssube · 2023-03-11T20:38:36Z

This is fixed for now, by making the cache's internal storage global within each process. That's not elegant, but it prevents the cache from being garbage collected, since I was not able to find the cause of that collection.

ssube added status/new issues that have not been confirmed yet type/bug broken features labels Mar 9, 2023

ssube added this to the v0.9 milestone Mar 9, 2023

ssube changed the title ~~model cache not always keeping models~~ model cache not always keeping models loaded Mar 10, 2023

ssube modified the milestones: v0.9, v0.8.1 Mar 11, 2023

ssube added the scope/api label Mar 11, 2023

ssube self-assigned this Mar 11, 2023

ssube added status/progress issues that are in progress and have a branch and removed status/new issues that have not been confirmed yet labels Mar 11, 2023

ssube added a commit that referenced this issue Mar 11, 2023

fix(api): make cache global within each worker process (#227)

575cb88

ssube mentioned this issue Mar 11, 2023

v0.8.1 release checklist #240

Closed

11 tasks

ssube added status/fixed issues that have been fixed and released and removed status/progress issues that are in progress and have a branch labels Mar 11, 2023

ssube closed this as completed Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model cache not always keeping models loaded #227

model cache not always keeping models loaded #227

ssube commented Mar 9, 2023

ssube commented Mar 11, 2023 •

edited

ssube commented Mar 11, 2023

ssube commented Mar 11, 2023 •

edited

ssube commented Mar 11, 2023

model cache not always keeping models loaded #227

model cache not always keeping models loaded #227

Comments

ssube commented Mar 9, 2023

ssube commented Mar 11, 2023 • edited

ssube commented Mar 11, 2023

ssube commented Mar 11, 2023 • edited

ssube commented Mar 11, 2023

ssube commented Mar 11, 2023 •

edited

ssube commented Mar 11, 2023 •

edited