Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model cache not always keeping models loaded #227

Closed
ssube opened this issue Mar 9, 2023 · 4 comments
Closed

model cache not always keeping models loaded #227

ssube opened this issue Mar 9, 2023 · 4 comments
Assignees
Labels
scope/api status/fixed issues that have been fixed and released type/bug broken features
Milestone

Comments

@ssube
Copy link
Owner

ssube commented Mar 9, 2023

Something is causing models to be unloaded from the cache between requests, which kind of defeats the point of caching models in-memory in the first place. I've walked through the code in the debugger and added some logging to the cache .set and .drop methods, with no luck so far.

Caching models between jobs on the same device provides a pretty substantial speedup, since the models take 5-10 seconds to load and optimize.

@ssube ssube added status/new issues that have not been confirmed yet type/bug broken features labels Mar 9, 2023
@ssube ssube added this to the v0.9 milestone Mar 9, 2023
@ssube ssube changed the title model cache not always keeping models model cache not always keeping models loaded Mar 10, 2023
@ssube ssube modified the milestones: v0.9, v0.8.1 Mar 11, 2023
@ssube
Copy link
Owner Author

ssube commented Mar 11, 2023

With extra logging, including PID and thread ID, there is no indication that the code is removing items from the cache:

[2023-03-11 14:32:42,958] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0                                                                                                                                 
[2023-03-11 14:32:42,958] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model not found in cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', F
alse, None)                                                                                                                                                                                                                         
[2023-03-11 14:32:42,958] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0
[2023-03-11 14:32:42,958] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: dropping item from cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', F
alse, None)                                                                                                                                                                                                                         
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0                                                                                                                                 
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: adding new model to cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)                                                                                                                                                                                                                        
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 1
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model cache below limit, 1 of 4
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 1                                                                                                                                 
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: adding new model to cache: scheduler ('deis-multi', '../models/stable-diffusion-onnx-v1-5')
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 2                                                                                                                                 
[2023-03-11 14:32:46,247] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model cache below limit, 2 of 4
[2023-03-11 14:33:33,475] DEBUG: 1995558 140595396067328 onnx_web.server.hacks: patching download path: https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth -> ../models/.cache/correction-codeformer.pth
[2023-03-11 14:33:33,836] DEBUG: 1995558 140595396067328 onnx_web.server.hacks: patching download path: https://s3.eu-central-1.wasabisys.com/nextml-model-data/codeformer/weights/facelib/detection_Resnet50_Final.pth -> ../models
/.cache/detection-resnet50-final.pth
[2023-03-11 14:33:33,987] DEBUG: 1995558 140595396067328 onnx_web.server.hacks: patching download path: https://s3.eu-central-1.wasabisys.com/nextml-model-data/codeformer/weights/facelib/parsing_parsenet.pth -> ../models/.cache/
parsing-parsenet.pth
[2023-03-11 14:33:34,712] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 2
[2023-03-11 14:33:34,712] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model not found in cache: resrgan ('../models/upscaling-real-esrgan-x4-plus.onnx', 'onnx')
[2023-03-11 14:33:35,063] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 2
[2023-03-11 14:33:35,063] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: adding new model to cache: resrgan ('../models/upscaling-real-esrgan-x4-plus.onnx', 'onnx')
[2023-03-11 14:33:35,063] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 3
[2023-03-11 14:33:35,063] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model cache below limit, 3 of 4
[2023-03-11 14:35:19,278] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0
[2023-03-11 14:35:19,278] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: model not found in cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', F
alse, None)
[2023-03-11 14:35:19,278] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0
[2023-03-11 14:35:19,278] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: dropping item from cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', F
alse, None)
[2023-03-11 14:35:21,341] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 0
[2023-03-11 14:35:21,341] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: adding new model to cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', 
False, None)
[2023-03-11 14:35:21,341] DEBUG: 1995558 140595396067328 onnx_web.server.model_cache: cache size: 1

However, watching the startup sequence with PID/TID in the logs does indicate that the cache was created on the main thread, not the worker:

[2023-03-11 14:41:28,661] DEBUG: 1998132 140397405224960 onnx_web.server.context: cache size set True, 4                                                                                                                            
[2023-03-11 14:41:28,661] DEBUG: 1998132 140397405224960 onnx_web.server.model_cache: creating model cache with limit of 4 models
[2023-03-11 14:41:30,128] DEBUG: 1998235 140531136765952 onnx_web.server.model_cache: creating model cache with limit of 5 models

That does not fully explain the inner list reverting to 0 items, but suggests some multiprocessing bug.

@ssube ssube self-assigned this Mar 11, 2023
@ssube ssube added status/progress issues that are in progress and have a branch and removed status/new issues that have not been confirmed yet labels Mar 11, 2023
@ssube
Copy link
Owner Author

ssube commented Mar 11, 2023

Creating a cache within the worker process does not seem to make a difference, but I was able to confirm that the cache itself is being replaced by including a few id()s in the logs:

[2023-03-11 15:58:01,483] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: creating model cache with limit of 5 models, 140458031911296, 140451001093440
[2023-03-11 16:12:26,114] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: get cache size: 0, 140451000988816, 140451012225024
[2023-03-11 16:12:26,114] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: model not found in cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)
[2023-03-11 16:12:26,114] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: drop cache size: 0, 140451000988816, 140451012225024
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: set cache size: 0, 140451000988816, 140451012225024
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: adding new model to cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: prune cache size: 1, 140451000988816, 140451012225024
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: set cache size: 1, 140451000988816, 140451012225024
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: adding new model to cache: scheduler ('deis-multi', '../models/stable-diffusion-onnx-v1-5')
[2023-03-11 16:12:29,339] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: prune cache size: 2, 140451000988816, 140451012225024
[2023-03-11 16:13:30,109] DEBUG: 2017857 140458038505472 onnx_web.server.hacks: patching download path: https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth -> ../models/.cache/correction-codeformer.pth
[2023-03-11 16:13:30,472] DEBUG: 2017857 140458038505472 onnx_web.server.hacks: patching download path: https://s3.eu-central-1.wasabisys.com/nextml-model-data/codeformer/weights/facelib/detection_Resnet50_Final.pth -> ../models/.cache/detection-resnet50-final.pth
[2023-03-11 16:13:30,677] DEBUG: 2017857 140458038505472 onnx_web.server.hacks: patching download path: https://s3.eu-central-1.wasabisys.com/nextml-model-data/codeformer/weights/facelib/parsing_parsenet.pth -> ../models/.cache/parsing-parsenet.pth
[2023-03-11 16:13:31,564] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: get cache size: 0, 140437651356400, 140438182871296
[2023-03-11 16:13:31,564] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: model not found in cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)
[2023-03-11 16:13:31,564] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: drop cache size: 0, 140437651356400, 140438182871296
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: set cache size: 0, 140437651356400, 140438182871296
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: adding new model to cache: diffusion ('OnnxStableDiffusionPipeline', '../models/stable-diffusion-onnx-v1-5', 'cuda', 'CUDAExecutionProvider', False, None)
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: prune cache size: 1, 140437651356400, 140438182871296
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: set cache size: 1, 140437651356400, 140438182871296
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: adding new model to cache: scheduler ('deis-multi', '../models/stable-diffusion-onnx-v1-5')
[2023-03-11 16:13:33,572] DEBUG: 2017857 140458038505472 onnx_web.server.model_cache: prune cache size: 2, 140437651356400, 140438182871296

@ssube
Copy link
Owner Author

ssube commented Mar 11, 2023

Adding a __del__ method with some logging makes it look like the cache is getting garbage collected, which is interesting since it's attached to the server context, which is used in every job:

[2023-03-11 16:39:22,237] DEBUG: 2026024 140270594187264 onnx_web.server.model_cache: deleting cache with 2 models, 140263754283712, 140263995124608                             
Exception ignored in: <function ModelCache.__del__ at 0x7f91c179a680>                                                                                                                                                               
Traceback (most recent call last):                                                                                                                                                                                                  
  File "/opt/onnx-web/api/onnx_web/server/model_cache.py", line 19, in __del__                                                                                                                                                      
    raise Exception("deleting cache with models in it")                                                           
Exception: deleting cache with models in it  

Adding a reference to the cache in the pre-job logs is not helping either:

[2023-03-11 16:46:08,393] INFO: 2028469 139889488232448 onnx_web.worker.worker: worker for cuda got job: txt2img_619685547_dbe11c486d6f570fa9d335dfca23d7616d420fee7afb63cf74cfd9def24718ca_1678553106_0.png (cache size: 0)

@ssube ssube mentioned this issue Mar 11, 2023
11 tasks
@ssube ssube added status/fixed issues that have been fixed and released and removed status/progress issues that are in progress and have a branch labels Mar 11, 2023
@ssube ssube closed this as completed Mar 11, 2023
@ssube
Copy link
Owner Author

ssube commented Mar 11, 2023

This is fixed for now, by making the cache's internal storage global within each process. That's not elegant, but it prevents the cache from being garbage collected, since I was not able to find the cause of that collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope/api status/fixed issues that have been fixed and released type/bug broken features
Projects
None yet
Development

No branches or pull requests

1 participant