Diffusion model not fully unloading from gpu when removed from cache #242

bzlibby · 2023-03-13T02:25:27Z

After updating to v0.8.1, I noticed that my gpu memory was hitting 100% after generating 1 image, significantly impacting generation speed; the first image would generate at ~3it/s, whereas the second image with no changes to the model, scheduler, or other settings would generate at ~4.5s/it. Following the logs, the diffusion model I was using was getting removed from the cache and then getting reloaded, but was not fully unloading from my gpu. When I upped my model cache from 2 the 3, the issue was resolved.

task manager right after finishing first image:

task manager right after starting second image:

ssube · 2023-03-13T13:36:50Z

It looks like there are a few factors involved here:

the default cache limit of 2 is too low for a full diffusion-scheduler-upscaling-correction pipeline
- even with a larger limit, the cache will only keep one model of each type, like diffusion
the old diffusion model is being removed from the CPU cache but not the GPU cache
garbage collection on the GPU currently only happens on CUDA platforms (Nvidia): https://github.com/ssube/onnx-web/blob/main/api/onnx_web/utils.py#L94

It's possible that there is a memory leak on the Python side holding that old diffusion model and forcing it to remain active, but it's also possible that I just need to run garbage collection on the GPU side, like torch.cuda.empty_cache() does for CUDA.

For now, increasing the model cache limit to 4-5 (one of each) should be a workaround:

set ONNX_WEB_CACHE_MODELS=5

ssube · 2023-03-18T18:56:58Z

When you have a chance, please try running the new api/scripts/check-env.py script using your API's virtual environment:

> cd api
> onnx_env\Scripts\Activate.bat
> python3 scripts\check-env.py

and post the output. I'm looking for the last few items in particular, which should say whether the CUDA garbage collection functions are available.

The output should look something like this:

> python3 scripts/check-env.py 
['required module onnx is present at version 1.13.0', 'required module diffusers is present at version 0.12.1', 'required module safetensors is present at version 0.2.8', 'required module torch is present at version 1.13.1', 'runtime module onnxruntime is present at version 1.14.0', "unable to import runtime module onnxruntime_gpu: No module named 'onnxruntime_gpu'", "unable to import runtime module onnxruntime_rocm: No module named 'onnxruntime_rocm'", "unable to import runtime module onnxruntime_training: No module named 'onnxruntime_training'", 'onnxruntime provider TensorrtExecutionProvider is missing', 'onnxruntime provider CUDAExecutionProvider is missing', 'onnxruntime provider MIGraphXExecutionProvider is missing', 'onnxruntime provider ROCMExecutionProvider is missing', 'onnxruntime provider OpenVINOExecutionProvider is missing', 'onnxruntime provider DnnlExecutionProvider is missing', 'onnxruntime provider TvmExecutionProvider is missing', 'onnxruntime provider VitisAIExecutionProvider is missing', 'onnxruntime provider NnapiExecutionProvider is missing', 'onnxruntime provider CoreMLExecutionProvider is missing', 'onnxruntime provider ArmNNExecutionProvider is missing', 'onnxruntime provider ACLExecutionProvider is missing', 'onnxruntime provider DmlExecutionProvider is missing', 'onnxruntime provider RknpuExecutionProvider is missing', 'onnxruntime provider XnnpackExecutionProvider is missing', 'onnxruntime provider CANNExecutionProvider is missing', 'onnxruntime provider AzureExecutionProvider is missing', 'onnxruntime provider CPUExecutionProvider is available', 'loaded Torch but CUDA was not available']

bzlibby · 2023-03-19T03:21:48Z

this is the response I get:

['required module onnx is present at version 1.13.0', 'required module diffusers is present at version 0.11.1', 'required module safetensors is present at version 0.2.8', 'required module torch is present at version 1.13.1', 'runtime module onnxruntime is present at version 1.13.1', "unable to import runtime module onnxruntime_gpu: No module named 'onnxruntime_gpu'", "unable to import runtime module onnxruntime_rocm: No module named 'onnxruntime_rocm'", "unable to import runtime module onnxruntime_training: No module named 'onnxruntime_training'", 'onnxruntime provider TensorrtExecutionProvider is missing', 'onnxruntime provider CUDAExecutionProvider is missing', 'onnxruntime provider MIGraphXExecutionProvider is missing', 'onnxruntime provider ROCMExecutionProvider is missing', 'onnxruntime provider OpenVINOExecutionProvider is missing', 'onnxruntime provider DnnlExecutionProvider is missing', 'onnxruntime provider TvmExecutionProvider is missing', 'onnxruntime provider VitisAIExecutionProvider is missing', 'onnxruntime provider NnapiExecutionProvider is missing', 'onnxruntime provider CoreMLExecutionProvider is missing', 'onnxruntime provider ArmNNExecutionProvider is missing', 'onnxruntime provider ACLExecutionProvider is missing', 'onnxruntime provider DmlExecutionProvider is available', 'onnxruntime provider RknpuExecutionProvider is missing', 'onnxruntime provider XnnpackExecutionProvider is missing', 'onnxruntime provider CANNExecutionProvider is missing', 'onnxruntime provider AzureExecutionProvider is missing', 'onnxruntime provider CPUExecutionProvider is available', 'loaded Torch but CUDA was not available']

ssube · 2023-03-19T05:53:39Z

Based on the last item, 'loaded Torch but CUDA was not available', it's going to be difficult to run garbage collection on the GPU. I'll check again for a function in the ONNX runtime, but I haven't been able to find one.

If that's not possible, there are a few new optimizations that I want to add that will reduce memory usage (#241) and there will be a button to restart the GPU worker soon (#207).

ssube added status/new issues that have not been confirmed yet type/bug broken features scope/api labels Mar 13, 2023

ssube added this to the v0.9 milestone Mar 13, 2023

ssube mentioned this issue Mar 17, 2023

fix any remaining memory leaks #162

Closed

ssube added a commit that referenced this issue Mar 18, 2023

fix(api): bump default model cache to 5 (#242)

a9456f4

ssube added a commit that referenced this issue Mar 18, 2023

fix(scripts): check Torch CUDA devices (#242)

391a707

ssube added status/progress issues that are in progress and have a branch and removed status/new issues that have not been confirmed yet labels Mar 18, 2023

ssube added status/blocked in-progress issues that are blocked by a dependency and removed status/progress issues that are in progress and have a branch labels Mar 19, 2023

ssube removed this from the v0.9 milestone Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffusion model not fully unloading from gpu when removed from cache #242

Diffusion model not fully unloading from gpu when removed from cache #242

bzlibby commented Mar 13, 2023

ssube commented Mar 13, 2023 •

edited

ssube commented Mar 18, 2023

bzlibby commented Mar 19, 2023

ssube commented Mar 19, 2023

Diffusion model not fully unloading from gpu when removed from cache #242

Diffusion model not fully unloading from gpu when removed from cache #242

Comments

bzlibby commented Mar 13, 2023

ssube commented Mar 13, 2023 • edited

ssube commented Mar 18, 2023

bzlibby commented Mar 19, 2023

ssube commented Mar 19, 2023

ssube commented Mar 13, 2023 •

edited