Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffusion model not fully unloading from gpu when removed from cache #242

Open
bzlibby opened this issue Mar 13, 2023 · 4 comments
Open
Labels
scope/api status/blocked in-progress issues that are blocked by a dependency type/bug broken features

Comments

@bzlibby
Copy link
Contributor

bzlibby commented Mar 13, 2023

After updating to v0.8.1, I noticed that my gpu memory was hitting 100% after generating 1 image, significantly impacting generation speed; the first image would generate at ~3it/s, whereas the second image with no changes to the model, scheduler, or other settings would generate at ~4.5s/it. Following the logs, the diffusion model I was using was getting removed from the cache and then getting reloaded, but was not fully unloading from my gpu. When I upped my model cache from 2 the 3, the issue was resolved.

task manager right after finishing first image:
task manager right after finishing first image

task manager right after starting second image:
task manager right after starting second image

@ssube
Copy link
Owner

ssube commented Mar 13, 2023

It looks like there are a few factors involved here:

  • the default cache limit of 2 is too low for a full diffusion-scheduler-upscaling-correction pipeline
    • even with a larger limit, the cache will only keep one model of each type, like diffusion
  • the old diffusion model is being removed from the CPU cache but not the GPU cache
  • garbage collection on the GPU currently only happens on CUDA platforms (Nvidia): https://github.com/ssube/onnx-web/blob/main/api/onnx_web/utils.py#L94

It's possible that there is a memory leak on the Python side holding that old diffusion model and forcing it to remain active, but it's also possible that I just need to run garbage collection on the GPU side, like torch.cuda.empty_cache() does for CUDA.

For now, increasing the model cache limit to 4-5 (one of each) should be a workaround:

set ONNX_WEB_CACHE_MODELS=5

@ssube ssube added status/new issues that have not been confirmed yet type/bug broken features scope/api labels Mar 13, 2023
@ssube ssube added this to the v0.9 milestone Mar 13, 2023
@ssube
Copy link
Owner

ssube commented Mar 18, 2023

When you have a chance, please try running the new api/scripts/check-env.py script using your API's virtual environment:

> cd api
> onnx_env\Scripts\Activate.bat
> python3 scripts\check-env.py

and post the output. I'm looking for the last few items in particular, which should say whether the CUDA garbage collection functions are available.

The output should look something like this:

> python3 scripts/check-env.py 
['required module onnx is present at version 1.13.0', 'required module diffusers is present at version 0.12.1', 'required module safetensors is present at version 0.2.8', 'required module torch is present at version 1.13.1', 'runtime module onnxruntime is present at version 1.14.0', "unable to import runtime module onnxruntime_gpu: No module named 'onnxruntime_gpu'", "unable to import runtime module onnxruntime_rocm: No module named 'onnxruntime_rocm'", "unable to import runtime module onnxruntime_training: No module named 'onnxruntime_training'", 'onnxruntime provider TensorrtExecutionProvider is missing', 'onnxruntime provider CUDAExecutionProvider is missing', 'onnxruntime provider MIGraphXExecutionProvider is missing', 'onnxruntime provider ROCMExecutionProvider is missing', 'onnxruntime provider OpenVINOExecutionProvider is missing', 'onnxruntime provider DnnlExecutionProvider is missing', 'onnxruntime provider TvmExecutionProvider is missing', 'onnxruntime provider VitisAIExecutionProvider is missing', 'onnxruntime provider NnapiExecutionProvider is missing', 'onnxruntime provider CoreMLExecutionProvider is missing', 'onnxruntime provider ArmNNExecutionProvider is missing', 'onnxruntime provider ACLExecutionProvider is missing', 'onnxruntime provider DmlExecutionProvider is missing', 'onnxruntime provider RknpuExecutionProvider is missing', 'onnxruntime provider XnnpackExecutionProvider is missing', 'onnxruntime provider CANNExecutionProvider is missing', 'onnxruntime provider AzureExecutionProvider is missing', 'onnxruntime provider CPUExecutionProvider is available', 'loaded Torch but CUDA was not available']

@ssube ssube added status/progress issues that are in progress and have a branch and removed status/new issues that have not been confirmed yet labels Mar 18, 2023
@bzlibby
Copy link
Contributor Author

bzlibby commented Mar 19, 2023

this is the response I get:

['required module onnx is present at version 1.13.0', 'required module diffusers is present at version 0.11.1', 'required module safetensors is present at version 0.2.8', 'required module torch is present at version 1.13.1', 'runtime module onnxruntime is present at version 1.13.1', "unable to import runtime module onnxruntime_gpu: No module named 'onnxruntime_gpu'", "unable to import runtime module onnxruntime_rocm: No module named 'onnxruntime_rocm'", "unable to import runtime module onnxruntime_training: No module named 'onnxruntime_training'", 'onnxruntime provider TensorrtExecutionProvider is missing', 'onnxruntime provider CUDAExecutionProvider is missing', 'onnxruntime provider MIGraphXExecutionProvider is missing', 'onnxruntime provider ROCMExecutionProvider is missing', 'onnxruntime provider OpenVINOExecutionProvider is missing', 'onnxruntime provider DnnlExecutionProvider is missing', 'onnxruntime provider TvmExecutionProvider is missing', 'onnxruntime provider VitisAIExecutionProvider is missing', 'onnxruntime provider NnapiExecutionProvider is missing', 'onnxruntime provider CoreMLExecutionProvider is missing', 'onnxruntime provider ArmNNExecutionProvider is missing', 'onnxruntime provider ACLExecutionProvider is missing', 'onnxruntime provider DmlExecutionProvider is available', 'onnxruntime provider RknpuExecutionProvider is missing', 'onnxruntime provider XnnpackExecutionProvider is missing', 'onnxruntime provider CANNExecutionProvider is missing', 'onnxruntime provider AzureExecutionProvider is missing', 'onnxruntime provider CPUExecutionProvider is available', 'loaded Torch but CUDA was not available']

@ssube
Copy link
Owner

ssube commented Mar 19, 2023

Based on the last item, 'loaded Torch but CUDA was not available', it's going to be difficult to run garbage collection on the GPU. I'll check again for a function in the ONNX runtime, but I haven't been able to find one.

If that's not possible, there are a few new optimizations that I want to add that will reduce memory usage (#241) and there will be a button to restart the GPU worker soon (#207).

@ssube ssube added status/blocked in-progress issues that are blocked by a dependency and removed status/progress issues that are in progress and have a branch labels Mar 19, 2023
@ssube ssube removed this from the v0.9 milestone Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope/api status/blocked in-progress issues that are blocked by a dependency type/bug broken features
Projects
None yet
Development

No branches or pull requests

2 participants