Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AMD Ryzen Unified Memory Architecture (UMA) #107605

Open
winstonma opened this issue Aug 21, 2023 · 17 comments
Open

Support AMD Ryzen Unified Memory Architecture (UMA) #107605

winstonma opened this issue Aug 21, 2023 · 17 comments
Labels
module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@winstonma
Copy link

winstonma commented Aug 21, 2023

🚀 The feature, motivation and pitch

Background:
I am using Asus Zenbook S13 OLED, which runs AMD Ryzen 6800U APU. The APU comes with 680M Graphics Card. The memory of the graphic card use the shared memory from the system and its default is 512MB (Please reference the screenshot below).
alt text

In Windows environment the memory size would dynamically change, due to the amount of GPU memory required. But in Linux environment it shows 512MB memory (which is the result of setting Auto in BIOS) and thus when I use Stable Diffusion Pytorch would face the OOM situation. As the BIOS setting of the Notebook doesn't allow users from modifying the amount of dedicated memory so would it be possible that PyTorch could support UMA?

Here is the quote from AMD Ryzen UMA

The UMA Frame Buffer Size when set to Auto (default setting) allows the system to manage the amount of shared memory for graphics. In this configuration, the size of the UMA frame buffer should scale depending on the amount of available system memory, enabling the system to perform in an optimal state. Therefore, it is recommended to leave the setting on Auto, which is ideal for most types of video processing workloads.

Alternatives

No response

Additional context

Another developer created torch-apu-helper that uses CUDAPluggableAllocator to take advantage of the shared memory on PyTorch. However when I try the code snippet with Stable Diffusion I got the following error:

RuntimeError: CUDAPluggableAllocator does not yet support getDeviceStats. If you need it, please file an issue describing your use case.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

@cpuhrsch cpuhrsch added module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 22, 2023
@winstonma winstonma changed the title Support AMD Smart Access Memory Support AMD Ryzen Unified Memory Architecture (UMA) Aug 27, 2023
@yiakwy-xpu-ml-framework-team
Copy link

@ winstonma pytorch's CUDAPluggableAllocator memory allocator does not support native use of high level memory management data structure such as segments, blocks and call frames developed by pytorch:

it allocates what you just allocate

That means you should develop your own caching strategy wrapper upon the your primitive allocator/deallocator. But I think we could do better by reusing the cache strategy developed by Pytorch, don't we ? cc @jaykchen @cpuhrsch

@winstonma
Copy link
Author

@yiakwy-xpu-ml-framework-team but it seems that the sample in torch-apu-helper could get a piece of shared memory from the system and used by PyTorch. Maybe that is just a simple demo and may not have full functionalities but it seems there is a way for PyTorch to grab a piece of memory from the system.

Compared to booting the system into BIOS, define a dedicated piece of memory for GPU, and then reboot the system. I think allocating a fixed piece of memory at the program runtime is already considered as an improvment.

@dkuku
Copy link

dkuku commented Jan 7, 2024

Llama cpp added an option to use the shared memory on amd apus and I can confirm that it works. This is the pull request:
ggerganov/llama.cpp#4449
With this I can offload a 25GB model into gpu memory (it's still copying the model) and run it using the gpu. Since that time I have the gpu memory set in bios to 512MB

@winstonma
Copy link
Author

winstonma commented Jan 11, 2024

@dkuku Thanks. From llama.cpp we know that it is do-able (on Windows we can access video memory via UMA through DirectX, that's my guess). Also torch-apu-helper uses the same way to get PyTorch using the UMA.

However for using Stable Diffusion, the torch-apu-helper method would not work. And it support PyTorch on using the video memory. However when Stable Diffusion uses some other PyTorch API it would fail (like the following example)

Warning: caught exception 'CUDAPluggableAllocator does not yet support getDeviceStats. If you need it, please file an issue describing your use case.', memory monitor disabled

It shows that in order to get higher level application working, the torch-apu-helper is not sufficient. Getting official support from PyTorch is needed (just like llama.cpp).

With this I can offload a 25GB model into gpu memory (it's still copying the model) and run it using the gpu. Since that time I have the gpu memory set in bios to 512MB

Are you using AMD APU or AMD Dedicated GPU? How much memory does your system originally have? I think I am a bit confused.

@winstonma
Copy link
Author

winstonma commented Jan 11, 2024

@jithunnair-amd Could you please feel free to take a look? Thanks 🙏🏻

EDIT: Sorry I didn't see the first post was CCed

@yiakwy-xpu-ml-framework-team
Copy link

@winstonma as for pytorch caching algorithm, I have tried to added an external cached allocator wrapper successfully (with exactly pytorch block algorithm, -- I am familiar with Pytorch block, extended segments allocation strategy).

It works well in pytorch sample tests with any custom allocator as long as the memory address is deemed "valid" by device.

The only issue is how to dump it into pytorch newly supported memory snapshot. That also needs a python frame tracer.

Hence it is better to have native support from Pytorch code base.

CUDAPluggableAllocator does not yet support getDeviceStats.

This is because exporting memory info (blocks, segments) uses many statistics from Pytorch cached allocator (blocks, segments are exclusive to cached allocator).

Note the external allocator should support export memory snapshot and livenessinfo like this:

截屏2023-12-13 18 32 46

The whole function involvs 3k c++ codes.

So I guess your hipHostMalloc demo doesn't work in practice.

@yiakwy-xpu-ml-framework-team
Copy link

@dkuku How does llama.cpp handle overhead of memory allocation ? Does it cache memory allocated ?

@winstonma
Copy link
Author

winstonma commented Jan 11, 2024

So I guess your hipHostMalloc demo doesn't work in practice.

I am not sure. But I think both llama.cpp and torch-apu-helper both referenced HIP Programming Manual. I think memory-allocation-wise torch-apu-helper works. But as you mentioned modifications are needed in order to get full support from PyTorch.

But I didn't know it need to modify 3k lines of codes. I just think the modification mainly perpose on "ask the ROCm driver to assign the memory". After the memory is allocated everything should work.

@yiakwy-xpu-ml-framework-team
Copy link

Hi, @winstonma I am afraid that this is not how pytorch works:

I just think the modification mainly perpose on "ask the ROCm driver to assign the memory". After the memory is allocated everything should work.

In allocating stage, the pytorch first requests an available block (which contains valid GPU address + offset point to the remaining available block), then it splits the block such that a memory description of newly inserted block are exactly what you want. This will create small memory segments, hence each time you request an allocation, pytorch will increase age of the memory and release the "oldest" one if necessary.

The release stage just reverse in the above order except that the root block containing the GPU memory buffer will not be released (or immediately released) by CudaFree or something similar.

This algorithm can be easily supported with around 2k codes dependent on whether you want to handle multi streams : this is tricky because a memory can be used under one stream for computing and another stream for stream copy and one has to manually notify pytorch to increase event-streams-block references, -- otherwise Pytorch by no means to know whether the memory is still in the usage.

Native support from Pytorch must consider this for "CUDAPlugableAllocator".

Another issue one must need to support is tracing (this can be done easily by an allocator wrapper singleton with pybind interface to override native memory_snapshot function, you must override cDLL loading process if you want a singleton as the interface), otherwise pytorch will not generate memory snapshot.

This needs another 1k codes : PyObject frame tracking, c++ codes tracking, etc.

@winstonma
Copy link
Author

winstonma commented Jan 12, 2024

@yiakwy-xpu-ml-framework-team Thanks for explaining. Seems I oversimplify the whole process and think AMD could take advantage of the existing PyTorch ROCm framework and include support for UMA with some modification. Seems there is a long way to go.

As llama.cpp shows that the driver is able to use UMA in linux. So I wish that AMD team would get PyTorch on UMA work 🙏🏻

This feature is very crucial because most of the APU are installed on laptop, and the laptop manufacturer doesn't allow you to modify the dedicated memory in the BIOS. Therefore laptop user wouldn't be able to use PyTorch ROCm version which is really sad.

@qkiel
Copy link

qkiel commented Mar 21, 2024

@dkuku Thanks. From llama.cpp we know that it is do-able (on Windows we can access video memory via UMA through DirectX, that's my guess). Also torch-apu-helper uses the same way to get PyTorch using the UMA.

However for using Stable Diffusion, the torch-apu-helper method would not work. And it support PyTorch on using the video memory. However when Stable Diffusion uses some other PyTorch API it would fail (like the following example)

Warning: caught exception 'CUDAPluggableAllocator does not yet support getDeviceStats. If you need it, please file an issue describing your use case.', memory monitor disabled

It shows that in order to get higher level application working, the torch-apu-helper is not sufficient. Getting official support from PyTorch is needed (just like llama.cpp).

With this I can offload a 25GB model into gpu memory (it's still copying the model) and run it using the gpu. Since that time I have the gpu memory set in bios to 512MB

Are you using AMD APU or AMD Dedicated GPU? How much memory does your system originally have? I think I am a bit confused.

If you use force-host-alloction-APU instead of torch-apu-helper, you can run Stable Diffusion on an APU.

I have a 5600G APU and I'm using Fooocus with force-host-alloction-APU and only 512MiB allocated to VRAM. For ROCm version 5.7.3 I need to set up environment variable HSA_OVERRIDE_GFX_VERSION=9.0.0 and for ROCm 6.0+ I need one more HSA_ENABLE_SDMA=0. Then I start Foocus like that:

LD_PRELOAD=~/force-host-alloction-APU/./libforcegttalloc.so python3 ~/Fooocus/entry_with_update.py --always-high-vram

Notice a ./ before libforcegttalloc.so. Model is fully loaded into memory and GPU generates an image.

Links to ROCm and PyTorch versions I used:
https://www.amd.com/en/support/linux-drivers|https://www.amd.com/en/support/linux-drivers
https://pytorch.org/get-started/locally/|https://pytorch.org/get-started/locally/

ROCm 5.7 with PyTorch for 5.7

https://repo.radeon.com/amdgpu-install/5.7.3/ubuntu/jammy/amdgpu-install_5.7.50703-1_all.deb
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

ROCm 6.0 with PyTorch for 6.0

https://repo.radeon.com/amdgpu-install/23.40.2/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

@gonwan
Copy link

gonwan commented Mar 22, 2024

I was able to run sd-webui around January on my AMD APU(7840HS, RDNA3, 4G VRAM). I used UMAF to adjust VRAM from 4G to 8G, and set HSA_OVERRIDE_GFX_VERSION=11.0.0. ROCm 6.0 with PyTorch-ROCm 5.6 works for me, but ROCm 6.0 with PyTorch-ROCm 5.7/6.0 do not.

Just leave some words, as an alternative approach.

@winstonma
Copy link
Author

winstonma commented Mar 22, 2024

@yiakwy-xpu-ml-framework-team I tested @qkiel's method and I can run Stable Diffusion on my laptop, w/o VRAM modification in BIOS.

I am running ROCm v6.0.2 with PyTorch 2.2.1 stable version. I don't see there are some performance difference between the new method and the VRAM modification method.

Just wonder if PyTorch ROCm would consider including the method in force-host-alloction-APU in the future release of ROCm PyTorch. Thank you very much.

@qkiel
Copy link

qkiel commented Mar 22, 2024

@winstonma great it worked :]

@gonwan have you tried adding one more environment variable HSA_ENABLE_SDMA=0 for the latest PyTorch-ROCm6.0? That did the trick for me.

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

@gonwan
Copy link

gonwan commented Mar 23, 2024

@qkiel
Just tried to set HSA_ENABLE_SDMA=0 with both versions, no luck.
With rocm6.0+pytorch-rocm5.7, hang when generating images, no other logs.
With rocm6.0+pytorch-romc6.0, invalid ISA error reported.

Any idea?

@dkuku
Copy link

dkuku commented Mar 23, 2024 via email

@qkiel
Copy link

qkiel commented Mar 24, 2024

@gonwan
If HSA_OVERRIDE_GFX_VERSION=11.0.0 + HSA_ENABLE_SDMA=0, maybe try lower version HSA_OVERRIDE_GFX_VERSION=10.3.0 + HSA_ENABLE_SDMA=0.

You can also try setting UMA_SPECIFIED to 8 or more GiB of VRAM in BIOS or use force-host-alloction-APU method when launching Stable Diffusion, which requires UMA_AUTO set in BIOS. I have a 5600G APU with UMA_AUTO, and I'm launching Fooocus with force-host-alloction-APU like that:

LD_PRELOAD=~/force-host-alloction-APU/./libforcegttalloc.so python3 ~/Fooocus/entry_with_update.py --always-high-vram

Notice a ./ before libforcegttalloc.so.

@dkuku I've seen something like that when I launched Fooocus on CPU instead of GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: rocm AMD GPU support for Pytorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants