Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_in_8bit hangs on ROCm #1236

Open
DavideRossi opened this issue Jun 1, 2024 · 3 comments
Open

load_in_8bit hangs on ROCm #1236

DavideRossi opened this issue Jun 1, 2024 · 3 comments

Comments

@DavideRossi
Copy link

System Info

An AMD Epyc system with 3 MI210.
Quite a complex setup. The system uses slurm to schedule batch jobs which are usually in the form of apptainer run containers. The image I'm using has rocm6.0.2 on ubuntu22.04.

Reproduction

I followed the installation instructions at https://github.com/TimDettmers/bitsandbytes/blob/multi-backend-refactor/docs/source/rocm_installation.mdx bar the fact that I checked out the multi-backend-refactor branch (I hope that was the right thing to do).
Then I tried running the example at https://github.com/TimDettmers/bitsandbytes/blob/multi-backend-refactor/examples/int8_inference_huggingface.py, I just added a line to print the value of max_memory right after it is set (I also run a modified version forcing the use of only one GPU).
The jobs just hangs after printing:

Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Max memory: {0: '61GB', 1: '61GB', 2: '61GB'}

Then goes on timeout after one hour.
I'm willing to help debug the issue, just tell me how I can help.

Expected behavior

This is what I get when running the same example with an A100 (after a few seconds):

Max memory: {0: '76GB'}
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.35s/it]
Hamburg is in which country?
What is the currency of Germany?
Is Germany in Europe?
[...]
@matthewdouglas
Copy link
Member

cc: @pnunna93

@pnunna93
Copy link
Contributor

pnunna93 commented Jun 5, 2024

Hi @DavideRossi, Could you please share python and hip traces for the script?

For python trace, you can add this before the line where script hangs. You can stop it once the stack trace doesn't change.
import faulthandler;faulthandler.dump_traceback_later(10, repeat=True)

Please run the script with AMD_LOG_LEVEL=3 for hip trace.

Please also share the torch version and your machine details, outputs of 'pip show torch' and 'rocminfo'.

@DavideRossi
Copy link
Author

Hi, I am sorry but I mistakenly deleted the container I was using for these tests. In any case I was able to trace the problem to accelerate when using device_map="auto". The same code using device_map="cuda:0" was not hanging. I'm now trying to replicate the whole process with a new container.
Now I'm having issues with 8 bits support, but I'm going to post into the #538 thread about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants