-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Bugfix][Rocm] fix qr error when different inp shape #25892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Rocm] fix qr error when different inp shape #25892
Conversation
f8e0290 to
e427e68
Compare
e427e68 to
303bfb7
Compare
cb25224 to
d668cf9
Compare
|
@haoyangli-amd Please, add a test case covering the issue to |
d668cf9 to
8b208fc
Compare
|
hi, @ilmarkov |
|
|
||
| num = 1 | ||
| s1 = 1024 | ||
| while num < 50000: # 50000 is sufficient to identify issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does the test take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin Do we have CI for MI300? Should we add tests/distributed/test_quick_all_reduce.py to test-pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does the test take?
about 23s for tp4 and tp8
Co-authored-by: ilmarkov <markovilya197@gmail.com> Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com> Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
8b208fc to
096e9ec
Compare
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
When the tensor parallelism (TP) degree is set to 4 or 8, frequent changes in the input shape can cause QuickReduce to hang (this issue has been observed with the gpt_oss model).
We have identified that the root cause is overlapping flag memory addresses between consecutive AllReduce operations.
For most models, the hidden size remains relatively stable, so this issue does not occur.
Our current solution is to allocate separate memory regions for the flags and data of the two AllReduce phases in each operation.
(Note: The data region must also be separated, as overlapping would lead to correctness issues.)
To reproduce error
1.git clone https://github.com/vllm-project/vllm.git
2.python3 setup.py develop
3.python3 this_script.py
we can also use this command to check if the hang issue is resolved and checkout if the result is reasonable.
A more detailed explanation
Why does the frequently changing INP shape cause problems?
1.I have obtained some logs.
It appears the program isn't stuck at [256, 2048], but rather at its previous execution point, [256, 1024].
To summarize this phenomenon:
It seems that the program didn’t actually hang at the [256, 2048] stage, but rather at the previous one — [256, 1024].
I suspect that the n-th allreduce and the (n+1)-th allreduce overlap in time.
When the (n+1)-th allreduce executes its phase 1, it modifies the flag used by the n-th allreduce’s phase 2.
For [256,2048], its phase 1 address completely overlaps with the phase 1+phase 2 address of [256,1024].
2.Previously, we used
hipStreamSynchronize(stream);hipDeviceSynchronize();These did not guarantee all ranks would block at the same point. Now we use
dist.barrier(group=cpu_group),and even after running for an hour, the program will not hang.
3.Referring to vLLM’s communication reduction (CR) implementation, using isolated addresses to distinguish different phases of different allreduce batches is necessary to prevent interference between the n-th and (n+1)-th allreduce operations.
Why don't other models have this issue?

For typical models, the hidden size is fixed, so the input does not change frequently, and the phase 2 of the n-th allreduce and phase 1 of the (n+1)-th allreduce do not share addresses. However, for models like GPT-OSS with variable-length inputs, conflicts may occur. What we need to do is to completely isolate the addresses to avoid any conflicts.