-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic outputs for llama2 #966
Comments
@zhuohan123 here is a reproduction script (set MODEL_DIR to the correct path):
Which when run on 1x A100-80GB with
Increasing the number of prompts to 1000 results in 6 unique generations. I think it happens more frequently on really long sequences of generations, and seemingly only with weird/adversarial inputs. |
I'm experiencing the same issue. Based on my investigation, there appear to be two main areas of concern:
Consequently, it seems that even when forcing deterministic output (e.g., using only the top-1 result), the outcome differs. In my situation, I addressed this problem by converting the model's precision to either float32 (.float()) or float64 (.double()), ensuring result consistency during rigorous testing. I'm curious if these issues could be related. |
This issue can be unavoidable since it may caused by batching. Batching will change the order of each request being computed, and eventually affect the floating-point arithmetic and lead to undeterministic results. With the API server, the requests in a batch depend on when you submit the query, which can vary. |
@zhuohan123 can you comment a bit more, why would batching cause this float-point error? My understanding is that, during GEMV or GEMM, every row should not affect each other? |
The batching may change the order of summation in the attention/gemm kernel of each request, which can lead to different summation results. |
@zhuohan123 Are you talking about "tiling" when doing gemm, and the batch size will determine the tile size and hence might affect the result? |
@flexwang One potential source of the problem is this: vllm/vllm/model_executor/layers/attention.py Lines 300 to 301 in ff578ca
Currently, vLLM has two attention kernels: V1 and V2. We dynamically select either of the kernel based on the batch size (and number of heads). The V1 and V2 kernels have different implementations in that V2 uses FlashAttention-style algorithm to compute the output while V1 does not. |
After confirmation, it was identified that there was an issue with the discrepancy in the results of matrix multiplication operations between torch.matmul and those of jax and numpy. This issue was addressed by implementing matmul using triton, and it was confirmed that the problem was resolved. I wonder if others have encountered the same issue, so I kindly request confirmation from others. |
@dyanos Is this non-deterministic problem already resolved in version 0.3.3 (or it's for 0.3.4)? |
If you have time, could you let me know what the PR or issue is regarding the part you mentioned? As far as I understand, this issue was discussed as being difficult to resolve, so I was under the impression that there was no related work. However, I'm curious if there has been any discussion about a different issue. For now, I haven't been able to find anything in my search. |
Yes. You can check this colab notebook. What I want to show is that different batch sizes will cause some small numerical differences in the computational results, and these differences will be more significant with lower precisions like FP16, and will accumulate across different operators, which eventually lead to a different result. This is fundamental with batching. |
@dyanos Sorry, I was referring to this, only asking.
I think there is no PR on this issue. Will there be a work or solution on this problem? |
Oh, I understood. I misunderstood what you said. For 1GPU, I changed torch.matmul to a code using triton, and it seemed that there was no problem. https://github.com/openai/triton/blob/main/python/tutorials/03-matrix-multiplication.py |
@dyanos Oh, thanks. In which vllm .py file did you modify torch.matmul with triton's matmul? Thanks In my case I'm using a single GPU with vllm, and I saw that there is this issue when using a checkpoint of a finetuned QWEN model (safetensors) |
H. I also changed F.linear. Please think about it when changing. |
For some adversarially optimized prompts, it seems that llama2 running on vllm returns slightly different generations from time to time. Does anyone know what could be causing this, and if it's possible to fix this? My suspicion is the model shards not being reduced in the same order every time which leads to different floating point values due to non-associativity.
The text was updated successfully, but these errors were encountered: