Inductor generated Triton kernel spends double time from Llama2 to Llama 3 #125524
Labels
module: inductor
oncall: pt2
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Describe the bug
I'm working on enabling Llama3 on gpt-fast, but I found the Llama3 + int8 performs worse than expected.
I checked the generated kernel and found one of the kernels spent double time than Llama2 (from 0.066 ms to 0.11 ms). This kernel was used 34 times in the whole Llama model, so it slows down the whole model's performance a lot.
This only happens on int8 quantized model (tokens/sec is 73% of Llama2), the base model (tokens/sec is 90% of Llama2) has sort of reasonable inference speed.
Llama 2:
Llama 3:
The only difference is
rnumel
goes from 11008 to 14336. I'm curious if this is reasonable and what could cause the difference?Versions
N/A
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @ColinPeppler @amjames @desertfire
The text was updated successfully, but these errors were encountered: