Alternative access order for the same buffer can bring big perf win #126913
Labels
enhancement
Not as big of a feature, but technically not a bug. Should be easy to fix
module: inductor
oncall: pt2
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Check this softmax kernel generated by inductor: https://gist.github.com/shunting314/16bf79d906bd2e929a62c0b2f3c02150 (call it k1)
If we reverse the access order for the second for loop from:
to
we get k2 (https://gist.github.com/shunting314/e749c3766757adaed729b51d38cd3169 )
k2 is 1.54 x faster than k1 (5.231ms v.s. 8.067ms). The speedup is mainly due to more cache hit.
Credit to llm.c since I learn the idea from there. This is probably something we can apply in general in inductor.
cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @jansel @Chillee @eellison
The text was updated successfully, but these errors were encountered: