Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor] make thread order consistent with loop order #106827

Closed
wants to merge 3 commits into from

Commits on Aug 8, 2023

  1. [inductor] make thread order consistent with loop order

    [ghstack-poisoned]
    shunting314 committed Aug 8, 2023
    Configuration menu
    Copy the full SHA
    2048bd2 View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2023

  1. Update on "[inductor] make thread order consistent with loop order"

    I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead.
    
    For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels:
    
    - before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d 
    - after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551 
    
    I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard.
    
    
    cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov
    
    [ghstack-poisoned]
    shunting314 committed Aug 9, 2023
    Configuration menu
    Copy the full SHA
    df8cd57 View commit details
    Browse the repository at this point in the history

Commits on Aug 11, 2023

  1. Update on "[inductor] make thread order consistent with loop order"

    I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead.
    
    For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels:
    
    - before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d 
    - after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551 
    
    I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard.
    
    
    cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov
    
    [ghstack-poisoned]
    shunting314 committed Aug 11, 2023
    Configuration menu
    Copy the full SHA
    88a4ea1 View commit details
    Browse the repository at this point in the history