Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor][cpu]basic_gnn_gcn AMP static/dynamic shape default/cpp wrapper single thread performance regression #123502

Closed
zxd1997066 opened this issue Apr 6, 2024 · 3 comments
Assignees
Labels
oncall: cpu inductor CPU Inductor issues for Intel team to triage oncall: pt2

Comments

@zxd1997066
Copy link
Contributor

zxd1997066 commented Apr 6, 2024

🐛 Describe the bug

AMP static shape default wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench basic_gnn_gcn single 1 2.840742 0.02862003 0.08130212126226001 8.595758 1 3.233373 0.025129724 0.08125377107905199 8.340146 0.88 1.0 0.88 0.97

AMP dynamic shape default wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench basic_gnn_gcn single 1 2.856395 0.028604007 0.081704342574765 8.626275 1 3.252178 0.025021263999999998 0.08137360431299198 8.333838 0.88 1.0 0.87 0.97

AMP dynamic shape CPP wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench basic_gnn_gcn single 1 2.854889 0.028386648 0.081040729122072 50.862871 1 3.263137 0.024769257 0.080825478979209 50.52748 0.87 1.0 0.87 0.99

SW info

name target_branch target_commit refer_branch refer_commit
torchbench main d6015d42 main d6015d42
torch main 35c493f main f0d461b
torchvision main 0.19.0a0+2c4665f main 0.19.0a0+a0c79b3
torchtext main 0.16.0a0+b0ebddc main 0.16.0a0+b0ebddc
torchaudio main 2.2.0a0+ea437b3 main 2.2.0a0+17a7081
torchdata main 0.7.1a0+0790338 main 0.7.1a0+0790338
dynamo_benchmarks main nightly main nightly

Repro:
inductor_single_run.sh
bash inductor_single_run.sh single inference performance torchbench basic_gnn_gcn amp first static/dynamic default/cpp
torchbench-basic_gnn_gcn-inference-amp-static-cpp-single-performance-drop_guilty_commit.log
Suspected guilty commit: 4912160
cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @WeizhuoZhang-intel @chuanqi129

@chuanqi129 chuanqi129 added the oncall: cpu inductor CPU Inductor issues for Intel team to triage label Apr 6, 2024
@zhuhaozhe
Copy link
Contributor

The only difference brought by 4912160 is more kernel are vectorized.
Vec kernel: https://gist.github.com/zhuhaozhe/d99974f84d78f76f393fbeec7ee3b03e
Scalar kernel: https://gist.github.com/zhuhaozhe/408469f904e644fa2ea6c3c5ac013453.
Performance analyze WIP

@zhuhaozhe
Copy link
Contributor

zhuhaozhe commented Apr 17, 2024

There is additional load/store in vec kernel

# scaler kernel
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 10000L), "index out of bounds: 0 <= tmp3 < 10000L")
auto tmp4 = out_ptr0[static_cast<long>(tmp3)];  

-> becomes to

# vec kernel, store tmp7 to tmp8 and load tmp8 as tmp9
auto tmp7 = decltype(tmp3)::blendv(tmp0, tmp3, tmp6.template cast<int64_t,2>());
TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp7) & (tmp7 < at::vec::VectorizedN<int64_t,2>(10000L)))).all_masked(), "index out of bounds: 0 <= tmp7 < 10000L")
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
auto tmp9 =
[&]
{
    __at_align__ std::array<float, 16> tmpbuf;
    #pragma GCC unroll 16
    for (long x0_inner = 0; x0_inner < 16; x0_inner++)
    {
        tmpbuf[x0_inner] = out_ptr0[static_cast<long>(tmp8[x0_inner])];
    }
    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
}
()
;

@zhuhaozhe
Copy link
Contributor

There is additional load/store in vec kernel

# scaler kernel
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 10000L), "index out of bounds: 0 <= tmp3 < 10000L")
auto tmp4 = out_ptr0[static_cast<long>(tmp3)];  

-> becomes to

# vec kernel, store tmp7 to tmp8 and load tmp8 as tmp9
auto tmp7 = decltype(tmp3)::blendv(tmp0, tmp3, tmp6.template cast<int64_t,2>());
TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp7) & (tmp7 < at::vec::VectorizedN<int64_t,2>(10000L)))).all_masked(), "index out of bounds: 0 <= tmp7 < 10000L")
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
auto tmp9 =
[&]
{
    __at_align__ std::array<float, 16> tmpbuf;
    #pragma GCC unroll 16
    for (long x0_inner = 0; x0_inner < 16; x0_inner++)
    {
        tmpbuf[x0_inner] = out_ptr0[static_cast<long>(tmp8[x0_inner])];
    }
    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
}
()
;

I tried to avoid generate tmp8 but the performance is still not recovered. So it is not the correct root cause.

Latest finding is the vec kernel generated redundant code
https://gist.github.com/zhuhaozhe/d99974f84d78f76f393fbeec7ee3b03e#file-vec-py-L68-L76
https://gist.github.com/zhuhaozhe/d99974f84d78f76f393fbeec7ee3b03e#file-vec-py-L96-L104

            auto tmp8 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                tmp7.store(tmpbuf.data());
                return tmpbuf;
            }
            ()
            ;
// also store tmp7 here
            auto tmp16 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                tmp7.store(tmpbuf.data());
                return tmpbuf;
            }
            ()
            ;

This is due to the cse.cache are not cloned during swap buffer.
https://github.com/pytorch/pytorch/pull/124597/files.

I am not sure whether we should or should not clone the cse.cache so I create this PR to see the CI status.

zhuhaozhe added a commit that referenced this issue Apr 22, 2024
Fix #123502





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 22, 2024
…t load"

Fix #123502





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 24, 2024
…ctorized indirect load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 24, 2024
…t load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 24, 2024
…ctorized indirect load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 24, 2024
…t load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 25, 2024
…ctorized indirect load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 25, 2024
…t load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 25, 2024
…ctorized indirect load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 25, 2024
…t load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 25, 2024
…ctorized indirect load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
zhuhaozhe added a commit that referenced this issue Apr 25, 2024
…t load"

Fix #123502

`swap_buffer` do not clone the `cse.cache` which will bring redundant computation.
We may able to clone the `cse.cache` if there is no cse value in the `expr`
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
andoorve pushed a commit to andoorve/pytorch that referenced this issue May 1, 2024
…24597)

Fix pytorch#123502

`swap_buffer` in not needed in vectorized indirect load, remove it to share cse buffer.
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```

Pull Request resolved: pytorch#124597
Approved by: https://github.com/jgong5, https://github.com/jansel
pytorch-bot bot pushed a commit that referenced this issue May 3, 2024
Fix #123502

`swap_buffer` in not needed in vectorized indirect load, remove it to share cse buffer.
```
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
//
// other codes
//
// also store tmp7 here (redundant tmp16)
auto tmp16 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
```

Pull Request resolved: #124597
Approved by: https://github.com/jgong5, https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: cpu inductor CPU Inductor issues for Intel team to triage oncall: pt2
Projects
None yet
Development

No branches or pull requests

4 participants