[inductor][cpu]basic_gnn_gcn AMP static/dynamic shape default/cpp wrapper single thread performance regression #123502

zxd1997066 · 2024-04-06T14:13:32Z

🐛 Describe the bug

AMP static shape default wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	basic_gnn_gcn	single	1	2.840742	0.02862003	0.08130212126226001	8.595758	1	3.233373	0.025129724	0.08125377107905199	8.340146	0.88	1.0	0.88	0.97

AMP dynamic shape default wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	basic_gnn_gcn	single	1	2.856395	0.028604007	0.081704342574765	8.626275	1	3.252178	0.025021263999999998	0.08137360431299198	8.333838	0.88	1.0	0.87	0.97

AMP dynamic shape CPP wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	basic_gnn_gcn	single	1	2.854889	0.028386648	0.081040729122072	50.862871	1	3.263137	0.024769257	0.080825478979209	50.52748	0.87	1.0	0.87	0.99

SW info

name	target_branch	target_commit	refer_branch	refer_commit
torchbench	main	d6015d42	main	d6015d42
torch	main	`35c493f`	main	`f0d461b`
torchvision	main	0.19.0a0+2c4665f	main	0.19.0a0+a0c79b3
torchtext	main	0.16.0a0+b0ebddc	main	0.16.0a0+b0ebddc
torchaudio	main	2.2.0a0+ea437b3	main	2.2.0a0+17a7081
torchdata	main	0.7.1a0+0790338	main	0.7.1a0+0790338
dynamo_benchmarks	main	nightly	main	nightly

Repro:
inductor_single_run.sh
bash inductor_single_run.sh single inference performance torchbench basic_gnn_gcn amp first static/dynamic default/cpp
torchbench-basic_gnn_gcn-inference-amp-static-cpp-single-performance-drop_guilty_commit.log
Suspected guilty commit: 4912160
cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @WeizhuoZhang-intel @chuanqi129

The text was updated successfully, but these errors were encountered:

zhuhaozhe · 2024-04-17T06:11:06Z

The only difference brought by 4912160 is more kernel are vectorized.
Vec kernel: https://gist.github.com/zhuhaozhe/d99974f84d78f76f393fbeec7ee3b03e
Scalar kernel: https://gist.github.com/zhuhaozhe/408469f904e644fa2ea6c3c5ac013453.
Performance analyze WIP

zhuhaozhe · 2024-04-17T08:02:25Z

There is additional load/store in vec kernel

# scaler kernel
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 10000L), "index out of bounds: 0 <= tmp3 < 10000L")
auto tmp4 = out_ptr0[static_cast<long>(tmp3)];

-> becomes to

# vec kernel, store tmp7 to tmp8 and load tmp8 as tmp9
auto tmp7 = decltype(tmp3)::blendv(tmp0, tmp3, tmp6.template cast<int64_t,2>());
TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp7) & (tmp7 < at::vec::VectorizedN<int64_t,2>(10000L)))).all_masked(), "index out of bounds: 0 <= tmp7 < 10000L")
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
auto tmp9 =
[&]
{
    __at_align__ std::array<float, 16> tmpbuf;
    #pragma GCC unroll 16
    for (long x0_inner = 0; x0_inner < 16; x0_inner++)
    {
        tmpbuf[x0_inner] = out_ptr0[static_cast<long>(tmp8[x0_inner])];
    }
    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
}
()
;

zhuhaozhe · 2024-04-22T06:57:44Z

There is additional load/store in vec kernel

# scaler kernel
auto tmp3 = tmp2 ? tmp1 : tmp0;
TORCH_CHECK((0 <= tmp3) & (tmp3 < 10000L), "index out of bounds: 0 <= tmp3 < 10000L")
auto tmp4 = out_ptr0[static_cast<long>(tmp3)];

-> becomes to

# vec kernel, store tmp7 to tmp8 and load tmp8 as tmp9
auto tmp7 = decltype(tmp3)::blendv(tmp0, tmp3, tmp6.template cast<int64_t,2>());
TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp7) & (tmp7 < at::vec::VectorizedN<int64_t,2>(10000L)))).all_masked(), "index out of bounds: 0 <= tmp7 < 10000L")
auto tmp8 =
[&]
{
    __at_align__ std::array<int64_t, 16> tmpbuf;
    tmp7.store(tmpbuf.data());
    return tmpbuf;
}
()
;
auto tmp9 =
[&]
{
    __at_align__ std::array<float, 16> tmpbuf;
    #pragma GCC unroll 16
    for (long x0_inner = 0; x0_inner < 16; x0_inner++)
    {
        tmpbuf[x0_inner] = out_ptr0[static_cast<long>(tmp8[x0_inner])];
    }
    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
}
()
;

I tried to avoid generate tmp8 but the performance is still not recovered. So it is not the correct root cause.

Latest finding is the vec kernel generated redundant code
https://gist.github.com/zhuhaozhe/d99974f84d78f76f393fbeec7ee3b03e#file-vec-py-L68-L76
https://gist.github.com/zhuhaozhe/d99974f84d78f76f393fbeec7ee3b03e#file-vec-py-L96-L104

            auto tmp8 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                tmp7.store(tmpbuf.data());
                return tmpbuf;
            }
            ()
            ;
// also store tmp7 here
            auto tmp16 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                tmp7.store(tmpbuf.data());
                return tmpbuf;
            }
            ()
            ;

This is due to the cse.cache are not cloned during swap buffer.
https://github.com/pytorch/pytorch/pull/124597/files.

I am not sure whether we should or should not clone the cse.cache so I create this PR to see the CI status.

Fix #123502 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…t load" Fix #123502 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…ctorized indirect load" Fix #123502 `swap_buffer` do not clone the `cse.cache` which will bring redundant computation. We may able to clone the `cse.cache` if there is no cse value in the `expr` ``` auto tmp8 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp7.store(tmpbuf.data()); return tmpbuf; } () ; // // other codes // // also store tmp7 here (redundant tmp16) auto tmp16 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp7.store(tmpbuf.data()); return tmpbuf; } () ; ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…t load" Fix #123502 `swap_buffer` do not clone the `cse.cache` which will bring redundant computation. We may able to clone the `cse.cache` if there is no cse value in the `expr` ``` auto tmp8 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp7.store(tmpbuf.data()); return tmpbuf; } () ; // // other codes // // also store tmp7 here (redundant tmp16) auto tmp16 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp7.store(tmpbuf.data()); return tmpbuf; } () ; ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]