[inductor cpp] fix non-contiguous reduction store #113261

jgong5 · 2023-11-08T14:28:55Z

Stack from ghstack (oldest at bottom):

The reduction store in this case works on non-contiguous buffer. Previously, we only do scalar fallback for normal stores but not reduction stores. This PR fixes this.

Before fix

            #pragma omp for 
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0)));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0_vec.store(out_ptr1 + static_cast<long>(x0 + (39L*x1))); // this is wrong since x0 is not vector dim
                    }
                }
                #pragma omp simd simdlen(8) 
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))];
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
                        }
                        out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0;
                    }
                }
            }

After fix

            #pragma omp for 
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0)));
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                        }
                        { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp_acc0_vec.store(tmpbuf); for (long x1_inner = 0; x1_inner < 16; x1_inner++) out_ptr1[static_cast<long>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; }
                    }
                }
                #pragma omp simd simdlen(8) 
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))];
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
                        }
                        out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0;
                    }
                }
            }

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

[ghstack-poisoned]

pytorch-bot · 2023-11-08T14:28:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113261

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ec155ea with merge base 4f2b288 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 5b4d662c675d16d882c5b81ed721a60eaa2b537f Pull Request resolved: #113261

Fix #113018 The reduction store in this case works on non-contiguous buffer. Previously, we only do scalar fallback for normal stores but not reduction stores. This PR fixes this. Before fix ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0))); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } tmp_acc0_vec.store(out_ptr1 + static_cast<long>(x0 + (39L*x1))); // this is wrong since x0 is not vector dim } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After fix ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0))); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp_acc0_vec.store(tmpbuf); for (long x1_inner = 0; x1_inner < 16; x1_inner++) out_ptr1[static_cast<long>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0; } } } ``` cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: e9fd92c05473f44799e74127348f568c94691c11 Pull Request resolved: #113261

lezcano · 2023-11-10T10:30:36Z

torch/_inductor/codegen/cpp.py

+    def get_vec_store_line(self, value, var, index, dtype):
+        """
+        Get a store line str that stores `value` into `var` at `index` of `dtype`.
+        :param value: Vectorized type templaterized on `dtype`.


Do you want to static assert that value is indeed of type Vectorized<dtype> or nah?

I will check it in python after introducing CppCSEVariable.

torch/_inductor/codegen/cpp.py

Fix #113018 The reduction store in this case works on non-contiguous buffer. Previously, we only do scalar fallback for normal stores but not reduction stores. This PR fixes this. Before fix ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0))); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } tmp_acc0_vec.store(out_ptr1 + static_cast<long>(x0 + (39L*x1))); // this is wrong since x0 is not vector dim } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0; } } } ``` After fix ```c++ #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(39L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L)) { { #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())}) float tmp_acc0 = -std::numeric_limits<float>::infinity(); at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (17L*x2) + (306L*x0))); tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0); } { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp_acc0_vec.store(tmpbuf); for (long x1_inner = 0; x1_inner < 16; x1_inner++) out_ptr1[static_cast<long>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner]; } } } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L); x1<static_cast<long>(17L); x1+=static_cast<long>(1L)) { { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long x2=static_cast<long>(0L); x2<static_cast<long>(18L); x2+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x1 + (17L*x2) + (306L*x0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr1[static_cast<long>(x0 + (39L*x1))] = tmp_acc0; } } } ``` cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

jgong5 · 2023-11-14T06:44:01Z

@jansel @lezcano I revised the PR according to the comments. Could you please take a look? Thanks.

jgong5 · 2023-11-15T00:52:27Z

@pytorchbot merge

pytorchmergebot · 2023-11-15T00:55:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #113407 Approved by: https://github.com/lezcano ghstack dependencies: #113261

[inductor cpp] fix non-contiguous reduction store

ff00b75

[ghstack-poisoned]

jgong5 mentioned this pull request Nov 8, 2023

[inductor cpu] fix uint8 add and sub #113253

Closed

jgong5 added a commit that referenced this pull request Nov 8, 2023

[inductor cpp] fix non-contiguous reduction store

d42011b

ghstack-source-id: 5b4d662c675d16d882c5b81ed721a60eaa2b537f Pull Request resolved: #113261

github-actions bot added module: inductor ciflow/inductor labels Nov 8, 2023

pytorchbot added the open source label Nov 8, 2023

jgong5 added the topic: not user facing topic category label Nov 8, 2023

jgong5 requested review from jansel and lezcano November 9, 2023 09:01

jgong5 added a commit that referenced this pull request Nov 10, 2023

[inductor cpp] fix non-contiguous reduction store

834f9cd

ghstack-source-id: e9fd92c05473f44799e74127348f568c94691c11 Pull Request resolved: #113261

jgong5 mentioned this pull request Nov 10, 2023

[inductor cpp] simplify test for uint8 add/sub #113407

Closed

lezcano approved these changes Nov 10, 2023

View reviewed changes

jgong5 requested a review from lezcano November 10, 2023 13:23

lezcano approved these changes Nov 14, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 15, 2023

pytorchmergebot added the merging label Nov 15, 2023

pytorchmergebot added the Merged label Nov 15, 2023

pytorchmergebot closed this in fcdfcde Nov 15, 2023

pytorchmergebot removed the merging label Nov 15, 2023

pytorchmergebot pushed a commit that referenced this pull request Nov 15, 2023

[inductor cpp] simplify test for uint8 add/sub (#113407)

1a8d076

Pull Request resolved: #113407 Approved by: https://github.com/lezcano ghstack dependencies: #113261

facebook-github-bot deleted the gh/jgong5/27/head branch November 18, 2023 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor cpp] fix non-contiguous reduction store #113261

[inductor cpp] fix non-contiguous reduction store #113261

jgong5 commented Nov 8, 2023 •

edited

pytorch-bot bot commented Nov 8, 2023 •

edited

lezcano Nov 10, 2023

jgong5 Nov 10, 2023

jgong5 commented Nov 14, 2023

jgong5 commented Nov 15, 2023

pytorchmergebot commented Nov 15, 2023

[inductor cpp] fix non-contiguous reduction store #113261

[inductor cpp] fix non-contiguous reduction store #113261

Conversation

jgong5 commented Nov 8, 2023 • edited

pytorch-bot bot commented Nov 8, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113261

✅ No Failures

lezcano Nov 10, 2023

Choose a reason for hiding this comment

jgong5 Nov 10, 2023

Choose a reason for hiding this comment

jgong5 commented Nov 14, 2023

jgong5 commented Nov 15, 2023

pytorchmergebot commented Nov 15, 2023

Merge started

jgong5 commented Nov 8, 2023 •

edited

pytorch-bot bot commented Nov 8, 2023 •

edited