Improve precision and performance for BFloat16 upsampling #91169

CaoE · 2022-12-20T13:02:07Z

Description

Fix precision issue for BFloat16 upsampling: [BF16] Visit all the type cast from integer to BF16 type for potential accuracy loss #89212
Improve performance for BFloat16 upsampling.

Testing

data type: BFloat16

Single core

contiguous:

mode	scale_factor	shape	before backward / ms	after backward / ms
nearest	2	[10, 3, 200, 200]	14.47	8.34
linear	2	[3, 200, 200]	3.69	2.74
bilinear	2	[3, 5, 200, 200]	87.99	49.05
trilinear	2	[3, 3, 3, 100, 100]	171.02	72.53
bicubic	2	[3, 3, 200, 200 ]	176.29	78

channels last:

mode	scale_factor	shape	before backward / ms	after backward / ms
nearest	2	[10, 3, 200, 200]	17.70	10.30
linear	2	[3, 200, 200]	\	\
bilinear	2	[3, 5, 200, 200]	50.90	18.83
trilinear	2	[3, 3, 3, 100, 100]	121.56	42.60
bicubic	2	[3, 3, 200, 200 ]	179.40	80

20 cores

contiguous:

mode	scale_factor	shape	before backward / ms	after backward / ms
nearest	2	[10, 3, 200, 200]	1.17	1.01
linear	2	[3, 200, 200]	0.41	0.26
bilinear	2	[3, 5, 200, 200]	7.19	4.07
trilinear	2	[3, 3, 3, 100, 100]	21.32	9.33
bicubic	2	[3, 3, 200, 200 ]	178.67	10

channels last:

mode	scale_factor	shape	before backward / ms	after backward / ms
nearest	2	[10, 3, 200, 200]	2.25	1.55
linear	2	[3, 200, 200]	\	\
bilinear	2	[3, 5, 200, 200]	20.17	7.20
trilinear	2	[3, 3, 3, 100, 100]	43.33	15.66
bicubic	2	[3, 3, 200, 200 ]	176.76	10

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2022-12-20T13:02:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91169

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 238de76:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5

Next time, suggest to split the PR (one for precision and one for performance) to ease the code review.

jgong5 · 2023-01-18T02:40:45Z

aten/src/ATen/native/UpSample.h

+// For compilation, and it will not be used by data types other than BFloat16.
+template <typename scalar_in, typename scalar_out>
+void inline apply_grad_input(scalar_out* buffer_ptr, scalar_in* gin, int64_t size) {
+  return;
+}
+
+template <>
+void inline apply_grad_input(float* buffer_ptr, BFloat16* gin, int64_t size) {


If only one specialization is needed, we don't have to make it a template function, right?

Template function used here is to allow the compilations to pass in addition to BFloat16 and I removed the unused function.

jgong5 · 2023-01-18T07:01:35Z

aten/src/ATen/native/cpu/UpSampleKernel.cpp

+          // when `real_input_index` becomes larger than the range the floating point
+          // type can accurately represent, the type casting to `int64_t` might exceed
+          // `input_size - 1`. So we guard it with `std::min` below.
+          input_index = std::min(static_cast<int64_t>(floorf(real_input_index)), input_size - 1);
+          auto lambda = std::min(
+            std::max(real_input_index - input_index, static_cast<opmath_t>(0)),
+            static_cast<opmath_t>(1)
+          );


Seems there are quite a few duplicate code like this. Can we factor out a util function to dedup?

Factor out a util function guard_index_and_lambda.

jgong5 · 2023-01-18T07:06:35Z

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

@@ -14,6 +14,70 @@ namespace {

 using scale_t = std::vector<c10::optional<double>>;

+template <typename scalar_in, typename scalar_out>
+void inline nearest_channels_last_acc(scalar_in* gin, scalar_out* gout, int64_t size) {
+  using Vec = vec::Vectorized<scalar_in>;


Vec::size() of scalar_in and scalar_out might not match. Have you considered this case?

scalar_in and scalar_out are always same when the data type is not BFloat16.

Do you mean scalar_in and scalar_out are always the same?

I mean there are two cases to use nearest_channels_last_acc : 1. scalar_in and scalar_out are same; 2. scalar_in=float and scalar_out =BFloat16.
When opmath_t is not scalar_t i.e., scalar_t=BFloat16 and opmath_t=float template <> void inline nearest_channels_last_acc(float* gin, BFloat16* gout, int64_t size) is called.
When opmath_t == scalar_t i.e., scalar_t=float or double template <typename scalar_in, typename scalar_out> void inline nearest_channels_last_acc(scalar_in* gin, scalar_out* gout, int64_t size) is called.

Please check they are equal here.

Added check.

jgong5 · 2023-01-18T07:06:50Z

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

+
+template <typename scalar_in, typename scalar_out>
+void inline linear_channels_last_acc(scalar_in* gin, scalar_out* gout, scalar_in w, int64_t size) {
+  using Vec = vec::Vectorized<scalar_in>;


same as nearest_channels_last_acc.

jgong5 · 2023-01-18T07:10:48Z

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

  auto loop1d = [&](int64_t begin, int64_t end) {
+    opmath_t* acc_data_ptr = nullptr;
+    std::unique_ptr<opmath_t[]> buffer_data;
+    if (std::is_same<scalar_t, BFloat16>::value) {


I guess it's safer to check opmath_t and scalar_t doesn't match rather than explicitly checking bfloat16 here.

Suggest to fix similar usage in other places too.

Fix as suggested.

CaoE · 2023-01-18T13:22:43Z

Next time, suggest to split the PR (one for precision and one for performance) to ease the code review.

Sorry for the inconvenience and thanks for your advice! The precision part and performance part are sometimes hard to separate since the codes for improving precision also improve the performance. I will improve PRs like this to ease the code review in the future.

jgong5 · 2023-02-03T11:41:14Z

aten/src/ATen/native/UpSample.h

+// It will not be used by data types other than BFloat16.
+template <typename scalar_in, typename scalar_out>
+void inline apply_grad_input(scalar_in* buffer_ptr, scalar_out* gin, int64_t size) {
+  constexpr bool is_bf16 = std::is_same<scalar_out, BFloat16>::value;
+  TORCH_CHECK(is_bf16,
+              "Upsample backward only support BFloat16 in the lower percision data types on CPU.")
+  constexpr bool is_fp32 = std::is_same<scalar_in, float>::value;
+  TORCH_CHECK(is_fp32,
+              "Upsample backward should use float as acc buffer for BFloat16 grad input on CPU.")


What's the problem if you make the function void inline apply_grad_input(float* buffer_ptr, BFloat16* gin, int64_t size)?

Although scalar_in=float and scalar_out=BFloat16 will be only used at runtime, scalar_in and scalar_out may be float or double at compile time. It will raise "cannot convert ‘double*’ to ‘float*’ or "cannot convert ‘float*’ to ‘c10::BFloat16*’" at compile time when using void inline apply_grad_input(float* buffer_ptr, BFloat16* gin, int64_t size).

Does SFINAE work here so that you don't have to add runtime checks inside a general template function?

Add a default template instead since SFINAE can't work because it still doesn't generate functions for data types other than BFloat16.

jgong5 · 2023-02-03T11:42:50Z

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

@@ -14,6 +14,70 @@ namespace {

 using scale_t = std::vector<c10::optional<double>>;

+template <typename scalar_in, typename scalar_out>
+void inline nearest_channels_last_acc(scalar_in* gin, scalar_out* gout, int64_t size) {
+  using Vec = vec::Vectorized<scalar_in>;


Do you mean scalar_in and scalar_out are always the same?

jgong5 · 2023-02-06T03:58:11Z

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

@@ -14,6 +14,70 @@ namespace {

 using scale_t = std::vector<c10::optional<double>>;

+template <typename scalar_in, typename scalar_out>
+void inline nearest_channels_last_acc(scalar_in* gin, scalar_out* gout, int64_t size) {
+  using Vec = vec::Vectorized<scalar_in>;


Please check they are equal here.

jgong5 · 2023-02-06T03:58:51Z

aten/src/ATen/native/UpSample.h

+// It will not be used by data types other than BFloat16.
+template <typename scalar_in, typename scalar_out>
+void inline apply_grad_input(scalar_in* buffer_ptr, scalar_out* gin, int64_t size) {
+  constexpr bool is_bf16 = std::is_same<scalar_out, BFloat16>::value;
+  TORCH_CHECK(is_bf16,
+              "Upsample backward only support BFloat16 in the lower percision data types on CPU.")
+  constexpr bool is_fp32 = std::is_same<scalar_in, float>::value;
+  TORCH_CHECK(is_fp32,
+              "Upsample backward should use float as acc buffer for BFloat16 grad input on CPU.")


Does SFINAE work here so that you don't have to add runtime checks inside a general template function?

CaoE · 2023-05-23T02:54:43Z

@ngimel Could you please review this PR? Thanks.

CaoE · 2023-05-25T03:08:49Z

@Skylion007 Could you please review this PR? Thanks.

Skylion007 · 2023-05-25T03:11:53Z

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

+  for (; d < size - (size % bVec::size()); d += bVec::size()) {
+    bVec gout_bvec = bVec::loadu(gout + d);
+    fVec gout_fvec0, gout_fvec1;
+    std::tie(gout_fvec0, gout_fvec1) = convert_bfloat16_float(gout_bvec);


Really minor nit, but you should be able to use structured bindings now that C++17 is fully enabled

Use structured bindings.

Skylion007 · 2023-05-25T03:12:12Z

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

        }
      }
+      if (!std::is_same<scalar_t, opmath_t>::value) {


if constexpr here?

Use if constexpr instead.

Skylion007 · 2023-05-25T03:12:25Z

aten/src/ATen/native/cpu/UpSampleMoreKernel.cpp

  };

  auto loop3d = [&](int64_t begin, int64_t end) {
+    opmath_t* acc_data_ptr = nullptr;
+    std::unique_ptr<opmath_t[]> buffer_data;
+    if (!std::is_same<scalar_t, opmath_t>::value) {


likewise if constexpr here

Use if constexpr instead.

CaoE · 2023-05-28T13:16:29Z

@pytorchbot rebase

pytorchmergebot · 2023-05-28T13:18:50Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…sample when scale_factor is large

pytorchmergebot · 2023-05-28T13:18:59Z

Successfully rebased ecao/bf16_precision onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ecao/bf16_precision && git pull --rebase)

CaoE · 2023-05-29T01:33:16Z

@pytorchbot merge

pytorchmergebot · 2023-05-29T01:35:51Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the release notes: nn release notes category label Dec 20, 2022

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 20, 2022

pytorchbot added the open source label Dec 20, 2022

CaoE force-pushed the ecao/bf16_precision branch 4 times, most recently from 3905d48 to 1532c67 Compare December 23, 2022 05:26

CaoE force-pushed the ecao/bf16_precision branch 4 times, most recently from 07d7ad7 to 0ee9447 Compare January 13, 2023 05:29

CaoE changed the title ~~Improve precision for BFloat16 upsampling~~ Improve precision and performance for BFloat16 upsampling Jan 16, 2023

CaoE force-pushed the ecao/bf16_precision branch 2 times, most recently from e5bcb93 to c0c3a5e Compare January 17, 2023 02:53

CaoE requested review from jgong5 and mingfeima January 17, 2023 06:09

jgong5 requested changes Jan 18, 2023

View reviewed changes

CaoE force-pushed the ecao/bf16_precision branch 3 times, most recently from fad835a to 148f49b Compare January 31, 2023 02:44

CaoE requested a review from jgong5 February 2, 2023 03:24

jgong5 reviewed Feb 3, 2023

View reviewed changes

jgong5 requested changes Feb 6, 2023

View reviewed changes

CaoE force-pushed the ecao/bf16_precision branch 2 times, most recently from a7b9eae to efd6487 Compare February 8, 2023 08:30

jgong5 approved these changes Feb 9, 2023

View reviewed changes

CaoE marked this pull request as ready for review February 9, 2023 05:03

CaoE added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023

CaoE added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 9, 2023

CaoE force-pushed the ecao/bf16_precision branch from efd6487 to 6a39f13 Compare May 9, 2023 07:52

CaoE force-pushed the ecao/bf16_precision branch from 6a39f13 to f3bb695 Compare May 23, 2023 02:47

CaoE requested review from ngimel and Skylion007 May 25, 2023 03:07

Skylion007 reviewed May 25, 2023

View reviewed changes

Skylion007 approved these changes May 25, 2023

View reviewed changes

CaoE force-pushed the ecao/bf16_precision branch 2 times, most recently from 40655a1 to de59e9e Compare May 25, 2023 05:55

CaoE added 6 commits May 28, 2023 13:18

use opmath_t instead of scalar_t to improve precision for BFloat16 up…

5b4a8df

…sample when scale_factor is large

use float buffers to accumulate for BFloat16 upsample backward

dd75b7e

reuse code to reduce redundancy

ee097ee

improve the performance and bfloat16 precision for bicubic

f5edf43

reuse code

b8bad77

add a unique function for guarding index and remove the unused function

238de76

pytorchmergebot force-pushed the ecao/bf16_precision branch from de59e9e to 238de76 Compare May 28, 2023 13:19

pytorchmergebot added the merging label May 29, 2023

pytorchmergebot added Merged and removed merging labels May 29, 2023

pytorchmergebot closed this in af1d437 May 29, 2023

Improve precision and performance for BFloat16 upsampling #91169

Improve precision and performance for BFloat16 upsampling #91169

Conversation

CaoE commented Dec 20, 2022 • edited

Description

Testing

pytorch-bot bot commented Dec 20, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91169

✅ No Failures

jgong5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE Jan 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE commented Jan 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE commented May 23, 2023

CaoE commented May 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CaoE commented May 28, 2023

pytorchmergebot commented May 28, 2023

pytorchmergebot commented May 28, 2023

CaoE commented May 29, 2023

pytorchmergebot commented May 29, 2023

Merge started

CaoE commented Dec 20, 2022 •

edited

pytorch-bot bot commented Dec 20, 2022 •

edited

CaoE Jan 31, 2023 •

edited

CaoE commented Jan 18, 2023 •

edited