Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve precision and performance for BFloat16 upsampling #91169

Closed
wants to merge 6 commits into from

Conversation

CaoE
Copy link
Collaborator

@CaoE CaoE commented Dec 20, 2022

Description

Testing

data type: BFloat16

  • Single core

contiguous:

mode scale_factor shape before backward / ms after backward / ms
nearest 2 [10, 3, 200, 200] 14.47 8.34
linear 2 [3, 200, 200] 3.69 2.74
bilinear 2 [3, 5, 200, 200] 87.99 49.05
trilinear 2 [3, 3, 3, 100, 100] 171.02 72.53
bicubic 2 [3, 3, 200, 200 ] 176.29 78

channels last:

mode scale_factor shape before backward / ms after backward / ms
nearest 2 [10, 3, 200, 200] 17.70 10.30
linear 2 [3, 200, 200] \ \
bilinear 2 [3, 5, 200, 200] 50.90 18.83
trilinear 2 [3, 3, 3, 100, 100] 121.56 42.60
bicubic 2 [3, 3, 200, 200 ] 179.40 80
  • 20 cores

contiguous:

mode scale_factor shape before backward / ms after backward / ms
nearest 2 [10, 3, 200, 200] 1.17 1.01
linear 2 [3, 200, 200] 0.41 0.26
bilinear 2 [3, 5, 200, 200] 7.19 4.07
trilinear 2 [3, 3, 3, 100, 100] 21.32 9.33
bicubic 2 [3, 3, 200, 200 ] 178.67 10

channels last:

mode scale_factor shape before backward / ms after backward / ms
nearest 2 [10, 3, 200, 200] 2.25 1.55
linear 2 [3, 200, 200] \ \
bilinear 2 [3, 5, 200, 200] 20.17 7.20
trilinear 2 [3, 3, 3, 100, 100] 43.33 15.66
bicubic 2 [3, 3, 200, 200 ] 176.76 10

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 20, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91169

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 238de76:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: nn release notes category label Dec 20, 2022
@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Dec 20, 2022
@CaoE CaoE force-pushed the ecao/bf16_precision branch 4 times, most recently from 3905d48 to 1532c67 Compare December 23, 2022 05:26
@CaoE CaoE force-pushed the ecao/bf16_precision branch 4 times, most recently from 07d7ad7 to 0ee9447 Compare January 13, 2023 05:29
@CaoE CaoE changed the title Improve precision for BFloat16 upsampling Improve precision and performance for BFloat16 upsampling Jan 16, 2023
@CaoE CaoE force-pushed the ecao/bf16_precision branch 2 times, most recently from e5bcb93 to c0c3a5e Compare January 17, 2023 02:53
@CaoE CaoE requested review from jgong5 and mingfeima January 17, 2023 06:09
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time, suggest to split the PR (one for precision and one for performance) to ease the code review.

Comment on lines 466 to 482
// For compilation, and it will not be used by data types other than BFloat16.
template <typename scalar_in, typename scalar_out>
void inline apply_grad_input(scalar_out* buffer_ptr, scalar_in* gin, int64_t size) {
return;
}

template <>
void inline apply_grad_input(float* buffer_ptr, BFloat16* gin, int64_t size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only one specialization is needed, we don't have to make it a template function, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template function used here is to allow the compilations to pass in addition to BFloat16 and I removed the unused function.

Comment on lines 983 to 990
// when `real_input_index` becomes larger than the range the floating point
// type can accurately represent, the type casting to `int64_t` might exceed
// `input_size - 1`. So we guard it with `std::min` below.
input_index = std::min(static_cast<int64_t>(floorf(real_input_index)), input_size - 1);
auto lambda = std::min(
std::max(real_input_index - input_index, static_cast<opmath_t>(0)),
static_cast<opmath_t>(1)
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there are quite a few duplicate code like this. Can we factor out a util function to dedup?

Copy link
Collaborator Author

@CaoE CaoE Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factor out a util function guard_index_and_lambda.

@@ -14,6 +14,70 @@ namespace {

using scale_t = std::vector<c10::optional<double>>;

template <typename scalar_in, typename scalar_out>
void inline nearest_channels_last_acc(scalar_in* gin, scalar_out* gout, int64_t size) {
using Vec = vec::Vectorized<scalar_in>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec::size() of scalar_in and scalar_out might not match. Have you considered this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalar_in and scalar_out are always same when the data type is not BFloat16.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean scalar_in and scalar_out are always the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean there are two cases to use nearest_channels_last_acc : 1. scalar_in and scalar_out are same; 2. scalar_in=float and scalar_out =BFloat16.
When opmath_t is not scalar_t i.e., scalar_t=BFloat16 and opmath_t=float template <> void inline nearest_channels_last_acc(float* gin, BFloat16* gout, int64_t size) is called.
When opmath_t == scalar_t i.e., scalar_t=float or double template <typename scalar_in, typename scalar_out> void inline nearest_channels_last_acc(scalar_in* gin, scalar_out* gout, int64_t size) is called.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check they are equal here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check.


template <typename scalar_in, typename scalar_out>
void inline linear_channels_last_acc(scalar_in* gin, scalar_out* gout, scalar_in w, int64_t size) {
using Vec = vec::Vectorized<scalar_in>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as nearest_channels_last_acc.

auto loop1d = [&](int64_t begin, int64_t end) {
opmath_t* acc_data_ptr = nullptr;
std::unique_ptr<opmath_t[]> buffer_data;
if (std::is_same<scalar_t, BFloat16>::value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's safer to check opmath_t and scalar_t doesn't match rather than explicitly checking bfloat16 here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to fix similar usage in other places too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix as suggested.

@CaoE
Copy link
Collaborator Author

CaoE commented Jan 18, 2023

Next time, suggest to split the PR (one for precision and one for performance) to ease the code review.

Sorry for the inconvenience and thanks for your advice! The precision part and performance part are sometimes hard to separate since the codes for improving precision also improve the performance. I will improve PRs like this to ease the code review in the future.

@CaoE CaoE force-pushed the ecao/bf16_precision branch 3 times, most recently from fad835a to 148f49b Compare January 31, 2023 02:44
@CaoE CaoE requested a review from jgong5 February 2, 2023 03:24
Comment on lines 471 to 477
// It will not be used by data types other than BFloat16.
template <typename scalar_in, typename scalar_out>
void inline apply_grad_input(scalar_in* buffer_ptr, scalar_out* gin, int64_t size) {
constexpr bool is_bf16 = std::is_same<scalar_out, BFloat16>::value;
TORCH_CHECK(is_bf16,
"Upsample backward only support BFloat16 in the lower percision data types on CPU.")
constexpr bool is_fp32 = std::is_same<scalar_in, float>::value;
TORCH_CHECK(is_fp32,
"Upsample backward should use float as acc buffer for BFloat16 grad input on CPU.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem if you make the function void inline apply_grad_input(float* buffer_ptr, BFloat16* gin, int64_t size)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although scalar_in=float and scalar_out=BFloat16 will be only used at runtime, scalar_in and scalar_out may be float or double at compile time. It will raise "cannot convert ‘double*’ to ‘float*’ or "cannot convert ‘float*’ to ‘c10::BFloat16*’" at compile time when using void inline apply_grad_input(float* buffer_ptr, BFloat16* gin, int64_t size).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SFINAE work here so that you don't have to add runtime checks inside a general template function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a default template instead since SFINAE can't work because it still doesn't generate functions for data types other than BFloat16.

@@ -14,6 +14,70 @@ namespace {

using scale_t = std::vector<c10::optional<double>>;

template <typename scalar_in, typename scalar_out>
void inline nearest_channels_last_acc(scalar_in* gin, scalar_out* gout, int64_t size) {
using Vec = vec::Vectorized<scalar_in>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean scalar_in and scalar_out are always the same?

@@ -14,6 +14,70 @@ namespace {

using scale_t = std::vector<c10::optional<double>>;

template <typename scalar_in, typename scalar_out>
void inline nearest_channels_last_acc(scalar_in* gin, scalar_out* gout, int64_t size) {
using Vec = vec::Vectorized<scalar_in>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check they are equal here.

Comment on lines 471 to 477
// It will not be used by data types other than BFloat16.
template <typename scalar_in, typename scalar_out>
void inline apply_grad_input(scalar_in* buffer_ptr, scalar_out* gin, int64_t size) {
constexpr bool is_bf16 = std::is_same<scalar_out, BFloat16>::value;
TORCH_CHECK(is_bf16,
"Upsample backward only support BFloat16 in the lower percision data types on CPU.")
constexpr bool is_fp32 = std::is_same<scalar_in, float>::value;
TORCH_CHECK(is_fp32,
"Upsample backward should use float as acc buffer for BFloat16 grad input on CPU.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SFINAE work here so that you don't have to add runtime checks inside a general template function?

@CaoE CaoE force-pushed the ecao/bf16_precision branch 2 times, most recently from a7b9eae to efd6487 Compare February 8, 2023 08:30
@CaoE CaoE marked this pull request as ready for review February 9, 2023 05:03
@CaoE CaoE added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 9, 2023
@CaoE CaoE added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 9, 2023
@CaoE
Copy link
Collaborator Author

CaoE commented May 23, 2023

@ngimel Could you please review this PR? Thanks.

@CaoE CaoE requested review from ngimel and Skylion007 May 25, 2023 03:07
@CaoE
Copy link
Collaborator Author

CaoE commented May 25, 2023

@Skylion007 Could you please review this PR? Thanks.

for (; d < size - (size % bVec::size()); d += bVec::size()) {
bVec gout_bvec = bVec::loadu(gout + d);
fVec gout_fvec0, gout_fvec1;
std::tie(gout_fvec0, gout_fvec1) = convert_bfloat16_float(gout_bvec);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor nit, but you should be able to use structured bindings now that C++17 is fully enabled

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use structured bindings.

}
}
if (!std::is_same<scalar_t, opmath_t>::value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if constexpr here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use if constexpr instead.

};

auto loop3d = [&](int64_t begin, int64_t end) {
opmath_t* acc_data_ptr = nullptr;
std::unique_ptr<opmath_t[]> buffer_data;
if (!std::is_same<scalar_t, opmath_t>::value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise if constexpr here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use if constexpr instead.

@CaoE CaoE force-pushed the ecao/bf16_precision branch 2 times, most recently from 40655a1 to de59e9e Compare May 25, 2023 05:55
@CaoE
Copy link
Collaborator Author

CaoE commented May 28, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased ecao/bf16_precision onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ecao/bf16_precision && git pull --rebase)

@CaoE
Copy link
Collaborator Author

CaoE commented May 29, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: nn release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

6 participants