Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INTEL MKL] Enabling TF + OneDNN in stock TF build #47745

Closed
wants to merge 7 commits into from

Conversation

gzmkl
Copy link
Contributor

@gzmkl gzmkl commented Mar 12, 2021

Refactor code to do the following:

  1. code that is included in stock TF build will be guarded by macro INTEL_MKL

  2. code that is included only when --config=mkl is used is now guarded by macro ENABLE_MKL

  3. when building stock TF for x86 Linux/Windows the macro INTEL_MKL will be defined, and won't be defined for other architectures.

  4. Added runtime env variable ENABLE_ONEDNN_OPTS that will enable oneDNN optimizations in stock TF.

  5. for config=mkl, ENABLE_MKL will be defined.

This PR assumes

a) oneDNN is upgraded to 2.1 (PR #: #47743) and

b) stock TF and TF+oneDNN are using same oneDNN build (#47679)

@google-ml-butler google-ml-butler bot added the size:L CL Change Size: Large label Mar 12, 2021
@google-cla google-cla bot added the cla: yes label Mar 12, 2021
@gbaned gbaned self-assigned this Mar 12, 2021
@gbaned gbaned added the comp:mkl MKL related issues label Mar 12, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Mar 12, 2021
@gbaned
Copy link
Contributor

gbaned commented Mar 17, 2021

@gzmkl Can you please resolve conflicts? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Mar 17, 2021
@gzmkl
Copy link
Contributor Author

gzmkl commented Mar 18, 2021

@gbaned The merge conflicts were related to oneDNN upgrade PR. They are gone after the upgrade PR is merged

@penpornk penpornk added the kokoro:force-run Tests on submitted change label Mar 18, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 18, 2021
Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the PR!

tensorflow/core/common_runtime/threadpool_device.cc Outdated Show resolved Hide resolved
#ifdef _OPENMP
#include <omp.h>
#endif
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
#endif
#endif // defined(ENABLE_ONEDNN_OPENMP) && defined(INTEL_MKL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are doing the change. Also replace INTEL_MKL with ENABLE_MKL due to line 35 change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should be added to line 39 though. #endif in line 38 is for #ifdef _OPENMP. I will change this internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

tensorflow/core/kernels/mkl/mkl_fused_ops_test.cc Outdated Show resolved Hide resolved
tensorflow/core/util/util.cc Show resolved Hide resolved
tensorflow/tensorflow.bzl Outdated Show resolved Hide resolved
third_party/mkl/build_defs.bzl Outdated Show resolved Hide resolved
third_party/mkl/build_defs.bzl Show resolved Hide resolved
PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Mar 19, 2021
@penpornk
Copy link
Member

@gzmkl By the way, if any change requires long testing time and you prefer to do it later. Please feel free to just add a TODO instead.

@gzmkl
Copy link
Contributor Author

gzmkl commented Mar 19, 2021

@penpornk Thank you for the code review. Most code change suggestions are very good and I have done code change.
Local test will take a while but will try to push them to the PR branch today.

@penpornk
Copy link
Member

@gzmkl Thank you very much for the quick responses! Please take your time. :)
The oneDNN v2.1 upgrade is still not in the clear. It made one test timed out (due to long compilation time) so I submitted a fix this morning. I'll have to wait for tonight's nightly test results to make sure v2.1 will stick. The soonest this PR can be merged is tomorrow (if nothing else goes wrong).

@gzmkl
Copy link
Contributor Author

gzmkl commented Mar 19, 2021

@penpornk I have committed changes per your suggestions. Please let me know if there is anything missing.
Thanks a lot.

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 19, 2021
Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick response! Several of my comments are still not addressed. But I can just make the rest of the minor modifications (and add TODOs for refactors) myself. I'm going to pull the PR in to test now.

There are two comments that I need answers though. Could you please reply to these questions? Q1, Q2

#ifdef _OPENMP
#include <omp.h>
#endif
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should be added to line 39 though. #endif in line 38 is for #ifdef _OPENMP. I will change this internally.

tensorflow/core/util/util.cc Show resolved Hide resolved
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 19, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 19, 2021
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Mar 21, 2021
@gbaned gbaned removed the ready to pull PR ready for merge process label Mar 22, 2021
@gbaned
Copy link
Contributor

gbaned commented Mar 22, 2021

@gzmkl Can you please check @penpornk's comments and keep us posted ? Thanks!

@gzmkl
Copy link
Contributor Author

gzmkl commented Mar 22, 2021

@gzmkl Can you please check @penpornk's comments and keep us posted ? Thanks!

Yes, I just did that. Thanks!

copybara-service bot pushed a commit that referenced this pull request Mar 23, 2021
PiperOrigin-RevId: 364470371
Change-Id: I6ff0e6bec13cb9dd4bf7396c22495c505f801cb1
@penpornk
Copy link
Member

penpornk commented Mar 23, 2021

@gzmkl The changes from this PR have been merged in 4a24610. Somehow Github doesn't mark it as merged. I'm closing this PR now. Thank you again for the PR! :)

@penpornk penpornk closed this Mar 23, 2021
PR Queue automation moved this from Approved by Reviewer to Closed/Rejected Mar 23, 2021
@ScottTodd
Copy link
Contributor

Hey, I'm noticing some build failures in a downstream project (IREE) after this was merged. Full logs are here. Maybe we're using different flags/compilers to build.

Relevant logs snippet:

ERROR: /home/kbuilder/.cache/bazel/_bazel_kbuilder/1900d0fac5123d725c8d2e08b3f8c209/external/org_tensorflow/tensorflow/core/kernels/mkl/BUILD:299:22: C++ compilation of rule '@org_tensorflow//tensorflow/core/kernels/mkl:mkl_softmax_op' failed (Exit 1): clang failed: error executing command /usr/bin/clang -U_FORTIFY_SOURCE -fstack-protector -Wall -Wthread-safety -Wself-assign -fcolor-diagnostics -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 262 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox clang failed: error executing command /usr/bin/clang -U_FORTIFY_SOURCE -fstack-protector -Wall -Wthread-safety -Wself-assign -fcolor-diagnostics -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 262 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
In file included from external/org_tensorflow/tensorflow/core/kernels/mkl/mkl_softmax_op.cc:22:
In file included from external/org_tensorflow/tensorflow/core/framework/numeric_op.h:19:
In file included from external/org_tensorflow/tensorflow/core/framework/op_kernel.h:24:
In file included from external/org_tensorflow/tensorflow/core/framework/allocator.h:28:
In file included from external/org_tensorflow/tensorflow/core/platform/logging.h:27:
external/org_tensorflow/tensorflow/core/platform/default/logging.h:280:9: error: call to function 'operator<<' that is neither visible in the template definition nor found by argument-dependent lookup
  (*os) << v;
        ^
external/org_tensorflow/tensorflow/core/platform/default/logging.h:339:3: note: in instantiation of function template specialization 'tensorflow::internal::MakeCheckOpValueString<dnnl::memory::format_tag>' requested here
  MakeCheckOpValueString(comb.ForVar1(), v1);
  ^
external/org_tensorflow/tensorflow/core/platform/default/logging.h:384:1: note: in instantiation of function template specialization 'tensorflow::internal::MakeCheckOpString<dnnl::memory::format_tag, dnnl::memory::format_tag>' requested here
TF_DEFINE_CHECK_OP_IMPL(Check_NE, !=)  // Use CHECK(x == NULL) instead.
^
external/org_tensorflow/tensorflow/core/platform/default/logging.h:357:38: note: expanded from macro 'TF_DEFINE_CHECK_OP_IMPL'
      return ::tensorflow::internal::MakeCheckOpString(v1, v2, exprtext); \
                                     ^
external/org_tensorflow/tensorflow/core/util/mkl_util.h:472:7: note: in instantiation of function template specialization 'tensorflow::internal::Check_NEImpl<dnnl::memory::format_tag, dnnl::memory::format_tag>' requested here
      DCHECK_NE(format_tag, memory::format_tag::undef);
      ^
external/org_tensorflow/tensorflow/core/platform/default/logging.h:418:31: note: expanded from macro 'DCHECK_NE'
#define DCHECK_NE(val1, val2) CHECK_NE(val1, val2)
                              ^
external/org_tensorflow/tensorflow/core/platform/default/logging.h:405:30: note: expanded from macro 'CHECK_NE'
#define CHECK_NE(val1, val2) CHECK_OP(Check_NE, !=, val1, val2)
                             ^
external/org_tensorflow/tensorflow/core/platform/default/logging.h:401:40: note: expanded from macro 'CHECK_OP'
#define CHECK_OP(name, op, val1, val2) CHECK_OP_LOG(name, op, val1, val2)
                                       ^
external/org_tensorflow/tensorflow/core/platform/default/logging.h:395:31: note: expanded from macro 'CHECK_OP_LOG'
      ::tensorflow::internal::name##Impl(                      \
                              ^
<scratch space>:124:1: note: expanded from here
Check_NEImpl
^
external/org_tensorflow/tensorflow/core/util/mkl_util.h:184:22: note: 'operator<<' should be declared prior to the call site or in namespace 'dnnl'
inline std::ostream& operator<<(std::ostream& os,
                     ^
1 error generated.

This is the problematic line:

DCHECK_NE(format_tag, memory::format_tag::undef);

This can be fixed by either removing that line or moving this operator<< from namespace tensorflow to namespace dnnl:

inline std::ostream& operator<<(std::ostream& os,
const memory::format_tag& tag) {
if (tag == memory::format_tag::undef) {
os << "undef";
} else if (tag == memory::format_tag::any) {
os << "any";
} else {
os << "invalid";
}
return os;
}

How do you want to proceed?

@penpornk
Copy link
Member

@ScottTodd Sorry about the break and thank you for proposing the fix!
I'll leave this to @gzmkl and @agramesh1 to decide. I think making changes in TensorFlow is probably easier than in oneDNN (namespace dnnl is in an external library.)

@agramesh1
Copy link
Contributor

@penpornk thanks, we are looking at it now.

@ScottTodd
Copy link
Contributor

@ScottTodd Sorry about the break and thank you for proposing the fix!
I'll leave this to @gzmkl and @agramesh1 to decide. I think making changes in TensorFlow is probably easier than in oneDNN (namespace dnnl is in an external library.)

Changes don't need to be made in the external library, simply putting

namespace dnnl {
inline std::ostream& operator<<(std::ostream& os,
                                const memory::format_tag& tag) {
  if (tag == memory::format_tag::undef) {
    os << "undef";
  } else if (tag == memory::format_tag::any) {
    os << "any";
  } else {
    os << "invalid";
  }
  return os;
}
}  // namespace

above namespace tensorflow:

#ifdef _WIN32
typedef unsigned int uint;
#endif
namespace tensorflow {

works when I build locally. (I think memory::format_tag is inlined without being fully qualified as dnnl::memory::format_tag, the macros aren't compiling as expected)

That's just one solution though. Not sure what works best for your projects.

@gzmkl
Copy link
Contributor Author

gzmkl commented Mar 23, 2021

After discussing with team and reviewing the related code in mkl_util.h, removing the related line of code is the right choice.
We will submit an PR.

@gzmkl
Copy link
Contributor Author

gzmkl commented Mar 23, 2021

@ScottTodd @penpornk A PR to fix the issue has been submitted here #48030.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:mkl MKL related issues size:L CL Change Size: Large
Projects
PR Queue
  
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

7 participants