Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting default openmp settings for MKL kernels #19136

Merged
5 changes: 4 additions & 1 deletion tensorflow/core/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -3277,7 +3277,10 @@ tf_cc_tests_gpu(
tf_cc_test_mkl(
name = "mkl_runtime_tests",
size = "small",
srcs = ["common_runtime/mkl_cpu_allocator_test.cc"],
srcs = [
"common_runtime/mkl_cpu_allocator_test.cc",
"common_runtime/mkl_threadpool_device_test.cc",
],
linkstatic = 1,
deps = [
":core",
Expand Down
53 changes: 53 additions & 0 deletions tensorflow/core/common_runtime/mkl_threadpool_device_test.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
/* Copyright 2018 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

#ifdef INTEL_MKL

#include "tensorflow/core/common_runtime/threadpool_device.h"

#include "tensorflow/core/lib/core/status_test_util.h"
#include "tensorflow/core/platform/cpu_info.h"
#include "tensorflow/core/platform/logging.h"
#include "tensorflow/core/platform/test.h"
#include "tensorflow/core/public/session_options.h"

namespace tensorflow {

#ifdef _OPENMP
TEST(MKLThreadPoolDeviceTest, TestOmpDefaults) {
SessionOptions options;
unsetenv("OMP_NUM_THREADS");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get rid of anything manipulating the environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other places in Tensorflow unit tests where unsetenv is being used to ensure we are checking the default paths (

).

Do you have an alternative that I could look into?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any thread-safe versions of these APIs. Maybe we can put locks around the calls since these are just tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK in unit tests.


ThreadPoolDevice* tp = new ThreadPoolDevice(
options, "/device:CPU:0", Bytes(256), DeviceLocality(), cpu_allocator());

const int ht = port::NumHyperthreadsPerCore();
EXPECT_EQ(omp_get_max_threads(), (port::NumSchedulableCPUs() + ht - 1) / ht);
}

TEST(MKLThreadPoolDeviceTest, TestOmpPreSets) {
SessionOptions options;
setenv("OMP_NUM_THREADS", "314", 1);

ThreadPoolDevice* tp = new ThreadPoolDevice(
options, "/device:CPU:0", Bytes(256), DeviceLocality(), cpu_allocator());

EXPECT_EQ(omp_get_max_threads(), 314);
}
#endif // _OPENMP

} // namespace tensorflow

#endif // INTEL_MKL
11 changes: 8 additions & 3 deletions tensorflow/core/common_runtime/process_util.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@ limitations under the License.
#include "tensorflow/core/common_runtime/process_util.h"

#ifdef INTEL_MKL
#ifdef _OPENMP
#include <omp.h>
#endif
#endif // _OPENMP
#endif // INTEL_MKL
#include <string.h>

#include "tensorflow/core/lib/core/threadpool.h"
Expand Down Expand Up @@ -57,7 +59,10 @@ int32 NumInterOpThreadsFromSessionOptions(const SessionOptions& options) {
// MKL library executes ops in parallel using OMP threads
// Set inter_op conservatively to avoid thread oversubscription that could
// lead to severe perf degradations and OMP resource exhaustion
const int mkl_intra_op = omp_get_max_threads();
int mkl_intra_op = 1;
#ifdef _OPENMP
mkl_intra_op = omp_get_max_threads();
#endif // _OPENMP
CHECK_GE(mkl_intra_op, 1);
const int32 mkl_inter_op = std::max(
(port::NumSchedulableCPUs() + mkl_intra_op - 1) / mkl_intra_op, 2);
Expand All @@ -68,7 +73,7 @@ int32 NumInterOpThreadsFromSessionOptions(const SessionOptions& options) {
#else
// Default to using the number of cores available in the process.
return port::NumSchedulableCPUs();
#endif
#endif // INTEL_MKL
}

thread::ThreadPool* NewThreadPoolFromSessionOptions(
Expand Down
25 changes: 24 additions & 1 deletion tensorflow/core/common_runtime/threadpool_device.cc
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,11 @@ limitations under the License.
#include "tensorflow/core/public/session_options.h"

#ifdef INTEL_MKL
#ifdef _OPENMP
#include <omp.h>
#endif
#include "tensorflow/core/common_runtime/mkl_cpu_allocator.h"
#include "tensorflow/core/platform/cpu_info.h"
#endif

namespace tensorflow {
Expand All @@ -43,7 +47,26 @@ ThreadPoolDevice::ThreadPoolDevice(const SessionOptions& options,
: LocalDevice(options, Device::BuildDeviceAttributes(
name, DEVICE_CPU, memory_limit, locality)),
allocator_(allocator),
scoped_allocator_mgr_(new ScopedAllocatorMgr(name)) {}
scoped_allocator_mgr_(new ScopedAllocatorMgr(name)) {
#ifdef INTEL_MKL
#ifdef _OPENMP
const char* user_omp_threads = getenv("OMP_NUM_THREADS");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::getenv is only thread-safe with other getenv but not setenv and unsetenv. Some other parts of TensorFlow use setenv and unsetenv, like Jayaram said. Just a note that we might want to look into this in the future.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as {set,unset,put,clear}env only happen in tests, I think this is fine: even if the tests are not single-threaded, the only damage that can be done is to the tests themselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Sorry. For some reason I thought core/platform/posix/subprocess.cc is using setenv, but it actually just mentions setenv. Just confirmed that there aren't any {set,unset,put,clear}env calls outside of tests. That's great! :)

if (user_omp_threads == nullptr) {
// OMP_NUM_THREADS controls MKL's intra-op parallelization
// Default to available physical cores
const int mkl_intra_op = port::NumSchedulableCPUs();
const int ht = port::NumHyperthreadsPerCore();
omp_set_num_threads((mkl_intra_op + ht - 1) / ht);
Copy link
Member

@penpornk penpornk May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @jktomer: can num_logical_cores not be divisible by num_hyper_threads? In other words, can each physical core have different number of hyper-threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hardware, no but the user can choose to explicitly disable some hyperthreads and not others using OS controls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

} else {
uint64 user_val = 0;
if (strings::safe_strtou64(user_omp_threads, &user_val)) {
// Superflous but triggers OpenMP loading
omp_set_num_threads(user_val);
}
}
#endif // _OPENMP
#endif // INTEL_MKL
}

ThreadPoolDevice::~ThreadPoolDevice() {}

Expand Down
23 changes: 23 additions & 0 deletions tensorflow/core/platform/cpu_info.cc
Original file line number Diff line number Diff line change
Expand Up @@ -344,5 +344,28 @@ int CPUModelNum() {
#endif
}

int CPUIDNumSMT() {
#ifdef PLATFORM_IS_X86
// https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration
// https://software.intel.com/en-us/articles/intel-sdm (Vol 3A)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding "section Detecting Hardware Multi-threading Support and Topology" after Vol 3A would be helpful.

// Section: Detecting Hardware Multi-threads Support and Topology
// Uses CPUID Leaf 11 to enumerate system topology on Intel x86 architectures
// Other cases not supported
uint32 eax, ebx, ecx, edx;
// Check if system supports Leaf 11
GETCPUID(eax, ebx, ecx, edx, 0, 0);
if (eax >= 11) {
// 1) Leaf 11 available? CPUID.(EAX=11, ECX=0):EBX != 0
// 2) SMT_Mask_Width = CPUID.(EAX=11, ECX=0):EAX[4:0] if CPUID.(EAX=11,
// ECX=0):ECX[15:8] is 1
GETCPUID(eax, ebx, ecx, edx, 11, 0);
if (ebx != 0 && ((ecx & 0xff00) >> 8) == 1) {
return 1 << (eax & 0x1f); // 2 ^ SMT_Mask_Width
}
}
#endif // PLATFORM_IS_X86
return 0;
}

} // namespace port
} // namespace tensorflow
7 changes: 7 additions & 0 deletions tensorflow/core/platform/cpu_info.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ namespace port {
// software can change it dynamically.
int NumSchedulableCPUs();

// Returns an estimate of the number of hyperthreads per physical core
// on the CPU
int NumHyperthreadsPerCore();

// Mostly ISA related features that we care about
enum CPUFeature {
// Do not change numeric assignments.
Expand Down Expand Up @@ -107,6 +111,9 @@ int CPUModelNum();
// Returns nominal core processor cycles per second of each processor.
double NominalCPUFrequency();

// Returns num of hyperthreads per physical core
int CPUIDNumSMT();

} // namespace port
} // namespace tensorflow

Expand Down
5 changes: 5 additions & 0 deletions tensorflow/core/platform/posix/port.cc
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,11 @@ int NumSchedulableCPUs() {
return kDefaultCores;
}

int NumHyperthreadsPerCore() {
Copy link
Member

@rmlarsen rmlarsen May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should implement this properly in tensorflow/core/platform/cpu_info.{h,cc}, i.e. query the relevant bit fields in the cpu info). Here's a stack overflow article with example code: https://stackoverflow.com/questions/2901694/programmatically-detect-number-of-physical-processors-cores-or-if-hyper-threadin/2921632#2921632

Hmmm, wait a second, you're the Intel engineer, you probably know how to do this even better :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to go down the CPUID route but wanted to check if 'hwloc' (https://www.open-mpi.org/projects/hwloc/ BSD license) was acceptable instead. With that we dont have to maintain the code inside tensorflow.

If we decide to enumerate cpuid instead, this is the best source so far for Intel CPUs: https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration). Need to look into what AMD cpus use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmlarsen added code to extract SMT contexts per core on modern intel cpus. can you take a look now?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change appears to have broken my build on Windows. I get LNK2019 on tensorflow::Port::NumHyperthreadsPerCore. Verified that the commit preceding this builds fine. (2058a24).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too . This change appears to have broken my build on Windows. I get LNK2019 on tensorflow::Port::NumHyperthreadsPerCore

static const int ht_per_core = tensorflow::port::CPUIDNumSMT();
return (ht_per_core > 0) ? ht_per_core : 1;
}

void* AlignedMalloc(size_t size, int minimum_alignment) {
#if defined(__ANDROID__)
return memalign(minimum_alignment, size);
Expand Down