Setting OMP_DYNAMIC=true ruins training on TF-MKL #51970

eli-osherovich · 2021-09-12T16:49:14Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): git
Python version: 3.9
Bazel version (if compiling from source): 4.2.1
GCC/Compiler version (if compiling from source): 11
CUDA/cuDNN version: NA
GPU model and memory: NA

Describe the current behavior
Some background: I notices a big (several times) time difference between TF-Eigen and TF-MKL. With Eigen being faster. Trying to tune MKL to better performance I experimented with OMP environment variables. While I was able to get much faster times with TF-MKL, I discovered that the convergence is ruined (the same code does not converge anymore). It seems that the offensive flag is OMP_DYNAMIC=true

Just to give some numbers, running with and without the variable set:

OMP_DYNAMIC=true ./demo.py

Epoch 1/100
   1020/Unknown - 37s 36ms/step - loss: nan - accuracy: 0.0928

And without it:

./demo.py
Epoch 1/100
   1003/Unknown - 112s 111ms/step - loss: 12.1739 - accuracy: 0.9124

Note how much faster the first run is (37ms vs 111ms). However, its accuracy and loss are much worse. At some point, the loss becomes NaN...

Describe the expected behavior
Convergence should not be affected by OMP flags.

Contributing

Do you want to contribute a PR? (yes/no): no
Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue
Unfortunately it is not easy to provide a short code that can reproduce the problem. I hope the developers would be able to run some internal tests.

The text was updated successfully, but these errors were encountered:

eli-osherovich · 2021-09-12T16:50:32Z

@vpirogov can you please have a look at this since the issue is probably with MKL.

Thanks.

eli-osherovich · 2021-09-12T17:13:20Z

Some additional info (MKL_VERBOSE=1) that looks suspicious and crushes (why?!)

OMP_DYNAMIC=true MKL_VERBOSE=1 ./demo.py

MKL_VERBOSE oneMKL 2021.0 Update 3 Product build 20210617 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.10GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x55c50c00b6e0,1,0x55c50c00b6e0,1) 1.24ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
OMP: Error #15: Initializing libiomp5.so, but found libomp.so already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
Fatal Python error: Aborted

eli-osherovich · 2021-09-12T17:19:57Z

Preloading Intel's libiomp5.so (via LD_PRELOAD) removes the error of two omp libraries loaded, but does not solve the main issue with poor convergence.

It seems to be a bug in MKL (see tensorflow/tensorflow#51970)

vpirogov · 2021-09-13T19:51:39Z

@eli-osherovich, based on the messages it looks like your TF build has two OpenMP libraries linked, as the message indicates this may (and usually does) lead to nasty consequences. How do you build Tensorflow?

+@agramesh1

eli-osherovich · 2021-09-14T06:00:52Z

Thanks @vpirogov, @agramesh1

I use the standard building procedure:

bazel build --verbose_failures  --config=nogcp --config=nonccl  --config=noaws --config=nohdfs --config=mkl  -c opt --copt=-march=native --copt="-O3" -s //tensorflow/tools/pip_package:build_pip_package

Is it really two omp libraries? Since preloading Intel's libiomp.so does not solve the problem. I also tried to set KMP_DUPLICATE_LIB_OK, but it does not help. Namely, the command below still produces very poor convergence.

KMP_DUPLICATE_LIB_OK=TRUE OMP_DYNAMIC=true MKL_VERBOSE=1 ./demo.py

eli-osherovich · 2021-09-17T15:14:43Z

@vpirogov , @agramesh1

The issue is reproduced with Intel's TF (installed in a clean conda environment):

conda create -n test -c intel  python=3.9 tensorflow

No conflicting libraries. This is definitely a bug.

preethivenkatesh · 2021-09-29T19:08:52Z

hi @eli-osherovich, is it possible to share your demo.py with us to properly investigate the issue?

eli-osherovich · 2021-09-30T07:57:38Z

@preethivenkatesh I can share with specific people from Intel/Google.

preethivenkatesh · 2021-10-27T14:35:04Z

@eli-osherovich I had share my email address a while ago. Please reach out to me via email for additional support

shailensobhee · 2021-10-28T07:35:05Z

@eli-osherovich To also get a full picture, could you please kindly share with @preethivenkatesh your demo.py script? We would like to create a similar environment and replicate the results. Thank you!

eli-osherovich · 2021-11-25T07:29:50Z

Thanks, @preethivenkatesh , unfortunately, after a long discussion, I cannot release our company's code.

eli-osherovich added the type:bug Bug label Sep 12, 2021

google-ml-butler bot assigned tilakrayal Sep 12, 2021

eli-osherovich added a commit to eli-osherovich/papers_with_code that referenced this issue Sep 13, 2021

Removed OMP_DYNAMIC=true from setup_omp()

c9576b2

It seems to be a bug in MKL (see tensorflow/tensorflow#51970)

tilakrayal added the comp:mkl MKL related issues label Sep 13, 2021

tilakrayal assigned jvishnuvardhan and unassigned tilakrayal Sep 13, 2021

jvishnuvardhan assigned TensorFlow-MKL and unassigned jvishnuvardhan Sep 13, 2021

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting OMP_DYNAMIC=true ruins training on TF-MKL #51970

Setting OMP_DYNAMIC=true ruins training on TF-MKL #51970

eli-osherovich commented Sep 12, 2021 •

edited

eli-osherovich commented Sep 12, 2021 •

edited

eli-osherovich commented Sep 12, 2021 •

edited

eli-osherovich commented Sep 12, 2021

vpirogov commented Sep 13, 2021

eli-osherovich commented Sep 14, 2021

eli-osherovich commented Sep 17, 2021 •

edited

preethivenkatesh commented Sep 29, 2021

eli-osherovich commented Sep 30, 2021

preethivenkatesh commented Oct 27, 2021

shailensobhee commented Oct 28, 2021

eli-osherovich commented Nov 25, 2021

Setting OMP_DYNAMIC=true ruins training on TF-MKL #51970

Setting OMP_DYNAMIC=true ruins training on TF-MKL #51970

Comments

eli-osherovich commented Sep 12, 2021 • edited

eli-osherovich commented Sep 12, 2021 • edited

eli-osherovich commented Sep 12, 2021 • edited

eli-osherovich commented Sep 12, 2021

vpirogov commented Sep 13, 2021

eli-osherovich commented Sep 14, 2021

eli-osherovich commented Sep 17, 2021 • edited

preethivenkatesh commented Sep 29, 2021

eli-osherovich commented Sep 30, 2021

preethivenkatesh commented Oct 27, 2021

shailensobhee commented Oct 28, 2021

eli-osherovich commented Nov 25, 2021

eli-osherovich commented Sep 12, 2021 •

edited

eli-osherovich commented Sep 12, 2021 •

edited

eli-osherovich commented Sep 12, 2021 •

edited

eli-osherovich commented Sep 17, 2021 •

edited