Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting OMP_DYNAMIC=true ruins training on TF-MKL #51970

Open
eli-osherovich opened this issue Sep 12, 2021 · 11 comments
Open

Setting OMP_DYNAMIC=true ruins training on TF-MKL #51970

eli-osherovich opened this issue Sep 12, 2021 · 11 comments
Assignees
Labels
comp:mkl MKL related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug

Comments

@eli-osherovich
Copy link
Contributor

eli-osherovich commented Sep 12, 2021

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): git
  • Python version: 3.9
  • Bazel version (if compiling from source): 4.2.1
  • GCC/Compiler version (if compiling from source): 11
  • CUDA/cuDNN version: NA
  • GPU model and memory: NA

Describe the current behavior
Some background: I notices a big (several times) time difference between TF-Eigen and TF-MKL. With Eigen being faster. Trying to tune MKL to better performance I experimented with OMP environment variables. While I was able to get much faster times with TF-MKL, I discovered that the convergence is ruined (the same code does not converge anymore). It seems that the offensive flag is OMP_DYNAMIC=true

Just to give some numbers, running with and without the variable set:

OMP_DYNAMIC=true ./demo.py

Epoch 1/100
   1020/Unknown - 37s 36ms/step - loss: nan - accuracy: 0.0928   

And without it:

./demo.py
Epoch 1/100
   1003/Unknown - 112s 111ms/step - loss: 12.1739 - accuracy: 0.9124  

Note how much faster the first run is (37ms vs 111ms). However, its accuracy and loss are much worse. At some point, the loss becomes NaN...

Describe the expected behavior
Convergence should not be affected by OMP flags.

Contributing

  • Do you want to contribute a PR? (yes/no): no
  • Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue
Unfortunately it is not easy to provide a short code that can reproduce the problem. I hope the developers would be able to run some internal tests.

@eli-osherovich
Copy link
Contributor Author

eli-osherovich commented Sep 12, 2021

@vpirogov can you please have a look at this since the issue is probably with MKL.

Thanks.

@eli-osherovich
Copy link
Contributor Author

eli-osherovich commented Sep 12, 2021

Some additional info (MKL_VERBOSE=1) that looks suspicious and crushes (why?!)

OMP_DYNAMIC=true MKL_VERBOSE=1 ./demo.py

MKL_VERBOSE oneMKL 2021.0 Update 3 Product build 20210617 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.10GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x55c50c00b6e0,1,0x55c50c00b6e0,1) 1.24ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:4
OMP: Error #15: Initializing libiomp5.so, but found libomp.so already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
Fatal Python error: Aborted

@eli-osherovich
Copy link
Contributor Author

Preloading Intel's libiomp5.so (via LD_PRELOAD) removes the error of two omp libraries loaded, but does not solve the main issue with poor convergence.

eli-osherovich added a commit to eli-osherovich/papers_with_code that referenced this issue Sep 13, 2021
@tilakrayal tilakrayal added the comp:mkl MKL related issues label Sep 13, 2021
@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 13, 2021
@vpirogov
Copy link

@eli-osherovich, based on the messages it looks like your TF build has two OpenMP libraries linked, as the message indicates this may (and usually does) lead to nasty consequences. How do you build Tensorflow?

+@agramesh1

@eli-osherovich
Copy link
Contributor Author

Thanks @vpirogov, @agramesh1

I use the standard building procedure:

bazel build --verbose_failures  --config=nogcp --config=nonccl  --config=noaws --config=nohdfs --config=mkl  -c opt --copt=-march=native --copt="-O3" -s //tensorflow/tools/pip_package:build_pip_package

Is it really two omp libraries? Since preloading Intel's libiomp.so does not solve the problem. I also tried to set KMP_DUPLICATE_LIB_OK, but it does not help. Namely, the command below still produces very poor convergence.

KMP_DUPLICATE_LIB_OK=TRUE OMP_DYNAMIC=true MKL_VERBOSE=1 ./demo.py

@eli-osherovich
Copy link
Contributor Author

eli-osherovich commented Sep 17, 2021

@vpirogov , @agramesh1

The issue is reproduced with Intel's TF (installed in a clean conda environment):

conda create -n test -c intel  python=3.9 tensorflow

No conflicting libraries. This is definitely a bug.

@preethivenkatesh
Copy link

hi @eli-osherovich, is it possible to share your demo.py with us to properly investigate the issue?

@eli-osherovich
Copy link
Contributor Author

@preethivenkatesh I can share with specific people from Intel/Google.

@preethivenkatesh
Copy link

@eli-osherovich I had share my email address a while ago. Please reach out to me via email for additional support

@shailensobhee
Copy link

@eli-osherovich To also get a full picture, could you please kindly share with @preethivenkatesh your demo.py script? We would like to create a similar environment and replicate the results. Thank you!

@eli-osherovich
Copy link
Contributor Author

Thanks, @preethivenkatesh , unfortunately, after a long discussion, I cannot release our company's code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:mkl MKL related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug
Projects
None yet
Development

No branches or pull requests

7 participants