New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting OMP_DYNAMIC=true ruins training on TF-MKL #51970
Comments
@vpirogov can you please have a look at this since the issue is probably with MKL. Thanks. |
Some additional info (MKL_VERBOSE=1) that looks suspicious and crushes (why?!)
|
Preloading Intel's |
It seems to be a bug in MKL (see tensorflow/tensorflow#51970)
@eli-osherovich, based on the messages it looks like your TF build has two OpenMP libraries linked, as the message indicates this may (and usually does) lead to nasty consequences. How do you build Tensorflow? |
Thanks @vpirogov, @agramesh1 I use the standard building procedure:
Is it really two omp libraries? Since preloading Intel's
|
The issue is reproduced with Intel's TF (installed in a clean conda environment):
No conflicting libraries. This is definitely a bug. |
hi @eli-osherovich, is it possible to share your |
@preethivenkatesh I can share with specific people from Intel/Google. |
@eli-osherovich I had share my email address a while ago. Please reach out to me via email for additional support |
@eli-osherovich To also get a full picture, could you please kindly share with @preethivenkatesh your |
Thanks, @preethivenkatesh , unfortunately, after a long discussion, I cannot release our company's code. |
System information
Describe the current behavior
Some background: I notices a big (several times) time difference between TF-Eigen and TF-MKL. With Eigen being faster. Trying to tune MKL to better performance I experimented with OMP environment variables. While I was able to get much faster times with TF-MKL, I discovered that the convergence is ruined (the same code does not converge anymore). It seems that the offensive flag is
OMP_DYNAMIC=true
Just to give some numbers, running with and without the variable set:
And without it:
Note how much faster the first run is (37ms vs 111ms). However, its accuracy and loss are much worse. At some point, the loss becomes NaN...
Describe the expected behavior
Convergence should not be affected by OMP flags.
Contributing
Standalone code to reproduce the issue
Unfortunately it is not easy to provide a short code that can reproduce the problem. I hope the developers would be able to run some internal tests.
The text was updated successfully, but these errors were encountered: