-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building with MKL reduces CPU performance #14496
Comments
I also have this problem. I build tensorflow step by step as official tutorial says, but get a much slower version when MKL is enabled. It is quite strange, |
@vivek-rane Vivek, could you respond to this? |
Will get someone to take a look - thanks for tagging. |
@georgh Could you tell us what the inter/intra op settings and model is? |
I didn't set inter or intra op settings for the example, but I just tested it with both set to 20 and it still behaves the same. |
Setting the inter op to 20 is a really bad idea. Try starting with an inter op of 1 and intra op of #cores, and OMP_NUM_THREADS set to #cores and tweak from there. |
Thank you for your advise. If that's a bad Idea, why does the performance guide state:
Is this only valid if you set it to the number of physical cores? But anyway - regarding the error using MKL the options to not show any effect. |
@tfboyd do we recommend running with intra=num_cores for mkl? that would give pretty bad perf, since TF will run num_cores number of ops in parallel, each with OMP_NUM_THREADS threads (heavy oversubscription). |
@georgh can you share the timelines for your run with and without MKL? It is hard to figure what the problem is without topology info. |
I created two timelines for a very small run, but I am not sure if they contain useful informations. |
In my testing I had the best results with the following settings, cut and
pasted from the document
<https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu>I
wrote. It can be a different for different models but for resnet and
inception training this worked well and even for inference. Intel
mentioned that slightly higher levels of inter_op might work but my
guidelines as # of "sockets" was reasonable. If someone gets data for
other models I am happy to try and find a way to share the info widely.
There are models and hardware platforms that benefit from different
settings. Each variable that impacts performance is discussed below.
-
*KMP_BLOCKTIME*: The MKL default is 200ms, which was not optimal in our
testing. 0 (0ms) was a good default for CNN based models that were tested.
The best performance for AlexNex was achieved at 30ms and both GoogleNet
and VGG11 performed best set at 1ms.
-
*KMP_AFFINITY*: The recommended setting is
granularity=fine,verbose,compact,1,0.
-
*OMP_NUM_THREADS*: This defaults to the number of physical cores.
Adjusting this parameter beyond matching the number of cores can have an
impact when using Intel® Xeon Phi™ (Knights Landing) for some
models. SeeTensorFlow*
Optimizations on Modern Intel® Architecture
<https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture>
for
optimal settings.
-
*intra_op_parallelism_threads*: Setting this equal to the number of
*physical
cores* is recommended. Setting the value to 0, which is the default and
will result in the value being set to the number of logical cores, is an
option to try for some architectures. This value and OMP_NUM_THREADS should
be equal.
-
*inter_op_parallelism_threads*: Setting this equal to the number of
sockets is recommended. Setting the value to 0, which is the default,
results in the value being set to the number of logical cores.
…On Thu, Nov 16, 2017 at 6:57 AM, georgh ***@***.***> wrote:
I created two timelines for a very small run, but I am not sure if they
contain useful informations.
I tried to create a timeline for a bigger run, but the resulting file is
1.3GB big and chrome does not show any information if I try to open it.
timelines.zip
<https://github.com/tensorflow/tensorflow/files/1478931/timelines.zip>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14496 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWZeshN8JsICk3B_IfU-v9WbQsIJZxkuks5s3E1WgaJpZM4Qa4-L>
.
|
I agree with Toby that the inter-op should be set to somewhere around the number of sockets on the system, which is typically 1. I looked at the timeline and there are no ops in there that are currently supported with MKL on TensorFlow. If this timeline is representative of your workload, you will not gain any benefit from switching to MKL. The top 5 ops here are listed below, and matmul is probably the only thing that stands to gain from MKL. If this timeline is off, we need to find a way of getting the correct timeline.
|
Btw setting the KMP_BLOCKTIME to 0 (as Toby suggested) also helps with oversubscription. |
It has been 14 days with no activity and the |
It has been 14 days with no activity and the |
Nagging Awaiting TensorFlower: It has been 14 days with no activity and the |
A member of the TensorFlow organization has replied after the stat:awaiting tensorflower label was applied. |
@tatianashp any idea? |
@vivek-rane explained why using MKL does not help with performance for this network. I am closing the issue. @georgh If you have more questions related to MKL performance please re-open. |
System information
Describe the problem
Building tensorflow with mkl (--config=mkl) prevents the system from using all its cores.
CPU load remains always below 20% in my testcase. Using the same build flags but without mkl achieve 100% CPU load and a nearly 10 times faster execution.
While playing with the MKL flags described here https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu
i noticed some strange behavior:
Running the MKL-build with
OMP_NUM_THREADS=27 KMP_SETTINGS=1 KMP_AFFINITY=verbose
results in the following print:
If I use the same execution flags with a build without MKL or with the pip version I get the same ouput up to
... OMP: Info #247: KMP_AFFINITY: pid 36958 tid 37191 thread 27 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
Afterwards no OMP prints are created. It seems like, if I build with mkl, tensorflow continues to create more and more threads but cant utilize them.
Is this a configuration issue or a bug?
If its a known issue, please expand the performance guide :)
pinging @skye because of its help with the performance issue with while_loop
The text was updated successfully, but these errors were encountered: