Building with MKL reduces CPU performance #14496

georgh · 2017-11-12T13:52:22Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): both
TensorFlow version (use command below): 1.4
Python version: 3.6
Bazel version (if compiling from source): 0.5.4
GCC/Compiler version (if compiling from source): 5.4.0
CUDA/cuDNN version: 8
GPU model and memory: Tesla P100-PCIE-16GB
Exact command to reproduce:

Describe the problem

Building tensorflow with mkl (--config=mkl) prevents the system from using all its cores.
CPU load remains always below 20% in my testcase. Using the same build flags but without mkl achieve 100% CPU load and a nearly 10 times faster execution.

While playing with the MKL flags described here https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu
i noticed some strange behavior:

Running the MKL-build with
OMP_NUM_THREADS=27 KMP_SETTINGS=1 KMP_AFFINITY=verbose
results in the following print:

User settings:

   KMP_AFFINITY=verbose
   KMP_SETTINGS=1
   OMP_NUM_THREADS=27

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=224
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=200
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=3
   KMP_GTID_MODE=3
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_INIT_WAIT=2048
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_NEXT_WAIT=1024
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
   KMP_REDUCTION_BARRIER='1,1'
   KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
   KMP_SCHEDULE='static,balanced;guided,iterative'
   KMP_SETTINGS=true
   KMP_SPIN_BACKOFF_PARAMS='4096,100'
   KMP_STACKOFFSET=64
   KMP_STACKPAD=0
   KMP_STACKSIZE=4M
   KMP_STORAGE_MAP=false
   KMP_TASKING=2
   KMP_TASKLOOP_MIN_TASKS=0
   KMP_TASK_STEALING_CONSTRAINT=1
   KMP_TEAMS_THREAD_LIMIT=56
   KMP_TOPOLOGY_METHOD=all
   KMP_USER_LEVEL_MWAIT=false
   KMP_VERSION=false
   KMP_WARNINGS=true
   OMP_CANCELLATION=false
   OMP_DEFAULT_DEVICE=0
   OMP_DISPLAY_ENV=false
   OMP_DYNAMIC=false
   OMP_MAX_ACTIVE_LEVELS=2147483647
   OMP_MAX_TASK_PRIORITY=0
   OMP_NESTED=false
   OMP_NUM_THREADS='27'
   OMP_PLACES: value is not defined
   OMP_PROC_BIND='false'
   OMP_SCHEDULE='static'
   OMP_STACKSIZE=4M
   OMP_THREAD_LIMIT=2147483647
   OMP_WAIT_POLICY=PASSIVE
   KMP_AFFINITY='verbose,warnings,respect,granularity=core,none'

OMP: Info #209: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #207: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 total cores)
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35708 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35706 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35707 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35704 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35701 thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35711 thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35712 thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35713 thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35705 thread 8 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35710 thread 9 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35702 thread 10 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35717 thread 11 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
...
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35678 thread 78 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35672 thread 79 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35674 thread 80 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
....
(continues with a higher thread number each time)

If I use the same execution flags with a build without MKL or with the pip version I get the same ouput up to
... OMP: Info #247: KMP_AFFINITY: pid 36958 tid 37191 thread 27 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
Afterwards no OMP prints are created. It seems like, if I build with mkl, tensorflow continues to create more and more threads but cant utilize them.

Is this a configuration issue or a bug?
If its a known issue, please expand the performance guide :)

pinging @skye because of its help with the performance issue with while_loop

The text was updated successfully, but these errors were encountered:

qmick · 2017-11-13T02:21:47Z

I also have this problem. I build tensorflow step by step as official tutorial says, but get a much slower version when MKL is enabled. It is quite strange,

rohan100jain · 2017-11-14T21:45:09Z

@vivek-rane Vivek, could you respond to this?

vivek-rane · 2017-11-14T21:53:02Z

Will get someone to take a look - thanks for tagging.

vivek-rane · 2017-11-15T01:04:40Z

@georgh Could you tell us what the inter/intra op settings and model is?

georgh · 2017-11-15T01:25:18Z

I didn't set inter or intra op settings for the example, but I just tested it with both set to 20 and it still behaves the same.
The model I used is quite big, the main parts uses two big tf.while_loops with a few operations.
I will try to provide a simple test case tomorrow.

vivek-rane · 2017-11-15T20:08:24Z

Setting the inter op to 20 is a really bad idea. Try starting with an inter op of 1 and intra op of #cores, and OMP_NUM_THREADS set to #cores and tweak from there.

georgh · 2017-11-15T21:18:42Z

Thank you for your advise. If that's a bad Idea, why does the performance guide state:

A common alternative optimization is to set the number of threads in both pools equal to the number of physical cores rather than logical cores.

Is this only valid if you set it to the number of physical cores?
I didn't do that much tests with these values, because I can't always use the complete cluster for my computations. Setting both values to 20 restricts tensorflow to 20 logical cores and leaves therefore enough room for other people to use the system.

But anyway - regarding the error using MKL the options to not show any effect.
Build with MKL and inter/intra set to 20 I only get a CPU usage of around 150% (1.5 cores) using a build without MKL I achieve 2000 % (20 cores).

vivek-rane · 2017-11-15T22:35:08Z

@tfboyd do we recommend running with intra=num_cores for mkl? that would give pretty bad perf, since TF will run num_cores number of ops in parallel, each with OMP_NUM_THREADS threads (heavy oversubscription).

vivek-rane · 2017-11-15T22:35:45Z

@georgh can you share the timelines for your run with and without MKL? It is hard to figure what the problem is without topology info.

georgh · 2017-11-16T14:53:00Z

I created two timelines for a very small run, but I am not sure if they contain useful informations.
I tried to create a timeline for a bigger run, but the resulting file is 1.3GB big and chrome does not show any information if I try to open it.

timelines.zip

tfboyd · 2017-11-16T16:12:22Z

In my testing I had the best results with the following settings, cut and pasted from the document <https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu>I wrote. It can be a different for different models but for resnet and inception training this worked well and even for inference. Intel mentioned that slightly higher levels of inter_op might work but my guidelines as # of "sockets" was reasonable. If someone gets data for other models I am happy to try and find a way to share the info widely. There are models and hardware platforms that benefit from different settings. Each variable that impacts performance is discussed below. - *KMP_BLOCKTIME*: The MKL default is 200ms, which was not optimal in our testing. 0 (0ms) was a good default for CNN based models that were tested. The best performance for AlexNex was achieved at 30ms and both GoogleNet and VGG11 performed best set at 1ms. - *KMP_AFFINITY*: The recommended setting is granularity=fine,verbose,compact,1,0. - *OMP_NUM_THREADS*: This defaults to the number of physical cores. Adjusting this parameter beyond matching the number of cores can have an impact when using Intel® Xeon Phi™ (Knights Landing) for some models. SeeTensorFlow* Optimizations on Modern Intel® Architecture <https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture> for optimal settings. - *intra_op_parallelism_threads*: Setting this equal to the number of *physical cores* is recommended. Setting the value to 0, which is the default and will result in the value being set to the number of logical cores, is an option to try for some architectures. This value and OMP_NUM_THREADS should be equal. - *inter_op_parallelism_threads*: Setting this equal to the number of sockets is recommended. Setting the value to 0, which is the default, results in the value being set to the number of logical cores.

…

On Thu, Nov 16, 2017 at 6:57 AM, georgh ***@***.***> wrote: I created two timelines for a very small run, but I am not sure if they contain useful informations. I tried to create a timeline for a bigger run, but the resulting file is 1.3GB big and chrome does not show any information if I try to open it. timelines.zip <https://github.com/tensorflow/tensorflow/files/1478931/timelines.zip> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14496 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWZeshN8JsICk3B_IfU-v9WbQsIJZxkuks5s3E1WgaJpZM4Qa4-L> .

vivek-rane · 2017-11-16T21:55:16Z

I agree with Toby that the inter-op should be set to somewhere around the number of sockets on the system, which is typically 1.

I looked at the timeline and there are no ops in there that are currently supported with MKL on TensorFlow. If this timeline is representative of your workload, you will not gain any benefit from switching to MKL. The top 5 ops here are listed below, and matmul is probably the only thing that stands to gain from MKL. If this timeline is off, we need to find a way of getting the correct timeline.

Pack
Mul
StridedSlice
Matmul
Add

vivek-rane · 2017-11-20T18:25:29Z

Btw setting the KMP_BLOCKTIME to 0 (as Toby suggested) also helps with oversubscription.

tensorflowbutler · 2017-12-20T01:25:10Z

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

tensorflowbutler · 2018-01-03T19:01:43Z

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

tensorflowbutler · 2018-01-18T19:04:28Z

Nagging Awaiting TensorFlower: It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

tensorflowbutler · 2018-01-23T22:58:19Z

A member of the TensorFlow organization has replied after the stat:awaiting tensorflower label was applied.

drpngx · 2018-02-03T01:23:10Z

@tatianashp any idea?

tatianashp · 2018-02-11T04:49:08Z

@vivek-rane explained why using MKL does not help with performance for this network. I am closing the issue.

@georgh If you have more questions related to MKL performance please re-open.

rohan100jain added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:support Support issues type:bug Bug and removed type:support Support issues labels Nov 14, 2017

hendra-herviawan mentioned this issue Dec 29, 2017

Multi-core CPU performance dropped for MKL TF build #15320

Closed

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 23, 2018

danqing mentioned this issue Jan 24, 2018

CPU version with MKL mind/wheels#12

Open

drpngx added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 3, 2018

tatianashp closed this as completed Feb 11, 2018

tatianashp removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building with MKL reduces CPU performance #14496

Building with MKL reduces CPU performance #14496

georgh commented Nov 12, 2017 •

edited

qmick commented Nov 13, 2017 •

edited

rohan100jain commented Nov 14, 2017

vivek-rane commented Nov 14, 2017

vivek-rane commented Nov 15, 2017

georgh commented Nov 15, 2017

vivek-rane commented Nov 15, 2017

georgh commented Nov 15, 2017 •

edited

vivek-rane commented Nov 15, 2017

vivek-rane commented Nov 15, 2017 •

edited

georgh commented Nov 16, 2017

tfboyd commented Nov 16, 2017 via email

vivek-rane commented Nov 16, 2017

vivek-rane commented Nov 20, 2017

tensorflowbutler commented Dec 20, 2017

tensorflowbutler commented Jan 3, 2018

tensorflowbutler commented Jan 18, 2018

tensorflowbutler commented Jan 23, 2018

drpngx commented Feb 3, 2018

tatianashp commented Feb 11, 2018

Building with MKL reduces CPU performance #14496

Building with MKL reduces CPU performance #14496

Comments

georgh commented Nov 12, 2017 • edited

System information

Describe the problem

qmick commented Nov 13, 2017 • edited

rohan100jain commented Nov 14, 2017

vivek-rane commented Nov 14, 2017

vivek-rane commented Nov 15, 2017

georgh commented Nov 15, 2017

vivek-rane commented Nov 15, 2017

georgh commented Nov 15, 2017 • edited

vivek-rane commented Nov 15, 2017

vivek-rane commented Nov 15, 2017 • edited

georgh commented Nov 16, 2017

tfboyd commented Nov 16, 2017 via email

vivek-rane commented Nov 16, 2017

vivek-rane commented Nov 20, 2017

tensorflowbutler commented Dec 20, 2017

tensorflowbutler commented Jan 3, 2018

tensorflowbutler commented Jan 18, 2018

tensorflowbutler commented Jan 23, 2018

drpngx commented Feb 3, 2018

tatianashp commented Feb 11, 2018

georgh commented Nov 12, 2017 •

edited

qmick commented Nov 13, 2017 •

edited

georgh commented Nov 15, 2017 •

edited

vivek-rane commented Nov 15, 2017 •

edited