Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building with MKL reduces CPU performance #14496

Closed
georgh opened this issue Nov 12, 2017 · 19 comments
Closed

Building with MKL reduces CPU performance #14496

georgh opened this issue Nov 12, 2017 · 19 comments
Labels

Comments

@georgh
Copy link

georgh commented Nov 12, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (source or binary): both
  • TensorFlow version (use command below): 1.4
  • Python version: 3.6
  • Bazel version (if compiling from source): 0.5.4
  • GCC/Compiler version (if compiling from source): 5.4.0
  • CUDA/cuDNN version: 8
  • GPU model and memory: Tesla P100-PCIE-16GB
  • Exact command to reproduce:

Describe the problem

Building tensorflow with mkl (--config=mkl) prevents the system from using all its cores.
CPU load remains always below 20% in my testcase. Using the same build flags but without mkl achieve 100% CPU load and a nearly 10 times faster execution.

While playing with the MKL flags described here https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu
i noticed some strange behavior:

Running the MKL-build with
OMP_NUM_THREADS=27 KMP_SETTINGS=1 KMP_AFFINITY=verbose
results in the following print:

User settings:

   KMP_AFFINITY=verbose
   KMP_SETTINGS=1
   OMP_NUM_THREADS=27

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=224
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=200
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=3
   KMP_GTID_MODE=3
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_INIT_WAIT=2048
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_NEXT_WAIT=1024
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
   KMP_REDUCTION_BARRIER='1,1'
   KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
   KMP_SCHEDULE='static,balanced;guided,iterative'
   KMP_SETTINGS=true
   KMP_SPIN_BACKOFF_PARAMS='4096,100'
   KMP_STACKOFFSET=64
   KMP_STACKPAD=0
   KMP_STACKSIZE=4M
   KMP_STORAGE_MAP=false
   KMP_TASKING=2
   KMP_TASKLOOP_MIN_TASKS=0
   KMP_TASK_STEALING_CONSTRAINT=1
   KMP_TEAMS_THREAD_LIMIT=56
   KMP_TOPOLOGY_METHOD=all
   KMP_USER_LEVEL_MWAIT=false
   KMP_VERSION=false
   KMP_WARNINGS=true
   OMP_CANCELLATION=false
   OMP_DEFAULT_DEVICE=0
   OMP_DISPLAY_ENV=false
   OMP_DYNAMIC=false
   OMP_MAX_ACTIVE_LEVELS=2147483647
   OMP_MAX_TASK_PRIORITY=0
   OMP_NESTED=false
   OMP_NUM_THREADS='27'
   OMP_PLACES: value is not defined
   OMP_PROC_BIND='false'
   OMP_SCHEDULE='static'
   OMP_STACKSIZE=4M
   OMP_THREAD_LIMIT=2147483647
   OMP_WAIT_POLICY=PASSIVE
   KMP_AFFINITY='verbose,warnings,respect,granularity=core,none'

OMP: Info #209: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #207: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #156: KMP_AFFINITY: 56 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 14 cores/pkg x 2 threads/core (28 total cores)
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35708 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35706 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35707 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35704 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35701 thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35711 thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35712 thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35713 thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35705 thread 8 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35710 thread 9 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35702 thread 10 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35717 thread 11 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
...
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35678 thread 78 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35672 thread 79 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
OMP: Info #247: KMP_AFFINITY: pid 35537 tid 35674 thread 80 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
....
(continues with a higher thread number each time)

If I use the same execution flags with a build without MKL or with the pip version I get the same ouput up to
... OMP: Info #247: KMP_AFFINITY: pid 36958 tid 37191 thread 27 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55}
Afterwards no OMP prints are created. It seems like, if I build with mkl, tensorflow continues to create more and more threads but cant utilize them.

Is this a configuration issue or a bug?
If its a known issue, please expand the performance guide :)

pinging @skye because of its help with the performance issue with while_loop

@qmick
Copy link
Contributor

qmick commented Nov 13, 2017

I also have this problem. I build tensorflow step by step as official tutorial says, but get a much slower version when MKL is enabled. It is quite strange,

@rohan100jain rohan100jain added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:support Support issues type:bug Bug and removed type:support Support issues labels Nov 14, 2017
@rohan100jain
Copy link
Member

@vivek-rane Vivek, could you respond to this?

@vivek-rane
Copy link

Will get someone to take a look - thanks for tagging.

@vivek-rane
Copy link

@georgh Could you tell us what the inter/intra op settings and model is?

@georgh
Copy link
Author

georgh commented Nov 15, 2017

I didn't set inter or intra op settings for the example, but I just tested it with both set to 20 and it still behaves the same.
The model I used is quite big, the main parts uses two big tf.while_loops with a few operations.
I will try to provide a simple test case tomorrow.

@vivek-rane
Copy link

Setting the inter op to 20 is a really bad idea. Try starting with an inter op of 1 and intra op of #cores, and OMP_NUM_THREADS set to #cores and tweak from there.

@georgh
Copy link
Author

georgh commented Nov 15, 2017

Thank you for your advise. If that's a bad Idea, why does the performance guide state:

A common alternative optimization is to set the number of threads in both pools equal to the number of physical cores rather than logical cores.

Is this only valid if you set it to the number of physical cores?
I didn't do that much tests with these values, because I can't always use the complete cluster for my computations. Setting both values to 20 restricts tensorflow to 20 logical cores and leaves therefore enough room for other people to use the system.

But anyway - regarding the error using MKL the options to not show any effect.
Build with MKL and inter/intra set to 20 I only get a CPU usage of around 150% (1.5 cores) using a build without MKL I achieve 2000 % (20 cores).

@vivek-rane
Copy link

@tfboyd do we recommend running with intra=num_cores for mkl? that would give pretty bad perf, since TF will run num_cores number of ops in parallel, each with OMP_NUM_THREADS threads (heavy oversubscription).

@vivek-rane
Copy link

vivek-rane commented Nov 15, 2017

@georgh can you share the timelines for your run with and without MKL? It is hard to figure what the problem is without topology info.

@georgh
Copy link
Author

georgh commented Nov 16, 2017

I created two timelines for a very small run, but I am not sure if they contain useful informations.
I tried to create a timeline for a bigger run, but the resulting file is 1.3GB big and chrome does not show any information if I try to open it.

timelines.zip

@tfboyd
Copy link
Member

tfboyd commented Nov 16, 2017 via email

@vivek-rane
Copy link

I agree with Toby that the inter-op should be set to somewhere around the number of sockets on the system, which is typically 1.

I looked at the timeline and there are no ops in there that are currently supported with MKL on TensorFlow. If this timeline is representative of your workload, you will not gain any benefit from switching to MKL. The top 5 ops here are listed below, and matmul is probably the only thing that stands to gain from MKL. If this timeline is off, we need to find a way of getting the correct timeline.

  • Pack
  • Mul
  • StridedSlice
  • Matmul
  • Add

@vivek-rane
Copy link

Btw setting the KMP_BLOCKTIME to 0 (as Toby suggested) also helps with oversubscription.

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@tensorflowbutler
Copy link
Member

Nagging Awaiting TensorFlower: It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 23, 2018
@tensorflowbutler
Copy link
Member

A member of the TensorFlow organization has replied after the stat:awaiting tensorflower label was applied.

@drpngx
Copy link
Contributor

drpngx commented Feb 3, 2018

@tatianashp any idea?

@drpngx drpngx added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 3, 2018
@tatianashp
Copy link
Member

@vivek-rane explained why using MKL does not help with performance for this network. I am closing the issue.

@georgh If you have more questions related to MKL performance please re-open.

@tatianashp tatianashp removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants