The performance is worse after turning on mkldnn #56697

baoachun · 2022-07-07T13:25:44Z

Click to expand!

Issue Type

Performance

Source

source

Tensorflow Version

2.4.1

Custom Code

No

OS Platform and Distribution

centos7

Mobile device

No response

Python version

3.8.6

Bazel version

3.1.0

GCC/Compiler version

10.2

CUDA/cuDNN version

N

GPU model and memory

N

Current Behaviour?

The performance is worse after turning on mkldnn.
compile command without mkldnn：
bazel build --config=nogcp   --config=nohdfs  --config=noaws --config=nonccl -c opt --copt=-march=native //tensorflow:libtensorflow_cc.so 

compile command with mkldnn：
bazel build --config=mkl --config=nogcp   --config=nohdfs  --config=noaws --config=nonccl -c opt --copt=-march=native  //tensorflow:libtensorflow_cc.so 

performance with mkldnn：
Number of nodes executed: 2628
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	         _MklFusedMatMul	       33	    56.488	    83.143%	    83.143%	    55.356	       33
	            StridedSlice	     1449	     4.799	     7.063%	    90.206%	    66.120	     1449
	            _MklConcatV2	       14	     2.766	     4.071%	    94.277%	   118.188	       14
	                   Const	      737	     1.379	     2.030%	    96.307%	     0.000	      737
	                    NoOp	        1	     0.934	     1.375%	    97.682%	     0.000	        1
	                    _Arg	      303	     0.422	     0.621%	    98.303%	     0.000	      303
	                    Pack	        1	     0.216	     0.318%	    98.621%	     1.080	        1
	             _MklSoftmax	        5	     0.196	     0.288%	    98.909%	     4.724	        5
	                _MklToTf	       31	     0.169	     0.249%	    99.158%	     0.000	       31
	             _MklReshape	        6	     0.136	     0.200%	    99.358%	     5.472	        6
	     _MklInputConversion	        5	     0.105	     0.155%	    99.513%	     9.120	        5
	                     Sum	        5	     0.103	     0.152%	    99.664%	     5.120	        5
	              ExpandDims	       24	     0.097	     0.143%	    99.807%	     0.000	       24
	                 _MklMul	        5	     0.096	     0.141%	    99.948%	    41.984	        5
	                    Sqrt	        2	     0.010	     0.015%	    99.963%	     0.008	        2
	                 Sigmoid	        2	     0.009	     0.013%	    99.976%	     0.000	        2
	               IdentityN	        1	     0.007	     0.010%	    99.987%	     0.000	        1
	                     Mul	        1	     0.005	     0.007%	    99.994%	     0.000	        1
	                 _Retval	        3	     0.004	     0.006%	   100.000%	     0.000	        3

Timings (microseconds): count=1 curr=67941
Memory (bytes): count=1 curr=307172
2628 nodes observed

performance without mkldnn：
Number of nodes executed: 1905
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	            StridedSlice	     1449	     8.448	    48.678%	    48.678%	    66.120	     1449
	            _FusedMatMul	       33	     5.554	    32.002%	    80.680%	    25.260	       33
	                ConcatV2	       14	     1.302	     7.502%	    88.182%	   105.420	       14
	                    _Arg	      303	     0.737	     4.247%	    92.429%	     0.000	      303
	                   Const	       50	     0.370	     2.132%	    94.561%	     0.000	       50
	                    Pack	        1	     0.238	     1.371%	    95.932%	     1.080	        1
	                 Reshape	        6	     0.176	     1.014%	    96.946%	     0.000	        6
	                    NoOp	        1	     0.136	     0.784%	    97.730%	     0.000	        1
	                 Softmax	        5	     0.125	     0.720%	    98.450%	     0.000	        5
	              ExpandDims	       24	     0.108	     0.622%	    99.072%	     0.000	       24
	                     Sum	        5	     0.089	     0.513%	    99.585%	     5.120	        5
	                     Mul	        6	     0.040	     0.230%	    99.816%	     0.000	        6
	                    Sqrt	        2	     0.010	     0.058%	    99.873%	     0.008	        2
	                 Sigmoid	        2	     0.009	     0.052%	    99.925%	     0.000	        2
	               IdentityN	        1	     0.009	     0.052%	    99.977%	     0.000	        1
	                 _Retval	        3	     0.004	     0.023%	   100.000%	     0.000	        3

Timings (microseconds): count=1 curr=17355
Memory (bytes): count=1 curr=203008
1905 nodes observed

According to profile above, there are more const operators when mkldnn is enabled, and the performance of many operators becomes worse.

Standalone code to reproduce the issue

Relevant log output

No response

baoachun · 2022-07-11T02:07:33Z

~~I just found that changing the mkl of the bazel compilation option to mkl_threadpool and turning on -O3 optimization has greatly improved the performance.~~

No effect.

baoachun · 2022-07-11T05:48:07Z

In addition, after mkl is turned on, lots of new Const nodes will be created through the GetDummyMklTensorNode method of mkl_layout_pass, and these nodes are particularly time-consuming in inference, resulting in slower performance after mkl is turned on. Is there any way to remove the PartitionedCall operator?

preethivenkatesh · 2022-07-14T16:46:28Z

could you please provide us a reproducer? mkl tensorflow builds will usually require you to set up openmp variables
export OMP_NUM_THREADS=#no of physical cores

louie-tsai · 2022-07-22T18:11:38Z

@baoachun
As preethi mentioned, you might need configure Openmp to achieve better performance.
You could refer to below article for more details.
https://www.intel.com/content/www/us/en/developer/articles/technical/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

Moreover, after TF 2.9 (pip install tensorflow), oneDNN (mkldnn) is enabled by default, you don't need to build TF with --config=mkl for using mkldnn. In Official TF 2.9, you also don't need to configure openmp for oneDNN(mkldnn) because it use eigen threadpool instead.

Could you try you workload on official TF 2.9?

In the meantime, MklFusedMatMul indeed might have some slight performance drop, but the drop should be less than 10%.
oneDNN team will also work on further optimization on MklFusedMatMul.

aice-support · 2022-08-04T01:19:43Z

@aice-support

@baoachun
Have you try your workload on official TF 2.9?
Moreover, what platform do you use?
Official TF 2.9 turns on oneDNN optimization by default on Intel Xeon CLX, CPX and ICX.

google-ml-butler bot added the type:performance Performance Issue label Jul 7, 2022

google-ml-butler bot assigned sushreebarsa Jul 7, 2022

sushreebarsa added comp:mkl MKL related issues TF 2.4 for issues related to TF 2.4 labels Jul 7, 2022

sushreebarsa assigned chunduriv and unassigned sushreebarsa Jul 8, 2022

chunduriv assigned sachinprasadhs and unassigned chunduriv Jul 13, 2022

sachinprasadhs assigned penpornk Jul 13, 2022

sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 13, 2022

penpornk assigned TensorFlow-MKL Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performance is worse after turning on mkldnn #56697

The performance is worse after turning on mkldnn #56697

baoachun commented Jul 7, 2022 •

edited by google-ml-butler bot

Issue Type

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

baoachun commented Jul 11, 2022 •

edited

baoachun commented Jul 11, 2022 •

edited

preethivenkatesh commented Jul 14, 2022 •

edited

louie-tsai commented Jul 22, 2022

aice-support commented Aug 4, 2022

The performance is worse after turning on mkldnn #56697

The performance is worse after turning on mkldnn #56697

Comments

baoachun commented Jul 7, 2022 • edited by google-ml-butler bot

Issue Type

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

baoachun commented Jul 11, 2022 • edited

baoachun commented Jul 11, 2022 • edited

preethivenkatesh commented Jul 14, 2022 • edited

louie-tsai commented Jul 22, 2022

aice-support commented Aug 4, 2022

baoachun commented Jul 7, 2022 •

edited by google-ml-butler bot

baoachun commented Jul 11, 2022 •

edited

baoachun commented Jul 11, 2022 •

edited

preethivenkatesh commented Jul 14, 2022 •

edited