Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The performance is worse after turning on mkldnn #56697

Open
baoachun opened this issue Jul 7, 2022 · 5 comments
Open

The performance is worse after turning on mkldnn #56697

baoachun opened this issue Jul 7, 2022 · 5 comments
Assignees
Labels
comp:mkl MKL related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:performance Performance Issue

Comments

@baoachun
Copy link

baoachun commented Jul 7, 2022

Click to expand!

Issue Type

Performance

Source

source

Tensorflow Version

2.4.1

Custom Code

No

OS Platform and Distribution

centos7

Mobile device

No response

Python version

3.8.6

Bazel version

3.1.0

GCC/Compiler version

10.2

CUDA/cuDNN version

N

GPU model and memory

N

Current Behaviour?

The performance is worse after turning on mkldnn.
compile command without mkldnn:
bazel build --config=nogcp   --config=nohdfs  --config=noaws --config=nonccl -c opt --copt=-march=native //tensorflow:libtensorflow_cc.so 

compile command with mkldnn:
bazel build --config=mkl --config=nogcp   --config=nohdfs  --config=noaws --config=nonccl -c opt --copt=-march=native  //tensorflow:libtensorflow_cc.so 

performance with mkldnn:
Number of nodes executed: 2628
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	         _MklFusedMatMul	       33	    56.488	    83.143%	    83.143%	    55.356	       33
	            StridedSlice	     1449	     4.799	     7.063%	    90.206%	    66.120	     1449
	            _MklConcatV2	       14	     2.766	     4.071%	    94.277%	   118.188	       14
	                   Const	      737	     1.379	     2.030%	    96.307%	     0.000	      737
	                    NoOp	        1	     0.934	     1.375%	    97.682%	     0.000	        1
	                    _Arg	      303	     0.422	     0.621%	    98.303%	     0.000	      303
	                    Pack	        1	     0.216	     0.318%	    98.621%	     1.080	        1
	             _MklSoftmax	        5	     0.196	     0.288%	    98.909%	     4.724	        5
	                _MklToTf	       31	     0.169	     0.249%	    99.158%	     0.000	       31
	             _MklReshape	        6	     0.136	     0.200%	    99.358%	     5.472	        6
	     _MklInputConversion	        5	     0.105	     0.155%	    99.513%	     9.120	        5
	                     Sum	        5	     0.103	     0.152%	    99.664%	     5.120	        5
	              ExpandDims	       24	     0.097	     0.143%	    99.807%	     0.000	       24
	                 _MklMul	        5	     0.096	     0.141%	    99.948%	    41.984	        5
	                    Sqrt	        2	     0.010	     0.015%	    99.963%	     0.008	        2
	                 Sigmoid	        2	     0.009	     0.013%	    99.976%	     0.000	        2
	               IdentityN	        1	     0.007	     0.010%	    99.987%	     0.000	        1
	                     Mul	        1	     0.005	     0.007%	    99.994%	     0.000	        1
	                 _Retval	        3	     0.004	     0.006%	   100.000%	     0.000	        3

Timings (microseconds): count=1 curr=67941
Memory (bytes): count=1 curr=307172
2628 nodes observed

performance without mkldnn:
Number of nodes executed: 1905
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	            StridedSlice	     1449	     8.448	    48.678%	    48.678%	    66.120	     1449
	            _FusedMatMul	       33	     5.554	    32.002%	    80.680%	    25.260	       33
	                ConcatV2	       14	     1.302	     7.502%	    88.182%	   105.420	       14
	                    _Arg	      303	     0.737	     4.247%	    92.429%	     0.000	      303
	                   Const	       50	     0.370	     2.132%	    94.561%	     0.000	       50
	                    Pack	        1	     0.238	     1.371%	    95.932%	     1.080	        1
	                 Reshape	        6	     0.176	     1.014%	    96.946%	     0.000	        6
	                    NoOp	        1	     0.136	     0.784%	    97.730%	     0.000	        1
	                 Softmax	        5	     0.125	     0.720%	    98.450%	     0.000	        5
	              ExpandDims	       24	     0.108	     0.622%	    99.072%	     0.000	       24
	                     Sum	        5	     0.089	     0.513%	    99.585%	     5.120	        5
	                     Mul	        6	     0.040	     0.230%	    99.816%	     0.000	        6
	                    Sqrt	        2	     0.010	     0.058%	    99.873%	     0.008	        2
	                 Sigmoid	        2	     0.009	     0.052%	    99.925%	     0.000	        2
	               IdentityN	        1	     0.009	     0.052%	    99.977%	     0.000	        1
	                 _Retval	        3	     0.004	     0.023%	   100.000%	     0.000	        3

Timings (microseconds): count=1 curr=17355
Memory (bytes): count=1 curr=203008
1905 nodes observed

According to profile above, there are more const operators when mkldnn is enabled, and the performance of many operators becomes worse.

Standalone code to reproduce the issue

N

Relevant log output

No response

@google-ml-butler google-ml-butler bot added the type:performance Performance Issue label Jul 7, 2022
@sushreebarsa sushreebarsa added comp:mkl MKL related issues TF 2.4 for issues related to TF 2.4 labels Jul 7, 2022
@baoachun
Copy link
Author

baoachun commented Jul 11, 2022

I just found that changing the mkl of the bazel compilation option to mkl_threadpool and turning on -O3 optimization has greatly improved the performance.

No effect.

@baoachun
Copy link
Author

baoachun commented Jul 11, 2022

In addition, after mkl is turned on, lots of new Const nodes will be created through the GetDummyMklTensorNode method of mkl_layout_pass, and these nodes are particularly time-consuming in inference, resulting in slower performance after mkl is turned on. Is there any way to remove the PartitionedCall operator?

@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 13, 2022
@preethivenkatesh
Copy link

preethivenkatesh commented Jul 14, 2022

could you please provide us a reproducer? mkl tensorflow builds will usually require you to set up openmp variables
export OMP_NUM_THREADS=#no of physical cores

@louie-tsai
Copy link
Contributor

@baoachun
As preethi mentioned, you might need configure Openmp to achieve better performance.
You could refer to below article for more details.
https://www.intel.com/content/www/us/en/developer/articles/technical/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

Moreover, after TF 2.9 (pip install tensorflow), oneDNN (mkldnn) is enabled by default, you don't need to build TF with --config=mkl for using mkldnn. In Official TF 2.9, you also don't need to configure openmp for oneDNN(mkldnn) because it use eigen threadpool instead.

Could you try you workload on official TF 2.9?

In the meantime, MklFusedMatMul indeed might have some slight performance drop, but the drop should be less than 10%.
oneDNN team will also work on further optimization on MklFusedMatMul.

@aice-support
Copy link

@aice-support

@baoachun
Have you try your workload on official TF 2.9?
Moreover, what platform do you use?
Official TF 2.9 turns on oneDNN optimization by default on Intel Xeon CLX, CPX and ICX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:mkl MKL related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

9 participants