Why is my CPU Performance of Float64 tf.matmul in TensorFlow2 significantly lower than the NumPy matmul, even in the graph mode? #53798

as641651 · 2022-01-17T20:55:55Z

I'm comparing the single thread performance of the matrix-matrix products in TensorFlow 2 and NumPy. I compare separately for single precision (float32) and double precision (float64). I find that the NumPy performance is almost equivalent to the Intel MKL C++ implementation (used as a benchmark for matrix multiplication) for both single and double precision (DGEMM and SGEMM). But in TensorFlow, only the single precision (float32) performance is equivalent to the MKL, and the double precision (float64) performance is significantly slower. Why is Tensorflow slower when used with double precision data?

Sample Scripts:

I consider the following instance to reproduce my observation. Consider the matrix multiplication:

C = AB where A and B are of size 3000x3000

The TensorFlow2 and NumPy code are given below:

Tensorflow2 code

import tensorflow as tf
import os
import time


#Check if MKL is enabled
import tensorflow.python.framework as tff
print("MKL Enabled : ", tff.test_util.IsMklEnabled())


#Set threads
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)

#Problem size
N = 3000
REPS = 20
DTYPE = tf.float64
#DTYPE = tf.float32


@tf.function
def gemm_implicit_noup(A, B):
    #C = A @ B
    start = tf.timestamp()
    with tf.control_dependencies([start]):
        C = tf.matmul(A,B)
    with tf.control_dependencies([C]):
        end = tf.timestamp()
    tf.print(end-start)
    return C

tf.config.run_functions_eagerly(False)

A = tf.random.normal([N, N], dtype=DTYPE)
B = tf.random.normal([N, N], dtype=DTYPE)


#Building Trace
C = gemm_implicit_noup(A,B)

for i in range(REPS):
   C = gemm_implicit_noup(A,B)

Numpy code

import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np
import time

N = 3000
REPS = 20
DTYPE = np.float64
#DTYPE = np.float32

def gemm_implicit_noup(A, B):
    #C = A @ B
    C = np.matmul(A,B)
    return C



A = np.random.randn(N,N).astype(DTYPE)
B = np.random.randn(N,N).astype(DTYPE)

for i in range(REPS):
   start = time.perf_counter()
   C = gemm_implicit_noup(A,B)
   end = time.perf_counter()
   print(end-start)

System and Installation settings:

The performance was compared on Intel Xeon Skylake 2.1 GHz with CentOS 7 and also on MacBook Pro 2018 with BigSur. The performance was compared on both Tensorflow 2.7 and 2.8, which were built with Intel MKL. Python 3.9.7 and 3.7.4 were checked. I compare the single thread performance so that the results can be reliably reproduced. I observe similar performance numbers in all the settings:

Single precision performance is as expected:

Intel MKL C++ SGEMM ~ 0.5s
NumPy float32 ~ 0.5s
TensorFlow float32 ~ 0.5s

But Double precision performance:

Intel MKL C++ DGEMM ~ 0.9s
NumPy float64 ~ 1s
TensorFlow float64 > 2.5s (Much Slower!!)

mohantym · 2022-01-19T05:17:53Z

Hi @Saduf2019 ! Could you please look at this issue. Attaching gist for reference.

as641651 added the type:performance Performance Issue label Jan 17, 2022

google-ml-butler bot assigned mohantym Jan 17, 2022

mohantym added TF 2.7 Issues related to TF 2.7.0 comp:ops OPs related issues labels Jan 18, 2022

mohantym assigned Saduf2019 and unassigned mohantym Jan 19, 2022

Saduf2019 assigned jvishnuvardhan and unassigned Saduf2019 Jan 19, 2022

jvishnuvardhan assigned penpornk and unassigned jvishnuvardhan Jan 31, 2022

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is my CPU Performance of Float64 tf.matmul in TensorFlow2 significantly lower than the NumPy matmul, even in the graph mode? #53798

Why is my CPU Performance of Float64 tf.matmul in TensorFlow2 significantly lower than the NumPy matmul, even in the graph mode? #53798

as641651 commented Jan 17, 2022

mohantym commented Jan 19, 2022

Why is my CPU Performance of Float64 tf.matmul in TensorFlow2 significantly lower than the NumPy matmul, even in the graph mode? #53798

Why is my CPU Performance of Float64 tf.matmul in TensorFlow2 significantly lower than the NumPy matmul, even in the graph mode? #53798

Comments

as641651 commented Jan 17, 2022

mohantym commented Jan 19, 2022