for loop in map_fn within tf.function is very slow compared to using while_loop counterpart #40517

henrysky · 2020-06-16T17:41:05Z

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): N/A
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win10 2004 x64
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): 2.2.0
Python version: 3.7.7
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: 10.1, 7,6
GPU model and memory: 1650Ti

Describe the current behavior

Code 1 runs much slow than Code 2 (71s seconds vs 11s), but from my understanding in Tensorflow documentation, Code 1 and Code 2 should be effectively the same (tf.function will turn the for loop into while_loop if the loop use Tensor but apparently not the case here). Moreover Code 2 will run much faster (~1s) on CPU (9750H, 1650Ti), not sure if that is expected tho. Code 3 using for loop and for loop to replace map_fn completed in 164s which is the slowest among the three which is expected.

Update: This behavior is not observed in tf2.1.0, code 1/2/3 completed in similar amount of time (11ish seconds)

Describe the expected behavior

Code 1 and Code 2 runs as fast as each other.

Standalone code to reproduce the issue

Code 1:

import time
import tensorflow as tf

@tf.function
def odeint_external(tensor):
    finished_user_t_ii = 0
    x = tensor[0]
    t = tensor[1]
    
    # array to store the result
    result = tf.TensorArray(dtype=tf.float32, size=t.shape[0])
    result = result.write(0, x)

    i = tf.constant(0, dtype=tf.int32)
    for _t in t[1:]:
        i = i+1
        result = result.write(i, tf.stack([tf.cos(_t), tf.sin(_t)]))

    return result.stack(), t


@tf.function
def parallelized_map_fn(tensor):
    return tf.map_fn(odeint_external, tensor)


# warm up
result = parallelized_map_fn((tf.random.normal((50, 2), 0, 1), tf.random.normal((50, 10000), 0, 1)))

start_t = time.time()
result = parallelized_map_fn((tf.random.normal((50, 2), 0, 1), tf.random.normal((50, 10000), 0, 1)))
print(time.time()-start_t)

Code 2:

import time
import tensorflow as tf

@tf.function
def odeint_external(tensor):
    finished_user_t_ii = 0
    x = tensor[0]
    t = tensor[1]
    
    # array to store the result
    result = tf.TensorArray(dtype=tf.float32, size=t.shape[0])
    result = result.write(0, x)

    i = tf.constant(1, dtype=tf.int32)
    cond = lambda i, _t, r: i < 9999
    body = lambda i, _t, r: (i+1, _t, r.write(i, tf.stack([tf.cos(_t[i]), tf.sin(_t[i])])))
    result = tf.while_loop(cond=cond, body=body, loop_vars=(i, t[1:], result))
    
    return result[2].stack(), t

@tf.function
def parallelized_map_fn(tensor):
    return tf.map_fn(odeint_external, tensor, parallel_iterations=1)

# warm up
result = parallelized_map_fn((tf.random.normal((50, 2), 0, 1), tf.random.normal((50, 10000), 0, 1)))

start_t = time.time()
result = parallelized_map_fn((tf.random.normal((50, 2), 0, 1), tf.random.normal((50, 10000), 0, 1)))
print(time.time()-start_t)

Code 3:

import time
import tensorflow as tf

@tf.function
def odeint_external(tensor):
    finished_user_t_ii = 0
    x = tensor[0]
    t = tensor[1]
    
    # array to store the result
    result = tf.TensorArray(dtype=tf.float32, size=t.shape[0])
    result = result.write(0, x)

    i = tf.constant(0, dtype=tf.int32)
    for _t in t[1:]:
        i = i+1
        result = result.write(i, tf.stack([tf.cos(_t), tf.sin(_t)]))
    
    return result.stack(), t

x = tf.random.normal((50, 2), 0, 1)
t = tf.random.normal((50, 10000), 0, 1)

# warm up
result = [odeint_external((_x, _t)) for _x, _t in zip(x, t)]

start_t = time.time()
result = [odeint_external((_x, _t)) for _x, _t in zip(x, t)]
print(time.time()-start_t)

Other info / logs Include any logs or source code that would be helpful to
N/A

The text was updated successfully, but these errors were encountered:

ravikyram · 2020-06-17T07:15:40Z

I have tried in colab with TF version 2.2, nightly versions and was able to reproduce the issue .Please, find the gist here.Thanks!

mdanatg · 2020-06-19T23:57:37Z

This looks like a regression - certain ops that the for loop uses don't seem to be optimized properly. We'll look into that.

For a temporary workaround, writing the loop like so should avoid the slow ops:

    i = tf.constant(1, dtype=tf.int32)
    while i < 9999:
        result = result.write(i, tf.stack([tf.cos(t[i]), tf.sin(t[i])]))
        i = i+1

mdanatg · 2020-07-04T22:49:22Z

Looks like the workaround only fixes the difference partially - there is still a pair of logical_and/logical_or ops which are slow on GPU and hurt performance [1]. This will be fixed in TF 2.4.

tensorflow/tensorflow/python/autograph/operators/control_flow.py

Line 525 in 4e73f5f

main_test = math_ops.logical_or(

…r when maximum_iterations is set. Fixes #40517. PiperOrigin-RevId: 326056684 Change-Id: I81854d6731a9134b695c704b0f28786091f8239e

* 'master' of github.com:tensorflow/tensorflow: (1480 commits) Strengthen the warning about source code of lambda functions, until the issue is fixed. As found in tensorflow#39832, this bug can lead not just to parse errors but incorrect code as well. Integrate LLVM at llvm/llvm-project@cd209f1a3790 compat: Update forward compatibility horizon to 2020-07-06 Update GraphDef version to 454. Fix a crash on BenchmarkTfLiteModel with delegate Remove the constraint the QConst couldn't be shared Integrate LLVM at llvm/llvm-project@edba2864a7a8 Integrate LLVM at llvm/llvm-project@91c320e9d852 compat: Update forward compatibility horizon to 2020-07-05 Update GraphDef version to 453. [MLIR] Add TF_Log(1+x) to TF_Log1p(x) canonicalization [mlir] Move test to correct folder Revive the use of constant_value workaround, since it seems that logical ops are still considerably slower on GPU. Fixes tensorflow#40517. Partially addresses tensorflow#40708. Remove unit tests from values_test that test functions in distribute_utils. compat: Update forward compatibility horizon to 2020-07-04 Update GraphDef version to 452. [XLA:CPU] Support printing with printf(3) during execution [XLA:CPU] [NFC] Extract a superclass for different ParticipantData inputs to the rendezvous [XLA][LHLO] signal pass failure if one conversion failed Integrate LLVM at llvm/llvm-project@01c4574a129e ...

henrysky added the type:bug Bug label Jun 16, 2020

google-ml-butler bot assigned ravikyram Jun 16, 2020

ravikyram added comp:autograph Autograph related issues TF 2.2 Issues related to TF 2.2 type:performance Performance Issue and removed type:bug Bug labels Jun 17, 2020

ravikyram assigned ymodak and unassigned ravikyram Jun 17, 2020

tensorflow-copybara closed this as completed in 094340c Jun 22, 2020

goldiegadde added this to Done in TensorFlow 2.3.0 Jun 23, 2020

mdanatg reopened this Jul 4, 2020

TensorFlow 2.3.0 automation moved this from Done to In progress Jul 4, 2020

tensorflow-copybara closed this as completed in a258be2 Jul 4, 2020

TensorFlow 2.3.0 automation moved this from In progress to Done Jul 4, 2020

mdanatg mentioned this issue Jul 4, 2020

looping over tf.range in tf.function is slower than looping over range #40708

Closed

tensorflow-copybara pushed a commit that referenced this issue Aug 11, 2020

Add a workaround for slowness of tf.while_loop in the default executo…

ede136a

…r when maximum_iterations is set. Fixes #40517. PiperOrigin-RevId: 326056684 Change-Id: I81854d6731a9134b695c704b0f28786091f8239e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

for loop in map_fn within tf.function is very slow compared to using while_loop counterpart #40517

for loop in map_fn within tf.function is very slow compared to using while_loop counterpart #40517

henrysky commented Jun 16, 2020 •

edited

ravikyram commented Jun 17, 2020

mdanatg commented Jun 19, 2020

mdanatg commented Jul 4, 2020

for loop in map_fn within tf.function is very slow compared to using while_loop counterpart #40517

for loop in map_fn within tf.function is very slow compared to using while_loop counterpart #40517

Comments

henrysky commented Jun 16, 2020 • edited

ravikyram commented Jun 17, 2020

mdanatg commented Jun 19, 2020

mdanatg commented Jul 4, 2020

henrysky commented Jun 16, 2020 •

edited