Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for loop in map_fn within tf.function is very slow compared to using while_loop counterpart #40517

Closed
henrysky opened this issue Jun 16, 2020 · 3 comments
Assignees
Labels
comp:autograph Autograph related issues TF 2.2 Issues related to TF 2.2 type:performance Performance Issue

Comments

@henrysky
Copy link

henrysky commented Jun 16, 2020

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): N/A
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win10 2004 x64
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 2.2.0
  • Python version: 3.7.7
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: 10.1, 7,6
  • GPU model and memory: 1650Ti

Describe the current behavior

Code 1 runs much slow than Code 2 (71s seconds vs 11s), but from my understanding in Tensorflow documentation, Code 1 and Code 2 should be effectively the same (tf.function will turn the for loop into while_loop if the loop use Tensor but apparently not the case here). Moreover Code 2 will run much faster (~1s) on CPU (9750H, 1650Ti), not sure if that is expected tho. Code 3 using for loop and for loop to replace map_fn completed in 164s which is the slowest among the three which is expected.

Update: This behavior is not observed in tf2.1.0, code 1/2/3 completed in similar amount of time (11ish seconds)

Describe the expected behavior

Code 1 and Code 2 runs as fast as each other.

Standalone code to reproduce the issue

Code 1:

import time
import tensorflow as tf

@tf.function
def odeint_external(tensor):
    finished_user_t_ii = 0
    x = tensor[0]
    t = tensor[1]
    
    # array to store the result
    result = tf.TensorArray(dtype=tf.float32, size=t.shape[0])
    result = result.write(0, x)

    i = tf.constant(0, dtype=tf.int32)
    for _t in t[1:]:
        i = i+1
        result = result.write(i, tf.stack([tf.cos(_t), tf.sin(_t)]))

    return result.stack(), t


@tf.function
def parallelized_map_fn(tensor):
    return tf.map_fn(odeint_external, tensor)


# warm up
result = parallelized_map_fn((tf.random.normal((50, 2), 0, 1), tf.random.normal((50, 10000), 0, 1)))

start_t = time.time()
result = parallelized_map_fn((tf.random.normal((50, 2), 0, 1), tf.random.normal((50, 10000), 0, 1)))
print(time.time()-start_t)

Code 2:

import time
import tensorflow as tf

@tf.function
def odeint_external(tensor):
    finished_user_t_ii = 0
    x = tensor[0]
    t = tensor[1]
    
    # array to store the result
    result = tf.TensorArray(dtype=tf.float32, size=t.shape[0])
    result = result.write(0, x)

    i = tf.constant(1, dtype=tf.int32)
    cond = lambda i, _t, r: i < 9999
    body = lambda i, _t, r: (i+1, _t, r.write(i, tf.stack([tf.cos(_t[i]), tf.sin(_t[i])])))
    result = tf.while_loop(cond=cond, body=body, loop_vars=(i, t[1:], result))
    
    return result[2].stack(), t

@tf.function
def parallelized_map_fn(tensor):
    return tf.map_fn(odeint_external, tensor, parallel_iterations=1)

# warm up
result = parallelized_map_fn((tf.random.normal((50, 2), 0, 1), tf.random.normal((50, 10000), 0, 1)))

start_t = time.time()
result = parallelized_map_fn((tf.random.normal((50, 2), 0, 1), tf.random.normal((50, 10000), 0, 1)))
print(time.time()-start_t)

Code 3:

import time
import tensorflow as tf

@tf.function
def odeint_external(tensor):
    finished_user_t_ii = 0
    x = tensor[0]
    t = tensor[1]
    
    # array to store the result
    result = tf.TensorArray(dtype=tf.float32, size=t.shape[0])
    result = result.write(0, x)

    i = tf.constant(0, dtype=tf.int32)
    for _t in t[1:]:
        i = i+1
        result = result.write(i, tf.stack([tf.cos(_t), tf.sin(_t)]))
    
    return result.stack(), t

x = tf.random.normal((50, 2), 0, 1)
t = tf.random.normal((50, 10000), 0, 1)

# warm up
result = [odeint_external((_x, _t)) for _x, _t in zip(x, t)]

start_t = time.time()
result = [odeint_external((_x, _t)) for _x, _t in zip(x, t)]
print(time.time()-start_t)

Other info / logs Include any logs or source code that would be helpful to
N/A

@henrysky henrysky added the type:bug Bug label Jun 16, 2020
@ravikyram ravikyram added comp:autograph Autograph related issues TF 2.2 Issues related to TF 2.2 type:performance Performance Issue and removed type:bug Bug labels Jun 17, 2020
@ravikyram
Copy link
Contributor

I have tried in colab with TF version 2.2, nightly versions and was able to reproduce the issue .Please, find the gist here.Thanks!

@ravikyram ravikyram assigned ymodak and unassigned ravikyram Jun 17, 2020
@mdanatg
Copy link

mdanatg commented Jun 19, 2020

This looks like a regression - certain ops that the for loop uses don't seem to be optimized properly. We'll look into that.

For a temporary workaround, writing the loop like so should avoid the slow ops:

    i = tf.constant(1, dtype=tf.int32)
    while i < 9999:
        result = result.write(i, tf.stack([tf.cos(t[i]), tf.sin(t[i])]))
        i = i+1

@mdanatg
Copy link

mdanatg commented Jul 4, 2020

Looks like the workaround only fixes the difference partially - there is still a pair of logical_and/logical_or ops which are slow on GPU and hurt performance [1]. This will be fixed in TF 2.4.

main_test = math_ops.logical_or(

@mdanatg mdanatg reopened this Jul 4, 2020
TensorFlow 2.3.0 automation moved this from Done to In progress Jul 4, 2020
TensorFlow 2.3.0 automation moved this from In progress to Done Jul 4, 2020
tensorflow-copybara pushed a commit that referenced this issue Aug 11, 2020
…r when maximum_iterations is set. Fixes #40517.

PiperOrigin-RevId: 326056684
Change-Id: I81854d6731a9134b695c704b0f28786091f8239e
keithm-xmos pushed a commit to xmos/tensorflow that referenced this issue Feb 1, 2021
* 'master' of github.com:tensorflow/tensorflow: (1480 commits)
  Strengthen the warning about source code of lambda functions, until the issue is fixed. As found in tensorflow#39832, this bug can lead not just to parse errors but incorrect code as well.
  Integrate LLVM at llvm/llvm-project@cd209f1a3790
  compat: Update forward compatibility horizon to 2020-07-06
  Update GraphDef version to 454.
  Fix a crash on BenchmarkTfLiteModel with delegate
  Remove the constraint the QConst couldn't be shared
  Integrate LLVM at llvm/llvm-project@edba2864a7a8
  Integrate LLVM at llvm/llvm-project@91c320e9d852
  compat: Update forward compatibility horizon to 2020-07-05
  Update GraphDef version to 453.
  [MLIR] Add TF_Log(1+x) to TF_Log1p(x) canonicalization
  [mlir] Move test to correct folder
  Revive the use of constant_value workaround, since it seems that logical ops are still considerably slower on GPU. Fixes tensorflow#40517. Partially addresses tensorflow#40708.
  Remove unit tests from values_test that test functions in distribute_utils.
  compat: Update forward compatibility horizon to 2020-07-04
  Update GraphDef version to 452.
  [XLA:CPU] Support printing with printf(3) during execution
  [XLA:CPU] [NFC] Extract a superclass for different ParticipantData inputs to the rendezvous
  [XLA][LHLO] signal pass failure if one conversion failed
  Integrate LLVM at llvm/llvm-project@01c4574a129e
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:autograph Autograph related issues TF 2.2 Issues related to TF 2.2 type:performance Performance Issue
Projects
Development

No branches or pull requests

4 participants