Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: API to synchronize devices #434

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

reedwm
Copy link
Member

@reedwm reedwm commented Oct 25, 2022

Status Proposed
RFC # 434
Author(s) Reed Wanderman-Milne (reedwm@google.com), Jonathan Dekhtiar (jdekhtiar@nvidia.com)
Sponsor Rohan Jain (rohanj@google.com)
Updated 2022-10-25

Objective

This document proposes a simple API to synchronize TensorFlow devices: tf.test.sync_devices(). This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks.

/CC @DEKHTIARJonathan @rohan100jain

@penpornk
Copy link
Member

cc: @Jianhui-Li, @jzhoulon, @yiqianglee, @kulinseth, @wchao1115, and @PatriceVignola for PluggableDevices.


## Objective

This document proposes a simple API to synchronize TensorFlow devices: `tf.sync_devices()`. This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put into this namespace. tf.config.experimental.sync_devices()?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sun51 It is not a config call. It could be call at every iterations:

for step in range(NUM_STEPS):
   data = ds_iter.next()
   start_t = time.time()
   rslt = model(data)
   tf.sync_devices()
   print(f"Time: {time.time() - start_t}")

I don't think config is the appropriate namespace for such an API.

Copy link

@sun51 sun51 Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually changing the configuration of runtime. Meanwhile, I am still not quite clear about the semantics of this API. In the above case, model() will include many ops, it is called before the tf.sync_device() is even called. How does this work? Once you sync device, can you set to normal config(async) later?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you there is a misunderstanding on how the API works.

It is actually changing the configuration of runtime.

No, the API introduces a blocking call that awaits for the OPs scheduled on every devices to be cleared. There is no "configuration" being updated or modified. It's just an API call that awaits for the GPUs to be done. Hence this has nothing to do in the config namespace IMHO.

How does this work?

As said above, it awaits for the pipeline of scheduled OPs to clear. In short, "everyone finishes what they are doing, let me know when you're done and we move on when everybody is done."

Feel free to reach to @reedwm inside Google, he designed the original solution.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chatted with reed, put into tf.test.XXX namespace is something we both think reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I edited the RFC to put under tf.test.sync_devices.

print(f'Time taken: {time.time() - start}')
```

This can be fixed by calling `y.numpy()` which forces the Python thread to wait until the matmul finishes, but this also adds a device-to-host transfer. The benchmark only wants to measure the matmul time, not the device transfer time.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to measure exactly? because when we call tf.matmul(), even with sync, it also includes all the Python overhead, C++ launch overhead, and then GPU execution time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to measure step time (including all the potential overheads) from start to finish.
Having the most accurate picture from start to finish of a single "step" or "compute unit". Without having to force a .numpy() that would memcpyDtoH.

See: https://github.com/tensorflow/community/blob/1fbc2877e154973cbc37d0405e94cb18852e67cd/rfcs/20221025-sync-devices.md#motivation

@sun51
Copy link

sun51 commented Jan 12, 2023

LGTM, thanks

@DEKHTIARJonathan
Copy link

@reedwm can you please link the commit that adds the feature or the PR/CL here ?

Thanks

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Jan 23, 2023
There was an RFC for this API: tensorflow/community#434

PiperOrigin-RevId: 504062646
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants