RFC: API to synchronize devices #434

reedwm · 2022-10-25T23:14:24Z

Status	Proposed
RFC #	434
Author(s)	Reed Wanderman-Milne (reedwm@google.com), Jonathan Dekhtiar (jdekhtiar@nvidia.com)
Sponsor	Rohan Jain (rohanj@google.com)
Updated	2022-10-25

Objective

This document proposes a simple API to synchronize TensorFlow devices: tf.test.sync_devices(). This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks.

/CC @DEKHTIARJonathan @rohan100jain

penpornk · 2022-10-26T11:13:22Z

cc: @Jianhui-Li, @jzhoulon, @yiqianglee, @kulinseth, @wchao1115, and @PatriceVignola for PluggableDevices.

sun51 · 2022-11-08T19:30:24Z

rfcs/20221025-sync-devices.md

+
+## Objective
+
+This document proposes a simple API to synchronize TensorFlow devices: `tf.sync_devices()`. This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks.


Can we put into this namespace. tf.config.experimental.sync_devices()?

@sun51 It is not a config call. It could be call at every iterations:

for step in range(NUM_STEPS): data = ds_iter.next() start_t = time.time() rslt = model(data) tf.sync_devices() print(f"Time: {time.time() - start_t}")

I don't think config is the appropriate namespace for such an API.

It is actually changing the configuration of runtime. Meanwhile, I am still not quite clear about the semantics of this API. In the above case, model() will include many ops, it is called before the tf.sync_device() is even called. How does this work? Once you sync device, can you set to normal config(async) later?

I think you there is a misunderstanding on how the API works.

It is actually changing the configuration of runtime.

No, the API introduces a blocking call that awaits for the OPs scheduled on every devices to be cleared. There is no "configuration" being updated or modified. It's just an API call that awaits for the GPUs to be done. Hence this has nothing to do in the config namespace IMHO.

How does this work?

As said above, it awaits for the pipeline of scheduled OPs to clear. In short, "everyone finishes what they are doing, let me know when you're done and we move on when everybody is done."

Feel free to reach to @reedwm inside Google, he designed the original solution.

chatted with reed, put into tf.test.XXX namespace is something we both think reasonable.

Sorry for the delay, I edited the RFC to put under tf.test.sync_devices.

sun51 · 2022-11-08T19:37:48Z

rfcs/20221025-sync-devices.md

+print(f'Time taken: {time.time() - start}')
+```
+
+This can be fixed by calling `y.numpy()` which forces the Python thread to wait until the matmul finishes, but this also adds a device-to-host transfer. The benchmark only wants to measure the matmul time, not the device transfer time. 


what are you trying to measure exactly? because when we call tf.matmul(), even with sync, it also includes all the Python overhead, C++ launch overhead, and then GPU execution time.

The idea is to measure step time (including all the potential overheads) from start to finish.
Having the most accurate picture from start to finish of a single "step" or "compute unit". Without having to force a .numpy() that would memcpyDtoH.

See: https://github.com/tensorflow/community/blob/1fbc2877e154973cbc37d0405e94cb18852e67cd/rfcs/20221025-sync-devices.md#motivation

sun51 · 2023-01-12T02:39:02Z

LGTM, thanks

DEKHTIARJonathan · 2023-01-16T17:47:54Z

@reedwm can you please link the commit that adds the feature or the PR/CL here ?

Thanks

There was an RFC for this API: tensorflow/community#434 PiperOrigin-RevId: 504062646

reedwm added 2 commits October 24, 2022 16:17

Create 20221024-sync-devices.md

f01bfa3

Update sponsor and date

43f238a

reedwm requested review from theadactyl and ematejska as code owners October 25, 2022 23:14

Update RFC number

1fbc287

sun51 reviewed Nov 8, 2022

View reviewed changes

Rename tf.sync_devices to tf.test.sync_devices

0a41c61

sun51 approved these changes Jan 12, 2023

View reviewed changes

DEKHTIARJonathan mentioned this pull request Jan 16, 2023

Cuda synchronize alternative for profiling tensorflow/tensorrt#304

Open

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Jan 23, 2023

Add sync_devices function.

267c63a

There was an RFC for this API: tensorflow/community#434 PiperOrigin-RevId: 504062646

DEKHTIARJonathan mentioned this pull request Mar 23, 2023

Inconsistent Runtime of XLA Compiled Model Inference tensorflow/tensorflow#59719

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: API to synchronize devices #434

RFC: API to synchronize devices #434

reedwm commented Oct 25, 2022 •

edited

penpornk commented Oct 26, 2022

sun51 Nov 8, 2022

DEKHTIARJonathan Nov 8, 2022

sun51 Nov 8, 2022 •

edited

DEKHTIARJonathan Nov 8, 2022 •

edited

sun51 Nov 11, 2022

reedwm Jan 12, 2023

sun51 Nov 8, 2022

DEKHTIARJonathan Nov 8, 2022 •

edited

sun51 commented Jan 12, 2023

DEKHTIARJonathan commented Jan 16, 2023


		## Objective

		This document proposes a simple API to synchronize TensorFlow devices: `tf.sync_devices()`. This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks.

RFC: API to synchronize devices #434

Are you sure you want to change the base?

RFC: API to synchronize devices #434

Conversation

reedwm commented Oct 25, 2022 • edited

Objective

penpornk commented Oct 26, 2022

sun51 Nov 8, 2022

Choose a reason for hiding this comment

DEKHTIARJonathan Nov 8, 2022

Choose a reason for hiding this comment

sun51 Nov 8, 2022 • edited

Choose a reason for hiding this comment

DEKHTIARJonathan Nov 8, 2022 • edited

Choose a reason for hiding this comment

sun51 Nov 11, 2022

Choose a reason for hiding this comment

reedwm Jan 12, 2023

Choose a reason for hiding this comment

sun51 Nov 8, 2022

Choose a reason for hiding this comment

DEKHTIARJonathan Nov 8, 2022 • edited

Choose a reason for hiding this comment

sun51 commented Jan 12, 2023

DEKHTIARJonathan commented Jan 16, 2023

reedwm commented Oct 25, 2022 •

edited

sun51 Nov 8, 2022 •

edited

DEKHTIARJonathan Nov 8, 2022 •

edited

DEKHTIARJonathan Nov 8, 2022 •

edited