-
Notifications
You must be signed in to change notification settings - Fork 580
RFC: TensorFloat-32 support in TensorFlow #247
Conversation
Also fix date format
rfcs/20200520-tensor-float-32.md
Outdated
|
|
||
| ## Motivation | ||
|
|
||
| NVIDIA Ampere, an upcoming generation of NVidia GPUs announced at GTC 2020, introduces a new numeric format called TensorFloat-32, or TF32 for short. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to announcement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
rfcs/20200520-tensor-float-32.md
Outdated
|
|
||
| NVIDIA Ampere, an upcoming generation of NVidia GPUs announced at GTC 2020, introduces a new numeric format called TensorFloat-32, or TF32 for short. | ||
| TF32 has the range of float32/bfloat16 (i.e. 8 bits of exponent) and the precision of fp16 (i.e. 10 bits of mantissa). | ||
| For the most part, it is not an in-memory format, but tensor cores natively support it as a computation format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "For the most part"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this part. There is an intrinsic to convert from float32 to tf32, so technically it can be stored in memory, but I don't think "for the most part" clarified anything.
rfcs/20200520-tensor-float-32.md
Outdated
| TF32 has the range of float32/bfloat16 (i.e. 8 bits of exponent) and the precision of fp16 (i.e. 10 bits of mantissa). | ||
| For the most part, it is not an in-memory format, but tensor cores natively support it as a computation format. | ||
| TF32 should not be thought of as an in-memory dtype but instead a computation mode that increases performance and decreases numeric precision for certain float32 operations. | ||
| Nvidia has not found any cases where TF32 reduces the convergence of deep learning models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be NVIDIA as per the NVIDIA Brand Guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
rfcs/20200520-tensor-float-32.md
Outdated
|
|
||
| Since TF32 only affects Ampere GPUs, moving an op to a GPU can affect numerics. Grappler and other graph optimizations will not consider this, and will freely move ops between devices without regard to numeric stability. As a result, explicitly putting an op on the CPU does not ensure it will use the full float32 precision instead of TF32. | ||
|
|
||
| Since TensorFlow 2.3 will not support CUDA 11, which is required for TF32, this API will first be exposed in TensorFlow 2.4. However, Google Cloud will likely cherrypick CUDA 11 and this API into their version of 2.3, so they can offer TF32 support to their customers who use TensorFlow 2.3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "However, Google Cloud will likely cherrypick CUDA 11 and this API into their version of 2.3..." sentence should not be part of the RFC I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? It provides information on when the API will be available. And also motivation for why we need to have the RFC so early despite TF 2.4 not coming out for months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing "google cloud will likely" with "downstream repackagers of tensorflow (such as google cloud) are encouraged to" will make this read better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| 3. Do not turn it on by default. | ||
|
|
||
|
|
||
| The advantage of (1) is that all Ampere float32 users get the performance benefit unless they opt out. Additionally, Ampere numerics will not be loosened in a new release: TensorFlow 2.4 will be the first release with Ampere support, and it will immediately default to TF32 being enabled. The disadvantage is that we cannot collect as much feedback from users before defaulting to TF32, because no stable version of TensorFlow will support TF32 but not have it enabled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because no stable version of TensorFlow will support TF32 but not have it enabled by default.
I don't buy this: the models that TF32 targets use FP32 today, so I'd expect users to notice a regression even if 2.4 enables it by default, which they can corroborate further by comparing the accuracy with disabling it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand your argument. We'd like to have a release where users can try TF32 and give us feedback before we decide to whether to turn it on by default. If we immediately turn it on by default in 2.4, users can still give feedback, but it will be too late: we will have already made our decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it will be too late: we will have already made our decision.
Is the assumption that disabling tf32 by default (if users report problems after we enable it by default in 2.4) is more of a breaking change than enabling it by default (if users try it with 2.4 and don't report problems)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, enabling tf32 is probably more of a breaking change. However, we only want to make such a change at most once. After enabling tf32, I don't think we should subsequently disable it.
rfcs/20200520-tensor-float-32.md
Outdated
|
|
||
| ### Remote devices | ||
|
|
||
| Enabling TF32 will affect remote Ampere GPUs in addition to local Ampere GPUs. In particular, it will affect devices on hosts connected to via [`tf.config.experimental_connect_to_host`](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_host) or [`tf.config.experimental_connect_to_cluster`](https://www.tensorflow.org/api_docs/python/tf/config/experimental_connect_to_cluster). The initial, unexposed version of the function in TensorFlow 2.3 may only support local devices, not remote devices, if we do not have time to implement remote device support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific additional efforts needed here to support remote devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't worked this out yet, which is why I state this might not be done for TensorFlow 2.3. This will likely be done by adding a field to the CreateContextRequest proto to indicate whether TF32 is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it will not be part of the cluster_device_attributes, but a new field? (Also see my comment below on updating remote context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will add a new field. See my reply to your other comment.
| In TensorFlow, TF32 can be enabled for supported ops on Ampere GPUs with the following call: | ||
|
|
||
| ```python | ||
| tf.config.allow_tensor_float_32_execution(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using the Keras mixed precision policy API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This affects ops outside Keras, so it shouldn't be under tf.keras. In a sense, TF32 is a form of mixed precision, as some ops use TF32 and others use float32. We could put it under tf.mixed_precision, but I think tf.config is better since tf32 should be though of as a mode, not a dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the mixed precision API mostly changes the dtype of tensors, while tf32 doesn't affect tensor dtype (afaict) just the dtype of accumulators inside ops.
| In TensorFlow, TF32 can be enabled for supported ops on Ampere GPUs with the following call: | ||
|
|
||
| ```python | ||
| tf.config.allow_tensor_float_32_execution(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the mixed precision API mostly changes the dtype of tensors, while tf32 doesn't affect tensor dtype (afaict) just the dtype of accumulators inside ops.
rfcs/20200520-tensor-float-32.md
Outdated
|
|
||
| Since TF32 only affects Ampere GPUs, moving an op to a GPU can affect numerics. Grappler and other graph optimizations will not consider this, and will freely move ops between devices without regard to numeric stability. As a result, explicitly putting an op on the CPU does not ensure it will use the full float32 precision instead of TF32. | ||
|
|
||
| Since TensorFlow 2.3 will not support CUDA 11, which is required for TF32, this API will first be exposed in TensorFlow 2.4. However, Google Cloud will likely cherrypick CUDA 11 and this API into their version of 2.3, so they can offer TF32 support to their customers who use TensorFlow 2.3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing "google cloud will likely" with "downstream repackagers of tensorflow (such as google cloud) are encouraged to" will make this read better
|
|
||
| Another advantage of turning on TF32 by default is that it makes TensorFlow’s behavior with GPUs more consistent with TPUs. TPUs internally use lower precision for float32 matmuls and convolutions, similar to how Ampere GPUs will use lower precision for float32 matmuls and convolutions if TF32 is enabled. | ||
|
|
||
| **If you know of any models whose accuracy may be impacted by TF32, please comment on this RFC.** Note that TF32 is equivalent to float32 except it has 10 bits of mantissa instead of 23 bits. It will initially be used only for matmuls and convolutions, but may be used for other ops in the future if they are implemented in terms of a matmul. Once TensorFlow 2.4 is released, you will be able to test the impact of TF32 on your models if you have Ampere GPUs. You will be able to test earlier if you use Tensorflow nightly packages, and even earlier if you build from source with CUDA 11 support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to indicate a way to receive private feedback about this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's likely someone would be willing to share with us but not publicly? I could recommend emailing me for private feedback, but I would rather people post feedback publicly since I want to be transparent about why we make whatever decision we make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the transparency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to recommend it, but not everyone may be at freedom to talk about what they're working on in a public forum. So mentioning a private channel seems like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll mention this but state we much prefer it be posted publicly, even if that requires being vague about the use case. We should list at least two emails in case one of us is sick. @sanjoy should I list my email and yours?
rfcs/20200520-tensor-float-32.md
Outdated
|
|
||
| The word "allow" emphasizes only certain devices (Ampere GPUs) and ops (such as matmuls and convolutions) will be affected. Once enabled, all local and remote Ampere GPUs use TF32 for supported float32 ops. | ||
|
|
||
| Passing `False` to `allow_tensor_float_32_execution` will disable TF32 if already enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the use-cases for toggling between these? Are there potentially issues with moving between the two in a single program?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm there isn't a strong use case. I added the sentences
This is useful if multiple models are run sequentially in the same process, where only some should use TF32. It is also useful for tests, as it allows a test class to test both TF32 being enabled and disabled.
Admittedly, this is a fairly weak use case, but I think it's still probably worth having. If others disagree, I'd be happy to remove this.
Also of note, implementing this will require an RPC to enable/disable TF32 even after the eager context has been created, in order to support remote devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the UpdateContext RPC support this particular use case? I imagine one could use the cluster_device_attributes field to pass on the TF32 mode toggle, but it looks like it will update the remote session when cluster_device_attributes is not empty in current codebase. Not sure if setting TF32 worth carrying out such a heavy(?) operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now UpdateContext is only used to add/remove machines/devices. But we can add a field to allow it to also update whether TF32 is used. We can altneratively add an option to QueueItem in case there are ordering concerns between enabling/disabling TF32 and executing an op.
I asked internally, and cluster_device_attributes is only useful for propogating device information to other machines in the cluster. It is not intended for use in communicating information about how ops should run, but only fundamental information about the devices themselves.
See this post for some more context.
Not sure if setting TF32 worth carrying out such a heavy(?) operation.
Yeah, this will be a heavy operation. But users should only set/unset TF32 very rarely: only at the beginning of the model and between running one model and the next (or one test and the next). So I think it's OK.
I updated the RFC based on this discussion. I added a paragraph to the "Remote Devices" section and added a new paragraph to "Alternatives considered"
|
|
||
| Another advantage of turning on TF32 by default is that it makes TensorFlow’s behavior with GPUs more consistent with TPUs. TPUs internally use lower precision for float32 matmuls and convolutions, similar to how Ampere GPUs will use lower precision for float32 matmuls and convolutions if TF32 is enabled. | ||
|
|
||
| **If you know of any models whose accuracy may be impacted by TF32, please comment on this RFC.** Note that TF32 is equivalent to float32 except it has 10 bits of mantissa instead of 23 bits. It will initially be used only for matmuls and convolutions, but may be used for other ops in the future if they are implemented in terms of a matmul. Once TensorFlow 2.4 is released, you will be able to test the impact of TF32 on your models if you have Ampere GPUs. You will be able to test earlier if you use Tensorflow nightly packages, and even earlier if you build from source with CUDA 11 support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there users that NVIDIA can help us find directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvidia will probably directly collect feedback and tell us. They have already tested themselves on many models.
| tf.config.allow_tensor_float_32_execution(True) | ||
| ``` | ||
|
|
||
| The word "allow" emphasizes only certain devices (Ampere GPUs) and ops (such as matmuls and convolutions) will be affected. Once enabled, all local and remote Ampere GPUs use TF32 for supported float32 ops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should an error be raised (or a warning) if allow=True and the device does not support TF32? One could imagine users being surprised that no complaint is raised when when this mode is requested. I guess in that case the flag would be "use_tensor_float_32_execution" instead of allow... but maybe explicit is preferable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this and the original draft did warn. But I think we should encourage users to put the allow_tensor_float_32_execution(True) line at the top of their program unconditionally and not warn. Otherwise, to avoid the warning, model code with have to check whether the GPUs support TF32 and only allow TF32 if the GPU supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design review notes: This has been accepted.
This RFC will be open for comment until Wednesday, June 3rd, 2020.
Objective
Allow TensorFloat-32 to be used in TensorFlow to improve performance.