FP8 Convolutions in XLA by philipphack · Pull Request #60807 · tensorflow/tensorflow

philipphack · 2023-06-07T20:46:33Z

Enables scaled convolutions of the form

(X, W, x_scale, w_scale, y_scale) -> Y,

where the input X, the filter W and the output Y are based on the F8E4M3FN and F8E5M2 data types and x_scale, w_scale and y_scale are their scaling factors.

philipphack · 2023-06-07T20:47:21Z

CC @reedwm, @nluehr.

reedwm

Thanks for adding FP8 support to convolutions! Sorry for taking so long to review this.

Also the comments are in a weird order since I went back and forth between the files a lot when reviewing this. I would view the comments in the "Files changed" tab.

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

tensorflow/compiler/xla/service/gpu/gpu_conv_runner.h

tensorflow/compiler/xla/service/gpu/buffer_comparator.cc

tensorflow/compiler/xla/stream_executor/dnn.h

tensorflow/compiler/xla/service/gpu/buffer_comparator_test.cc

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

reedwm · 2023-07-20T01:38:35Z

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

      }
    }

    if (pattern_level == 1) {


I misunderstood what you were doing before, when I commented

Instead of having this pattern_level concept, can we just iterate over the users again and check for the convert+clamp?

I understand now that you're trying to match a convert of a clamp, which requires looking at the user of the user. I'm still not a fan of the pattern_level concept though, since it's confusing. The best way to address this would be to return a GraphString without the convert+clamp. Then create a FuseConvertToF8 function that is called from CudnnFusedConvRewriter::Run, where you find the existing instruction with a graph string and append the conversion to F8.

If you'd rather handle everything in this function, you can directly get the user of the user to see if that matches, instead of using recursion and pattern_level. E.g. you can do:

if (user->user_count() == 1 && Match(user->users()[0], m::Convert(...))) {...}

Granted, when we eventually match amax calculations, we might need to go deeper, e.g. checking the user of the user of the user. I think doing so directly is better than recursion.

I think in the context of having multiple users, the recursive approach is advantageous and more straightforward. Can we revisit this issue after you've seen the Amax case?

Sure, I'll bring this up again when you add amax support if I still think the non-recursive approach is better.

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

reedwm · 2023-07-20T02:04:09Z

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter_test.cc

+    if (GetCudaComputeCapability().IsAtLeast(
+            se::CudaComputeCapability::HOPPER)) {


Can we check both custom_call_string and serialized_graph_string on pre-Hopper, using only the two passes (GpuConvRewriter and CudnnFusedConvRewriter) instead of all the passes? In both the Hopper and non-Hopper case, we can call RunAndFilecheckHloRewrite, and on Hopper only, we can call RunAndCompare. In the Hopper case, you can also call RunFileCheck but only need to do a simple sanity check, such as that the graph string is correct, since other passes may modify things like layout which would make the custom_call_string not match.

This is similar to what we do in gemm_rewrite_test, where we only call RunAndCompare on Hopper

Isn't that the solution we arrived at? As I understand it, you don't want to verify the final layout even on Hopper systems which in my opinion renders the test somewhat incomplete. I don't think this is directly comparable to the GEMM case where layout conversions play less of a role.

I'm still unsure why layout is important here, compared to the gemm case. Running GpuConvRewriter and CudnnFusedConvRewriter on pre_hlo_string is still causing the FP8 rewrite to happen even if layout assignment doesn't run, right?

Even layout assignment is important, maybe see what transformations it is doing to pre_hlo_string, and just putting the resulting layouts directly in the HLO strings in the test, instead of relying on layout assignment to run.

My perspective is that the XLA unit tests are usually based on running the compiler pipeline. When that's not possible, we can do at partial testing by running only the relevant pass in some artificial setting that we can't easily extend to the full pipeline. It's less clear to me though why we'd want to deviate from the normal approach and restrict the testing in cases where we don't have to as well.

The reason to deviate from the normal approach is that we want to test as much as possible on non-Hopper, and right now, the PR only tests the graph string and not things like the custom_call_target, the dim_labels, etc.

But I'll accept only testing the graph string for now, we can reconsider if the tests get broken later due to a lack of Hopper CI.

We can discard the first part of the custom_call FileCheck string on non-Hopper systems and still compare the configuration of the Custom Call.

What is the first part of the custom_call FileCheck string? Is this the f8e4m3fn[1,6,6,16]{3,2,1,0}, u8[{{.*}}]{0}) part and is that part different if you don't run the rest of the passes?

Yes, the order of the dimensions changes.

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter_test.cc

reedwm · 2023-07-20T02:24:34Z

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

+tsl::StatusOr<cudnn_frontend::Tensor> CreateCudnnTensor(
+    cudnn_frontend::Tensor original, int64_t uid, dnn::DataType dtype,
+    bool is_virtual = false) {
+  return tsl::errors : Internal("Not implemented.");


State that copying a cudnn tensor requires cudnn 8.8 in the error.

Also in cudnn_fused_conv_rewriter.cc, you should check CUDNN_VERSION in addition to CUDA_VERSION to avoid this error.

I think this might be supported but the functionality is only used when we require cuDNN version of at least 8.9 and I can't easily test it. One option would be to give the clone overload of CreateCudnnTensor its own 8.9 version guard instead of sharing it with the existing overload and remove this.

I'm fine either keeping it as-is or adding an 8.9 version check inside thedefinition of CreateCudnnTensor.

Either way, check CUDNN_VERSION in cudnn_fused_conv_rewriter.cc though.

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

reedwm · 2023-07-20T18:54:36Z

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

      }
    }

    if (pattern_level == 1) {


Sure, I'll bring this up again when you add amax support if I still think the non-recursive approach is better.

reedwm · 2023-07-20T19:17:34Z

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter_test.cc

+    if (GetCudaComputeCapability().IsAtLeast(
+            se::CudaComputeCapability::HOPPER)) {


I'm still unsure why layout is important here, compared to the gemm case. Running GpuConvRewriter and CudnnFusedConvRewriter on pre_hlo_string is still causing the FP8 rewrite to happen even if layout assignment doesn't run, right?

Even layout assignment is important, maybe see what transformations it is doing to pre_hlo_string, and just putting the resulting layouts directly in the HLO strings in the test, instead of relying on layout assignment to run.

reedwm · 2023-07-20T19:26:05Z

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

+tsl::StatusOr<cudnn_frontend::Tensor> CreateCudnnTensor(
+    cudnn_frontend::Tensor original, int64_t uid, dnn::DataType dtype,
+    bool is_virtual = false) {
+  return tsl::errors : Internal("Not implemented.");


I'm fine either keeping it as-is or adding an 8.9 version check inside thedefinition of CreateCudnnTensor.

Either way, check CUDNN_VERSION in cudnn_fused_conv_rewriter.cc though.

reedwm · 2023-07-21T23:13:03Z

Can you resolve conflicts?

Imported from GitHub PR tensorflow/tensorflow#60807 Enables scaled convolutions of the form (X, W, x_scale, w_scale, y_scale) -> Y, where the input X, the filter W and the output Y are based on the `F8E4M3FN` and `F8E5M2` data types and x_scale, w_scale and y_scale are their scaling factors. Copybara import of the project: -- 8a30aa731c21612fe098a6b620a54922578611c2 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- caade6453519ad2531ebcf8f206e40187a1687ca by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- ecd080bd6c64682f6bee62f4455ea2c37c279f26 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- da22a881a3d24fd4f357207034ba6c596aa414d0 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. Merging this change closes #60807 FUTURE_COPYBARA_INTEGRATE_REVIEW=tensorflow/tensorflow#60807 from philipphack:u_fp8_conv_xla da22a881a3d24fd4f357207034ba6c596aa414d0 PiperOrigin-RevId: 550604841

Imported from GitHub PR tensorflow/tensorflow#60807 Enables scaled convolutions of the form (X, W, x_scale, w_scale, y_scale) -> Y, where the input X, the filter W and the output Y are based on the `F8E4M3FN` and `F8E5M2` data types and x_scale, w_scale and y_scale are their scaling factors. Copybara import of the project: -- 8a30aa731c21612fe098a6b620a54922578611c2 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- caade6453519ad2531ebcf8f206e40187a1687ca by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- ecd080bd6c64682f6bee62f4455ea2c37c279f26 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- da22a881a3d24fd4f357207034ba6c596aa414d0 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. Merging this change closes #60807 FUTURE_COPYBARA_INTEGRATE_REVIEW=tensorflow/tensorflow#60807 from philipphack:u_fp8_conv_xla da22a881a3d24fd4f357207034ba6c596aa414d0 PiperOrigin-RevId: 551346059

Imported from GitHub PR tensorflow/tensorflow#60807 Enables scaled convolutions of the form (X, W, x_scale, w_scale, y_scale) -> Y, where the input X, the filter W and the output Y are based on the `F8E4M3FN` and `F8E5M2` data types and x_scale, w_scale and y_scale are their scaling factors. Copybara import of the project: -- 8a30aa731c21612fe098a6b620a54922578611c2 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- caade6453519ad2531ebcf8f206e40187a1687ca by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- ecd080bd6c64682f6bee62f4455ea2c37c279f26 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. -- da22a881a3d24fd4f357207034ba6c596aa414d0 by Philipp Hack <phack@nvidia.com>: Support for FP8 convolutions in XLA. Merging this change closes #60807 PiperOrigin-RevId: 551973730

google-ml-butler bot added the size:XL CL Change Size:Extra Large label Jun 7, 2023

google-ml-butler bot assigned gbaned Jun 7, 2023

reedwm self-requested a review June 7, 2023 20:53

google-ml-butler bot added the awaiting review Pull request awaiting review label Jun 7, 2023

gbaned added the comp:xla XLA label Jun 8, 2023

reedwm suggested changes Jul 11, 2023

View reviewed changes

philipphack requested a review from reedwm July 19, 2023 23:51

reedwm suggested changes Jul 20, 2023

View reviewed changes

philipphack requested a review from reedwm July 21, 2023 21:14

reedwm approved these changes Jul 21, 2023

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jul 21, 2023

philipphack added 3 commits July 21, 2023 23:22

Support for FP8 convolutions in XLA.

8a30aa7

Support for FP8 convolutions in XLA.

caade64

Support for FP8 convolutions in XLA.

ecd080b

philipphack force-pushed the u_fp8_conv_xla branch from dcf44a8 to ecd080b Compare July 22, 2023 00:30

google-ml-butler bot removed the ready to pull PR ready for merge process label Jul 22, 2023

kokoro-team removed the kokoro:force-run Tests on submitted change label Jul 22, 2023

Support for FP8 convolutions in XLA.

da22a88

reedwm approved these changes Jul 23, 2023

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jul 23, 2023

kokoro-team removed the kokoro:force-run Tests on submitted change label Jul 23, 2023

copybara-service bot mentioned this pull request Jul 26, 2023

PR #60807: FP8 Convolutions in XLA google/tsl#949

Closed

copybara-service bot mentioned this pull request Jul 26, 2023

PR #60807: FP8 Convolutions in XLA openxla/xla#4533

Closed

copybara-service bot mentioned this pull request Jul 27, 2023

PR #60807: FP8 Convolutions in XLA openxla/xla#4541

Closed

copybara-service bot mentioned this pull request Jul 27, 2023

PR #60807: FP8 Convolutions in XLA google/tsl#950

Closed

copybara-service bot merged commit 631bbed into tensorflow:master Jul 28, 2023

copybara-service bot mentioned this pull request Jul 28, 2023

[jax] Rewrite IfrtHelpers::xla_dynamic_shape to use more specific shape getters openxla/xla#4336

Closed

		if (GetCudaComputeCapability().IsAtLeast(
		se::CudaComputeCapability::HOPPER)) {

Conversation

philipphack commented Jun 7, 2023

Uh oh!

philipphack commented Jun 7, 2023

Uh oh!

reedwm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reedwm commented Jul 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants