Amax for FP8 Convolutions in XLA #61460

philipphack · 2023-08-02T23:43:44Z

Extends the functionality of scaled convolutions operating on F8E4M3FN and F8E5M2 data types to optionally return the scalar maximum of the absolute (Amax) of the result before quantization,

(X, W, x_scale, w_scale, y_scale) -> (Y, y_amax).

philipphack · 2023-08-02T23:44:23Z

CC @reedwm.

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

reedwm · 2023-08-03T03:44:52Z

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

+      if (std::find(uids_.begin(), uids_.end(), operand_uid) != uids_.end()) {
+        auto user = user_uids_.find(operand_uid);
+        if (user == user_uids_.end()) {
+          return {tsl::errors::Internal("Unknown ID.")};


Don't think you need braces {...} here? And same in GetUserUIDs.

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

reedwm · 2023-08-04T01:37:14Z

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

 StatusOr<bool> F8GraphConv(HloComputation* comp, se::CudaComputeCapability cc) {
  bool changed = false;


I think the FP8 logic should be moved to its own pass in its own file, since it's independent from the non-FP8 logic, the FP8 logic is fairly long and substantial, and it uses the currently-FP8-specific concept of a GraphString

This should be done in a separate PR though. I'm fine if this is done before or after this PR.

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

reedwm · 2023-08-04T02:16:24Z

tensorflow/compiler/xla/service/gpu/conv_algorithm_picker.cc

+  // Set the algorithm and update the shape of the convolution Custom Call to
+  // account for the appropriate amount of scratch memory.


The original code still was careful to package the convolution output in a tuple where the second element was an empty tensor (see line 1072 of the original code). I assume this is because it's possible an instruction uses the empty tensor. You should keep that code here, even though you now update the custom call in place instead of creating a new one.

The last element in the tuple is the workspace. The current and the original implementation preserve the shapes of the tuple elements, which should make them compatible.

Sorry, I forgot to comment before, but the current implementation doesn't seem to preserve the shapes of the last element in the tuple. Before, the last tuple element was empty but with this PR, the last tuple element in the size of the scratch space.

Isn't that the objective though, based on the comment in line 1039? Note that in the previous implementation, the second element of the tuple returned by new_call is not accessed.

The old code would replace

output_and_empty_array = (f16[1,1,1,1], u8[0]) custom-call(...) conv_result = f16[1,1,1,1] get-tuple-element(output_and_empty_array), index=0

with

new_call = (f16[1,1,1,1], u8[123]) custom-call(...) new_conv_result = f16[1,1,1,1] get-tuple-element(new_call), index=0 empty_constant = u8[0] constant({}) output_and_empty_array = (f16[1,1,1,1], u8[0]) tuple(new_conv_result, empty_constant) conv_result = f16[1,1,1,1] get-tuple-element(output_and_empty_array), index=0

Before this PR, output_and_empty_array has the same shape both before and after this pass, which is (f16[1,1,1,1], u8[0]). Even though the custom-call itself now has a non-empty second tuple element, the pass carefully packages the output in a new tuple such that the second element is empty.

With this PR, you should make sure this is still done. The PR currently does not guarantee that the second element of new_call is not accessed. Some previous pass may have introduced a usage of the second element, so you should ensure an empty second element still exists.

new_call is introduced as an operand of new_tuple. Since instr (the original Custom Call) is replaced by the tuple (i.e. the user of the first elements of new_call), I don't think that new_call's final element can have an existing user. The autotuning passes are followed by a pass of the TupleSimplifier which I think will eliminate the dead element describing the workspace size.

reedwm · 2023-08-04T02:28:08Z

tensorflow/compiler/xla/service/gpu/conv_layout_normalization.cc

-    TF_RET_CHECK(hlo->shape().tuple_shapes_size() == 2);
-    TF_RET_CHECK(hlo->shape().tuple_shapes(1).rank() == 1)
-        << "Second element in a convolution tuple is expected to be an "
+    TF_RET_CHECK(hlo->shape().tuple_shapes().back().rank() == 1)


I would check that all but the last and first elements have rank zero. Then you can get rid of some of the for-loops you added that change the layout of all the outputs

Admittedly, I am having trouble understanding this file and it would take me a long time to understand it and verify the changes you made, especially on a hypothetical non-scalar second output. Still, I think only changing the layout of the first element makes things simpler.

When we further generalize the graph convolutions, the result tensor may no longer be the first element of the tuple. The workspace will remain the final element though.

reedwm · 2023-08-04T02:28:35Z

tensorflow/compiler/xla/service/gpu/conv_layout_normalization.cc

+  // Avoid replacing the Custom Call with an identical copy.
+  if (!performed_normalization &&
+      ShapeUtil::Equal(normalized_shape, hlo->shape()) &&
+      ConvolutionDimensionNumbersToString(new_dim_numbers) ==
+          ConvolutionDimensionNumbersToString(dim_numbers)) {
+    return std::nullopt;
+  }


Why avoid replacing the custom call with an identical copy, when we didn't have this logic before?

When the Custom Call returns a tuple, replacing the instruction with an identical copy can cause layout normalization to run in an endless loop.

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

reedwm · 2023-08-10T03:19:45Z

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

+  GraphString init_graph_string = graph_string;
+  std::vector<HloInstruction*> init_operands = operands,
+                               init_aux_outputs = aux_outputs;
+  int linear_users = 0, nonlinear_users = 0;
  for (HloInstruction* user : instr->users()) {


I wonder if this logic can be clearer if, instead of checking each user if it matches a pattern, instead we check the op itself, then make a recursive call for each user. This has several advantages. For one, we don't have to explicitly "rollback" changes to local variables like init_graph_string, but instead can implicitly pass a copy of graph_string to the recursive call by passing-by-value. Second, dealing with linear_users and non_linear users is easier as we can immediately bail when we see multiple linear or nonlinear users.

Another alternative is not using recursion at all. Instead, perhaps we can structure this function as something like

HloInstruction* current_instr = ... while (current_instr) { next_instr = nullptr; for (HloInstruction* user : instr->users()) { ... if (Match(...)) { graph_string.AppendOp(...) next_instr = current_instr } ... current_instr = next_instr; }

As I understand it, we would still have two states and we would conditionally assign the returned state from the recursion to the return value of the current instance. On the other hand, having a separate loop over the users of an op after each successful match may make the general structure a bit more complex (again IIUC). If you think it’s preferable, the current implementation also allows us to split the conditions for not fusing and to end the loop early if there are more than one linear or nonlinear users. We could also have a condition to return early if there are more than two users.

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc

tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc

reedwm · 2023-08-11T00:37:53Z

tensorflow/compiler/xla/service/gpu/conv_algorithm_picker.cc

+  // Set the algorithm and update the shape of the convolution Custom Call to
+  // account for the appropriate amount of scratch memory.


Sorry, I forgot to comment before, but the current implementation doesn't seem to preserve the shapes of the last element in the tuple. Before, the last tuple element was empty but with this PR, the last tuple element in the size of the scratch space.

Imported from GitHub PR tensorflow/tensorflow#61460 Extends the functionality of scaled convolutions operating on `F8E4M3FN` and `F8E5M2` data types to optionally return the scalar maximum of the absolute (Amax) of the result before quantization, (X, W, x_scale, w_scale, y_scale) -> (Y, y_amax). Copybara import of the project: -- 540b431aa58823a8908030b495a7e59ebdb6767c by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 1ca35995d887ca74f5e0b7f68288121aeb08c6fa by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- cbf180cd7ab8a33441362839215ca60e5f291c03 by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 5ca436f3a5cd33aee203e21c26631f9e279a66a5 by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 2ada77def154a41597058d757639dedaa5ae95ec by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. Merging this change closes #61460 PiperOrigin-RevId: 556163674

Calculation of Amax for FP8 convolutions.

540b431

google-ml-butler bot added the size:XL CL Change Size:Extra Large label Aug 2, 2023

google-ml-butler bot assigned gbaned Aug 2, 2023

reedwm self-requested a review August 3, 2023 00:42

google-ml-butler bot added the awaiting review Pull request awaiting review label Aug 3, 2023

gbaned added the comp:xla XLA label Aug 3, 2023

gbaned added this to Assigned Reviewer in PR Queue via automation Aug 3, 2023

reedwm requested changes Aug 4, 2023

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Aug 4, 2023

Calculation of Amax for FP8 convolutions.

1ca3599

philipphack requested a review from reedwm August 9, 2023 00:39

Calculation of Amax for FP8 convolutions.

cbf180c

reedwm requested changes Aug 10, 2023

View reviewed changes

Calculation of Amax for FP8 convolutions.

5ca436f

philipphack requested a review from reedwm August 10, 2023 23:36

reedwm requested changes Aug 11, 2023

View reviewed changes

Calculation of Amax for FP8 convolutions.

2ada77d

philipphack requested a review from reedwm August 11, 2023 19:16

reedwm approved these changes Aug 11, 2023

View reviewed changes

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Aug 11, 2023

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 11, 2023

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 11, 2023

copybara-service bot mentioned this pull request Aug 11, 2023

PR #61460: Amax for FP8 Convolutions in XLA openxla/xla#4923

Merged

copybara-service bot merged commit e61e7e0 into tensorflow:master Aug 12, 2023
13 of 14 checks passed

PR Queue automation moved this from Approved by Reviewer to Merged Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amax for FP8 Convolutions in XLA #61460

Amax for FP8 Convolutions in XLA #61460

philipphack commented Aug 2, 2023

philipphack commented Aug 2, 2023

reedwm Aug 3, 2023

reedwm Aug 4, 2023

reedwm Aug 4, 2023

philipphack Aug 9, 2023

reedwm Aug 11, 2023

philipphack Aug 11, 2023

reedwm Aug 11, 2023

philipphack Aug 11, 2023

reedwm Aug 4, 2023

philipphack Aug 9, 2023

reedwm Aug 4, 2023

philipphack Aug 9, 2023

reedwm Aug 10, 2023

philipphack Aug 10, 2023

reedwm Aug 11, 2023

		StatusOr<bool> F8GraphConv(HloComputation* comp, se::CudaComputeCapability cc) {
		bool changed = false;

		// Set the algorithm and update the shape of the convolution Custom Call to
		// account for the appropriate amount of scratch memory.

Amax for FP8 Convolutions in XLA #61460

Amax for FP8 Convolutions in XLA #61460

Conversation

philipphack commented Aug 2, 2023

philipphack commented Aug 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment