Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amax for FP8 Convolutions in XLA #61460

Merged
merged 5 commits into from Aug 12, 2023

Conversation

philipphack
Copy link
Contributor

Extends the functionality of scaled convolutions operating on F8E4M3FN and F8E5M2 data types to optionally return the scalar maximum of the absolute (Amax) of the result before quantization,

(X, W, x_scale, w_scale, y_scale) -> (Y, y_amax).

@google-ml-butler google-ml-butler bot added the size:XL CL Change Size:Extra Large label Aug 2, 2023
@philipphack
Copy link
Contributor Author

CC @reedwm.

@reedwm reedwm self-requested a review August 3, 2023 00:42
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Aug 3, 2023
@gbaned gbaned added the comp:xla XLA label Aug 3, 2023
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Aug 3, 2023
tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved
if (std::find(uids_.begin(), uids_.end(), operand_uid) != uids_.end()) {
auto user = user_uids_.find(operand_uid);
if (user == user_uids_.end()) {
return {tsl::errors::Internal("Unknown ID.")};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you need braces {...} here? And same in GetUserUIDs.

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved
tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved
tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved
Comment on lines 538 to 539
StatusOr<bool> F8GraphConv(HloComputation* comp, se::CudaComputeCapability cc) {
bool changed = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the FP8 logic should be moved to its own pass in its own file, since it's independent from the non-FP8 logic, the FP8 logic is fairly long and substantial, and it uses the currently-FP8-specific concept of a GraphString

This should be done in a separate PR though. I'm fine if this is done before or after this PR.

Comment on lines 1080 to 1081
// Set the algorithm and update the shape of the convolution Custom Call to
// account for the appropriate amount of scratch memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code still was careful to package the convolution output in a tuple where the second element was an empty tensor (see line 1072 of the original code). I assume this is because it's possible an instruction uses the empty tensor. You should keep that code here, even though you now update the custom call in place instead of creating a new one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last element in the tuple is the workspace. The current and the original implementation preserve the shapes of the tuple elements, which should make them compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to comment before, but the current implementation doesn't seem to preserve the shapes of the last element in the tuple. Before, the last tuple element was empty but with this PR, the last tuple element in the size of the scratch space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the objective though, based on the comment in line 1039? Note that in the previous implementation, the second element of the tuple returned by new_call is not accessed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code would replace

output_and_empty_array = (f16[1,1,1,1], u8[0]) custom-call(...)
conv_result = f16[1,1,1,1] get-tuple-element(output_and_empty_array), index=0

with

new_call = (f16[1,1,1,1], u8[123]) custom-call(...)
new_conv_result = f16[1,1,1,1] get-tuple-element(new_call), index=0
empty_constant = u8[0] constant({})
output_and_empty_array = (f16[1,1,1,1], u8[0]) tuple(new_conv_result, empty_constant)
conv_result = f16[1,1,1,1] get-tuple-element(output_and_empty_array), index=0

Before this PR, output_and_empty_array has the same shape both before and after this pass, which is (f16[1,1,1,1], u8[0]). Even though the custom-call itself now has a non-empty second tuple element, the pass carefully packages the output in a new tuple such that the second element is empty.

With this PR, you should make sure this is still done. The PR currently does not guarantee that the second element of new_call is not accessed. Some previous pass may have introduced a usage of the second element, so you should ensure an empty second element still exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_call is introduced as an operand of new_tuple. Since instr (the original Custom Call) is replaced by the tuple (i.e. the user of the first elements of new_call), I don't think that new_call's final element can have an existing user. The autotuning passes are followed by a pass of the TupleSimplifier which I think will eliminate the dead element describing the workspace size.

TF_RET_CHECK(hlo->shape().tuple_shapes_size() == 2);
TF_RET_CHECK(hlo->shape().tuple_shapes(1).rank() == 1)
<< "Second element in a convolution tuple is expected to be an "
TF_RET_CHECK(hlo->shape().tuple_shapes().back().rank() == 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check that all but the last and first elements have rank zero. Then you can get rid of some of the for-loops you added that change the layout of all the outputs

Admittedly, I am having trouble understanding this file and it would take me a long time to understand it and verify the changes you made, especially on a hypothetical non-scalar second output. Still, I think only changing the layout of the first element makes things simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we further generalize the graph convolutions, the result tensor may no longer be the first element of the tuple. The workspace will remain the final element though.

Comment on lines +145 to +151
// Avoid replacing the Custom Call with an identical copy.
if (!performed_normalization &&
ShapeUtil::Equal(normalized_shape, hlo->shape()) &&
ConvolutionDimensionNumbersToString(new_dim_numbers) ==
ConvolutionDimensionNumbersToString(dim_numbers)) {
return std::nullopt;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why avoid replacing the custom call with an identical copy, when we didn't have this logic before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the Custom Call returns a tuple, replacing the instruction with an identical copy can cause layout normalization to run in an endless loop.

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Aug 4, 2023
@philipphack philipphack requested a review from reedwm August 9, 2023 00:39
tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved
GraphString init_graph_string = graph_string;
std::vector<HloInstruction*> init_operands = operands,
init_aux_outputs = aux_outputs;
int linear_users = 0, nonlinear_users = 0;
for (HloInstruction* user : instr->users()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this logic can be clearer if, instead of checking each user if it matches a pattern, instead we check the op itself, then make a recursive call for each user. This has several advantages. For one, we don't have to explicitly "rollback" changes to local variables like init_graph_string, but instead can implicitly pass a copy of graph_string to the recursive call by passing-by-value. Second, dealing with linear_users and non_linear users is easier as we can immediately bail when we see multiple linear or nonlinear users.

Another alternative is not using recursion at all. Instead, perhaps we can structure this function as something like

HloInstruction* current_instr = ...
while (current_instr) {
  next_instr = nullptr;
  for (HloInstruction* user : instr->users()) {
    ...
    if (Match(...)) {
      graph_string.AppendOp(...)
      next_instr = current_instr
    }
    ...
  current_instr = next_instr;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, we would still have two states and we would conditionally assign the returned state from the recursion to the return value of the current instance. On the other hand, having a separate loop over the users of an op after each successful match may make the general structure a bit more complex (again IIUC). If you think it’s preferable, the current implementation also allows us to split the conditions for not fusing and to end the loop early if there are more than one linear or nonlinear users. We could also have a condition to return early if there are more than two users.

tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved
tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc Outdated Show resolved Hide resolved
Comment on lines 1080 to 1081
// Set the algorithm and update the shape of the convolution Custom Call to
// account for the appropriate amount of scratch memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to comment before, but the current implementation doesn't seem to preserve the shapes of the last element in the tuple. Before, the last tuple element was empty but with this PR, the last tuple element in the size of the scratch space.

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Aug 11, 2023
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 11, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 11, 2023
copybara-service bot pushed a commit to tensorflow/mlir-hlo that referenced this pull request Aug 12, 2023
Imported from GitHub PR tensorflow/tensorflow#61460

Extends the functionality of scaled convolutions operating on `F8E4M3FN` and `F8E5M2` data types to optionally return the scalar maximum of the absolute (Amax) of the result before quantization,

(X, W, x_scale, w_scale, y_scale) -> (Y, y_amax).

Copybara import of the project:

--
540b431aa58823a8908030b495a7e59ebdb6767c by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

--
1ca35995d887ca74f5e0b7f68288121aeb08c6fa by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

--
cbf180cd7ab8a33441362839215ca60e5f291c03 by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

--
5ca436f3a5cd33aee203e21c26631f9e279a66a5 by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

--
2ada77def154a41597058d757639dedaa5ae95ec by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

Merging this change closes #61460

PiperOrigin-RevId: 556163674
copybara-service bot pushed a commit to openxla/xla that referenced this pull request Aug 12, 2023
Imported from GitHub PR tensorflow/tensorflow#61460

Extends the functionality of scaled convolutions operating on `F8E4M3FN` and `F8E5M2` data types to optionally return the scalar maximum of the absolute (Amax) of the result before quantization,

(X, W, x_scale, w_scale, y_scale) -> (Y, y_amax).

Copybara import of the project:

--
540b431aa58823a8908030b495a7e59ebdb6767c by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

--
1ca35995d887ca74f5e0b7f68288121aeb08c6fa by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

--
cbf180cd7ab8a33441362839215ca60e5f291c03 by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

--
5ca436f3a5cd33aee203e21c26631f9e279a66a5 by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

--
2ada77def154a41597058d757639dedaa5ae95ec by Philipp Hack <phack@nvidia.com>:

Calculation of Amax for FP8 convolutions.

Merging this change closes #61460

PiperOrigin-RevId: 556163674
@copybara-service copybara-service bot merged commit e61e7e0 into tensorflow:master Aug 12, 2023
13 of 14 checks passed
PR Queue automation moved this from Approved by Reviewer to Merged Aug 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review Pull request awaiting review comp:xla XLA ready to pull PR ready for merge process size:XL CL Change Size:Extra Large
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

4 participants