New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amax for FP8 Convolutions in XLA #61460
Amax for FP8 Convolutions in XLA #61460
Conversation
CC @reedwm. |
if (std::find(uids_.begin(), uids_.end(), operand_uid) != uids_.end()) { | ||
auto user = user_uids_.find(operand_uid); | ||
if (user == user_uids_.end()) { | ||
return {tsl::errors::Internal("Unknown ID.")}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you need braces {...}
here? And same in GetUserUIDs.
StatusOr<bool> F8GraphConv(HloComputation* comp, se::CudaComputeCapability cc) { | ||
bool changed = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the FP8 logic should be moved to its own pass in its own file, since it's independent from the non-FP8 logic, the FP8 logic is fairly long and substantial, and it uses the currently-FP8-specific concept of a GraphString
This should be done in a separate PR though. I'm fine if this is done before or after this PR.
tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc
Outdated
Show resolved
Hide resolved
// Set the algorithm and update the shape of the convolution Custom Call to | ||
// account for the appropriate amount of scratch memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code still was careful to package the convolution output in a tuple where the second element was an empty tensor (see line 1072 of the original code). I assume this is because it's possible an instruction uses the empty tensor. You should keep that code here, even though you now update the custom call in place instead of creating a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last element in the tuple is the workspace. The current and the original implementation preserve the shapes of the tuple elements, which should make them compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to comment before, but the current implementation doesn't seem to preserve the shapes of the last element in the tuple. Before, the last tuple element was empty but with this PR, the last tuple element in the size of the scratch space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the objective though, based on the comment in line 1039? Note that in the previous implementation, the second element of the tuple returned by new_call is not accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code would replace
output_and_empty_array = (f16[1,1,1,1], u8[0]) custom-call(...)
conv_result = f16[1,1,1,1] get-tuple-element(output_and_empty_array), index=0
with
new_call = (f16[1,1,1,1], u8[123]) custom-call(...)
new_conv_result = f16[1,1,1,1] get-tuple-element(new_call), index=0
empty_constant = u8[0] constant({})
output_and_empty_array = (f16[1,1,1,1], u8[0]) tuple(new_conv_result, empty_constant)
conv_result = f16[1,1,1,1] get-tuple-element(output_and_empty_array), index=0
Before this PR, output_and_empty_array
has the same shape both before and after this pass, which is (f16[1,1,1,1], u8[0])
. Even though the custom-call itself now has a non-empty second tuple element, the pass carefully packages the output in a new tuple such that the second element is empty.
With this PR, you should make sure this is still done. The PR currently does not guarantee that the second element of new_call is not accessed. Some previous pass may have introduced a usage of the second element, so you should ensure an empty second element still exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_call is introduced as an operand of new_tuple. Since instr (the original Custom Call) is replaced by the tuple (i.e. the user of the first elements of new_call), I don't think that new_call's final element can have an existing user. The autotuning passes are followed by a pass of the TupleSimplifier which I think will eliminate the dead element describing the workspace size.
TF_RET_CHECK(hlo->shape().tuple_shapes_size() == 2); | ||
TF_RET_CHECK(hlo->shape().tuple_shapes(1).rank() == 1) | ||
<< "Second element in a convolution tuple is expected to be an " | ||
TF_RET_CHECK(hlo->shape().tuple_shapes().back().rank() == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would check that all but the last and first elements have rank zero. Then you can get rid of some of the for-loops you added that change the layout of all the outputs
Admittedly, I am having trouble understanding this file and it would take me a long time to understand it and verify the changes you made, especially on a hypothetical non-scalar second output. Still, I think only changing the layout of the first element makes things simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we further generalize the graph convolutions, the result tensor may no longer be the first element of the tuple. The workspace will remain the final element though.
// Avoid replacing the Custom Call with an identical copy. | ||
if (!performed_normalization && | ||
ShapeUtil::Equal(normalized_shape, hlo->shape()) && | ||
ConvolutionDimensionNumbersToString(new_dim_numbers) == | ||
ConvolutionDimensionNumbersToString(dim_numbers)) { | ||
return std::nullopt; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why avoid replacing the custom call with an identical copy, when we didn't have this logic before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the Custom Call returns a tuple, replacing the instruction with an identical copy can cause layout normalization to run in an endless loop.
tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc
Outdated
Show resolved
Hide resolved
tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc
Outdated
Show resolved
Hide resolved
tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc
Outdated
Show resolved
Hide resolved
tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc
Outdated
Show resolved
Hide resolved
tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc
Outdated
Show resolved
Hide resolved
GraphString init_graph_string = graph_string; | ||
std::vector<HloInstruction*> init_operands = operands, | ||
init_aux_outputs = aux_outputs; | ||
int linear_users = 0, nonlinear_users = 0; | ||
for (HloInstruction* user : instr->users()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this logic can be clearer if, instead of checking each user if it matches a pattern, instead we check the op itself, then make a recursive call for each user. This has several advantages. For one, we don't have to explicitly "rollback" changes to local variables like init_graph_string, but instead can implicitly pass a copy of graph_string to the recursive call by passing-by-value. Second, dealing with linear_users and non_linear users is easier as we can immediately bail when we see multiple linear or nonlinear users.
Another alternative is not using recursion at all. Instead, perhaps we can structure this function as something like
HloInstruction* current_instr = ...
while (current_instr) {
next_instr = nullptr;
for (HloInstruction* user : instr->users()) {
...
if (Match(...)) {
graph_string.AppendOp(...)
next_instr = current_instr
}
...
current_instr = next_instr;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, we would still have two states and we would conditionally assign the returned state from the recursion to the return value of the current instance. On the other hand, having a separate loop over the users of an op after each successful match may make the general structure a bit more complex (again IIUC). If you think it’s preferable, the current implementation also allows us to split the conditions for not fusing and to end the loop early if there are more than one linear or nonlinear users. We could also have a condition to return early if there are more than two users.
tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc
Outdated
Show resolved
Hide resolved
tensorflow/compiler/xla/service/gpu/cudnn_fused_conv_rewriter.cc
Outdated
Show resolved
Hide resolved
// Set the algorithm and update the shape of the convolution Custom Call to | ||
// account for the appropriate amount of scratch memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to comment before, but the current implementation doesn't seem to preserve the shapes of the last element in the tuple. Before, the last tuple element was empty but with this PR, the last tuple element in the size of the scratch space.
Imported from GitHub PR tensorflow/tensorflow#61460 Extends the functionality of scaled convolutions operating on `F8E4M3FN` and `F8E5M2` data types to optionally return the scalar maximum of the absolute (Amax) of the result before quantization, (X, W, x_scale, w_scale, y_scale) -> (Y, y_amax). Copybara import of the project: -- 540b431aa58823a8908030b495a7e59ebdb6767c by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 1ca35995d887ca74f5e0b7f68288121aeb08c6fa by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- cbf180cd7ab8a33441362839215ca60e5f291c03 by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 5ca436f3a5cd33aee203e21c26631f9e279a66a5 by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 2ada77def154a41597058d757639dedaa5ae95ec by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. Merging this change closes #61460 PiperOrigin-RevId: 556163674
Imported from GitHub PR tensorflow/tensorflow#61460 Extends the functionality of scaled convolutions operating on `F8E4M3FN` and `F8E5M2` data types to optionally return the scalar maximum of the absolute (Amax) of the result before quantization, (X, W, x_scale, w_scale, y_scale) -> (Y, y_amax). Copybara import of the project: -- 540b431aa58823a8908030b495a7e59ebdb6767c by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 1ca35995d887ca74f5e0b7f68288121aeb08c6fa by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- cbf180cd7ab8a33441362839215ca60e5f291c03 by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 5ca436f3a5cd33aee203e21c26631f9e279a66a5 by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. -- 2ada77def154a41597058d757639dedaa5ae95ec by Philipp Hack <phack@nvidia.com>: Calculation of Amax for FP8 convolutions. Merging this change closes #61460 PiperOrigin-RevId: 556163674
e61e7e0
into
tensorflow:master
Extends the functionality of scaled convolutions operating on
F8E4M3FN
andF8E5M2
data types to optionally return the scalar maximum of the absolute (Amax) of the result before quantization,(X, W, x_scale, w_scale, y_scale) -> (Y, y_amax).