Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

- Add codegen for embedding backward meta functions #2347

Closed
wants to merge 1 commit into from

Conversation

q10
Copy link
Contributor

@q10 q10 commented Feb 22, 2024

Summary:
Adding embedding backward meta codegen functions.

Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.

Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.

There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.

Differential Revision: D53674518

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53674518

Copy link

netlify bot commented Feb 22, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 4aa3883
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/65deb93f4c07b90008dd0cea
😎 Deploy Preview https://deploy-preview-2347--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -270,8 +272,13 @@ def execute_backward_adagrad_( # noqa C901
output_dtype=output_dtype,
)

# TODO: make it compile for CPU and unweighted
if compile and not use_cpu and weighted:
cc = torch.compile(cc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding fullgraph=True here

#x " must have the same number of elements as " #y " They had ", \
(x).sym_numel(), \
" and ", \
(y).sym_numel())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more robust version of this which works better with unbacked symints is

TORCH_SYM_CHECK(x.sym_numel().sym_eq(y.sym_numel()))

for more explanation on what this is doing, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.jqnrfurlygn5

q10 pushed a commit to q10/FBGEMM that referenced this pull request Feb 23, 2024
Summary:

Adding embedding backward meta codegen functions.

Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.

Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.

There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.

Reviewed By: q10, Microve

Differential Revision: D53674518
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53674518

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53674518

q10 pushed a commit to q10/FBGEMM that referenced this pull request Feb 23, 2024
Summary:
Pull Request resolved: pytorch#2347

Adding embedding backward meta codegen functions.

Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.

Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.

There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.

Reviewed By: q10, Microve

Differential Revision: D53674518
}
if (reinterpret_cast<uint64_t>(grad_output.data_ptr()) % 16 != 0) {
aligned_grad_output = at::empty_like(grad_output).copy_(grad_output);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the code got moved here. I guess you're only running this inside of the CUDA kernel now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential hazard to be aware of when doing a transform like this, is if the operator you moved this logic into is differentiable. The backward in that case may have been relying on the input being guaranteed to be aligned, including the saved copy for backwards.

@@ -253,7 +265,7 @@ Tensor {{ ddesc }}_embedding_codegen_grad_indice_weights{{ vdesc }}_cuda(
const auto total_B = offsets.size(0) - 1;
TORCH_CHECK_GE(total_B, 0);
TORCH_CHECK_LE(max_D, {{ max_embedding_dim }});
auto grad_indice_weights = empty_like(indices, indices.options().dtype(at::toAccumulateType(grad_output.scalar_type(), true)));
auto grad_indice_weights = empty_like(indices, indices.options().dtype(at::toAccumulateType(aligned_grad_output.scalar_type(), true)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to carefully audit that you updated all the downstream use sites. To make it obvious you didn't do it wrong, change the input name and then once you align grad output, assign it to grad_output, no diff afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grad_output is const here, so I need to create new variable.

) {

const auto T = D_offsets.sym_size(0) - 1;
TORCH_CHECK_GT(T, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also potentially make these more unbacked symint friendly, but happy to leave this for later too.


auto grad_indice_weights = empty_like(indices, indices.options().dtype(at::toAccumulateType(grad_output.scalar_type(), true)));

return grad_indice_weights;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trusting you that this accurately reflects the original logic ;)

}
if (reinterpret_cast<uint64_t>(grad_output.data_ptr()) % 16 != 0) {
aligned_grad_output = at::empty_like(grad_output).copy_(grad_output);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duped! Maybe factor this out to a helper?

{%- endif %}

// short-circuit if there are zero indices.
if (indices.sym_numel() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice work!

q10 pushed a commit to q10/FBGEMM that referenced this pull request Feb 27, 2024
Summary:

Adding embedding backward meta codegen functions.

Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.

Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.

There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.

Reviewed By: ezyang, q10, Microve

Differential Revision: D53674518
q10 pushed a commit to q10/FBGEMM that referenced this pull request Feb 27, 2024
Summary:

Adding embedding backward meta codegen functions.

Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.

Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.

There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.

Reviewed By: ezyang, q10, Microve

Differential Revision: D53674518
q10 pushed a commit to q10/FBGEMM that referenced this pull request Feb 27, 2024
Summary:

Adding embedding backward meta codegen functions.

Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.

Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.

There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.

Reviewed By: ezyang, q10, Microve

Differential Revision: D53674518
q10 pushed a commit to q10/FBGEMM that referenced this pull request Feb 28, 2024
Summary:

Adding embedding backward meta codegen functions.

Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.

Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.

There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.

Reviewed By: ezyang, q10, Microve

Differential Revision: D53674518
Summary:

Adding embedding backward meta codegen functions.

Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.

Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.

There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.

Reviewed By: ezyang, q10, Microve

Differential Revision: D53674518
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 5f48fbd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants