Skip to content

Commit

Permalink
Avoid cloning gradient tensor in embedding backward pass (#2526)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #2526

I found memory spike during embedding kernel backward `split_embedding_backward_codegen_rowwise_adagrad_unweghted_exact_cuda`, which was traced into the below code making a clone of the gradient tensor. This logic didn't seem to be there in the original code: https://github.com/pytorch/FBGEMM/pull/2347/files#diff-944ab49dcbcf54826cc3e1eab5e3c0c787b5a195f602c2d3052adae14c506d78.

Reviewed By: ezyang

Differential Revision: D56420646

fbshipit-source-id: a4e3fd6952cdaa4f1a3339980151f5dc1ce6c436
  • Loading branch information
jhadidjojo authored and facebook-github-bot committed Apr 24, 2024
1 parent 0fea06c commit a75037b
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion fbgemm_gpu/include/fbgemm_gpu/sparse_ops_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -466,7 +466,7 @@ struct StackArray {

inline at::Tensor aligned_grad_output_tensor_for_cuda_backwards(
const at::Tensor& grad_output) {
auto aligned_grad_output = grad_output.clone();
auto aligned_grad_output = grad_output;
// FIXME: to support aligned memory access in Vec4T load/store function
// 16 for FP32 and 8 for FP16
if (grad_output.dim() > 1 &&
Expand Down

0 comments on commit a75037b

Please sign in to comment.