Avoid cloning gradient tensor in embedding backward pass (#2526)

Summary: Pull Request resolved: #2526 I found memory spike during embedding kernel backward `split_embedding_backward_codegen_rowwise_adagrad_unweghted_exact_cuda`, which was traced into the below code making a clone of the gradient tensor. This logic didn't seem to be there in the original code: https://github.com/pytorch/FBGEMM/pull/2347/files#diff-944ab49dcbcf54826cc3e1eab5e3c0c787b5a195f602c2d3052adae14c506d78. Reviewed By: ezyang Differential Revision: D56420646 fbshipit-source-id: a4e3fd6952cdaa4f1a3339980151f5dc1ce6c436
pytorch · Apr 24, 2024 · a75037b · a75037b
1 parent 0fea06c
commit a75037b
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/fbgemm_gpu/include/fbgemm_gpu/sparse_ops_utils.h b/fbgemm_gpu/include/fbgemm_gpu/sparse_ops_utils.h
@@ -466,7 +466,7 @@ struct StackArray {
 
 inline at::Tensor aligned_grad_output_tensor_for_cuda_backwards(
     const at::Tensor& grad_output) {
-  auto aligned_grad_output = grad_output.clone();
+  auto aligned_grad_output = grad_output;
   // FIXME: to support aligned memory access in Vec4T load/store function
   // 16 for FP32 and 8 for FP16
   if (grad_output.dim() > 1 &&