Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Aug 9, 2024

Stack from ghstack (oldest at bottom):

FIXES #132939

Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other

hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html
fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html
Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html

 ===== Compiled autograd graph =====                                                                                                                                                          
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):                                                                                                                                  
    def forward(self, inputs, sizes, scalars, hooks):                                                                                                                                         
        # No stacktrace found for following nodes                                                                                                                                             
        getitem: "f32[]cpu" = inputs[0]                                                                                                                                                       
        aot1_primals_1: "f32[4]cpu" = inputs[1]                                                                                                                                               
        aot1_primals_2: "f32[4]cpu" = inputs[2]                                                                                                                                               
        aot0_sin: "f32[4]cpu" = inputs[3]                                                                                                                                                     
        aot0_cos: "f32[4]cpu" = inputs[4]                                                                                                                                                     
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None                                                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)                                                       
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)                                          
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None                                                          
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None                                                                                          
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None                                                                                                    
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None                                                                                 
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None                                                                                          
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None                                                           
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)                                          
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None                                                                                                        
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None                                                                      
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None                                                                    
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None                                                                                
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)                                    
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None                                                  
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None                                                            
        return []   

where aot1 is

class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)

and aot0 is

class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133148

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 09bf218 with merge base e7b870c (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@xmfan xmfan marked this pull request as ready for review August 9, 2024 23:49
@xmfan xmfan requested review from bdhirsh and jansel August 9, 2024 23:49
FIXES #132939

Assumes AOTAutograd naming scheme starts with primals_1 and increments the suffix id

```python
 ===== Compiled autograd graph =====
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]cpu" = inputs[0]
        aot1_primals_1: "f32[4]cpu" = inputs[1]
        aot1_primals_2: "f32[4]cpu" = inputs[2]
        aot0_primals_1: "f32[4]cpu" = inputs[3]
        aot0_primals_2: "f32[4]cpu" = inputs[4]
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: SumBackward0 (NodeCall 1)
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)
        clone: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        sin: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None
        neg: "f32[4]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(clone, neg);  neg = None
        cos: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None
        mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(clone, cos);  clone = cos = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)
        neg_1: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_primals_1);  aot0_primals_1 = None
        mul_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(mul, neg_1);  mul = neg_1 = None
        mul_3: "f32[4]cpu" = torch.ops.aten.mul.Tensor(mul_1, aot0_primals_2);  mul_1 = aot0_primals_2 = None
        add: "f32[4]cpu" = torch.ops.aten.add.Tensor(mul_2, mul_3);  mul_2 = mul_3 = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, add);  getitem_5 = add = accumulate_grad_ = None
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None
        return []
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
FIXES #132939

Assumes AOTAutograd naming scheme starts with primals_1 and increments the suffix id

```python
 ===== Compiled autograd graph =====
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]cpu" = inputs[0]
        aot1_primals_1: "f32[4]cpu" = inputs[1]
        aot1_primals_2: "f32[4]cpu" = inputs[2]
        aot0_primals_1: "f32[4]cpu" = inputs[3]
        aot0_primals_2: "f32[4]cpu" = inputs[4]
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: SumBackward0 (NodeCall 1)
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)
        clone: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        sin: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None
        neg: "f32[4]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(clone, neg);  neg = None
        cos: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None
        mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(clone, cos);  clone = cos = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)
        neg_1: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_primals_1);  aot0_primals_1 = None
        mul_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(mul, neg_1);  mul = neg_1 = None
        mul_3: "f32[4]cpu" = torch.ops.aten.mul.Tensor(mul_1, aot0_primals_2);  mul_1 = aot0_primals_2 = None
        add: "f32[4]cpu" = torch.ops.aten.add.Tensor(mul_2, mul_3);  mul_2 = mul_3 = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, add);  getitem_5 = add = accumulate_grad_ = None
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None
        return []
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Aug 9, 2024
@xmfan xmfan marked this pull request as draft August 10, 2024 00:03
FIXES #132939

Assumes AOTAutograd naming scheme starts with primals_1 and increments the suffix id


https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpBBRr3Z/index.html
```python
 ===== Compiled autograd graph =====
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]cpu" = inputs[0]
        aot1_primals_1: "f32[4]cpu" = inputs[1]
        aot1_primals_2: "f32[4]cpu" = inputs[2]
        aot0_primals_1: "f32[4]cpu" = inputs[3]
        aot0_primals_2: "f32[4]cpu" = inputs[4]
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: SumBackward0 (NodeCall 1)
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)
        clone: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        sin: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None
        neg: "f32[4]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(clone, neg);  neg = None
        cos: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None
        mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(clone, cos);  clone = cos = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)
        neg_1: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_primals_1);  aot0_primals_1 = None
        mul_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(mul, neg_1);  mul = neg_1 = None
        mul_3: "f32[4]cpu" = torch.ops.aten.mul.Tensor(mul_1, aot0_primals_2);  mul_1 = aot0_primals_2 = None
        add: "f32[4]cpu" = torch.ops.aten.add.Tensor(mul_2, mul_3);  mul_2 = mul_3 = None
        
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:396 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, add);  getitem_5 = add = accumulate_grad_ = None
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None
        return []
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@xmfan xmfan changed the title [compiled autograd] rename AOTAutograd primals graph nodes [compiled autograd] rename AOTDispatcher graph nodes Aug 13, 2024
@xmfan xmfan changed the title [compiled autograd] rename AOTDispatcher graph nodes [compiled autograd] use same graph node names as AOTDispatcher Aug 13, 2024
@xmfan xmfan marked this pull request as ready for review August 13, 2024 23:14
@xmfan xmfan marked this pull request as draft August 13, 2024 23:56
…cher"


FIXES #132939


Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, so the graphs may be slightly offset from each other


hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html

Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====                                                                                                                                                          
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):                                                                                                                                  
    def forward(self, inputs, sizes, scalars, hooks):                                                                                                                                         
        # No stacktrace found for following nodes                                                                                                                                             
        getitem: "f32[]cpu" = inputs[0]                                                                                                                                                       
        aot1_primals_1: "f32[4]cpu" = inputs[1]                                                                                                                                               
        aot1_primals_2: "f32[4]cpu" = inputs[2]                                                                                                                                               
        aot0_sin: "f32[4]cpu" = inputs[3]                                                                                                                                                     
        aot0_cos: "f32[4]cpu" = inputs[4]                                                                                                                                                     
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None                                                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)                                                       
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)                                          
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None                                                          
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None                                                                                          
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None                                                                                                    
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None                                                                                 
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None                                                                                          
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None                                                           
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)                                          
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None                                                                                                        
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None                                                                      
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None                                                                    
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None                                                                                
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)                                    
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None                                                  
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None                                                            
        return []   
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…cher"


FIXES #132939


Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, so the graphs may be slightly offset from each other


hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html

Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====                                                                                                                                                          
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):                                                                                                                                  
    def forward(self, inputs, sizes, scalars, hooks):                                                                                                                                         
        # No stacktrace found for following nodes                                                                                                                                             
        getitem: "f32[]cpu" = inputs[0]                                                                                                                                                       
        aot1_primals_1: "f32[4]cpu" = inputs[1]                                                                                                                                               
        aot1_primals_2: "f32[4]cpu" = inputs[2]                                                                                                                                               
        aot0_sin: "f32[4]cpu" = inputs[3]                                                                                                                                                     
        aot0_cos: "f32[4]cpu" = inputs[4]                                                                                                                                                     
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None                                                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)                                                       
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)                                          
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None                                                          
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None                                                                                          
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None                                                                                                    
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None                                                                                 
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None                                                                                          
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None                                                           
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)                                          
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None                                                                                                        
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None                                                                      
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None                                                                    
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None                                                                                
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)                                    
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None                                                  
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None                                                            
        return []   
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…cher"


FIXES #132939


Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, so the graphs may be slightly offset from each other


hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html

Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====                                                                                                                                                          
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):                                                                                                                                  
    def forward(self, inputs, sizes, scalars, hooks):                                                                                                                                         
        # No stacktrace found for following nodes                                                                                                                                             
        getitem: "f32[]cpu" = inputs[0]                                                                                                                                                       
        aot1_primals_1: "f32[4]cpu" = inputs[1]                                                                                                                                               
        aot1_primals_2: "f32[4]cpu" = inputs[2]                                                                                                                                               
        aot0_sin: "f32[4]cpu" = inputs[3]                                                                                                                                                     
        aot0_cos: "f32[4]cpu" = inputs[4]                                                                                                                                                     
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None                                                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)                                                       
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)                                          
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None                                                          
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None                                                                                          
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None                                                                                                    
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None                                                                                 
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None                                                                                          
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None                                                           
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)                                          
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None                                                                                                        
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None                                                                      
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None                                                                    
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None                                                                                
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)                                    
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None                                                  
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None                                                            
        return []   
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Aug 14, 2024
ghstack-source-id: 07f5b41
Pull Request resolved: #133148
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Aug 14, 2024
…cher"


FIXES #132939


Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, so the graphs may be slightly offset from each other


hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html

Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====                                                                                                                                                          
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):                                                                                                                                  
    def forward(self, inputs, sizes, scalars, hooks):                                                                                                                                         
        # No stacktrace found for following nodes                                                                                                                                             
        getitem: "f32[]cpu" = inputs[0]                                                                                                                                                       
        aot1_primals_1: "f32[4]cpu" = inputs[1]                                                                                                                                               
        aot1_primals_2: "f32[4]cpu" = inputs[2]                                                                                                                                               
        aot0_sin: "f32[4]cpu" = inputs[3]                                                                                                                                                     
        aot0_cos: "f32[4]cpu" = inputs[4]                                                                                                                                                     
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None                                                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)                                                       
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)                                          
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None                                                          
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None                                                                                          
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None                                                                                                    
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None                                                                                 
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None                                                                                          
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None                                                           
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)                                          
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None                                                                                                        
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None                                                                      
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None                                                                    
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None                                                                                
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)                                    
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None                                                  
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None                                                            
        return []   
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@xmfan xmfan marked this pull request as ready for review August 15, 2024 00:40
…cher"


FIXES #132939


Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other


hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html
fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html
Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====                                                                                                                                                          
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):                                                                                                                                  
    def forward(self, inputs, sizes, scalars, hooks):                                                                                                                                         
        # No stacktrace found for following nodes                                                                                                                                             
        getitem: "f32[]cpu" = inputs[0]                                                                                                                                                       
        aot1_primals_1: "f32[4]cpu" = inputs[1]                                                                                                                                               
        aot1_primals_2: "f32[4]cpu" = inputs[2]                                                                                                                                               
        aot0_sin: "f32[4]cpu" = inputs[3]                                                                                                                                                     
        aot0_cos: "f32[4]cpu" = inputs[4]                                                                                                                                                     
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None                                                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)                                                       
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)                                          
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None                                                          
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None                                                                                          
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None                                                                                                    
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None                                                                                 
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None                                                                                          
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None                                                           
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)                                          
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None                                                                                                        
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None                                                                      
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None                                                                    
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None                                                                                
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)                                    
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None                                                  
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None                                                            
        return []   
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Aug 15, 2024
ghstack-source-id: ea6af35
Pull Request resolved: #133148
…cher"


FIXES #132939


Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other


hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html
fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html
Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====                                                                                                                                                          
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):                                                                                                                                  
    def forward(self, inputs, sizes, scalars, hooks):                                                                                                                                         
        # No stacktrace found for following nodes                                                                                                                                             
        getitem: "f32[]cpu" = inputs[0]                                                                                                                                                       
        aot1_primals_1: "f32[4]cpu" = inputs[1]                                                                                                                                               
        aot1_primals_2: "f32[4]cpu" = inputs[2]                                                                                                                                               
        aot0_sin: "f32[4]cpu" = inputs[3]                                                                                                                                                     
        aot0_cos: "f32[4]cpu" = inputs[4]                                                                                                                                                     
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None                                                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)                                                       
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)                                          
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None                                                          
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None                                                                                          
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None                                                                                                    
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None                                                                                 
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None                                                                                          
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None                                                           
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)                                          
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None                                                                                                        
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None                                                                      
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None                                                                    
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None                                                                                
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)                                    
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None                                                  
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None                                                            
        return []   
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]

def is_similar(a: torch.fx.node.Node, b: torch.fx.node.Node):
if callable(a.target) and callable(b.target):
target_match = a.target.__qualname__ == b.target.__qualname__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all callables in python guaranteed to have a __qualname__? (include C extension functions?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any functions yes, but for instances of classes that defined call, we have to set them manually. this actually breaks for HigherOrderOperator which doesn't set qualname, i'll move the fix into this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe getattr(a.target, "__qualname__", "<unset>")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gonna use name instead since we still want to rename HOPs

…cher"


FIXES #132939


Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other


hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html
fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html
Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====                                                                                                                                                          
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):                                                                                                                                  
    def forward(self, inputs, sizes, scalars, hooks):                                                                                                                                         
        # No stacktrace found for following nodes                                                                                                                                             
        getitem: "f32[]cpu" = inputs[0]                                                                                                                                                       
        aot1_primals_1: "f32[4]cpu" = inputs[1]                                                                                                                                               
        aot1_primals_2: "f32[4]cpu" = inputs[2]                                                                                                                                               
        aot0_sin: "f32[4]cpu" = inputs[3]                                                                                                                                                     
        aot0_cos: "f32[4]cpu" = inputs[4]                                                                                                                                                     
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None                                                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)                                                       
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None                                                                                                    
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)                                          
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None                                                          
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None                                                                                          
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None                                                                                                    
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None                                                                                 
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None                                                                                          
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None                                                           
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)                                          
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None                                                                                                        
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None                                                                      
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None                                                                    
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None                                                                                
                                                                                                                                                                                              
         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)                                    
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None                                                  
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None                                                            
        return []   
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None
        
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 26, 2024
@github-actions github-actions bot deleted the gh/xmfan/76/head branch September 28, 2024 02:08
pytorchmergebot pushed a commit that referenced this pull request Jan 7, 2025
…44202)

This error started popping up in HUD CA benchmarks:
```python
 File "/data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py", line 371, in dce
    self.fx_tracer.graph.eliminate_dead_code(is_impure)
  File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1862, in eliminate_dead_code
    self.lint()
  File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1753, in lint
    raise RuntimeError(f"Node redefined name {node.name}!")
RuntimeError: Node redefined name aot0_expand!
```

We added CA initial capture's renaming (#133148) to help debug issues with AOT backward, but it errors out when we have multiple instances of the same AOT backward. This likely only showed up now because of increased hierarchical graph reuse. I fix it by adding a postfix counter to the node name

Pull Request resolved: #144202
Approved by: https://github.com/bdhirsh, https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants