Add the counter check for dynamo tests#4603
Conversation
|
Here is what I understand from the pending graph:
This looks very like gradient accumulation done by the autograd engine. |
|
@shunting314 yea, the part that's confusing to me is I would expect backward graph to contain the gradient accumulation but that might be in the optimizer graph? |
|
Gradient accumulation should happen in the backward graph. |
|
I think gradient accumulation for one parameter happens when autograd determines that all the contributors to that parameter's grad are ready. In a simple case, there may be only one producer of a grad for a parameter, so you would expect autograd to fire its accumulate grad for that parameter soon after the backward op that produced the grad finished. I believe, if the autograd accumulategrad is ready to fire while we are tracing the backward graph, it should be likely that we are capturing that in the backwards graph. Note that I suspect a race condition here- if we 'exit' the dynamo-AOT backward trace phase right before autograd decides to fire its hook, then instead of including it in our backward graph it would be in its own graph. A separate reason that could cause delayed accumulategrad is if there are other forward ops using the parameter in question, meaning those corresponding separate backwards ops all have to finish before accumulategrad can fire. So it would be good to confirm which case it is. |
|
Thanks @wconstab for the input, I am wondering what's the best way to figure out which one is the case? AFAICT, dynamo only passes us two graphs, one for forward and one for backward and that part of logic is not controlled by the existing dynamo bridge. Should I dig into AOTAutograd's code and figure out how does it decide to pass xla the backward graph and force it to sleep for a while to make sure everything is captured, or is there any other way to achieve the similar goal? |
|
One tool you can play with is dumping the autograd graph. You'll be better off doing this on a toy model if you can repro there.
The graph dump idea is mainly to visualize which nodes are contributing grads to a given parameter. If you have more nodes contributing than you expected, maybe you are in the second case.
I don't know what a clean solution would look like. Indeed the thing that fires AccumulateGrad hooks during regular .backward() is a separate c++ thread in autograd engine, so it can come at any time with respect to python's tracing. There is probably some hook autograd fires after finishing all grad accumulation, which you could register a callback in python and wait for that before you exit the dynamo backward trace. But make sure there is no valid case where grads are not expected to all be ready before you wait. |
Comment with what I see. aot-autograd calls .grad. Both .grad and .backward eventually calls into: Variable._execution_engine.run_backward which should run into the C++ autograd engine. |
Yes, and funilly enough they're called engine callbacks haha. They will trigger at the very end of the backward pass once everything else has been executed.
There is no side thread after aot-autograd. |
Interesting, so does that mean that
|
I'm not sure what you mean by that? |
Cleaning up the test and add the counter check back so it does not regress. There are some open issues but I want to get this one merge first
mark_stepat the end of the optim_mod to get 3 graphs per step, I think this somehow has to do withgraddata.grad.cpu()after backward after step 1 actually will trigger additional graph execution, I think this is due 1 above. It seems to me thatinput.graddoes not fully materialized after the backward computation. It has pending IR likeI think we didn't hit this issue in the torch bench because we only run for 1 step and we have
mark_stepbetween steps to clear things up. I will continue looking into this issue, I think the right thing to do is to add a optimizer step in the test I have, which should simulate the real use case and that can also capture theinput.gradin the same graph.FYI @shunting314 @wconstab