Skip to content

Add the counter check for dynamo tests#4603

Merged
JackCaoG merged 1 commit into
masterfrom
JackCaoG/add_dynamo_test_counter_check
Feb 10, 2023
Merged

Add the counter check for dynamo tests#4603
JackCaoG merged 1 commit into
masterfrom
JackCaoG/add_dynamo_test_counter_check

Conversation

@JackCaoG
Copy link
Copy Markdown
Collaborator

@JackCaoG JackCaoG commented Feb 10, 2023

Cleaning up the test and add the counter check back so it does not regress. There are some open issues but I want to get this one merge first

  1. I have to remove the mark_step at the end of the optim_mod to get 3 graphs per step, I think this somehow has to do with grad
  2. accessing data.grad.cpu() after backward after step 1 actually will trigger additional graph execution, I think this is due 1 above. It seems to me that input.grad does not fully materialized after the backward computation. It has pending IR like
print(torch_xla._XLAC._get_xla_tensors_text([data.grad]))
IR {
  %0 = f32[] prim::Constant(), xla_shape=f32[], value=1
  %1 = f32[4,3,224,224]{3,2,1,0} aten::expand(%0), xla_shape=f32[4,3,224,224]{3,2,1,0}, size=(4, 3, 224, 224)
  %2 = f32[4,3,224,224]{3,2,1,0} xla::device_data(), xla_shape=f32[4,3,224,224]{3,2,1,0}, device=CPU:0
  %3 = f32[4,3,224,224]{3,2,1,0} aten::mul(%2, %1), xla_shape=f32[4,3,224,224]{3,2,1,0}
  %4 = f32[4,3,224,224]{3,2,1,0} xla::device_data(), xla_shape=f32[4,3,224,224]{3,2,1,0}, device=CPU:0
  %5 = f32[4,3,224,224]{3,2,1,0} aten::add(%4, %3), xla_shape=f32[4,3,224,224]{3,2,1,0}, ROOT=0
}

FYI @shunting314 @wconstab 

I think we didn't hit this issue in the torch bench because we only run for 1 step and we have mark_step between steps to clear things up. I will continue looking into this issue, I think the right thing to do is to add a optimizer step in the test I have, which should simulate the real use case and that can also capture the input.grad in the same graph.

FYI @shunting314 @wconstab

Copy link
Copy Markdown
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@JackCaoG JackCaoG merged commit de2160a into master Feb 10, 2023
@shunting314
Copy link
Copy Markdown
Collaborator

Here is what I understand from the pending graph:

  1. the mul is a nop since we multiply an tensor by 1
  2. then the graph essentially only contains a add op

This looks very like gradient accumulation done by the autograd engine.

@JackCaoG
Copy link
Copy Markdown
Collaborator Author

@shunting314 yea, the part that's confusing to me is I would expect backward graph to contain the gradient accumulation but that might be in the optimizer graph?

@shunting314
Copy link
Copy Markdown
Collaborator

Gradient accumulation should happen in the backward graph.

@wconstab
Copy link
Copy Markdown
Collaborator

I think gradient accumulation for one parameter happens when autograd determines that all the contributors to that parameter's grad are ready. In a simple case, there may be only one producer of a grad for a parameter, so you would expect autograd to fire its accumulate grad for that parameter soon after the backward op that produced the grad finished.

I believe, if the autograd accumulategrad is ready to fire while we are tracing the backward graph, it should be likely that we are capturing that in the backwards graph. Note that I suspect a race condition here- if we 'exit' the dynamo-AOT backward trace phase right before autograd decides to fire its hook, then instead of including it in our backward graph it would be in its own graph.

A separate reason that could cause delayed accumulategrad is if there are other forward ops using the parameter in question, meaning those corresponding separate backwards ops all have to finish before accumulategrad can fire. So it would be good to confirm which case it is.

@JackCaoG
Copy link
Copy Markdown
Collaborator Author

Thanks @wconstab for the input, I am wondering what's the best way to figure out which one is the case? AFAICT, dynamo only passes us two graphs, one for forward and one for backward and that part of logic is not controlled by the existing dynamo bridge. Should I dig into AOTAutograd's code and figure out how does it decide to pass xla the backward graph and force it to sleep for a while to make sure everything is captured, or is there any other way to achieve the similar goal?

@wconstab
Copy link
Copy Markdown
Collaborator

One tool you can play with is dumping the autograd graph. You'll be better off doing this on a toy model if you can repro there.

The graph dump idea is mainly to visualize which nodes are contributing grads to a given parameter. If you have more nodes contributing than you expected, maybe you are in the second case.

Should I dig into AOTAutograd's code and figure out how does it decide to pass xla the backward graph and force it to sleep for a while to make sure everything is captured

I don't know what a clean solution would look like. Indeed the thing that fires AccumulateGrad hooks during regular .backward() is a separate c++ thread in autograd engine, so it can come at any time with respect to python's tracing. There is probably some hook autograd fires after finishing all grad accumulation, which you could register a callback in python and wait for that before you exit the dynamo backward trace. But make sure there is no valid case where grads are not expected to all be ready before you wait.
cc @albanD - is there such a hook? any other ideas? Also, is aot-autograd the same as .backward, with a separate thread firing these accumulategrad operations?

@shunting314
Copy link
Copy Markdown
Collaborator

shunting314 commented Feb 10, 2023

Also, is aot-autograd the same as .backward, with a separate thread firing these accumulategrad operations?

Comment with what I see. aot-autograd calls .grad. Both .grad and .backward eventually calls into: Variable._execution_engine.run_backward which should run into the C++ autograd engine.

@albanD
Copy link
Copy Markdown

albanD commented Feb 10, 2023

is there such a hook?

Yes, and funilly enough they're called engine callbacks haha. They will trigger at the very end of the backward pass once everything else has been executed.

Also, is aot-autograd the same as .backward, with a separate thread firing these accumulategrad operations?

There is no side thread after aot-autograd.
But we also enable single threaded autograd in there so there is never a side thread to begin with even when calling .backward. (https://pytorch.org/docs/master/generated/torch.autograd.set_multithreading_enabled.html?highlight=set_multithreading_enabled#torch.autograd.set_multithreading_enabled)

@wconstab
Copy link
Copy Markdown
Collaborator

There is no side thread after aot-autograd.

Interesting, so does that mean that

  1. if all grads for a given parameter are ready during aot-autograd, then we gaurantee to also capture accumulategrad during aot-autograd trace?
  2. or, even if grads are ready during aot-autograd, we do not expect accumulategrad to fire until someone actually calls .backwards() outside?

@albanD
Copy link
Copy Markdown

albanD commented Feb 13, 2023

if grads are ready during aot-autograd, we do not expect accumulategrad to fire until someone actually calls .backwards() outside?

I'm not sure what you mean by that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants