Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: guard DifferentiableGraph node #49433

Closed
wants to merge 12 commits into from
Closed

Conversation

t-vi
Copy link
Collaborator

@t-vi t-vi commented Dec 15, 2020

This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.

Fixes #49299

I still need to look into a handful of failing tests, but maybe it can be a discussion basis.

@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Dec 15, 2020
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 15, 2020

💊 CI failures summary and remediations

As of commit 0b1c880 (more details on the Dr. CI page):


  • 2/4 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)
  • 2/4 broken upstream at merge base d78b638 on Jan 08 from 10:56am to 2:26pm

🚧 2 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 130 times.

@t-vi t-vi changed the title JIT: guard DifferentiableGraph node [WIP] JIT: guard DifferentiableGraph node Dec 15, 2020
@t-vi t-vi force-pushed the guard_diffgraph branch 4 times, most recently from 3c73921 to 3752916 Compare December 17, 2020 11:21
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, looks great, thanks for the PR!

I think we need to figure out when we're specializing the graph, because we might be doing it too generally here without guarding to make sure that the types we have set are guaranteed to be executed.

cc @Krovatkin as well for review

aten/src/ATen/core/type.cpp Show resolved Hide resolved
test/test_jit_fuser_te.py Outdated Show resolved Hide resolved
torch/csrc/jit/passes/create_autodiff_subgraphs.cpp Outdated Show resolved Hide resolved
torch/csrc/jit/runtime/interpreter.cpp Show resolved Hide resolved
torch/csrc/jit/passes/tensorexpr_fuser.h Show resolved Hide resolved
@@ -376,17 +377,23 @@ void ProfilingGraphExecutorImpl::runProfilingOptimizations(
runPreAutodiffPassPipeline(copy);

if (needsGradientInProfilingMode(copy->block())) {
RemoveProfileNodesAndSpecializeTypes(copy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we're specializing the types of the graph here without guaranteeing that they will be seen. This could lead to invalid peepholes in the forward of the differentiable graph. Could we try to rewrite this without removing the profiling nodes here ? The only property that we're guaranteeing stays consistent is the gradient of the inputs.

I think the flow we want is:

  • CreateAutodiffSubgraphs

  • Specialize the inputs of CreateAutodiffSubgraphs on their requires_grad property. This might require going into DifferentiableGraph and aggregaating the grad property based on uses.

  • Add a Type Guard to the diff graph. In the fallback function, we probably want to inline the differentiable graph, since we are running it only if gradients have changed, and we may not need to differentiate at all.

  • Next, when we differentiate the graph, it would be nice if we're not adding excess gradients, since we know which inputs require grad. (This doesn't have to be done in thisPR).

It is sort of annoying to have to keep the profile nodes in the graph, but I don't know if there's a better solution here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the following in mind:

  • CreateAutodiffSubgraphs
  • Specialize the inputs of DifferentiableGraphs
  • Add a guard based on the input requiresGrad.

My thinking was that the fallback function doesn't have a differentiable graph but only the nodes that have been moved inside the graph so it would be good to use as is.

@@ -396,7 +403,7 @@ void ProfilingGraphExecutorImpl::runProfilingOptimizations(
InlineAutodiffSubgraphs(
copy,
getAutodiffSubgraphInlining() ? autodiffSubgraphInlineThreshold : 1);
RemoveProfilingNodes(copy);
replaceFallbackGraphWithFallbackFunction(copy->block());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how does this work ? what do we want the graph to look like ?

if TypeCheck:
    DifferentiableGraph
else:
    FallBack(...)

within the Fallback, we probably don't want it to be a Differentiable Graph right, because the gradient properties are different? We might need to inline the DifferentiableGraph in the FallBack graph if i'm reading this right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the fallback graph is inserted by insertTypeGuard using the subgraph attached to the guarded (ie the DifferentiableGraph) node. So it does not include the DifferentiableGraph itself. Then here the replaceFallbackGraphWithFallbackFunction replaces the fallback graph with the fallback function. (And that gets an executor so it can then do its own profiling + optimization.)

@t-vi
Copy link
Collaborator Author

t-vi commented Dec 18, 2020

Thank you for the review! I'll update the PR tomorrow morning.

For some reason, in the CI the requiresGrad of the type seems to be nullopt rather than false.
I could use a hint why, as I cannot reproduce it locally.

@t-vi
Copy link
Collaborator Author

t-vi commented Dec 18, 2020

I updated the Type-Guarding to pull the Type from first the profile node inside the differentiable graph to the outside and leave the profile alone, I hope this works.

@t-vi
Copy link
Collaborator Author

t-vi commented Dec 19, 2020

When the output of a differentiable graph is the input to another, the prim::profile node gets moved into the differentiable graph where it is the output. Then it is missing at the input. I'll go duplicate these nodes. I ended up not duplicating them because it turned into a mess quickly. Instead I now special case inputs that come from differentiable graphs.

Note that this is also a problem for the fusers. On the other hand it depends on unmergeable differentiable graph, which might not be too common.

@eellison I must admit I'm a bit dense, but this has me thinking again that it might be worth exploring specializing types before creating the autodiff subgraphs -- then we wouldn't have the problem. Could you elaborate your concerns? Or we might remove and re-instrument the things inside the differentiable graph. That would have us do another set of profiling runs, but would likely work out nicely. I will probably revisit this once I have thought a bit more about incremental profiling.

@codecov
Copy link

codecov bot commented Dec 19, 2020

Codecov Report

Merging #49433 (7387285) into master (e0f60c9) will decrease coverage by 39.53%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #49433       +/-   ##
===========================================
- Coverage   80.57%   41.03%   -39.54%     
===========================================
  Files        1883      500     -1383     
  Lines      204369    67395   -136974     
===========================================
- Hits       164662    27658   -137004     
- Misses      39707    39737       +30     

@t-vi t-vi mentioned this pull request Dec 19, 2020
// defined
return TensorType::get()->withRequiresGrad(t->requiresGrad());
});
if (all_requires_grad_defined) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's somewhat common for LSTMs (in a loop) to have tensor values whose requires_grad attributes flip from false (in the first iteration) to true (in the subsequent iterations). This behaviour would result in all_requires_grad_defined set to false. The other probable scenario would be when we switch from training to eval/test (i.e. true to false).

I think we could consider the following options:
a) enable re-profiling if we aren't sure, so we will profile a steady state (would cost us an extra call in a loop, to always go through a fallback)
b) gamble and assume that requires_gradients will be true. This strategy works out very nicely for both LSTMs and eval/train scenarios. We would take the fallback in the first iterations since requires_gradients would be false, but then we will be running the optimized path (no fallbacks). Similarly, we would spend more time in training than eval, so incurring the fallback penalty in the test/eval runs is preferable.
c) peel the first iteration, so profiler only sees requires_grad=true. We kinda tried it but very reluctantly.
d) do nothing and lose out on perf since we won't fuse in non-diff regions as long as even a single requires_grad=true anywhere in a graph.

Thoughts
@t-vi @eellison @jjsjann123 ?

Copy link
Collaborator Author

@t-vi t-vi Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oha, indeed! I would go for b). If that doesn't work out, I'd probably go for actually querying the configuration in the ProfilingRecord for which is the most commonly seen.
This will apply to completely unseen things, too, though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also kinda like b). It seems to be a reasonable heuristic; we could always revise it later.

This will apply to completely unseen things, too, though...

that's true. we could check for unseen things as their types are just TensorType::get().

I'd probably go for actually querying the configuration in the ProfilingRecord for which is the most commonly seen.

I'm actually working on this in a separate PR. However, the initial version would aggregate across different invocations of a function rather than different iterations of a loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true. we could check for unseen things as their types are just TensorType::get().

I wonder whether we might just have a "seen" bit on the profiling node. I've been thinking about incremental profiling (where we leave profiling nodes for the bits we don't have records of and re-run the optimization once we hit them) and will probably play with it one of these days.

Copy link
Contributor

@eellison eellison Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we should do b) but filter out TensorType::get() (unseen tensors). Incremental profiling could have value, i'm not sure, but is probably out of scope for this PRR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll return to the taking out graphs with TensorType::get() for now and look into incremental profiling. Thank you for the insightful comments!

@t-vi
Copy link
Collaborator Author

t-vi commented Dec 20, 2020

Finally, it seems that the CI is happy, and I'm reasonably happy, too.

@t-vi t-vi changed the title [WIP] JIT: guard DifferentiableGraph node JIT: guard DifferentiableGraph node Dec 20, 2020
@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 21, 2020
Copy link
Contributor

@Krovatkin Krovatkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks gut. I had a few fairly minor comments that would be nice to fix.

test/test_jit_fuser_te.py Outdated Show resolved Hide resolved
@@ -26,6 +27,9 @@ class ProfileRegistry {

bool shouldProfileNode(const Node* node) {
std::lock_guard<std::mutex> guard(mutex_);
if (isDifferentiable(node)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add a comment here? I presume we need this this to minimize the number of nodes whose requires_grad isn't set because they aren't being profiled?

if (n->kind() == prim::profile) {
GRAPH_DEBUG(
"setting input ", i, " to type ", *n->ty(attr::profiled_type));
dni->setType(n->ty(attr::profiled_type));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't just set requiresGrad attribute vs the complete type? TensorType::get().withRequiresGrad(n->type(attr::profiled_type)->requiresGrad())?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then if withRequiresGrad is undefined (ie the case you mentioned), we would end with something indistinguishable from an unseen tensor. So I'd probably leave it as is for now.

GRAPH_DEBUG(
"setting input ", i, " to type ", *n->ty(attr::profiled_type));
dni->setType(n->ty(attr::profiled_type));
} else if (dni->node()->kind() == prim::DifferentiableGraph) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Eventually, I think the best fix would be to 1) re-profile the graph inside the differentiable graph (or do we do that already when using the differentiable graph operator?) or 2) apply the type annotations before creating autodiff graphs. So I understood @eellison that this isn't good to do at the moment, but I haven't understood well enough what is unsafe about it to see if that could be fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is an easy enough - see switching from Code to GraphExecutor for gradient forward #47434.

  2. yea is a little weird bc it gets away from our "fusers decide what to guard on" logic, and im also a little worried about proving what properties in the differentiable graph are guaranteed. For instance, if an Op takes in a NumberType, the output type could change if in one iteration the input is an Integer and the next a Float. I'm not saying that's particularly common, but it's just one example where strictly guarding on Tensor inputs does not guarantee consistent tensor types.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool cool, LGTM, comments I made you/ @Krovatkin / myself can do in a follow up, doesnt have to be in this PR.. Thanks for doing this, I know it was a ton of work and you improved some infrastructure that was lacking along the way!

@jjsjann123 it would be good to try to speed up the layer norm autodiff with this in place.

@@ -0,0 +1,17 @@
import torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a test/jit/test_profiler, this test might be better there

@@ -70,9 +70,11 @@ struct ParsedLiteral {
int64_t i = 0;
std::string s = "";
double f = 0.0;
TypePtr ty;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

GRAPH_DEBUG("Optimizing diff node ", idx, " in ", *copy);
if (!guardDifferentiableGraph(dnode)) {
// if we cannot guard (because of inputs without profiling information),
// we re-inline the subgraph and remove the differentiable node
Copy link
Contributor

@eellison eellison Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if inputs without profiling information we just didn't guard on, instead of disabling differentiable graphs entirely, maybe as a follow up... I'm not sure this is acttually relevantt in practice though

for (size_t i = 0; i < gi.size(); i++) {
auto ty = gi[i]->type()->cast<TensorType>();
if (ty) {
auto n = gi[i]->uses().at(0).user;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose with DifferentiableGraph's we're always fusing just a block, so all or no uses will have been executed

}
}
if (all_inputs_seen) {
// we may have seen both true and false for requires_grad. In this case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when would this happen exactly? I'm a little confused here. I guess this is the LSTM first loop doesn't require grad ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for LSTM in the first timestep the hidden state usually doesn't require gradients, the inputs would.

GRAPH_DEBUG(
"setting input ", i, " to type ", *n->ty(attr::profiled_type));
dni->setType(n->ty(attr::profiled_type));
} else if (dni->node()->kind() == prim::DifferentiableGraph) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is an easy enough - see switching from Code to GraphExecutor for gradient forward #47434.

  2. yea is a little weird bc it gets away from our "fusers decide what to guard on" logic, and im also a little worried about proving what properties in the differentiable graph are guaranteed. For instance, if an Op takes in a NumberType, the output type could change if in one iteration the input is an Integer and the next a Float. I'm not saying that's particularly common, but it's just one example where strictly guarding on Tensor inputs does not guarantee consistent tensor types.

idx++;
continue;
}
GRAPH_DEBUG("After guardDifferentiableGraph:\n", *copy);
auto diff_graph = std::move(dnode->g(attr::Subgraph));
Gradient gradient = differentiate(diff_graph);
Copy link
Contributor

@eellison eellison Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we re-profile within differentiable graphs, we can remove the profile nodes before differentiating, and we not compute unnecesssary gradients. I'm not sure we're doing that now.

see:

if (!output->requires_grad())

(not for this PR, as a follow up)

// ideally we would set this up for re-profiling
UpdateDifferentiableGraphRequiresGrad(
dnode->g(attr::Subgraph), c10::nullopt);
SubgraphUtils::unmergeSubgraph(dnode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateDifferentiableGraphRequiresGrad(dnode->g(attr::Subgraph), c10::nullopt);
replaceBlockWithFallbackGraph(SubgraphUtils::getSubgraph(dnode)->block());
SubgraphUtils::unmergeSubgraph(dnode)

@eellison
Copy link
Contributor

eellison commented Dec 28, 2020

Hey @t-vi, sry things are moving a little slowly during the holidays! We're trying to merge this ~soon~, however @Krovatkin noticed a 3-4% slowdown on fastrnns. We're looking if we can make this faster by replacing the generic TypeCheck node with a check that just looks at requires grad, and by replacing the unoptimized block with a FallBack path. Any other ideas welcome !

As you pointed out, there is some overhead with the extra executor, I wonder if there's anyway we could inline fallback graphs after we've done optimizing them 🤔

I also don't know why we're instantiating the interpreter state on each forward -

InterpreterState(f).run(*stack);

@t-vi
Copy link
Collaborator Author

t-vi commented Dec 29, 2020

Thank you for catching this. But do we think the type check is the problem or the fallback?
If the latter, then having re-profiling of parts might be a good solution that would allow us to just leave the fallback in the "else block" directly.
I'm a bit sad we don't get this in because of the correctness aspect, but hey 4% slowdown is terrible, too.

@Krovatkin
Copy link
Contributor

Krovatkin commented Dec 29, 2020

@t-vi fyi: @eellison

I've run fastrnns from benchmarks, please see the results here: https://docs.google.com/spreadsheets/d/1otmopqasKTe1rYqreJrNm4WNwQpaT37z4CpMPszVq3o/edit?usp=sharing The slowdown seems to be somewhat small, so I'm optimistically suspecting checks, given the fact that graphs are nearly identical modulo extra checks.

I am slowly looking into this. I added a specialized check here, #49929 , but I didn't get a chance to measure it yet (some setup issues with a benchmarking machine, I usually use). I'll measure it and report back, If my patch doesn't help, I'll add counters to confirm that we are indeed hitting more bailouts.

@t-vi
Copy link
Collaborator Author

t-vi commented Dec 29, 2020

Do you know if the std is the std of measurements or adjusted for the number of runs btw?
I must admit that my intuition is that the extra function call is might be problematic than the check itself (and we'd run into that at some point). The other thing I thought about was whether we could actually be making the if into a switch that works similar to the traditional executor but with "user provided" (for the CUDA fuser) criteria.

You shouldn't have to fix my PR.

@Krovatkin
Copy link
Contributor

Krovatkin commented Dec 29, 2020

Here's the raw csv data. If I remember correctly, the slowdown was greater than both stds.
tvi3.txt

@Krovatkin
Copy link
Contributor

Krovatkin commented Dec 29, 2020

@t-vi

you shouldn't have to fix my PR

To give a little context: I started to look into it as a follow up to your PR when I thought we were ready to merge as-is and before I ran the numbers. This was just too similar to my other effort to speed up TypeChecks in general.

@Krovatkin
Copy link
Contributor

@t-vi specialized checks seem to help more than not (most runs we get most perf back). Although, this machine is pretty noisy: https://docs.google.com/spreadsheets/d/1UCzYlHFnGo_G097aq0sSbmYAwd1Go_ouyGpKamCmerI/edit?usp=sharing

@Krovatkin
Copy link
Contributor

@t-vi could you please rebase yr PR and I'll try merging ASAP! sorry for the delay!

@t-vi
Copy link
Collaborator Author

t-vi commented Jan 8, 2021

@Krovatkin will do! Thank you!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@Krovatkin merged this pull request in ea087e2.

jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.

Fixes pytorch/pytorch#49299

I still need to look into a handful of failing tests, but maybe it can be a discussion basis.

Pull Request resolved: pytorch/pytorch#49433

Reviewed By: ngimel

Differential Revision: D25681374

Pulled By: Krovatkin

fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.

Fixes pytorch/pytorch#49299

I still need to look into a handful of failing tests, but maybe it can be a discussion basis.

Pull Request resolved: pytorch/pytorch#49433

Reviewed By: ngimel

Differential Revision: D25681374

Pulled By: Krovatkin

fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: DifferentiableGraph/Requires grad handled badly by ProfilingExecutor and fuser fallbacks
6 participants