You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As more modules utilize profiling data in the pipeline, it is increasingly difficult to locate the proper profile node for a given value in graph.
A consistent API/protocol to query and propagate profile node would be beneficial that: 1. Allowing pass writers to implement robust and short code (less code to search for profile node hence less bugs); 2. Orchestrating subsequent passes to manipulate profile nodes in compatible manners.
Motivation
The problem arises when we try to improve autodiff via utilizing requires_grad from profiled tensor type. requires_grad on IO tensors to a DifferentiableGraph is used to: 1. Prune computation of grad_inputs in backward graph; 2, mark requires_grad on output tensor in forward graph.
This works when we have a DifferentiableGraph that preserves profile nodes for all input/output tensors within its subgraph. However, it is not validated by any explicit checks and can easily be broken by optimization passes, since each pass writer mutates the graph freely.
e.g. For autodiff, we can look at a made up graph. This is mimicking a graph output at this stage:
For edges connecting two DifferentiableGraph, the profile node is absorbed by one of them and making it harder for the other to retrieve the profiling information. In the example above, the last output tensor in prim::DifferentiableGraph_0 doesn鈥檛 come directly from a profile node. But the profile node is instead located inside its consumer(s). When we query the profiled tensor type of output inside the subgraph within a DifferentiableGraph, there are three places that we need to look: i. Within the subgraph, we look for profile node that feeds tensor to output node; ii. In the block where the DifferentiableGraph exists, we check users of its output and looking for profile nodes; iii. Finally, for users in case ii that are DifferentiableGraph themselves, we have to look at their subgraph and look for profile node on the corresponding inputs as well.
We could have multiple profile nodes in different branches with conflicting type information. Conflicting type information in general is tricky to handle, but in the restricted use case where only a single profile run is executed, the conflict is more of a concrete tensor type vs an empty tensor type, which can be resolved by simply iterating through all branches until a concrete type is found.
Pitch
In the current protocol for using profiling information, where users insert custom guards and realize the guarded information in optimization, the overhead of profile node is relatively small. Giving that profiling information not used by user will just be discarded, with no runtime penalty.
So this is drastically different from the earlier use of profile node in conjunction with BailOut node, where the existence of profile node implies BailOut overhead and also blocks fusion. Therefore, our strategy of merging profile node across blocks should also be recalibrated.
I think by simply cloning profile node instead of moving them in optimization passes, it would be much easier for subsequent passes to execute similar profiling-information-dependent optimizations. However, without a validation pass, it is hard to keep up this protocol. Hence there comes the request to formalize our APIs to manipulate and validate profiling nodes in a graph. Given that a profile node is mostly just a pass-through node without side effects at runtime, we can safely propagate profile nodes in a generic way after a graph mutation to facilitate future optimization passes, as well as removing a graph with profile nodes before execution.
Simple APIs that I think would be useful:
// Propagate profile nodes across boundaries between Blocks.
void propagateProfileNode(std::shared_ptr<Graph> graph);
// Validate that the given graph satisfies the assumption made in `retriveProfileInformation`, that every use of value in the graph with profiling information could be extracted.
bool validateProfileNode(std::shared_ptr<Graph> graph);
// Note that we are using `Use { Node* user; size_t offset}` here instead of `Value`, this should return the profile node regarding the specific use when applicable.
Node* retrieveProfileInformation(Use use);
@jjsjann123 I like the idea! Let me give it a little bit more thought if we could run into any issues with this approach and we could get to implement it.
馃殌 Feature
As more modules utilize profiling data in the pipeline, it is increasingly difficult to locate the proper profile node for a given value in graph.
A consistent API/protocol to query and propagate profile node would be beneficial that: 1. Allowing pass writers to implement robust and short code (less code to search for profile node hence less bugs); 2. Orchestrating subsequent passes to manipulate profile nodes in compatible manners.
Motivation
The problem arises when we try to improve autodiff via utilizing
requires_grad
from profiled tensor type.requires_grad
on IO tensors to a DifferentiableGraph is used to: 1. Prune computation of grad_inputs in backward graph; 2, markrequires_grad
on output tensor in forward graph.This works when we have a DifferentiableGraph that preserves profile nodes for all input/output tensors within its subgraph. However, it is not validated by any explicit checks and can easily be broken by optimization passes, since each pass writer mutates the graph freely.
e.g. For autodiff, we can look at a made up graph. This is mimicking a graph output at this stage:
pytorch/torch/csrc/jit/runtime/profiling_graph_executor_impl.cpp
Lines 509 to 511 in bbdb37b
We notice a few complications above:
prim::DifferentiableGraph_0
doesn鈥檛 come directly from a profile node. But the profile node is instead located inside its consumer(s). When we query the profiled tensor type of output inside the subgraph within a DifferentiableGraph, there are three places that we need to look: i. Within the subgraph, we look for profile node that feeds tensor to output node; ii. In the block where the DifferentiableGraph exists, we check users of its output and looking for profile nodes; iii. Finally, for users in case ii that are DifferentiableGraph themselves, we have to look at their subgraph and look for profile node on the corresponding inputs as well.Pitch
In the current protocol for using profiling information, where users insert custom guards and realize the guarded information in optimization, the overhead of profile node is relatively small. Giving that profiling information not used by user will just be discarded, with no runtime penalty.
So this is drastically different from the earlier use of profile node in conjunction with BailOut node, where the existence of profile node implies BailOut overhead and also blocks fusion. Therefore, our strategy of merging profile node across blocks should also be recalibrated.
I think by simply cloning profile node instead of moving them in optimization passes, it would be much easier for subsequent passes to execute similar profiling-information-dependent optimizations. However, without a validation pass, it is hard to keep up this protocol. Hence there comes the request to formalize our APIs to manipulate and validate profiling nodes in a graph. Given that a profile node is mostly just a pass-through node without side effects at runtime, we can safely propagate profile nodes in a generic way after a graph mutation to facilitate future optimization passes, as well as removing a graph with profile nodes before execution.
Simple APIs that I think would be useful:
cc @gmagogsfm
The text was updated successfully, but these errors were encountered: