-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[StaticRuntime] Threading model #46219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D24237078 |
💊 CI failures summary and remediationsAs of commit fb11ae1 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Codecov Report
@@ Coverage Diff @@
## master #46219 +/- ##
==========================================
- Coverage 68.40% 68.25% -0.15%
==========================================
Files 411 410 -1
Lines 54100 53531 -569
==========================================
- Hits 37008 36539 -469
+ Misses 17092 16992 -100
Continue to review full report at Codecov.
|
This pull request was exported from Phabricator. Differential Revision: D24237078 |
d2a68bc
to
351f9ea
Compare
Summary: Pull Request resolved: pytorch#46219 - Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`: ``` struct InferenceModule { explicit InferenceModule(const torch::jit::Module& m); explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g); torch::jit::Module module; std::shared_ptr<torch::jit::Graph> graph; std::unique_ptr<c10::FunctionSchema> schema; std::unordered_map<Value*, size_t> value_to_reg; std::vector<size_t> input_regs; // inputs to the graph std::vector<size_t> output_regs; // outputs of the graph std::vector<size_t> internals; }; ``` which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime: ``` mutable std::vector<IValue> reg_; // The nodes we need to run std::vector<ProcessedNode> nodes_; ``` `reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`. - Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example. - Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable. This diff focuses on threading model only. Benchmarks will be separate. Reviewed By: bwasti Differential Revision: D24237078 fbshipit-source-id: b0f8572536454247b8738a0c11f6b8b596107cc4
This pull request was exported from Phabricator. Differential Revision: D24237078 |
351f9ea
to
fb11ae1
Compare
This pull request has been merged in 1a3ea46. |
Summary:
InferenceModule
:which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
reg_
holds all the weights and activations, which is different across threads during running.nodes_
holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into theInferenceModule
if we decide not to anything else intoProcessedNode
.Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking.
cleanup_activations
is an example.Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the
push
andpop
combo takes about 80 ns, which is quite acceptable.This diff focuses on threading model only. Benchmarks will be separate.
Differential Revision: D24237078