You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the unique features of Static Runtime is the MemoryPlanner, which aggregates all the memory allocation of the intermediate tensors into a single malloc and caches all the TensorImpls into Static Runtime. It helps speed up inference by reducing the number of mallocs and the time it takes to create/destry Tensor objects and the associated refcount bumps on the fly. However, MemoryPlanner only manages the intermediate tensors, which exclude model inputs and outputs. If we can extend MemoryPlanner to include the output tensors, we can speed up models with multiple outputs dramatically.
First, we'll need some bookkeeping for the output tensors:
std::vector<std::pair<size_t, std::vector<c10::StorageImpl*>>> managed_output_storage_;
size_t managed_output_bytes_{0};
at::DataPtr output_buffer_; // for outputs only
For implementation, see https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/static/impl.cpp
Similar to the intermediates, for outputs, MemoryPlanner can only manage output tensors of ops with out variants. For ops without out variants, their output tensors will be dynamically created by the op. There is nothing the MemoryPlanner can do.
Do pay attention to aliases. We'll need to exclude model input and input aliases. Aliases of intermediate tensors and output tensors need to handled carefully.
One of the unique features of Static Runtime is the MemoryPlanner, which aggregates all the memory allocation of the intermediate tensors into a single malloc and caches all the TensorImpls into Static Runtime. It helps speed up inference by reducing the number of mallocs and the time it takes to create/destry Tensor objects and the associated refcount bumps on the fly. However, MemoryPlanner only manages the intermediate tensors, which exclude model inputs and outputs. If we can extend MemoryPlanner to include the output tensors, we can speed up models with multiple outputs dramatically.
First, we'll need some bookkeeping for the output tensors:
For implementation, see https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/static/impl.cpp
Similar to the intermediates, for outputs, MemoryPlanner can only manage output tensors of ops with out variants. For ops without out variants, their output tensors will be dynamically created by the op. There is nothing the MemoryPlanner can do.
Do pay attention to aliases. We'll need to exclude model input and input aliases. Aliases of intermediate tensors and output tensors need to handled carefully.
For testing, there are a lot of unit tests in https://github.com/pytorch/pytorch/blob/master/benchmarks/static_runtime/test_static_runtime.cc and https://github.com/pytorch/pytorch/blob/master/test/test_static_runtime.py.
cc @gmagogsfm
The text was updated successfully, but these errors were encountered: