New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch NNAPI integration prototype #46780
Conversation
💊 CI failures summary and remediationsAs of commit c323076 (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
codecov.io: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 22 times. |
int32_t zeroPoint; | ||
} ANeuralNetworksOperandType; | ||
|
||
#endif // MINIMAL_NEURAL_NETWORKS_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so this file is mainly from NNAPIs source tree - i.e. enum and struct definitions.
torch/backends/_nnapi/prepare.py
Outdated
if fmt == 0: | ||
fixed_args.append(args[idx].contiguous()) | ||
elif fmt == 1: | ||
fixed_args.append(args[idx].permute(0,2,3,1).contiguous()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this is where the NHCW stuff you mentioned earlier happens.
TENSOR_QUANT16_ASYMM = 12 | ||
|
||
|
||
class NNAPI_OperationCode(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, these were picked up from https://developer.android.com/ndk/reference/group/neural-networks#operationcode
return struct.pack("i" * len(ints), *ints) | ||
|
||
|
||
ADDER_MAP = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so these are the aten operators supported for the NNAPI execution. I'm guessing this list is dictated by the operation codes for the operators from https://developer.android.com/ndk/reference/group/neural-networks#operationcode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this maps from the internal TorchScript operator name to the function that needs to handle that node. In some cases (like processing constants), there's no NNAPI operation generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this map could grow if we are supporting more and more nnapi operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Though I'd like to replace the explicit map with a decorator.
if config is None: | ||
config = {} | ||
|
||
self.solid_weights = config.get("solid_weights", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q. What is solid_weights
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is a vestigal feature. I'll delete it. Currently, we get each weight from a separate tensor. "solid weights" was to let all of the weights be bundled as a single blob to make it possible to deploy without using PyTorch or Caffe2 as a wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Now I understand the intent, but not the details, so I shan't mull over them for now.
self.modules = {} | ||
self.constants = {} | ||
self.jitval_operand_map = {} | ||
self.cached_immediates = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - cached_immediates
is a set of all literal constant values that show up in this model, de-duped.
return abs(lhs - rhs) <= tolerance * min(lhs, rhs) | ||
|
||
|
||
def tensor_size(op_type, dims): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so this is the tensor size in bytes.
torch/backends/_nnapi/serializer.py
Outdated
if len(s1) > len(s2): | ||
#s2 = [1] * (len(s1) - len(s2)) + s2 | ||
raise Exception("Non-equal-rank broadcast is too dangerous because XXX.") | ||
if len(s2) > len(s1): | ||
#s3 = [1] * (len(s2) - len(s1)) + s1 | ||
raise Exception("Non-equal-rank broadcast is too dangerous because XXX.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dreiss Maybe this is not relevant here, but https://pytorch.org/docs/stable/notes/broadcasting.html suggests that broadcasting can account for non-equal shapes of tensors. Is this something that would be a limitation of this specific API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to clean this code up a bit. In theory, PyTorch and NNAPI have the same broadcasting semantics (back to front, extend front with 1s). However, when using "nnapi_nhwc", it gets a bit wonky, because PyTorch uses an implicit NHWC representation based on strides, so the tensors still have NCHW semantics, and broadcasting order is still W,H,C,N. NNAPI (and TF and C2) use explicit NHWC, so the broadcast order is C,W,H,N. The trickiest bits come into play when we try to broadcast a known NHWC tensors with a constant (where we don't necessarily known the user's intention). I haven't worked through all the cases yet, so I decided to not support non-equal-rank broadcast for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I understand now that there is additional complexity when you're dealing with different orders of dimensions and that broadcasting only works when the dimensions are ordered the same, and at a high level I gather that you can't make that assertion here, so you've left this open. If you somehow knew the order of dimensions then you could have supported the 1 padding behaviour. Got it!
torch/backends/_nnapi/serializer.py
Outdated
header = struct.pack( | ||
"iiiiii", | ||
version, | ||
len(self.operands), | ||
len(self.values), | ||
len(self.operations), | ||
len(self.inputs), | ||
len(self.outputs), | ||
) | ||
model.append(header) | ||
|
||
serialized_values, serialized_value_data = self.serialize_values() | ||
|
||
model.extend(struct.pack("iifi", t, len(d), s, z) for (t,d,_m,s,z) in self.operands) | ||
model.extend(serialized_values) | ||
model.extend(struct.pack("iii", *x) for x in self.operations) | ||
model.extend(self.serialize_ints(fix_shape(dims, mf)) for (_, dims, mf, _, _) in self.operands) | ||
model.extend(serialized_value_data) | ||
model.append(self.serialize_ints(self.operation_args)) | ||
model.append(self.serialize_ints(self.inputs)) | ||
model.append(self.serialize_ints(self.outputs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so the serialized model basically is a file which has the model header specifying version, num(operands), num(values), etc... where operands, values, etc... are fixed size structures and the de-serializer basically reads them off as such because it knows how many of each to expect.
self.outputs = [] | ||
|
||
self.modules = {} | ||
self.constants = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but I don't see any place that constants is being set (or updated). I only see queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm I see it now as self.constants[key]
.
def add_constant_value(self, jitval, ctype, value): | ||
assert jitval not in self.constants | ||
self.constants[jitval] = (ctype, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q. Where can I find out more about the permissible values in jitval
and what they mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a "Value" in the JIT IR. https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/ir/ir.h#L148 It represents a local variable or the result of a TorchScript expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a conversion from struct Value
to std::string
or const char*
in the file csrc/jit/ir/ir.h
. How does one use jitval
as a string in this context?
Summary: This is in prototype status, but pretty functional. There are two major parts. - Model converter. This is a pure Python component that consumes a model in TorchScript format, converts the operations into NNAPI semantics, and serializes the model in a custom format. It then wraps the result in a new TorchScript model that can invoke NNAPI under the hood. - Runtime. This is a TorchBind object that deserializes the model and sends the result to NNAPI. This is fairly simple since the serialized format is basically just a list of NNAPI calls to make, so most of the code is spent on bounds checking. A few notes on the design. - Currently, all tensor sizes need to be fixed, and those fixed sizes are burned directly into the serialized model. This will probably need to change. NNAPI supports variable-sized tensors, but the important hardware backends do not. However, we're seeing use cases crop up where the input size is not known until around the time that the model is loaded (for example, it might depend on the camera aspect ratio). I think the proper fix here is to remove the code in the converter that eagerly calculates the sizes of the intermediate tensors and replace it with a code generator that will generate some TorchScript code that will perform those calculations at model load time. This way, we will be able to support models that have variable-sized inputs while still only showing fixed-sized operands to NNAPI. - The important hardware backends want operands to be in NHWC order, but PyTorch natively represents all tensors and NCHW. The strategy for this is to keep NCHW during most of the conversion process, but track and additional value per operand representing the "dimension order". The dimension order gets propagated through convolutions and pointwise ops. When we're ready to serialize the model, we reorder the dimensions for "channels last" operands to NHWC. Test Plan: Some local testing with FB prod models. I'll need to add some examples and automated tests. ghstack-source-id: e1fa978af170d4d00c5270c52b9d4cb63843e7d2 Pull Request resolved: pytorch#46780
} | ||
""") | ||
.replace("__DEFINE_CHECK_FUNCTIONS__", "\n".join(define_checks)) | ||
.replace("__LOAD_FUNCTIONS__", "\n".join(load_functions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we find a symbol with dlsym, do we build it into PyTorch binary? Would that introduce some binary size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wrapper obviously has some size in the binary, but dlsym happens at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So we assume that libneuralnetworks.so is available if on-device.
TENSOR_QUANT16_ASYMM = 12 | ||
|
||
|
||
class NNAPI_OperationCode(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the nnapi OperationCode has optional or default args like PT operators? For example, if pt::conv_2d() handles 5 args and the last one is optional. There could be calls with either 4 or 5 args. They are both valid. When converted to nnapi's CONV_2D, would it also cover both the situations with 4 and 5 args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each NNAPI op defines its own semantics. For example, Conv2D has two forms, one with 10 args and one with 13: https://android.googlesource.com/platform/frameworks/ml/+/refs/tags/android-11.0.0_r8/nn/runtime/include/NeuralNetworks.h#401 . But if you look closer, some of those args are "Available since API level 29", so there are also a 7-arg version and a different 10-arg version. I believe the two different 10-arg versions are distinguished by the types of the arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We may need to find the bast schema match of Conv2D manually, when converting a conv2d node to nnapi::Conv2D
value = getattr(obj, name) | ||
output = node.outputsAt(0) | ||
ctype = output.type() | ||
self.add_constant_value(output, ctype, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to freeze the model! So in nnapi models, there's no concept of "parameters"? The weights and bias are all "constants"? Does that mean it does not have training capability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NNAPI is only for inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this model only be used in platforms with nnapi available (Android)?
Yes. (Though it is possible in theory to build the NNAPI CPU implementation for Linux and run the model directly, which is how I've been testing.)
Does that mean we may keep multiple versions of a model?
For Android, we deploy an nnapi model.
For iOS, I'm wondering if there is corresponding API layer (CoreML?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iseeyuan for Metal, I believe it's integrated as a separate backend instead. I'm also curious to know why NNAPI isn't implemented as a separate backend. One thought I had is that PyTorch represents image Tensors as NCHW whereas NNAP (based on @dreiss diff summary) requires them to be in NHWC format, so implementing it as a separate backend may have required the conversion on every operator entry/exit? Not sure.
torch/backends/_nnapi/serializer.py
Outdated
self.add_getattr(node), | ||
"prim::Constant": lambda self, node: | ||
self.add_constant_node(node), | ||
"prim::ListConstruct": lambda self, node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could nnapi has corresponding prim::DictConstruct, TupleConstruct, etc? Do you have any plan to handle those nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NNAPI has no concepts of lists and dicts. It only operates on tensors and scalars: https://android.googlesource.com/platform/frameworks/ml/+/refs/tags/android-11.0.0_r8/nn/runtime/include/NeuralNetworks.h#68
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like how you handled prim::ListConstruct, we may also write code to handle DictConstruct and TupleConstruct from torchscript side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, I suppose so.
|
||
|
||
def serialize_model(module, inputs, config=None): | ||
return _NnapiSerializer(config).serialize_model(module, inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is inputs used inside _NnapiSerializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We use it to determine the input shape.
ANEURALNETWORKS_PREFER_LOW_POWER = 0, | ||
ANEURALNETWORKS_PREFER_FAST_SINGLE_ANSWER = 1, | ||
ANEURALNETWORKS_PREFER_SUSTAINED_SPEED = 2, | ||
} PreferenceCode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to guide the build with PreferenceCode. Should it be specified per model? What could I find the usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would like to expose this. I haven't figured out a good API for it yet.
return struct.pack("i" * len(ints), *ints) | ||
|
||
|
||
ADDER_MAP = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this map could grow if we are supporting more and more nnapi operations?
Awesome work on model serialization and nnapi binding! In addition to the embedded comments, I have some general questions:
|
Yes. (Though it is possible in theory to build the NNAPI CPU implementation for Linux and run the model directly, which is how I've been testing.)
Unfortunately, it is currently only possible to run these models on Android. I will push harder on getting a host build of NNAPI so we can automate this. |
Regarding unit tests: @dreiss Is the NNAPI similator API written by you or available in open source? If so could it be checked in and tests run against that library. One option could be to have a differently code-generated copy of I have reviewed the code related to the general serialization and deserialization of the model in the NNAPI format and it seems fine. I checked if the case statements have some copy-paste issue, and it doesn't seem that they do. I would love to see some test coverage as @iseeyuan mentioned, but I guess there are practical limitations? Specifically, I haven't reviewed the code which actually translates the call from PyTorch into the NNAPI backend, since it requires some understanding of how the translation takes place, and I'm not certain if I have that understanding yet. This includes most code in |
No. I have a heavily hacked-up version of the Android source tree that is able to build libneuralnetworks.so on Linux. I've asked the Android team about getting official support for this, but it is quite difficult. I will continue to ask them. |
torch/backends/_nnapi/prepare.py
Outdated
if fmt == 0: | ||
fixed_args.append(args[idx].contiguous()) | ||
elif fmt == 1: | ||
fixed_args.append(args[idx].permute(0,2,3,1).contiguous()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be feasible to comment here saying that fmt == 0
is NCHW (i.e. channels first) and fmt == 1
is NHWC or maybe used a enum for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum might be tricky since this needs to be TorchScript-compatible, but I can certainly add a comment.
def convert_model_to_nnapi(model, inputs): | ||
model = torch.jit.freeze(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that converting a model to NNAPI requires providing inputs and running the model? Am I reading this correctly?
Seems like it based on https://discuss.pytorch.org/t/any-different-between-model-input-and-model-forward-input/3690
i.e. model(input)
ends up calling model.forward(input)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we only run the model in order to get the number, shape, dtype, and qparams of the outputs. We could also do this without running the model.
out_mem_fmts: List[int], | ||
out_templates: List[torch.Tensor]): | ||
super().__init__() | ||
self.ser_model = ser_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dreiss Okay I think I now understand why you did this. Your serialized model (for NNAPI) is just a Tensor represented as a regular member of the original model so that it will get naturally serialized when the model is serialized to torchscript. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
TORCH_CHECK(serialized_model_tensor.is_contiguous()); | ||
c10::ArrayRef<uint8_t> ser_model = { | ||
serialized_model_tensor.data_ptr<uint8_t>(), | ||
serialized_model_tensor.nbytes() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this is how it's read in.
if (len % 4 == 0) { | ||
return len; | ||
} | ||
return len + 4 - (phys % 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlikely that this will result in an overflow, but it could if your model has size 4G - 3 bytes or more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I think we're a long way away from shipping a 4G model on NNAPI.
torch/backends/_nnapi/serializer.py
Outdated
def expand_sizes(self, size): | ||
return [ s.item() for s in size ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems to be unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably left-over from the old Caffe2 implementation.
|
||
# Pad with 0 bytes out to a multiple of 4 for alignment. | ||
physical_length = ((source_length - 1) | 0x3) + 1 | ||
padded_data = data + (b"\0" * (physical_length - source_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dreiss Q. Does this assume a certain endianness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I assume little-endian somewhere else, though.
torch/backends/_nnapi/prepare.py
Outdated
@torch.jit.export | ||
def __getstate__(self): | ||
return self.nnapi_module | ||
|
||
@torch.jit.export | ||
def __setstate__(self, nnapi_module): | ||
self.training = False | ||
self.nnapi_module = nnapi_module | ||
self.nnapi_module.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like get/set-state is just a way to set arbitrary metadata on a model. It's not clear to me if this method should get/set members that were previously set in the __init__
method, but seems like that is the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose here is to make sure that init gets called when the model is loaded, without the user having to intervene.
weight_permutation = (0, 2, 3, 1) | ||
elif args.group == in_c: | ||
# Depthwise convolution | ||
depthwise = True | ||
weight_permutation = (1, 2, 3, 0) | ||
else: | ||
raise Exception("Group convolution not supported yet.") | ||
|
||
# TODO: Transform at load time to share weights with CPU model. | ||
nnapi_weight_tensor = weight_tensor.permute(*weight_permutation).contiguous() | ||
weight_id = self.add_tensor_operand_for_weight(nnapi_weight_tensor) | ||
weight_oper = self.operands[weight_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if you could add some helper methods like toNHCW
and toNHWC
since this seems like a repeating pattern (i.e. permute the dimensions in the Tensor).
for idx, node in enumerate(model.graph.nodes()): | ||
LOG.debug("Processing node #%d: %r", idx, node) | ||
self.add_node(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this is the main code graph which is iterated over, and the node
is a JIT node.
torch/backends/_nnapi/serializer.py
Outdated
self_jitval = next(model.graph.inputs()) | ||
self.add_constant_value(self_jitval, self_jitval.type(), model) | ||
|
||
for input_value, input_tensor in zip(list(model.graph.inputs())[1:], inputs): | ||
op_id = self.add_tensor_operand_for_input(input_value, input_tensor) | ||
inp_dim_orders.append(self.operands[op_id].dim_order.value) | ||
|
||
for idx, node in enumerate(model.graph.nodes()): | ||
LOG.debug("Processing node #%d: %r", idx, node) | ||
self.add_node(node) | ||
|
||
retn = model.graph.return_node() | ||
assert retn.inputsSize() == 1 | ||
assert retn.outputsSize() == 0 | ||
# TODO: Make outputs a local variable? | ||
# TODO: Handle tuple-of-tensor return | ||
for idx in range(1): | ||
op_id = self.jitval_operand_map[retn.inputsAt(0)] | ||
self.outputs.append(op_id) | ||
out_dim_orders.append(self.operands[op_id].dim_order.value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so this is where the magic happens (it seems). i.e. once the model is run, somehow the JIT (runtime/interpreter) caches all intermediate state, and makes it available to anyone who wishes to inspect the JIT state -- which this code does and converts it into the custom NNAPI serialized Tensor (well serialized model).
This change itself looks good (based on the stuff I understand and was able to review - verified parts of it with OSS documentation, and everything seems to check out). The actual operator conversion to NNAPI operators isn't reviewed for correctness. A few things you could consider adding are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. Is there a guidance on how to convert and deploy an nnapi model with an example? That may help me to run the code and dive a little deeper. I think it would be good for this first version in prototype status.
Summary: This is in prototype status, but pretty functional. There are two major parts. - Model converter. This is a pure Python component that consumes a model in TorchScript format, converts the operations into NNAPI semantics, and serializes the model in a custom format. It then wraps the result in a new TorchScript model that can invoke NNAPI under the hood. - Runtime. This is a TorchBind object that deserializes the model and sends the result to NNAPI. This is fairly simple since the serialized format is basically just a list of NNAPI calls to make, so most of the code is spent on bounds checking. A few notes on the design. - Currently, all tensor sizes need to be fixed, and those fixed sizes are burned directly into the serialized model. This will probably need to change. NNAPI supports variable-sized tensors, but the important hardware backends do not. However, we're seeing use cases crop up where the input size is not known until around the time that the model is loaded (for example, it might depend on the camera aspect ratio). I think the proper fix here is to remove the code in the converter that eagerly calculates the sizes of the intermediate tensors and replace it with a code generator that will generate some TorchScript code that will perform those calculations at model load time. This way, we will be able to support models that have variable-sized inputs while still only showing fixed-sized operands to NNAPI. - The important hardware backends want operands to be in NHWC order, but PyTorch natively represents all tensors and NCHW. The strategy for this is to keep NCHW during most of the conversion process, but track and additional value per operand representing the "dimension order". The dimension order gets propagated through convolutions and pointwise ops. When we're ready to serialize the model, we reorder the dimensions for "channels last" operands to NHWC. Test Plan: Some local testing with FB prod models. I'll need to add some examples and automated tests. Differential Revision: [D24574040](https://our.internmc.facebook.com/intern/diff/D24574040) [ghstack-poisoned]
Summary: This is in prototype status, but pretty functional. There are two major parts. - Model converter. This is a pure Python component that consumes a model in TorchScript format, converts the operations into NNAPI semantics, and serializes the model in a custom format. It then wraps the result in a new TorchScript model that can invoke NNAPI under the hood. - Runtime. This is a TorchBind object that deserializes the model and sends the result to NNAPI. This is fairly simple since the serialized format is basically just a list of NNAPI calls to make, so most of the code is spent on bounds checking. A few notes on the design. - Currently, all tensor sizes need to be fixed, and those fixed sizes are burned directly into the serialized model. This will probably need to change. NNAPI supports variable-sized tensors, but the important hardware backends do not. However, we're seeing use cases crop up where the input size is not known until around the time that the model is loaded (for example, it might depend on the camera aspect ratio). I think the proper fix here is to remove the code in the converter that eagerly calculates the sizes of the intermediate tensors and replace it with a code generator that will generate some TorchScript code that will perform those calculations at model load time. This way, we will be able to support models that have variable-sized inputs while still only showing fixed-sized operands to NNAPI. - The important hardware backends want operands to be in NHWC order, but PyTorch natively represents all tensors and NCHW. The strategy for this is to keep NCHW during most of the conversion process, but track and additional value per operand representing the "dimension order". The dimension order gets propagated through convolutions and pointwise ops. When we're ready to serialize the model, we reorder the dimensions for "channels last" operands to NHWC. Test Plan: Some local testing with FB prod models. I'll need to add some examples and automated tests. Differential Revision: [D24574040](https://our.internmc.facebook.com/intern/diff/D24574040) [ghstack-poisoned]
Summary: This is in prototype status, but pretty functional. There are two major parts. - Model converter. This is a pure Python component that consumes a model in TorchScript format, converts the operations into NNAPI semantics, and serializes the model in a custom format. It then wraps the result in a new TorchScript model that can invoke NNAPI under the hood. - Runtime. This is a TorchBind object that deserializes the model and sends the result to NNAPI. This is fairly simple since the serialized format is basically just a list of NNAPI calls to make, so most of the code is spent on bounds checking. A few notes on the design. - Currently, all tensor sizes need to be fixed, and those fixed sizes are burned directly into the serialized model. This will probably need to change. NNAPI supports variable-sized tensors, but the important hardware backends do not. However, we're seeing use cases crop up where the input size is not known until around the time that the model is loaded (for example, it might depend on the camera aspect ratio). I think the proper fix here is to remove the code in the converter that eagerly calculates the sizes of the intermediate tensors and replace it with a code generator that will generate some TorchScript code that will perform those calculations at model load time. This way, we will be able to support models that have variable-sized inputs while still only showing fixed-sized operands to NNAPI. - The important hardware backends want operands to be in NHWC order, but PyTorch natively represents all tensors and NCHW. The strategy for this is to keep NCHW during most of the conversion process, but track and additional value per operand representing the "dimension order". The dimension order gets propagated through convolutions and pointwise ops. When we're ready to serialize the model, we reorder the dimensions for "channels last" operands to NHWC. Test Plan: Some local testing with FB prod models. I'll need to add some examples and automated tests. Differential Revision: [D24574040](https://our.internmc.facebook.com/intern/diff/D24574040) [ghstack-poisoned]
Codecov Report
@@ Coverage Diff @@
## gh/dreiss/77/base #46780 +/- ##
=====================================================
- Coverage 60.85% 60.52% -0.33%
=====================================================
Files 2751 2756 +5
Lines 254447 255838 +1391
=====================================================
+ Hits 154849 154857 +8
- Misses 99598 100981 +1383 |
Stack from ghstack:
Summary:
This is in prototype status, but pretty functional. There are two major
parts.
model in TorchScript format, converts the operations into NNAPI
semantics, and serializes the model in a custom format. It then wraps
the result in a new TorchScript model that can invoke NNAPI under the
hood.
sends the result to NNAPI. This is fairly simple since the serialized
format is basically just a list of NNAPI calls to make, so most of the
code is spent on bounds checking.
A few notes on the design.
are burned directly into the serialized model. This will probably
need to change. NNAPI supports variable-sized tensors, but the
important hardware backends do not. However, we're seeing use cases
crop up where the input size is not known until around the time that
the model is loaded (for example, it might depend on the camera aspect
ratio). I think the proper fix here is to remove the code in the
converter that eagerly calculates the sizes of the intermediate
tensors and replace it with a code generator that will generate some
TorchScript code that will perform those calculations at model load
time. This way, we will be able to support models that have
variable-sized inputs while still only showing fixed-sized operands to
NNAPI.
PyTorch natively represents all tensors and NCHW. The strategy for
this is to keep NCHW during most of the conversion process, but track
and additional value per operand representing the "dimension order".
The dimension order gets propagated through convolutions and pointwise
ops. When we're ready to serialize the model, we reorder the
dimensions for "channels last" operands to NHWC.
Test Plan:
Some local testing with FB prod models. I'll need to add some examples
and automated tests.
Differential Revision: D24574040