Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch NNAPI integration prototype #46780

Closed
wants to merge 3 commits into from
Closed

Conversation

dreiss
Copy link
Contributor

@dreiss dreiss commented Oct 23, 2020

Stack from ghstack:

Summary:
This is in prototype status, but pretty functional. There are two major
parts.

  • Model converter. This is a pure Python component that consumes a
    model in TorchScript format, converts the operations into NNAPI
    semantics, and serializes the model in a custom format. It then wraps
    the result in a new TorchScript model that can invoke NNAPI under the
    hood.
  • Runtime. This is a TorchBind object that deserializes the model and
    sends the result to NNAPI. This is fairly simple since the serialized
    format is basically just a list of NNAPI calls to make, so most of the
    code is spent on bounds checking.

A few notes on the design.

  • Currently, all tensor sizes need to be fixed, and those fixed sizes
    are burned directly into the serialized model. This will probably
    need to change. NNAPI supports variable-sized tensors, but the
    important hardware backends do not. However, we're seeing use cases
    crop up where the input size is not known until around the time that
    the model is loaded (for example, it might depend on the camera aspect
    ratio). I think the proper fix here is to remove the code in the
    converter that eagerly calculates the sizes of the intermediate
    tensors and replace it with a code generator that will generate some
    TorchScript code that will perform those calculations at model load
    time. This way, we will be able to support models that have
    variable-sized inputs while still only showing fixed-sized operands to
    NNAPI.
  • The important hardware backends want operands to be in NHWC order, but
    PyTorch natively represents all tensors and NCHW. The strategy for
    this is to keep NCHW during most of the conversion process, but track
    and additional value per operand representing the "dimension order".
    The dimension order gets propagated through convolutions and pointwise
    ops. When we're ready to serialize the model, we reorder the
    dimensions for "channels last" operands to NHWC.

Test Plan:
Some local testing with FB prod models. I'll need to add some examples
and automated tests.

Differential Revision: D24574040

@dr-ci
Copy link

dr-ci bot commented Oct 23, 2020

💊 CI failures summary and remediations

As of commit c323076 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


codecov.io: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 22 times.

int32_t zeroPoint;
} ANeuralNetworksOperandType;

#endif // MINIMAL_NEURAL_NETWORKS_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so this file is mainly from NNAPIs source tree - i.e. enum and struct definitions.

if fmt == 0:
fixed_args.append(args[idx].contiguous())
elif fmt == 1:
fixed_args.append(args[idx].permute(0,2,3,1).contiguous())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this is where the NHCW stuff you mentioned earlier happens.

TENSOR_QUANT16_ASYMM = 12


class NNAPI_OperationCode(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return struct.pack("i" * len(ints), *ints)


ADDER_MAP = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so these are the aten operators supported for the NNAPI execution. I'm guessing this list is dictated by the operation codes for the operators from https://developer.android.com/ndk/reference/group/neural-networks#operationcode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this maps from the internal TorchScript operator name to the function that needs to handle that node. In some cases (like processing constants), there's no NNAPI operation generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this map could grow if we are supporting more and more nnapi operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Though I'd like to replace the explicit map with a decorator.

if config is None:
config = {}

self.solid_weights = config.get("solid_weights", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. What is solid_weights?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is a vestigal feature. I'll delete it. Currently, we get each weight from a separate tensor. "solid weights" was to let all of the weights be bundled as a single blob to make it possible to deploy without using PyTorch or Caffe2 as a wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Now I understand the intent, but not the details, so I shan't mull over them for now.

self.modules = {}
self.constants = {}
self.jitval_operand_map = {}
self.cached_immediates = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - cached_immediates is a set of all literal constant values that show up in this model, de-duped.

return abs(lhs - rhs) <= tolerance * min(lhs, rhs)


def tensor_size(op_type, dims):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so this is the tensor size in bytes.

Comment on lines 235 to 241
if len(s1) > len(s2):
#s2 = [1] * (len(s1) - len(s2)) + s2
raise Exception("Non-equal-rank broadcast is too dangerous because XXX.")
if len(s2) > len(s1):
#s3 = [1] * (len(s2) - len(s1)) + s1
raise Exception("Non-equal-rank broadcast is too dangerous because XXX.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dreiss Maybe this is not relevant here, but https://pytorch.org/docs/stable/notes/broadcasting.html suggests that broadcasting can account for non-equal shapes of tensors. Is this something that would be a limitation of this specific API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to clean this code up a bit. In theory, PyTorch and NNAPI have the same broadcasting semantics (back to front, extend front with 1s). However, when using "nnapi_nhwc", it gets a bit wonky, because PyTorch uses an implicit NHWC representation based on strides, so the tensors still have NCHW semantics, and broadcasting order is still W,H,C,N. NNAPI (and TF and C2) use explicit NHWC, so the broadcast order is C,W,H,N. The trickiest bits come into play when we try to broadcast a known NHWC tensors with a constant (where we don't necessarily known the user's intention). I haven't worked through all the cases yet, so I decided to not support non-equal-rank broadcast for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I understand now that there is additional complexity when you're dealing with different orders of dimensions and that broadcasting only works when the dimensions are ordered the same, and at a high level I gather that you can't make that assertion here, so you've left this open. If you somehow knew the order of dimensions then you could have supported the 1 padding behaviour. Got it!

Comment on lines 539 to 580
header = struct.pack(
"iiiiii",
version,
len(self.operands),
len(self.values),
len(self.operations),
len(self.inputs),
len(self.outputs),
)
model.append(header)

serialized_values, serialized_value_data = self.serialize_values()

model.extend(struct.pack("iifi", t, len(d), s, z) for (t,d,_m,s,z) in self.operands)
model.extend(serialized_values)
model.extend(struct.pack("iii", *x) for x in self.operations)
model.extend(self.serialize_ints(fix_shape(dims, mf)) for (_, dims, mf, _, _) in self.operands)
model.extend(serialized_value_data)
model.append(self.serialize_ints(self.operation_args))
model.append(self.serialize_ints(self.inputs))
model.append(self.serialize_ints(self.outputs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so the serialized model basically is a file which has the model header specifying version, num(operands), num(values), etc... where operands, values, etc... are fixed size structures and the de-serializer basically reads them off as such because it knows how many of each to expect.

self.outputs = []

self.modules = {}
self.constants = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but I don't see any place that constants is being set (or updated). I only see queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm I see it now as self.constants[key].

Comment on lines +443 to +461
def add_constant_value(self, jitval, ctype, value):
assert jitval not in self.constants
self.constants[jitval] = (ctype, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. Where can I find out more about the permissible values in jitval and what they mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "Value" in the JIT IR. https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/ir/ir.h#L148 It represents a local variable or the result of a TorchScript expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a conversion from struct Value to std::string or const char* in the file csrc/jit/ir/ir.h. How does one use jitval as a string in this context?

dreiss added a commit to dreiss/pytorch that referenced this pull request Oct 27, 2020
Summary:
This is in prototype status, but pretty functional.  There are two major
parts.

- Model converter.  This is a pure Python component that consumes a
  model in TorchScript format, converts the operations into NNAPI
  semantics, and serializes the model in a custom format.  It then wraps
  the result in a new TorchScript model that can invoke NNAPI under the
  hood.
- Runtime.  This is a TorchBind object that deserializes the model and
  sends the result to NNAPI.  This is fairly simple since the serialized
  format is basically just a list of NNAPI calls to make, so most of the
  code is spent on bounds checking.

A few notes on the design.
- Currently, all tensor sizes need to be fixed, and those fixed sizes
  are burned directly into the serialized model.  This will probably
  need to change.  NNAPI supports variable-sized tensors, but the
  important hardware backends do not.  However, we're seeing use cases
  crop up where the input size is not known until around the time that
  the model is loaded (for example, it might depend on the camera aspect
  ratio).  I think the proper fix here is to remove the code in the
  converter that eagerly calculates the sizes of the intermediate
  tensors and replace it with a code generator that will generate some
  TorchScript code that will perform those calculations at model load
  time.  This way, we will be able to support models that have
  variable-sized inputs while still only showing fixed-sized operands to
  NNAPI.
- The important hardware backends want operands to be in NHWC order, but
  PyTorch natively represents all tensors and NCHW.  The strategy for
  this is to keep NCHW during most of the conversion process, but track
  and additional value per operand representing the "dimension order".
  The dimension order gets propagated through convolutions and pointwise
  ops.  When we're ready to serialize the model, we reorder the
  dimensions for "channels last" operands to NHWC.

Test Plan:
Some local testing with FB prod models.  I'll need to add some examples
and automated tests.

ghstack-source-id: e1fa978af170d4d00c5270c52b9d4cb63843e7d2
Pull Request resolved: pytorch#46780
}
""")
.replace("__DEFINE_CHECK_FUNCTIONS__", "\n".join(define_checks))
.replace("__LOAD_FUNCTIONS__", "\n".join(load_functions))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we find a symbol with dlsym, do we build it into PyTorch binary? Would that introduce some binary size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapper obviously has some size in the binary, but dlsym happens at runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So we assume that libneuralnetworks.so is available if on-device.

TENSOR_QUANT16_ASYMM = 12


class NNAPI_OperationCode(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the nnapi OperationCode has optional or default args like PT operators? For example, if pt::conv_2d() handles 5 args and the last one is optional. There could be calls with either 4 or 5 args. They are both valid. When converted to nnapi's CONV_2D, would it also cover both the situations with 4 and 5 args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each NNAPI op defines its own semantics. For example, Conv2D has two forms, one with 10 args and one with 13: https://android.googlesource.com/platform/frameworks/ml/+/refs/tags/android-11.0.0_r8/nn/runtime/include/NeuralNetworks.h#401 . But if you look closer, some of those args are "Available since API level 29", so there are also a 7-arg version and a different 10-arg version. I believe the two different 10-arg versions are distinguished by the types of the arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We may need to find the bast schema match of Conv2D manually, when converting a conv2d node to nnapi::Conv2D

value = getattr(obj, name)
output = node.outputsAt(0)
ctype = output.type()
self.add_constant_value(output, ctype, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to freeze the model! So in nnapi models, there's no concept of "parameters"? The weights and bias are all "constants"? Does that mean it does not have training capability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNAPI is only for inference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this model only be used in platforms with nnapi available (Android)?

Yes. (Though it is possible in theory to build the NNAPI CPU implementation for Linux and run the model directly, which is how I've been testing.)

Does that mean we may keep multiple versions of a model?
For Android, we deploy an nnapi model.
For iOS, I'm wondering if there is corresponding API layer (CoreML?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iseeyuan for Metal, I believe it's integrated as a separate backend instead. I'm also curious to know why NNAPI isn't implemented as a separate backend. One thought I had is that PyTorch represents image Tensors as NCHW whereas NNAP (based on @dreiss diff summary) requires them to be in NHWC format, so implementing it as a separate backend may have required the conversion on every operator entry/exit? Not sure.

self.add_getattr(node),
"prim::Constant": lambda self, node:
self.add_constant_node(node),
"prim::ListConstruct": lambda self, node:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could nnapi has corresponding prim::DictConstruct, TupleConstruct, etc? Do you have any plan to handle those nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNAPI has no concepts of lists and dicts. It only operates on tensors and scalars: https://android.googlesource.com/platform/frameworks/ml/+/refs/tags/android-11.0.0_r8/nn/runtime/include/NeuralNetworks.h#68

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like how you handled prim::ListConstruct, we may also write code to handle DictConstruct and TupleConstruct from torchscript side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, I suppose so.



def serialize_model(module, inputs, config=None):
return _NnapiSerializer(config).serialize_model(module, inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is inputs used inside _NnapiSerializer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We use it to determine the input shape.

ANEURALNETWORKS_PREFER_LOW_POWER = 0,
ANEURALNETWORKS_PREFER_FAST_SINGLE_ANSWER = 1,
ANEURALNETWORKS_PREFER_SUSTAINED_SPEED = 2,
} PreferenceCode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to guide the build with PreferenceCode. Should it be specified per model? What could I find the usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would like to expose this. I haven't figured out a good API for it yet.

return struct.pack("i" * len(ints), *ints)


ADDER_MAP = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this map could grow if we are supporting more and more nnapi operations?

@iseeyuan
Copy link
Contributor

iseeyuan commented Oct 28, 2020

Awesome work on model serialization and nnapi binding! In addition to the embedded comments, I have some general questions:

  1. Looks like the serializer would convert the torchscript graph to nnapi model (a torch.nn.Module) and script it. Would this model only be used in platforms with nnapi available (Android)?
  2. If we deploy this nnapi model to production, do you estimate any blocker to convert it to bytecode?
  3. Is there a test covering the serialization -> loading -> running, and verify the results are the same with the original torchscript model?

@dreiss
Copy link
Contributor Author

dreiss commented Oct 28, 2020

Would this model only be used in platforms with nnapi available (Android)?

Yes. (Though it is possible in theory to build the NNAPI CPU implementation for Linux and run the model directly, which is how I've been testing.)

If we deploy this nnapi model to production, do you estimate any blocker to convert it to bytecode?
I have tested with the lite interpreter. There are some minor blockers, but I think I have appropriate fixes ready.

Is there a test covering the serialization -> loading -> running, and verify the results are the same with the original torchscript model?

Unfortunately, it is currently only possible to run these models on Android. I will push harder on getting a host build of NNAPI so we can automate this.

@dhruvbird
Copy link
Contributor

Regarding unit tests: @dreiss Is the NNAPI similator API written by you or available in open source? If so could it be checked in and tests run against that library. One option could be to have a differently code-generated copy of nnapi_wrapper.cpp which just calls into the emulation later methods instead of via the function pointers.

I have reviewed the code related to the general serialization and deserialization of the model in the NNAPI format and it seems fine. I checked if the case statements have some copy-paste issue, and it doesn't seem that they do. I would love to see some test coverage as @iseeyuan mentioned, but I guess there are practical limitations? Specifically, nnapi_wrapper, nnapi_model_loader, codegen, and parts of serializer seem fine.

I haven't reviewed the code which actually translates the call from PyTorch into the NNAPI backend, since it requires some understanding of how the translation takes place, and I'm not certain if I have that understanding yet. This includes most code in serializer, which does the actual translation from pt->nnapi.

@dreiss
Copy link
Contributor Author

dreiss commented Oct 29, 2020

Is the NNAPI similator API written by you or available in open source?

No. I have a heavily hacked-up version of the Android source tree that is able to build libneuralnetworks.so on Linux. I've asked the Android team about getting official support for this, but it is quite difficult. I will continue to ask them.

Comment on lines 48 to 51
if fmt == 0:
fixed_args.append(args[idx].contiguous())
elif fmt == 1:
fixed_args.append(args[idx].permute(0,2,3,1).contiguous())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be feasible to comment here saying that fmt == 0 is NCHW (i.e. channels first) and fmt == 1 is NHWC or maybe used a enum for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum might be tricky since this needs to be TorchScript-compatible, but I can certainly add a comment.

Comment on lines +132 to +149
def convert_model_to_nnapi(model, inputs):
model = torch.jit.freeze(model)
Copy link
Contributor

@dhruvbird dhruvbird Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that converting a model to NNAPI requires providing inputs and running the model? Am I reading this correctly?

Seems like it based on https://discuss.pytorch.org/t/any-different-between-model-input-and-model-forward-input/3690

i.e. model(input) ends up calling model.forward(input).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we only run the model in order to get the number, shape, dtype, and qparams of the outputs. We could also do this without running the model.

out_mem_fmts: List[int],
out_templates: List[torch.Tensor]):
super().__init__()
self.ser_model = ser_model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dreiss Okay I think I now understand why you did this. Your serialized model (for NNAPI) is just a Tensor represented as a regular member of the original model so that it will get naturally serialized when the model is serialized to torchscript. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Comment on lines +69 to +75
TORCH_CHECK(serialized_model_tensor.is_contiguous());
c10::ArrayRef<uint8_t> ser_model = {
serialized_model_tensor.data_ptr<uint8_t>(),
serialized_model_tensor.nbytes()
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this is how it's read in.

if (len % 4 == 0) {
return len;
}
return len + 4 - (phys % 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlikely that this will result in an overflow, but it could if your model has size 4G - 3 bytes or more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I think we're a long way away from shipping a 4G model on NNAPI.

Comment on lines 463 to 480
def expand_sizes(self, size):
return [ s.item() for s in size ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems to be unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably left-over from the old Caffe2 implementation.


# Pad with 0 bytes out to a multiple of 4 for alignment.
physical_length = ((source_length - 1) | 0x3) + 1
padded_data = data + (b"\0" * (physical_length - source_length))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dreiss Q. Does this assume a certain endianness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I assume little-endian somewhere else, though.

Comment on lines 76 to 84
@torch.jit.export
def __getstate__(self):
return self.nnapi_module

@torch.jit.export
def __setstate__(self, nnapi_module):
self.training = False
self.nnapi_module = nnapi_module
self.nnapi_module.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like get/set-state is just a way to set arbitrary metadata on a model. It's not clear to me if this method should get/set members that were previously set in the __init__ method, but seems like that is the intent.

#20242

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose here is to make sure that init gets called when the model is loaded, without the user having to intervene.

Comment on lines +1037 to +1337
weight_permutation = (0, 2, 3, 1)
elif args.group == in_c:
# Depthwise convolution
depthwise = True
weight_permutation = (1, 2, 3, 0)
else:
raise Exception("Group convolution not supported yet.")

# TODO: Transform at load time to share weights with CPU model.
nnapi_weight_tensor = weight_tensor.permute(*weight_permutation).contiguous()
weight_id = self.add_tensor_operand_for_weight(nnapi_weight_tensor)
weight_oper = self.operands[weight_id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if you could add some helper methods like toNHCW and toNHWC since this seems like a repeating pattern (i.e. permute the dimensions in the Tensor).

Comment on lines +522 to +540
for idx, node in enumerate(model.graph.nodes()):
LOG.debug("Processing node #%d: %r", idx, node)
self.add_node(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this is the main code graph which is iterated over, and the node is a JIT node.

Comment on lines 515 to 556
self_jitval = next(model.graph.inputs())
self.add_constant_value(self_jitval, self_jitval.type(), model)

for input_value, input_tensor in zip(list(model.graph.inputs())[1:], inputs):
op_id = self.add_tensor_operand_for_input(input_value, input_tensor)
inp_dim_orders.append(self.operands[op_id].dim_order.value)

for idx, node in enumerate(model.graph.nodes()):
LOG.debug("Processing node #%d: %r", idx, node)
self.add_node(node)

retn = model.graph.return_node()
assert retn.inputsSize() == 1
assert retn.outputsSize() == 0
# TODO: Make outputs a local variable?
# TODO: Handle tuple-of-tensor return
for idx in range(1):
op_id = self.jitval_operand_map[retn.inputsAt(0)]
self.outputs.append(op_id)
out_dim_orders.append(self.operands[op_id].dim_order.value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so this is where the magic happens (it seems). i.e. once the model is run, somehow the JIT (runtime/interpreter) caches all intermediate state, and makes it available to anyone who wishes to inspect the JIT state -- which this code does and converts it into the custom NNAPI serialized Tensor (well serialized model).

@dhruvbird
Copy link
Contributor

This change itself looks good (based on the stuff I understand and was able to review - verified parts of it with OSS documentation, and everything seems to check out). The actual operator conversion to NNAPI operators isn't reviewed for correctness.

A few things you could consider adding are:

  1. When serializing every individual type of entity, put some magic constants so that if some change is made, the deserializer can fail loudly if it expects something different. Currently, it seems like it can fail at a different point in time based on size checks or even worse, if 2 types of structs with the same size change positions, no one will know.
  2. Sprinkle the code with some documentation for key parts which are non-trivial or not super obvious.

Copy link
Contributor

@iseeyuan iseeyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Is there a guidance on how to convert and deploy an nnapi model with an example? That may help me to run the code and dive a little deeper. I think it would be good for this first version in prototype status.

Summary:
This is in prototype status, but pretty functional.  There are two major
parts.

- Model converter.  This is a pure Python component that consumes a
  model in TorchScript format, converts the operations into NNAPI
  semantics, and serializes the model in a custom format.  It then wraps
  the result in a new TorchScript model that can invoke NNAPI under the
  hood.
- Runtime.  This is a TorchBind object that deserializes the model and
  sends the result to NNAPI.  This is fairly simple since the serialized
  format is basically just a list of NNAPI calls to make, so most of the
  code is spent on bounds checking.

A few notes on the design.
- Currently, all tensor sizes need to be fixed, and those fixed sizes
  are burned directly into the serialized model.  This will probably
  need to change.  NNAPI supports variable-sized tensors, but the
  important hardware backends do not.  However, we're seeing use cases
  crop up where the input size is not known until around the time that
  the model is loaded (for example, it might depend on the camera aspect
  ratio).  I think the proper fix here is to remove the code in the
  converter that eagerly calculates the sizes of the intermediate
  tensors and replace it with a code generator that will generate some
  TorchScript code that will perform those calculations at model load
  time.  This way, we will be able to support models that have
  variable-sized inputs while still only showing fixed-sized operands to
  NNAPI.
- The important hardware backends want operands to be in NHWC order, but
  PyTorch natively represents all tensors and NCHW.  The strategy for
  this is to keep NCHW during most of the conversion process, but track
  and additional value per operand representing the "dimension order".
  The dimension order gets propagated through convolutions and pointwise
  ops.  When we're ready to serialize the model, we reorder the
  dimensions for "channels last" operands to NHWC.

Test Plan:
Some local testing with FB prod models.  I'll need to add some examples
and automated tests.

Differential Revision: [D24574040](https://our.internmc.facebook.com/intern/diff/D24574040)

[ghstack-poisoned]
Summary:
This is in prototype status, but pretty functional.  There are two major
parts.

- Model converter.  This is a pure Python component that consumes a
  model in TorchScript format, converts the operations into NNAPI
  semantics, and serializes the model in a custom format.  It then wraps
  the result in a new TorchScript model that can invoke NNAPI under the
  hood.
- Runtime.  This is a TorchBind object that deserializes the model and
  sends the result to NNAPI.  This is fairly simple since the serialized
  format is basically just a list of NNAPI calls to make, so most of the
  code is spent on bounds checking.

A few notes on the design.
- Currently, all tensor sizes need to be fixed, and those fixed sizes
  are burned directly into the serialized model.  This will probably
  need to change.  NNAPI supports variable-sized tensors, but the
  important hardware backends do not.  However, we're seeing use cases
  crop up where the input size is not known until around the time that
  the model is loaded (for example, it might depend on the camera aspect
  ratio).  I think the proper fix here is to remove the code in the
  converter that eagerly calculates the sizes of the intermediate
  tensors and replace it with a code generator that will generate some
  TorchScript code that will perform those calculations at model load
  time.  This way, we will be able to support models that have
  variable-sized inputs while still only showing fixed-sized operands to
  NNAPI.
- The important hardware backends want operands to be in NHWC order, but
  PyTorch natively represents all tensors and NCHW.  The strategy for
  this is to keep NCHW during most of the conversion process, but track
  and additional value per operand representing the "dimension order".
  The dimension order gets propagated through convolutions and pointwise
  ops.  When we're ready to serialize the model, we reorder the
  dimensions for "channels last" operands to NHWC.

Test Plan:
Some local testing with FB prod models.  I'll need to add some examples
and automated tests.

Differential Revision: [D24574040](https://our.internmc.facebook.com/intern/diff/D24574040)

[ghstack-poisoned]
Summary:
This is in prototype status, but pretty functional.  There are two major
parts.

- Model converter.  This is a pure Python component that consumes a
  model in TorchScript format, converts the operations into NNAPI
  semantics, and serializes the model in a custom format.  It then wraps
  the result in a new TorchScript model that can invoke NNAPI under the
  hood.
- Runtime.  This is a TorchBind object that deserializes the model and
  sends the result to NNAPI.  This is fairly simple since the serialized
  format is basically just a list of NNAPI calls to make, so most of the
  code is spent on bounds checking.

A few notes on the design.
- Currently, all tensor sizes need to be fixed, and those fixed sizes
  are burned directly into the serialized model.  This will probably
  need to change.  NNAPI supports variable-sized tensors, but the
  important hardware backends do not.  However, we're seeing use cases
  crop up where the input size is not known until around the time that
  the model is loaded (for example, it might depend on the camera aspect
  ratio).  I think the proper fix here is to remove the code in the
  converter that eagerly calculates the sizes of the intermediate
  tensors and replace it with a code generator that will generate some
  TorchScript code that will perform those calculations at model load
  time.  This way, we will be able to support models that have
  variable-sized inputs while still only showing fixed-sized operands to
  NNAPI.
- The important hardware backends want operands to be in NHWC order, but
  PyTorch natively represents all tensors and NCHW.  The strategy for
  this is to keep NCHW during most of the conversion process, but track
  and additional value per operand representing the "dimension order".
  The dimension order gets propagated through convolutions and pointwise
  ops.  When we're ready to serialize the model, we reorder the
  dimensions for "channels last" operands to NHWC.

Test Plan:
Some local testing with FB prod models.  I'll need to add some examples
and automated tests.

Differential Revision: [D24574040](https://our.internmc.facebook.com/intern/diff/D24574040)

[ghstack-poisoned]
@codecov
Copy link

codecov bot commented Nov 5, 2020

Codecov Report

Merging #46780 into gh/dreiss/77/base will decrease coverage by 0.32%.
The diff coverage is 0.35%.

@@                  Coverage Diff                  @@
##           gh/dreiss/77/base   #46780      +/-   ##
=====================================================
- Coverage              60.85%   60.52%   -0.33%     
=====================================================
  Files                   2751     2756       +5     
  Lines                 254447   255838    +1391     
=====================================================
+ Hits                  154849   154857       +8     
- Misses                 99598   100981    +1383     

@facebook-github-bot
Copy link
Contributor

@dreiss merged this pull request in 9a9383e.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@dreiss merged this pull request in 9a9383e.

@facebook-github-bot facebook-github-bot deleted the gh/dreiss/77/head branch November 9, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants