Skip to content

Conversation

lara-hdr
Copy link
Contributor

  • Create a library for the custom ops.
  • Register Roi_Align, Roi_Pool and NMS as PyTorch custom ops.
  • Implement the ONNX symbolics for Roi_Align, Roi_Pool and NMS.
  • Add tests for exporting the ops to ONNX.

@codecov-io
Copy link

codecov-io commented Aug 28, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@04f70c1). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1267   +/-   ##
=========================================
  Coverage          ?   65.38%           
=========================================
  Files             ?       75           
  Lines             ?     5818           
  Branches          ?      886           
=========================================
  Hits              ?     3804           
  Misses            ?     1738           
  Partials          ?      276
Impacted Files Coverage Δ
torchvision/ops/boxes.py 94.59% <100%> (ø)
torchvision/ops/roi_pool.py 70.21% <100%> (ø)
torchvision/ops/_custom_ops.py 100% <100%> (ø)
torchvision/ops/roi_align.py 68% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04f70c1...6cfb171. Read the comment docs.

@lara-hdr
Copy link
Contributor Author

@fmassa

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good, thanks a lot Lara!

I've made a few comments, let me know what you think.

Also, I'd like @t-vi and @suo to have a look at this PR. @t-vi has worked a bit in the past on having C++ ops with backprop, and @suo might know what is the current status of RegisterOperator supporting backward ops.

@suo suo self-assigned this Aug 28, 2019
@@ -135,7 +144,14 @@ def get_extensions():
include_dirs=tests_include_dirs,
define_macros=define_macros,
extra_compile_args=extra_compile_args,
)
),
extension(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one key decision to make is whether we want to cover building a "no-python" library here, too.
If we do, building this in Python setup.py (rather than with cmake) is a bit dubious. If we want this, I think we might want to refine this to not build a python module (cpp_extension.load has a is_python_module flag, but the *Extension not.

If not, we might fold the custom ops into the extension module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense, and I imagine that one would want to run those models in C++ as well on torchscript.

One proposal: we can keep this approach as is for this PR, but we open a new issue discussing this?

@t-vi
Copy link
Contributor

t-vi commented Aug 29, 2019

So the two things for differentiable custom ops I'm aware of (but I haven't really caught up after my vacation) are "classic" adding of a backward and custom "modern" TorchScript source-to-source differentiation (which I have a hunch will take a while yet).

I would recommend enabling the first relatively shortly and then maybe deprecating the old extension module. Adding the source-to-source differentiation will be nice, but should be transparent for users.

@fmassa
Copy link
Member

fmassa commented Sep 2, 2019

So the two things for differentiable custom ops I'm aware of (but I haven't really caught up after my vacation) are "classic" adding of a backward and custom "modern" TorchScript source-to-source differentiation (which I have a hunch will take a while yet).

I would recommend enabling the first relatively shortly and then maybe deprecating the old extension module. Adding the source-to-source differentiation will be nice, but should be transparent for users.

This sounds reasonable to me. @suo what's your take on this?

@suo
Copy link
Member

suo commented Sep 2, 2019

@t-vi's take seems right to me. The "classic" way is a bit onerous, but we should do that first to unblock the torchscript compat work, then follow up once symbolic-script style backwards is available.

@fmassa
Copy link
Member

fmassa commented Sep 3, 2019

@suo @t-vi sounds good, thanks for your feedback!

I think to unblock @lara-hdr for now we will follow a simpler first step where we will not support autograd in the custom ops for now. I'll file a new task to address this once this PR gets merged, given that this is a separate task from the work needed by Lara to get those models exportable to ONNX.

@lara-hdr could you address the comments in this issue (apart from the auto differentiation, which will be tackled separately), and then this is good to merge?

Thanks!

@t-vi
Copy link
Contributor

t-vi commented Sep 3, 2019

One small thing: I wonder if we'd want to name the ops roi_align and roi_pool without the "forward".
In the module, we have those because we expect an autograd.Function wrapper to abstract them, but in script they would show up more directly. (I guess we can have a @script wrapper function, too.)

In a similar vein, do we want to return the argmax from the op?
It would feel more natural to me to match the "end user function".

@t-vi
Copy link
Contributor

t-vi commented Sep 4, 2019

Based on this PR (at the current state), I wrote https://github.com/t-vi/vision/tree/diff_op which replaces the autograd.Functions with differentiation in the ops. I haven't really found pytorch/pytorch#23572 to be that great a match, so I implemented it directly. I'll give some thoughts to streamlining by subclassing CppNode for the Backward Node, but my feeling is that it quite a few bits of that are not too great a match for us here (I could be wrong with that).

@soumith
Copy link
Member

soumith commented Sep 4, 2019

cc: @ezyang about Thomas' comment that #23572 wasn't a great match.

@t-vi
Copy link
Contributor

t-vi commented Sep 4, 2019

Just to clarify: The C++ Function facility is great, I just don't know how to use those bits from within ops (but now that I'm writing this, I should try using Function from inside the op and see if it works).

@ezyang
Copy link
Contributor

ezyang commented Sep 4, 2019

Yes, you literally just call the Function from within the traditional op binding, and put the actual forward implementation inside of the C++ class.

You'd be the first real user, so bug reports and feature requests helpful.

@fmassa
Copy link
Member

fmassa commented Sep 4, 2019

One small thing: I wonder if we'd want to name the ops roi_align and roi_pool without the "forward".
In the module, we have those because we expect an autograd.Function wrapper to abstract them, but in script they would show up more directly. (I guess we can have a @script wrapper function, too.)

Great point. I think we should have a name that kind of matches what we would have if the operator was instead added directly in PyTorch. In this case, it would probably not have the _forward, and the backward op would maybe have a _backward attached to it?

In a similar vein, do we want to return the argmax from the op?
It would feel more natural to me to match the "end user function".

Another great point. I think we should follow something similar to what max_pool does in PyTorch on the JIT (not sure what it currently is though)

@fmassa
Copy link
Member

fmassa commented Sep 4, 2019

@t-vi

Based on this PR (at the current state), I wrote https://github.com/t-vi/vision/tree/diff_op which replaces the autograd.Functions with differentiation in the ops. I haven't really found pytorch/pytorch#23572 to be that great a match, so I implemented it directly. I'll give some thoughts to streamlining by subclassing CppNode for the Backward Node, but my feeling is that it quite a few bits of that are not too great a match for us here (I could be wrong with that).

This patch is awesome! I think it's worth integrating it either here, or in a follow-up PR.

One question that I have is how often we will have breakages in torchvision due to us using "internal" functionality. @ezyang any ideas?

@lara-hdr
Copy link
Contributor Author

lara-hdr commented Sep 6, 2019

@fmassa, @t-vi, @suo, thanks for the review and all the comments.

  • @t-vi good point for the no-python library, as @fmassa said, we could start by merging this for now and add it in a following PR.
  • For argmax, we are discussing adding the argmax as an output of MaxRoiPool in ONNX, so for now we will not support it but it should be added soon.
  • @t-vi, https://github.com/t-vi/vision/tree/diff_op looks great, we could merge this PR first, and I’ll let you add your changes to preserve your authorship?

@t-vi
Copy link
Contributor

t-vi commented Sep 6, 2019

I think the flow should optimize handling of the merges, I'm not concerned about authorship.
Using the new autograd::Function should make the patch simpler, so there is an iteration to make yet.

@t-vi
Copy link
Contributor

t-vi commented Sep 6, 2019

Something seems to want to rename custom_ops.*.so when it already is custom_ops.so for me in python3 setup.py bdist_wheel.
Edit: Ah, no, the second rename fails because the source it doesn't exist. Maybe this is between setup.py develop and bdist_wheel and we could only rename when the file exists. I am not sure that copying is necessarily superior workaround. (This seems to be the CPU CI failure, too, when running setup.py bdist_wheel from a fresh checkout. Keeping the renames and making the second rename in torchvision_dir conditional on the file existing fixes it for me.)

@lara-hdr
Copy link
Contributor Author

lara-hdr commented Sep 6, 2019

Something seems to want to rename custom_ops.*.so when it already is custom_ops.so for me in python3 setup.py bdist_wheel.
Edit: Ah, no, the second rename fails because the source it doesn't exist. Maybe this is between setup.py develop and bdist_wheel and we could only rename when the file exists. I am not sure that copying is necessarily superior workaround. (This seems to be the CPU CI failure, too, when running setup.py bdist_wheel from a fresh checkout. Keeping the renames and making the second rename in torchvision_dir conditional on the file existing fixes it for me.)

You are right about trying to rename custom_ops.*.so to custom_ops.so in certain cases.
I modified the code to search for the pattern "custom_ops*." (like in the first commit) and copy the found file to torchvision repo.
The reason I am copying the file is to avoid constructing the path to the build file in _custom_ops.py in a hacky way (and avoid using glob at runtime like discussed before).
But it is not ideal since the library is a couple of MB...

Maybe going back to using glob at runtime and would make sense? any better ideas?

@t-vi
Copy link
Contributor

t-vi commented Sep 6, 2019

Looking a the build log, I'm having doubts about the loader code in _custom_ops.py

lib_dir = os.path.join('torchvision')
extension = os.path.basename(torch._C.__file__).rsplit('.', 1)[1]
custom_op_lib = os.path.join(lib_dir, 'custom_ops.' + extension)
torch.ops.load_library(custom_op_lib)

shouldn't lib_dir be relative to __file__ to capture the torchvision installation dir?

It would seem to me that copying the .so to the torchvision dir except for setup.py develop is a red herring.

@lara-hdr
Copy link
Contributor Author

lara-hdr commented Sep 6, 2019

@t-vi, yes, as I said in my last comment, I want to avoid doing a copy. but I would have to access the build directory from _custom_ops.py to access the library (the build directory being something like /vision/build/lib.linux-x86_64-3.7/torchvision).
But I am not sure how to access this repo without using glob at runtime.

@t-vi
Copy link
Contributor

t-vi commented Sep 6, 2019

Maybe I'm missing something, but to me it looks like the problem is that currently the code expects the library to sit in ./torchvision/custom_ops.so relative to the current working dir. I would think that

-lib_dir = os.path.join('torchvision')
+lib_dir = os.path.join(os.path.dirname(__file__), '..')

makes it look in the right dir - i.e. where torchvision is installed.

@t-vi
Copy link
Contributor

t-vi commented Sep 6, 2019

Hm. It seems to be looking in the right place now and locally I get it installed in the right place, too.

@t-vi
Copy link
Contributor

t-vi commented Sep 6, 2019

So after looking at this some more, we do have a no_python_abi_suffix for the BuildExtension but not for the Cpp/CudaExtension "class" in PyTorch.
Apparently getting a custom_ops.so along with the Python extension module without that is terribly hacky, so I have to retract the suggestion to try to get a custom_ops.so without the abi suffix.
Please accept my apologies for putting you on the wrong track here!

@lara-hdr
Copy link
Contributor Author

lara-hdr commented Sep 7, 2019

thanks @t-vi for all your help!

@fmassa
Copy link
Member

fmassa commented Sep 8, 2019

FYI the CI failures are unrelated, and we are looking into fixing it.

Copy link
Contributor

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, I think it's good to merge. Thank you for working on this!

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot!

I'll merge this once I get CI fixed, which I hope will be tomorrow

@fmassa fmassa closed this Sep 9, 2019
@fmassa fmassa reopened this Sep 9, 2019
@fmassa fmassa merged commit 78f169b into pytorch:master Sep 9, 2019
@fmassa
Copy link
Member

fmassa commented Sep 9, 2019

Thanks a lot Lara!

@ezyang
Copy link
Contributor

ezyang commented Sep 9, 2019

This broke pytorch-master https://circleci.com/gh/pytorch/pytorch/2711933?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link/console

Sep 09 12:02:26 ======================================================================
Sep 09 12:02:26 ERROR: test_list_entrypoints (__main__.TestHub)
Sep 09 12:02:26 ----------------------------------------------------------------------
Sep 09 12:02:26 Traceback (most recent call last):
Sep 09 12:02:26   File "test_utils.py", line 556, in test_list_entrypoints
Sep 09 12:02:26     entry_lists = hub.list('pytorch/vision', force_reload=True)
Sep 09 12:02:26   File "/opt/conda/lib/python3.6/site-packages/torch/hub.py", line 290, in list
Sep 09 12:02:26     hub_module = import_module(MODULE_HUBCONF, repo_dir + '/' + MODULE_HUBCONF)
Sep 09 12:02:26   File "/opt/conda/lib/python3.6/site-packages/torch/hub.py", line 72, in import_module
Sep 09 12:02:26     spec.loader.exec_module(module)
Sep 09 12:02:26   File "<frozen importlib._bootstrap_external>", line 678, in exec_module
Sep 09 12:02:26   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
Sep 09 12:02:26   File "/var/lib/jenkins/.cache/torch/hub/pytorch_vision_master/hubconf.py", line 4, in <module>
Sep 09 12:02:26     from torchvision.models.alexnet import alexnet
Sep 09 12:02:26   File "/var/lib/jenkins/.cache/torch/hub/pytorch_vision_master/torchvision/__init__.py", line 1, in <module>
Sep 09 12:02:26     from torchvision import models
Sep 09 12:02:26   File "/var/lib/jenkins/.cache/torch/hub/pytorch_vision_master/torchvision/models/__init__.py", line 12, in <module>
Sep 09 12:02:26     from . import detection
Sep 09 12:02:26   File "/var/lib/jenkins/.cache/torch/hub/pytorch_vision_master/torchvision/models/detection/__init__.py", line 1, in <module>
Sep 09 12:02:26     from .faster_rcnn import *
Sep 09 12:02:26   File "/var/lib/jenkins/.cache/torch/hub/pytorch_vision_master/torchvision/models/detection/faster_rcnn.py", line 7, in <module>
Sep 09 12:02:26     from torchvision.ops import misc as misc_nn_ops
Sep 09 12:02:26   File "/var/lib/jenkins/.cache/torch/hub/pytorch_vision_master/torchvision/ops/__init__.py", line 1, in <module>
Sep 09 12:02:26     from .boxes import nms, box_iou
Sep 09 12:02:26   File "/var/lib/jenkins/.cache/torch/hub/pytorch_vision_master/torchvision/ops/boxes.py", line 2, in <module>
Sep 09 12:02:26     import torchvision.ops._custom_ops
Sep 09 12:02:26   File "/var/lib/jenkins/.cache/torch/hub/pytorch_vision_master/torchvision/ops/_custom_ops.py", line 9, in <module>
Sep 09 12:02:26     file, path, description = imp.find_module("_custom_ops", [lib_dir])
Sep 09 12:02:26   File "/opt/conda/lib/python3.6/imp.py", line 297, in find_module
Sep 09 12:02:26     raise ImportError(_ERR_MSG.format(name), name=name)
Sep 09 12:02:26 ImportError: No module named '_custom_ops'

ezyang added a commit that referenced this pull request Sep 9, 2019
ezyang added a commit that referenced this pull request Sep 9, 2019
fmassa added a commit that referenced this pull request Sep 9, 2019
@fmassa
Copy link
Member

fmassa commented Sep 9, 2019

I'm sending a follow-up patch making the imports lazy again in #1317

fmassa pushed a commit that referenced this pull request Sep 10, 2019
* Revert "Revert "Register Torchvision Ops as Cutom Ops (#1267)" (#1316)"

This reverts commit fe234fc.

* Make import of C++ extensions lazy

* define python initialization functions for extension

* Fix lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants