reimplement __torch_function__ overrides for torch.functional using inline logic #32194

ngoldbaum · 2020-01-14T23:24:16Z

This improves the performance of operators in the torch.functional namespace that are overridable by __torch_function__ implementations when supplied with Tensor operands.

Running the split benchmark in various configurations produces the following timings:

Expand for timings on master

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 3.340

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 3.333

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 3.366

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 3.385

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 3.468

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 3.416

Expand for timings with this pull request applied

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 2.261

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 2.223

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.237

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.218

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.259

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.234

Expand for timings on master with __torch_function__ dispatch disabled

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cpu
# Input: M: 8, N: 8, parts: 2, device: cpu
Forward Execution Time (us) : 2.180

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M8_N8_parts2_cuda
# Input: M: 8, N: 8, parts: 2, device: cuda
Forward Execution Time (us) : 2.172

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cpu
# Input: M: 256, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.171

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M256_N512_parts2_cuda
# Input: M: 256, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.146

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cpu
# Input: M: 512, N: 512, parts: 2, device: cpu
Forward Execution Time (us) : 2.175

# Benchmarking PyTorch: split
# Mode: Eager
# Name: split_M512_N512_parts2_cuda
# Input: M: 512, N: 512, parts: 2, device: cuda
Forward Execution Time (us) : 2.152

So at least on the machine I'm testing on, this brings the overhead down to less than 100 ns. For comparison, the overhead for __array_function__ in NumPy is about 850 ns on the same machine.

Expand for timings for NumPy __array_function__ dispatch

In [1]: import numpy as np

In [2]: %timeit np.mean([1])
8.89 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]: %timeit np.mean._implementation([1])
8.04 µs ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

See the implementation in NumPy for why this measures __array_function__ overhead.

…nline logic

kostmo · 2020-01-15T00:35:08Z

💊 CircleCI build failures summary and remediations

As of commit c904a23:

Commit c904a23 was recently pushed. Waiting for builds...

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 1 time.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-01-16T21:24:20Z

@ezyang merged this pull request in bab87e4.

…nline logic (pytorch#32194) Summary: Fixes pytorch#30831. This improves the performance of operators in the `torch.functional` namespace that are overridable by `__torch_function__` implementations when supplied with `Tensor` operands. Running the split benchmark in various configurations produces the following timings: <details> <summary>Expand for timings on <code>master</code> </summary> ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cpu # Input: M: 8, N: 8, parts: 2, device: cpu Forward Execution Time (us) : 3.340 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cuda # Input: M: 8, N: 8, parts: 2, device: cuda Forward Execution Time (us) : 3.333 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cpu # Input: M: 256, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 3.366 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cuda # Input: M: 256, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 3.385 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cpu # Input: M: 512, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 3.468 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cuda # Input: M: 512, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 3.416 ``` </details> <details> <summary>Expand for timings with this pull request applied</summary> ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cpu # Input: M: 8, N: 8, parts: 2, device: cpu Forward Execution Time (us) : 2.261 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cuda # Input: M: 8, N: 8, parts: 2, device: cuda Forward Execution Time (us) : 2.223 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cpu # Input: M: 256, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 2.237 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cuda # Input: M: 256, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 2.218 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cpu # Input: M: 512, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 2.259 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cuda # Input: M: 512, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 2.234 ``` </details> <details> <summary>Expand for timings on <code>master</code> with <code>__torch_function__</code> dispatch disabled </summary> ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cpu # Input: M: 8, N: 8, parts: 2, device: cpu Forward Execution Time (us) : 2.180 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cuda # Input: M: 8, N: 8, parts: 2, device: cuda Forward Execution Time (us) : 2.172 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cpu # Input: M: 256, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 2.171 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cuda # Input: M: 256, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 2.146 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cpu # Input: M: 512, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 2.175 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cuda # Input: M: 512, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 2.152 ``` </details> So at least on the machine I'm testing on, this brings the overhead down to less than 100 ns. For comparison, the overhead for `__array_function__` in NumPy is about 850 ns on the same machine. <details> <summary>Expand for timings for NumPy <code>__array_function__</code> dispatch </summary> ``` In [1]: import numpy as np In [2]: %timeit np.mean([1]) 8.89 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [3]: %timeit np.mean._implementation([1]) 8.04 µs ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` See [the implementation in NumPy](https://github.com/numpy/numpy/blob/master/numpy/core/overrides.py#L195) for why this measures `__array_function__` overhead. </details> Pull Request resolved: pytorch#32194 Differential Revision: D19410396 Pulled By: ezyang fbshipit-source-id: ada788a4399c81cd7eb2d548aa04a2459e96634a

…onal (#32799) Summary: This adds `__torch_function__` support for all functions in `torch.functional` and `torch.nn.functional`. The changes to C++ code and codegen scripts are to facilitate adding `__torch_function__` support for the native functions in `torch._C._nn`. Note that I moved the `handle_torch_function` C++ function to a header that both `python_torch_functions.cpp` and `python_nn_functions.cpp` include. The changes to `python_nn_functions.cpp` mirror the changes I made to `python_torch_functions.cpp` when `__torch_function__` support was first added in #27064. Due to the somewhat different way the `torch._C` and `torch._C._nn` namespaces are initialized I needed to create a new static reference to the `torch._C._nn` namespace (`THPNNVariableFunctions`). I'm not sure if that is the best way to do this. In principle I could import these namespaces in each kernel and avoid the global variable but that would have a runtime cost. I added `__torch_function__` support to the Python functions in `torch.nn.functional` following the approach in #32194. I re-enabled the test that checks if all functions in the `torch` namespace are explicitly tested for `__torch_function__` support. I also generalized the check to work for `torch.functional` and `torch.nn.functional` as well. This test was explicitly disabled in #30730 and I'm happy to disable it again if you think that's appropriate. I figured now was as good a time as any to try to re-enable it. Finally I adjusted the existing torch API tests to suppress deprecation warnings and add keyword arguments used by some of the code in `torch.nn.functional` that were missed when I originally added the tests in #27064. Pull Request resolved: #32799 Differential Revision: D19956809 Pulled By: ezyang fbshipit-source-id: 40d34e0109cc4b9f3ef62f409d2d35a1d84e3d22

…nline logic (pytorch#32194) Summary: Fixes pytorch#30831. This improves the performance of operators in the `torch.functional` namespace that are overridable by `__torch_function__` implementations when supplied with `Tensor` operands. Running the split benchmark in various configurations produces the following timings: <details> <summary>Expand for timings on <code>master</code> </summary> ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cpu # Input: M: 8, N: 8, parts: 2, device: cpu Forward Execution Time (us) : 3.340 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cuda # Input: M: 8, N: 8, parts: 2, device: cuda Forward Execution Time (us) : 3.333 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cpu # Input: M: 256, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 3.366 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cuda # Input: M: 256, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 3.385 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cpu # Input: M: 512, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 3.468 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cuda # Input: M: 512, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 3.416 ``` </details> <details> <summary>Expand for timings with this pull request applied</summary> ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cpu # Input: M: 8, N: 8, parts: 2, device: cpu Forward Execution Time (us) : 2.261 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cuda # Input: M: 8, N: 8, parts: 2, device: cuda Forward Execution Time (us) : 2.223 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cpu # Input: M: 256, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 2.237 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cuda # Input: M: 256, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 2.218 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cpu # Input: M: 512, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 2.259 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cuda # Input: M: 512, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 2.234 ``` </details> <details> <summary>Expand for timings on <code>master</code> with <code>__torch_function__</code> dispatch disabled </summary> ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cpu # Input: M: 8, N: 8, parts: 2, device: cpu Forward Execution Time (us) : 2.180 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M8_N8_parts2_cuda # Input: M: 8, N: 8, parts: 2, device: cuda Forward Execution Time (us) : 2.172 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cpu # Input: M: 256, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 2.171 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M256_N512_parts2_cuda # Input: M: 256, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 2.146 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cpu # Input: M: 512, N: 512, parts: 2, device: cpu Forward Execution Time (us) : 2.175 # Benchmarking PyTorch: split # Mode: Eager # Name: split_M512_N512_parts2_cuda # Input: M: 512, N: 512, parts: 2, device: cuda Forward Execution Time (us) : 2.152 ``` </details> So at least on the machine I'm testing on, this brings the overhead down to less than 100 ns. For comparison, the overhead for `__array_function__` in NumPy is about 850 ns on the same machine. <details> <summary>Expand for timings for NumPy <code>__array_function__</code> dispatch </summary> ``` In [1]: import numpy as np In [2]: %timeit np.mean([1]) 8.89 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) In [3]: %timeit np.mean._implementation([1]) 8.04 µs ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) ``` See [the implementation in NumPy](https://github.com/numpy/numpy/blob/master/numpy/core/overrides.py#L195) for why this measures `__array_function__` overhead. </details> Pull Request resolved: pytorch#32194 Differential Revision: D19410396 Pulled By: ezyang fbshipit-source-id: ada788a4399c81cd7eb2d548aa04a2459e96634a

…onal (pytorch#32799) Summary: This adds `__torch_function__` support for all functions in `torch.functional` and `torch.nn.functional`. The changes to C++ code and codegen scripts are to facilitate adding `__torch_function__` support for the native functions in `torch._C._nn`. Note that I moved the `handle_torch_function` C++ function to a header that both `python_torch_functions.cpp` and `python_nn_functions.cpp` include. The changes to `python_nn_functions.cpp` mirror the changes I made to `python_torch_functions.cpp` when `__torch_function__` support was first added in pytorch#27064. Due to the somewhat different way the `torch._C` and `torch._C._nn` namespaces are initialized I needed to create a new static reference to the `torch._C._nn` namespace (`THPNNVariableFunctions`). I'm not sure if that is the best way to do this. In principle I could import these namespaces in each kernel and avoid the global variable but that would have a runtime cost. I added `__torch_function__` support to the Python functions in `torch.nn.functional` following the approach in pytorch#32194. I re-enabled the test that checks if all functions in the `torch` namespace are explicitly tested for `__torch_function__` support. I also generalized the check to work for `torch.functional` and `torch.nn.functional` as well. This test was explicitly disabled in pytorch#30730 and I'm happy to disable it again if you think that's appropriate. I figured now was as good a time as any to try to re-enable it. Finally I adjusted the existing torch API tests to suppress deprecation warnings and add keyword arguments used by some of the code in `torch.nn.functional` that were missed when I originally added the tests in pytorch#27064. Pull Request resolved: pytorch#32799 Differential Revision: D19956809 Pulled By: ezyang fbshipit-source-id: 40d34e0109cc4b9f3ef62f409d2d35a1d84e3d22

reimplement __torch_function__ overrides for torch.functional using i…

b427a29

…nline logic

ngoldbaum requested review from ezyang and rgommers January 14, 2020 23:24

pytorchbot added the open source label Jan 14, 2020

fix onnx test failures

c904a23

ezyang approved these changes Jan 15, 2020

View reviewed changes

facebook-github-bot reviewed Jan 15, 2020

View reviewed changes

facebook-github-bot closed this in bab87e4 Jan 16, 2020

facebook-github-bot added the merged label Jan 16, 2020

ngoldbaum mentioned this pull request Jan 29, 2020

__torch_function__ overrides for torch.functional and torch.nn.functional #32799

Closed

rgommers mentioned this pull request Jan 30, 2020

Add the __torch_function__ API override mechanism #30730

Closed

rgommers mentioned this pull request Feb 11, 2020

RFC-0001: Add method __torch_function__ RFC. pytorch/rfcs#3

Merged

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reimplement __torch_function__ overrides for torch.functional using inline logic #32194

reimplement __torch_function__ overrides for torch.functional using inline logic #32194

ngoldbaum commented Jan 14, 2020 •

edited

Loading

kostmo commented Jan 15, 2020 •

edited

Loading

facebook-github-bot left a comment

facebook-github-bot commented Jan 16, 2020

reimplement __torch_function__ overrides for torch.functional using inline logic #32194

reimplement __torch_function__ overrides for torch.functional using inline logic #32194

Conversation

ngoldbaum commented Jan 14, 2020 • edited Loading

kostmo commented Jan 15, 2020 • edited Loading

💊 CircleCI build failures summary and remediations

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 16, 2020

ngoldbaum commented Jan 14, 2020 •

edited

Loading

kostmo commented Jan 15, 2020 •

edited

Loading