Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::copysign workaround on CUDA #2

Closed
wants to merge 872 commits into from
Closed

std::copysign workaround on CUDA #2

wants to merge 872 commits into from

Conversation

shmsong
Copy link
Owner

@shmsong shmsong commented Feb 23, 2021

Based on
pytorch#52049

Fixes the ICE for gcc<=8.5 on CUDA side.

@shmsong shmsong force-pushed the jetson branch 2 times, most recently from 7cc7bde to 10f22fb Compare March 11, 2021 22:56
malfet and others added 28 commits March 13, 2021 11:20
Summary:
When compiled with OpenMP support `ideep`'s computational_cache would cache max number of OpenMP workers
This number could be wrong after `torch.set_num_threads` call, so clean it after the call.

Fixes pytorch#53565

Pull Request resolved: pytorch#53871

Reviewed By: albanD

Differential Revision: D27003265

Pulled By: malfet

fbshipit-source-id: 1d84c23070eafb3d444e09590d64f97f99ae9d36
Summary:
Pull Request resolved: pytorch#51131

--------
- Refactored `can_use_fast_route` logic in ForeachUtils.h.
- Fixed related bugs in test_foreach.py

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D26103904

Pulled By: izdeby

fbshipit-source-id: b3859b39adaab55c87dab6f7709d227adc0f6342
Summary: As title. Otherwise we are getting flaky when running on devices in dev mode

Reviewed By: jfix71

Differential Revision: D27035924

fbshipit-source-id: 4946a90bd341be63d74b7052cace3fabdefdc0c4
Summary:
Minor PR following up the previous PR about sparse benchmarking utils pytorch#48397

Fixes pytorch#44634:  Performance benchmarks for matrix-matrix and matrix-vector ops (dense-sparse, sparse-sparse, and compare to dense-dense)

I ran  all benchmarks on an 2xRTX8000 machine with AMD 2970WX 24-cores for `DLMC/magnitude_pruning` dataset with different sparsity levels.

 ---
<details><summary> forward tests (expand for details).
</summary>

- `sparse@sparse`
```
[------------------------------- cpu:matmul-forward -------------------------------]
                           |   0.5   |   0.7   |   0.8   |   0.9   |  0.95  |   0.98
1 threads: -------------------------------------------------------------------------
      torch:dense@dense    |  108.1  |  100.5  |  101.3  |  108.4  |  98.4  |  187.4
      torch:sparse@sparse  |  659.1  |  368.8  |  156.5  |   53.3  |  26.8  |   14.9
      scipy:sparse@sparse  |  565.1  |  233.9  |  130.2  |   23.1  |  21.6  |   15.2

Times are in milliseconds (ms).

[----------------------------------- cuda:matmul-forward -----------------------------------]
                           |    0.5    |    0.7    |   0.8    |   0.9    |   0.95   |   0.98
1 threads: ----------------------------------------------------------------------------------
      torch:dense@dense    |   2243.5  |   4392.5  |  4419.8  |  2272.3  |  4433.9  |  8920.1
      torch:sparse@sparse  |  21369.2  |  11877.6  |  7339.2  |  1787.2  |  1335.1  |   845.7

Times are in microseconds (us).

```
- `sparse@dense`
```
[------------------------------- cpu:matmul-forward -------------------------------]
                          |   0.5   |   0.7   |   0.8   |   0.9   |   0.95  |   0.98
1 threads: -------------------------------------------------------------------------
      torch:dense@dense   |  105.8  |  103.8  |  103.0  |  104.4  |  104.4  |  197.0
      torch:sparse@dense  |  119.9  |  102.4  |   84.0  |   19.7  |   16.8  |   11.6
      scipy:sparse@dense  |  906.5  |  799.6  |  697.8  |  182.2  |  165.5  |  135.4

Times are in milliseconds (ms).

[------------------------- cuda:matmul-forward --------------------------]
                          |  0.5  |  0.7  |  0.8  |  0.9  |  0.95  |  0.98
1 threads: ---------------------------------------------------------------
      torch:dense@dense   |  2.2  |  4.4  |  4.4  |  2.3  |  4.5   |  2.3
      torch:sparse@dense  |  5.7  |  6.6  |  4.5  |  1.4  |  1.4   |  1.3

Times are in milliseconds (ms).

```
- `sparse@vector`
```
[----------------------------------- cpu:matmul-forward ----------------------------------]
                           |    0.5    |   0.7    |   0.8    |   0.9    |   0.95   |   0.98
1 threads: --------------------------------------------------------------------------------
      torch:dense@vector   |    510.6  |   505.8  |   759.6  |   782.1  |   682.4  |  764.6
      torch:sparse@vector  |  10122.8  |  6241.1  |  7935.6  |  2076.3  |  1049.5  |  826.3
      scipy:sparse@vector  |   1756.7  |  1033.9  |   678.2  |   343.5  |   168.5  |   65.4

Times are in microseconds (us).

[-------------------------------- cuda:matmul-forward --------------------------------]
                           |   0.5    |   0.7    |   0.8   |   0.9   |   0.95  |   0.98
1 threads: ----------------------------------------------------------------------------
      torch:dense@vector   |    36.1  |    21.5  |   21.6  |   21.5  |   21.6  |   21.5
      torch:sparse@vector  |  1099.2  |  1289.4  |  775.7  |  327.1  |  285.4  |  274.0

Times are in microseconds (us).

```
</details>

 ---
<details><summary> backward tests (expand for details).
</summary>

- `sparse@sparse`
```
[--------------------------------- cpu:matmul-backward ---------------------------------]
                           |   0.5    |   0.7    |   0.8    |   0.9    |   0.95  |   0.98
1 threads: ------------------------------------------------------------------------------
      torch:dense@dense    |   246.1  |   315.0  |   306.9  |   168.6  |  290.6  |  146.9
      torch:sparse@sparse  |  6417.5  |  4393.7  |  3012.7  |  1029.4  |  908.0  |  650.7

Times are in microseconds (us).

[----------------------------- cuda:matmul-backward -----------------------------]
                           |   0.5   |   0.7   |   0.8   |  0.9   |  0.95  |  0.98
1 threads: -----------------------------------------------------------------------
      torch:dense@dense    |    6.7  |   13.3  |   13.3  |   6.9  |  13.5  |   6.9
      torch:sparse@sparse  |  143.7  |  143.4  |  119.6  |  29.5  |  29.1  |  10.9

Times are in microseconds (us).

```
- `sparse@dense`
```
 [------------------------------ cpu:matmul-backward -------------------------------]
                          |   0.5   |   0.7   |   0.8   |   0.9   |   0.95  |   0.98
1 threads: -------------------------------------------------------------------------
      torch:dense@dense   |  185.9  |  304.8  |  305.8  |  169.9  |  308.7  |  168.4
      torch:sparse@dense  |  407.9  |  345.8  |  274.6  |  114.2  |  163.6  |  230.5

Times are in milliseconds (ms).

[--------------------------- cuda:matmul-backward --------------------------]
                          |  0.5   |  0.7   |  0.8   |  0.9  |  0.95  |  0.98
1 threads: ------------------------------------------------------------------
      torch:dense@dense   |   6.7  |  13.3  |  13.3  |  6.9  |  13.4  |   6.9
      torch:sparse@dense  |  16.7  |  19.0  |  15.1  |  6.3  |   8.2  |  12.7

Times are in milliseconds (ms).

```
</details>

Kindly review this  PR. cc mruberry, ngimel

Pull Request resolved: pytorch#51647

Reviewed By: albanD

Differential Revision: D27007809

Pulled By: mruberry

fbshipit-source-id: 8c1922cb3280027ca5e3eef31bfa20500c548cfd
Summary:
pytorch#51348 added CUDA support for orgqr but only a cuSOLVER path; the orgqr tests, however, were marked to run on builds with either MAGMA or cuSOLVER.

This PR addresses the issue by creating a skipCUDAIfNoCusolver decator and applying to the orgqr tests. It triggers ci-all because our CI build with MAGMA but no cuSOLVER is CUDA 9.2, which does run in the typical PR CI.

cc IvanYashchuk

Pull Request resolved: pytorch#53975

Reviewed By: ngimel

Differential Revision: D27036683

Pulled By: mruberry

fbshipit-source-id: f6c0a3e526bde08c44b119ed2ae5d51fee27e283
Summary:
Added OpInfo-based testing of the following linear algebra functions:
* cholesky, linalg.cholesky
* linalg.eigh
* inverse, linalg.inv
* qr, linalg.qr
* solve

The output of `torch.linalg.pinv` for empty inputs was not differentiable, now it's fixed.

In some cases, batched grad checks are disabled because it doesn't work well with 0x0 matrices (see pytorch#50743 (comment)).

Ref. pytorch#50006

Pull Request resolved: pytorch#51107

Reviewed By: albanD

Differential Revision: D27006115

Pulled By: mruberry

fbshipit-source-id: 3c1d00e3d506948da25d612fb114e6d4a478c5b1
…torch#53909)

Summary:
The size of the workspace array should be max(1, lwork) according to LAPACK documentation. We got away with this previously because we tested only MKL which does a nice thing returning lwork >= 1.

Fixes pytorch#53454

Pull Request resolved: pytorch#53909

Reviewed By: heitorschueroff

Differential Revision: D27017025

Pulled By: mruberry

fbshipit-source-id: 040a8cfb4bfb98db47d0b117938856d9483b20fb
Summary:
This test is flaky (tracked in pytorch#53976). This PR skips it to let the rest of the ROCm CI run.

cc nikitaved

Pull Request resolved: pytorch#53977

Reviewed By: ngimel

Differential Revision: D27036705

Pulled By: mruberry

fbshipit-source-id: 5bae741fd2a68f23717cb3a7c8b73e97cfb23b5c
Summary:
See pytorch#42919

Pull Request resolved: pytorch#53349

Reviewed By: malfet

Differential Revision: D27039089

Pulled By: bugra

fbshipit-source-id: 8063dc184248604506a8dbb1bcb73da8ec85bb18
Summary:
Pull Request resolved: pytorch#53747

Fixes pytorch#53743

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26971343

Pulled By: ezyang

fbshipit-source-id: cee7aa10063ae674f741406a3af830e4b4f128df
…tedError (pytorch#53682)

Summary:
Pull Request resolved: pytorch#53682

With this, under the meta device, 101 tests passed and 16953 skipped.
It ain't much, but it's a start.

Some various bits and bobs:
- NotImplementedError suppression at test level is implemented
  in the same way as CUDA memory leak check, i.e., by wrapping
  test methods and monkeypatching them back in.
- I had to reimplement assertRaises/assertRaisesRegex from scratch to
  ignore NotImplementedError when _ignore_not_implemented_error is True.
  The implementation relies on a small amount of private API that hasn't
  changed since 2010
- expectedAlertNondeterministic doesn't really work so I skipped them
  all; there's probably a way to do it better

I tested this using `pytest --disable-warnings --tb=native -k meta --sw
test/*.py` and a pile of extra patches to make collection actually work
(lol).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D26955539

Pulled By: ezyang

fbshipit-source-id: ac21c8734562497fdcca3b614a28010bc4c03d74
Summary:
Pull Request resolved: pytorch#53759

Fixes pytorch#53587, see issue for in-depth explanation of the bug.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26971342

Pulled By: ezyang

fbshipit-source-id: 805983fed2658e27fb033f36a71fd30950a29328
…torch#53492)

Summary:
Assert no duplicate in method_tests for an OpInfo entry

Pull Request resolved: pytorch#53492

Reviewed By: izdeby

Differential Revision: D26882441

Pulled By: mruberry

fbshipit-source-id: f0631ea2b46b74285c76365c679bd45abc917d63
…pytorch#53949)

Summary:
Pull Request resolved: pytorch#53949

As title says
ghstack-source-id: 123849745

Test Plan:
`buck test pp-mac`

```
2021-03-11 18:25:07.922375-0800 PyTorchPlayground[8324:5122672] [bool test_add()],[1 180 12 12 ],[SUCCEED]
2021-03-11 18:25:07.960812-0800 PyTorchPlayground[8324:5122672] [bool test_add_broadcast()],[2 17 58 67 ],[SUCCEED]
2021-03-11 18:25:07.978399-0800 PyTorchPlayground[8324:5122672] [bool test_add_broadcast2()],[2 17 1 67 ],[SUCCEED]
2021-03-11 18:25:08.021570-0800 PyTorchPlayground[8324:5122672] [bool test_sub()],[5 3 167 222 ],[SUCCEED]
2021-03-11 18:25:08.034218-0800 PyTorchPlayground[8324:5122672] [bool test_sub_broadcast()],[1 3 1 1 ],[SUCCEED]
2021-03-11 18:25:08.069419-0800 PyTorchPlayground[8324:5122672] [bool test_sub_broadcast2()],[3 3 192 192 ],[SUCCEED]
2021-03-11 18:25:08.112967-0800 PyTorchPlayground[8324:5122672] [bool test_mul()],[2 7 262 119 ],[SUCCEED]
2021-03-11 18:25:08.136691-0800 PyTorchPlayground[8324:5122672] [bool test_mul_broadcast()],[4 3 192 192 ],[SUCCEED]
2021-03-11 18:25:08.148920-0800 PyTorchPlayground[8324:5122672] [bool test_mul_broadcast2()],[1 3 192 192 ],[SUCCEED]
```

Reviewed By: SS-JIA

Differential Revision: D27000487

fbshipit-source-id: f86fca5ac1960ca0a56636da17ae05020c1a4138
Summary:
Pull Request resolved: pytorch#53950

Add four binary ops to Metal

- `aten::mul_`
- `aten::sub_`
- `aten::div`
- `aten::div_`
ghstack-source-id: 123850577

Test Plan:
- `buck test pp-mac`

```
2021-03-11 20:36:47.151139-0800 PyTorchPlayground[8469:5169786] [bool test_sub()],[5 3 167 222 ],[SUCCEED]
2021-03-11 20:36:47.157638-0800 PyTorchPlayground[8469:5169786] [bool test_sub_broadcast()],[1 3 1 1 ],[SUCCEED]
2021-03-11 20:36:47.170640-0800 PyTorchPlayground[8469:5169786] [bool test_sub_broadcast2()],[3 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.194009-0800 PyTorchPlayground[8469:5169786] [bool test_mul()],[2 7 262 119 ],[SUCCEED]
2021-03-11 20:36:47.210344-0800 PyTorchPlayground[8469:5169786] [bool test_mul_broadcast()],[4 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.216610-0800 PyTorchPlayground[8469:5169786] [bool test_mul_broadcast2()],[1 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.224471-0800 PyTorchPlayground[8469:5169786] [bool test_div()],[1 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.240817-0800 PyTorchPlayground[8469:5169786] [bool test_div_broadcast()],[4 3 192 192 ],[SUCCEED]
2021-03-11 20:36:47.246816-0800 PyTorchPlayground[8469:5169786] [bool test_div_broadcast2()],[1 3 192 192 ],[SUCCEED]
```

Reviewed By: SS-JIA

Differential Revision: D27003417

fbshipit-source-id: 290f7e524eef4c444f8884fc1315151752e5ac31
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: pytorch/tensorpipe@cd0eb12

Pull Request resolved: pytorch#53892

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D27009398

fbshipit-source-id: af46edd701cde94c6175d3058fd15487d8b0b8c7
Summary:
When build libtorch static library, these three static libraries will be generated but won't be installed to CMAKE_INSTALL_LIBDIR:
- libCaffe2_perfkernels_avx2.a
- libCaffe2_perfkernels_avx512.a
- libCaffe2_perfkernels_avx.a

This PR will fix this issue.

Please be noted that after this fix there still have static libraries missing in CMAKE_INSTALL_LIBDIR, but they belong to third_party repo, and we need to fix in the corresponding repo:
- libfoxi_loader.a
- libonnx.a
- libonnx_proto.a
- libfmt.a
- libnccl_static.a

Pull Request resolved: pytorch#53825

Reviewed By: ngimel

Differential Revision: D27013844

Pulled By: malfet

fbshipit-source-id: 8a84cc72b6ae87393ca26c4e474f5526a7b18ab2
Summary:
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.

Thank you, PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).

Fixes pytorch#53872

Pull Request resolved: pytorch#53886

Reviewed By: jbschlosser

Differential Revision: D27011846

Pulled By: H-Huang

fbshipit-source-id: 5c460d1e4d095a8bc267bf63613b556856ced3e8
Summary:
Code cleanup.

Pull Request resolved: pytorch#53874

Reviewed By: jbschlosser

Differential Revision: D27036819

Pulled By: ngimel

fbshipit-source-id: c267e20c8d91224cd3c01b302a75f43aa309b560
Summary:
To be in-sync with pytorch#53447

Pull Request resolved: pytorch#53931

Reviewed By: ngimel

Differential Revision: D27026616

Pulled By: malfet

fbshipit-source-id: 4c50b29fa296c90aeeeb1757bdaada92cbba33d4
Summary:
Close pytorch#51108
Related pytorch#38349

This PR implements the `cpu_kernel_multiple_outputs` to support returning multiple values in a CPU kernel.
```c++
auto iter = at::TensorIteratorConfig()
  .add_output(out1)
  .add_output(out2)
  .add_input(in1)
  .add_input(in2)
  .build();

at::native::cpu_kernel_multiple_outputs(iter,
  [=](float a, float b) -> std::tuple<float, float> {
    float add = a + b;
    float mul = a * b;
    return std::tuple<float, float>(add, mul);
  }
);
```

The `out1` will equal to `torch.add(in1, in2)`, while the result of `out2` will be `torch.mul(in1, in2)`.
It helps developers implement new torch functions that return two tensors more conveniently, such as NumPy-like functions [divmod](https://numpy.org/doc/1.18/reference/generated/numpy.divmod.html?highlight=divmod#numpy.divmod) and [frexp](https://numpy.org/doc/stable/reference/generated/numpy.frexp.html#numpy.frexp).

This PR adds `torch.frexp` function to exercise the new functionality provided by `cpu_kernel_multiple_outputs`.

Pull Request resolved: pytorch#51097

Reviewed By: albanD

Differential Revision: D26982619

Pulled By: heitorschueroff

fbshipit-source-id: cb61c7f2c79873ab72ab5a61cbdb9203531ad469
Summary:
accross -> across

Pull Request resolved: pytorch#53968

Reviewed By: jbschlosser

Differential Revision: D27035761

Pulled By: ngimel

fbshipit-source-id: 94fac6f2e27648e70652fd29f7800e60b211acd5
Summary:
untill -> until

Pull Request resolved: pytorch#53978

Reviewed By: jbschlosser

Differential Revision: D27039043

Pulled By: rohan-varma

fbshipit-source-id: c9178e79fe8b2a3dc61665148fe55dba5adb0abf
Summary:
Follow-up of pytorch#53447

Reference: pytorch#53447 (comment)

Pull Request resolved: pytorch#53809

Reviewed By: bdhirsh

Differential Revision: D27049643

Pulled By: jbschlosser

fbshipit-source-id: 623a2a254783b86391dc2b0777b688506adb4c0e
Summary: Pull Request resolved: pytorch#49249

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502939

Pulled By: izdeby

fbshipit-source-id: b16e23063b37521be549e83cb17676e3afc4ddb3
Summary: Pull Request resolved: pytorch#49250

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D25502938

Pulled By: izdeby

fbshipit-source-id: bdd583464eb15d7cb30fd0c22d119cc4b31cbf8d
Summary:
This PR fixes a typo in the explanation of `dims` for `linalg.tensorsolve`.

Pull Request resolved: pytorch#53320

Reviewed By: jbschlosser

Differential Revision: D27048736

Pulled By: anjali411

fbshipit-source-id: db230b21191cc9cfb73b967cd15305fe74178c2b
Summary:
Pull Request resolved: pytorch#53928

HashStoreTest was taking forever to run. Turns out it was because a default timeout is set when creating Store() and setTimeout for prefixStore is not actually able to change the timeout of the underlying store.

After removing the default timeout and updating setTimeout, this will save ~10 minutes for all of the gcc_test CI runs.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27025275

Pulled By: H-Huang

fbshipit-source-id: 650c8c1eb8b166da1d412ed88e765747a2ca2069
@shmsong
Copy link
Owner Author

shmsong commented Apr 12, 2021

fixed upstream

@shmsong shmsong closed this Apr 12, 2021
shmsong pushed a commit that referenced this pull request Jun 9, 2021
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node csarofeen#5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node csarofeen#6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node csarofeen#7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node csarofeen#8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node csarofeen#9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node csarofeen#10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
shmsong pushed a commit that referenced this pull request Jul 6, 2021
Summary:
Pull Request resolved: pytorch#60987

We were seeing deadlocks as follows during shutdown:

```
Thread 1 (LWP 2432101):
#0  0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6
#1  0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#2  0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1
#3  0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so
#4  0x00007efc651aee03 in ?? () from /lib64/libcuda.so
csarofeen#5  0x00007efc64f76b84 in ?? () from /lib64/libcuda.so
csarofeen#6  0x00007efc64f77f5d in ?? () from /lib64/libcuda.so
csarofeen#7  0x00007efc64eac858 in ?? () from /lib64/libcuda.so
csarofeen#8  0x00007efc64eacfbc in ?? () from /lib64/libcuda.so
csarofeen#9  0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11
csarofeen#21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
csarofeen#22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
csarofeen#23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6
csarofeen#24 0x00007efca4648c40 in exit () from /lib64/libc.so.6
csarofeen#25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292
csarofeen#26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636
csarofeen#27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646
csarofeen#28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457
csarofeen#29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420
csarofeen#30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907
csarofeen#31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460
csarofeen#32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495
csarofeen#33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6
csarofeen#34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103
```

This was likely caused due to a static singleton that wasn't leaky. Following
the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to
use a leaky singleton instead.
ghstack-source-id: 132847448

Test Plan: Verified locally.

Reviewed By: malfet

Differential Revision: D29468866

fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
shmsong pushed a commit that referenced this pull request Aug 6, 2021
Summary:
Pull Request resolved: pytorch#61588

As part of debugging pytorch#60290,
we discovered the following deadlock:

```
Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)):
#0  pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103
#2  take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224
#3  0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278
#4  0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
csarofeen#5  0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
csarofeen#6  0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so
csarofeen#7  0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
csarofeen#8  0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
csarofeen#9  0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
csarofeen#10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
csarofeen#11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
csarofeen#12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

Thread 72 (Thread 0x7f53077fe700 (LWP 205412)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)csarofeen#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)csarofeen#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so

```

Basically Thread 72, holds GIL and tries to acquire the lock for
DistAutogradContainer to perform a lookup on a map. On the other hand,
Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as
part of TensorImpl destructor, concrete_decref_fn is called which waits for
GIL. As a result, we have a deadlock.

To fix this issue, I've ensured we release GIL when we call `retrieveContext`
and acquire it later when needed.
ghstack-source-id: 133493659

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D29682624

fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
shmsong pushed a commit that referenced this pull request Oct 13, 2021
Summary:
Pull Request resolved: pytorch#61983

Trial #2. The previous PR (pytorch#61498) was reverted because this caused a failure in `pytorch_linux_backward_compatibility_check_test`. Fixed that now by adding to the exception list in `check_backward_compatibility.py`.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D29828830

Pulled By: navahgar

fbshipit-source-id: 947a7b1622ff6e3e575c051b8f34a789e105bcee
shmsong pushed a commit that referenced this pull request Oct 13, 2021
…ytorch#63339)

Summary:
Pull Request resolved: pytorch#63339

# Context
https://fb.workplace.com/groups/pytorch.dev/permalink/900474523864362/?comment_id=901125403799274&reply_comment_id=905023386742809

##### WHAT IS A STACK TRACE?
A stack trace (also called stack backtrace or stack traceback) is a report of the active stack frames at a certain point in time during the execution of a program.

Typically when an exception is thrown, one would expect to see the code (file:line) that threw the exception, and every intermediate frame up to and including the main function.

We are enabling android stack trace to help debugging on android devices.

Test Plan:
## Steps to test
```
buck build fbsource//xplat/caffe2/mode/aibench_pytorch_android -c pt.enable_qpl=0 -c pt.has_backtraces=1 fbsource//xplat/caffe2/fb/lite_predictor:lite_predictorAndroid#android-x86_64

one_world android emulator android-28

adb push ~/fbsource/buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictorAndroid#android-x86_64 /data/local/tmp

cd /data/local/tmp
./lite_predictorAndroid#android-x86_64

./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
```

## See how model file is not found stack traces is:

### before
```
./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true

Run with 2 threads
Run with 2 threads
Loading model...
terminating with uncaught exception of type c10::Error: open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
(no backtrace available)
Aborted
```

### after
```
134|generic_x86_64:/data/local/tmp $ ./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
Run with 2 threads
Run with 2 threads
Loading model...
terminating with uncaught exception of type c10::Error: open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
 frame #0       c10::get_backtrace(unsigned long, unsigned long, bool)[0x59494274f10e]
 frame #1       [0x5949427b1eee]
 frame #2       [0x5949427b1eb2]
 frame #3       [0x5949427b1cdc]
 frame #4       std::__ndk1::function<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > ()>::operator()() const[0x5949427afc34]
 frame csarofeen#5       c10::Error::Error(c10::SourceLocation, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >)[0x5949427b05b1]
 frame csarofeen#6       c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949427aca5f]
 frame csarofeen#7       caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949426b37b2]
 frame csarofeen#8       caffe2::serialize::FileAdapter::FileAdapter(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949426b3903]
 frame csarofeen#9       torch::jit::_load_for_mobile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, c10::optional<c10::Device>, std::__ndk1::unordered_map<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >, std::__ndk1::hash<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >, std::__ndk1::equal_to<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >, std::__ndk1::allocator<std::__ndk1::pair<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > > >&)[0x5949422737bd]
 frame csarofeen#10      torch::jit::_load_for_mobile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, c10::optional<c10::Device>)[0x594942273769]
 frame csarofeen#11      benchmark(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, int, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, bool, int, int, int, bool, int, bool, int, double, bool, bool, bool, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x59494189b21d]
 frame csarofeen#12      main[0x594941882aff]
 frame csarofeen#13      __libc_init[0x7b699d08578d]
```

### what we get for os:linux
```
(base) [pavithran@devvm1803.vll0 /data/users/pavithran/fbsource] ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
Run with 24 threads
Run with 24 threads
Loading model...
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
frame #0: ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor() [0x20cb7fe]
frame #1: ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor() [0x20cb6c6]
frame #2: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x54 (0x20ca4e4 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #3: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x57 (0x20ca9a7 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #4: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x7a (0x20c823a in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame csarofeen#5: caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x96 (0x206f3d6 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame csarofeen#6: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x42 (0x206f502 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame csarofeen#7: torch::jit::_load_for_mobile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x30 (0x1be826c in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame csarofeen#8: torch::jit::_load_for_mobile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) + 0x35 (0x1be8214 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame csarofeen#9: benchmark(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, int, int, bool, int, bool, int, double, bool, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x16d (0x12093ad in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame csarofeen#10: main + 0x25c (0x11f933c in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame csarofeen#11: __libc_start_main + 0x105 (0x7fc7b9f2ed95 in /usr/local/fbcode/platform009/lib/libc.so.6)
frame csarofeen#12: _start + 0x2a (0x11f902a in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)

Aborted (core dumped)
````

Reviewed By: dhruvbird

Differential Revision: D30135947

fbshipit-source-id: f50c634ef4545843305cad4b4a14a8776b1aec76
shmsong pushed a commit that referenced this pull request Oct 13, 2021
…4332)

Summary:
Pull Request resolved: pytorch#64332

With this diff, if a compiler bug occurs (unlikely, I know!) we'll be able to get a c++ stacktrace leading to the exception, rather than just a terse message.  E.g.,
```
RuntimeError: UNSUPPORTED DTYPE
Exception raised from compilation_error at ../torch/csrc/jit/tensorexpr/exceptions.h:32 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f966659b2eb in /fsx/users/bertrand/c\
onda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x376f099 (0x7f966a195099 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x3763bf5 (0x7f966a189bf5 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0xdd8 (0x7f966a193368 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda\
.so)
```

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D30745610

Pulled By: bertmaher

fbshipit-source-id: a1cfaa7364ef4120de834e9cbe57ced1d082ab4e
shmsong pushed a commit that referenced this pull request Oct 13, 2021
Summary:
Pull Request resolved: pytorch#66009

Fixes
```
test_trace_c10_ops (jit.test_tracer.TestTracer) ... third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:374:24: runtime error: applying non-zero offset 4 to null pointer
    #0 0x7f5228f72227 in Eigen::internal::BlockImpl_dense<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false, true>::BlockImpl_dense(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:374
    #1 0x7f5228f7212c in Eigen::BlockImpl<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false, Eigen::Dense>::BlockImpl(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:166
    #2 0x7f5228f720dc in Eigen::Block<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>::Block(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:142
    #3 0x7f5229b0e059 in Eigen::DenseBase<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > >::FixedBlockXpr<internal::get_fixed_value<int>::value, internal::get_fixed_value<long>::value>::Type Eigen::DenseBase<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > >::block<int, long>(long, long, int, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/../plugins/BlockMethods.h:98
    #4 0x7f5229b0c5ca in caffe2::GenerateProposalsOp<caffe2::CPUContext>::RunOnDevice() caffe2/caffe2/operators/generate_proposals_op.cc:348
```
Also cleans up some data type and const issues around the area.

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D31343046

fbshipit-source-id: fd9096c8e47a0aad529c72fd313f64ca98dcb80b
shmsong pushed a commit that referenced this pull request Nov 9, 2021
Summary:
Pull Request resolved: pytorch#66060

Fixes
```
testTumHistoryAdditionalLaser (caffe2.caffe2.fb.layers.tests.tum_history_test.TestTumHistory) ... caffe2/caffe2/operators/concat_split_op.h:363:74: runtime error: applying non-zero offset 8 to null pointer
    #0 0x7f8f39d29795 in caffe2::ConcatOp<caffe2::CPUContext>::RunOnDevice() caffe2/caffe2/operators/concat_split_op.h:363
    #1 0x7f8f39c4978d in caffe2::Operator<caffe2::CPUContext>::Run(int) caffe2/caffe2/core/operator.h:987
    #2 0x7f8f381fe9c9 in caffe2::SimpleNet::Run() caffe2/caffe2/core/net_simple.cc:67
    #3 0x7f8f38ee488e in caffe2::Workspace::RunNet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) caffe2/caffe2/core/workspace.cc:289
```

Test Plan: Sandcastle

Reviewed By: dzhulgakov, xush6528

Differential Revision: D31366205

fbshipit-source-id: 566aa519677c9d371189e4b1f81d595732861efc
shmsong pushed a commit that referenced this pull request Nov 9, 2021
Summary:
Pull Request resolved: pytorch/pytorch-canary#2

Pull Request resolved: pytorch#66881

Adds `static_runtime::fused_equally_split` operator and removes `is_fused` logic from original operator. Modifies `FuseUnpackListV2` to map `fb::equally_split` to this new operator.

Test Plan:
```
adityapillai@5960 /data/sandcastle/boxes/fbsource/fbcode 1m 13s
❯ buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```
and sandcastle
strange_what_could_go_wrong

Reviewed By: mikeiovine

Differential Revision: D31742293

fbshipit-source-id: 60b35589c8817719b005d49811f575b6590d1c39
shmsong pushed a commit that referenced this pull request Jun 29, 2022
This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(pytorch#73039). So far we have tried
or investigated:
1. Moving distributed builds to master
2. Moving distributed builds to periodic
3. Only running rocm on a specific set of paths
4. Running multiple jobs on a single rocm host.

Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").

There are two things we haven't tried so far:
1. Running "smoke tests" only on PR
2. Switching rocm builds to master

Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.

[skip ci]

Pull Request resolved: pytorch#77989

Approved by: https://github.com/malfet, https://github.com/janeyx99
shmsong pushed a commit that referenced this pull request Jun 29, 2022
…78136)

This prevents `import torch` accidentally crash on machines with no metal devices

Should prevent crashes reported in pytorch#77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true

Backtrace to the crash:
```
(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23
    frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
    frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125
    frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535
    frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up
frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl:
->  0x10fd9f524 <+436>: movq   %rax, 0x1b0(%rbx)
    0x10fd9f52b <+443>: movw   $0x0, 0x1b8(%rbx)
    0x10fd9f534 <+452>: addq   $0x8, %rsp
    0x10fd9f538 <+456>: popq   %rbx
(lldb) disassemble
 ...
    0x10fd9f514 <+420>: movq   0xf19ad15(%rip), %rsi     ; "maxBufferLength"
    0x10fd9f51b <+427>: movq   %r14, %rdi
    0x10fd9f51e <+430>: callq  *0xeaa326c(%rip)          ; (void *)0x00007fff7202be40: objc_msgSend
```

which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in
https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171

Pull Request resolved: pytorch#78136
Approved by: https://github.com/seemethere
shmsong pushed a commit that referenced this pull request Jun 29, 2022
… of libtorch_python (pytorch#78028)

Summary:
This moves torch::class_<WorkerInfo> into `rpc_agent.cpp` so it gets registered in libtorch instead of libtorch_python. This is intermediate work to getting torch::deploy to load an unmodified copy of libtorch. Current RPC is incompatible due to duplicate registrations.

```
unknown file: Failure
C++ exception with description "Exception Caught inside torch::deploy embedded library:
Custom class with name __torch__.torch.classes.dist_rpc.WorkerInfo is already registered. Ensure that registration with torch::class_ is only called once.
Exception raised from registerCustomClass at ../aten/src/ATen/core/custom_class.cpp:61 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f3bd9adb92e in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x7f3bd9ab7068 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: torch::registerCustomClass(std::shared_ptr<c10::ClassType>) + 0x110 (0x7f3bc2258980 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::detail::class_base::class_base(std::string const&, std::string const&, std::string, std::type_info const&, std::type_info const&) + 0x3b9 (0x7f3bc225a419 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: [0x7f3ba45cfea1]
frame csarofeen#5: <unknown function> + 0x1b5334 (0x5652bdab9334 in ./test_deploy)
frame csarofeen#6: <unknown function> + 0x1b4f3e (0x5652bdab8f3e in ./test_deploy)
frame csarofeen#7: <unknown function> + 0x1b519b (0x5652bdab919b in ./test_deploy)
frame csarofeen#8: loadSearchFile(char const*) + 0x23e (0x7f3ba62f37f8 in /tmp/torch_deploy9ATEFg)
frame csarofeen#9: deploy_set_self + 0x51 (0x7f3ba62f38f9 in /tmp/torch_deploy9ATEFg)
frame csarofeen#10: torch::deploy::Interpreter::Interpreter(torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>) + 0x274 (0x5652bdaaa790 in ./test_deploy)
frame csarofeen#11: void __gnu_cxx::new_allocator<torch::deploy::Interpreter>::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::Interpreter*, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0x81 (0x5652bdaaf58b in ./test_deploy)
frame csarofeen#12: void std::allocator_traits<std::allocator<torch::deploy::Interpreter> >::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(std::allocator<torch::deploy::Interpreter>&, torch::deploy::Interpreter*, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0x4a (0x5652bdaae320 in ./test_deploy)
frame csarofeen#13: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::_M_realloc_insert<torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(__gnu_cxx::__normal_iterator<torch::deploy::Interpreter*, std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> > >, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0xee (0x5652bdaae4a0 in ./test_deploy)
frame csarofeen#14: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::emplace_back<torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0xb6 (0x5652bdaad258 in ./test_deploy)
frame csarofeen#15: torch::deploy::InterpreterManager::InterpreterManager(unsigned long, std::shared_ptr<torch::deploy::Environment>) + 0x123 (0x5652bdaa83b1 in ./test_deploy)
frame csarofeen#16: TorchpyTest_InitTwice_Test::TestBody() + 0x65 (0x5652bda075a9 in ./test_deploy)
frame csarofeen#17: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x65 (0x5652bda944b7 in ./test_deploy)
frame csarofeen#18: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x5a (0x5652bda8cfe7 in ./test_deploy)
frame csarofeen#19: testing::Test::Run() + 0x100 (0x5652bda68622 in ./test_deploy)
frame csarofeen#20: testing::TestInfo::Run() + 0x10f (0x5652bda68fb3 in ./test_deploy)
frame csarofeen#21: testing::TestSuite::Run() + 0x121 (0x5652bda6980d in ./test_deploy)
frame csarofeen#22: testing::internal::UnitTestImpl::RunAllTests() + 0x38e (0x5652bda756e6 in ./test_deploy)
frame csarofeen#23: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x65 (0x5652bda9586b in ./test_deploy)
frame csarofeen#24: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x5a (0x5652bda8e0f7 in ./test_deploy)
frame csarofeen#25: testing::UnitTest::Run() + 0xc9 (0x5652bda73fd1 in ./test_deploy)
frame csarofeen#26: RUN_ALL_TESTS() + 0x11 (0x5652bda169fa in ./test_deploy)
frame csarofeen#27: main + 0x27 (0x5652bda10ce2 in ./test_deploy)
frame csarofeen#28: <unknown function> + 0x2d310 (0x7f3bc0431310 in /usr/lib/libc.so.6)
frame csarofeen#29: __libc_start_main + 0x81 (0x7f3bc04313c1 in /usr/lib/libc.so.6)
frame csarofeen#30: _start + 0x25 (0x5652bda063b5 in ./test_deploy)
```

Test Plan: CI

Differential Revision: D36564258

Pull Request resolved: pytorch#78028
Approved by: https://github.com/rohan-varma
shmsong pushed a commit that referenced this pull request Jun 29, 2022
… to conform with non-quantized countertpart filenames

Summary:
Names of analogous files in quantized directory (previously snake case) were inconsistent with
their non-quantized filename counterparts (pascal case). This is the first of a series of PRs that changes
all files in quantized (and sub-directories) dir to have pascal case.

`aten/src/ATen/native/quantized/qconv_unpack.cpp` has not been renamed yet
because (for reasons currently unknown) after making the name change, `import torch` produces the below error (`qlinear_unpack.cpp` renaming also seems to fail some phabricator CI tests for similar reasons). We suspect that these may be undefined errors and will revisit naming these files in a future PR.

```
terminate called after throwing an instance of 'c10::Error'
  what():  Type c10::intrusive_ptr<ConvPackedParamsBase<2> > could not be converted to any of the known types.
Exception raised from operator() at ../aten/src/ATen/core/jit_type.h:1735 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7f26745c0c65 in /data/users/dzdang/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb1 (0x7f26745bdcd1 in /data/users/dzdang/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1494e24 (0x7f2663b14e24 in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xfed0bc (0x7f266366d0bc in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #4: c10::detail::infer_schema::make_function_schema(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) + 0x5a (0x7f266366d71a in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame csarofeen#5: c10::detail::infer_schema::make_function_schema(c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) + 0x7b (0x7f266366e06b in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame csarofeen#6: <unknown function> + 0x1493f32 (0x7f2663b13f32 in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame csarofeen#7: <unknown function> + 0xe227dd (0x7f26634a27dd in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame csarofeen#8: <unknown function> + 0x14e0a (0x7f268c934e0a in /lib64/ld-linux-x86-64.so.2)
..........................truncated.............
```

Test Plan:
```
python test/test_quantization.py
```

Pull Request resolved: pytorch#77037

Approved by: https://github.com/jerryzh168
shmsong pushed a commit that referenced this pull request Jun 29, 2022
shmsong pushed a commit that referenced this pull request Jun 29, 2022
…ops to use method overloads""

This reverts commit f3665dd.

Reverted pytorch#79819 on behalf of https://github.com/malfet due to land raced with softshrink refs
shmsong pushed a commit that referenced this pull request Jul 24, 2022
…ytorch#81031)

Re-attempting after original PR pytorch#79596 was reverted due to causing ROCm build failures
Pull Request resolved: pytorch#81031
Approved by: https://github.com/jeffdaily, https://github.com/malfet
shmsong pushed a commit that referenced this pull request Aug 8, 2022
### Summary:
This PR implements PTQ for APoT FakeQuant. It runs models (Resnet-18 pre-trained model, ImageNet dataset) to compare accuracy metrics for different qconfig settings of uniform vs. APoT quantized activation and weight.

According to the collected accuracy stats, model #2 (uniform activation and APoT weight) appears to have a slight improvement in accuracy compared to model #1 (uniform activation and uniform weight) for 8-bit and significant improvement for 4-bit (see "Accuracy Stats" section below).

### Test Plan:
Run models with: `python test/quantization/core/experimental/fx_graph_mode_apot.py`

### Accuracy Stats:
8-bit (Uniform int8, APoT b = 8 k = 2)

**Model #1:** Uniform activation, uniform weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 64.43% (Top-1), 85.62% (Top-5)

**Model #2:** Uniform activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 64.51% (Top-1), 85.78% (Top-5)

**Model #3:** APoT activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 64.32% (Top-1), 85.78% (Top-5)

4-bit (Uniform int4, APoT b = 4 k = 2)

**Model #1:** Uniform activation, uniform weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 45.63% (Top-1), 71.96% (Top-5)

**Model #2:** Uniform activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 64.24% (Top-1), 85.56% (Top-5)

**Model #3:** APoT activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 45.40% (Top-1), 76.21% (Top-5)

**Full Precision model (FX Graph Mode quantized)**
Evaluation accuracy on test dataset: 69.76% (Top-1), 89.08% (Top-5)

**Eager mode quantized model**
Evaluation accuracy on test dataset: 69.49% (Top-1), 88.90% (Top-5)
Pull Request resolved: pytorch#81040
Approved by: https://github.com/jerryzh168
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.