[ONNX] Add binary_cross_entropy_with_logits op to ONNX opset version 12 #49675

)

Summary: Pull Request resolved: pytorch#49396 Pull Request resolved: pytorch#49271 Two things: 1. These throw exceptions in their constructor, which causes a segfault (*), so move the exceptions to ::make. 2. They technically support FP types but the rules are complicated so let's not bother. (*) The reason for the segfault: all Exprs including these inherit from KernelScopedObject, whose constructor adds the object to a list for destruction at the end of the containing KernelArena's lifetime. But if the derived-class constructor throws, the object is deleted even though it's still in the KernelArena's list. So when the KernelArena is itself deleted, it double-frees the pointer and dies. I've also fixed And, Or, and Xor in this diff. ghstack-source-id: 118594998 Test Plan: `buck test //caffe2/test:jit` Reviewed By: bwasti Differential Revision: D25512052 fbshipit-source-id: 42670b3be0cc1600dc5cda6811f7f270a2c88bba

Summary: Pull Request resolved: pytorch#49340 This refines the fusion group to include on certain types of operations. We cannot safely handle "canRunNatively" types and the memonger pass causes regressions on some internal models, so it was disabled (to be revisited with proper memory optimization once Tensor pools are implemented) Test Plan: ``` buck test mode/no-gpu caffe2/test:static_runtime buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: ZolotukhinM Differential Revision: D25520105 fbshipit-source-id: add61d103e4f8b4615f5402e760893ef759a60a9

Summary: Pull Request resolved: pytorch#48992 Differential Revision: D25388100 Test Plan: Imported from OSS Reviewed By: heitorschueroff Pulled By: ZolotukhinM fbshipit-source-id: d95713af2220cf4f99ac92f59f8e5b902f2f3822

Summary: BC-breaking note: This PR changes the behavior of the any and all functions to always return a bool tensor. Previously these functions were only defined on bool and uint8 tensors, and when called on uint8 tensors they would also return a uint8 tensor. (When called on a bool tensor they would return a bool tensor.) PR summary: pytorch#44790 (comment) Fixes 2 and 3 Also Fixes pytorch#48352 Changes * Output dtype is always `bool` (consistent with numpy) **BC Breaking (Previously used to match the input dtype**) * Uses vectorized version for all dtypes on CPU * Enables test for complex * Update doc for `torch.all` and `torch.any` TODO * [x] Update docs * [x] Benchmark * [x] Raise issue on XLA Pull Request resolved: pytorch#47878 Reviewed By: H-Huang Differential Revision: D25421263 Pulled By: mruberry fbshipit-source-id: c6c681ef94004d2bcc787be61a72aa059b333e69

…L_LAUNCH_CHECK() (pytorch#49424) Summary: Pull Request resolved: pytorch#49424 As per conversation in this [comment](https://www.internalfb.com/intern/diff/D25541113 (https://github.com/pytorch/pytorch/commit/e2510a0b60232aba5160ceb18b6ece8c59a9b79d)/?dest_fbid=393026838623691&transaction_id=3818008671564312) on D25541113 (pytorch@e2510a0), although THError does more than just log any errors associated cuda kernel launches, we're going to go ahead and replace it with C10_CUDA_KERNEL_LAUNCH_CHECK, so as to be consistent throughout the code base. Standardization FTW. This commit is purposefully sent in as a single file change so it can be easily reverted if it introduces a regression. Test Plan: Checked that the code still builds with ``` buck build //caffe2/aten:ATen-cu ``` Also ran basic aten tests ``` buck test //caffe2/aten:atest ``` Reviewed By: r-barnes Differential Revision: D25567863 fbshipit-source-id: 1093bfe2b6ca6b9a3bfb79dcdc5d713f6025eb77

Summary: Signed-off-by: caozhong <zhong.z.cao@intel.com> Pull Request resolved: pytorch#48827 Reviewed By: agolynski Differential Revision: D25375988 Pulled By: ailzhang fbshipit-source-id: a8d5ab4572d991d6d96dfe758011517651ff0a6b

…ings.warn (pytorch#49313) Summary: Adding a flag torch_jit_disable_warning_prints to optimize interpreter performance by suppressing (potentially large amount) of warnings.warn. This is to work around TorchScript's warning behavior mismatch with Python. Python by default triggers a warning once per location but TorchScript doesn't support it. This causes same warning to trigger and print once per inference run, hurting performance. Pull Request resolved: pytorch#49313 Reviewed By: SplitInfinity Differential Revision: D25534274 Pulled By: gmagogsfm fbshipit-source-id: eaeb57a335c3e6c7eb259671645db05d781e80a2

…s in async execution (pytorch#49322) Summary: Pull Request resolved: pytorch#49322 In some cases async execution might loose dependencies (Alias like ops) or produce suboptimal scheduling when there is an option which parts to schedule first. Example of the later behavior can happen in ModelParallel training where copy can get lower priority compared to the rest of the execution on the given GPU, which will caused other GPUs to starve. This operator allows to address these issues by introducing extra explicit dependencies between ops. Test Plan: Unit-test/ E2E testing in the future diffs. Reviewed By: xianjiec Differential Revision: D24933471 fbshipit-source-id: 1668994c7856d73926cde022378a99e1e8db3567

Summary: Pull Request resolved: pytorch#49415 Test Plan: Imported from OSS Reviewed By: zdevito Differential Revision: D25565341 Pulled By: jamesr66a fbshipit-source-id: 2290ab62572632788809ba16319578bf0c0260ee

…reapply) (pytorch#49408) Summary: Pull Request resolved: pytorch#49408 Nearly every non-test callsite doesn't need to capture any variables anyway, and this saves 48 bytes per callback. ghstack-source-id: 118665808 Test Plan: Wait for GitHub CI since we had C++14-specific issues with this one in previous PR pytorch#48629 Reviewed By: malfet Differential Revision: D25563207 fbshipit-source-id: 6a2831205917d465f8248ca37429ba2428d5626d

Summary: Since NCCL is an optional CUDA dependency, remove nccl.cpp from the core filelist Pull Request resolved: pytorch#49429 Reviewed By: nikithamalgifb Differential Revision: D25569883 Pulled By: malfet fbshipit-source-id: 61371a4c6b0438e4e0a7f094975b9a9f9ffa4032

Summary: Fixes pytorch#47462, but not completely. Update breathe to the latest version to get fixes for the "Unable to resolve..." issues. There are still some build errors, but much fewer than before. Pull Request resolved: pytorch#49407 Reviewed By: izdeby Differential Revision: D25562163 Pulled By: glaringlee fbshipit-source-id: 91bfd9e9ac70723816309f489022d72853f5fdc5

Summary: Pull Request resolved: pytorch#49447 Adding an out variant for `permute`. It's better than fixing the copy inside contiguous because 1) we can leverage the c2 math library, 2) contiguous creates a tensor inside the function which isn't managed by the MemoryPlanner in StaticRuntime Test Plan: Benchmark: ``` After: I1214 12:35:32.218775 991920 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0902339. Iters per second: 11082.3 Before: I1214 12:35:43.368770 992620 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0961521. Iters per second: 10400.2 ``` Reviewed By: yinghai Differential Revision: D25541666 fbshipit-source-id: 013ed0d4080cd01de4d3e1b031ab51e5032e6651

Summary: Pull Request resolved: pytorch#49388 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D25553672 Pulled By: glaringlee fbshipit-source-id: e9f2233bd678a90768844af2d8d5e2994d59e304

…ets (pytorch#49113) Summary: Pull Request resolved: pytorch#49113 Reviewed By: ajyu Differential Revision: D25388512 fbshipit-source-id: 3daa5b9387a3a10b6c220688df06540c4d844aea

pytorch#49346) Summary: Pull Request resolved: pytorch#49346 This is less ambitious redo of pytorch#49129. We make the ``` xq_slice = xq[:, [0], :, :] ``` indexing syntax work if `xq` is a quantized Tensor. For now, we are making the code not crash, with an in efficient `dq -> index -> q` implementation. A future PR can optimize performance by removing the unnecessary memory copies (which will require some non-trivial changes to TensorIterator). Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_advanced_indexing ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25539365 fbshipit-source-id: 98485875aaaf5743e1a940e170258057691be4fa

Summary: Pull Request resolved: pytorch#49373 Unescaping the string in RPC error message to provide better error msg Test Plan: CI Reviewed By: xush6528 Differential Revision: D25511730 fbshipit-source-id: 054f46d5ffbcb1350012362a023fafb1fe57fca1

Summary: Pull Request resolved: pytorch#49449 Similar to permute_out, add the out variant of `aten::narrow` (slice in c2) which does an actual copy. `aten::narrow` creates a view, however, an copy is incurred when we call `input.contiguous` in the ops that follow `aten::narrow`, in `concat_add_mul_replacenan_clip`, `casted_batch_one_hot_lengths`, and `batch_box_cox`. {F351263599} Test Plan: Unit test: ``` buck test //caffe2/aten:native_test ``` Benchmark with the adindexer model: ``` bs = 1 is neutral Before: I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6 After: I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261 bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher Before: I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51 After: I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67 ``` Reviewed By: bwasti Differential Revision: D25554109 fbshipit-source-id: 6bae62e6ce3456ff71559b635cc012fdcd1fdd0e

Test Plan: revert-hammer Differential Revision: D25554109 (pytorch@ed04b71) Original commit changeset: 6bae62e6ce34 fbshipit-source-id: bfa038e150166d0116bcae8f7a6415d98d4146de

Summary: Pull Request resolved: pytorch#49083 We have some (but very few) ops that take optional out arguments `Tensor(a!)? out`. This PR makes them non-optional mandatory arguments and enables c10-fullness for them. There is only a very small number of ops affected by this. Putting this up for discussion. Alternatives considered: If we keep them optional, we run into lots of issues in the dispatcher. We have to decide what the dispatcher calling convention for this argument type should be. 1) If we keep passing them in as `Tensor&` arguments and return them as `tuple<Tensor&, Tensor&, Tensor&>`, so basically same as currently, then the schema inference check will say "Your kernel function got inferred to have a `Tensor` argument but your native_functions.yaml declaration says `Tensor?`. This is a mismatch, you made an error". We could potentially disable that check, but that would open the door for real mistakes to not be reported anymore in the future. This sounds bad. 2) If we change them to a type that schema inference could differentiate from `Tensor`, say we pass them in as `const optional<Tensor>&` and return them as `tuple<const optional<Tensor>&, const optional<Tensor>&, const optional<Tensor>&>`, then our boxing logic fails because it can't recognize those as out overloads anymore and shortcut the return value as it is doing right now. We might be able to rewrite the boxing logic, but that could be difficult and could easily develop into a rabbit hole of having to clean up `Tensor&` references throughout the system where we use them. Furthermore, having optional out arguments in C++ doesn't really make sense. the C++ API puts them to the front of the argument list, so you can't omit them anyways when calling an op. You would be able to omit them when calling from Python with out kwargs, but not sure if we want that discrepancy between the c++ and python API. ghstack-source-id: 118660075 Test Plan: waitforsandcastle Reviewed By: ezyang Differential Revision: D25422197 fbshipit-source-id: 3cb25c5a3d93f9eb960d70ca014bae485be9f058

Summary: Pull Request resolved: pytorch#49088 We had special case logic to support `int[]?` and `double[]?` but nothing for `DimnameList[]?`. This PR generalizes the logic to support optional lists so it should now work with all types. It also enables c10-fullness for ops that were blocked by this. Note that using these arguments in a signature was always and still is expensive because the whole list needs to be copied. We should probably consider alternatives in the future like for example using `torch::List` instead of `ArrayRef`, that could work without copying the list. ghstack-source-id: 118660071 Test Plan: waitforsandcastle Reviewed By: ezyang Differential Revision: D25423901 fbshipit-source-id: dec58dc29f3bb4cbd89e2b95c42da204a9da2e0a

) Summary: Pull Request resolved: pytorch#49355 List's move ctor is a little bit more expensive than you might expect, but we can easily avoid it. ghstack-source-id: 118624596 Test Plan: Roughly 1% improvement on internal benchmark. Reviewed By: hlu1 Differential Revision: D25542190 fbshipit-source-id: 08532642c7d1f1604e16c8ebefd1ed3e56f7c919

Summary: This PR addresses the feature request outlined in pytorch#48713 for two-way communication with enhanced generators from [pep-342](https://www.python.org/dev/peps/pep-0342/). Briefly, the logic of the patch resembles `yield from` [pep-380](https://www.python.org/dev/peps/pep-0380/), which cannot be used, since the generator **must be interacted with from within the grad-mode context**, while yields from the decorator **must take place outside of the context**. Hence any interaction with the wrapped generator, be it via [.send](https://docs.python.org/3/reference/expressions.html?highlight=throw#generator.send), [.throw](https://docs.python.org/3/reference/expressions.html?highlight=throw#generator.throw), and even [.close](https://docs.python.org/3/reference/expressions.html?highlight=throw#generator.close) must be wrapped by a `with` clause. The patch is compatible with `for i in gen: pass` and `next(gen)` use cases and allows two-way communication with the generator via `.send <-> yield` points. ### Logic At lines [L37-L38](https://github.com/ivannz/pytorch/blob/2d40296c0c6617b3980c86762be466c995aa7f8e/torch/autograd/grad_mode.py#L37-L38) we (the decorator) **start the wrapped generator** (coroutine) by issuing `None` into it (equivalently, we can use `next(get)` here). Then we **dispatch responses of the generator** to our ultimate caller and **relay the latter's requests** into the generator in the loop on lines [L39-L52](https://github.com/ivannz/pytorch/blob/2d40296c0c6617b3980c86762be466c995aa7f8e/torch/autograd/grad_mode.py#L39-L52). We yield the most recent response on [L40-L41](https://github.com/ivannz/pytorch/blob/2d40296c0c6617b3980c86762be466c995aa7f8e/torch/autograd/grad_mode.py#L40-L41), at which point we become **paused**, waiting for the next ultimate caller's interaction with us. If the caller **sends us a request**, then we become unpaused and move to [L51-L52](https://github.com/ivannz/pytorch/blob/2d40296c0c6617b3980c86762be466c995aa7f8e/torch/autograd/grad_mode.py#L51-L52) and **forward it into the generator**, at which point we pause, waiting for its response. The response might be a value, an exception or a `StopIteration`. In the case of an exception from the generator, we let it **bubble up** from the immediately surrounding [except clause](https://docs.python.org/3/reference/compound_stmts.html#the-try-statement) to the ultimate caller through the [outer try-except](https://github.com/ivannz/pytorch/blob/2dc287bba87fa6f05c49446c0239ffdcdb1e896e/torch/autograd/grad_mode.py#L36-L54). In the case of a `StopIteration`, we **take it's payload and propagate it** to the caller via [return](https://github.com/ivannz/pytorch/blob/2d40296c0c6617b3980c86762be466c995aa7f8e/torch/autograd/grad_mode.py#L54). In the case of a value, the flow and the loop continues. The caller **throwing an exception at us** is handled much like a proper request, except for the exception playing the role of the request. In this case we **forward it into the generator** on lines [L47-L49](https://github.com/ivannz/pytorch/blob/2d40296c0c6617b3980c86762be466c995aa7f8e/torch/autograd/grad_mode.py#L47-L49) and await its response. We explicitly **advance** the traceback one frame up, in order to indicate the **source of the exception within the generator**. Finally the `GeneratorExit` is handled on lines [L42-L45](https://github.com/ivannz/pytorch/blob/2d40296c0c6617b3980c86762be466c995aa7f8e/torch/autograd/grad_mode.py#L42-L45) and closes the generator. Updates: clarified exception propagation Pull Request resolved: pytorch#49017 Reviewed By: izdeby Differential Revision: D25567796 Pulled By: albanD fbshipit-source-id: 801577cccfcb2b5e13a08e77faf407881343b7b0

Summary: Pull Request resolved: pytorch#48944 This is a stack PR for webdataset prototype. I am trying to make each stack a separate dataset. To make the implementation simple, each dataset will only support the basic functionality. - [x] ListDirFilesDataset - [x] LoadFilesFromDiskIterableDataset - [x] ReadFilesFromTarIterableDataset - [x] ReadFilesFromZipIterableDataset - [x] RoutedDecoderIterableDataset Test Plan: Imported from OSS Reviewed By: izdeby Differential Revision: D25541277 Pulled By: glaringlee fbshipit-source-id: 9e738f6973493f6be1d5cc1feb7a91513fa5807c

Summary: Pull Request resolved: pytorch#48955 Test Plan: Imported from OSS Reviewed By: izdeby Differential Revision: D25541393 Pulled By: glaringlee fbshipit-source-id: dea6ad64a7ba40abe45612d99f078b14d1da8bbf

Summary: Pull Request resolved: pytorch#44848 Reviewed By: izdeby Differential Revision: D25574204 Pulled By: ngimel fbshipit-source-id: b35f7253a6ad2b83f7b6b06862a5ab77295373e0

Summary: Pull Request resolved: pytorch#49442 When moving Aten/native to app level, symbols from native/quantized may sit in a target away from some of its call sites. As a result, there are linking errors of missing symbols of instantiations of PackedConvWeight::prepack. The solution is to instantiate PackedConvWeight in the same compilation unit. It's similar to D24941989 (pytorch@fe6bb2d). ghstack-source-id: 118676374 Test Plan: CI Reviewed By: dhruvbird Differential Revision: D25576703 fbshipit-source-id: d6e3d11d51d8172ab8487ce44ec8c042889f0f11

Summary: Pull Request resolved: pytorch#49434 There was a bug that was introduced in conda-package-handling >= 1.6.1 that makes archives above a certain size fail out when attempting to extract see: conda/conda-package-handling#71 coincides with pytorch/builder#611 Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Reviewed By: xuzhao9, janeyx99, samestep Differential Revision: D25573390 Pulled By: seemethere fbshipit-source-id: 82173804f1b30da6e4b401c4949e2ee52065e149

pytorch#49131) Summary: Pull Request resolved: pytorch#49131 Users frequently assume the correct range of ranks is 1 ... `world_size`. This PR udpates the docs to indicate that the correct rank range users should specify is 0 ... `world_size` - 1. Test Plan: Rendering and Building Docs Reviewed By: mrshenli Differential Revision: D25410532 fbshipit-source-id: fe0f17a4369b533dc98543204a38b8558e68497a

Summary: Pull Request resolved: pytorch#49130 The Python Store API docs had some typos, where boolean value were lower case, which is incorrect Python syntax. This diff fixes those typos. Test Plan: Built and Rendered Docs Reviewed By: mrshenli Differential Revision: D25411492 fbshipit-source-id: fdbf1e6b8f81e9589e638286946cad68eb7c9252

Summary: Implements the sinc operator. See https://numpy.org/doc/stable/reference/generated/numpy.sinc.html ![image](https://user-images.githubusercontent.com/13428986/101653855-cdffa080-3a0d-11eb-8426-ecc81c152ebd.png) Pull Request resolved: pytorch#48740 Reviewed By: izdeby Differential Revision: D25564477 Pulled By: soulitzer fbshipit-source-id: 13f36a2b84dadfb4fd1442a2a40a3a3246cbaecb

…es (pytorch#46398)" (pytorch#49189) Summary: Pull Request resolved: pytorch#49189 This reverts commit d307601 and fixes the bug with diagonals and ellipsis combined. Test Plan: Imported from OSS Reviewed By: glaringlee Differential Revision: D25540722 Pulled By: heitorschueroff fbshipit-source-id: 86d0c9a7dcfda600b546457dad102af2ff33e353

…_requires_grad (pytorch#49167) Summary: Pull Request resolved: pytorch#49167 Building with clang and a fair warning level can result in hundreds of lines of compiler output of the form: ``` caffe2\gen_aten_libtorch\autograd\generated\VariableType_1.cpp(2279,8): warning: unused variable '_any_requires_grad' [-Wunused-variable] auto _any_requires_grad = compute_requires_grad( self ); ^ caffe2\gen_aten_libtorch\autograd\generated\VariableType_1.cpp(2461,8): warning: unused variable '_any_requires_grad' [-Wunused-variable] auto _any_requires_grad = compute_requires_grad( grad_output, self ); ^ caffe2\gen_aten_libtorch\autograd\generated\VariableType_1.cpp(2677,8): warning: unused variable '_any_requires_grad' [-Wunused-variable] auto _any_requires_grad = compute_requires_grad( self ); ^ ... ``` This happens when requires_derivative == False. Let's mark `_any_requires_grad` as potentially unused. If this were C++17 we would use `[[maybe_unused]]` but to retain compatibility with C++11 we just mark it with `(void)`. Test Plan: CI + locally built Reviewed By: ezyang Differential Revision: D25421548 fbshipit-source-id: c56279a184b1c616e8717a19ee8fad60f36f37d1

…e is always bool Test Plan: revert-hammer Differential Revision: D25421263 (pytorch@c508e5b) Original commit changeset: c6c681ef9400 fbshipit-source-id: 4c0c9acf42b06a3ed0af8f757ea4512ca35b6c59

Summary: This reverts commit c7746ad. Fixes #{issue number} Pull Request resolved: pytorch#48797 Reviewed By: mruberry Differential Revision: D25575264 Pulled By: ngimel fbshipit-source-id: c7f3b384db833d727bb5bd8a51f1493a13016d09

Summary: Fixes pytorch#46213 I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should. 1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes. 2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that. 3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places. 4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that? 5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout. Pull Request resolved: pytorch#47725 Reviewed By: zou3519 Differential Revision: D25449794 Pulled By: ngimel fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c

Summary: Pull Request resolved: pytorch#49118 I need this in the next stack up. It seems useful to have as a helper function. Test Plan: - run tests Reviewed By: izdeby Differential Revision: D25563546 Pulled By: zou3519 fbshipit-source-id: a4031fdc4b2373cc230ba3c66738d91dcade96e2

Summary: Pull Request resolved: pytorch#49119 I don't know how the accumulate_grad code gets hit via calling autograd.grad, so I went through all places in accumulate_grad that are definitely impossible to vmap through and changed them. To support this: - I added vmap support for Tensor::strides(). It returns the strides that correspond to the public dimensions of the tensor (not the ones being vmapped over). - Changed an instance of empty_strided to new_empty_strided. - Replaced an in-place operation in accumulate_grad.h Test Plan: - added a test for calling strides() inside of vmap - added tests that exercise all of the accumulate_grad code path. NB: I don't know why these tests exercise the code paths, but I've verified that they do via gdb. Suggestions for some saner test cases are very welcome. Reviewed By: izdeby Differential Revision: D25563543 Pulled By: zou3519 fbshipit-source-id: 05ac6c549ebd447416e6a07c263a16c90b2ef510

Summary: Pull Request resolved: pytorch#49467 Credit to beauby for the Bazel fixes. Test Plan: Export and run on CI Reviewed By: beauby Differential Revision: D25588027 fbshipit-source-id: efe1c543eb7438ca05254de67cf8b5cee625119a

…ytorch#49286) Summary: Closes pytorchgh-42003 Pull Request resolved: pytorch#49286 Reviewed By: glaringlee Differential Revision: D25535250 Pulled By: ezyang fbshipit-source-id: a7790bfe4528fa6a31698126cc687793fdf7ac3f

Summary: Updated `svd_backward` to work correctly for complex-valued inputs. Updated `common_methods_invocations.py` to take dtype, device arguments for input construction. Removed `test_pinverse` from `test_autograd.py`, it is replaced by entries to `common_methods_invocations.py`. Added `svd` and `pinverse` to list of complex tests. References for complex-valued SVD differentiation: - https://giggleliu.github.io/2019/04/02/einsumbp.html - https://arxiv.org/abs/1909.02659 The derived rules assume gauge invariance of loss functions, so the result would not be correct for loss functions that are not gauge invariant. https://re-ra.xyz/Gauge-Problem-in-Automatic-Differentiation/ The same rule is implemented in Tensorflow and [BackwardsLinalg.jl](https://github.com/GiggleLiu/BackwardsLinalg.jl). Ref. pytorch#33152 Pull Request resolved: pytorch#47761 Reviewed By: izdeby Differential Revision: D25574962 Pulled By: mruberry fbshipit-source-id: 832b61303e883ad3a451b84850ccf0f36763a6f6

…torch#49211) Summary: Pull Request resolved: pytorch#49211 Test Plan: Imported from OSS Reviewed By: raghuramank100 Differential Revision: D25507480 fbshipit-source-id: 9e9e4b5fef979f5621c1bbd1b49e9cc6830da617

Summary: Pull Request resolved: pytorch#49022 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25569586 Pulled By: mruberry fbshipit-source-id: 09608088f540c2c3fc70465f6a23f2aec5f24f85

Test Plan: revert-hammer Differential Revision: D25564477 (pytorch@bbc7143) Original commit changeset: 13f36a2b84da fbshipit-source-id: 58cbe8109efaf499dd017531878b9fbbb27976bc

Summary: Pull Request resolved: pytorch#49146 Add support for Storage arguments to IValue and the JIT typing system, and make ops that were blocked on that c10-full. ghstack-source-id: 118710665 (Note: this ignores all push blocking failures!) Test Plan: waitforsandcastle Reviewed By: ezyang Differential Revision: D25456799 fbshipit-source-id: da14f125af352de5fcf05a83a69ad5a69d5a3b45

Summary: Fixes pytorch#47578. Pull Request resolved: pytorch#47579 Reviewed By: H-Huang Differential Revision: D25429403 Pulled By: vincentqb fbshipit-source-id: c42fbcd71b46e07c672a1e9661468848ac16de38

…ntization docs Test Plan: revert-hammer Differential Revision: D25507480 (pytorch@7729581) Original commit changeset: 9e9e4b5fef97 fbshipit-source-id: fdb08d824209b97defaba2e207d1a914575a6ae7

Summary: One of the links for ramp up tasks wasn't showing any results and the other was only RPC results. Instead of this, I just changed it to one link that has `pt_distributed_rampup` which seems reasonable as the developer will be able to see both RPC and distributed tasks. Also added test command for DDP tests. Pull Request resolved: pytorch#49141 Reviewed By: ezyang Differential Revision: D25597560 Pulled By: rohan-varma fbshipit-source-id: 85d7d2964a19ea69fe149c017cf88dff835b164a

Summary: Address pytorch#48641 Documents the behavior of sinh and cosh in the edge cases ``` >>> b = torch.full((15,), 89, dtype=torch.float32) >>> torch.sinh(b) tensor([2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38]) >>> b = torch.full((16,), 89, dtype=torch.float32) >>> torch.sinh(b) tensor([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]) >>> b = torch.full((17,), 89, dtype=torch.float32) >>> torch.sinh(b) tensor([ inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, 2.2448e+38]) >>> b = torch.full((32,), 89, dtype=torch.float32)[::2] >>> torch.sinh(b) tensor([2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38, 2.2448e+38]) ``` See https://sleef.org/purec.xhtml Pull Request resolved: pytorch#49413 Reviewed By: ezyang Differential Revision: D25587932 Pulled By: soulitzer fbshipit-source-id: 6db75c45786f4b95f82459d0ce5efa37ec0774f0

Summary: NNPACK convolution algorithms can only be used for kernels up to 16x16 Fixes pytorch#49462 Pull Request resolved: pytorch#49464 Reviewed By: xuzhao9 Differential Revision: D25587879 Pulled By: malfet fbshipit-source-id: 658197f23c08cab97f0849213ecee3f91f96c932

Summary: Pull Request resolved: pytorch#48863 Support default arguments when invoking a module via PyTorch Lite (`mobile::Module`). Test Plan: buck test mode/dbg //caffe2/test/cpp/jit:jit -- LiteInterpreterTest.MethodInvocation buck test mode/dbg caffe2/test:mobile -- test_method_calls_with_optional_arg Reviewed By: raziel, iseeyuan Differential Revision: D25152559 fbshipit-source-id: bbf52f1fbdbfbc6f8fa8b65ab524b1cd4648f9c0

Summary: Pull Request resolved: pytorch#49359 This should be both slightly more efficient (1 less TLS guard check in at::shouldRunRecordFunction) and definitely more correct (CoinflipTLS is now saved whenever RecordFunctionTLS is saved), fixing a bad merge that left RecordFunctionTLS::tries_left dead. ghstack-source-id: 118624402 Test Plan: Review, CI Reviewed By: hlu1 Differential Revision: D25542799 fbshipit-source-id: 310f9fd157101f659cea13c331b2a0ee6db2db88

Summary: Pull Request resolved: pytorch#49364 We had a local `Tensor` when we only needed a `const Tensor&`. ghstack-source-id: 118624595 Test Plan: Internal benchmark. Reviewed By: hlu1 Differential Revision: D25544731 fbshipit-source-id: 7b9656d0371ab65a6313cb0ad4aa1df707884c1c

) Summary: Pull Request resolved: pytorch#49368 The former is faster because it doesn't allow negative indexing (which we don't use). ghstack-source-id: 118624598 Test Plan: internal benchmark Reviewed By: hlu1 Differential Revision: D25545777 fbshipit-source-id: b2714fac95c801fd735fac25b238b4a79b012993

…ytorch#49371) Summary: Pull Request resolved: pytorch#49371 As with previous diff, .sizes() is strictly more efficient. ghstack-source-id: 118627223 Test Plan: internal benchmark Differential Revision: D25546409 fbshipit-source-id: 196034716b6e11efda1ec8cb1e0fce7732d73eb4

…ch#49412) Summary: Pull Request resolved: pytorch#49412 FLAGS_disable_variable_dispatch had to go, but it looks like the only user was some benchmarks anyway. ghstack-source-id: 118669590 Test Plan: Small (order of 0.1% improvement) on Internal benchmarks. Wait for GitHub CI since this was reverted before due to CI break Reviewed By: ezyang Differential Revision: D25547962 fbshipit-source-id: 58424b1da230fdc5d27349af762126a5512fce43

…rch#48881) Summary: Pull Request resolved: pytorch#48881 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D25537190 Pulled By: VitalyFedyunin fbshipit-source-id: a61a433c638e2e95576f88f081b64ff171b2316e

Summary: Pull Request resolved: pytorch#49122 cpparguments_exprs has induced a lot of head scratching in many recent PRs for how to structure the code in a good way. This PR eliminates the old algorithm for an entirely new algorithm inspired by logic programming. The net result is shorter, cleaner and should be more robust to future changes. This PR is a bit of a whopper. Here is the order to review it. - tools/codegen/api/types.py - Deleted CppArgument, CppArgumentPackIface (and subclasses), CppExpr, DispatcherExpr, DispatcherArgument, NativeExpr, NativeArgument, MetaArgument. All things previously called XArgument are now Binding. All things previously called XExpr are now Expr. I deleted the `__str__` implementation on Binding and fixed all call sites not to use it. On Binding, I renamed `str_no_default` and `str_default` to `defn` and `decl` for better symmetry with the corresponding signature concepts, although I'm open to naming them back to their original versions. - Obviously, things are less type safe without the class distinctions. So I introduce a new ADT called CType. CType represents the *semantic C++ type* of a binding: it is both the C++ type (e.g., `const Tensor&`) as well as the argument name that specifies what the binding denotes (e.g., `other`). Every binding now records its CType. The key observation here is that you don't actually care if a given expression is from the cpp or dispatcher or native API; what you care is having enough information to know what the expression means, so you can use it appropriately. CType has this information. For the most part, ArgNames are just the string names of the arguments as you see them in JIT schema, but there is one case (`possibly_redundant_memory_format`) where we encode a little extra information. Unlike the plain strings we previously used to represent C++ types, CType have a little bit of structure around optional and references, because the translation code needs to work around these concepts. - I took the opportunity to kill all of the private fields like `_arguments` and `_returns_type` (since the argument types don't make sense anymore). Everything is computed for you on the fly. If this is a perf problem in codegen we can start using `cached_property` decorator. - All of the heavy lifting in CppSignature.argument_packs has been moved to the cpp module. We'll head over there next. Similarly, all of the exprs methods are now calling translate, the new functionality which we haven't gotten to yet - tools/codegen/api/cpp.py - We refactor all of the type computation functions to return CType instead of str. Because CTypes need to know the denotation, there is a new `binds: ArgName` argument to most functions that provides the denotation, so we can slot it in. (An alternative would have been to construct CTypes without denotations and then fill them in post-facto, but I didn't do it this way. One downside is there are some places where I need a CType without denotation, so I fill these in with `__placeholder__` whenever this happens). - `argument` and `arguments` are now extremely simple. There is no more Pack business, just produce one or more Bindings. The one thing of note is that when both a `memory_format` and `options` are in scope, we label the memory format as `possibly_redundant_memory_format`. This will be used in translation - tools/codegen/api/dispatcher.py and tools/codegen/api/native.py - same deal as cpp.py. One thing is that `cpparguments_exprs` is deleted; that is in the translator - tools/codegen/api/translate.py - the translator! It uses a very simple backwards deduction engine to work out how to fill in the arguments of functions. There are comments in the file that explain how it works. - Everything else: just some small call site tweaks for places when I changed API. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: ljk53 Differential Revision: D25455887 Pulled By: ezyang fbshipit-source-id: 90dc58d420d4cc49281aa8647987c69f3ed42fa6

Test Plan: revert-hammer Differential Revision: D25569586 (pytorch@5874925) Original commit changeset: 09608088f540 fbshipit-source-id: 6a5953b327a4a2465b046e29bb007a0c5f4cf14a

Summary: In pytorch#48967 we enabled output buffer inlining, which results in duplicate computation if one output depends on another. This was done to fix correctness for CUDA, but is not needed for correctness for CPU and results in perf slowdown. The output buffer inlining solution for CUDA is intended to be an interim solution because it does not work with reductions. Pull Request resolved: pytorch#49488 Reviewed By: ezyang Differential Revision: D25596071 Pulled By: eellison fbshipit-source-id: bc3d987645da5ce3c603b4abac3586b169656cfd

Summary: Pull Request resolved: pytorch#49510 Adding old style operators with out arguments will break XLA. This prevents that. See for background: https://fb.workplace.com/groups/pytorch.dev/permalink/809934446251704/ This is a temporary change that will prevent this breakage for the next couple of days until the problem is resolved for good. It will be deleted in pytorch#49164 then. ghstack-source-id: 118756437 (Note: this ignores all push blocking failures!) Test Plan: waitforsandcastle Reviewed By: bhosmer Differential Revision: D25599112 fbshipit-source-id: 6b0ca4da4b55da8aab9d1b332cd9f68e7602301e

Summary: Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Fixes #{issue number} Pull Request resolved: pytorch#49519 Reviewed By: robieta Differential Revision: D25603779 Pulled By: seemethere fbshipit-source-id: ca8d811925762a5a413ca906d94c974a4ac5b132

Summary: Fixes pytorch#48114 Before: ``` >>> torch.empty(2 * 10 ** 20) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: empty(): argument 'size' must be tuple of ints, but found element of type int at pos 1 ``` After fix: ``` >>> torch.empty(2 * 10 ** 20) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Overflow when unpacking long ``` Unclear whether we need a separate test for this case, I can add one if it's necessary... Pull Request resolved: pytorch#48250 Reviewed By: linbinyu Differential Revision: D25105217 Pulled By: ezyang fbshipit-source-id: a5aa7c0266945c8125210a2fd34ce4b6ba940c92

Summary: Pull Request resolved: pytorch#49507 Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D25598424 Pulled By: ailzhang fbshipit-source-id: b3f43e84f177cf7c14831b0b83a399b155c813c4

Summary: I am submitting this PR on behalf of Janne Hellsten(nurpax) from NVIDIA, for the convenience of CLA. Thanks Janne a lot for the contribution! Currently, the ninja build decides whether to rebuild a .cu file or not pretty randomly. And there are actually two issues: First, the arch list in the building command is ordered randomly. When the order changes, it will unconditionally rebuild regardless of the timestamp. Second, the header files are not included in the dependency list, so if the header file changes, it is possible that ninja will not rebuild. This PR fixes both issues. The fix for the second issue requires nvcc >= 10.2. nvcc < 10.2 can still build CUDA extension as it used to be, but it will be unable to see the changes in header files. Pull Request resolved: pytorch#49344 Reviewed By: glaringlee Differential Revision: D25540157 Pulled By: ezyang fbshipit-source-id: 197541690d7f25e3ac5ebe3188beb1f131a4c51f

…ytorch#49443) Summary: tldr: current version of `is_ninja_available` of `torch/utils/cpp_extension.py` fails to run in the recent incarnations of pip w/ new build isolation feature which is now a default. This PR fixes this problem. The full story follows: -------------------------- Currently trying to build https://github.com/facebookresearch/fairscale/ which builds cuda extensions fails with the recent pip versions. The build is failing to perform `is_ninja_available`, which runs a simple subprocess to run `ninja --version` but does it with some /dev/null stream override which seems to break with the new pip versions. Currently I have `pip==20.3.3`. The recent pip performs build isolation which first fetches all dependencies to somewhere under /tmp/pip-install-xyz and then builds the package. If I build: ``` pip install fairscale --no-build-isolation ``` everything works. When building normally (i.e. without `--no-build-isolation`), the failure is a long long trace, <details> <summary>Full log</summary> <pre> pip install fairscale Collecting fairscale Downloading fairscale-0.1.1.tar.gz (83 kB) |████████████████████████████████| 83 kB 562 kB/s Installing build dependencies ... done Getting requirements to build wheel ... error ERROR: Command errored out with exit status 1: command: /home/stas/anaconda3/envs/main-38/bin/python /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpjvw00c7v cwd: /tmp/pip-install-1wq9f8fp/fairscale_347f218384a64f24b8d5ce846641213e Complete output (55 lines): running egg_info writing fairscale.egg-info/PKG-INFO writing dependency_links to fairscale.egg-info/dependency_links.txt writing requirements to fairscale.egg-info/requires.txt writing top-level names to fairscale.egg-info/top_level.txt Traceback (most recent call last): File "/home/stas/anaconda3/envs/main-38/bin/ninja", line 5, in <module> from ninja import ninja ModuleNotFoundError: No module named 'ninja' Traceback (most recent call last): File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 280, in <module> main() File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 263, in main json_out['return_val'] = hook(**hook_input['kwargs']) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py", line 114, in get_requires_for_build_wheel return hook(config_settings) File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 149, in get_requires_for_build_wheel return self._get_build_requires( File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 130, in _get_build_requires self.run_setup() File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 145, in run_setup exec(compile(code, __file__, 'exec'), locals()) File "setup.py", line 56, in <module> setuptools.setup( File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup return distutils.core.setup(**attrs) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/core.py", line 148, in setup dist.run_commands() File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 298, in run self.find_sources() File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 305, in find_sources mm.run() File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 536, in run self.add_defaults() File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 572, in add_defaults sdist.add_defaults(self) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/sdist.py", line 228, in add_defaults self._add_defaults_ext() File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/sdist.py", line 311, in _add_defaults_ext build_ext = self.get_finalized_command('build_ext') File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/cmd.py", line 298, in get_finalized_command cmd_obj = self.distribution.get_command_obj(command, create) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 858, in get_command_obj cmd_obj = self.command_obj[command] = klass(self) File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 351, in __init__ if not is_ninja_available(): File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1310, in is_ninja_available subprocess.check_call('ninja --version'.split(), stdout=devnull) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['ninja', '--version']' returned non-zero exit status 1. ---------------------------------------- ERROR: Command errored out with exit status 1: /home/stas/anaconda3/envs/main-38/bin/python /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpjvw00c7v Check the logs for full command output. </pre> </details> and the middle of it is what we want: ``` File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 351, in __init__ if not is_ninja_available(): File "/tmp/pip-build-env-a5x2icen/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1310, in is_ninja_available subprocess.check_call('ninja --version'.split(), stdout=devnull) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['ninja', '--version']' returned non-zero exit status 1. ``` For some reason pytorch fails to run this simple code: ``` # torch/utils/cpp_extension.py def is_ninja_available(): r''' Returns ``True`` if the `ninja <https://ninja-build.org/>`_ build system is available on the system, ``False`` otherwise. ''' with open(os.devnull, 'wb') as devnull: try: subprocess.check_call('ninja --version'.split(), stdout=devnull) except OSError: return False else: return True ``` I suspect that pip does something to `os.devnull` and that's why it fails. This PR proposes a simpler code which doesn't rely on anything but `subprocess.check_output`: ``` def is_ninja_available(): r''' Returns ``True`` if the `ninja <https://ninja-build.org/>`_ build system is available on the system, ``False`` otherwise. ''' try: subprocess.check_output('ninja --version'.split()) except Exception: return False else: return True ``` which doesn't use `os.devnull` and performs the same function. There could be a whole bunch of different exceptions there I think, so I went for the generic one - we don't care why it failed, since this function's only purpose is to suggest whether ninja can be used or not. Let's check ``` python -c "import torch.utils.cpp_extension; print(torch.utils.cpp_extension.is_ninja_available())" True ``` Look ma - no std noise to take care of. (i.e. no need for /dev/null). I was editing the installed environment-wide `cpp_extension.py` file directly, so didn't need to tweak `PYTHONPATH` - I made sure to replace `'ninja --version'.` with something that should fail and I did get `False` for the above command line. I next did a somewhat elaborate cheat to re-package an already existing binary wheel with this corrected version of `cpp_extension.py`, rather than building from source: ``` mkdir /tmp/pytorch-local-channel cd /tmp/pytorch-local-channel # get the latest nightly wheel wget https://download.pytorch.org/whl/nightly/cu110/torch-1.8.0.dev20201215%2Bcu110-cp38-cp38-linux_x86_64.whl # unpack it unzip torch-1.8.0.dev20201215+cu110-cp38-cp38-linux_x86_64.whl # edit torch/utils/cpp_extension.py to fix the python code with the new version as in this PR emacs torch/utils/cpp_extension.py & # pack the files back zip -r torch-1.8.0.dev20201215+cu110-cp38-cp38-linux_x86_64.whl caffe2 torch torch-1.8.0.dev20201215+cu110.dist-info ``` Now I tell pip to use my local channel, plus `--pre` for it to pick up the pre-release as an acceptable wheel ``` # install using this local channel git clone https://github.com/facebookresearch/fairscale/ cd fairscale pip install -v --disable-pip-version-check -e . -f file:///tmp/pytorch-local-channel --pre ``` and voila all works. ``` [...] Successfully installed fairscale ``` I noticed a whole bunch of ninja not found errors in the log, which I think is the same problem with other parts of the build system packages which also use this old check copied all over various projects and build tools, and which the recent pip breaks. ``` writing manifest file '/tmp/pip-modern-metadata-_nsdesbq/fairscale.egg-info/SOURCES.txt' Traceback (most recent call last): File "/home/stas/anaconda3/envs/main-38/bin/ninja", line 5, in <module> from ninja import ninja ModuleNotFoundError: No module named 'ninja' [...] /tmp/pip-build-env-fqflyevr/overlay/lib/python3.8/site-packages/torch/utils/cpp_extension.py:364: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend. warnings.warn(msg.format('we could not find ninja.')) ``` but these don't prevent from the build completing and installing. I suppose these need to be identified and reported to various other projects, but that's another story. The new pip does something to `os.devnull` I think which breaks any code relying on it - I haven't tried to figure out what happens to that stream object, but this PR which removes its usage solves the problem. Also do notice that: ``` git clone https://github.com/facebookresearch/fairscale/ cd fairscale python setup.py bdist_wheel pip install dist/fairscale-0.1.1-cp38-cp38-linux_x86_64.whl ``` works too. So it is really a pip issue. Apologies if the notes are too many, I tried to give the complete picture and probably other projects will need those details as well. Thank you for reading. Pull Request resolved: pytorch#49443 Reviewed By: mruberry Differential Revision: D25592109 Pulled By: ezyang fbshipit-source-id: bfce4420c28b614ead48e9686f4153c6e0fbe8b7

Summary: Pull Request resolved: pytorch#48973 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D25413166 Pulled By: eellison fbshipit-source-id: 0c79258345df18c60a862373fa16931228fb92ef

Summary: Pull Request resolved: pytorch#48974 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D25413165 Pulled By: eellison fbshipit-source-id: 8cece1dc3692389be90c0d77bd71b103254d5ad3

Summary: Pull Request resolved: pytorch#48976 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D25413164 Pulled By: eellison fbshipit-source-id: 0c31787e8b5e1368b0cba6e23660799b652389cd

…h#49213) Summary: Pull Request resolved: pytorch#49213 Changes behavior of Eager mode quantization to remove observation after add_scalar/mul_scalar. This is not used, and it removes one difference between Eager and FX modes. Test Plan: ``` python test/test_quantization.py TestQuantizeFxOps.test_quantized_add_qat python test/test_quantization.py TestQuantizeFxOps.test_quantized_mul_qat python test/test_quantization.py TestQuantizationAwareTraining.test_add_scalar_uses_input_qparams python test/test_quantization.py TestQuantizationAwareTraining.test_mul_scalar_uses_input_qparams ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25486276 fbshipit-source-id: 34a5d6ce0d08739319ec0f8b197cfc1309d71040

…are (pytorch#49238) Summary: Pull Request resolved: pytorch#49238 Moves the `input_quantized_idxs` and `output_quantized_idxs` options from the convert config to the prepare config. This is done because these operations are related to placing observers, which is numerics changing during QAT. The next PR will adjust the behavior of `input_quantized_idxs` in prepare in QAT to prevent placing a fake_quant at the input if the input is marked quantized. Placing a fake_quant there can lead to numerical inaccuracies during calibration, as it would start with scale=1 and zp=0, which may be different from the quantization parameters of the incoming quantized input. Test Plan: ``` python test/test_quantization.py TestQuantizeFx ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25498762 fbshipit-source-id: 17ace8f803542155652b310e5539e1882ebaadc6

Summary: Pull Request resolved: pytorch#49239 Context: the existing implementation of `quantized_input_idxs` is convert-only. Therefore, observers are inserted between the input and the first quantized node. This is a problem during QAT, because the initial input is a fake_quant, and it starts with scale=1 and zp=0. This does not match the quantization parameters of the graph input, which can lead to incorrect numerics. Fix: do not insert observer for a quantized input. Test Plan: ``` python test/test_quantization.py TestQuantizeFx ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25499486 fbshipit-source-id: 303b49cc9d95a9fd06fef3b0859c08be34e19d8a

…ytorch#49382) Summary: Pull Request resolved: pytorch#49382 Fixes an edge case. If the input to the graph is quantized and the first node does not need activation observation, makes sure that the observer is not inserted. Test Plan: ``` python test/test_quantization.py TestQuantizeFxOps.test_int8_input_no_unnecessary_fq ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25551041 fbshipit-source-id: a6cba235c63ca7f6856e4128af7c1dc7fa0085ea

…ytorch#49420) Summary: Pull Request resolved: pytorch#49420 Before: if an output was marked as quantized, it could actually not be quantized, if the previous node was not quantized. After: if an output was marked as quantized, it will be quantized regardless of the quantization status of the previous node. Test Plan: ``` python test/test_quantization.py TestQuantizeFxOps.test_quant_output_always_observed ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25566834 fbshipit-source-id: 84755a1605fd3847edd03a7887ab9f635498c05c

Summary: - test_torch was split into 6 in pytorch#47356. - also test_linalg has 10 slowtest marking. Pull Request resolved: pytorch#49500 Reviewed By: ezyang, malfet Differential Revision: D25598085 Pulled By: walterddr fbshipit-source-id: 74b0b433897721db86c00e236d1dd925d7a6d3d0

Summary: Pull Request resolved: pytorch#49383 Reland of pytorch#47137 ghstack-source-id: 118735407 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D25551910 fbshipit-source-id: 2e1f2f77e7c69204056dfe6ed178e8ad7650ab32

Summary: Addresses pytorch#45418. This is probably not the best solution, but it's a rebase of the solution we're considering until pytorch#45418 is solved. If you can outline a better one I'm willing to implement it (: Pull Request resolved: pytorch#45566 Reviewed By: ezyang Differential Revision: D24621568 Pulled By: glaringlee fbshipit-source-id: 89dad5c61d8b5c26984d401551a1fe29df1ead04

Summary: **In this PR** - add `_foreach_zero_` API - Update all optimizers under /_multi_tensor/ to use `_foreach_zero_` in `zero_grad` method Performance improvement ----------------- OP: zero_ ----------------- for-loop: 630.36 us foreach: 90.84 us script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils inputs = [torch.rand(3, 200, 200, device="cuda") for _ in range(100)] def main(): for op in [ "zero_" ]: print("\n\n----------------- OP: ", op, " -----------------") stmt = "[torch.{op}(t) for t in inputs]" timer = benchmark_utils.Timer( stmt=stmt.format(op = op), globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") stmt = "torch._foreach_{op}(inputs)" timer_mta = benchmark_utils.Timer( stmt=stmt.format(op = op), globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` **TODO** - Refactor zero_grad once foreach APIs are stable. **Tested** via unit tests Pull Request resolved: pytorch#47286 Reviewed By: ngimel Differential Revision: D24706240 Pulled By: izdeby fbshipit-source-id: aac69d6d134d65126ae8e5916f3627b73d8a94bf

…49439) Summary: Pull Request resolved: pytorch#49439 Test Plan: Imported from OSS Reviewed By: nikithamalgifb, ngimel Differential Revision: D25594129 Pulled By: ailzhang fbshipit-source-id: 627bbea9ba478ee3a8edcc6695abab6431900192

Summary: Pull Request resolved: pytorch#49179 Test Plan: Imported from OSS Reviewed By: vkuzo, wat3rBro Differential Revision: D25470520 fbshipit-source-id: 16e35fec9a5f3339860bd2305ae8ffdd8e2dfaf7

Summary: Also fix some tests. Pull Request resolved: pytorch#49356 Reviewed By: mruberry Differential Revision: D25604364 Pulled By: ngimel fbshipit-source-id: 9efdd83aaa96cacc66e9689db9f9d8c24175a693

…finition (pytorch#48971) Summary: This PR is to change the `aten::native_layer_norm` and `aten::native_layer_norm_backward` signature to match `torch.layer_norm` definition. The current definition doesn't provide enough information to the PyTorch JIT to fuse layer_norm during training. `native_layer_norm(X, gamma, beta, M, N, eps)` => `native_layer_norm(input, normalized_shape, weight, bias, eps)` `native_layer_norm_backward(dY, X, mean, rstd, gamma, M, N, grad_input_mask)` => `native_layer_norm_backward(dY, input, normalized_shape, mean, rstd, weight, bias, grad_input_mask)` Pull Request resolved: pytorch#48971 Reviewed By: izdeby Differential Revision: D25574070 Pulled By: ngimel fbshipit-source-id: 23e2804295a95bda3f1ca6b41a1e4c5a3d4d31b4

Summary: Fixes pytorch#49362 **Summary:** This PR fixes the issue where invalid annotation types are used for a dictionary. Unsupported assertion message is generated for all invalid annotations **Test Case**: python test/test_jit.py TestJit.test_dict_invalid_annotations Pull Request resolved: pytorch#49425 Reviewed By: navahgar Differential Revision: D25601578 Pulled By: nikithamalgifb fbshipit-source-id: 91633e3d0891bdcb5402f044a74d02fe352ecd6f

Summary: Pull Request resolved: pytorch#49181 Fuse ClipRangesGatherSigridHash Test Plan: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/merge/traced_merge_dper_fixes.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=30000 --warmup_iters=10000 --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime --pt_cleanup_activations=true --pt_enable_out_variant=true --do_profile --compare_results ``` Verify op fused: Node #3: 0.00104917 ms/iter, %173 : Tensor, %174 : Tensor = fb::clip_ranges_gather_sigrid_hash_offsets(%75, %76, %39, %40, %41, %38, %26) Before: 0.0919786 After: 0.0911792 Reviewed By: hlu1 Differential Revision: D25468225 fbshipit-source-id: 36bd91c140eaa57cb42cdaad46d878b94f162a9d

…svd and pinverse Test Plan: revert-hammer Differential Revision: D25574962 (pytorch@9955355) Original commit changeset: 832b61303e88 fbshipit-source-id: d73f77f3e51b0f535dad6d21c5bebf8d41a6bfbd

Summary: Pull Request resolved: pytorch#49463 set_quantizer_ takes a ConstQuantizerPtr argument, which is neither supported by JIT nor by c10. Also, it doesn't get dispatched (CPU and CUDA have the same implementation) and it is excluded from python bindings generation. So there is no real reason why this needs to be in native_functions.yaml Removing it unblocks the migration to c10-fullness since this is an op that would have been hard to migrate. See https://fb.quip.com/QRtJAin66lPN ghstack-source-id: 118710663 Test Plan: waitforsandcastle Reviewed By: ezyang Differential Revision: D25587763 fbshipit-source-id: 8fab921f4c256c128d48d82dac731f04ec9bad92

Summary: Pull Request resolved: pytorch#49402 In cases of NCCLAllReduce operations there could be non-trivial overhead for launching cooperative kernels (especially in case of async execution of different parts of the model). This diff is reviving this operator to make it possible to fuse multiple operations into a single kernel. Test Plan: Unit-test. Used in a later diff. Reviewed By: xianjiec Differential Revision: D25531206 fbshipit-source-id: 64b1c161233a726f9e2868f1059316e42a8ea1fc

…ke CLANGFORMAT` Reviewed By: zertosh Differential Revision: D25609974 fbshipit-source-id: 4db8f8100336a2f0f2af8bc7b960d3711a5d1d7d

Summary: Fixes pytorch#45581 Pull Request resolved: pytorch#49280 Reviewed By: mruberry Differential Revision: D25592330 Pulled By: ezyang fbshipit-source-id: 5c16d6aed88ad1feaa7f129b4cd44c0561be2de2

…docs (pytorch#49211) (pytorch#49515) Summary: Pull Request resolved: pytorch#49515 Test Plan: Imported from OSS Imported from OSS Reviewed By: vkuzo Differential Revision: D25601061 fbshipit-source-id: 74e917d57895e9b4131a01fdcea8df3e94322bec

…ch#49009) Summary: Pull Request resolved: pytorch#49009 As per the title, we should generally not have exception swalling and this commit makes it so that if there is a true error in JIT operator resolution, it is propagated back to the RPC callee and we don't silently swallow any other exceptions that may happen. Swallowing the exceptions previously resulted in hard to debug issues such as unexpected ops showing up in profiler, and flaky tests which were fixed by pytorch#41287 Added a unittest that validates the error that comes from `jit/pybind_utils.h`. ghstack-source-id: 118794661 Test Plan: CI Reviewed By: mrshenli Differential Revision: D25392905 fbshipit-source-id: 6f93251635740bcf902824548b2bc6f9249be5f0

Test Plan: revert-hammer Differential Revision: D25105217 (pytorch@c675727) Original commit changeset: a5aa7c026694 fbshipit-source-id: ddb4c93f9317e1747def8842a8072c84776cd487

pytorch#49470) Summary: Pull Request resolved: pytorch#49470 pytorch#48625 changes the default contiguous settings for `TensorImpl` causing the Vulkan backend to crash. Therefore, add argument that can set `is_non_overlapping_and_dense_` back to false for `OpaqueTensorImpl` constructor. Test Plan: Imported from OSS Reviewed By: AshkanAliabadi Differential Revision: D25592826 Pulled By: SS-JIA fbshipit-source-id: e5d9de9a733875cb00c0546a3bc3271e5c6e23a3

Summary: Pull Request resolved: pytorch#49072 As per the title, we should enable these tests for Gloo when run on GPU and the profiler is enabled with `use_cuda=True`. Enabling ProcessGroupNCCL profiling test to work with `use_cuda=True` is being tracked in pytorch#48987. ghstack-source-id: 118789003 Test Plan: CI Reviewed By: mrshenli Differential Revision: D25388986 fbshipit-source-id: 664d922ac2e10c77299daebdc6d3c92bb70eb56e

Test Plan: revert-hammer Differential Revision: D25152559 (pytorch@6bde0ca) Original commit changeset: bbf52f1fbdbf fbshipit-source-id: 592fdb3078b1ac86cd394adc6c1bfd6b10d829e1

Summary: This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25445815 fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888

Summary: Pull Request resolved: pytorch#49428 Previously dequantstub will be swapped with nn.quantized.DeQuantize regardless of qconfig reason is we skipped attaching qconfig for DeQuantStub to avoid adding fake quantize module to it but the correct fix is to skip it in insert observers, this PR fixes the issue. Test Plan: Imported from OSS Reviewed By: vkuzo Differential Revision: D25569991 fbshipit-source-id: d44a08c6e64c7a49509687dc389b57de1cbb878c

Summary: Successful run: https://github.com/pytorch/pytorch/runs/1572315901 Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Pull Request resolved: pytorch#49509 Reviewed By: walterddr Differential Revision: D25619133 Pulled By: seemethere fbshipit-source-id: 092ab12535f3bf4fc85bbfc690d3f5b10a5f8791

Summary: Pull Request resolved: pytorch#49556 Implemented the missing Store functionality (specifically numKeys) in the FileStore. Test Plan: Added both C++ and Python tests to verify functionality. Reviewed By: jiayisuse Differential Revision: D25619001 fbshipit-source-id: 9146d0da9e0903622be3035880f619bbb2cc3891

Summary: Pull Request resolved: pytorch#49557 Updating the PyTorch docs to reflect that FileStore now supported the num_keys API. Also included a note to describe the behavior of the API. Test Plan: build and rendered docs. Reviewed By: jiayisuse Differential Revision: D25619000 fbshipit-source-id: 6c660d7ceb32d1d61024df8394aff3fcd0b752c1

Test Plan: revert-hammer Differential Revision: D25445815 (pytorch@1329066) Original commit changeset: 20696eacd12a fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782

Summary: Pull Request resolved: pytorch#47774 Test Plan: Imported from OSS Reviewed By: pbelevich Differential Revision: D25615464 Pulled By: ansley fbshipit-source-id: 10bba6f70e812fa580cbbbf097e93de7142484cc

…e (reapply) Test Plan: revert-hammer Differential Revision: D25547962 (pytorch@6f928a4) Original commit changeset: 58424b1da230 fbshipit-source-id: 10ff9f45f6587f67e1c88886f977930b4f7e396a

…rial_kernel_impl Test Plan: revert-hammer Differential Revision: D25546409 (pytorch@953f992) Original commit changeset: 196034716b6e fbshipit-source-id: 0e80f06a98c2842d2f11db7057ffcdcaea85f3bf

…ut_cpu Test Plan: revert-hammer Differential Revision: D25545777 (pytorch@c1879b5) Original commit changeset: b2714fac95c8 fbshipit-source-id: f534f8fc312943f1e6ba3d4029d6cf69b006aca8

…t_cpu Test Plan: revert-hammer Differential Revision: D25544731 (pytorch@1a05104) Original commit changeset: 7b9656d0371a fbshipit-source-id: 0f7ea74eca282cadf269bbd284d59650a431ed65

Test Plan: revert-hammer Differential Revision: D25542799 (pytorch@9ce1df0) Original commit changeset: 310f9fd15710 fbshipit-source-id: 51777914422a560e94430a786c86f5de4007a00b

Summary: Pull Request resolved: pytorch#49575 This is a fast log implementations benchmark: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none' ``` Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat Reviewed By: bertmaher Differential Revision: D25627157 fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9

Summary: I was exhausted with needing to hunt down zombies when working with ddp launcher, so this PR solves the various zombie issues. This PR addresses 2 distinct zombie scenarios caused by ddp launch.py: 1. When the main process is killed, the child processes aren't killed and continue running 2. When any of the children processes dies (e.g. OOM), the rest of the children and the parent remain running, but really are stuck To solve these problems this PR switches from `wait` to `poll` and uses signal handlers. The main problem with `wait()` was that it's not async, and I was having a 2nd process OOM, and the code was stuck waiting for the first process to finish which will not happen since the first process is blocking now waiting for the 2nd process - a sort of deadlock. My 2nd card is smaller than the first one, so it occasionally OOMs. Using `asyncio` would probably be the cleanest solution, but as it's relatively new in python, perhaps polling is good enough. I wrote this little script to reproduce 2 problematic scenarios and a normal running setup, it does 3 different things according to the `--mode` arg - `oom` - causes the 2nd process to exit prematurely emulating OOM - `clean-finish` - just exit normally in both processes - `False` (lack of arg) just keep on running - emulating multiple normally running processes ``` # oom.py import argparse from time import sleep import sys def main(): parser = argparse.ArgumentParser() parser.add_argument("--local_rank", default=False, type=int) parser.add_argument("--mode", default=False, type=str) args, _ = parser.parse_known_args() print(f"{args.local_rank} is starting") sleep(3) if args.mode == "oom": # emulate OOM in 2nd card if args.local_rank == 1: raise RuntimeError("OOM") if args.mode == "clean-finish": sleep(1) print(f"{args.local_rank} is cleanly finishing") sys.exit(0) while (True): # emulate long running process print(f"{args.local_rank} is running") sleep(1) if __name__ == "__main__": main() ``` Let's begin: ### 1. Normal execution ``` python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=clean-finish ``` All the processes exit upon completion - I won't bother pasting the log here - just testing that my code didn't break the normal running ### 2. OOM ``` python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=oom ``` ``` POLLING FOR 17547 POLLING FOR 17548 0 0 is starting 1 1 is starting POLLING FOR 17547 POLLING FOR 17548 POLLING FOR 17548 POLLING FOR 17547 POLLING FOR 17547 POLLING FOR 17548 0 is running Traceback (most recent call last): File "./oom.py", line 33, in <module> main() File "./oom.py", line 20, in main raise RuntimeError("OOM") RuntimeError: OOM POLLING FOR 17548 process 17548 is no more Killing subprocess 17547 Killing subprocess 17548 Traceback (most recent call last): File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 341, in <module> main() File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 327, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/stas/anaconda3/envs/main-38/bin/python', '-u', './oom.py', '--local_rank=1', '--mode=oom']' returned non-zero exit status 1. ``` All processes exited and the trace was printed ### 3. Exit on SIGINT/SIGTERM If I started a process and then realized I made a mistake I want to be able to kill it cleanly and if any sub-processes have already been spawned I want them to be killed too. Here the sighandler takes care of trapping the SIGTERM/SIGINT. ``` python -m torch.distributed.launch --nproc_per_node=2 ./oom.py ``` Here the processes emulate a long normal run. So let's Ctrl-C the process as soon as it started and see: ``` POLLING FOR 18749 POLLING FOR 18750 0 0 is starting 1 1 is starting POLLING FOR 18749 POLLING FOR 18750 POLLING FOR 18750 POLLING FOR 18749 POLLING FOR 18749 POLLING FOR 18750 0 is running 1 is running POLLING FOR 18750 POLLING FOR 18749 0 is running 1 is running ^CTraceback (most recent call last): Killing subprocess 18749 Traceback (most recent call last): File "./oom.py", line 33, in <module> File "./oom.py", line 33, in <module> Killing subprocess 18750 Parent got kill signal=SIGINT, exiting ``` all processes got killed -------------------------------- So this covered the 2 problematic cases and 1 normal case Notes: - we could probably switch to `sleep(3)` - `1` is probably too fast - all the debug prints will be removed once you are happy - I left them so that it's easier for you to test that my PR does the right thing. Thank you! Pull Request resolved: pytorch#49305 Reviewed By: izdeby Differential Revision: D25565617 Pulled By: rohan-varma fbshipit-source-id: 1ea864113f283d4daac5eef1131c8d745aae4c99

Summary: Pull Request resolved: pytorch#48268 Test Plan: Imported from OSS Reviewed By: jamesr66a Differential Revision: D25104617 Pulled By: eellison fbshipit-source-id: b41c03d5da6e9b88acf21a859f61c5c70608c150

Summary: Pull Request resolved: pytorch#49571 Disable nested namespace check since OSS standard is ``` set(CMAKE_CXX_STANDARD 14) ``` and its currently causing confusion on clang-tidy internally such as D25214452 Test Plan: clang-tidy Reviewed By: xuzhao9 Differential Revision: D25626392 fbshipit-source-id: 1fb472c89ebe9b83718ae27f2c1d77b8b2412b5e

Summary: Pull Request resolved: pytorch#49517 We should add concrete type info for Tensor List case as well. Test Plan: ci Reviewed By: qizzzh Differential Revision: D25599223 fbshipit-source-id: 3614e9ec25fc963a8d6a0bd641735fcca6c87032

Summary: FLOPs Roofline Analysis Feature for PyTorch Profiler. Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv. FLOPs are helpful to estimate the computation complexity of the operators. For now, we use input shapes to estimate the number of floating pointer operations. In the future, we may compute this information by tracking hardware counters. Pull Request resolved: pytorch#46506 Test Plan: Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following: ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes MFLOPS ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ aten::matmul 0.06% 57.653us 82.97% 79.310ms 79.310ms 1 [[40, 33, 1, 243], [243, 243]] -- aten::mm 82.84% 79.186ms 82.86% 79.204ms 79.204ms 1 [[1320, 243], [243, 243]] 984.323 aten::conv2d 0.04% 36.345us 16.06% 15.347ms 15.347ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ 44065010.318 aten::convolution 0.02% 16.016us 16.02% 15.310ms 15.310ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::_convolution 0.07% 63.855us 16.00% 15.294ms 15.294ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::mkldnn_convolution 15.89% 15.188ms 15.93% 15.225ms 15.225ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::relu 0.10% 98.223us 0.64% 612.157us 306.079us 2 [[40, 33, 1, 243]] -- aten::threshold 0.49% 465.416us 0.54% 513.934us 256.967us 2 [[40, 33, 1, 243], [], []] -- aten::add_ 0.29% 279.301us 0.29% 279.301us 279.301us 1 [[40, 33, 1, 243], [243], []] -- aten::empty 0.10% 99.113us 0.10% 99.113us 24.778us 4 [[], [], [], [], [], []] -- ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ Self CPU time total: 95.584ms . ---------------------------------------------------------------------- Ran 1 test in 0.176s For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators. Reviewed By: ezyang Differential Revision: D25214452 Pulled By: xuzhao9 fbshipit-source-id: 0ae841bd8dbdeb032346dc3d9d38e19875aa1da3

Summary: These are redundant with the functional variant checks and can be very costly, as some grad and gradgrad testing takes minutes to run per variant. Maybe in the future we'll add them back for operations with divergent method implementations. Pull Request resolved: pytorch#49576 Reviewed By: albanD, ngimel Differential Revision: D25631691 Pulled By: mruberry fbshipit-source-id: 247f750979d9dafab2454cdbfa992a2aa6da724a

Summary: Pull Request resolved: pytorch#49419 As described in pytorch#48110, the newly introduced `barrier()` in `init_process_group` messes up NCCL communicator state since it uses a bunch of default devices to perform an allreduce which simulates a barrier(). As a ressult, subsequent NCCL operations might not behave as expected. ghstack-source-id: 118861776 Test Plan: 1) unit test added. 2) waitforbuildbot Reviewed By: mrshenli Differential Revision: D25566550 fbshipit-source-id: ab083b67b634d7c515f4945deb228f959b27c936

Summary: Clear static variable at the end of the test to ensure test passes after re-runs Pull Request resolved: pytorch#49581 Test Plan: `./bin/test_api "--gtest_filter=CustomAutogradTest.ReentrantPriority" --gtest_repeat=50` Before the change all subsequent runs of the test failed with ``` ../test/cpp/api/autograd.cpp:681: Failure Expected equality of these values: order.size() Which is: 310 10 ``` Reviewed By: mrshenli Differential Revision: D25632374 Pulled By: malfet fbshipit-source-id: 4814d22b5dff15e1b38a0187e51070771fd58370

Summary: Pull Request resolved: pytorch#49201 This unblocks kineto profiler for 1.8 release. This PR supercedes pytorch#48391 Note: this will somewhat increase the size of linux server binaries, bc we add libkineto.a and libcupti_static.a: -rw-r--r-- 1 jenkins jenkins 1107502 Dec 10 21:16 build/lib/libkineto.a -rw-r--r-- 1 root root 13699658 Nov 13 2019 /usr/local/cuda/lib64/libcupti_static.a Test Plan: CI pytorch#48391 Imported from OSS Reviewed By: ngimel Differential Revision: D25480770 fbshipit-source-id: 037cd774f5547d9918d6055ef5cc952a54e48e4c

Test Plan: revert-hammer Differential Revision: D25480770 (pytorch@1a92802) Original commit changeset: 037cd774f554 fbshipit-source-id: 6a6062195033ca91fcc0cfa1e890e47efc774ac1

Summary: Pull Request resolved: pytorch#49357 This is a follow-up fix for PR pytorch#48679, where the previous PR adds support for integer inputs to aten::abs by promoting integers to float and then demote the result back to integers. This PR supports integer inputs to aten::abs more efficiently in the SimpleIREvaluator by allowing implementing integer inputs for kAbs (renamed from kFabs). - Rename kFabs to kAbs - Add support for integer input to kAbs in SimpleIREvalator (note that: llvm_codegen and cuda_codegen already supports integer inputs to kAbs) Test Plan: - `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops` - `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops` Imported from OSS Reviewed By: eellison Differential Revision: D25545791 fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230

Summary: Pull Request resolved: pytorch#49598 Add op bench for caffe2 quantile op Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:quantile_op_test -- --wramup_iterations=10000 --iterations=10000` Reviewed By: radkris-git Differential Revision: D25590085 fbshipit-source-id: 0db58ac87c595b2bf2958f6299a1bf2ccea019db

Summary: Pull Request resolved: pytorch#49590 Reviewed By: samestep Differential Revision: D25633341 Pulled By: walterddr fbshipit-source-id: 6e8db1f628f562d7632390bdb7788437cb1bf63d

Summary: Pull Request resolved: pytorch#49482 Motivation ========== Batching rules always invoke newLogicalToPhysical at the very end to turn a physical tensor into a logical BatchedTensor (an example is below): ``` Tensor select_backward_batching_rule(const Tensor& grad, IntArrayRef input_sizes, int64_t dim, int64_t index) { auto grad_physical = MultiBatchVmapTransform::logicalToPhysical(grad); auto grad_input = at::zeros(grad_physical.getPhysicalShape(input_sizes), grad.options()); auto physical_dim = getGradInputPhysicalDim(dim, input_sizes, grad_physical.numBatchDims()); grad_input.select(physical_dim, index).copy_(grad_physical.tensor()); return grad_physical.newLogicalFromPhysical(grad_input); } ``` However, albanD noted that this function is confusing and ambiguous because it's unclear which physical tensor is being turned into the logical (in this case, grad_physical is a VmapPhysicalView, but we're really transforming grad_input and returning it). pytorch#44505 (comment) I didn't want to make too many changes to the batching rule API because I think we'll change it even more in the future, but this PR attempts to remove the ambiguity by applying one of the suggestions in pytorch#44505 (comment) This PR ======= The diagnosis of the problem is that we were conflating "VmapPhysicalView", which maps logical attributes on a Tensor (like dimension and shape) to physical attributes, with the reverse physical-to-logical map. This PR creates a new VmapPhysicalToLogicalMap object that handles the latter. Instead of calling `grad_physical.newLogicalFromPhysical(grad_input)`, an author of batching rules should now retrieve the VmapPhysicalToLogicalMap object and apply it to their physical input. So the above code becomes: ``` grad_physical.getPhysicalToLogicalMap().apply(grad_input) ``` I've also moved VmapPhysicalView::makeLogicalFromPhysicalListInplace to VmapPhysicalToLogicalMap::applyInplace. Test Plan ========= wait for tests Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D25592645 Pulled By: zou3519 fbshipit-source-id: 9c6ede9901ec6b70e5763193064658a8f91e6d48

…st line (pytorch#49584) Summary: Changed the first line of the torch.rst file to match that of the __init__.py file Fixes pytorch#49228 Pull Request resolved: pytorch#49584 Reviewed By: VitalyFedyunin Differential Revision: D25639260 Pulled By: mrshenli fbshipit-source-id: a0bafd945ff92115eed932662feedc46d29dfaab

Summary: Fixes pytorch#598 This is BC-breaking as we now explicitly don't call the hook when there are not Tensors at the top level of the output. This feature was not working anyways as the returned grad_input/grad_output were wrong (not respecting the output structure and wrong inputs for multi-Node Module). This is also BC-breaking as we now report the correct gradients for `nn.Module`s that contain multiple autograd `Node`s while we use to return bad results before. Pull Request resolved: pytorch#46163 Reviewed By: ailzhang, mruberry Differential Revision: D24894180 Pulled By: albanD fbshipit-source-id: e1b5d193d2818eb2f51e2a2722c7405c8bd13c2b

…orch#49591) Summary: Pull Request resolved: pytorch#49591 A bunch of these tests are marked flaky, and have been since time immemorial. (Read: as far back as Buck will build.) However closer inspection reveals that they fail if and only if run on a GPU worker. What seems to be going on is that there are more jobs than GPUs, so the contention causes waits which registers as timeouts on the test. This diff is kind of hacky, but it basically just drops deadlines if a GPU is present. Because Caffe2 is going away I'm not too terribly concerned about a beautiful solution, but we may as well keep some test coverage if it's easy. CC Sebastian, Ilia, Min, and Hongzheng who also have tasks for what seems to be the same flakiness. Test Plan: Turn the tests back on and see if they fall over. (The failure repros reliably on an OnDemand GPU and is fixed by this change, so it's not really just a hail Mary.) Reviewed By: ngimel Differential Revision: D25632981 fbshipit-source-id: 43dcce416fea916ba91f891e9e5b59b2c11cca1a

Summary: Pull Request resolved: pytorch#49526 Test Plan: Imported from OSS Reviewed By: Chillee Differential Revision: D25606115 Pulled By: jamesr66a fbshipit-source-id: f2a21d02a2cf8c08cbd618efc5a6a28d34806851

Summary: Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO. Manually edited some references of the removed `CAFFE2_API`: * `CONTRIBUTING.md` * `caffe2/proto/CMakeLists.txt` * `cmake/ProtoBuf.cmake` * `c10/macros/Export.h` * `torch/csrc/WindowsTorchApiMacro.h` Pull Request resolved: pytorch#49496 Reviewed By: malfet, samestep Differential Revision: D25600726 Pulled By: janeyx99 fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782

…tead of from TorchScript Model (pytorch#49385) Summary: Pull Request resolved: pytorch#49385 Currently, the API to export operator lists accepts a `torch::jit::Module` object, and spits out an operator list. The operator list is practically used only for mobile. This is not ideal because the set of root operators may change by the time the model is subsequently optmized and exported for mobile. What we need to to instead is glean the list of operators from the mobile model itself (`bytecode.pkl` specifically), and expose that instead. Also updated the logic in `converter`. ### Before this change: 1. Get operator List from Torch Script Model 2. Convert to bytecode mobile model ### After this change: 1. Convert to bytecode mobile model 2. Use this converted mobile model to get the list of operators for each method on the model ghstack-source-id: 118796752 Test Plan: Added a unit test in `test_lite_interpreter.cpp` to ensure that all model referenced operators show up in the exported operator list. Also make `test_lite_interpreter.cpp` runnable from `xplat/caffe2/BUCK` since this is where the production code will be built from. Verified that the list of operators produced before and after this change for an example model (segmentation) are the same. {P147863234} Also verified that the operator lists for BI-Xray model is different (we have been having problems with missing operators for this one): {P154903132} Reviewed By: iseeyuan Differential Revision: D24690094 fbshipit-source-id: 0426a6ef90456a811010cfe337c415882ae2deff

Summary: Pull Request resolved: pytorch#48280 Adding new API for the kineto profiler that supports enable predicate function Test Plan: unit test Reviewed By: ngimel Differential Revision: D25142220 Pulled By: ilia-cher fbshipit-source-id: c57fa42855895075328733d7379eaf3dc1743d14

Summary: ======== Fixes #{42915} This commit adds support for Bitwise Shorthands in TorchScript, i.e : |=,&=,^=,<<=,>>=,**= Testing: ====== This commit also adds test for the above fix in test_jit.py The test can be invoked by pytest -k augassign test/test_jit.py Here is a snapshot of the testing: <img width="1238" alt="image" src="https://user-images.githubusercontent.com/70345919/93105141-8f9f5300-f663-11ea-836b-3b52da6d2be5.png"> Pull Request resolved: pytorch#44621 Reviewed By: mrshenli Differential Revision: D23906344 Pulled By: nikithamalgifb fbshipit-source-id: 4c93a7430a625f698b163609ccec15e51417d564

Summary: Pull Request resolved: pytorch#48470 Adding a unit test to test this works as expected. Although, this doesn't work with other checkpointing modes of the pipe and checkpoint=never needs to be set for this to work. ghstack-source-id: 118820806 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D25182668 fbshipit-source-id: 85e69e338bf388c132a303ad93e29ec2cc4a0ed8

… arg (pytorch#49553) Summary: Pull Request resolved: pytorch#49553 Test Plan: Imported from OSS Reviewed By: zdevito Differential Revision: D25618577 Pulled By: jamesr66a fbshipit-source-id: 042f742f9ca02e59bbceda97bfcf47f9bac07873

Summary: Pull Request resolved: pytorch#49306 This allows you to mock out everything except for specific patterns while still correctly externing the python standard library. This makes it less likely that you will need to override require_module. Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D25526212 Pulled By: zdevito fbshipit-source-id: 7339f4c7f12af883496f79de95e57d452bb32dc2

Summary: This PR currently just modifies the `test/print_test_stats.py` script (run in the `pytorch_linux_test` job) so that now it uploads test times to the new `ossci-metrics` S3 bucket (rather than just to Scribe) if passed the `--upload-to-s3` parameter. The next step is to add an additional step to that `pytorch_linux_test` job which checks if it's being run on a PR, and if so, finds the `master` commit to compare against (similar to what's done in the now-unused `.jenkins/pytorch/short-perf-test-{c,g}pu.sh` scripts) and adds test time info to the Dr CI comment if the PR is significantly different from the base revision. Pull Request resolved: pytorch#49190 Test Plan: An "integration test" would be to just look in [the `ossci-metrics` S3 bucket](https://s3.console.aws.amazon.com/s3/buckets/ossci-metrics) to confirm that the CI run(s) for this PR did indeed upload their test time data successfully. To test this locally, first make sure you have all the packages you need, such as these: ``` $ conda install -c anaconda boto3 $ conda install -c conda-forge unittest-xml-reporting ``` Then run whatever tests you want; these are the ones I used for my local smoke test, for no particular reason: ``` $ python test/test_spectral_ops.py --save-xml=/tmp/reports/spectral_ops ``` Once the tests finish, run the script to upload their times to S3: ``` $ CIRCLE_SHA1="$(git rev-parse HEAD)" CIRCLE_JOB=foo test/print_test_stats.py --upload-to-s3 /tmp/reports/spectral_ops ``` Now check that they uploaded successfully: ``` $ aws s3 cp "s3://ossci-metrics/test_time/$(git rev-parse HEAD)/foo/" /tmp/reports --recursive ``` And that it's a valid `*.json.bz2` file: ``` $ bzip2 -kdc /tmp/reports/*Z.json.bz2 | jq . | head -n21 { "build_pr": null, "build_tag": null, "build_sha1": "e46f43621b910bc2f18dd33c08f5af18a542d5ed", "build_branch": null, "build_job": "foo", "build_workflow_id": null, "total_seconds": 0.9640000000000003, "suites": { "TestFFTCPU": { "total_seconds": 0.9640000000000003, "cases": [ { "name": "test_fft_invalid_dtypes_cpu", "seconds": 0.022, "errored": false, "failed": false, "skipped": false }, { "name": "test_istft_throws_cpu", ``` Reviewed By: walterddr, malfet Differential Revision: D25618035 Pulled By: samestep fbshipit-source-id: 4d8013859a38a49e5bba700c5134951ca1a9d8b7

Summary: Pull Request resolved: pytorch#48630 1) Make torch.distributed.pipeline package public. 2) Make several helper methods private. ghstack-source-id: 118820803 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25235688 fbshipit-source-id: c32833ebf090ddbd4eaf06fcb5e3f9d421623a60

Summary: Pull Request resolved: pytorch#49605 Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D25643157 Pulled By: IvanKobzarev fbshipit-source-id: c5440622f6cf559afadca853e1eb7a9fbb8edf7f

Summary: Implements the sinc operator. See https://numpy.org/doc/stable/reference/generated/numpy.sinc.html ![image](https://user-images.githubusercontent.com/13428986/101653855-cdffa080-3a0d-11eb-8426-ecc81c152ebd.png) Pull Request resolved: pytorch#48740 Reviewed By: ezyang Differential Revision: D25597565 Pulled By: soulitzer fbshipit-source-id: 6dbcf282ee4eba34930bc9e5c85c0c5e79cf0322

Summary: Pull Request resolved: pytorch#48438 Outputting stacks in a format suitable for SVG vizualization (e.g. with https://github.com/brendangregg/FlameGraph tool) Test Plan: python test/test_profiler.py -k test_export_stacks e.g. resnet18 (note: actual SVG is interactive): <img width="1193" alt="Screen Shot 2020-11-24 at 7 06 27 PM" src="https://user-images.githubusercontent.com/30845429/100178160-397f3500-2e88-11eb-81c4-34b19c5fcb87.png"> Reviewed By: dzhulgakov Differential Revision: D25174270 Pulled By: ilia-cher fbshipit-source-id: 6b60084071b209441805c468f5ff777318e42d1a

Summary: Fixes pytorch#49091 Pull Request resolved: pytorch#49102 Reviewed By: VitalyFedyunin Differential Revision: D25639541 Pulled By: soulitzer fbshipit-source-id: 1dd360bd7b77f106d606143d8d3961610bac8cb7

Summary: There's a bug internally, disable as quick fix before investigation Pull Request resolved: pytorch#49622 Test Plan: Imported from GitHub, without a `Test Plan:` line. build Reviewed By: zheng-xq, PursueHappinessDirectly Differential Revision: D25651897 Pulled By: eellison fbshipit-source-id: dd1454f2ef7506d7844016128aa6320d7e69aa6e

…(); added a test case for torch.fx.len() (pytorch#49532) Summary: Pull Request resolved: pytorch#49532 Test Plan: Imported from OSS Reviewed By: jamesr66a Differential Revision: D25608804 Pulled By: huiguoo fbshipit-source-id: 93ac02ab57db5d200d92443062286c34782ec0ef

Summary: Instead of calling coverage frontend import coverage module and call combine() and html_report() Fixes pytorch#49596 by not using a strict mode when combining those reports Pull Request resolved: pytorch#49615 Reviewed By: seemethere Differential Revision: D25645196 Pulled By: malfet fbshipit-source-id: be55b5c23a3569a331cbdf3f86d8c89bc27d5fe1

…orch#49417) Summary: Pull Request resolved: pytorch#49417 The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 118921275 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25511543 fbshipit-source-id: 19ef188bc2d4c7406443c8fa233c1f2c2f27d93c

Summary: Pull Request resolved: pytorch#48638 Polishing up some of the docs for the main `Pipe` class and its `forward` method. ghstack-source-id: 118820804 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25237705 fbshipit-source-id: ba3d8737b90a80024c827c0887fc56f14bf678b7

Summary: Pull Request resolved: pytorch#49577 Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 | batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37 | batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33 | batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37 | batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83 | batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33 | batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50 | batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68 | batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82 | batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98 | batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` ghstack-source-id: 118939686 Test Plan: sentinel Reviewed By: rohan-varma Differential Revision: D25628721 fbshipit-source-id: 41c788eed4f852aef019aec18a84cb25ad254f3a

Summary: Pull Request resolved: pytorch#49599 Reviewed By: lw Differential Revision: D25639036 Pulled By: mrshenli fbshipit-source-id: 595b396a01d7fa9049d88447ab9079e286637afe

Summary: Fix lint on master Pull Request resolved: pytorch#49629 Reviewed By: rohan-varma Differential Revision: D25654199 Pulled By: mrshenli fbshipit-source-id: 2ab5669ad47996c0ca0f9b6611855767d5af0506

…ytorch#49621) Summary: Pull Request resolved: pytorch#49621 This adds support to configure qconfig for a call_method, e.g. x.chunk, this will help workaround a problem in our internal model. TODO: since call_method is also a string and we flatten the qconfig, might need to resolve namespace conflict between call_method and module_name TODO: Add scope support to set the qconfig for call_method correctly with original qconfig Test Plan: Imported from OSS Reviewed By: vkuzo Differential Revision: D25651828 fbshipit-source-id: 82d66b121d37c8274fd481b6a2e9f9b54c5ca73d

…wise PowerSGD Test Plan: revert-hammer Differential Revision: D25511543 (pytorch@71f3399) Original commit changeset: 19ef188bc2d4 fbshipit-source-id: a363641a059aeacc57684884998cf8fb7363d748

…optimize_for_mobile (pytorch#49170) Summary: Pull Request resolved: pytorch#49170 Added an extra step to **always** preserve the bundled inputs methods if they are present in the input module. Also added a check to see if all the methods in the `preseved_methods` exist. If not, we will now throw an exception. This can hopefully stop hard-to-debug inputs from getting into downstream functions. ~~Add an optional argument `preserve_bundled_inputs_methods=False` to the `optimize_for_mobile` function. If set to be True, the function will now add three additional functions related with bundled inputs to be preserved: `get_all_bundled_inputs`, `get_num_bundled_inputs` and `run_on_bundled_input`.~~ Test Plan: `buck test mode/dev //caffe2/test:mobile -- 'test_preserve_bundled_inputs_methods $test_mobile_optimizer\.TestOptimizer$'` or `buck test caffe2/test:mobile` to run some other related tests as well. Reviewed By: dhruvbird Differential Revision: D25463719 fbshipit-source-id: 6670dfd59bcaf54b56019c1a43db04b288481b6a

Summary: Pull Request resolved: pytorch#49636 test_export_stacks fails with permission errors Test Plan: CI Imported from OSS Reviewed By: robieta Differential Revision: D25654680 fbshipit-source-id: 5689289e06eebc0686030f90ed56483a072b6850

Summary: Pull Request resolved: pytorch#48840 The CUDAFuture class needs to inspect the values it contains in order to extract its tensors (in fact, the DataPtrs backing those). These are needed first to determine what CUDA devices back those tensors, so that an event for each such device can be recorded; and later to record these DataPtrs with the CUDA caching allocator if they are used in other streams. This became complicated when Python was added to the mix, because to inspect a Python object we need to acquire the GIL, but we couldn't do so from code that was supposed to also work in C++-only mode. The solution was for users to provide a custom way to extract DataPtrs, so that the PythonFutureWrapper could install such a custom Python-aware one. This was the DataPtr extractor. In pytorch#48502 a different suggestion was proposed. At its root, it consists in adding support for IValues of type PyObject to the visit() and getSubValues() methods. In order to deal with the GIL, we do this through a virtual method: PyObjectHolder, which is the base class, is available also in C++-only mode, and thus defines this method but leaves it unimplemented; ConcretePyObjectHolder, which is the subclass, is only included in Python mode, and thus it can implement that method, acquire the GIL, and do what it's supposed to. In my opinion, this approach is just brilliant! Thank wanchaol for proposing it! It hides the complexity of dealing with Python inside getSubValues(), where it can be done properly, thus simplifying enormously the CUDAFuture and the PythonFutureWrapper classes. ghstack-source-id: 118704935 Test Plan: Unit tests Reviewed By: wanchaol Differential Revision: D25334355 fbshipit-source-id: 3f1d3bf6e6e8505a114c877fb9a6fcc3f68d91d3

Summary: Fixes pytorch#47934 Pull Request resolved: pytorch#48254 Reviewed By: bdhirsh Differential Revision: D25276689 Pulled By: VitalyFedyunin fbshipit-source-id: a70774e31c269b41786170e99ec1ede42596ba7b

…erSGD (pytorch#49639) Summary: Pull Request resolved: pytorch#49639 Resubmit pytorch#49417 with a fix for distributed_test. The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission. The real diff is: pytorch@4ca1014 This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (pytorch#49651) from a separate branch: https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079 ghstack-source-id: 118969912 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、 Reviewed By: mrshenli Differential Revision: D25654961 fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e

Summary: Updated `svd_backward` to work correctly for complex-valued inputs. Updated `common_methods_invocations.py` to take dtype, device arguments for input construction. Removed `test_pinverse` from `test_autograd.py`, it is replaced by entries to `common_methods_invocations.py`. Added `svd` and `pinverse` to list of complex tests. References for complex-valued SVD differentiation: - https://giggleliu.github.io/2019/04/02/einsumbp.html - https://arxiv.org/abs/1909.02659 The derived rules assume gauge invariance of loss functions, so the result would not be correct for loss functions that are not gauge invariant. https://re-ra.xyz/Gauge-Problem-in-Automatic-Differentiation/ The same rule is implemented in Tensorflow and [BackwardsLinalg.jl](https://github.com/GiggleLiu/BackwardsLinalg.jl). Ref. pytorch#33152 Pull Request resolved: pytorch#47761 Reviewed By: ngimel Differential Revision: D25658897 Pulled By: mruberry fbshipit-source-id: ba33ecbbea3f592238c01e62c7f193daf22a9d01

…rch#49418) Summary: Pull Request resolved: pytorch#49418 Add error feedback to the original implementation of PowerSGD. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 118670930 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25555538 fbshipit-source-id: c01145cc9acf574a4c6aa337dbbba0ba7d9350b2

…y stream syncrhonization (pytorch#49435) Summary: Pull Request resolved: pytorch#49435 Previously the assertion that prevents illegal memory access is because of the torch.any that returns a boolean value, which initiates a data transfer from the device to the host and forces a synchronization. An explicit synchronization is more to the point. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 118664204 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25573484 fbshipit-source-id: 516d0d502da2863b516c15332702335ee662f072

…rgument (pytorch#49169) Summary: Pull Request resolved: pytorch#49169 Trying to solve PR request pytorch#47479. This diff tries to overload method `torch.tensor_split` to also accept a tensor for argument `split_size_or_sections` which currently accepts a python list or int. The motivation is to avoid converting a tensor to a list so that when tracing a model/module the tensor operations can be recorded. Implementation is following the diff that originally added the `tensor_split` method D24166164 (pytorch@ef4817f). Test Plan: ``` buck test caffe2/test:torch -- tensor_split ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/5910974550563805/ ``` buck test caffe2/test:others -- tensor_split ``` https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688849905082678/ Reviewed By: mruberry Differential Revision: D25440885 fbshipit-source-id: 6705dc551279e3a5eb1e5ec1ede2728eab85ffb1

…ke CLANGFORMAT` Reviewed By: zertosh Differential Revision: D25662961 fbshipit-source-id: f5811a5797fd6dc8733fdf86f35c93d12a08d53a

Summary: Pull Request resolved: pytorch#48933 Prototype for CollateIterableDataset. Move `collate_batch_fn` to BatchIterableDataset - CollateIterableDataset - [x] Prototype - [x] Tests - BatchIterableDataset - [x] Prototype - [x] Tests - SamplerIterableDataset - [x] Prototype - [x] Tests Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D25623635 Pulled By: ejguan fbshipit-source-id: 99ba077619f672551ac15367baaba985db35a9c2

Summary: Pull Request resolved: pytorch#49186 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D25623636 Pulled By: ejguan fbshipit-source-id: 01a08cccb69301481c55b46358203354b9b4f5fa

Summary: Pull Request resolved: pytorch#49363 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D25623637 Pulled By: ejguan fbshipit-source-id: 9155d27d1fc91996b74110795cc73f1da0eedd44

Summary: Pull Request resolved: pytorch#49574 Adds support for additional Eigen Utils for custom type defs. Reviewed By: linbinyu Differential Revision: D25624556 fbshipit-source-id: 0ffa90aaf8cbf1d08825e95156fb40d966ca7042

Summary: Fix small typo in sinc docs Pull Request resolved: pytorch#49667 Reviewed By: ngimel Differential Revision: D25665721 Pulled By: soulitzer fbshipit-source-id: 5f78b9e34bb0084e51ae79d1afc450bcb0ae3d75

Summary: This PR adds `torch.linalg.solve`. `linalg_solve_out` uses in-place operations on the provided result tensor. I modified `apply_solve` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_solve_out` but removing the error checks and device memory synchronization. In comparison to `torch.solve` this routine accepts 1-dimensional tensors and batches of 1-dim tensors for the right-hand-side term. `torch.solve` requires it to be at least 2-dimensional. Ref. pytorch#42666 Pull Request resolved: pytorch#48456 Reviewed By: izdeby Differential Revision: D25562222 Pulled By: mruberry fbshipit-source-id: a9355c029e2442c2e448b6309511919631f9e43b

Summary: Pull Request resolved: pytorch#49165 Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D25463694 Pulled By: ejguan fbshipit-source-id: 5cf907e8de6eeb0171d61175a60fac9812b76c6c

Summary: Very small PR to fix a typo. ### Description Fixed 1 typo in the documentation of `torch/utils/tensorboard/writer.py` (replaced "_should in_" by "_should be in_") Pull Request resolved: pytorch#49648 Reviewed By: ngimel Differential Revision: D25665831 Pulled By: mrshenli fbshipit-source-id: a4e733515603bb9313c1267fdf2cfcc2bc2773c6

Summary: This small PR fixes a one character typo in the docstring for `DataLoader`. Pull Request resolved: pytorch#49437 Reviewed By: ngimel Differential Revision: D25665971 Pulled By: mrshenli fbshipit-source-id: b60f975f1e3bf0bb8f88e39f490f716c602f087e

…9554) Summary: Makes two changes in NNC for intermediate buffer allocations: 1. Flattens dimensions of buffers allocated in LoopNest::prepareForCodegen() to match their flattened usages. 2. Adds support for tracking memory dependencies of Alloc/Free to the MemDependencyChecker, which will allow us to check safety of accesses to intermediate buffers (coming in a future diff). I didn't add any new tests as the mem dependency checker tests already cover it pretty well, particularly the GEMM test. Pull Request resolved: pytorch#49554 Reviewed By: VitalyFedyunin Differential Revision: D25643133 Pulled By: nickgg fbshipit-source-id: 66be3054eb36f0a4279d0c36562e63aa2dae371c

Summary: Related pytorch#38349 Implement NumPy-like function `torch.broadcast_to` to broadcast the input tensor to a new shape. Pull Request resolved: pytorch#48997 Reviewed By: anjali411, ngimel Differential Revision: D25663937 Pulled By: mruberry fbshipit-source-id: 0415c03f92f02684983f412666d0a44515b99373

Summary: This PR implements matrix multiplication support for 2-d sparse tensors using the COO sparse format. The current implementation of `torch.sparse.mm` support this configuration, `torch.sparse.mm(sparse_matrix1, sparse_matrix2.to_dense())`, but this could spend a lot of memory when sparse_matrix2's shape is large. This implementation extends `torch.sparse.mm` function to support `torch.sparse.mm(sparse_matrix1, sparse_matrix2)` Resolves #[20988](pytorch#20988) for CPU/CUDA. - [x] sparse matmul - [x] CPU/CUDA C++ implementation - [x] unittests - [x] update torch.sparse.mm documentation - [x] autograd support The CPU sparse-sparse matmul was implemented taking as a reference this work "Sparse Matrix Multiplication Package (SMMP)". The GPU sparse-sparse matmul is based on cuSparse, there is specific code for CUSPARSE when CUSPARSE_VERSION >= 11 and old version of CUSPARSE. Both CPU/CUDA rely on the sparse-sparse matmul algorithm using the CSR indices format as it is one of the fastest algorithm. Here it is the latest benchmark (script is here) results for torch.sparse.mm (CUDA) and torch.sparse.mm (CPU) and scipy, values are float32 scalars: size | density | sparse.mm(CUDA) | sparse.mm(CPU) | scipy_coo_matmul -- | -- | -- | -- | -- (32, 10000) | 0.01 | 822.7 | 79.4 | 704.1 (32, 10000) | 0.05 | 1741.1 | 402.6 | 1155.3 (32, 10000) | 0.1 | 2956.8 | 840.8 | 1885.4 (32, 10000) | 0.25 | 6417.7 | 2832.3 | 4665.2 (512, 10000) | 0.01 | 1010.2 | 3941.3 | 26937.7 (512, 10000) | 0.05 | 2216.2 | 26903.8 | 57343.7 (512, 10000) | 0.1 | 4868.4 | 87773.7 | 117477.0 (512, 10000) | 0.25 | 16639.3 | 608105.0 | 624290.4 (1024, 10000) | 0.01 | 1224.8 | 13088.1 | 110379.2 (1024, 10000) | 0.05 | 3897.5 | 94783.9 | 236541.8 (1024, 10000) | 0.1 | 10559.1 | 405312.5 | 525483.4 (1024, 10000) | 0.25 | 57456.3 | 2424337.5 | 2729318.7 A new backward algorithm was implemented using only `sparse @ sparse` and `sparse_mask` operations. Here is some benchmarking: ``` [------------------------- sparse.mm-backward -------------------------] | sparse.backward | dense.backward ----------------------------------------------------------------------- (32, 10000) | 0.01 | 13.5 | 2.4 (32, 10000) | 0.05 | 52.3 | 2.4 (512, 10000) | 0.01 | 1016.8 | 491.5 (512, 10000) | 0.05 | 1604.3 | 492.3 (1024, 10000) | 0.01 | 2384.1 | 1963.7 (1024, 10000) | 0.05 | 3965.8 | 1951.9 ``` I added new benchmark tests. Now I am using a real dataset used in recent studies [1, 2] with different sparsity levels. ``` [---------------------------------- matmul ---------------------------------] | 0.5 | 0.7 | 0.8 | 0.9 | 0.95 | 0.98 1 threads: ------------------------------------------------------------------ (cpu) torch | 5.4 | 5.4 | 5.2 | 5.3 | 5.3 | 5.4 torch.sparse | 122.2 | 51.9 | 27.5 | 11.4 | 4.9 | 1.8 scipy | 150.1 | 87.4 | 69.2 | 56.8 | 38.4 | 17.1 (cuda) torch | 1.3 | 1.1 | 1.1 | 1.1 | 1.1 | 1.1 torch.sparse | 20.0 | 8.4 | 5.1 | 2.5 | 1.5 | 1.1 [----------------------------------- backward -----------------------------------] | 0.5 | 0.7 | 0.8 | 0.9 | 0.95 | 0.98 1 threads: ----------------------------------------------------------------------- (cpu) torch | 17.7 | 17.9 | 17.7 | 17.7 | 17.6 | 17.9 torch.sparse | 672.9 | 432.6 | 327.5 | 230.8 | 176.7 | 116.7 (cuda) torch | 3.8 | 3.6 | 3.5 | 3.5 | 3.6 | 3.5 torch.sparse | 68.8 | 46.2 | 35.6 | 24.2 | 17.8 | 11.9 Times are in milliseconds (ms). ``` In summary, I can say that the new `sparse @ sparse` backward algorithm is better as it is more about saving space than performance. Moreover, it is better than other options tested before. ## **References** 1. Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen. **Sparse GPU Kernels for Deep Learning.** Proceedings of the International Conference for High Performance Computing, 2020. [https://github.com/google-research/google-research/tree/master/sgk](https://github.com/google-research/google-research/tree/master/sgk) 2. Trevor Gale, Erich Elsen, Sara Hooker. **The State of Sparsity in Deep Neural Networks.** [https://github.com/google-research/google-research/tree/master/state_of_sparsity](https://github.com/google-research/google-research/tree/master/state_of_sparsity) Pull Request resolved: pytorch#39526 Reviewed By: mruberry Differential Revision: D25661239 Pulled By: ngimel fbshipit-source-id: b515ecd66d25f347d637e159d51aa45fb43b6938

Summary: Used to temporarily change working directory, but restore it even if exception is raised Use it in test_type_hints and during code coverage collection Pull Request resolved: pytorch#49657 Reviewed By: walterddr Differential Revision: D25660543 Pulled By: malfet fbshipit-source-id: 77f08d57e4b60b95daa4068d0dacf7c25f978526

Summary: In `torchvision` we use [`torch.hub.tqdm`](https://github.com/pytorch/vision/blob/2cc20d7485458a6368e8995e3f79799589b632bd/torchvision/datasets/utils.py#L11) to display the dataset download. One of our methods uses [`tqdm().close()`](https://github.com/pytorch/vision/blob/2cc20d7485458a6368e8995e3f79799589b632bd/torchvision/datasets/utils.py#L188), which is [not included in the mock](https://github.com/pmeier/pytorch/blob/283ae1998cd6920b588907adfb88909afb522ae2/torch/hub.py#L22-L49). This PR adds a `close()` method to the mock. Cc fmassa Pull Request resolved: pytorch#46040 Reviewed By: mrshenli Differential Revision: D25619429 Pulled By: fmassa fbshipit-source-id: a137f2417d8a47923ccb1ec6b7d5298c1545245c

Summary: Pull Request resolved: pytorch#49448 ghstack-source-id: 118982171 Test Plan: buck test caffe2/test:quantization -- 'test_qlstmGRU $quantization\.test_quantized_op\.TestDynamicQuantizedRNNOp$' --print-passing-details buck test caffe2/test:quantization -- 'test_quantized_rnn $quantization\.test_quantize\.TestPostTrainingDynamic$' --print-passing-details buck test caffe2/test:quantization -- 'test_qrnncell $quantization\.test_quantized_op\.TestDynamicQuantizedRNNOp$' --run-disabled --print-passing-details Reviewed By: vkuzo Differential Revision: D25579815 fbshipit-source-id: 413cc8888eb8058230b94c9576d2fa54b0ed1416

… statements pytorch#48771 (pytorch#49040) Summary: Pull Request resolved: pytorch#49040 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D25407356 Pulled By: huiguoo fbshipit-source-id: 1c1f893ed8d0877bee27e9a673a5dce2203c2bad

…placed similar checks in LLVM codegen with such macros (pytorch#49121) Summary: Pull Request resolved: pytorch#49121 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D25445971 Pulled By: huiguoo fbshipit-source-id: 980775a94159aa0b3b66fae938962761b38703d5

Summary: minor changes to block codegen to handle new inlining in NNC. For Block code generation we need to delay inlining before collecting dimension data about the tensors. We need to collect the dimension of the tensor before they were flattened. We don't have this information after the inlining pass, so for Block we run inling after we have collected this data using `CreateBufferMap` analysis. Pull Request resolved: pytorch#47687 Reviewed By: ZolotukhinM Differential Revision: D24864869 Pulled By: protonu fbshipit-source-id: 9574c0599f7d959a1cf0eb49d4e3e541cbe9b1d3

Summary: Pull Request resolved: pytorch#49691 Quite a few stale items, let's make the list short. Test Plan: oss ci Reviewed By: hl475 Differential Revision: D25667464 fbshipit-source-id: cff1be8b5e0068470b3f621acf6bf4fbd414233e

Summary: Fixes pytorch#48351 Pull Request resolved: pytorch#48637 Reviewed By: mrshenli Differential Revision: D25658596 Pulled By: mruberry fbshipit-source-id: ff3ada74b6d281c8e4753ed38339a1c036f722ee

Summary: Fixes pytorch#49432 Pull Request resolved: pytorch#49693 Reviewed By: walterddr, janeyx99 Differential Revision: D25668027 Pulled By: malfet fbshipit-source-id: 802cbd39e4ebe585709179f332b680f5f7978814

Summary: GCD should always return positive integers. When negative values are used, we hit a corner case that results in an infinite recursion during simplification. Pull Request resolved: pytorch#49379 Reviewed By: ezyang Differential Revision: D25597115 Pulled By: navahgar fbshipit-source-id: b0e8ac07ee50a5eb775c032628d4840df7424927

…when strict=False (pytorch#49568) Summary: Pull Request resolved: pytorch#49568 We have some inference use cases where the expected output of a module is of the form `{"key": (t1, t1)}` and are currently jit tracing the modules until we can reach jit script compatibility. Test Plan: buck test mode/dev caffe2/test:jit -- 'test_trace_returning_complex_dict' Reviewed By: houseroad Differential Revision: D25624152 fbshipit-source-id: 5adef0e3c9d54cd31ad5fece4ac6530d541fd673

Summary: Pull Request resolved: pytorch#46664 Test Plan: Imported from OSS Reviewed By: anjali411 Differential Revision: D24453343 Pulled By: izdeby fbshipit-source-id: b82a658af50ededc985195ed02dbf60e792c7a13

Summary: Pull Request resolved: pytorch#49694 The store based barrier introduced in pytorch#49419 broke for certain store types. This is a quick fix to resolve the issues for other store types. ghstack-source-id: 119006874 Test Plan: 1) waitforbuildbot Reviewed By: ppwwyyxx, rohan-varma Differential Revision: D25668404 fbshipit-source-id: 751fb8b229ad6f50ee9c50f63a70de5a91c9eda5

Summary: Fixes pytorch#49052 The TCPStore example with 4 arguments was working because the datetime value was being implicitly converted to a bool. Modified the pybind definition and updated documentation. Pull Request resolved: pytorch#49685 Test Plan: ``` import torch.distributed as dist from datetime import timedelta dist.TCPStore("127.0.0.1", 0, True, timedelta(seconds=30)) ``` Now fails with ``` TypeError: __init__(): incompatible constructor arguments. The following argument types are supported: 1. torch._C._distributed_c10d.TCPStore(host_name: str, port: int, world_size: int, is_master: bool, timeout: datetime.timedelta = datetime.timedelta(seconds=300)) Invoked with: '127.0.0.1', 0, True, datetime.timedelta(seconds=30) ``` Reviewed By: mrshenli, ngimel Differential Revision: D25668021 Pulled By: H-Huang fbshipit-source-id: ce40b8648d0a414f0255666fbc680f1a66fae090

Summary: THC_API and THC_CLASS were leftover macros from before the consolidation of caffe2, aten, and torch. Now that they're combined, these are misleading and should just be TORCH_CUDA_API. The only file I manually edited was `THCGeneral.h.in`. Pull Request resolved: pytorch#49690 Reviewed By: malfet Differential Revision: D25667982 Pulled By: janeyx99 fbshipit-source-id: 2fdf7912b2a0537b7c25e1fed21cc301fa59d57f

Test Plan: revert-hammer Differential Revision: D25607503 (pytorch@fdf02ef) Original commit changeset: f1396290de1d fbshipit-source-id: 057206e28ff48ee288856adfe3ca577d4880789f

… `Stmt*`. (pytorch#49696) Summary: Pull Request resolved: pytorch#49696 And make it static. Test Plan: Imported from OSS Reviewed By: navahgar, nickgg Differential Revision: D25668695 Pulled By: ZolotukhinM fbshipit-source-id: 8d7fb507d6f3beca70e868d9e0f4c46247311a99

…orch#49697) Summary: Pull Request resolved: pytorch#49697 Mostly mechanical move. This refactoring helps to hide unnecessary details from the SimpleIREval interface and make it more similar to a pure 'codegen'. Test Plan: Imported from OSS Reviewed By: nickgg Differential Revision: D25668696 Pulled By: ZolotukhinM fbshipit-source-id: 423247bfcdfa88403e8ec92152f00110bb9da19c

Summary: Pull Request resolved: pytorch#49549 Somehow `mypy torch/quantization` got broken in the past couple of days: https://gist.github.com/vkuzo/07af454246f0a68e6fa8929beeec7e0d . I didn't see any relevant PRs other than pytorch#47725, which doesn't seem related. The error doesn't seem real, as the arguments to `_cudnn_rnn_flatten_weight` seem correct. For now, ignoring the failure so we have a clean `mypy` run on `torch/quantization`. Test Plan: ``` mypy torch/quantization ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25616972 fbshipit-source-id: 46c207fe1565ec949c0b1f57d6cd0c93f627e6bd

Summary: Pull Request resolved: pytorch#49606 Adds more types, for readability. Test Plan: ``` mypy torch/quantization ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25643894 fbshipit-source-id: 4aad52fe4e59ad74b6e0e3acd0f98fba91561a29

Summary: Pull Request resolved: pytorch#49607 Readability Test Plan: ``` mypy torch/quantization ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25643895 fbshipit-source-id: b4b8741b07ac4827c3bacd2084df81fbfdd0c2d5

Summary: Pull Request resolved: pytorch#49616 Add types to `_find_quants` I/O and fix resulting errors, needed for an upcoming bug fix. Test Plan: ``` mypy torch/quantization python test/test_quantization.py TestQuantizeFx ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25645719 fbshipit-source-id: 4bf788b55fd4fd086c83a4438b9c2df22b9cff49

…rch#49720) Summary: Pull Request resolved: pytorch#49720 Test Plan: Imported from OSS Reviewed By: zdevito Differential Revision: D25675825 Pulled By: jamesr66a fbshipit-source-id: a9028acad9c8feb877fff5cd09aedabed52a3f4b

Summary: Pull Request resolved: pytorch#49506 - Get rid of expensive stuff like `TensorArg`, `checkBackend`, `checkSize`, and `TensorAccessor`. - Add `checkDim` that does not require creating a `TensorArg` which incurs a refcount bump - Avoid unnecessary calls to `torch.select`, which goes through the dispatcher in the cases we care about, with mat1 and mat2 not permuted or permuted with dims = [0, 2, 1]. The pt version of bmm supports crazy cases like when the inputs are permuted with dims = [1, 2, 0], which is uncommon in SparseNNs. Test Plan: Unit test: ``` buck test //caffe2/test:linalg ``` Benchmark with the adindexer model: ``` Before: I1216 14:02:24.155516 2595800 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0847197. Iters per second: 11803.6 After: I1216 14:02:26.583878 2595939 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.082051. Iters per second: 12187.5 ``` Reviewed By: bwasti Differential Revision: D25577574 fbshipit-source-id: 8aba69b950e7b4d9d1b14ba837931695a908c068

… non-native codes (pytorch#49721) Summary: Pull Request resolved: pytorch#49721 As a refactor effort of per-app selective build, we are decoupling ATen/native from the rest of aten (D25413998). All symbols of ATen/native could only be referenced through dispatcher (pytorch#48684). This diff is to decouple the native reference recently introduced for sparse tensors. ghstack-source-id: 119028080 Test Plan: CI Reviewed By: dhruvbird, ngimel Differential Revision: D25675711 fbshipit-source-id: 381cbb3b361ee41b002055399d4996a9ca21377c

Summary: Pull Request resolved: pytorch#49451 Reuse the low-rank tensors P(s) and Q(s) from the previous iteration if possible. This can give a better compression performance in terms of both accuracy and speed. Also add a unit test for batched PowerSGD to test_c10d.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 119014132 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25583086 fbshipit-source-id: a757df3c4cfcc0ead4647f7de2f43198f1e063ee

…49566) Summary: Pull Request resolved: pytorch#49566 Fixes pytorch#49422. check_jacobian and gradcheck do roughly the same thing: they both compute an analytic jacobian and a numeric jacobian and check that they are equivalent. Furthermore, NewModuleTest will (by default) call both check_jacobian and gradcheck, leading to some redundant checks that waste CI resources. However, there is one subtle difference: `check_jacobian` can handle the special case where a Module takes in dense inputs and dense parameters but returns sparse gradients, but that is not something gradcheck can handle. This is only used in the tests for nn.Embedding and nn.EmbeddingBag. This PR does the following: - have NewModuleTest call gradcheck instead of check_jacobian by default - add a new "has_sparse_gradients" flag to NewModuleTest. These are True for the nn.Embedding and nn.EmbeddingBag sparse gradient tests. If `has_sparse_gradients` is True, then we call check_jacobian, otherwise, we call gradcheck. - Kills the "jacobian_input" flag. This flag was used to tell NewModuleTest to not attempt to compute the jacobian for the inputs to the module. This is only desireable if the input to the module isn't differentiable and was only set in the case of nn.Embedding / nn.EmbeddingBag that take a LongTensor input. `gradcheck` handles these automatically by not checking gradients for non-differentiable inputs. Test Plan: - Code reading - run test_nn.py Reviewed By: albanD Differential Revision: D25622929 Pulled By: zou3519 fbshipit-source-id: 8d831ada98b6a95d63f087ea9bce1b574c996a22

Summary: Fixes pytorch#49214 **BC-Breaking** Before this PR, `%=` didn't actually do the operation inplace and returned a new tensor. After this PR, `%=` operation is actually inplace and the modified input tensor is returned. Before PR, ```python >>> import torch >>> a = torch.tensor([11,12,13]) >>> id(a) 139627966219328 >>> a %= 10 >>> id(a) 139627966219264 ``` After PR, ```python >>> import torch >>> a = torch.tensor([11,12,13]) >>> id(a) 139804702425280 >>> a %= 10 >>> id(a) 139804702425280 ``` Pull Request resolved: pytorch#49390 Reviewed By: izdeby Differential Revision: D25560423 Pulled By: zou3519 fbshipit-source-id: 2b92bfda260582aa4ac22c4025376295e51f854e

Summary: Pull Request resolved: pytorch#49461 resolves pytorch#48398 Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D25589454 Pulled By: anjali411 fbshipit-source-id: 46e9f913c8ab3e18c98d6f623b2394044b6fe079

Summary: Depends on pytorch/builder#614. Pull Request resolved: pytorch#49632 Reviewed By: ngimel Differential Revision: D25665880 Pulled By: walterddr fbshipit-source-id: b37a55b7e3028648453b422683fa4a72e0ee04a4

Summary: Update CPUINFO to include pytorch/cpuinfo#51 Update sleef to include shibatch/sleef#376 Modify aten/src/ATen/native/quantized/cpu/qnnpack/CMakeLists.txt to recognize CMAKE_OSX_ARCHITECTURES Pull Request resolved: pytorch#49701 Test Plan: `cmake -DCMAKE_OSX_ARCHITECTURES=x86_64 -DPYTHON_EXECUTABLE=/usr/bin/python3 -DUSE_XNNPACK=NO -DBUILD_TEST=YES .. -G Ninja; ninja basic` finishes successfully on Apple M1 Reviewed By: janeyx99 Differential Revision: D25669219 Pulled By: malfet fbshipit-source-id: 5ee36b64e3a7ac76448f2a300ac4993375a26de5

Summary: Pull Request resolved: pytorch#49380 Couldn't resist removing a class member that is only used in one function. Reviewed By: yinghai Differential Revision: D25547366 fbshipit-source-id: 74e61c6a0068566fb7956380862999163e7e94bf

Summary: Pull Request resolved: pytorch#49638 Reviewed By: ngimel Differential Revision: D25681908 Pulled By: asuhan fbshipit-source-id: 2ea8623614f2f0027f6437cf2819ba1657464f54

) Summary: Minor doc fix in clarifying that the input data is rounded not truncated. CC zasdfgbnm ngimel Pull Request resolved: pytorch#49625 Reviewed By: mruberry Differential Revision: D25668244 Pulled By: ngimel fbshipit-source-id: ac97e41e0ca296276544f9e9f85b2cf1790d9985

Summary: removes unused THCBlas, call `at::cuda::blas::gemm` directly where needed. Pull Request resolved: pytorch#49725 Reviewed By: mruberry Differential Revision: D25680831 Pulled By: ngimel fbshipit-source-id: d826f3f558b156f45f2a4864daf3f6d086bda78c

…49645) Summary: Pull Request resolved: pytorch#49645 Reviewed By: malfet Differential Revision: D25665851 Pulled By: walterddr fbshipit-source-id: 1cf50f6e3657f70776aaf3c5d3823c8a586bf22d

merge code

)

* [ONNX] Add checks in ONNXSetDynamicInputShape * [ONNX] Add checks in ONNXSetDynamicInputShape

* Add derive_index * Add derive_index test * Adding more tests * Update symbolic_opset9.py

* update symbolic for unfold * update symbolic_opse12 file * update symbolic_opse12 file * [ONNX] Support onnx if/loop sequence output in opset 13 - (pytorch#49270) * Symbolic function for torch.square (pytorch#49446) * instead of a pass use a helper function * update ort version * Revert "instead of a pass use a helper function" This reverts commit 723b446. * update symbolics * update symbolic * update symbolics * [ONNX] Support onnx if/loop sequence output in opset 13 - (pytorch#49270) * Symbolic function for torch.square (pytorch#49446) * empty commit * fix clang-tidy * fix clang-tidy Co-authored-by: Bowen Bao <bowbao@microsoft.com> Co-authored-by: David Fan <30608893+jiafatom@users.noreply.github.com>

…it is as same as inputs. (pytorch#49798) * [ONNX] Support onnx if/loop sequence output in opset 13 - (pytorch#49270) * Symbolic function for torch.square (pytorch#49446) * [ONNX] Support onnx if/loop sequence output in opset 13 - (pytorch#49270) * Symbolic function for torch.square (pytorch#49446) * Update code so that initializers' sequence is as same as inputs. * Correct the format according to flake8. * Correct the format by clang-format. * Add a new test for script model. * Update expect files for Test_Operators tests. Co-authored-by: Bowen Bao <bowbao@microsoft.com> Co-authored-by: David Fan <30608893+jiafatom@users.noreply.github.com>

* Enable opset 13 ORT tests * Update test.sh * Set environ var * Update test.sh * Enabling more ops for opset 13 * change master to main * Update symbolic_opset13.py * Flake 8 fix * [ONNX] Support onnx if/loop sequence output in opset 13 - (pytorch#49270) * Symbolic function for torch.square (pytorch#49446) * Clean up tests * Exclude more tests * Trigge build * [ONNX] Support onnx if/loop sequence output in opset 13 - (pytorch#49270) * Symbolic function for torch.square (pytorch#49446) * update ORT version * disable more tests * clean up * flake8 * Disable TV tests * Update test_pytorch_onnx_onnxruntime.py Co-authored-by: Bowen Bao <bowbao@microsoft.com> Co-authored-by: David Fan <30608893+jiafatom@users.noreply.github.com>

…ytorch-onnx_ms_1

…ogits_sy12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ONNX] Add binary_cross_entropy_with_logits op to ONNX opset version 12 #49675

[ONNX] Add binary_cross_entropy_with_logits op to ONNX opset version 12 #49675

Commits on Dec 17, 2020

Commits on Dec 22, 2020

Commits on Dec 23, 2020

Commits on Jan 4, 2021

Commits on Jan 5, 2021

Commits on Jan 6, 2021

Commits on Jan 14, 2021

Commits on Jan 15, 2021

Commits on Jan 18, 2021