Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize DLPack stride to 1 where shape < 2 #83158

Closed
wants to merge 6 commits into from

Conversation

mattip
Copy link
Collaborator

@mattip mattip commented Aug 10, 2022

Fixes #83069. Also move all the dlpack tests to a new file., test_dlpack.py.

The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and strides[i] is set to 1 where shape[i] < 2.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 10, 2022

🔗 Helpful links

❌ 2 New Failures

As of commit d117e02 (more details on the Dr. CI page):

Expand to see more
  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-docs / build-docs (python) (1/2)

Step: "Unknown" (full log | diagnosis details)

2022-08-23T16:49:06.0518486Z ##[error]The operation was canceled.
2022-08-23T10:54:23.6114179Z     tasks.add_task(write_process, arg, on_chunk_done)
2022-08-23T10:54:23.6114818Z   File "/opt/conda/lib/python3.7/site-packages/sphinx/util/parallel.py", line 88, in add_task
2022-08-23T10:54:23.6115255Z     self._join_one()
2022-08-23T10:54:23.6120556Z   File "/opt/conda/lib/python3.7/site-packages/sphinx/util/parallel.py", line 114, in _join_one
2022-08-23T10:54:23.6123136Z     raise SphinxParallelError(*result)
2022-08-23T10:54:23.6125883Z sphinx.errors.SphinxParallelError: ConnectionRefusedError: [Errno 111] Connection refused
2022-08-23T10:54:23.6128331Z 
2022-08-23T10:54:23.6130967Z Sphinx parallel build error:
2022-08-23T10:54:23.6131687Z ConnectionRefusedError: [Errno 111] Connection refused
2022-08-23T10:54:38.8989312Z make: *** [Makefile:42: html] Error 2
2022-08-23T16:49:06.0518486Z ##[error]The operation was canceled.
2022-08-23T16:49:06.0539122Z Prepare all required actions
2022-08-23T16:49:06.0555874Z ##[group]Run ./.github/actions/chown-workspace
2022-08-23T16:49:06.0556090Z ##[endgroup]
2022-08-23T16:49:06.0568871Z ##[group]Run docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
2022-08-23T16:49:06.0569216Z �[36;1mdocker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .�[0m
2022-08-23T16:49:06.0580482Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2022-08-23T16:49:06.0580706Z env:
2022-08-23T16:49:06.0580945Z   ALPINE_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine
2022-08-23T16:49:06.0581171Z ##[endgroup]
2022-08-23T16:49:07.6617487Z Post job cleanup.

See GitHub Actions build pull / linux-bionic-py3_7-clang8-xla / test (xla, 1, 1, linux.2xlarge) (2/2)

Step: "Test" (full log | diagnosis details)

2022-08-23T11:00:32.0451544Z ##[error]Process completed with exit code 1.
2022-08-23T11:00:32.0399277Z Non-cacheable calls                   0
2022-08-23T11:00:32.0399837Z Non-compilation calls                 0
2022-08-23T11:00:32.0400313Z Unsupported compiler calls            0
2022-08-23T11:00:32.0400802Z Average cache write               0.000 s
2022-08-23T11:00:32.0401277Z Average cache read miss           0.000 s
2022-08-23T11:00:32.0401731Z Average cache read hit            0.000 s
2022-08-23T11:00:32.0402224Z Failed distributed compilations       0
2022-08-23T11:00:32.0403497Z Cache location                  S3, bucket: Bucket(name=ossci-compiler-cache-circleci-v2, base_url=http://ossci-compiler-cache-circleci-v2.s3.amazonaws.com/)
2022-08-23T11:00:32.0404405Z + echo ::endgroup::
2022-08-23T11:00:32.0405368Z ##[endgroup]
2022-08-23T11:00:32.0451544Z ##[error]Process completed with exit code 1.
2022-08-23T11:00:32.0502694Z Prepare all required actions
2022-08-23T11:00:32.0503015Z Getting action download info
2022-08-23T11:00:32.2267057Z ##[group]Run ./.github/actions/get-workflow-job-id
2022-08-23T11:00:32.2267292Z with:
2022-08-23T11:00:32.2267653Z   github-token: ***
2022-08-23T11:00:32.2267821Z env:
2022-08-23T11:00:32.2268005Z   GIT_DEFAULT_BRANCH: master
2022-08-23T11:00:32.2268201Z ##[endgroup]
2022-08-23T11:00:32.2317133Z ##[group]Run nick-fields/retry@71062288b76e2b6214ebde0e673ce0de1755740a
2022-08-23T11:00:32.2317371Z with:

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@mattip mattip changed the title Dlpack stride Normalize DLPack stride to 1 where shape < 2 Aug 10, 2022
@vadimkantorov
Copy link
Contributor

stride can still be greater than 1 even if ndim == 1, right? (can be a valid sub-tensor, right?)

@mattip
Copy link
Collaborator Author

mattip commented Aug 10, 2022

The question is not about ndim, but about the shape. Wherever shape[i] < 2, it is desirable to normalize strides[i] to 1. This PR only normalizes for toDLPack, since a general cleanup is too invasive.

@vadimkantorov
Copy link
Contributor

Oh, I see!

Copy link
Collaborator

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mattip! Added a few comments

with self.assertRaises(RuntimeError):
to_dlpack(x)

# TODO: increase tests once NumPy supports the `__dlpack__` protocol
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy 1.23 already supports this - can this be done in this PR? Or otherwise, let's at least update the comment to save the next person some time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment to # TODO: add interchange tests once NumPy 1.22 (dlpack support) is required

y.__dlpack__()

@skipMeta
def test_dlpack_normalize_strides(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for other reviewers: this is the new test (the rest of the tests are just moving code around)

self.assertEqual(y.stride(), (3,))
z = from_dlpack(y)
self.assertEqual(z.shape, (1,))
# gh-83069, toDLPack should normalize strides
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the report was about __dlpack__, and that is the main method (at least going forward). so perhaps tweak this comment?

Copy link
Collaborator Author

@mattip mattip Aug 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to # gh-83069, make sure __dlpack__ normalizes strides

self.assertEqual(z.shape, (1,))
# gh-83069, toDLPack should normalize strides
self.assertEqual(z.stride(), (1,))
# TODO: are there more complicated cases that are still not handled?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has more: #82610 (comment). In general, I think any size 0 or size 1 dimension in an n-dim tensor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy does not do this though, so I suggest adding this as a comment on the implementation and removing this TODO. Unless @dalcinl has a need for PyTorch to do more than NumPy does here regarding strides normalization.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the changes in this PR are enough to make the current mpi4py release work properly.
Upcoming releases of mpi4py will NOT require normalized strides to detect contiguity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this comment and mentioned gh-83069 in the implementation

@@ -227,12 +228,23 @@ DLManagedTensor* toDLPack(const Tensor& src) {
atDLMTensor->tensor.dl_tensor.device = getDLDevice(src, device_id);
atDLMTensor->tensor.dl_tensor.ndim = src.dim();
atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
// Normalize the strides to 1 wherever shape < 2
auto shape = src.sizes().data();
int64_t *strides = new int64_t[src.dim()];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you guys have something like a MAX_DIM concept? perhaps you could add a slot strides[MAX_DIM] in the ATenDLMTensor structure to store the strides. That way you can implement this hole thing with a single new allocation.

int64_t *strides = ATenDLMTensor->strides;

PS: This is just a suggestion for an alternative implementation. I'm totally fine with things as they are now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a concept of MAX_DIMS but I am not sure the src.sizes.data() is forced to be a int64_t[MAX_DIMS]. If it is shorter, arbitrary memory would be copied into the ATenDLMTensor.strides.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the src.sizes.data() is forced to be a int64_t[MAX_DIMS]

It does not have to be. It does not matter.

If it is shorter, arbitrary memory would be copied into the ATenDLMTensor.strides.

No, simply because the for loop below fills up to src.dim() entries in the stride buffer. Any entries beyond src.dim() will be unused.

The real issue is whether you are willing to pay the extra 1 KiB needed to store the MAX_DIMS strides vs saving one new allocation call.

Am I making myself clear enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right. I misunderstood. I will let other reviewers weigh in whether they want to avoid the new call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really care about perf here, I just want to minimize manual memory management. Probably the slickest way to do this is to create a new Tensor that has the normalized strides and then use the fields on that tensor directly from dlpack. WDYT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, as_strided is the ticket here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to use as_strided I need to allocate an int vector for the new strides, right? I am not sure what advantage creating a new container gives over using the existing capsule deallocate mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, the benefit is you don't have to manually memory manage strides separately from Tensor object. Tensor is the owning object that is managed and everything else hangs off it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, the benefit is you don't have to manually memory manage strides separately from Tensor object. Tensor is the owning object that is managed and everything else hangs off it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (assuming CI passes)

@mattip
Copy link
Collaborator Author

mattip commented Aug 14, 2022

There is a test failure in TestONNXExport.test_source_range_propagation. I don't think it is connected

Traceback
[gw3] linux -- Python 3.7.13 /opt/conda/bin/python
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/onnx/test_pytorch_onnx_no_runtime.py", line 478, in test_source_range_propagation
    operator_export_type=torch.onnx.OperatorExportTypes.ONNX,
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 1071, in _model_to_graph
    module=module,
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 625, in _optimize_graph
    graph = _C._jit_pass_onnx(graph, operator_export_type)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/utils.py", line 1753, in _run_symbolic_function
    return symbolic_fn(g, *inputs, **attrs)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/symbolic_helper.py", line 254, in wrapper
    return fn(g, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/onnx/symbolic_opset9.py", line 2286, in layer_norm
    g, input, normalized_shape, weight, bias, eps, cudnn_enable, False
TypeError: cannot unpack non-iterable torch._C.Value object 
(Occurred when translating layer_norm).

@ezyang ezyang self-requested a review August 15, 2022 14:53
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 15, 2022
@ezyang
Copy link
Contributor

ezyang commented Aug 23, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@pytorchmergebot
Copy link
Collaborator

Merge failed
Reason: Refusing to merge as mandatory check(s) pull failed for rule Core Maintainers
Raised by https://github.com/pytorch/pytorch/actions/runs/2912012177

@ezyang
Copy link
Contributor

ezyang commented Aug 23, 2022

@pytorchbot merge -f "known failure"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the force (-f) flag. This means your change will be merged immediately, bypassing any CI checks (ETA: 1-5 minutes). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@github-actions
Copy link

Hey @mattip.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Aug 24, 2022
Summary:
Fixes #83069. Also move all the dlpack tests to a new file., `test_dlpack.py`.

The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`.

Pull Request resolved: #83158
Approved by: https://github.com/ezyang

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4dfa6d28a139e8325fe9b255af86e8d1360ae7ee

Reviewed By: weiwangmeta

Differential Revision: D38947365

fbshipit-source-id: 3b621efe3fc8f1798f2d285cfa818548d3bfca7e
zasdfgbnm added a commit to csarofeen/pytorch that referenced this pull request Sep 30, 2022
* hash update - bug fix for branches (#83865)

hash updates for xla were failing because the current pinned hash is a branch, so the git command for getting the date couldn't find the branch due to not having a local version of the branch.  Fixed by checking out the branch to make sure it exists locally.

example of failure: https://github.com/pytorch/pytorch/runs/7913835742?check_suite_focus=true

Test plan:
made it pull request trigger and ran, to get this:
https://github.com/pytorch/pytorch/runs/7959221184?check_suite_focus=true
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83865
Approved by: https://github.com/zengk95

* [FSDP] Remove unneeded checks (#83150)

@awgu pointed out these checks aren't really doing anything, as they just make sure we're setting training state in certain ways throughout FSDP and is sort of arbitrary. So, removing them to avoid confusion.

We still keep the checking around `_post_backward_called` because this is needed in `finalize_params` for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83150
Approved by: https://github.com/awgu

* [BE] Revert distributed change in https://github.com/pytorch/pytorch/pull/68779 (#83181)

https://github.com/pytorch/pytorch/issues/82641 points out a regression in how inputs / outputs are processed by DDP, blocking their HF use case. It was narrowed down to https://github.com/pytorch/pytorch/pull/68779 and reverting the distributed change there fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83181
Approved by: https://github.com/kumpera

* Transpose scheduler small dim sizes better support (#1910)

* Optimize transpose copy on CPU using fbgemm transpose (#83327)

### Description
Optimize transpose copy on CPU using fbgemm transpose

### Testing
single socket (28cores):
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 4.819e-05 ms; bf16: 4.846e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000171 ms; bf16: 0.000129 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10])  fp32: 2.439e-05 ms; bf16: 2.152e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000132 ms; bf16: 3.916e-05 ms
```
single core:
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.00109 ms;  bf16: 0.00103 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00339 ms; bf16: 0.00295 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.000566  ms; bf16: 0.000382 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00282 ms; bf16: 0.000999 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83327
Approved by: https://github.com/frank-wei

* Grouped grid welford (#1921)

Enables grouping of grid welford ops across iterations. Same functionality as the iteration grouping for GridReduction. This ins intended to improve the outer-norm grid persistence in batchnorm-like fusions.

* [ONNX] Use `errors.SymbolicValueError` for more context (#83332)

Replace runtime errors in torch.onnx with `errors.SymbolicValueError` for more context around jit values.

- Extend `_unimplemented`, `_onnx_unsupported`, `_onnx_opset_unsupported`, `_onnx_opset_unsupported_detailed` errors to include JIT value information
- Replace plain RuntimeError with `errors.SymbolicValueError`
- Clean up: Use `_is_bool` to replace string comparison on jit types
- Clean up: Remove the todo `Remove type ignore after #81112`

#77316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83332
Approved by: https://github.com/AllenTiTaiWang, https://github.com/thiagocrepaldi, https://github.com/BowenBao

* [quant][fx] Add support for quantized matmul (#83885)

Summary:
att, probably missed the op during migration to the reference flow

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_qmatmul

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83885
Approved by: https://github.com/andrewor14

* Misc fixes/tuning for transpose scheduler (#1912)

* [nn] split rnn_utils test from test_nn.py (#83675)

Ref: https://github.com/pytorch/pytorch/issues/63085
Proposed folder structure
```
-> test
  -> nn
    -> test_conv.py
    -> test_pooling.py
    -> .....
```

This PR: Moves test related RNN utilities to a different file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675
Approved by: https://github.com/albanD

* [optim] rprop: handle complex params as independent real params (#83858)

Ref #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858
Approved by: https://github.com/albanD

* [xla hash update] update the pinned xla hash (#83899)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83899
Approved by: https://github.com/pytorchbot

* [ROCm] More Sparse UTs enablement and more hipification mappings. (#78939)

Enables:

 test_bmm_cuda_float64
 test_bmm_deterministic_cuda_float64
 test_csr_matvec_cuda_complex128
 test_csr_matvec_cuda_complex64
 test_csr_matvec_cuda_float32
 test_csr_matvec_cuda_float64

To enable the above tests had to add some more hip mappings for the hipification process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78939
Approved by: https://github.com/pruthvistony, https://github.com/malfet

* Normalize DLPack stride to 1 where shape < 2 (#83158)

Fixes #83069. Also move all the dlpack tests to a new file., `test_dlpack.py`.

The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83158
Approved by: https://github.com/ezyang

* Remove DBR quantization from the codebase (#83642)

Summary:

DBR quantization is a no-go for now because it does not align well with
PyTorch 2.0 plans and we do not want to build yet another tracing system.

Deleting it from the codebase for now since there are no plans to develop
this in the near future. We can bring it back at a later time if necessary.

Test plan:

CI

Differential Revision: [D38839556](https://our.internmc.facebook.com/intern/diff/D38839556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83642
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168

* Refactored ops on size to be dispatcher ops (#83719)

An example of how the graph looks now.
```
def forward(self, x_1):
    size = torch.ops.math.size(x_1, 0)
    size_1 = torch.ops.math.size(x_1, 1);  x_1 = None
    ones = torch.ops.aten.ones.default([1], device = device(type='cpu'), pin_memory = False)
    expand_sym_int = torch.ops.aten.expand.SymInt(ones, [size, size_1]);  ones = size = size_1 = None
    cos_default = torch.ops.aten.cos.default(expand_sym_int);  expand_sym_int = None
    return (cos_default,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83719
Approved by: https://github.com/ezyang

* Fix stride issue with faketensors (#83822)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83822
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Nullary RNGOp (#1892)

* [ROCm] restore MIOpen benchmark flag default to true (#82656)

### Description
PR https://github.com/pytorch/pytorch/pull/77438 allowed MIOpen to support the benchmark flag. Previously, the benchmark flag was ignored by MIOpen such that benchmarking was always turned on. This commit restores the behavior that MIOpen benchmarking is by default turned on.

### Testing
CI unit tests cover this capability.  Torchvision models demonstrate the performance delta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82656
Approved by: https://github.com/ngimel

* Update retry action to latest version (#83911)

We're running into EPERM issues when trying to install nvidia tools, see failure example https://github.com/pytorch/pytorch/runs/7975726013?check_suite_focus=true.
```
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1049
            throw err;
            ^

Error: kill EPERM
    at process.kill (internal/process/per_thread.js:199:13)
    at killPid (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1059:17)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1036:21
    at Array.forEach (<anonymous>)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1034:23
    at Array.forEach (<anonymous>)
    at killAll (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1033:27)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1024:13
    at ChildProcess.onClose (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1080:17)
    at ChildProcess.emit (events.js:314:20) {
  errno: 'EPERM',
  code: 'EPERM',
  syscall: 'kill'
}

```

The root issue probably lies elsewhere but this action is not helping/the errors seem to say it's unable to kill child processes. A more recent commit in that repo uses spawn instead of exec which might make a difference.

Regardless, we should keep our actions up to date anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83911
Approved by: https://github.com/malfet

* [PyTorch] Remove unused sstream/string includes from c10/macros/Macros.h (#83353)

Nothing in the rest of the header seems to use these.

Differential Revision: [D38672680](https://our.internmc.facebook.com/intern/diff/D38672680/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83353
Approved by: https://github.com/malfet

* [functorch] add linalg cross batch rule (#83759)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83759
Approved by: https://github.com/zou3519

* Improve DistanceKernel.cu (#83811)

include device_sqrt
replace reduce_agg by BlockReduce
choose implementation by impl_fptr instead of error-prone copy-and-paste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83811
Approved by: https://github.com/ngimel

* reinplace pass: bugfix for output node replacement (#83845)

Cleaned up some of the arg replacement logic to use tree_map, so it handles FX nodes that have nested containers.

See the added test: when you write a function that returns a list, the `output` node in the FX graph shows up as having `node.args = tuple(immutable_list(...))`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83845
Approved by: https://github.com/ezyang

* reinplace pass: special handling for view_scatter ops (#83846)

There is already special handling in the reinplacing pass for removing `{view}_scatter` ops, but there is another case that needs special handling. In this code:
```
         def f():
             a = torch.zeros(4, 4, 4)
             a[:, 2:] = torch.ones(4, 2, 4)
             return a
```

Tracing normally with `make_fx()` gives you:
```

def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    copy__default = torch.ops.aten.copy_.default(slice_tensor_1, ones);  slice_tensor_1 = ones = None
    return zeros
```
Functionalizing it gives you:

```
def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, 9223372036854775807);  slice_tensor_2 = ones = None
    slice_scatter_default_1 = torch.ops.aten.slice_scatter.default(zeros, slice_scatter_default, 0, 0, 9223372036854775807);  zeros = slice_scatter_default = None
    return slice_scatter_default_1
```

Notice that there are not any functional ops to directly re-inplace! What actually happened is that functionalization turned the `copy_()` into a `copy()`, but the out-of-place `copy()` operator gets optimized away because it's a no-op (when the input and output metadata are the same, `out = copy(a, b)` just returns `b`).

What we actually want is to replace this line:
```
slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, ...);
```
with this:
```
new_slice = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, ...);
_ = torch.ops.aten.copy_.default(new_slice, ones)
```

In the above, we're taking a fresh slice of the "base" tensor, and performing a `copy_()` on the slice, adding back what functionalization removed.

We actually need to create a fresh "slice" node, because we're not guaranteed that one already exists in the graph (technically there should be one, but it might have been DCE'd by the time we hit re-inplacing)

I also updated the docs for re-inplacing to more closely match the order of the logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83846
Approved by: https://github.com/ezyang

* Move ATenNVRTC.h include from `jit_utils.h` to `jit_utils.cpp` (#83886)

In general, `.h` files should only include headers that are used in the header

Fixes https://github.com/pytorch/pytorch/issues/83856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83886
Approved by: https://github.com/ngimel

* Allow None arguments for elementwise type promotion wrapper and fix clamp with None arguments (#83586)

Fixes https://github.com/pytorch/torchdynamo/issues/759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83586
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL (#83881)

Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`.
Saving user from setting two env variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881
Approved by: https://github.com/malfet, https://github.com/rohan-varma, https://github.com/H-Huang

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Fix view_func replay in no-grad mode (#83872)

Fixes https://github.com/pytorch/pytorch/issues/83828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83872
Approved by: https://github.com/albanD

* [vulkan] Add VMA as a third_party subrepo (#83906)

the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906
Approved by: https://github.com/kimishpatel

* [torchgen] Add documentation for `autogen` keyword (#83610)

This is a follow up for #81437. This PR explains what operator can use `autogen` and what will be generated. Also talked about generated kernels and where to find them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83610
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* remove assertEqualIgnoreTypes from test/distributions/test_distributions.py (#83709)

See https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83709
Approved by: https://github.com/kit1980

* [fix] edge case in `MaxPool1d` and add ErrorInputs (#83553)

Fixes #83224

cc @kshitij12345 @albanD!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553
Approved by: https://github.com/albanD

* [complex] conv_transpose1d (#79694)

Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79694
Approved by: https://github.com/ngimel

* Revert "Strenghten preconditions of linalg.cross (#83798)"

This reverts commit 7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e.

Reverted https://github.com/pytorch/pytorch/pull/83798 on behalf of https://github.com/janeyx99 due to Sorry, land race caused functorch issues https://hud.pytorch.org/pytorch/pytorch/commit/7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e

* Fix load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)

`_load_extra_only_for_mobile` API hasn't handled flatbuffers logic yet. Update the api accordingly.

Also find out mobile build in OSS doesn't build with flatbuffers. Filed task T129996445 to track

Differential Revision: [D38890847](https://our.internmc.facebook.com/intern/diff/D38890847/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38890847/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83855
Approved by: https://github.com/qihqi

* Prefer signal from land checks over PR signals (#83715)

# The problem

When a dev forks their branch from a red master build, their branch can fail CI checks for reasons unrelated to their changes, but the same checks would however pass in the land validation commit (which is rebased off of viable/strict)

Today, in the above scenario the `merge -l` command fails because mergebot sees the failing checks in the PR, which is not helpful when that same check passes in land validation.

# The solution
This PR changes the behavior so that:
1. If both the PR and land validation ran a workflow, only look at the results from land validation
2. If only the PR ran a specific workflow (e.g. for CLA Check or a nightly run) then continue to look the result from the PR (which matches existing behavior)

### Bonus fixes
It also includes a few extra BE fixes:
- Replaces the tuple we used to pass workflow check results around with a named tuple so that it's easier to tell what data is being used
- Reduces the number of API calls to github by ~50% during merges.  Before, we were pulling results from github every time and then filtering it down to the relevant category of checks (e.g. failed/pending/startup_failed). Now, our filters share the check results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83715
Approved by: https://github.com/zengk95

* Don't introduce new overload for SymInt (#83628)

Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove CoreMLMemoryObserver (#83703)

Summary: We added this observer to help us diagnose memory issues that have since resolved. It should be safe to clean this up.

Test Plan: Diff just removed logging, so just build IG and confirm no errors.

Differential Revision: D38843701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83703
Approved by: https://github.com/mcr229

* ci: Remove dead code related to android uploads (#83930)

These uploads actually never got triggeredhappened in nightlies so
removing it altogether. Someone can re-add in the future if they feel
these are important but I can't find an instance of this running since
we migrated so I have a hard time believing anyone will miss it.

https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=android

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83930
Approved by: https://github.com/atalman, https://github.com/malfet

* [fx][pass infra] Adding error catching (#83933)

Example:

```
======================================================================
ERROR: test_pass_manager_error (fx.test_pass_infra.TestPassManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 285, in __call__
    res = fn(module)
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 164, in pass_fail
    raise RuntimeError("bad")
RuntimeError: bad

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 170, in test_pass_manager_error
    pm(traced_m)
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 289, in __call__
    raise RuntimeError(msg) from e
RuntimeError: An error occured when running the 'pass_fail' pass after the following passes: ['replace_add_with_mul_pass', 'replace_mul_with_div_pass']
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83933
Approved by: https://github.com/SherlockNoMad

* Back out "Support regex-style matching for Any and Oneof (#82853)" (#83922)

Reviewed By: hl475

Differential Revision: D38945806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83922
Approved by: https://github.com/hl475

* Fix use-dict-literal lint (#83718)

Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD

* Revert "Optimize transpose copy on CPU using fbgemm transpose (#83327)"

This reverts commit 04d8da88a6a1abf0da2b11096c85244bf38d3b2a.

Reverted https://github.com/pytorch/pytorch/pull/83327 on behalf of https://github.com/weiwangmeta due to breaking internal builds/causing out-of-bounds errors/training accuracy

* Add hypothesis to requirements.txt (#83740)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83740
Approved by: https://github.com/zhxchen17, https://github.com/janeyx99, https://github.com/zou3519

* [fbia] Keep Track of full qualified name before and after remote sharding (#83889)

Summary: track qualname changes in embedding sharding & FX split, and compose target qualname in the end of FBIA transform stage, so we can use the qualname mapping in XL materialize stage

Test Plan:
CI/CD

with DISABLE_XLEBB_MATERIALIZATION = True
https://fburl.com/fblearner/a8yljbux

with DISABLE_XLEBB_MATERIALIZATION = False
https://fburl.com/fblearner/2nvi0dam

Reviewed By: lliu315gt

Differential Revision: D38772525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83889
Approved by: https://github.com/houseroad

* add merge blocking to ci: sev template (#83940)

as in title, so that by default, ci: sev will block merges

the line can be removed to not block merges
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83940
Approved by: https://github.com/huydhn, https://github.com/janeyx99, https://github.com/malfet, https://github.com/seemethere

* Move nnapi code from ATen common code to specific library (#83748)

Summary: Currently we include nnapi code in all targets using ATen even if it's not used (actually there is no usage and being deprecated). Move it to `nnapi_backend_lib` for now.

Test Plan: Sandcastle.

Differential Revision: D38761095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83748
Approved by: https://github.com/salilsdesai, https://github.com/SS-JIA

* Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870)

See https://github.com/pytorch/pytorch/issues/38095
Replaced assertEqualIgnoreType with assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870
Approved by: https://github.com/kit1980

* [Nested Tensor] Make offset copy and move assignment more explicit. (#83488)

Currently the nested tensor construction for the offset_ parameter takes in references and in the chain of delegation uses value. This could lead to unnecessary copies.  Whenever a nested tensor impl is constructed it should take ownership of all its metadata. The only non-trivially copyable metadata associated with the class is `offsets_`.

The goal of this PR is to make sure that consumers of nested_tensor_impl constructors ensure that they are passing offsets as a temporary - either buy explicitly copying a reference, or by constructing the offsets vector in the scope of construction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83488
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove conj kernels for real dtypes (#80374)

`conj_physical_stub` is currently implemented for all dtypes despite
it just being a plain copy for real dtypes. So, instead we should
defer to the existing copy kernel in these cases.

On my build for one CUDA architecture, I see a 2.2 MB decrease in
`libtorch_cuda.so` size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80374
Approved by: https://github.com/ngimel, https://github.com/atalman

* [BE][CUDA] Use packed_accessor64 (#83949)

Not sure why we are ignoring those, but SoftMax.cu alone
generates 100+ lines of warnings:
```
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In function ‘at::Tensor at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::get_offsets(const at::Tensor&, const IntArrayRef&, int64_t)’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:261:69: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = long int; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto indices_accessor = indices.packed_accessor<int64_t, 2>();
                                                                     ^
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:924:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:1677:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:927:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:1679:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:977:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:1775:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:980:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:1777:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:16:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:18:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:20:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:21:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83949
Approved by: https://github.com/ngimel

* Support returning symbolic strides from t.stride() in Python (#83842)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83842
Approved by: https://github.com/albanD, https://github.com/Chillee, https://github.com/bdhirsh

* Support the XPU backend untyped storage (#83952)

Simple add XPU backend in untyped torch storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83952
Approved by: https://github.com/ezyang

* Support NCCL Premul Sum (#81272)

This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is https://github.com/pytorch/pytorch/pull/81272/commits/cb99ad67447b5899ecf8c4c3d78deaafa1cc09b8 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501

* Test type promotion assertignoretypes (#83867)

See #38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83867
Approved by: https://github.com/kit1980, https://github.com/mruberry

* [Profiler] record nn.Module's parameters (#83209)

Summary:
Record nn.Module's parameters for detaild memory profiling:
- extend 'module_' in value cache  & NNModuleInfo to save parameters
- python binding and unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule

Differential Revision: D38379717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
Approved by: https://github.com/robieta

* [xla hash update] update the pinned xla hash (#83967)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83967
Approved by: https://github.com/pytorchbot

* Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)

* Fix LTC build warnings (#83955)

Addresses `Wc++98-compat-extra-semi` warning from https://github.com/llvm/torch-mlir/issues/1264 by removing extraneous semicolon after autogen LTC native function definitions.

```
/home/runner/work/torch-mlir/torch-mlir/build/tools/torch-mlir/python/torch_mlir/csrc/base_lazy_backend/generated/LazyNativeFunctions.cpp:4241:6: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
    };
     ^
```

cc: @wconstab @desertfire @ke1337 @antoniojkim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83955
Approved by: https://github.com/wconstab

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Make linalg.inv composite of linalg.solve (#80074)

The `getri` kernel calls inside `getrs` so we can do so explicitly
ourselves and save ourselves from having to maintain an extra kernel.
This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
will be as efficient as it can be, as it'll be choosing the best backend
to perform the factorisation and the best backend (not necessarily the
same) to perform the solve.

Fixes https://github.com/pytorch/pytorch/issues/77498

The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet

* Support a stable double backward on linalg.det for real inputs (#80217)

The complex case still fails. I do not know why.

Fixes https://github.com/pytorch/pytorch/issues/62327
Fixes https://github.com/pytorch/pytorch/issues/53364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80217
Approved by: https://github.com/nikitaved, https://github.com/albanD, https://github.com/malfet

* [LTC] Add custom lazy tensor save function (#83294)

We need a custom `save` function for checkpointing a lazy model, similar to what exists in PyTorch/XLA:
https://github.com/pytorch/xla/blob/3eb8a9d9eb4ebb0b064461c3704650241625654e/torch_xla/core/xla_model.py#L994
The purpose of this function is to move any lazy tensors to CPU before saving the checkpoint.

The way I implemented it was to create a general structure visitor, adapted from a function that we use quite often in Cerebras internal repositories. If there is a better tool already available in PyTorch that does the same things, I'm open to suggestions.

CC: @wconstab @Krovatkin @JackCaoG
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83294
Approved by: https://github.com/wconstab

* move pooling test from test_nn to test/nn/test_pooling (#83915)

Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915
Approved by: https://github.com/albanD

* [ONNX] Remove static None graph output (#82623)

Fixes #82370
* Unify the export behavior regarding static None outputs. These are
dropped for both traced graph and TorchScript graph export.
* `Optional` outputs are not affected.
Fixes #82370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82623
Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock

* [TorchTidy Fix] Don't try to collect strides for non-strided tensors (#83935)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83935
Approved by: https://github.com/robieta, https://github.com/slgong-fb

* [WIP] Validating input_col for certain datapipes (#80267)

Follow up from #79344.

Currently WIP due to multiple test failures.

Waiting for #80140 to land
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267
Approved by: https://github.com/ejguan

* support more symintnode operations (#83877)

remove debug code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83877
Approved by: https://github.com/ezyang

* add arithmetic ops (#83878)

arithmetic ops tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83878
Approved by: https://github.com/ezyang

* logical ops (#83879)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83879
Approved by: https://github.com/ezyang

* strip SymIntNodes off in the mobile builds (#83938)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83938
Approved by: https://github.com/ezyang

* [pthreadpool] Cap max thread count to fix TSAN issues (#83950)

Summary: Cap the thread count to 64 unconditionally to solve this tsan issue which leads to harder to debug, flaky test failures.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D38136212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83950
Approved by: https://github.com/kimishpatel

* Skip NCCL slimming for cxx11 libtorch builds (#83959)

Fixes https://github.com/pytorch/pytorch/issues/83887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83959
Approved by: https://github.com/atalman

* add hud link to merge failure message (#83946)

as in title, related to https://github.com/pytorch/test-infra/issues/568
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83946
Approved by: https://github.com/huydhn

* Check all CUDA API calls for errors in benchmarks/cpp/nvfuser (#74920) (#81817)

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74920

Test Plan: Sandcastle

Differential Revision: D35194656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81817
Approved by: https://github.com/malfet

* [frontend] Fix tensor list alias annotation  (#84005)

For issue https://github.com/pytorch/pytorch/issues/77920 and a retry of https://github.com/pytorch/pytorch/pull/8…
zasdfgbnm added a commit to csarofeen/pytorch that referenced this pull request Mar 20, 2023
…ew tile sizes (#1900)

* use custom propagator in ampere TN

* add tile ordering utilities

* initial matmul scheduler implementation

* use matmul scheduler prototype on ampere and turing test cases

* extend to support Volta

* minor cleanup

* comment cleanup

* minor fix

* add fragment iteration and use it in matmul scheduler

* use scheduler params for tests

* fragment support in double buffer

* add register double buffering test cases

* clean up custom transform propagator

* rebase fix

* comment

* move bounded selector to common area

* Add logic to handle fake boundary tensors in selection.

* naming and comment

* remove unused parameters from mma node

* remove unnecessary parameters from mma ir node

* rename scheduling variables

* change accumulator tv interface

* Update torch/csrc/jit/codegen/cuda/scheduler/utils.h

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

* PR feedback

* pipe through parallel type position

* Revert "fragment support in double buffer"

This reverts commit d12a90fcce5cd02aca7c98ea5f29ea01bc85df6f.

* use cache op to handle double buffer input

* add more comment in matmul scheduler

* more comments

* comment fix

* rebase fix

* add inline pred for cpasync

* minor cleanup

* add inlining test in unit

* add option to dump ptx

* add ampere xor swizzle gen

* minor scheduler fix; add bank conflict helper

* minor update and enable single word access checker

* minor fixes and symmetric 4 warp recipe tests

* rebase fix

* fix rebase

* add cyclic shift for non-power-of-2 swizzle period

* fix swizzle handling in replay

* add a few more tile support

* minor fix

* add 6 warp test cases

* add skip swizzle option for replay matching

* cleanup

* add small repro for the replay fix

* Fix missing thread predicates

Unlikely to matter, but should be necessary

* fix merge

* fix merge

* format

* Rebase #1900 (#2009)

* hash update - bug fix for branches (#83865)

hash updates for xla were failing because the current pinned hash is a branch, so the git command for getting the date couldn't find the branch due to not having a local version of the branch.  Fixed by checking out the branch to make sure it exists locally.

example of failure: https://github.com/pytorch/pytorch/runs/7913835742?check_suite_focus=true

Test plan:
made it pull request trigger and ran, to get this:
https://github.com/pytorch/pytorch/runs/7959221184?check_suite_focus=true
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83865
Approved by: https://github.com/zengk95

* [FSDP] Remove unneeded checks (#83150)

@awgu pointed out these checks aren't really doing anything, as they just make sure we're setting training state in certain ways throughout FSDP and is sort of arbitrary. So, removing them to avoid confusion.

We still keep the checking around `_post_backward_called` because this is needed in `finalize_params` for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83150
Approved by: https://github.com/awgu

* [BE] Revert distributed change in https://github.com/pytorch/pytorch/pull/68779 (#83181)

https://github.com/pytorch/pytorch/issues/82641 points out a regression in how inputs / outputs are processed by DDP, blocking their HF use case. It was narrowed down to https://github.com/pytorch/pytorch/pull/68779 and reverting the distributed change there fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83181
Approved by: https://github.com/kumpera

* Transpose scheduler small dim sizes better support (#1910)

* Optimize transpose copy on CPU using fbgemm transpose (#83327)

### Description
Optimize transpose copy on CPU using fbgemm transpose

### Testing
single socket (28cores):
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 4.819e-05 ms; bf16: 4.846e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000171 ms; bf16: 0.000129 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10])  fp32: 2.439e-05 ms; bf16: 2.152e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000132 ms; bf16: 3.916e-05 ms
```
single core:
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.00109 ms;  bf16: 0.00103 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00339 ms; bf16: 0.00295 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.000566  ms; bf16: 0.000382 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00282 ms; bf16: 0.000999 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83327
Approved by: https://github.com/frank-wei

* Grouped grid welford (#1921)

Enables grouping of grid welford ops across iterations. Same functionality as the iteration grouping for GridReduction. This ins intended to improve the outer-norm grid persistence in batchnorm-like fusions.

* [ONNX] Use `errors.SymbolicValueError` for more context (#83332)

Replace runtime errors in torch.onnx with `errors.SymbolicValueError` for more context around jit values.

- Extend `_unimplemented`, `_onnx_unsupported`, `_onnx_opset_unsupported`, `_onnx_opset_unsupported_detailed` errors to include JIT value information
- Replace plain RuntimeError with `errors.SymbolicValueError`
- Clean up: Use `_is_bool` to replace string comparison on jit types
- Clean up: Remove the todo `Remove type ignore after #81112`

#77316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83332
Approved by: https://github.com/AllenTiTaiWang, https://github.com/thiagocrepaldi, https://github.com/BowenBao

* [quant][fx] Add support for quantized matmul (#83885)

Summary:
att, probably missed the op during migration to the reference flow

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_qmatmul

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83885
Approved by: https://github.com/andrewor14

* Misc fixes/tuning for transpose scheduler (#1912)

* [nn] split rnn_utils test from test_nn.py (#83675)

Ref: https://github.com/pytorch/pytorch/issues/63085
Proposed folder structure
```
-> test
  -> nn
    -> test_conv.py
    -> test_pooling.py
    -> .....
```

This PR: Moves test related RNN utilities to a different file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675
Approved by: https://github.com/albanD

* [optim] rprop: handle complex params as independent real params (#83858)

Ref #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858
Approved by: https://github.com/albanD

* [xla hash update] update the pinned xla hash (#83899)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83899
Approved by: https://github.com/pytorchbot

* [ROCm] More Sparse UTs enablement and more hipification mappings. (#78939)

Enables:

 test_bmm_cuda_float64
 test_bmm_deterministic_cuda_float64
 test_csr_matvec_cuda_complex128
 test_csr_matvec_cuda_complex64
 test_csr_matvec_cuda_float32
 test_csr_matvec_cuda_float64

To enable the above tests had to add some more hip mappings for the hipification process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78939
Approved by: https://github.com/pruthvistony, https://github.com/malfet

* Normalize DLPack stride to 1 where shape < 2 (#83158)

Fixes #83069. Also move all the dlpack tests to a new file., `test_dlpack.py`.

The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83158
Approved by: https://github.com/ezyang

* Remove DBR quantization from the codebase (#83642)

Summary:

DBR quantization is a no-go for now because it does not align well with
PyTorch 2.0 plans and we do not want to build yet another tracing system.

Deleting it from the codebase for now since there are no plans to develop
this in the near future. We can bring it back at a later time if necessary.

Test plan:

CI

Differential Revision: [D38839556](https://our.internmc.facebook.com/intern/diff/D38839556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83642
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168

* Refactored ops on size to be dispatcher ops (#83719)

An example of how the graph looks now.
```
def forward(self, x_1):
    size = torch.ops.math.size(x_1, 0)
    size_1 = torch.ops.math.size(x_1, 1);  x_1 = None
    ones = torch.ops.aten.ones.default([1], device = device(type='cpu'), pin_memory = False)
    expand_sym_int = torch.ops.aten.expand.SymInt(ones, [size, size_1]);  ones = size = size_1 = None
    cos_default = torch.ops.aten.cos.default(expand_sym_int);  expand_sym_int = None
    return (cos_default,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83719
Approved by: https://github.com/ezyang

* Fix stride issue with faketensors (#83822)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83822
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Nullary RNGOp (#1892)

* [ROCm] restore MIOpen benchmark flag default to true (#82656)

### Description
PR https://github.com/pytorch/pytorch/pull/77438 allowed MIOpen to support the benchmark flag. Previously, the benchmark flag was ignored by MIOpen such that benchmarking was always turned on. This commit restores the behavior that MIOpen benchmarking is by default turned on.

### Testing
CI unit tests cover this capability.  Torchvision models demonstrate the performance delta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82656
Approved by: https://github.com/ngimel

* Update retry action to latest version (#83911)

We're running into EPERM issues when trying to install nvidia tools, see failure example https://github.com/pytorch/pytorch/runs/7975726013?check_suite_focus=true.
```
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1049
            throw err;
            ^

Error: kill EPERM
    at process.kill (internal/process/per_thread.js:199:13)
    at killPid (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1059:17)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1036:21
    at Array.forEach (<anonymous>)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1034:23
    at Array.forEach (<anonymous>)
    at killAll (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1033:27)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1024:13
    at ChildProcess.onClose (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1080:17)
    at ChildProcess.emit (events.js:314:20) {
  errno: 'EPERM',
  code: 'EPERM',
  syscall: 'kill'
}

```

The root issue probably lies elsewhere but this action is not helping/the errors seem to say it's unable to kill child processes. A more recent commit in that repo uses spawn instead of exec which might make a difference.

Regardless, we should keep our actions up to date anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83911
Approved by: https://github.com/malfet

* [PyTorch] Remove unused sstream/string includes from c10/macros/Macros.h (#83353)

Nothing in the rest of the header seems to use these.

Differential Revision: [D38672680](https://our.internmc.facebook.com/intern/diff/D38672680/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83353
Approved by: https://github.com/malfet

* [functorch] add linalg cross batch rule (#83759)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83759
Approved by: https://github.com/zou3519

* Improve DistanceKernel.cu (#83811)

include device_sqrt
replace reduce_agg by BlockReduce
choose implementation by impl_fptr instead of error-prone copy-and-paste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83811
Approved by: https://github.com/ngimel

* reinplace pass: bugfix for output node replacement (#83845)

Cleaned up some of the arg replacement logic to use tree_map, so it handles FX nodes that have nested containers.

See the added test: when you write a function that returns a list, the `output` node in the FX graph shows up as having `node.args = tuple(immutable_list(...))`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83845
Approved by: https://github.com/ezyang

* reinplace pass: special handling for view_scatter ops (#83846)

There is already special handling in the reinplacing pass for removing `{view}_scatter` ops, but there is another case that needs special handling. In this code:
```
         def f():
             a = torch.zeros(4, 4, 4)
             a[:, 2:] = torch.ones(4, 2, 4)
             return a
```

Tracing normally with `make_fx()` gives you:
```

def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    copy__default = torch.ops.aten.copy_.default(slice_tensor_1, ones);  slice_tensor_1 = ones = None
    return zeros
```
Functionalizing it gives you:

```
def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, 9223372036854775807);  slice_tensor_2 = ones = None
    slice_scatter_default_1 = torch.ops.aten.slice_scatter.default(zeros, slice_scatter_default, 0, 0, 9223372036854775807);  zeros = slice_scatter_default = None
    return slice_scatter_default_1
```

Notice that there are not any functional ops to directly re-inplace! What actually happened is that functionalization turned the `copy_()` into a `copy()`, but the out-of-place `copy()` operator gets optimized away because it's a no-op (when the input and output metadata are the same, `out = copy(a, b)` just returns `b`).

What we actually want is to replace this line:
```
slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, ...);
```
with this:
```
new_slice = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, ...);
_ = torch.ops.aten.copy_.default(new_slice, ones)
```

In the above, we're taking a fresh slice of the "base" tensor, and performing a `copy_()` on the slice, adding back what functionalization removed.

We actually need to create a fresh "slice" node, because we're not guaranteed that one already exists in the graph (technically there should be one, but it might have been DCE'd by the time we hit re-inplacing)

I also updated the docs for re-inplacing to more closely match the order of the logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83846
Approved by: https://github.com/ezyang

* Move ATenNVRTC.h include from `jit_utils.h` to `jit_utils.cpp` (#83886)

In general, `.h` files should only include headers that are used in the header

Fixes https://github.com/pytorch/pytorch/issues/83856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83886
Approved by: https://github.com/ngimel

* Allow None arguments for elementwise type promotion wrapper and fix clamp with None arguments (#83586)

Fixes https://github.com/pytorch/torchdynamo/issues/759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83586
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL (#83881)

Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`.
Saving user from setting two env variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881
Approved by: https://github.com/malfet, https://github.com/rohan-varma, https://github.com/H-Huang

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Fix view_func replay in no-grad mode (#83872)

Fixes https://github.com/pytorch/pytorch/issues/83828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83872
Approved by: https://github.com/albanD

* [vulkan] Add VMA as a third_party subrepo (#83906)

the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906
Approved by: https://github.com/kimishpatel

* [torchgen] Add documentation for `autogen` keyword (#83610)

This is a follow up for #81437. This PR explains what operator can use `autogen` and what will be generated. Also talked about generated kernels and where to find them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83610
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* remove assertEqualIgnoreTypes from test/distributions/test_distributions.py (#83709)

See https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83709
Approved by: https://github.com/kit1980

* [fix] edge case in `MaxPool1d` and add ErrorInputs (#83553)

Fixes #83224

cc @kshitij12345 @albanD!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553
Approved by: https://github.com/albanD

* [complex] conv_transpose1d (#79694)

Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79694
Approved by: https://github.com/ngimel

* Revert "Strenghten preconditions of linalg.cross (#83798)"

This reverts commit 7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e.

Reverted https://github.com/pytorch/pytorch/pull/83798 on behalf of https://github.com/janeyx99 due to Sorry, land race caused functorch issues https://hud.pytorch.org/pytorch/pytorch/commit/7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e

* Fix load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)

`_load_extra_only_for_mobile` API hasn't handled flatbuffers logic yet. Update the api accordingly.

Also find out mobile build in OSS doesn't build with flatbuffers. Filed task T129996445 to track

Differential Revision: [D38890847](https://our.internmc.facebook.com/intern/diff/D38890847/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38890847/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83855
Approved by: https://github.com/qihqi

* Prefer signal from land checks over PR signals (#83715)

# The problem

When a dev forks their branch from a red master build, their branch can fail CI checks for reasons unrelated to their changes, but the same checks would however pass in the land validation commit (which is rebased off of viable/strict)

Today, in the above scenario the `merge -l` command fails because mergebot sees the failing checks in the PR, which is not helpful when that same check passes in land validation.

# The solution
This PR changes the behavior so that:
1. If both the PR and land validation ran a workflow, only look at the results from land validation
2. If only the PR ran a specific workflow (e.g. for CLA Check or a nightly run) then continue to look the result from the PR (which matches existing behavior)

### Bonus fixes
It also includes a few extra BE fixes:
- Replaces the tuple we used to pass workflow check results around with a named tuple so that it's easier to tell what data is being used
- Reduces the number of API calls to github by ~50% during merges.  Before, we were pulling results from github every time and then filtering it down to the relevant category of checks (e.g. failed/pending/startup_failed). Now, our filters share the check results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83715
Approved by: https://github.com/zengk95

* Don't introduce new overload for SymInt (#83628)

Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove CoreMLMemoryObserver (#83703)

Summary: We added this observer to help us diagnose memory issues that have since resolved. It should be safe to clean this up.

Test Plan: Diff just removed logging, so just build IG and confirm no errors.

Differential Revision: D38843701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83703
Approved by: https://github.com/mcr229

* ci: Remove dead code related to android uploads (#83930)

These uploads actually never got triggeredhappened in nightlies so
removing it altogether. Someone can re-add in the future if they feel
these are important but I can't find an instance of this running since
we migrated so I have a hard time believing anyone will miss it.

https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=android

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83930
Approved by: https://github.com/atalman, https://github.com/malfet

* [fx][pass infra] Adding error catching (#83933)

Example:

```
======================================================================
ERROR: test_pass_manager_error (fx.test_pass_infra.TestPassManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 285, in __call__
    res = fn(module)
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 164, in pass_fail
    raise RuntimeError("bad")
RuntimeError: bad

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 170, in test_pass_manager_error
    pm(traced_m)
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 289, in __call__
    raise RuntimeError(msg) from e
RuntimeError: An error occured when running the 'pass_fail' pass after the following passes: ['replace_add_with_mul_pass', 'replace_mul_with_div_pass']
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83933
Approved by: https://github.com/SherlockNoMad

* Back out "Support regex-style matching for Any and Oneof (#82853)" (#83922)

Reviewed By: hl475

Differential Revision: D38945806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83922
Approved by: https://github.com/hl475

* Fix use-dict-literal lint (#83718)

Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD

* Revert "Optimize transpose copy on CPU using fbgemm transpose (#83327)"

This reverts commit 04d8da88a6a1abf0da2b11096c85244bf38d3b2a.

Reverted https://github.com/pytorch/pytorch/pull/83327 on behalf of https://github.com/weiwangmeta due to breaking internal builds/causing out-of-bounds errors/training accuracy

* Add hypothesis to requirements.txt (#83740)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83740
Approved by: https://github.com/zhxchen17, https://github.com/janeyx99, https://github.com/zou3519

* [fbia] Keep Track of full qualified name before and after remote sharding (#83889)

Summary: track qualname changes in embedding sharding & FX split, and compose target qualname in the end of FBIA transform stage, so we can use the qualname mapping in XL materialize stage

Test Plan:
CI/CD

with DISABLE_XLEBB_MATERIALIZATION = True
https://fburl.com/fblearner/a8yljbux

with DISABLE_XLEBB_MATERIALIZATION = False
https://fburl.com/fblearner/2nvi0dam

Reviewed By: lliu315gt

Differential Revision: D38772525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83889
Approved by: https://github.com/houseroad

* add merge blocking to ci: sev template (#83940)

as in title, so that by default, ci: sev will block merges

the line can be removed to not block merges
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83940
Approved by: https://github.com/huydhn, https://github.com/janeyx99, https://github.com/malfet, https://github.com/seemethere

* Move nnapi code from ATen common code to specific library (#83748)

Summary: Currently we include nnapi code in all targets using ATen even if it's not used (actually there is no usage and being deprecated). Move it to `nnapi_backend_lib` for now.

Test Plan: Sandcastle.

Differential Revision: D38761095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83748
Approved by: https://github.com/salilsdesai, https://github.com/SS-JIA

* Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870)

See https://github.com/pytorch/pytorch/issues/38095
Replaced assertEqualIgnoreType with assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870
Approved by: https://github.com/kit1980

* [Nested Tensor] Make offset copy and move assignment more explicit. (#83488)

Currently the nested tensor construction for the offset_ parameter takes in references and in the chain of delegation uses value. This could lead to unnecessary copies.  Whenever a nested tensor impl is constructed it should take ownership of all its metadata. The only non-trivially copyable metadata associated with the class is `offsets_`.

The goal of this PR is to make sure that consumers of nested_tensor_impl constructors ensure that they are passing offsets as a temporary - either buy explicitly copying a reference, or by constructing the offsets vector in the scope of construction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83488
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove conj kernels for real dtypes (#80374)

`conj_physical_stub` is currently implemented for all dtypes despite
it just being a plain copy for real dtypes. So, instead we should
defer to the existing copy kernel in these cases.

On my build for one CUDA architecture, I see a 2.2 MB decrease in
`libtorch_cuda.so` size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80374
Approved by: https://github.com/ngimel, https://github.com/atalman

* [BE][CUDA] Use packed_accessor64 (#83949)

Not sure why we are ignoring those, but SoftMax.cu alone
generates 100+ lines of warnings:
```
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In function ‘at::Tensor at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::get_offsets(const at::Tensor&, const IntArrayRef&, int64_t)’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:261:69: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = long int; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto indices_accessor = indices.packed_accessor<int64_t, 2>();
                                                                     ^
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:924:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:1677:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:927:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:1679:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:977:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:1775:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:980:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:1777:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:16:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:18:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:20:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:21:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83949
Approved by: https://github.com/ngimel

* Support returning symbolic strides from t.stride() in Python (#83842)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83842
Approved by: https://github.com/albanD, https://github.com/Chillee, https://github.com/bdhirsh

* Support the XPU backend untyped storage (#83952)

Simple add XPU backend in untyped torch storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83952
Approved by: https://github.com/ezyang

* Support NCCL Premul Sum (#81272)

This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is https://github.com/pytorch/pytorch/pull/81272/commits/cb99ad67447b5899ecf8c4c3d78deaafa1cc09b8 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501

* Test type promotion assertignoretypes (#83867)

See #38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83867
Approved by: https://github.com/kit1980, https://github.com/mruberry

* [Profiler] record nn.Module's parameters (#83209)

Summary:
Record nn.Module's parameters for detaild memory profiling:
- extend 'module_' in value cache  & NNModuleInfo to save parameters
- python binding and unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule

Differential Revision: D38379717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
Approved by: https://github.com/robieta

* [xla hash update] update the pinned xla hash (#83967)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83967
Approved by: https://github.com/pytorchbot

* Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)

* Fix LTC build warnings (#83955)

Addresses `Wc++98-compat-extra-semi` warning from https://github.com/llvm/torch-mlir/issues/1264 by removing extraneous semicolon after autogen LTC native function definitions.

```
/home/runner/work/torch-mlir/torch-mlir/build/tools/torch-mlir/python/torch_mlir/csrc/base_lazy_backend/generated/LazyNativeFunctions.cpp:4241:6: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
    };
     ^
```

cc: @wconstab @desertfire @ke1337 @antoniojkim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83955
Approved by: https://github.com/wconstab

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Make linalg.inv composite of linalg.solve (#80074)

The `getri` kernel calls inside `getrs` so we can do so explicitly
ourselves and save ourselves from having to maintain an extra kernel.
This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
will be as efficient as it can be, as it'll be choosing the best backend
to perform the factorisation and the best backend (not necessarily the
same) to perform the solve.

Fixes https://github.com/pytorch/pytorch/issues/77498

The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet

* Support a stable double backward on linalg.det for real inputs (#80217)

The complex case still fails. I do not know why.

Fixes https://github.com/pytorch/pytorch/issues/62327
Fixes https://github.com/pytorch/pytorch/issues/53364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80217
Approved by: https://github.com/nikitaved, https://github.com/albanD, https://github.com/malfet

* [LTC] Add custom lazy tensor save function (#83294)

We need a custom `save` function for checkpointing a lazy model, similar to what exists in PyTorch/XLA:
https://github.com/pytorch/xla/blob/3eb8a9d9eb4ebb0b064461c3704650241625654e/torch_xla/core/xla_model.py#L994
The purpose of this function is to move any lazy tensors to CPU before saving the checkpoint.

The way I implemented it was to create a general structure visitor, adapted from a function that we use quite often in Cerebras internal repositories. If there is a better tool already available in PyTorch that does the same things, I'm open to suggestions.

CC: @wconstab @Krovatkin @JackCaoG
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83294
Approved by: https://github.com/wconstab

* move pooling test from test_nn to test/nn/test_pooling (#83915)

Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915
Approved by: https://github.com/albanD

* [ONNX] Remove static None graph output (#82623)

Fixes #82370
* Unify the export behavior regarding static None outputs. These are
dropped for both traced graph and TorchScript graph export.
* `Optional` outputs are not affected.
Fixes #82370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82623
Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock

* [TorchTidy Fix] Don't try to collect strides for non-strided tensors (#83935)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83935
Approved by: https://github.com/robieta, https://github.com/slgong-fb

* [WIP] Validating input_col for certain datapipes (#80267)

Follow up from #79344.

Currently WIP due to multiple test failures.

Waiting for #80140 to land
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267
Approved by:…
zasdfgbnm added a commit to NVIDIA/Fuser that referenced this pull request Mar 20, 2023
 for a few tile sizes (#1900)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* use custom propagator in ampere TN

* add tile ordering utilities

* initial matmul scheduler implementation

* use matmul scheduler prototype on ampere and turing test cases

* extend to support Volta

* minor cleanup

* comment cleanup

* minor fix

* add fragment iteration and use it in matmul scheduler

* use scheduler params for tests

* fragment support in double buffer

* add register double buffering test cases

* clean up custom transform propagator

* rebase fix

* comment

* move bounded selector to common area

* Add logic to handle fake boundary tensors in selection.

* naming and comment

* remove unused parameters from mma node

* remove unnecessary parameters from mma ir node

* rename scheduling variables

* change accumulator tv interface

* Update torch/csrc/jit/codegen/cuda/scheduler/utils.h

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>

* PR feedback

* pipe through parallel type position

* Revert "fragment support in double buffer"

This reverts commit d12a90fcce5cd02aca7c98ea5f29ea01bc85df6f.

* use cache op to handle double buffer input

* add more comment in matmul scheduler

* more comments

* comment fix

* rebase fix

* add inline pred for cpasync

* minor cleanup

* add inlining test in unit

* add option to dump ptx

* add ampere xor swizzle gen

* minor scheduler fix; add bank conflict helper

* minor update and enable single word access checker

* minor fixes and symmetric 4 warp recipe tests

* rebase fix

* fix rebase

* add cyclic shift for non-power-of-2 swizzle period

* fix swizzle handling in replay

* add a few more tile support

* minor fix

* add 6 warp test cases

* add skip swizzle option for replay matching

* cleanup

* add small repro for the replay fix

* Fix missing thread predicates

Unlikely to matter, but should be necessary

* fix merge

* fix merge

* format

* Rebase #1900 (#2009)

* hash update - bug fix for branches (#83865)

hash updates for xla were failing because the current pinned hash is a branch, so the git command for getting the date couldn't find the branch due to not having a local version of the branch.  Fixed by checking out the branch to make sure it exists locally.

example of failure: https://github.com/pytorch/pytorch/runs/7913835742?check_suite_focus=true

Test plan:
made it pull request trigger and ran, to get this:
https://github.com/pytorch/pytorch/runs/7959221184?check_suite_focus=true
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83865
Approved by: https://github.com/zengk95

* [FSDP] Remove unneeded checks (#83150)

@awgu pointed out these checks aren't really doing anything, as they just make sure we're setting training state in certain ways throughout FSDP and is sort of arbitrary. So, removing them to avoid confusion.

We still keep the checking around `_post_backward_called` because this is needed in `finalize_params` for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83150
Approved by: https://github.com/awgu

* [BE] Revert distributed change in https://github.com/pytorch/pytorch/pull/68779 (#83181)

https://github.com/pytorch/pytorch/issues/82641 points out a regression in how inputs / outputs are processed by DDP, blocking their HF use case. It was narrowed down to https://github.com/pytorch/pytorch/pull/68779 and reverting the distributed change there fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83181
Approved by: https://github.com/kumpera

* Transpose scheduler small dim sizes better support (#1910)

* Optimize transpose copy on CPU using fbgemm transpose (#83327)

Optimize transpose copy on CPU using fbgemm transpose

single socket (28cores):
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 4.819e-05 ms; bf16: 4.846e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000171 ms; bf16: 0.000129 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10])  fp32: 2.439e-05 ms; bf16: 2.152e-05 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.000132 ms; bf16: 3.916e-05 ms
```
single core:
```
before: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.00109 ms;  bf16: 0.00103 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00339 ms; bf16: 0.00295 ms

after: torch.Size([10, 128, 10, 124]) -> torch.Size([10, 128, 124, 10]) fp32: 0.000566  ms; bf16: 0.000382 ms
        torch.Size([10, 128, 30, 124]) -> torch.Size([10, 128, 124, 30]) fp32: 0.00282 ms; bf16: 0.000999 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83327
Approved by: https://github.com/frank-wei

* Grouped grid welford (#1921)

Enables grouping of grid welford ops across iterations. Same functionality as the iteration grouping for GridReduction. This ins intended to improve the outer-norm grid persistence in batchnorm-like fusions.

* [ONNX] Use `errors.SymbolicValueError` for more context (#83332)

Replace runtime errors in torch.onnx with `errors.SymbolicValueError` for more context around jit values.

- Extend `_unimplemented`, `_onnx_unsupported`, `_onnx_opset_unsupported`, `_onnx_opset_unsupported_detailed` errors to include JIT value information
- Replace plain RuntimeError with `errors.SymbolicValueError`
- Clean up: Use `_is_bool` to replace string comparison on jit types
- Clean up: Remove the todo `Remove type ignore after #81112`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83332
Approved by: https://github.com/AllenTiTaiWang, https://github.com/thiagocrepaldi, https://github.com/BowenBao

* [quant][fx] Add support for quantized matmul (#83885)

Summary:
att, probably missed the op during migration to the reference flow

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_qmatmul

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83885
Approved by: https://github.com/andrewor14

* Misc fixes/tuning for transpose scheduler (#1912)

* [nn] split rnn_utils test from test_nn.py (#83675)

Ref: https://github.com/pytorch/pytorch/issues/63085
Proposed folder structure
```
-> test
  -> nn
    -> test_conv.py
    -> test_pooling.py
    -> .....
```

This PR: Moves test related RNN utilities to a different file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83675
Approved by: https://github.com/albanD

* [optim] rprop: handle complex params as independent real params (#83858)

Ref #65711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83858
Approved by: https://github.com/albanD

* [xla hash update] update the pinned xla hash (#83899)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83899
Approved by: https://github.com/pytorchbot

* [ROCm] More Sparse UTs enablement and more hipification mappings. (#78939)

Enables:

 test_bmm_cuda_float64
 test_bmm_deterministic_cuda_float64
 test_csr_matvec_cuda_complex128
 test_csr_matvec_cuda_complex64
 test_csr_matvec_cuda_float32
 test_csr_matvec_cuda_float64

To enable the above tests had to add some more hip mappings for the hipification process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78939
Approved by: https://github.com/pruthvistony, https://github.com/malfet

* Normalize DLPack stride to 1 where shape < 2 (#83158)

Fixes #83069. Also move all the dlpack tests to a new file., `test_dlpack.py`.

The fix involves always allocating a "strides" int array when converting to dlPack and deleting the strides when the capsule descructor is called. Then the strides are copied from the tensor, and `strides[i]` is set to `1` where `shape[i] < 2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83158
Approved by: https://github.com/ezyang

* Remove DBR quantization from the codebase (#83642)

Summary:

DBR quantization is a no-go for now because it does not align well with
PyTorch 2.0 plans and we do not want to build yet another tracing system.

Deleting it from the codebase for now since there are no plans to develop
this in the near future. We can bring it back at a later time if necessary.

Test plan:

CI

Differential Revision: [D38839556](https://our.internmc.facebook.com/intern/diff/D38839556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83642
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168

* Refactored ops on size to be dispatcher ops (#83719)

An example of how the graph looks now.
```
def forward(self, x_1):
    size = torch.ops.math.size(x_1, 0)
    size_1 = torch.ops.math.size(x_1, 1);  x_1 = None
    ones = torch.ops.aten.ones.default([1], device = device(type='cpu'), pin_memory = False)
    expand_sym_int = torch.ops.aten.expand.SymInt(ones, [size, size_1]);  ones = size = size_1 = None
    cos_default = torch.ops.aten.cos.default(expand_sym_int);  expand_sym_int = None
    return (cos_default,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83719
Approved by: https://github.com/ezyang

* Fix stride issue with faketensors (#83822)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83822
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Nullary RNGOp (#1892)

* [ROCm] restore MIOpen benchmark flag default to true (#82656)

PR https://github.com/pytorch/pytorch/pull/77438 allowed MIOpen to support the benchmark flag. Previously, the benchmark flag was ignored by MIOpen such that benchmarking was always turned on. This commit restores the behavior that MIOpen benchmarking is by default turned on.

CI unit tests cover this capability.  Torchvision models demonstrate the performance delta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82656
Approved by: https://github.com/ngimel

* Update retry action to latest version (#83911)

We're running into EPERM issues when trying to install nvidia tools, see failure example https://github.com/pytorch/pytorch/runs/7975726013?check_suite_focus=true.
```
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1049
            throw err;
            ^

Error: kill EPERM
    at process.kill (internal/process/per_thread.js:199:13)
    at killPid (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1059:17)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1036:21
    at Array.forEach (<anonymous>)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1034:23
    at Array.forEach (<anonymous>)
    at killAll (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1033:27)
    at /home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1024:13
    at ChildProcess.onClose (/home/ec2-user/actions-runner/_work/_actions/nick-fields/retry/71062288b76e2b6214ebde0e673ce0de1755740a/dist/index.js:1080:17)
    at ChildProcess.emit (events.js:314:20) {
  errno: 'EPERM',
  code: 'EPERM',
  syscall: 'kill'
}

```

The root issue probably lies elsewhere but this action is not helping/the errors seem to say it's unable to kill child processes. A more recent commit in that repo uses spawn instead of exec which might make a difference.

Regardless, we should keep our actions up to date anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83911
Approved by: https://github.com/malfet

* [PyTorch] Remove unused sstream/string includes from c10/macros/Macros.h (#83353)

Nothing in the rest of the header seems to use these.

Differential Revision: [D38672680](https://our.internmc.facebook.com/intern/diff/D38672680/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83353
Approved by: https://github.com/malfet

* [functorch] add linalg cross batch rule (#83759)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83759
Approved by: https://github.com/zou3519

* Improve DistanceKernel.cu (#83811)

include device_sqrt
replace reduce_agg by BlockReduce
choose implementation by impl_fptr instead of error-prone copy-and-paste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83811
Approved by: https://github.com/ngimel

* reinplace pass: bugfix for output node replacement (#83845)

Cleaned up some of the arg replacement logic to use tree_map, so it handles FX nodes that have nested containers.

See the added test: when you write a function that returns a list, the `output` node in the FX graph shows up as having `node.args = tuple(immutable_list(...))`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83845
Approved by: https://github.com/ezyang

* reinplace pass: special handling for view_scatter ops (#83846)

There is already special handling in the reinplacing pass for removing `{view}_scatter` ops, but there is another case that needs special handling. In this code:
```
         def f():
             a = torch.zeros(4, 4, 4)
             a[:, 2:] = torch.ones(4, 2, 4)
             return a
```

Tracing normally with `make_fx()` gives you:
```

def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    copy__default = torch.ops.aten.copy_.default(slice_tensor_1, ones);  slice_tensor_1 = ones = None
    return zeros
```
Functionalizing it gives you:

```
def forward(self):
    zeros = torch.ops.aten.zeros.default([4, 4, 4], device = device(type='cpu'), pin_memory = False)
    ones = torch.ops.aten.ones.default([4, 2, 4], device = device(type='cpu'), pin_memory = False)
    slice_tensor = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_tensor_1 = torch.ops.aten.slice.Tensor(slice_tensor, 1, 2, 9223372036854775807);  slice_tensor = None
    slice_tensor_2 = torch.ops.aten.slice.Tensor(zeros, 0, 0, 9223372036854775807)
    slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, 9223372036854775807);  slice_tensor_2 = ones = None
    slice_scatter_default_1 = torch.ops.aten.slice_scatter.default(zeros, slice_scatter_default, 0, 0, 9223372036854775807);  zeros = slice_scatter_default = None
    return slice_scatter_default_1
```

Notice that there are not any functional ops to directly re-inplace! What actually happened is that functionalization turned the `copy_()` into a `copy()`, but the out-of-place `copy()` operator gets optimized away because it's a no-op (when the input and output metadata are the same, `out = copy(a, b)` just returns `b`).

What we actually want is to replace this line:
```
slice_scatter_default = torch.ops.aten.slice_scatter.default(slice_tensor_2, ones, 1, 2, ...);
```
with this:
```
new_slice = torch.ops.aten.slice.Tensor(slice_tensor_2, 1, 2, ...);
_ = torch.ops.aten.copy_.default(new_slice, ones)
```

In the above, we're taking a fresh slice of the "base" tensor, and performing a `copy_()` on the slice, adding back what functionalization removed.

We actually need to create a fresh "slice" node, because we're not guaranteed that one already exists in the graph (technically there should be one, but it might have been DCE'd by the time we hit re-inplacing)

I also updated the docs for re-inplacing to more closely match the order of the logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83846
Approved by: https://github.com/ezyang

* Move ATenNVRTC.h include from `jit_utils.h` to `jit_utils.cpp` (#83886)

In general, `.h` files should only include headers that are used in the header

Fixes https://github.com/pytorch/pytorch/issues/83856
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83886
Approved by: https://github.com/ngimel

* Allow None arguments for elementwise type promotion wrapper and fix clamp with None arguments (#83586)

Fixes https://github.com/pytorch/torchdynamo/issues/759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83586
Approved by: https://github.com/ezyang, https://github.com/ngimel

* Enable NCCL_DESYNC_DEBUG when TORCH_DISTRIBUTED_DEBUG=DETAIL (#83881)

Automatically enable `NCCL_DESYNC_DEBUG` when `TORCH_DISTRIBUTED_DEBUG` is set to `DETAIL`.
Saving user from setting two env variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83881
Approved by: https://github.com/malfet, https://github.com/rohan-varma, https://github.com/H-Huang

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Fix view_func replay in no-grad mode (#83872)

Fixes https://github.com/pytorch/pytorch/issues/83828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83872
Approved by: https://github.com/albanD

* [vulkan] Add VMA as a third_party subrepo (#83906)

the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906
Approved by: https://github.com/kimishpatel

* [torchgen] Add documentation for `autogen` keyword (#83610)

This is a follow up for #81437. This PR explains what operator can use `autogen` and what will be generated. Also talked about generated kernels and where to find them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83610
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* remove assertEqualIgnoreTypes from test/distributions/test_distributions.py (#83709)

See https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83709
Approved by: https://github.com/kit1980

* [fix] edge case in `MaxPool1d` and add ErrorInputs (#83553)

Fixes #83224

cc @kshitij12345 @albanD!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83553
Approved by: https://github.com/albanD

* [complex] conv_transpose1d (#79694)

Reference: https://github.com/pytorch/pytorch/issues/71108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79694
Approved by: https://github.com/ngimel

* Revert "Strenghten preconditions of linalg.cross (#83798)"

This reverts commit 7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e.

Reverted https://github.com/pytorch/pytorch/pull/83798 on behalf of https://github.com/janeyx99 due to Sorry, land race caused functorch issues https://hud.pytorch.org/pytorch/pytorch/commit/7f0198e7390eff2f2f5fcb33ce36c99ec3b7f55e

* Fix load_extra_only api for flatbuffers and enable flatbuffers in mobile for OSS properly (#83855)

`_load_extra_only_for_mobile` API hasn't handled flatbuffers logic yet. Update the api accordingly.

Also find out mobile build in OSS doesn't build with flatbuffers. Filed task T129996445 to track

Differential Revision: [D38890847](https://our.internmc.facebook.com/intern/diff/D38890847/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38890847/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83855
Approved by: https://github.com/qihqi

* Prefer signal from land checks over PR signals (#83715)

When a dev forks their branch from a red master build, their branch can fail CI checks for reasons unrelated to their changes, but the same checks would however pass in the land validation commit (which is rebased off of viable/strict)

Today, in the above scenario the `merge -l` command fails because mergebot sees the failing checks in the PR, which is not helpful when that same check passes in land validation.

This PR changes the behavior so that:
1. If both the PR and land validation ran a workflow, only look at the results from land validation
2. If only the PR ran a specific workflow (e.g. for CLA Check or a nightly run) then continue to look the result from the PR (which matches existing behavior)

It also includes a few extra BE fixes:
- Replaces the tuple we used to pass workflow check results around with a named tuple so that it's easier to tell what data is being used
- Reduces the number of API calls to github by ~50% during merges.  Before, we were pulling results from github every time and then filtering it down to the relevant category of checks (e.g. failed/pending/startup_failed). Now, our filters share the check results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83715
Approved by: https://github.com/zengk95

* Don't introduce new overload for SymInt (#83628)

Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove CoreMLMemoryObserver (#83703)

Summary: We added this observer to help us diagnose memory issues that have since resolved. It should be safe to clean this up.

Test Plan: Diff just removed logging, so just build IG and confirm no errors.

Differential Revision: D38843701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83703
Approved by: https://github.com/mcr229

* ci: Remove dead code related to android uploads (#83930)

These uploads actually never got triggeredhappened in nightlies so
removing it altogether. Someone can re-add in the future if they feel
these are important but I can't find an instance of this running since
we migrated so I have a hard time believing anyone will miss it.

https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=android

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83930
Approved by: https://github.com/atalman, https://github.com/malfet

* [fx][pass infra] Adding error catching (#83933)

Example:

```
======================================================================
ERROR: test_pass_manager_error (fx.test_pass_infra.TestPassManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 285, in __call__
    res = fn(module)
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 164, in pass_fail
    raise RuntimeError("bad")
RuntimeError: bad

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/angelayi/Projects/pytorch/test/fx/test_pass_infra.py", line 170, in test_pass_manager_error
    pm(traced_m)
  File "/Users/angelayi/Projects/pytorch/torch/fx/passes/infra/pass_manager.py", line 289, in __call__
    raise RuntimeError(msg) from e
RuntimeError: An error occured when running the 'pass_fail' pass after the following passes: ['replace_add_with_mul_pass', 'replace_mul_with_div_pass']
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83933
Approved by: https://github.com/SherlockNoMad

* Back out "Support regex-style matching for Any and Oneof (#82853)" (#83922)

Reviewed By: hl475

Differential Revision: D38945806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83922
Approved by: https://github.com/hl475

* Fix use-dict-literal lint (#83718)

Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD

* Revert "Optimize transpose copy on CPU using fbgemm transpose (#83327)"

This reverts commit 04d8da88a6a1abf0da2b11096c85244bf38d3b2a.

Reverted https://github.com/pytorch/pytorch/pull/83327 on behalf of https://github.com/weiwangmeta due to breaking internal builds/causing out-of-bounds errors/training accuracy

* Add hypothesis to requirements.txt (#83740)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83740
Approved by: https://github.com/zhxchen17, https://github.com/janeyx99, https://github.com/zou3519

* [fbia] Keep Track of full qualified name before and after remote sharding (#83889)

Summary: track qualname changes in embedding sharding & FX split, and compose target qualname in the end of FBIA transform stage, so we can use the qualname mapping in XL materialize stage

Test Plan:
CI/CD

with DISABLE_XLEBB_MATERIALIZATION = True
https://fburl.com/fblearner/a8yljbux

with DISABLE_XLEBB_MATERIALIZATION = False
https://fburl.com/fblearner/2nvi0dam

Reviewed By: lliu315gt

Differential Revision: D38772525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83889
Approved by: https://github.com/houseroad

* add merge blocking to ci: sev template (#83940)

as in title, so that by default, ci: sev will block merges

the line can be removed to not block merges
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83940
Approved by: https://github.com/huydhn, https://github.com/janeyx99, https://github.com/malfet, https://github.com/seemethere

* Move nnapi code from ATen common code to specific library (#83748)

Summary: Currently we include nnapi code in all targets using ATen even if it's not used (actually there is no usage and being deprecated). Move it to `nnapi_backend_lib` for now.

Test Plan: Sandcastle.

Differential Revision: D38761095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83748
Approved by: https://github.com/salilsdesai, https://github.com/SS-JIA

* Task: T129772171 remove assertEqualIgnoreTypes from test/test_nn.py (#83870)

See https://github.com/pytorch/pytorch/issues/38095
Replaced assertEqualIgnoreType with assertEqual
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83870
Approved by: https://github.com/kit1980

* [Nested Tensor] Make offset copy and move assignment more explicit. (#83488)

Currently the nested tensor construction for the offset_ parameter takes in references and in the chain of delegation uses value. This could lead to unnecessary copies.  Whenever a nested tensor impl is constructed it should take ownership of all its metadata. The only non-trivially copyable metadata associated with the class is `offsets_`.

The goal of this PR is to make sure that consumers of nested_tensor_impl constructors ensure that they are passing offsets as a temporary - either buy explicitly copying a reference, or by constructing the offsets vector in the scope of construction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83488
Approved by: https://github.com/albanD, https://github.com/bdhirsh

* Remove conj kernels for real dtypes (#80374)

`conj_physical_stub` is currently implemented for all dtypes despite
it just being a plain copy for real dtypes. So, instead we should
defer to the existing copy kernel in these cases.

On my build for one CUDA architecture, I see a 2.2 MB decrease in
`libtorch_cuda.so` size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80374
Approved by: https://github.com/ngimel, https://github.com/atalman

* [BE][CUDA] Use packed_accessor64 (#83949)

Not sure why we are ignoring those, but SoftMax.cu alone
generates 100+ lines of warnings:
```
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In function ‘at::Tensor at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::get_offsets(const at::Tensor&, const IntArrayRef&, int64_t)’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:261:69: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = long int; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto indices_accessor = indices.packed_accessor<int64_t, 2>();
                                                                     ^
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:924:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:607:1677:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:927:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax(at::Tensor&, const at::Tensor&, int64_t) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:623:1679:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:423:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:426:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:977:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = false; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:641:1775:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = double; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:980:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘void at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::cuda_sparse_coo_softmax_backward(at::Tensor&, const at::Tensor&, const at::Tensor&, int64_t, c10::ScalarType) [with scalar_t = float; bool LogSoftMax = true; int64_t = long int]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:661:1777:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:542:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto values_accessor = values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:545:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto out_values_accessor = out_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:548:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
   auto grad_values_accessor = grad_values_2.packed_accessor<scalar_t, 2>();
      ^~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:16:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = true; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:18:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = double; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:20:557:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = double; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu: In instantiation of ‘std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> at::native::_GLOBAL__N__39f8a8aa_10_SoftMax_cu_75209b9c::compute_pool_max(const at::Tensor&, const at::Tensor&, const IntArrayRef&, int64_t, int64_t) [with scalar_t = float; bool requireMxRows = false; at::IntArrayRef = c10::ArrayRef<long int>; int64_t = long int]’:
/tmp/tmpxft_000040e0_00000000-6_SoftMax.cudafe1.stub.c:21:556:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/sparse/cuda/SoftMax.cu:347:6: warning: ‘at::GenericPackedTensorAccessor<T, N, PtrTraits, index_t> at::Tensor::packed_accessor() const & [with T = float; long unsigned int N = 2; PtrTraits = at::DefaultPtrTraits; index_t = long int]’ is deprecated: packed_accessor is deprecated, use packed_accessor32 or packed_accessor64 instead [-Wdeprecated-declarations]
     auto values_accessor =
      ^~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:245:1: note: declared here
   GenericPackedTensorAccessor<T,N,PtrTraits,index_t> packed_accessor() const & {
 ^ ~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83949
Approved by: https://github.com/ngimel

* Support returning symbolic strides from t.stride() in Python (#83842)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83842
Approved by: https://github.com/albanD, https://github.com/Chillee, https://github.com/bdhirsh

* Support the XPU backend untyped storage (#83952)

Simple add XPU backend in untyped torch storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83952
Approved by: https://github.com/ezyang

* Support NCCL Premul Sum (#81272)

This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is https://github.com/pytorch/pytorch/pull/81272/commits/cb99ad67447b5899ecf8c4c3d78deaafa1cc09b8 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501

* Test type promotion assertignoretypes (#83867)

See #38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83867
Approved by: https://github.com/kit1980, https://github.com/mruberry

* [Profiler] record nn.Module's parameters (#83209)

Summary:
Record nn.Module's parameters for detaild memory profiling:
- extend 'module_' in value cache  & NNModuleInfo to save parameters
- python binding and unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler -- -r test_nnmodule

Differential Revision: D38379717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83209
Approved by: https://github.com/robieta

* [xla hash update] update the pinned xla hash (#83967)

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83967
Approved by: https://github.com/pytorchbot

* Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)

* Fix LTC build warnings (#83955)

Addresses `Wc++98-compat-extra-semi` warning from https://github.com/llvm/torch-mlir/issues/1264 by removing extraneous semicolon after autogen LTC native function definitions.

```
/home/runner/work/torch-mlir/torch-mlir/build/tools/torch-mlir/python/torch_mlir/csrc/base_lazy_backend/generated/LazyNativeFunctions.cpp:4241:6: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
    };
     ^
```

cc: @wconstab @desertfire @ke1337 @antoniojkim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83955
Approved by: https://github.com/wconstab

* Strenghten preconditions of linalg.cross (#83798)

This makes `linalg.cross` array API complaint (https://github.com/data-apis/array-api/issues/415) and fixes a few bugs.

Fixes https://github.com/pytorch/pytorch/issues/77629
Fixes https://github.com/pytorch/pytorch/issues/83756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83798
Approved by: https://github.com/mruberry

* Make linalg.inv composite of linalg.solve (#80074)

The `getri` kernel calls inside `getrs` so we can do so explicitly
ourselves and save ourselves from having to maintain an extra kernel.
This way we just need to optimise `lu_factor` and `lu_solve` and `inv`
will be as efficient as it can be, as it'll be choosing the best backend
to perform the factorisation and the best backend (not necessarily the
same) to perform the solve.

Fixes https://github.com/pytorch/pytorch/issues/77498

The benchmarks: https://github.com/pytorch/pytorch/pull/80074#issuecomment-1164309071
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80074
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD, https://github.com/malfet

* Support a stable double backward on linalg.det for real inputs (#80217)

The complex case still fails. I do not know why.

Fixes https://github.com/pytorch/pytorch/issues/62327
Fixes https://github.com/pytorch/pytorch/issues/53364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80217
Approved by: https://github.com/nikitaved, https://github.com/albanD, https://github.com/malfet

* [LTC] Add custom lazy tensor save function (#83294)

We need a custom `save` function for checkpointing a lazy model, similar to what exists in PyTorch/XLA:
https://github.com/pytorch/xla/blob/3eb8a9d9eb4ebb0b064461c3704650241625654e/torch_xla/core/xla_model.py#L994
The purpose of this function is to move any lazy tensors to CPU before saving the checkpoint.

The way I implemented it was to create a general structure visitor, adapted from a function that we use quite often in Cerebras internal repositories. If there is a better tool already available in PyTorch that does the same things, I'm open to suggestions.

CC: @wconstab @Krovatkin @JackCaoG
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83294
Approved by: https://github.com/wconstab

* move pooling test from test_nn to test/nn/test_pooling (#83915)

Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83915
Approved by: https://github.com/albanD

* [ONNX] Remove static None graph output (#82623)

Fixes #82370
* Unify the export behavior regarding static None outputs. These are
dropped for both traced graph and TorchScript graph export.
* `Optional` outputs are not affected.
Fixes #82370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82623
Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock

* [TorchTidy Fix] Don't try to collect strides for non-strided tensors (#83935)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83935
Approved by: https://github.com/robieta, https://github.com/slgong-fb

* [WIP] Validating input_col for certain datapipes (#80267)

Follow up from #79344.

Currently WIP due to multiple test failures.

Waiting for #80140 to land
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80267
Approved by: https://github.com/ejguan

* support more symintnode operations (#83877)

remove debug code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83877
Approved by: https://github.com/ezyang

* add arithmetic ops (#83878)

arithmetic ops tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83878
Approved by: https://github.com/ezyang

* logical ops (#83879)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83879
Approved by: https://github.com/ezyang

* strip SymIntNodes off in the mobile builds (#83938)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83938
Approved by: https://github.com/ezyang

* [pthreadpool] Cap max thread count to fix TSAN issues (#83950)

Summary: Cap the thread count to 64 unconditionally to solve this tsan issue which leads to harder to debug, flaky test failures.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D38136212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83950
Approved by: https://github.com/kimishpatel

* Skip NCCL slimming for …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: dlpack open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

__dlpack__() should normalize strides
9 participants