Skip to content

Conversation

Flamefire
Copy link
Collaborator

@Flamefire Flamefire commented Apr 25, 2025

The argument needs to be appended when test reports should be generated. IS_CI is not necessarily set, so rather check TEST_SAVE_XML instead as in other places where test reports are conditionally enabled.

See also #126523

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @mcarilli @ptrblck @leslie-fang-intel @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

masnesral and others added 30 commits April 18, 2025 18:49
Summary: test/dynamo/test_utils.py is out of date because of some new dynamo_timed fields. (I guess the test is disabled?). Bring it up to date

Test Plan: `python test/dynamo/test_utils.py`

Fixes pytorch#148093

Pull Request resolved: pytorch#151599
Approved by: https://github.com/Skylion007
…151698)

The name was updated by pytorch#151155.  The benchmark results weren't updated on the dashboard otherwise.

For PT2 compiler perf benchmark, we are still relying on this old workflow.  To get rid of this, we need to update PT2 benchmark dashboard to use the new benchmark database (cc @yangw-dev)

The results are there on the new database:

```
SELECT
    *
FROM
    oss_ci_benchmark_v3
WHERE
    workflow_id = 14510035576
```

but not on the old database:

```
SELECT
    *
FROM
    inductor_torch_dynamo_perf_stats
WHERE
    workflow_id = 14510035576
```

Pull Request resolved: pytorch#151698
Approved by: https://github.com/seemethere, https://github.com/atalman
Summary: We're now w/ later rocm version so ok to add uuid back.

Test Plan: sandcastle

Differential Revision: D73240086

Pull Request resolved: pytorch#151652
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/houseroad
This is part of splitting up pytorch#150558 into smaller chunks, please see that for more context

Similar to pytorch#151483 but for libtorch

Changed the job name

Testing:
Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now
Pull Request resolved: pytorch#151488
Approved by: https://github.com/atalman
This is part of splitting up pytorch#150558 into smaller chunks, please see that for more context

Similar to pytorch#151483 but for manywheel

Changed the job name

s390x doesn't have access to aws ecr so it doesn't use the action.  manylinuxs390x-builder ecr repo doesn't exist in docker hub so idk why the image name is that

Testing:
Can't really test since PRs don't have the credentials to push to docker io, which is the image used for everything, including PRs right now
Pull Request resolved: pytorch#151489
Approved by: https://github.com/seemethere
…ch#151683)

Summary: Further testing the script, we found that we shouldn't always assume rank 0 is the first rank, so we need to check all entries and see if it P2P op for this coalesced group.

Test Plan: Directly test with corner case.

Differential Revision: D73266257

Pull Request resolved: pytorch#151683
Approved by: https://github.com/fegin
…pendencies.cmake (pytorch#151583)

Fixes [pytorch#147220]

Context: In the CUDA NVTX world, there are NVTX v2 and NVTX v3. As announced in CUDA release notes, e.g. [CUDA 12.8 Update 1]( https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-or-dropped-operating-systems) "`NVTX v2 is deprecated. To migrate to NVTX v3. Change your code from: #include <nvtoolsext.h> to #include "nvtx3/nvtoolsext.h`". This header is included in the toolkit."
On the PyTorch side, TORCH_CUDA_USE_NVTX3 compile time macro is used and it is set to true when (most of the time) nvtx3 is found. nvtx3 is found in two cases: 1) USE_SYSTEM_NVTX=0 (default), torch build process would automatically look for the nvtx3 in pytorch/third_party/nvtx. This is the most common and default case. 2) when USE_SYSTEM_NVTX=1 is used, nvtx3 is found from the installed CUDA toolkit (e.g. CUDA 12.8 and even some earlier cuda versions).
As described in pytorch#147220, the reason it can find pytorch/third_party/nvtx is because it used
https://github.com/pytorch/pytorch/blob/6f035d8462e43b1c678e5f334d52d9df0e00e6bf/cmake/public/cuda.cmake#L176
note the "PROJECT_SOURCE_DIR" usage in [pytorch/cmake/public/cuda.cmake](https://github.com/pytorch/pytorch/blob/6f035d8462e43b1c678e5f334d52d9df0e00e6bf/cmake/public/cuda.cmake#L176)

Before this PR:
PyTorch build would succeed in finding nvtx3 due to the above described process, everything is good. But downstream projects like torchvision *can* fail, and would by default fail because the following are happening:
1) USE_SYSTEM_NVTX=0 is used (and most likely it is this case because it is the default)
2) NVTX v2 can no longer be found (e.g. future CUDA versions because deprecation would eventually become removal)
3) TorchVision cannot find NVTX3 either because torchvision was invoking [pytorch/cmake/public/cuda.cmake] but the PROJECT_SOURCE_DIR is no longer the pytorch source but torchvision source!
4) One workaround is to "USE_SYSTEM_NVTX=1" but users have to explicitly set this and do the plumbing work

After this PR:
PyTorch can still find nvtx3 because the part of the code that finds nvtx3 is just moved to a new place. The CI logs are showing it being able to find nvtx3. e.g. [this job](https://productionresultssa14.blob.core.windows.net/actions-results/47f8efaa-0afe-4e1f-bc94-0a82629941cb/workflow-job-run-dc8201b1-845b-5da1-a6ea-d3360ce1b508/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-04-18T20%3A38%3A05Z&sig=yMd6egC%2Banl3lR%2BudXFX18bfUH189z0DTGLtscHQJwY%3D&ske=2025-04-19T06%3A21%3A45Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-04-18T18%3A21%3A45Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-04-18T20%3A28%3A00Z&sv=2025-01-05), which reads "`Found nvtx3: C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/NVTX/c/include`"
For torchvision, it still invoke  [pytorch/cmake/public/cuda.cmake] but it no longer tries to find nvtx3 as torchvision is not using nvtx3 (if in future it uses, it can set USE_SYSTEM_NVTX=1 by default). So it would avoid the error reported in [pytorch#147220]

Pull Request resolved: pytorch#151583
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet
pytorch#151623)

- Update docstring list formatting
- Use a try finally block to keep the model unmodified if save() fails.

Pull Request resolved: pytorch#151623
Approved by: https://github.com/titaiwangms
Summary: test/dynamo/test_structured_trace.py is out of date because of some new fields. (I guess the test is disabled?). Bring it up to date.

Test Plan: `python test/dynamo/test_structured_trace.py`

Fixes pytorch#149671

Pull Request resolved: pytorch#151606
Approved by: https://github.com/Skylion007
ghstack dependencies: pytorch#151599
…ls._infer_size for wildcard dims (pytorch#150127)"

This reverts commit 1dd2033.

Reverted pytorch#150127 on behalf of https://github.com/clee2000 due to maybe caused export test to fail? export/test_draft_export.py::TestDraftExport::test_masked_linear [GH job link](https://github.com/pytorch/pytorch/actions/runs/14538768138/job/40794985504) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/1dd2033c0a1de460ee2bad8d64c36a0344886071), bad TD ([comment](pytorch#150127 (comment)))
It ends up being templated over a bunch of reference-to-array-of-characters types with different lengths, such as `char const (&) [88]`, which is an annoyance when profiling and possibly a source of code bloat.

Differential Revision: [D73129450](https://our.internmc.facebook.com/intern/diff/D73129450/)

Pull Request resolved: pytorch#151626
Approved by: https://github.com/Skylion007, https://github.com/malfet
1) reserving is much better than not reserving
2) std::transform for a 1-line-body loop is generally not considered to be an improvement (and doesn't get seem to get boiled away by clang under -Oz)

Differential Revision: [D73013363](https://our.internmc.facebook.com/intern/diff/D73013363/)
Pull Request resolved: pytorch#151627
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#151626
Clear missing reserve (we should expect that pieces are not empty).

Differential Revision: [D73129445](https://our.internmc.facebook.com/intern/diff/D73129445/)

Pull Request resolved: pytorch#151628
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#151626, pytorch#151627
Summary: When doing on-demand profiler with stack, the decref causes a segfault. I tried checking the refcount and the object itself and they both look fine but still segfaults every time. Lets remove it for now and revisit.

This will induce a small memory leak but it should be small enough that it does not create any significant impact on jobs ran.

Test Plan:
Removed decref and got clean traces
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1744933624/localhost/libkineto_activities_2936811.json.gz&bucket=gpu_traces

Differential Revision: D73225468

Pull Request resolved: pytorch#151625
Approved by: https://github.com/davidberard98
…ytorch#145523) (pytorch#146051) (pytorch#151481)

Summary:

This config is not supported (it throws an error when set), and doesn't really make sense imo.

Approved by: https://github.com/eellison

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/edf266e9bbbf6063f7c4a336ffb50234e11a0a82

Reviewed By: masnesral

Differential Revision: D68846308

Pull Request resolved: pytorch#151481
Approved by: https://github.com/masnesral
By building wheel with USE_DISTRIBUTED=1

Otherwise attempt to run
```
python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend inductor --inference --devices mps
```
wil fail with
```
  File "/Users/nshulga/Library/Python/3.10/lib/python/site-packages/transformers/modeling_utils.py", line 40, in <module>
    import torch.distributed.tensor
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/__init__.py", line 4, in <module>
    import torch.distributed.tensor._ops  # force import all built-in dtensor ops
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/__init__.py", line 2, in <module>
    from ._conv_ops import *  # noqa: F403
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/_conv_ops.py", line 5, in <module>
    from torch.distributed.tensor._dtensor_spec import DTensorSpec, TensorMeta
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_dtensor_spec.py", line 6, in <module>
    from torch.distributed.tensor.placement_types import (
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/placement_types.py", line 8, in <module>
    import torch.distributed._functional_collectives as funcol
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/_functional_collectives.py", line 9, in <module>
    import torch.distributed.distributed_c10d as c10d
  File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 23, in <module>
    from torch._C._distributed_c10d import (
ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package
```
Pull Request resolved: pytorch#151721
Approved by: https://github.com/wdvr, https://github.com/dcci, https://github.com/huydhn
…ported (pytorch#145523) (pytorch#146051) (pytorch#151481)"

This reverts commit cfc4d74.

Reverted pytorch#151481 on behalf of https://github.com/malfet due to It indeed breaks lint, it followup PR contains it's own issues ([comment](pytorch#151481 (comment)))
…1732)

By constructing tensor on that device, because it does not call `self.common` but rather executes test directly.

Otherwise `test_add_complex3_mps` will test CPU inductor, rather than MPS one

Pull Request resolved: pytorch#151732
Approved by: https://github.com/dcci
This PR adds support for submatrices in offline tuning for:
- GEMM
- GEMM and bias
- ScaledGEMM
- Batch Strided GEMM

New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately.

There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case.

To improve code readability, swapped definition of transA and transB.

Pull Request resolved: pytorch#151138
Approved by: https://github.com/jeffdaily
… 0 (pytorch#151226)

Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users.

The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it.

Repro with cuda:
```
>>> import torch
>>> event = torch.cuda.Event()
>>> event.cuda_event
0
>>> event.event_id
0
>>> event.record()
>>> event.cuda_event
127982096
>>> event.event_id
0
```
Pull Request resolved: pytorch#151226
Approved by: https://github.com/albanD
As the title stated.

The difference between declaration and implemention.
declaration:
https://github.com/pytorch/pytorch/blob/d5a19e4525f49049f822930ed85fe32bb004589c/torch/_C/__init__.pyi.in#L157-L162

Implementation:
https://github.com/pytorch/pytorch/blob/d5a19e4525f49049f822930ed85fe32bb004589c/torch/csrc/Event.cpp#L30-L32

**Question**: Which one should we choose?
- Change enable_timing to False to be consistent with torch.cuda.Event
- Change enable_timing to True to avoid BC-break
Pull Request resolved: pytorch#151221
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#151226
…151411)

**Changes:**
- add detailed function or class signature
- fix the wrong display of torch.Event.wait and torch.Event.record
Pull Request resolved: pytorch#151411
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#151226, pytorch#151221
Copy link

pytorch-bot bot commented Apr 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152167

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Flamefire Flamefire deleted the branch pytorch:Flamefire-patch-1 April 25, 2025 07:00
@Flamefire Flamefire closed this Apr 25, 2025
@pytorch-bot pytorch-bot bot added ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request module: amp (automated mixed precision) autocast module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: quantization release notes category release notes: releng release notes category release notes: distributed (checkpoint) labels Apr 25, 2025
@Flamefire
Copy link
Collaborator Author

Wrong target branch and can't change it now. New PR: #152170

@Flamefire Flamefire deleted the Flamefire-patch-1 branch May 6, 2025 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request module: amp (automated mixed precision) autocast module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) release notes: quantization release notes category release notes: releng release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.