forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
Test that TORCH_FEATURE_VERSION guards are used where needed #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tomerqodo
wants to merge
48
commits into
base_20251119_6409
Choose a base branch
from
head_20251119_6409
base: base_20251119_6409
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR outputs chars to stream without building temporary strings. They were modified by (on fish) ``` sed -i -e 's/<< "\([^\\\']\)"/<< \'\1\'/g' (grep '<< "."' -r torch c10 aten -l) ``` and revert some invalid changes. Pull Request resolved: pytorch#167899 Approved by: https://github.com/Skylion007
# Description Fixes pytorch#114850, we will port test utils and schema check to Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: # Changes 1. Get device type with from accelerator and get_devtype helper method 2. Replace the requires cuda statement to device_type. 3. Add HAS_XPU and HAS GPU check to replace some of the HAS_XPU etc. # Notify Pull Request resolved: pytorch#166684 Approved by: https://github.com/ezyang, https://github.com/guangyey Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Summary: This diff would be a follow-up diff for D85883723. Test Plan: See D86719598. We are now able to publish the model. Unit test: ``` buck run fbcode//mode/opt -c remoteexecution.local=enabled fbcode//sigmoid/inference/test:test_passes -m ovr_config//triton:experimental -- -r test_triton_hop_cpu ``` Differential Revision: D87091238 Pull Request resolved: pytorch#167862 Approved by: https://github.com/XueningXu
Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#167916 Approved by: https://github.com/Skylion007
**Summary:** Optimize scalar welford_reduce implementation, combining Welford algorithm with cascade sum to improve numerical stability. Specifically: 1. Use Welford algorithm to compute mean and variance. 2. Use cascade summation when computing sum over input for both mean and variance. **Example:** Take pytorch#141541 as an example: ``` import torch import torch.nn as nn torch.manual_seed(0) class Model(nn.Module): def __init__(self): super().__init__() self.gn = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x): return self.gn(x) model = Model().eval() x = torch.randn(1, 32, 128, 128, 128) with torch.no_grad(): output = model(x) with torch._inductor.config.patch({"cpp.simdlen": 0}): c_model = torch.compile(model) c_output = c_model(x) print(torch.max(torch.abs(output - c_output))) print(torch.allclose(output, c_output, 1.3e-6, 1e-5)) ``` **logs** - before ``` tensor(0.0005) False ``` - After ``` tensor(1.4305e-06) True ``` **Generated code:** - before ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(float* in_out_ptr0, float* in_out_ptr1, const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr2) { auto out_ptr1 = in_out_ptr0; auto out_ptr0 = in_out_ptr1; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<float> tmp_acc0_arr[4]; for (int i = 0; i < 4; i++) { tmp_acc0_arr[i] = Welford<float>(); } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); Welford<float> tmp_acc0_local = Welford<float>(); #pragma omp for for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0); } } } tmp_acc0_arr[tid] = tmp_acc0_local; } for (int tid = 0; tid < 4; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean; in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2; } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { { auto tmp0 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp6 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp8 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp11 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp1 = static_cast<float>(2097152.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = float(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = float(tmp5 * tmp6); auto tmp9 = decltype(tmp8)(-tmp8); auto tmp10 = float(tmp9 * tmp7); auto tmp12 = float(tmp10 + tmp11); in_out_ptr0[static_cast<int64_t>(x0)] = tmp7; in_out_ptr1[static_cast<int64_t>(x0)] = tmp12; } } } } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)]; auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)]; auto tmp2 = float(tmp0 * tmp1); auto tmp4 = float(tmp2 + tmp3); out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4; } } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (32, ), (1, )) assert_size_stride(arg1_1, (32, ), (1, )) assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1)) buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1 # reuse buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0 # reuse buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32) # [Provenance debug handles] cpp_fused_native_group_norm_0:1 cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5) del arg0_1 del arg1_1 del arg2_1 return (buf5, ) ``` - After ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(float* in_out_ptr0, float* in_out_ptr1, const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr2) { auto out_ptr1 = in_out_ptr0; auto out_ptr0 = in_out_ptr1; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<float> tmp_acc0_arr[4]; for (int i = 0; i < 4; i++) { tmp_acc0_arr[i] = Welford<float>(); } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); WelfordHelper<float, float, 4096> scalar_welford_helper0(static_cast<int64_t>(524288L)); Welford<float> tmp_acc0_local = Welford<float>(); #pragma omp for for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0, &scalar_welford_helper0); } } } tmp_acc0_local = welford_combine(tmp_acc0_local, &scalar_welford_helper0); tmp_acc0_arr[tid] = tmp_acc0_local; } for (int tid = 0; tid < 4; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean; in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2; } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { { auto tmp0 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp6 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp8 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp11 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp1 = static_cast<float>(2097152.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = float(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = float(tmp5 * tmp6); auto tmp9 = decltype(tmp8)(-tmp8); auto tmp10 = float(tmp9 * tmp7); auto tmp12 = float(tmp10 + tmp11); in_out_ptr0[static_cast<int64_t>(x0)] = tmp7; in_out_ptr1[static_cast<int64_t>(x0)] = tmp12; } } } } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)]; auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)]; auto tmp2 = float(tmp0 * tmp1); auto tmp4 = float(tmp2 + tmp3); out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4; } } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (32, ), (1, )) assert_size_stride(arg1_1, (32, ), (1, )) assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1)) buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1 # reuse buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0 # reuse buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32) # [Provenance debug handles] cpp_fused_native_group_norm_0:1 cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5) del arg0_1 del arg1_1 del arg2_1 return (buf5, ) ``` Pull Request resolved: pytorch#162709 Approved by: https://github.com/CaoE, https://github.com/jansel
) Test Plan: CI Differential Revision: D86211542 Pull Request resolved: pytorch#167799 Approved by: https://github.com/njriasan, https://github.com/eellison
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#167914 Approved by: https://github.com/pytorchbot
Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93
This PR fixes a bug where `torch.clamp` on MPS fails when min/max tensors have more dimensions than the input tensor. CPU already supports this broadcasting, but MPS raised a RuntimeError. Example of failing case before the fix: ```python x = torch.randn(2, 3, device="mps") min_t = torch.randn(1, 2, 3, device="mps") max_t = torch.randn(1, 2, 3, device="mps") torch.clamp(x, min=min_t, max=max_t) # RuntimeError ``` After this fix, MPS matches CPU behavior. Fixes pytorch#160734 Pull Request resolved: pytorch#165058 Approved by: https://github.com/malfet
…7734)" This reverts commit 226850c. Reverted pytorch#167734 on behalf of https://github.com/Aidyn-A due to fails on CUDA 12.8 ([comment](pytorch#167734 (comment)))
The PR pytorch#167401 reminded me that the removal of old NVTX interface is long overdue, as the header-only NVTX3 has been around for more than 5 years and is shipped with all CUDA Toolkit versions of 12+. In addition to that, `libnvToolsExt.so` was removed in CUDA Toolkit 13 and onward. Pull Request resolved: pytorch#167637 Approved by: https://github.com/eqy
…device allocator (pytorch#166831) The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143). - [ ] ->pytorch#166831 - [ ] pytorch#166833 - [ ] pytorch#166843 Pull Request resolved: pytorch#166831 Approved by: https://github.com/EikanWang, https://github.com/gujinghui Co-authored-by: Eikan Wang <eikan.wang@intel.com>
…lasLtWorkspace" (pytorch#167928) Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Re-land context: This is a re-land of pytorch#167248. A few issues were addressed: - fix for a bug in fast path: premature return in getCurrentCUDABlasHandle) - fix for test flakiness (pytorch#167884) Test Plan: 1. regression tests: buck2 test \mode/opt //caffe2/test\:test_transformers_cuda https://www.internalfb.com/intern/testinfra/testrun/6192449759713581 2. Use a GPU OD, run multi-threaded tests with TSAN: buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 Differential Revision: D87111985 Pull Request resolved: pytorch#167928 Approved by: https://github.com/Skylion007
…rnels (pytorch#158250) Co-authored-by: Nikhil Gupta [nikhil.gupta2@arm.com](mailto:nikhil.gupta2@arm.com) This PR enables the use of KleidiAI INT4 kernels that directly produce BF16 outputs within PyTorch to boost LLM prefill & decode performance **This change improves decode throughput by ~15% & reduces memory required to inference the model by 50%** ### Benchmark Setup ``` Model: meta-llama/Llama-3.1-8B Test Platform: Neoverse V2 ``` ### Detailed Results | Metric | With `--compile` | Without `--compile` | |----------------------------------|---------------------------|---------------------------| | Quantization Scheme | INT4 symmetric channelwise | INT4 symmetric channelwise | | Input Precision | BF16 | BF16 | | Number of Layers Quantized | 32 | 32 | | Average Compression Ratio | 87.49% | 87.49% | | Total Quantization Time (s) | 9.62 | 10.32 | | Compile Time (First) (s) | 134.48 | 1.69 | | Compile Time (Second) (s) | 80.44 | 1.60 | | Compile Time (Subsequent) (s) | 0.19 | 0.22 | | Prefill Tokens | 54 | 54 | | Decoded Tokens | 33 | 33 | | Prefill Time (s) | 0.19 | 0.22 | | Decode Time (s) | 0.76 | 1.38 | | E2E Generation Time (s) | 0.95 | 1.60 | | Prefill Throughput (tokens/s) | 288.13 | 249.91 | | Decode Throughput (tokens/s) | 43.42 | 23.83 | Pull Request resolved: pytorch#158250 Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/fadara01 Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#167968 Approved by: https://github.com/pytorchbot
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1e69f4](intel/torch-xpu-ops@1e69f40), includes: - Add PTL in the default AOT target list for both Win and Lin - Use PyTorch p2p API in Copy kernel - Add event cache and event timing to XCCL - Add Float8_e8m0fnu support for copy - Add CMAKE_SYCL_COMPILER_LAUNCHER for sccache Pull Request resolved: pytorch#167698 Approved by: https://github.com/EikanWang
Exposing `_inductor.config.bucket_all_reduces_fx` similar to all_gathers, reduce_scatters with only option "all". Pull Request resolved: pytorch#167634 Approved by: https://github.com/eellison
Make the PyObject preservation scheme thread-safe with free threaded (nogil) Python. The general idea is: * Python Tensor and Storage objects always hold a strong reference to their underlying c10 object * c10 objects hold a strong reference to their Python objects if there's at least one other reference to the c10 object This is implemented in `intrusive_ptr`: * The top most bit (`kHasPyObject`) from the weakref count is now used to indicate if the `intrusive_ptr_target` has an associated PyObject. So `kHasPyObject` is one bit, the weakref count is now 31 bits and the strong refcount remains 32 bits. * When the reference count increases from one to two and `kHasPyObject` is set, we incref the associated Python object to ensure that it's kept alive. * When the reference count decreases from two to one (i.e., there are no C++ reference to the `intrusive_ptr_target` other than from the Python object), we decre the associated Python object to break the cycle. Other benefits: * We can delete a lot of the copypasta from Python internal `subtype_dealloc` * This fixes the weakref and GC bugs we had in the previous scheme. Python weakrefs on Tensors and Storages should just work as expected now. Risks: * Extra branch for reference count operations on `intrusive_ptr<TensorImpl>`, `intrusive_ptr<StorageImpl>`, and the generic `intrusive_ptr<intrusive_ptr_target>` even when we're not using Python. * It's a big change (Second attempt at pytorch#166342) Pull Request resolved: pytorch#167564 Approved by: https://github.com/albanD, https://github.com/Skylion007
Previously we hard failed if pg was "gloo". Fallback on hardcoded formulas. Pull Request resolved: pytorch#167827 Approved by: https://github.com/eellison
pytorch#166044 removes openblas from whl dependency list for AArch64+CPU build so this PR adds it back. Only affects CPU build since AArch64+CUDA uses NVPL. Pull Request resolved: pytorch#167841 Approved by: https://github.com/tinglvv, https://github.com/malfet
Use standard HIP headers for unsafeAtomicAdd. Removes copy/paste of unsafeAtomicAdd as "preview" implementation for gfx942. Pull Request resolved: pytorch#167661 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…rch#165067)" This reverts commit 96a4c4b. Reverted pytorch#165067 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87036515, @albanD please help the author get this PR merged ([comment](pytorch#165067 (comment)))
This reverts commit e20ca3b. Reverted pytorch#167049 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87120562, @Skylion007 please thelp the author get this PR merged ([comment](pytorch#167049 (comment)))
This reverts commit 2245d7d. Reverted pytorch#167899 on behalf of https://github.com/jeanschmidt due to need to revert in order to revert pytorch#167899 ([comment](pytorch#167899 (comment)))
This reverts commit deabb3e. Reverted pytorch#167821 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D87148810. @Skylion007 may you help the author to get this PR merged? ([comment](pytorch#167821 (comment)))
Alas, one can not use `repeat_interleave_common` for MPS tensors, as `data_offset` is not a valid pointer to `id<MTLTensor>` On the other hand, one does not need to use `AT_DISPATCH_INDEX_TYPES` as dispatching is happening on the shader side Fixes pytorch#167924 Pull Request resolved: pytorch#167961 Approved by: https://github.com/manuelcandales
Summary: MXFP4 unit tests pass on B200, fail on RTX 5090 - disable non-B200 cases. Also add a fail w/a not implemented error for non-B200 to avoid unhelpful failure messages. Test Plan: ``` pytest -sv -k "mxfp4" test/test_scaled_matmul_cuda.py ``` Reviewers: @nWEIdia Subscribers: Tasks: Fixes pytorch#167850 Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: pytorch#167857 Approved by: https://github.com/nWEIdia, https://github.com/malfet
Upgrade all the ROCm docker images to ROCm 7.1 release version. Pull Request resolved: pytorch#166743 Approved by: https://github.com/atalman, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Prachi Gupta <prachi.gupta@amd.com>
…7860) getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions. This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity. Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)! Pull Request resolved: pytorch#167860 Approved by: https://github.com/zou3519
…ytorch#161728) Resolves pytorch#161290 ## Summary Expands `dynamo/check_perf_csv.py` output capabilities with latency, compile time and memory information: - Display's measured speedup and display % from target - Added clear messaging for all passing model tests when no regression is found - Added error handling if csv file is missing ### Example (Failing Check) ```bash python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40 ``` **Example Output:** ``` Checking inductor_training_smoketest.csv (speedup threshold >= 1.40x) hf_Bert speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB) Error 1 model(s) performance regressed hf_Bert - hf_Bert: 1.005x (< 1.40x; -28.2% from target) ``` ### Example (Passing Check) ```bash python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40 ``` **Example Output:** ``` Checking inductor_training_smoketest.csv (speedup threshold >= 1.00x) hf_Bert speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB) All 1 model(s) passed threshold check (>= 1.00x) ``` Pull Request resolved: pytorch#161728 Approved by: https://github.com/isuruf
This reverts commit 99fdca8. Reverted pytorch#166492 on behalf of https://github.com/jeanschmidt due to Internally we still depends on the old logic, so we need to find a way to maintain backwards compatibility, for now ([comment](pytorch#166492 (comment)))
Pull Request resolved: pytorch#167772 Approved by: https://github.com/janeyx99
…orch::stable::Tensor. (pytorch#161891) This ghstack is a prerequisite for porting torchaudio C++ extensions to use torch stable ABI, see pytorch/audio#4074, pytorch/audio#4075, pytorch/audio#4076, pytorch/audio#4077, pytorch/audio#4078 Pull Request resolved: pytorch#161891 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: pytorch#167772
The following tests are failing on python 3.14 on linux machine
* TestSetAffinity::test_set_affinity_in_worker_init
* Why? 3.14 makes `forkserver` the default start method for multiprocessing. With it, local functions are not pickle-able and unit test fail.
* TestIndividualWorkerQueue::test_ind_worker_queue
* Why? The test was hitting timeout. This is also related to the start method. I am increasing timeout and reducing batch size iterations to reduce total unit test time.
* Fixes pytorch#68643
Pull Request resolved: pytorch#167429
Approved by: https://github.com/aelavender, https://github.com/ramanishsingh
This reverts commit 77acc66. Reverted pytorch#166743 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#166743 (comment)))
…7633) Pull Request resolved: pytorch#167633 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167827
Not sure if the path are already properly setup so I can call 'benchmarks/dynamo/huggingface.py' in unit test directly. Let's tell from CI. Pull Request resolved: pytorch#167482 Approved by: https://github.com/v0i0, https://github.com/mlazos
Inductor may treat an outer reduction as inner reduction when the reduction ranges contains a 1. This cause some weird issue that we skip fusing with mix order reduction. While I'm still debugging why that happens, I think we should fix the decision here anyways Pull Request resolved: pytorch#167697 Approved by: https://github.com/jansel, https://github.com/v0i0
Fixes pytorch#158429 Updated LogAddExpKernel.cu to allow for complex numbers. Also, updated unittest to run test_logaddexp on CUDA with complex data types and added a unit test in test_linalg.py to compare results between CUDA and cpu. @drisspg Pull Request resolved: pytorch#163509 Approved by: https://github.com/isuruf
Enables mm out for sparse tensors Pull Request resolved: pytorch#167908 Approved by: https://github.com/malfet
…#167931) Per title 1) allows `self` argument to have the same precision as output 2) fixes broadcasting of `self` argument - it used to allocate incorrectly sized output and resize it later, causing a warning, in addmm, and error out in baddbmm 3) fixes `out` handling for `out` baddbmm overload, where the implementation used uninitialized memory in `out` instead of copying `self` to out. 4) removes couple unneeded iife patterns Pull Request resolved: pytorch#167931 Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/malfet
…idiAI kernels (pytorch#158250)" This reverts commit 53809f9. Reverted pytorch#158250 on behalf of https://github.com/zou3519 due to reverting to see if it fixes inductor halide test failure ([comment](pytorch#158250 (comment)))
Summary: add support for symint placeholders added two test cases with dynamic reshape - dynamic info coming from tmd on placeholders - dynamic info coming from placeholders (symints) Test Plan: test_reshape_dynamic_ph test_reshape_dynamic_tmd Differential Revision: D86984100 Pull Request resolved: pytorch#167757 Approved by: https://github.com/blaine-rister
…locate test into `TestSaveLoad` (pytorch#158247) This is a follow-up to [pytorch#154333](pytorch#154333), where I initially introduced a fallback mechanism in deserialize_torch_artifact. In this revised PR: Cleaned up commit history for clarity and reproducibility. Relocated the test into the TestSaveLoad class in test_serialize.py. There were some issues with last PR so opened this PR The previous PR had inconsistencies due to local branch issues and was closed in favor of this cleaner submission. Feedback is very welcome Pull Request resolved: pytorch#158247 Approved by: https://github.com/angelayi
This reverts commit 99117c1. Reverted pytorch#167637 on behalf of https://github.com/yangw-dev due to breaks internal build with torch/csrc/profiler/stubs/cuda.cpp:4:10: fatal error: 'nvtx3/nvtx3.hpp' file not found 4 | #include <nvtx3/nvtx3.hpp>, please find a meta fella to resolve this issue and try again, diff:[D87229660] ([comment](pytorch#167637 (comment)))
This reverts commit 7ede33b. Reverted pytorch#167771 on behalf of https://github.com/eellison due to needs one fix ([comment](pytorch#167771 (comment)))
… used where needed" Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed) [ghstack-poisoned]
… used where needed" Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed) [ghstack-poisoned]
|
/agentic_review |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone
Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed)
Stack from ghstack (oldest at bottom):