Releases: pytorch/pytorch
PyTorch 2.12.0 Release
PyTorch 2.12.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
| Batched linalg.eigh on CUDA is up to 100x faster due to updated cuSolver backend selection. |
| New torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends. |
| torch.export.save now supports Microscaling (MX) quantization formats, enabling full export of aggressively compressed models. |
Adagrad now supports fused=True, joining Adam, AdamW, and SGD with a single-kernel optimizer implementation. |
| torch.cond control flow can now be captured and replayed inside CUDA Graphs. |
| ROCm users gain expandable memory segments, rocSHMEM symmetric memory collectives, and FlexAttention pipelining. |
For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Build Frontend
-
Strengthened SVE compile checks in
FindARM.cmake, which may reject previously accepted but incorrect SVE configurations (#176646)Source builds that enable SVE now validate the compiler configuration more strictly. If a build previously passed with an incomplete or mismatched SVE setup, it may now fail during CMake configuration instead of later in compilation. Update the compiler/toolchain flags so they accurately describe the target SVE support, or disable SVE for that build.
-
Updated the minimum CUDA version required to build PyTorch from source to CUDA 12.6 (#178925)
Building PyTorch from source with CUDA versions older than 12.6 is no longer supported. Users building custom binaries should install CUDA 12.6 or newer and make sure
CUDA_HOMEpoints to that installation.Version 2.11:
CUDA_HOME=/usr/local/cuda-12.4 python setup.py develop
Version 2.12:
CUDA_HOME=/usr/local/cuda-12.6 python setup.py develop
-
Enforced a C++20 minimum in CMake build files (#178662)
Source builds now require a compiler and build configuration that support C++20. If you maintain custom build scripts or downstream extensions that build PyTorch from source, update the compiler and remove assumptions that PyTorch can be built as C++17.
Distributed
-
torch.distributed.nn.functionalops now raiseRuntimeErrorundertorch.compile(#177342)All ops in
torch.distributed.nn.functional(e.g.,broadcast,all_reduce,all_gather,reduce_scatter,all_to_all_single) now raiseRuntimeErrorwhen called insidetorch.compile. Users should migrate to the functional collectives API intorch.distributed._functional_collectives.Version 2.11:
@torch.compile def my_func(x): return torch.distributed.nn.functional.all_reduce(x, op=ReduceOp.SUM)
Version 2.12:
@torch.compile def my_func(x): return torch.distributed._functional_collectives.all_reduce(x, reduceOp="sum", group=group)
TorchElastic
-
torchrunnow defaults to an OS-assigned free port for single-node training instead of port 29500 (#175699)When running
torchrun --nproc-per-node=N script.pywithout specifying--master-portor--standalone, the default behavior now automatically uses an OS-assigned free port via thec10drendezvous backend. This eliminates "Address already in use" errors when running multiple training jobs concurrently. Multi-node training, explicit--master-port,PET_MASTER_PORTenv var, and--standaloneare unchanged.Version 2.11:
# Used static rendezvous on port 29500 by default torchrun --nproc-per-node=4 train.pyVersion 2.12:
# Uses OS-assigned free port by default torchrun --nproc-per-node=4 train.py # To explicitly use a fixed port: torchrun --nproc-per-node=4 --master-port=29500 train.py
MPS
-
All MPS tensors are now allocated in unified memory (#175818)
Previously, MPS tensors could be allocated in either device-only or unified memory. Now all MPS tensors use unified memory unconditionally. This simplifies memory management and enables CPU access to MPS tensor data without explicit copies. Code that relied on device-only memory placement may observe different performance characteristics.
Inductor
-
The
max_autotunelayout-constraint deferral introduced in 2.11 is now opt-in (#175330)In 2.11, Inductor deferred layout freezing for
max_autotunetemplates to expose more fusion opportunities. This caused a regional-inductor failure mode, so the default in 2.12 reverts to immediate layout freezing. Users who relied on the deferred behavior for fusion opportunities should opt in explicitly viatorch._inductor.config.max_autotune_defer_layout_freezingorTORCHINDUCTOR_MAX_AUTOTUNE_DEFER_LAYOUT_FREEZING=1.Version 2.11:
# Deferred layout freezing was the default torch.compile(model, mode="max-autotune")
Version 2.12:
import torch._inductor.config as cfg cfg.max_autotune_defer_layout_freezing = True # or set TORCHINDUCTOR_MAX_AUTOTUNE_DEFER_LAYOUT_FREEZING=1 torch.compile(model, mode="max-autotune")
Deprecations
Release Engineering
-
Deprecate CUDA 12.8 builds in favor of CUDA 13.0 (#179072)
CUDA 12.8 binaries have been removed from the PyTorch binary build matrix. CUDA 13.0 is now the stable default and CUDA 12.6 remains available for users on older drivers. Users explicitly pinning the
cu128index URL will need to switch tocu130(recommended) orcu126.Version 2.11:
pip install torch --index-url https://download.pytorch.org/whl/cu128
Version 2.12:
# Use CUDA 13.0 (default on PyPI): pip install torch # Or explicitly: pip install torch --index-url https://download.pytorch.org/whl/cu130 # Older driver fallback: pip install torch --index-url https://download.pytorch.org/whl/cu126
-
Compatibility with CMake < 3.10 will be removed in a future release (#166259)
Source builds against CMake versions older than 3.10 now emit a deprecation warning. A future release will require CMake 3.10 or newer; please upgrade CMake before then.
Linear Algebra
-
Several CUDA linear algebra operators no longer use the MAGMA backend and now dispatch to cuSolver or cuBLAS unconditionally:
torch.linalg.eighnow dispatches to cuSolver (#174619)torch.linalg.lu_solvenow dispatches to cuSolver/cuBLAS (#174248)torch.linalg.cholesky_inversenow dispatches to cuSolver (#174681)torch.linalg.cholesky_solvenow dispatches to cuSolver (#174769)
User code calling these APIs does not need to change. The practical impact is for users who depended on MAGMA-specific numerical behavior, performance characteristics, or debugging. Those calls now use the cuSolver/cuBLAS implementations on CUDA.
FullyShardedDataParallel2 (FSDP2)
-
Compiling through FSDP2 hooks without graph breaks is no longer supported (#174863, #174906). If you use compiled autograd with FSDP2, update your code to allow graph breaks around FSDP2 hooks or disable compiled autograd for the FSDP2 training step.
Version 2.11:
with torch._dynamo.config.patch(compiled_autograd=True): compiled_model = torch.compile(fsdp_model, fullgraph=True) loss = compiled_model(input).sum() loss.backward()
Version 2.12:
# Either run FSDP2 backward without fullgraph. compiled_model = torch.compile(fsdp_model, fullgraph=False) loss = compiled_model(input).sum() loss.backward() # Or apply compile before applying FSDP. compiled_model_pre_fsdp = torch.compile(model, fullgraph=True) compiled_model = fully_shard(compiled_model_pre_fsdp, ...) loss = compiled_model(input).sum() loss.backward()
Profiler
-
Profiler's
metadata_jsonfield is now deprecated; useevent_metadatainstead (#179417)Version 2.11:
metadata = event.metadata_json
Version 2.12:
metadata = event.event_metadata
Dynamo
-
torch.compile(fullgraph=True)now warns when a call runs no compiled code; will error in 2.13 (#181940)Previously
fullgraph=Truewas only validated once Dynamo actually compiled and ran the function. If Dynamo was bypassed at call time (e.g. under a user-definedTorchDispatchMode), the annotation silently had no effect. 2.12 emits a warning; 2.13 will raise. For graph-break erro...
PyTorch 2.11.0 Release
PyTorch 2.11.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
| Added Support for Differentiable Collectives for Distributed Training |
| FlexAttention now has a FlashAttention-4 backend on Hopper and Blackwell GPUs |
| MPS (Apple Silicon) Comprehensive Operator Expansion |
| Added RNN/LSTM GPU Export Support |
| Added XPU Graph Support |
For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Release Engineering
Volta (SM 7.0) GPU support removed from CUDA 12.8 and 12.9 binary builds (#172598)
Starting with PyTorch 2.11, the CUDA 12.8 and 12.9 pre-built binaries no longer include support for Volta GPUs (compute capability 7.0, e.g. V100). This change was necessary to enable updating to CuDNN 9.15.1, which is incompatible with Volta.
Users with Volta GPUs who need CUDA 12.8+ should use the CUDA 12.6 builds, which continue to include Volta support. Alternatively, build PyTorch from source with Volta included in TORCH_CUDA_ARCH_LIST.
Version 2.10:
# CUDA 12.8 builds supported Volta (SM 7.0)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Works on V100
Version 2.11:
# CUDA 12.8 builds no longer support Volta
# For V100 users, use CUDA 12.6 builds instead:
pip install torch --index-url https://download.pytorch.org/whl/cu126
PyPI wheels now ship with CUDA 13.0 instead of CUDA 12.x (#172663, announcement)
Starting with PyTorch 2.11, pip install torch on PyPI installs CUDA 13.0 wheels by default for both Linux x86_64 and Linux aarch64. Previously, PyPI wheels shipped with CUDA 12.x and only Linux x86_64 CUDA wheels were available on PyPI. Users whose systems have only CUDA 12.x drivers installed may encounter errors when running pip install torch without specifying an index URL.
Additionally, CUDA 13.0 only supports Turing (SM 7.5) and newer GPU architectures on Linux x86_64. Maxwell and Pascal GPUs are no longer supported under CUDA 13.0. Users with these older GPUs should use the CUDA 12.6 builds instead.
CUDA 12.6 and 12.8 binaries remain available via download.pytorch.org.
Version 2.10:
# PyPI wheel used CUDA 12.x
pip install torchVersion 2.11:
# PyPI wheel now uses CUDA 13.0
pip install torch
# To get CUDA 12.8 wheels instead:
pip install torch --index-url https://download.pytorch.org/whl/cu128
# To get CUDA 12.6 wheels (includes Maxwell/Pascal/Volta support):
pip install torch --index-url https://download.pytorch.org/whl/cu126Python Frontend
torch.hub.list(), torch.hub.load(), and torch.hub.help() now default the trust_repo parameter to "check" instead of None. The trust_repo=None option has been removed. (#174101)
Previously, passing trust_repo=None (or relying on the default) would silently download and run code from untrusted repositories with only a warning. Now, the default "check" behavior will prompt the user for explicit confirmation before running code from repositories not on the trusted list.
Users who were explicitly passing trust_repo=None must update their code. Users who were already passing trust_repo=True, trust_repo=False, or trust_repo="check" are not affected.
Version 2.10:
# Default trust_repo=None — downloads with a warning
torch.hub.load("user/repo", "model")
# Explicit None — same behavior
torch.hub.load("user/repo", "model", trust_repo=None)Version 2.11:
# Default trust_repo="check" — prompts for confirmation if repo is not trusted
torch.hub.load("user/repo", "model")
# To skip the prompt, explicitly trust the repo
torch.hub.load("user/repo", "model", trust_repo=True)torch.nn
Add sliding window support to varlen_attn via window_size, making optional arguments keyword-only (#172238)
The signature of torch.nn.attention.varlen_attn has changed: a * (keyword-only separator) has been inserted before the optional arguments. Previously, optional arguments like is_causal, return_aux, and scale could be passed positionally; they must now be passed as keyword arguments. A new window_size keyword argument has also been added.
# Before (2.10)
output = varlen_attn(query, key, value, cu_seq_q, cu_seq_k, max_q, max_k, True, None, 1.0)
# After (2.11) — pass as keyword argument
output = varlen_attn(query, key, value, cu_seq_q, cu_seq_k, max_q, max_k, window_size=(-1, 0), return_aux=None, scale=1.0)Remove is_causal flag from varlen_attn (#172245)
The is_causal parameter has been removed from torch.nn.attention.varlen_attn. Causal attention is now expressed through the window_size parameter: use window_size=(-1, 0) for causal masking, or window_size=(W, 0) for causal attention with a sliding window of size W. The default window_size=(-1, -1) corresponds to full (non-causal) attention.
# Before (2.10)
output = varlen_attn(query, key, value, cu_seq_q, cu_seq_k, max_q, max_k, is_causal=True)
# After (2.11) — use window_size instead
output = varlen_attn(query, key, value, cu_seq_q, cu_seq_k, max_q, max_k, window_size=(-1, 0))Distributed
DebugInfoWriter now honors $XDG_CACHE_HOME for its cache directory in C++ code, consistent with the Python side. Previously it always used ~/.cache/torch. (#168232)
This avoids issues where $HOME is not set or not writable. Users who relied on ~/.cache/torch being used regardless of $XDG_CACHE_HOME may see debug info written to a different location.
Version 2.10:
# C++ DebugInfoWriter always wrote to ~/.cache/torch
Version 2.11:
# C++ DebugInfoWriter now respects $XDG_CACHE_HOME/torch (same as Python code)
# Falls back to ~/.cache/torch if $XDG_CACHE_HOME is not set
DeviceMesh now stores a process group registry (_pg_registry) directly, enabling torch.compile to trace through get_group(). (#172272)
This may break code that skips init_process_group, loads a saved DTensor (constructing a DeviceMesh with no PGs), and later creates PGs separately — during torch.compile runtime the PG lookup will fail. Users should ensure process groups are initialized before constructing the DeviceMesh.
Version 2.10:
# PGs resolved via global _resolve_process_group at runtime
mesh = DeviceMesh(...) # PGs could be created laterVersion 2.11:
# PGs now stored on DeviceMesh._pg_registry; must exist at mesh creation
dist.init_process_group(...) # Must be called before creating mesh
mesh = DeviceMesh(...)Distributed (DTensor)
DTensor.to_local() backward now converts Partial placements to Replicate by default when grad_placements is not provided. (#173454)
Previously, calling to_local() on a Partial DTensor would preserve the Partial placement in the backward gradient, which could produce incorrect gradients when combined with from_local(). Now, the backward pass automatically maps Partial forward placements to Replicate gradient placements, matching the behavior of from_local().
Users who relied on the previous behavior (where to_local() backward preserved Partial gradients) may see different gradient values. To ensure correctness, explicitly pass grad_placements to to_local().
Version 2.10:
# Partial placement preserved in backward — could produce incorrect gradients
local_tensor = partial_dtensor.to_local()Version 2.11:
# Partial → Replicate in backward by default (correct behavior)
local_tensor = partial_dtensor.to_local()
# Or explicitly specify grad_placements for full control:
local_tensor = partial_dtensor.to_local(grad_placements=[Replicate()])_PhiloxState.seed and _PhiloxState.offset now return torch.Tensor instead of int (#173876)
The DTensor RNG internal _PhiloxState class changed its seed and offset properties to return tensors instead of Python ints, and the setters now expect tensors. This makes the RNG state compatible with PT2 tracing (the previous .item() calls were not fake-tensor friendly).
Code that directly reads _PhiloxState.seed or _PhiloxState.offset and treats them as ints will break. Call .item() to get the int value. When setting, wrap the value in a tensor.
Version 2.10:
from torch.distributed.tensor._random import _PhiloxState
philox = _PhiloxState(state)
seed: int = philox.seed # returned int
philox.offset = 42 # accepted intVersion 2.11:
from torch.distributed.tensor._random import _PhiloxState
philox = _PhiloxState(state)
seed: int = philox.seed.item() # now returns Tens...PyTorch 2.10.0 Release
PyTorch 2.10.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
Python 3.14 support for torch.compile(). Python 3.14t (freethreaded build) is experimentally supported as well.
|
| Reduced kernel launch overhead with combo-kernels horizontal fusion in torchinductor |
| A new varlen_attn() op providing support for ragged and packed sequences |
| Efficient eigenvalue decompositions with DnXgeev |
torch.compile() now respects use_deterministic_mode |
| DebugMode for tracking dispatched calls and debugging numerical divergence - This makes it simpler to track down subtle numerical bugs. |
| Intel GPUs support: Expand PyTorch support to the latest Panther Lake on Windows and Linux by enabling FP8 (core ops and scaled matmul) and complex MatMul support, and extending SYCL support in the C++ Extension API for Windows custom ops. |
For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Dataloader Frontend
- Removed unused
data_sourceargument from Sampler (#163134). This is a no-op, unless you have a custom sampler that uses this argument. Please update your custom sampler accordingly. - Removed deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438).
from torch.utils.data.datapipes.iter.grouping import SHARDING_PRIORITIES, ShardingFilterIterDataPipeis no longer supported. Please import fromtorch.utils.data.datapipes.iter.shardinginstead.
torch.nn
- Remove Nested Jagged Tensor support from
nn.attention.flex_attention(#161734)
ONNX
fallback=Falseis now the default intorch.onnx.export(#162726)- The exporter now uses the
dynamo=Trueoption without fallback. This is the recommended way to use the ONNX exporter. To preserve 2.9 behavior, manually setfallback=Truein thetorch.onnx.exportcall.
Release Engineering
- Rename pytorch-triton package to triton (#169888)
Deprecations
Distributed
- DeviceMesh
- Added a warning for slicing flattened dim from root mesh and types for _get_slice_mesh_layout (#164993)
We decided to deprecate an existing behavior which goes against the PyTorch design principle (explicit over implicit) for device mesh slicing of flattened dim.
Version <2.9
import torch
from torch.distributed.device_mesh import
device_type = (
acc.type
if (acc := torch.accelerator.current_accelerator(check_available=True))
else "cpu"
)
mesh_shape = (2, 2, 2)
mesh_3d = init_device_mesh(
device_type, mesh_shape, mesh_dim_names=("dp", "cp", "tp")
)
mesh_3d["dp", "cp"]._flatten()
mesh_3["dp_cp"] # This comes with no warningVersion >=2.10
import torch
from torch.distributed.device_mesh import
device_type = (
acc.type
if (acc := torch.accelerator.current_accelerator(check_available=True))
else "cpu"
)
mesh_shape = (2, 2, 2)
mesh_3d = init_device_mesh(
device_type, mesh_shape, mesh_dim_names=("dp", "cp", "tp")
)
mesh_3d["dp", "cp"]._flatten()
mesh_3["dp_cp"] # This will come with a warning because it implicitly change the state of the original mesh. We will eventually remove this behavior in future release. User should do the bookkeeping of flattened mesh explicitly.Ahead-Of-Time Inductor (AOTI)
- Move
from/tototorch::stable::detail(#164956)
JIT
torch.jitis not guaranteed to work in Python 3.14. Deprecation warnings have been added to user-facingtorch.jitAPI (#167669).
torch.jit should be replaced with torch.compile or torch.export.
ONNX
- The
dynamic_axesoption intorch.onnx.exportis deprecated (#165769)
Users should supply the dynamic_shapes argument instead. See https://docs.pytorch.org/docs/stable/export.html#expressing-dynamism for more documentation.
Profiler
- Deprecate
export_memory_timelinemethod (#168036)
The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative.
New Features
Autograd
- Allow setting grad_dtype on leaf tensors (#164751)
- Add Default Autograd Fallback for PrivateUse1 in PyTorch (#165315)
- Add API to annotate disjoint backward for use with
torch.utils.checkpoint.checkpoint(#166536)
Complex Frontend
- Add
ComplexTensorsubclass (#167621)
Composability
- Support autograd in torch.cond (#165908)
cuDNN
- BFloat16 support added to cuDNN RNN (#164411)
- [cuDNN][submodule] Upgrade to cuDNN frontend 1.16.1 (#170591)
Distributed
-
LocalTensor:
LocalTensoris a powerful debugging and simulation tool in PyTorch's distributed tensor ecosystem. It allows you to simulate distributed tensor computations across multiple SPMD (Single Program, Multiple Data) ranks on a single process. This is incredibly valuable for: 1) debugging distributed code without spinning up multiple processes; 2) understanding DTensor behavior by inspecting per-rank tensor states; 3) testing DTensor operations with uneven sharding across ranks; 4) rapid prototyping of distributed algorithms. Note that LocalTensor is designed for debugging purposes only. It has significant overhead and is not suitable for production distributed training.LocalTensoris atorch.Tensorsubclass that internally holds a mapping from rank IDs to local tensor shards. When you perform a PyTorch operation on aLocalTensor, the operation is applied independently to each local shard, mimicking distributed computation (LocalTensorsimulates collective operations locally without actual network communication.).LocalTensorModeis the context manager that enablesLocalTensordispatch. It intercepts PyTorch operations and routes them appropriately. The@maybe_run_for_local_tensordecorator is essential for handling rank-specific logic when implementing distributed code.- To get started with
LocalTensor, users import fromtorch.distributed._local_tensor, initialize a fake process group, and wrap their distributed code in aLocalTensorModecontext. Within this context, DTensor operations automatically produce LocalTensors. - PRs: (#164537, #166595, #168110,#168314,#169088,#169734)
-
c10d:
- New
shrink_groupimplementation to exposencclCommShrinkAPI (#164518)
- New
Dynamo
torch.compilenow fully works in Python 3.14 (#167384)- Add option to error or disable applying side effects (#167239)
- Config flag (
skip_fwd_side_effects_in_bwd_under_checkpoint) to allow eager and compile activation-checkpointing divergence for side-effects (#165775) torch._higher_order_ops.printfor enabling printing without graph breaks or reordering (#167571)
FX
- Added node metadata annotation API
- Disable preservation of node metadata when
enable=False(#164772) - Annotation should be mapped across submod (#165202)
- Annotate bw nodes before eliminate dead code (#165782)
- Add logging for debugging annotation (#165797)
- Override metadata on regenerated node in functional mode (#166200)
- Skip copying custom meta for gradient accumulation nodes; tag with is_gradient_acc=True (#167572)
- Add metadata hook for all nodes created in runtime_assert pass (#169497)
- Update `gm.print...
PyTorch 2.9.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
Tracked Regressions
Significant Memory Regression in F.conv3d with bfloat16 Inputs in PyTorch 2.9.0 (#166643)
This release provides work around this issue. If you are impacted please install nvidia-cudnn package version 9.15+ from pypi. (#166480) (#167111)
Torch.compile
Fix Inductor bug when compiling Gemma (#165601)
Fix InternalTorchDynamoError in bytecode_transformation (#166036)
Fix silent correctness error_on_graph_break bug where non-empty checkpoint results in unwanted graph break resumption (#166586)
Improve performance by avoiding recompilation with mark_static_address with cudagraphs (#162208)
Improve performance by caching get_free_symbol_uses in torch inductor (#166338)
Fix fix registration design for inductor graph partition for vLLM (#166458) (#165815) (#165514)
Fix warning spamming in torch.compile (#166993)
Fix exception related to uninitialized tracer_output variable (#163169)
Fix crash in torch.bmm and torch.compile with PyTorch release 2.9.0 (#166457)
Other
Fix warning spamming on new APIs to control TF32 behavior (#166956)
Fix distributed crash with non-contiguous gather inputs (#166181)
Fix indexing on large tensor causes invalid configuration argument (#166974)
Fix numeric issue in CUDNN_ATTENTION (#166912) (#166570)
Fix symmetric memory issue with fused_scaled_matmul_reduce_scatter (#165086)
Improve libtorch stable ABI documentation (#163899)
Fix image display on pypi project description section (#166404)
2.9 Release Notes
PyTorch 2.9.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
| Unstable (API-Unstable) |
| Updates to the stable libtorch ABI for third-party C++/CUDA extensions |
| Symmetric memory that enables easy programming of multi-GPU kernels |
| The ability to arbitrarily toggle error or resume on graph breaks in torch.compile |
| Expanded wheel variant support to include ROCm, XPU and CUDA 13 |
| FlexAttention enablement on Intel GPUs |
| Flash decoding optimization based on FlexAttention on X86 CPU |
| ARM Platform improvements and optimizations |
| Enablement of Linux aarch64 binary wheel builds across all supported CUDA versions |
For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Min supported Python version is now 3.10 (#162310)
The minimum version of Python required for PyTorch 2.9.0 is 3.10. We also have 3.14 and 3.14t available as preview with this release.
Undefined behavior when an output of a custom operator shares storage with an input
This is a reminder that outputs of PyTorch custom operators (that are registered using the torch.library or TORCH_LIBRARY APIs) are not allowed to return Tensors that share storage with input tensors. The violation of this condition leads to undefined behavior: sometimes the result will be correct, sometimes it will be garbage.
After #163227, custom operators that violated this condition that previously returned correct results under torch.compile may now return silently incorrect results under torch.compile. Because this is changing the behavior of undefined behavior, we do not consider this to be a bug, but we are still documenting it in this section as a "potentially unexpected behavior change".
This is one of the conditions checked for by torch.library.opcheck and is mentioned in The Custom Operators Manual
More details
Outputs of PyTorch custom operators are not allowed to return Tensors that share storage with input tensors
For example, the following two custom operators are not valid custom operators:
@torch.library.custom_op("mylib::foo", mutates_args=())
def foo(x: torch.Tensor) -> torch.Tensor:
# the result of `foo` must not directly be an input to foo.
return x
@torch.library.custom_op("mylib::bar", mutates_args=())
def bar(x: torch.Tensor) -> torch.Tensor:
# the result of bar must not be a view of an input of bar
return x.view(-1)The easiest workaround is to add an extra .clone() to the outputs:
@torch.library.custom_op("mylib::foo", mutates_args=())
def foo(x: torch.Tensor) -> torch.Tensor:
return x.clone()
@torch.library.custom_op("mylib::bar", mutates_args=())
def bar(x: torch.Tensor) -> torch.Tensor:
return x.view(-1).clone()A common way to get into this situation is for a user to want to create a custom operator that sometimes mutates the input in-place and sometimes returns a new Tensor, like in the following example.
@torch.library.custom_op("mylib::baz", mutates_args=["x"])
def baz(x: torch.Tensor) -> torch.Tensor:
if inplace:
x.sin_()
return x
else:
return x.sin()This dynamism is not supported and leads to undefined behavior. The workaround is to split the custom operator into two custom operators, one that always mutates the input in-place, and another that always returns a new Tensor.
@torch.library.custom_op("mylib::baz_outplace", mutates_args=())
def baz_outplace(x: torch.Tensor) -> torch.Tensor:
return x.sin()
@torch.library.custom_op("mylib::baz_inplace", mutates_args=["x"])
def baz_inplace(x: torch.Tensor) -> torch.Tensor:
x.sin_()
def baz(x):
if inplace:
baz_inplace(x)
return x
else:
return baz_outplace(x)Build metal kernels of MacOS-14+ and remove all pre-MacOS-14 specific logic, requires MacOS-14+ going forward (#159733, #159912)
PyTorch MPS is only supported on MacOS-14 or later. If you need to use MPS on MacOS Ventura, please avoid updating to Python-3.9 or above
Upgrade to DLPack 1.0 (#145000)
This upgrade is doing the same BC-breaking changes as the DLPack release. Objects in torch.utils.dlpack have been updated to reflect these changes, such as DLDeviceType.
See the PR for details on the exact changes and how to update your code.
Raise appropriate errors in torch.cat (#158249)
torch.cat now raises ValueError, IndexError or TypeError where appropriate instead of the generic RuntimeError. If you code was catching these errors, you can update to catch the new error type.
Default to dynamo=True for ONNX exporter (#159646, #162726)
Previously torch.onnx.export(...) used the legacy TorchScript exporter if no arguments were provied. The ONNX exporter now uses the newer torch.export.export pipeline by default (dynamo=True). This change improves graph fidelity and future-proofs exports, but may surface graph capture errors that were previously masked or handled differently.
Previously in torch 2.8.0:
# API calls the legacy exporter with dynamo=False
torch.onnx.export(...)Now in torch 2.9.0:
# To preserve the original behavior
torch.onnx.export(..., dynamo=False)
# Export onnx model through torch.export.export
torch.onnx.export(...)Recommendation: first try the new default; only fall back if you hit blocking issues and report them upstream.
Long term solution: fix the root cause instead of relying on fallback or TorchScript exporter.
Switch off runtime asserts by default in Export in favor of a shape guards function (#160111, #161178, #161794)
To enable runtime asserts, use export(..., prefer_deferred_runtime_asserts_over_guards=True). Also kills the allow_complex_guards_as_runtime_asserts flag, merging it into the former option.
Additionally, exported_program.module() will generate a call to a _guards_fn submodule that will run additional checks on inputs. Users who do not want this behavior can either remove this call in the graph, or do exported_program.module(check_guards=False) to avoid the generation.
Set default opset to 20 in ONNX (#158802)
Opset 20 enables newer operator definitions. If your tooling or downstream runtime only supports opset 18, pin it explicitly. For the latest ONNX operators, you can experiment with opset 23.
Previously in torch 2.8.0:
# opset_version=18
torch.onnx.export(...)Now in torch 2.9.0:
# To preserve the original behavior
torch.onnx.export(..., opset_version=18)
# New: opset_version=20
torch.onnx.export(...)
# Use the latest supported opset: opset_version=23
torch.onnx.export(..., opset_version=23)Drop draft_export in exporter API (#161454, #162225)
Remove implicit draft tracing from the default exporter path, achieving clearer behaviour and faster failures.
The expensive torch.export.draft_export diagnostic path is no longer auto-invoked (which could take hours on large models). You can still opt in for deep diagnostics:
Previously in torch 2.8.0:
# If both torch.export.export(..., strict=False) and
# torch.export.export(..., strict=True) fail to capture
# the model graph, torch.export.draft_export(...) will be triggered,
# and uses real tensor to trace/export the model.
#
# Inside export_to_onnx.py:
# ... torch.onnx.export(..., dynamo=True)
python export_to_onnx.pyNow in torch 2.9.0:
# To trigger torch.export.draft_export once
# torch.export.export strict=False/True both
# fail:
TORCH_ONNX_ENABLE_DRAFT_EXPORT=True python export_to_onnx.pyRemove torch.onnx.dynamo_export and the onnxrt torch compile backend (#158130, #158258)
torch.onnx.dynamo_export is removed. Please use torch.onnx.export instead.
The experimental ONNX Runtime compile backend (torch.compile(backend="onnxrt")) is no longer supported.
Remove torch.onnx.enable_fake_mode (#161222)
The dynamo=True mode uses FakeTensors by default which is memory efficient.
Some public facing ONNX utility APIs for the TorchScript based exporter are now private (#161323)
Deprecated members in torch.onnx.verification are removed. Previously private torch.onnx.symbolic_opsets* functions will no longer be accessible. Consider making a copy of the source code if you need to access any private functions for compatibility with the TorchScript based exporter.
Remove torch.onnx.symbolic_caffe2 (#157102)
Support for caffe2 in the ONNX exporter has ended and is removed.
Remove /d2implyavx512upperregs flag that slows build (#159431)
Re-introduced AVX512 optimizations for Windows VS2022 builds, may cause issues with specific versions of VS2022, see #145702
Add ScalarType to shim conversion and stable::Tensor.scalar_type (#160557)
Before, user extensions could only in abstract...
PyTorch 2.8.0 Release
PyTorch 2.8.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
Highlights
| Unstable |
| torch::stable::Tensor |
| High-performance quantized LLM inference on Intel CPUs with native PyTorch |
| Experimental Wheel Variant Support |
| Inductor CUTLASS backend support |
| Inductor Graph Partition for CUDAGraph |
| Control Flow Operator Library |
| HuggingFace SafeTensors support in PyTorch Distributed Checkpointing |
| SYCL support in PyTorch CPP Extension API |
| A16W4 on XPU Device |
| Hierarchical compilation with torch.compile |
| Intel GPU distributed backend (XCCL) support |
For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Tracked Regressions
Windows wheel builds with CUDA 12.9.1 stack overflow during build (#156181)
Due to a bug introduced in CUDA 12.9.1, we are unable to complete full Windows wheel builds with this
version, as compilation of torch.segment_reduce() crashes the build. Thus, we provide a wheel
without torch.segment_reduce() included in order to sidestep the issue. If you need support
for torch.segment_reduce(), please utilize a different version.
Backwards Incompatible Changes
CUDA Support
Removed support for Maxwell and Pascal architectures with CUDA 12.8 and 12.9 builds (#157517, #158478, #158744)
Due to binary size limitations, support for sm50 - sm60 architectures with CUDA 12.8 and 12.9 has
been dropped for the 2.8.0 release. If you need support for these architectures, please utilize
CUDA 12.6 instead.
Python Frontend
Calling an op with an input dtype that is unsupported now raises NotImplementedError instead of RuntimeError (#155470)
Please update exception handling logic to reflect this.
In 2.7.0
try:
torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
except RuntimeError:
...
In 2.8.0
try:
torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
except NotImplementedError:
...
Added missing in-place on view check to custom autograd.Function (#153094)
In 2.8.0, if a custom autograd.Function mutates a view of a leaf requiring grad,
it now properly raises an error. Previously, it would silently leak memory.
class Func(torch.autograd.Function):
@staticmethod
def forward(ctx, inp):
inp.add_(1)
ctx.mark_dirty(inp)
return inp
@staticmethod
def backward(ctx, gO):
pass
a = torch.tensor([1.0, 2.0], requires_grad=True)
b = a.view_as(a)
Func.apply(b)
Output:
Version 2.7.0
Runs without error, but leaks memory
Version 2.8.0
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation
An error is now properly thrown for the out variant of tensordot when called with a requires_grad=True tensor (#150270)
Please avoid passing an out tensor with requires_grad=True as gradients cannot be
computed for this tensor.
In 2.7.0
a = torch.empty((4, 2), requires_grad=True)
b = torch.empty((2, 4), requires_grad=True)
c = torch.empty((2, 2), requires_grad=True)
# does not error, but gradients for c cannot be computed
torch.tensordot(a, b, dims=([1], [0]), out=c)
In 2.8.0
a = torch.empty((4, 2), requires_grad=True)
b = torch.empty((2, 4), requires_grad=True)
c = torch.empty((2, 2), requires_grad=True)
torch.tensordot(a, b, dims=([1], [0]), out=c)
# RuntimeError: tensordot(): the 'out' tensor was specified and requires gradients, and
# its shape does not match the expected result. Either remove the 'out' argument, ensure
# it does not require gradients, or make sure its shape matches the expected output.
torch.compile
Specialization of a tensor shape with mark_dynamic applied now correctly errors (#152661)
Prior to 2.8, it was possible for a guard on a symbolic shape to be incorrectly
omitted if the symbolic shape evaluation was previously tested with guards
suppressed (this often happens within the compiler itself). This has been fixed
in 2.8 and usually will just silently "do the right thing" and add the correct
guard. However, if the new guard causes a tensor marked with mark_dynamic to become
specialized, this can result in an error. One workaround is to use
maybe_mark_dynamic instead of mark_dynamic.
See the discussion in issue #157921 for more
context.
Version 2.7.0
import torch
embed = torch.randn(2, 8192)
x = torch.zeros(8192)
torch._dynamo.mark_dynamic(x, 0)
@torch.compile
def f(embedding_indices, x):
added_tokens_mask = torch.where(x > 10000, 1, 0)
ei = torch.narrow(embedding_indices, 1, 0, x.size(0))
return ei.clone()
f(embed, x)Version 2.8.0
import torch
embed = torch.randn(2, 8192)
x = torch.zeros(8192)
torch._dynamo.maybe_mark_dynamic(x, 0)
@torch.compile
def f(embedding_indices, x):
added_tokens_mask = torch.where(x > 10000, 1, 0)
ei = torch.narrow(embedding_indices, 1, 0, x.size(0))
return ei.clone()
f(embed, x)Several config variables related to torch.compile have been renamed or removed
- Dynamo config variable
enable_cpp_framelocals_guard_evalhas changed to no longer have any effect (#151008). - Inductor config variable
rocm.n_max_profiling_configsis deprecated (#152341).
Instead, use ck-tile based configsrocm.ck_max_profiling_configsand
rocm.ck_tile_max_profiling_configs. - Inductor config variable
autotune_fallback_to_atenis deprecated (#154331).
Inductor will no longer silently fall back toATen. Please add"ATEN"to
max_autotune_gemm_backendsfor the old behavior. - Inductor config variables
use_mixed_mmandmixed_mm_choiceare deprecated (#152071). Inductor now supports prologue fusion, so there is no need for
special cases now. - Inductor config setting
descriptive_names = Falseis deprecated (#151481). Please use one of the other available
options:"torch","original_aten", or"inductor_node". custom_op_default_layout_constrainthas moved from inductor config to functorch config (#148104). Please reference it via
torch._functorch.config.custom_op_default_layout_constraintinstead of
torch._inductor.config.custom_op_default_layout_constraint.- AOTI config variable
emit_current_arch_binaryis deprecated (#155768). - AOTI config variable
aot_inductor.embed_cubinhas been renamed toaot_inductor.embed_kernel_binary(#154412). - AOTI config variable
aot_inductor.compile_wrapper_with_O0has been renamed tocompile_wrapper_opt_level(#148714).
Added a stricter aliasing/mutation check for HigherOrderOperators (e.g. cond), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953, #146658).
For affected HigherOrderOperators, add .clone() to aliased outputs to address this.
Version 2.7.0
import torch
@torch.compile(backend="eager")
def fn(x):
return torch.cond(x.sum() > 0, lambda x: x, lambda x: x + 1, [x])
fn(torch.ones(3))Version 2.8.0
import torch
@torch.compile(backend="eager")
def fn(x):
return torch.cond(x.sum() > 0, lambda x: x.clone(), lambda x: x + 1, [x])
fn(torch.ones(3))guard_or_x and definitely_x have been consolidated (#152463)
We removed definitely_true / definitely_false and associated APIs, replacing them with
guard_or_true / guard_or_false, which offer similar functionality and can be used to
achieve the same effect. Please migrate to the latter.
Version 2.7.0
from torch.fx.experimental.symbolic_shapes import definitely_false, definitely_true
...
if definitely_true(x):
...
if definitely_false(y):
...Version 2.8.0
from torch.fx.experimental.symbolic_shapes import guard_or_false, guard_or_true
...
if guard_or_false(x):
...
# alternatively: if guard_or_false(torch.sym_not(y))
if not guard_or_true(y):
...torch.export
torch.export.export_for_inference has been removed in favor of torch.export.export_for_training().run_decompositions() (#149078)
Version 2.7.0
import torch
...
exported_program = torch.export.export_for_inference(mod, args, kwargs)Version 2.8.0
import torch
...
exported_program = torch.export.export_for_training(
mod, args, kwargs
).run_decompositions(decomp_table=decomp_table)Switched default to strict=False in torch.export.export and export_for_training (#148790, #150941)
This differs from the previous release default of strict=True. To revert to the old default
behavior, please explicitly pass strict=True.
Version 2.7.0
import torch
# default behavior is strict=True
torch.export.export(...)
torch.export.export_for_training(...)Version 2.8.0
import torch
# strict=True must be explicitly passed to get the old behavior
torch.export.export(..., strict=True)
torch.export.export_for_training(..., strict=True)ONNX
Default opset in torch.onnx.export is now 18 (#156023)
When dynamo=False, th...
PyTorch 2.7.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
Torch.compile
Fix Excessive cudagraph re-recording for HF LLM models (#152287)
Fix torch.compile on some HuggingFace models (#151154)
Fix crash due to Exception raised inside torch.autocast (#152503)
Improve Error logging in torch.compile (#149831)
Mark mutable custom operators as cacheable in torch.compile (#151194)
Implement workaround for a graph break with older version einops (#153925)
Fix an issue with tensor.view(dtype).copy_(...) (#151598)
Flex Attention
Fix assertion error due to inductor permuting inputs to flex attention (#151959)
Fix performance regression on nanogpt speedrun (#152641)
Distributed
Fix extra CUDA context created by barrier (#149144)
Fix an issue related to Distributed Fused Adam in Rocm/APEX when using nccl_ub feature (#150010)
Add a workaround random hang in non-blocking API mode in NCCL 2.26 (#154055)
MacOS
Fix MacOS compilation error with Clang 17 (#151316)
Fix binary kernels produce incorrect results when one of the tensor arguments is from a wrapped scalar on MPS devices (#152997)
Other
Improve PyTorch Wheel size due to introduction of addition of 128 bit vectorization (#148320) (#152396)
Fix fmsub function definition (#152075)
Fix Floating point exception in torch.mkldnn_max_pool2d (#151848)
Fix abnormal inference output with XPU:1 device (#153067)
Fix Illegal Instruction Caused by grid_sample on Windows (#152613)
Fix ONNX decomposition does not preserve custom CompositeImplicitAutograd ops (#151826)
Fix error with dynamic linking of libgomp library (#150084)
Fix segfault in profiler with Python 3.13 (#153848)
PyTorch 2.7.0 Release
PyTorch 2.7.0 Release Notes
- Highlights
- Tracked Regressions
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
Highlights
| Beta | Prototype |
| Torch.Compile support for Torch Function Modes | NVIDIA Blackwell Architecture Support |
| Mega Cache | PyTorch Native Context Parallel |
| Enhancing Intel GPU Acceleration | |
| FlexAttention LLM first token processing on X86 CPUs | |
| FlexAttention LLM throughput mode optimization on X86 CPUs | |
| Foreach Map | |
| Flex Attention for Inference | |
| Prologue Fusion Support in Inductor |
For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Tracked Regressions
NCCL init hits CUDA failure 'invalid argument' on 12.2 driver
Some users with 12.2 CUDA driver (535 version) report seeing "CUDA driver error: invalid argument" during NCCL or Symmetric Memory initialization. This issue is currently under investigation, see #150852. If you use PyTorch from source, a known workaround is to rebuild PyTorch with CUDA 12.2 toolkit. Otherwise, you can try upgrading the CUDA driver on your system.
Backwards Incompatible Changes
Dropped support for Triton < 2.2.0. Removed Support for CUDA 12.4, Anaconda in CI/CD.
- Removed CUDA 12.4 support in CI/CD in favor of 12.8 (#148895, #142856, #144118, #145566, #145844, #148602, #143076, #148717)
- Removed Anaconda support in CI/CD (#144870, #145015, #147792)
- Dropped support for Triton < 2.2.0 (versions without ASTSource) (#143817)
C++ Extensions py_limited_api=True is now built with -DPy_LIMITED_API (#145764)
We formally began respecting the py_limited_api=True kwarg in 2.6 and stopped linking libtorch_python.so when the flag was specified, as libtorch_python.so does not guarantee using APIs from from the stable Python limited API. In 2.7, we go further by specifying the -DPy_LIMITED_API flag which will enforce that the extension is buildable with the limited API. As a result of this enforcement, custom extensions that set py_limited_api=True but do not abide by the limited API may fail to build. For an example, see #152243.
This is strictly better behavior as it is sketchy to claim CPython agnosticism without enforcing with the flag. If you run into this issue, please ensure that the extension you are building does not use any APIs which are outside of the Python limited API, e.g., pybind.
Change torch.Tensor.new_tensor() to be on the given Tensor's device by default (#144958)
This function was always creating the new Tensor on the "cpu" device and will now use the same device as the current Tensor object. This behavior is now consistent with other .new_* methods.
Use Manylinux 2.28 and CXX11_ABI=1 for future released Linux wheel builds.
With Migration to manylinux_2_28 (AlmaLinux 8 based), we can no longer support OS distros with glibc2_26. These include popular Amazon Linux 2 and CentOS 7. (#143423, #146200, #148028, #148135, #148195, #148129)
torch.onnx.dynamo_export now uses the ExportedProgram logic path (#137296)
Users using the torch.onnx.dynamo_export API may see some ExportOptions become
unsupported due to an internal switch to use torch.onnx.export(..., dynamo=True): diagnostic_options, fake_context and onnx_registry are removed/ignored by ExportOptions. Only dynamic_shapes is retained.
Users should move to use the dynamo=True option on torch.onnx.export as
torch.onnx.dynamo_export is now deprecated. Leverage the dynamic_shapes argument in torch.onnx.export for specifying dynamic shapes on the model.
Version 2.6.0
torch.onnx.dynamo_export(model, *args, **kwargs)Version 2.7.0
torch.onnx.export(model, args, kwargs=kwargs, dynamo=True)Finish deprecation of LRScheduler.print_lr() along with the verbose kwarg to the LRScheduler constructor. (#147301)
Both APIs have been deprecated since 2.2. Please use LRScheduler.get_last_lr() to access the learning rate instead.print_lr and verbose were confusing, not properly documented and were little used, as described in #99270, so we deprecated them in 2.2. Now, we complete the deprecation by removing them completely. To access and print the learning rate of a LRScheduler:
Version 2.6.0
optim = ...
lrsched = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, verbose=True)
// lrsched will internally call print_lr() and print the learning rate Version 2.7.0
optim = ...
lrsched = torch.optim.lr_scheduler.ReduceLROnPlateau(optim)
print(lrsched.get_last_lr())libtorch_python.so symbols are now invisible by default on all platforms except Apple (#142214)
Previously, the symbols in libtorch_python.so were exposed with default visibility. We have transitioned to being more intentional about what we expose as public symbols for our python API in C++. After #142214, public symbols will be marked explicitly while everything else will be hidden. Some extensions using private symbols will see linker failures with this change.
Please use torch.export.export instead of capture_pre_autograd_graph to export the model for pytorch 2 export quantization (#139505)
capture_pre_autograd_graph was a temporary API in torch.export. Since now we have a better longer term API: export available, we can deprecate it.
Version 2.6.0
from torch._export import capture_pre_autograd_graph
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
from torch.ao.quantization.quantizer.xnnpack_quantizer import (
XNNPACKQuantizer,
get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer().set_global(
get_symmetric_quantization_config()
)
m = capture_pre_autograd_graph(m, *example_inputs)
m = prepare_pt2e(m, quantizer)Version 2.7.0
from torch.export import export
from torch.ao.quantization.quantize_pt2e import prepare_pt2e
# please get xnnpack quantizer from executorch (https://github.com/pytorch/executorch/)
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import (
XNNPACKQuantizer,
get_symmetric_quantization_config,
)
quantizer = XNNPACKQuantizer().set_global(
get_symmetric_quantization_config()
)
m = export(m, *example_inputs)
m = prepare_pt2e(m, quantizer)New interface for torch.fx.passes.graph_transform_observer.GraphTransformObserver to enable Node Level provenance tracking (#144277)
We now track a mapping between the nodes in the pre-grad and post-grad graph. See the issue for an example frontend to visualize the transformations. To update your GraphTransformObserver subclasses, instead of overriding on_node_creation and on_node_erase, there are new functions get_node_creation_hook, get_node_erase_hook, get_node_replace_hook and get_deepcopy_hook. These are registered on the GraphModule member of the GraphTransformObserver upon entry and exit of a with block
Version 2.6.0
class MyPrintObserver(GraphTransformObserver):
def on_node_creation(self, node: torch.fx.Node):
print(node)Version 2.7.0
class MyPrintObserver(GraphTransformObserver):
def get_node_creation_hook(self):
def hook(node: torch.fx.Node):
print(node)
return hooktorch.ao.quantization.pt2e.graph_utils.get_control_flow_submodules is no longer public (#141612)
We are planning to make all functions under torch.ao.quantization.pt2e.graph_utils private. This update marks get_control_flow_submodules as a private API. If you have to or want to continue using get_control_flow_submodules, please make a private call by using _get_control_flow_submodules.
Example:
Version 2.6:
>>> from torch.ao.quantization.pt2e.graph_utils import get_control_flow_submodulesVersion 2.7:
>>> from torch.ao.quantization.pt2e.graph_utils import get_control_flow_submodules
ImportError: cannot import name 'get_control_flow_submodules' from 'torch.ao.quantization.pt2e.graph_utils'
>>> from torch.ao.quantization.pt2e.graph_utils import _get_control_flow_submodules # Note: Use _get_control_flow_submodules for private accessDeprecations
torch.onnx.dynamo_export is deprecated (#146425, #146639, #146923)
Users should use the dynamo=True option on torch.onnx.export.
Version 2.6.0
torch.onnx.dynamo_export(model, *args, **kwargs)Version 2.7.0
torch.onnx.export(model, args, kwargs=kwargs, dynamo=True)XNNPACKQuantizer is deprecated in PyTorch and moved to ExecuTorch, please use it from executorch.backends.xnnpack.quantizer.xnnpack_quantizer instead of torch.ao.quantization.quantizer.xnnpack_quantizer. (#144940)
XNNPACKQuantizer is a quantizer for xnnpack that was added into pytorch/pytorch for initial development. Ho...
PyTorch 2.6.0 Release
- Highlights
- Tracked Regressions
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
Highlights
We are excited to announce the release of PyTorch® 2.6 (release notes)! This release features multiple improvements for PT2: torch.compile can now be used with Python 3.13; new performance-related knob torch.compiler.set_stance; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs.
NOTE: Starting with this release we are not going to publish on Conda, please see [Announcement] Deprecating PyTorch’s official Anaconda channel for the details.
For this release the experimental Linux binaries shipped with CUDA 12.6.3 (as well as Linux Aarch64, Linux ROCm 6.2.4, and Linux XPU binaries) are built with CXX11_ABI=1 and are using the Manylinux 2.28 build platform. If you build PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see [RFC] PyTorch next wheel build platform: manylinux-2.28 for the details and discussion.
Also in this release as an important security improvement measure we have changed the default value for weights_only parameter of torch.load. This is a backward compatibility-breaking change, please see this forum post for more details.
This release is composed of 3892 commits from 520 contributors since PyTorch 2.5. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve PyTorch. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
| Beta | Prototype |
| torch.compiler.set_stance | Improved PyTorch user experience on Intel GPUs |
| torch.library.triton_op | FlexAttention support on X86 CPU for LLMs |
| torch.compile support for Python 3.13 | Dim.AUTO |
| New packaging APIs for AOTInductor | CUTLASS and CK GEMM/CONV Backends for AOTInductor |
| AOTInductor: minifier | |
| AOTInductor: ABI-compatible mode code generation | |
| FP16 support for X86 CPUs |
*To see a full list of public feature submissions click here.
BETA FEATURES
[Beta] torch.compiler.set_stance
This feature enables the user to specify different behaviors (“stances”) that torch.compile can take between different invocations of compiled functions. One of the stances, for example, is
“eager_on_recompile”, that instructs PyTorch to code eagerly when a recompile is necessary, reusing cached compiled code when possible.
For more information please refer to the set_stance documentation and the Dynamic Compilation Control with torch.compiler.set_stance tutorial.
[Beta] torch.library.triton_op
torch.library.triton_op offers a standard way of creating custom operators that are backed by user-defined triton kernels.
When users turn user-defined triton kernels into custom operators, torch.library.triton_op allows torch.compile to peek into the implementation, enabling torch.compile to optimize the triton kernel inside it.
For more information please refer to the triton_op documentation and the Using User-Defined Triton Kernels with torch.compile tutorial.
[Beta] torch.compile support for Python 3.13
torch.compile previously only supported Python up to version 3.12. Users can now optimize models with torch.compile in Python 3.13.
[Beta] New packaging APIs for AOTInductor
A new package format, “PT2 archive”, has been introduced. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send everything needed to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package.
For more details please see the updated torch.export AOTInductor Tutorial for Python runtime.
[Beta] AOTInductor: minifier
If a user encounters an error while using AOTInductor APIs, AOTInductor Minifier allows creation of a minimal nn.Module that reproduces the error.
For more information please see the AOTInductor Minifier documentation.
[Beta] AOTInductor: ABI-compatible mode code generation
AOTInductor-generated model code has dependency on Pytorch cpp libraries. As Pytorch evolves quickly, it’s important to make sure previously AOTInductor compiled models can continue to run on newer Pytorch versions, i.e. AOTInductor is backward compatible.
In order to guarantee application binary interface (ABI) backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide backward compatibility guarantees for AOTInductor-compiled models.
[Beta] FP16 support for X86 CPUs (both eager and Inductor modes)
Float16 datatype is commonly used for reduced memory usage and faster computation in AI inference and training. CPUs like the recently launched Intel® Xeon® 6 with P-Cores support Float16 datatype with native accelerator AMX. Float16 support on X86 CPUs was introduced in PyTorch 2.5 as a prototype feature, and now it has been further improved for both eager mode and Torch.compile + Inductor mode, making it Beta level feature with both functionality and performance verified with a broad scope of workloads.
PROTOTYPE FEATURES
[Prototype] Improved PyTorch user experience on Intel GPUs
PyTorch user experience on Intel GPUs is further improved with simplified installation steps, Windows release binary distribution and expanded coverage of supported GPU models including the latest Intel® Arc™ B-Series discrete graphics. Application developers and researchers seeking to fine-tune, inference and develop with PyTorch models on Intel® Core™ Ultra AI PCs and Intel® Arc™ discrete graphics will now be able to directly install PyTorch with binary releases for Windows, Linux and Windows Subsystem for Linux 2.
- Simplified Intel GPU software stack setup to enable one-click installation of the torch-xpu PIP wheels to run deep learning workloads in an out of the box fashion, eliminating the complexity of installing and activating Intel GPU development software bundles.
- Windows binary releases for torch core, torchvision and torchaudio have been made available for Intel GPUs, and the supported GPU models have been expanded from Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics and Intel® Arc™ A-Series Graphics to the latest GPU hardware Intel® Arc™ B-Series graphics.
- Further enhanced coverage of Aten operators on Intel GPUs with SYCL* kernels for smooth eager mode execution, as well as bug fixes and performance optimizations for torch.compile on Intel GPUs.
For more information regarding Intel GPU support, please refer to Getting Started Guide.
[Prototype] FlexAttention support on X86 CPU for LLMs
FlexAttention was initially introduced in PyTorch 2.5 to provide optimized implementations for Attention variants with a flexible API. In PyTorch 2.6, X86 CPU support for FlexAttention was added through TorchInductor CPP backend. This new feature leverages and extends current CPP template abilities to support...
PyTorch 2.5.1: bug fix release
This release is meant to fix the following regressions:
- Wheels from PyPI are unusable out of the box on PRM-based Linux distributions: #138324
- PyPI arm64 distribution logs cpuinfo error on import: #138333
- Crash When Using torch.compile with Math scaled_dot_product_attention in AMP Mode: #133974
- [MPS] Internal crash due to the invalid buffer size computation if sliced API is used: #137800
- Several issues related to CuDNN Attention: #138522
Besides the regression fixes, the release includes several documentation updates.
See release tracker #132400 for additional information.