Releases: pytorch/pytorch
PyTorch 2.2.2 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Properly raise an error when trying to use inductor backend on non-supported platforms such as Windows (#115969)
- Fix mkldnn performance issue on Windows platform (#121618)
- Fix
RuntimeError: cannot create std::vector larger than max_size()
intorch.nn.functional.conv1d
on non-contiguous cpu inputs by patching OneDNN (pytorch/builder#1742) (pytorch/builder#1744) - Add support for
torch.distributed.fsdp.StateDictType.FULL_STATE_DICT
for when usingtorch.distributed.fsdp.FullyShardedDataParallel
with thedevice_mesh
argument (#120837) - Fix
make triton
command on release branch for users building the release branch from source (#121169) - Ensure gcc>=9.0 for build from source and cpp_extensions (#120126)
- Fix cxx11-abi build in release branch (pytorch/builder#1709)
- Fix building from source on Windows source MSVC 14.38 - VS 2022 (#122120)
Release tracker #120999 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.2.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Fix missing OpenMP support on Apple Silicon binaries (pytorch/builder#1697)
- Fix crash when mixing lazy and non-lazy tensors in one operation (#117653)
- Fix PyTorch performance regression on Linux aarch64 (pytorch/builder#1696)
- Fix silent correctness in DTensor
_to_copy
operation (#116426) - Fix properly assigning
param.grad_fn
for next forward (#116792) - Ensure gradient clear out pending
AsyncCollectiveTensor
in FSDP Extension (#116122) - Fix processing unflatten tensor on compute stream in FSDP Extension (#116559)
- Fix FSDP
AssertionError
on tensor subclass when settingsync_module_states=True
(#117336) - Fix DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592)
- Fix OOM when when returning a AsyncCollectiveTensor by forcing
_gather_state_dict()
to be synchronous with respect to the mian stream. (#118197) (#119716) - Fix Windows runtime
torch.distributed.DistNetworkError
: [WinError 32] The process cannot access the file because it is being used by another process (#118860) - Update supported python versions in package description (#119743)
- Fix SIGILL crash during
import torch
on CPUs that do not support SSE4.1 (#116623) - Fix DCP RuntimeError in
get_state_dict
andset_state_dict
(#119573) - Fixes for HSDP + TP integration with device_mesh (#112435) (#118620) (#119064) (#118638) (#119481)
- Fix numerical error with
mixedmm
on NVIDIA V100 (#118591) - Fix RuntimeError when using SymInt input invariant when splitting graphs (#117406)
- Fix compile
DTensor.from_local
in trace_rule_look up (#119659) - Improve torch.compile integration with CUDA-11.8 binaries (#119750)
Release tracker #119295 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.2: FlashAttention-v2, AOTInductor
PyTorch 2.2 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
Highlights
We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention
via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.
This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.
Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.
This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
Summary:
scaled_dot_product_attention
(SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.- PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.
torch.distributed
supports a new abstraction for initializing and representing ProcessGroups called device_mesh.- PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.
- A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
- Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
torch.ao.quantization
now offers a prototypetorch.export
based flow
Stable | Beta | Prototype | Performance Improvements |
FlashAttentionV2 backend for scaled dot product attention | PT 2 Quantization | Inductor optimizations | |
AOTInductor | Scaled dot product attention support for jagged layout NestedTensors | aarch64-linux optimizations (AWS Graviton) | |
TORCH_LOGS | |||
torch.distributed.device_mesh | |||
torch.compile + Optimizers |
*To see a full list of public 2.2 - 1.12 feature submissions click here.
Tracked Regressions
Performance reduction when using NVLSTree algorithm in NCCL 2.19.3 (#117748)
We have noticed a performance regression introduced to all-reduce in NCCL 2.19.3. Please use version 2.19.1 instead.
Poor numeric stability of loss when training with FSDP + DTensor (#117471)
We observe the loss will flatline randomly while training with FSDP + DTensor in some instances.
Backwards Incompatible Changes
Building PyTorch from source now requires GCC 9.4 or newer (#112858)
GCC 9.4 is the oldest version fully compatible with C++17, which the PyTorch codebase has migrated to from C++14.
Updated flash attention kernel in scaled_dot_product_attention
to use Flash Attention v2 (#105602)
Previously, the v1 Flash Attention kernel had a Windows implementation. So if a user on Windows had explicitly forced the flash attention kernel to be run by using sdp_kernel
context manager with only flash attention enabled, it would work. In 2.2, if the sdp_kernel
context manager must be used, use the memory efficient or math kernel if on Windows.
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
torch.nn.functional.scaled_dot_product_attention(q,k,v)
# Don't force flash attention to be used if using sdp_kernel on Windows
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True):
torch.nn.functional.scaled_dot_product_attention(q,k,v)
Rewrote DTensor (Tensor Parallel) APIs to improve UX (#114732)
In PyTorch 2.1 or before, users can use ParallelStyles like PairwiseParallel
and specify input/output layout with functions like make_input_replicate_1d
or make_output_replicate_1d
. And we have default values for _prepare_input and _prepare_output. The UX of Tensor Parallel was like:
from torch.distributed.tensor.parallel.style import (
ColwiseParallel,
make_input_replicate_1d,
make_input_reshard_replicate,
make_input_shard_1d,
make_input_shard_1d_last_dim,
make_sharded_output_tensor,
make_output_replicate_1d,
make_output_reshard_tensor,
make_output_shard_1d,
make_output_tensor,
PairwiseParallel,
parallelize_module,
)
from torch.distributed.tensor import DeviceMesh
module = DummyModule()
device_mesh = DeviceMesh("cuda", list(range(self.world_size)))
parallelize_module(module, device_mesh, PairwiseParallel(_prepare_input=make_input_replicate_1d))
...
Starting from PyTorch 2.2, we simplified parallel styles to only contain ColwiseParallel
and RowwiseParallel
because other ParallelStyle can consist of these two. We also deleted the input/output functions, and started using input_layouts
and output_layouts
as kwargs instead to specify the sharding layout of both input/output tensors. Finally, added PrepareModuleInput/PrepareModuleOutput style, and no default arguments for layouts in these two styles and users need to specify them to think about the sharding layouts.
from torch.distributed.tensor.parallel.style import (
ColwiseParallel,
PrepareModuleInput,
RowwiseParallel,
parallelize_module,
)
from torch.distributed._tensor import init_device_mesh
module = SimpleMLPModule()
device_mesh = init_device_mesh("cuda", (self.world_size,)))
parallelize_module(
module,
device_mesh,
{
"fqn": PrepareModuleInput(
input_layouts=Shard(0),
desired_input_layouts=Replicate()
),
"fqn.net1": ColwiseParallel(),
"fqn.net2": RowwiseParallel(output_layouts=Shard(0)),
}
)
...
UntypedStorage.resize_
now uses the original device instead of the current device context (#113386)
Before this PR, UntypedStorage.resize_
would move data to the current CUDA device index (given by torch.cuda.current_device()
).
Now, UntypedStorage.resize_()
keeps the data on the same device index that it was on before, regardless of the current device index.
2.1 | 2.2 |
---|---|
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...: a = torch.zeros(0, device='cuda:1')
...: print(a.device)
...: a = a.untyped_storage().resize_(0)
...: print(a.device)
cuda:1
cuda:0 |
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...: a = torch.zeros(0, device='cuda:1')
...: print(a.device)
...: a = a.untyped_storage().resize_(0)
...: print(a.device)
cuda:1
cuda:1 |
Wrapping a function with set_grad_enabled will consume its global mutation (#113359)
This bc-breaking change fixes some unexpected behavior when set_grad_enabled
is used as a decorator.
2.1 | 2.2 |
---|---|
>>> import torch
>>> @torch.set_grad_enabled(False) # unexpectedly, this mutates the grad mode!
def inner_func(x):
return x.sin()
>>> torch.is_grad_enabled()
True |
>>> import torch
>>> @torch.set_grad_enabled(False) # unexpectedly, this mutates the grad mode!
def inner_func(x):
return x.sin()
>>> torch.is_grad_enabled()
False |
Deprecated verbose
parameter in LRscheduler
constructors (#111302)
As part of our decision to move towards a consolidated logging system, we are deprecating the verbose
flag in LRScheduler
.
If you would like to print the learning rate during execution, please use get_last_lr()
2.1 | 2.2 |
---|---|
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True)
for epoch in range(10):
train(...)
val_loss = validate(...)
# Note that step should be called after validate()
scheduler.step(val_loss) |
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
train(...)
val_loss = validate(...)
# Note that step should be called after validate()
scheduler.step(val_loss)
print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}") </td... |
PyTorch 2.1.2 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Fix crashes for float16 empty tensors (#115183)
- Fix MPS memory corruption when working with tensor slices (#114838)
- Fix crashes during Conv backward pass on MPS devices (#113398)
- Partially fix nn.Linear behavior on AArch64 platform (#110150)
- Fix cosine_similarity for tensors of different sizes (#109363)
- Package missing headers needed for extension development (#113055)
- Improve error handling of
torch.set_num_threads
(#113684) - Fix profiling traces generation (#113763)
The Cherry pick tracker #113962 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.1.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Remove spurious warning in comparison ops (#112170)
- Fix segfault in foreach_* operations when input list length does not match (#112349)
- Fix cuda driver API to load the appropriate .so file (#112996)
- Fix missing CUDA initialization when calling FFT operations (#110326)
- Ignore beartype==0.16.0 within the onnx package as it is incompatible (#111861)
- Fix the behavior of torch.new_zeros in onnx due to TorchScript behavior change (#111694)
- Remove unnecessary slow code in
torch.distributed.checkpoint.optimizer.load_sharded_optimizer_state_dict
(#111687) - Add
planner
argument totorch.distributed.checkpoint.optimizer.load_sharded_optimizer_state_dict
(#111393) - Continue if param not exist in sharded load in
torch.distributed.FSDP
(#109116) - Fix handling of non-contiguous bias_mask in
torch.nn.functional.scaled_dot_product_attention
(#112673) - Fix the meta device implementation for
nn.functional.scaled_dot_product_attention
(#110893) - Fix copy from mps to cpu device when storage_offset is non-zero (#109557)
- Fix segfault in
torch.sparse.mm
for non-contiguous inputs (#111742) - Fix circular import between Dynamo and einops (#110575)
- Verify flatbuffer module fields are initialized for mobile deserialization (#109794)
The #110961 contains all relevant pull requests related to this release as well as links to related issues.
PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing
PyTorch 2.1 Release Notes
- Highlights
- Backwards Incompatible Change
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
We are excited to announce the release of PyTorch® 2.1! PyTorch 2.1 offers automatic dynamic shape support in torch.compile, torch.distributed.checkpoint for saving/loading distributed training jobs on multiple ranks in parallel, and torch.compile support for the NumPy API.
In addition, this release offers numerous performance improvements (e.g. CPU inductor improvements, AVX512 support, scaled-dot-product-attention support) as well as a prototype release of torch.export, a sound full-graph capture mechanism, and torch.export
-based quantization.
Along with 2.1, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.
This release is composed of 6,682 commits and 784 contributors since 2.0. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.1. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
Summary:
torch.compile
now includes automatic support for detecting and minimizing recompilations due to tensor shape changes using automatic dynamic shapes.torch.distributed.checkpoint
enables saving and loading models from multiple ranks in parallel, as well as resharding due to changes in cluster topology.torch.compile
can now compile NumPy operations via translating them into PyTorch-equivalent operations.torch.compile
now includes improved support for Python 3.11.- New CPU performance features include inductor improvements (e.g. bfloat16 support and dynamic shapes), AVX512 kernel support, and scaled-dot-product-attention kernels.
torch.export
, a sound full-graph capture mechanism is introduced as a prototype feature, as well as torch.export-based quantization.torch.sparse
now includes prototype support for semi-structured (2:4) sparsity on NVIDIA® GPUs.
Stable | Beta | Prototype | Performance Improvements |
Automatic Dynamic Shapes | torch.export() | AVX512 kernel support | |
torch.distributed.checkpoint | torch.export-based Quantization | CPU optimizations for scaled-dot-product-attention (SDPA) | |
torch.compile + NumPy | semi-structured (2:4) sparsity | CPU optimizations for bfloat16 | |
torch.compile + Python 3.11 | cpp_wrapper for torchinductor | ||
torch.compile + autograd.Function | |||
third-party device integration: PrivateUse1 |
*To see a full list of public 2.1, 2.0, and 1.13 feature submissions click here.
For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Backwards Incompatible Changes
Building PyTorch from source now requires C++ 17 (#100557)
The PyTorch codebase has migrated from the C++14 to the C++17 standard, so a C++17 compatible compiler is now required to compile PyTorch, to integrate with libtorch, or to implement a C++ PyTorch extension.
Disable torch.autograd.{backward, grad}
for complex scalar output (#92753)
Gradients are not defined for functions that don't return real outputs; we now raise an error if you try to call backward on complex outputs. Previously, the complex component of the output was implicitly ignored. If you wish to preserve this behavior, you must now explicitly call .real
on your complex outputs before calling .grad()
or .backward()
.
Example
def fn(x):
return (x * 0.5j).sum()
x = torch.ones(1, dtype=torch.double, requires_grad=True)
o = fn(x)
2.0.1
o.backward()
2.1
o.real.backward()
Update non-reentrant checkpoint to allow nesting and support autograd.grad
(#90105)
As a part of a larger refactor to torch.utils.checkpoint
, we changed the interaction activation checkpoint and retain_graph=True
. Previously in 2.0.1, recomputed activations are kept alive if retain_graph=True
, in PyTorch 2.1, non-reentrant impl now clears recomputed tensors on backward immediately upon unpack, even if retain_graph=True
. This has the following additional implications: (1) Accessing ctx.saved_tensor
twice in the same backward will now raise an error. (2) Accessing _saved_tensors
multiple times will silently recompute forward multiple times.
2.1
class Func(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
out = x.exp()
ctx.save_for_backward(out)
return out
@staticmethod
def backward(ctx, x);
out, = ctx.saved_tensors
# Calling ctx.saved_tensors again will raise in 2.1
out, = ctx.saved_tensors
return out
a = torch.tensor(1., requires_grad=True)
def fn(x):
return Func.apply(x)
out = torch.utils.checkpoint(fn, (a,), use_reentrant=False)
def fn2(x):
return x.exp()
out = torch.utils.checkpoint(fn2, (a,), use_reentrant=False)
out.grad_fn._saved_result
# Calling _saved_result will trigger another unpack, and lead to forward being
# recomputed again
out.grad_fn._saved_result
Only sync buffers when broadcast_buffers
is True (#100729)
- In PyTorch 2.0.1 and previous releases, when users use DistributedDataParallel (DDP), all buffers were synced automatically even if users set flag
broadcast_buffers
to beFalse
:
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is synchronized across all devices.
...
- Starting with PyTorch 2.1, if users specify the flag
broadcast_buffers
to beFalse
, we don’t sync the buffer across devices:
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is NOT synchronized across all devices
...
Remove store barrier after PG init (#99937)
- In PyTorch 2.0.1 and previous releases, after we initialize PG, we always call store based barrier:
from torch.distributed.distributed_c10d import init_process_group
init_process_group(...) # Will call _store_based_barrier in the end.
...
- Starting with PyTorch 2.1, after we initialize PG, the environment variable
TORCH_DIST_INIT_BARRIER
controls whether we call store based barrier or not:
from torch.distributed.distributed_c10d import init_process_group
import os
os.environ["TORCH_DIST_INIT_BARRIER"] = "1" # This is the default behavior
init_process_group(...) # Will call _store_based_barrier in the end.
os.environ["TORCH_DIST_INIT_BARRIER"] = "0"
init_process_group(...) # Will not call _store_based_barrier in the end.
...
Disallow non-bool masks in torch.masked_{select, scatter, fill}
(#96112, #97999, #96594)
Finish the deprecation cycle for non-bool masks. Functions now require the dtype
of the mask to be torch.bool
.
>>> # 2.0.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1855.)
torch.masked_select(inp, mask)
>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine
>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine
>>> # 2.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
RuntimeError: masked_select: expected BoolTensor for mask
>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine
>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine
Fix the result of torch.unique
to make it consistent with NumPy when dim
is specified (#101693)
The dim
argument was clarified and its behavior aligned to match the one from NumPy to signify which sub-tensor to consider when considering uniqueness. See the documentation for more details, https://pytorch.org/docs/stable/generated/torch.unique.html
Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)
Prior to this change, for torch.nn.functional.grid_sample(mode='nearest')
the forward 2D kernel used std::nearbyint
whereas the forward 3D kernel used std::round
in order to determine the nearest pixel locations after un-normalization of the grid. Additionally, the backward kernels for both ...
PyTorch 2.0.1 Release, bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- Fix
_canonical_mask
throws warning when bool masks passed as input to TransformerEncoder/TransformerDecoder (#96009, #96286) - Fix Embedding bag max_norm=-1 causes leaf Variable that requires grad is being used in an in-place operation #95980
- Fix type hint for torch.Tensor.grad_fn, which can be a torch.autograd.graph.Node or None. #96804
- Can’t convert float to int when the input is a scalar np.ndarray. #97696
- Revisit torch._six.string_classes removal #97863
- Fix module backward pre-hooks to actually update gradient #97983
- Fix load_sharded_optimizer_state_dict error on multi node #98063
- Warn once for TypedStorage deprecation #98777
- cuDNN V8 API, Fix incorrect use of emplace in the benchmark cache #97838
Torch.compile:
- Add support for Modules with custom getitem method to torch.compile #97932
- Fix improper guards with on list variables. #97862
- Fix Sequential nn module with duplicated submodule #98880
Distributed:
- Fix distributed_c10d's handling of custom backends #95072
- Fix MPI backend not properly initialized #98545
NN_frontend:
- Update Multi-Head Attention's doc string #97046
- Fix incorrect behavior of
is_causal
paremeter for torch.nn.TransformerEncoderLayer.forward #97214 - Fix error for SDPA on sm86 and sm89 hardware #99105
- Fix nn.MultiheadAttention mask handling #98375
DataLoader:
- Fix regression for pin_memory recursion when operating on bytes #97737
- Fix collation logic #97789
- Fix Ppotentially backwards incompatible change with DataLoader and is_shardable Datapipes #97287
MPS:
- Fix LayerNorm crash when input is in float16 #96208
- Add support for cumsum on int64 input #96733
- Fix issue with setting BatchNorm to non-trainable #98794
Functorch:
- Fix Segmentation Fault for vmaped function accessing BatchedTensor.data #97237
- Fix index_select support when dim is negative #97916
- Improve docs for autograd.Function support #98020
- Fix Exception thrown when running Migration guide example for jacrev #97746
Releng:
- Fix Convolutions for CUDA-11.8 wheel builds #99451
- Fix Import torchaudio + torch.compile crashes on exit #96231
- Linux aarch64 wheels are missing the mkldnn+acl backend support - pytorch/builder@54931c2
- Linux aarch64 torchtext 0.15.1 wheels are missing for aarch64_linux platform - pytorch/builder#1375
- Enable ROCm 5.4.2 manywheel and python 3.11 builds #99552
- PyTorch cannot be installed at the same time as numpy in a conda env on osx-64 / Python 3.11 #97031
- Illegal instruction (core dumped) on Raspberry Pi 4.0 8gb - pytorch/builder#1370
Torch.optim:
- Fix fused AdamW causes NaN loss #95847
- Fix Fused AdamW has worse loss than Apex and unfused AdamW for fp16/AMP #98620
The release tracker should contain all relevant pull requests related to this release as well as links to related issues
PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever
PyTorch 2.0 Release notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
Highlights
We are excited to announce the release of PyTorch® 2.0 (release note) which we highlighted during the PyTorch Conference on 12/2/22! PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.
This next-generation release includes a Stable version of Accelerated Transformers (formerly called Better Transformers); Beta includes torch.compile as the main API for PyTorch 2.0, the scaled_dot_product_attention function as part of torch.nn.functional, the MPS backend, functorch APIs in the torch.func module; and other Beta/Prototype improvements across various inferences, performance and training optimization features on GPUs and CPUs. For a comprehensive introduction and technical overview of torch.compile, please visit the 2.0 Get Started page.
Along with 2.0, we are also releasing a series of beta updates to the PyTorch domain libraries, including those that are in-tree, and separate libraries including TorchAudio, TorchVision, and TorchText. An update for TorchX is also being released as it moves to community supported mode. More details can be found in this library blog.
This release is composed of over 4,541 commits and 428 contributors since 1.13.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.0 and the overall 2-series this year.
Summary:
- torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
- As an underpinning technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized cuda libraries such as cublas.
- Accelerated Transformers introduce high-performance support for training and inference using a custom kernel architecture for scaled dot product attention (SPDA). The API is integrated with torch.compile() and model developers may also use the scaled dot product attention kernels directly by calling the new scaled_dot_product_attention() operator.
- Metal Performance Shaders (MPS) backend provides GPU accelerated PyTorch training on Mac platforms with added support for Top 60 most used ops, bringing coverage to over 300 operators.
- Amazon AWS optimize the PyTorch CPU inference on AWS Graviton3 based C7g instances. PyTorch 2.0 improves inference performance on Graviton compared to the previous releases, including improvements for Resnet50 and Bert.
- New prototype features and technologies across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.
Stable | Beta | Prototype | Platform Changes |
Accelerated PT 2 Transformers | torch.compile | DTensor | CUDA support for 11.7 & 11.8 (deprecating CUDA 11.6) |
PyTorch MPS Backend | TensorParallel | Python 3.8 (deprecating Python 3.7) | |
Scaled dot product attention | 2D Parallel | AWS Graviton3 | |
Functorch | Torch.compile (dynamic=True) | ||
Dispatchable Collectives | |||
torch.set_default_device and torch.device as context manager | |||
X86 quantization backend | |||
GNN inference and training performance |
*To see a full list of public 2.0, 1.13 and 1.12 feature submissions click here
Backwards Incompatible Changes
Drop support for Python versions <= 3.7 (#93155)
Previously the minimum supported version of Python for PyTorch was 3.7. This PR updates the minimum version to require 3.8 in order to install PyTorch. See Hardware / Software Support for more information.
Drop support for CUDA 10 (#89582)
This PR updates the minimum CUDA version to 11.0. See the getting-started for installation or building from source for more information.
Gradients are now set to None
instead of zeros by default in torch.optim.*.zero_grad()
and torch.nn.Module.zero_grad()
(#92731)
This changes the default behavior of zero_grad()
to zero out the grads by setting them to None
instead of zero tensors. In other words, the set_to_none
kwarg is now True
by default instead of False
. Setting grads to None
reduces peak memory usage and increases performance. This will break code that directly accesses data or does computation on the grads after calling zero_grad()
as they will now be None
. To revert to the old behavior, pass in zero_grad(set_to_none=False)
.
1.13 | 2.0 |
---|---|
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(2,22)
>>> i = torch.randn(2, 2, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
False
>>> module.weight.grad.data
tensor([[0., 0.],
[0., 0.]])
>>> module.weight.grad + 1.0
tensor([[1., 1.],
[1., 1.]]) |
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(5, 5)
>>> i = torch.randn(2, 5, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
True
>>> module.weight.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
>>> module.weight.grad + 1.0
TypeError: unsupported operand type(s) for +:
'NoneType' and 'float' |
Update torch.tensor
and nn.Parameter
to serialize all their attributes (#88913)
Any attribute stored on torch.tensor
and torch.nn.Parameter
will now be serialized. This aligns the serialization behavior of torch.nn.Parameter
, torch.Tensor
and other tensor subclasses
1.13 | 2.0 |
---|---|
# torch.Tensor behavior
>>> a = torch.Tensor()
>>> a.foo = 'hey'
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Tensor' object has no attribute 'foo'
# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Parameter' object has no attribute 'foo'
# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
... pass
>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey |
# torch.Tensor behavior
a = torch.Tensor()
a.foo = 'hey'
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey
# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey
# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
... pass
>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey
>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey |
If you have an attribute that you don't want to be serialized you should not store it as an attribute on tensor or Parameter but instead it is recommended to use torch.utils.weak.WeakTensorKeyDictionary
>>> foo_dict = weak.WeakTensorKeyDictionary()
>>> foo_dict[a] = 'hey'
>>> print(foo_dict[a])
hey
Algorithms {Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD}
default to faster foreach
implementation when on CUDA + differentiable=False
When applicable, this changes the default behavior of step()
and anything that ca...
PyTorch 1.13.1 Release, small bug fix release
This release is meant to fix the following issues (regressions / silent correctness):
- RuntimeError by torch.nn.modules.activation.MultiheadAttention with bias=False and batch_first=True #88669
- Installation via pip on Amazon Linux 2, regression #88869
- Installation using poetry on Mac M1, failure #88049
- Missing masked tensor documentation #89734
- torch.jit.annotations.parse_type_line is not safe (command injection) #88868
- Use the Python frame safely in _pythonCallstack #88993
- Double-backward with full_backward_hook causes RuntimeError #88312
- Fix logical error in get_default_qat_qconfig #88876
- Fix cuda/cpu check on NoneType and unit test #88854 and #88970
- Onnx ATen Fallback for BUILD_CAFFE2=0 for ONNX-only ops #88504
- Onnx operator_export_type on the new registry #87735
- torchrun AttributeError caused by file_based_local_timer on Windows #85427
The release tracker should contain all relevant pull requests related to this release as well as links to related issues
PyTorch 1.13: beta versions of functorch and improved support for Apple’s new M1 chips are now available
Pytorch 1.13 Release Notes
- Highlights
- Backwards Incompatible Changes
- New Features
- Improvements
- Performance
- Documentation
- Developers
Highlights
We are excited to announce the release of PyTorch 1.13! This includes stable versions of BetterTransformer. We deprecated CUDA 10.2 and 11.3 and completed migration of CUDA 11.6 and 11.7. Beta includes improved support for Apple M1 chips and functorch, a library that offers composable vmap (vectorization) and autodiff transforms, being included in-tree with the PyTorch release. This release is composed of over 3,749 commits and 467 contributors since 1.12.1. We want to sincerely thank our dedicated community for your contributions.
Summary:
-
The BetterTransformer feature set supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. Additional improvements include accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models and Nested Tensors is now enabled by default.
-
Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows support for C++17 in PyTorch and new NVIDIA Open GPU Kernel Modules.
-
Previously, functorch was released out-of-tree in a separate package. After installing PyTorch, a user will be able to
import functorch
and use functorch without needing to install another package. -
PyTorch is offering native builds for Apple® silicon machines that use Apple's new M1 chip as a beta feature, providing improved support across PyTorch's APIs.
Stable | Beta | Prototype |
---|---|---|
|
|
|
You can check the blogpost that shows the new features here.
Backwards Incompatible changes
Python API
uint8 and all integer dtype masks are no longer allowed in Transformer (#87106)
Prior to 1.13, key_padding_mask
could be set to uint8 or other integer dtypes in TransformerEncoder
and MultiheadAttention
, which might generate unexpected results. In this release, these dtypes are not allowed for the mask anymore. Please convert them to torch.bool
before using.
1.12.1
>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.uint8)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
# works before 1.13
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)
1.13
>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.bool)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)
Updated torch.floor_divide
to perform floor division (#78411)
Prior to 1.13, torch.floor_divide
erroneously performed truncation division (i.e. truncated the quotients). In this release, it has been fixed to perform floor division. To replicate the old behavior, use torch.div
with rounding_mode='trunc'
.
1.12.1
>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -1.])
1.13
>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -2.])
# Old behavior can be replicated using torch.div with rounding_mode='trunc'
>>> torch.div(a, b, rounding_mode='trunc')
tensor([ 2., -1.])
Fixed torch.index_select
on CPU to error that index is out of bounds when the source
tensor is empty (#77881)
Prior to 1.13, torch.index_select
would return an appropriately sized tensor filled with random values on CPU if the source tensor was empty. In this release, we have fixed this bug so that it errors out. A consequence of this is that torch.nn.Embedding
which utilizes index_select
will error out rather than returning an empty tensor when embedding_dim=0
and input
contains indices which are out of bounds. The old behavior cannot be reproduced with torch.nn.Embedding
, however since an Embedding layer with embedding_dim=0
is a corner case this behavior is unlikely to be relied upon.
1.12.1
>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
tensor([], size=(1, 0), grad_fn=<EmbeddingBackward0>)
1.13
>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
RuntimeError: INDICES element is out of DATA bounds, id=4 axis_dim=3
Disallow overflows when tensors are constructed from scalars (#82329)
Prior to this PR, overflows during tensor construction from scalars would not throw an error. In 1.13, such cases will error.
1.12.1
>>> torch.tensor(1000, dtype=torch.int8)
tensor(-24, dtype=torch.int8)
1.13
>>> torch.tensor(1000, dtype=torch.int8)
RuntimeError: value cannnot be converted to type int8 without overflow
Error on indexing a cpu tensor with non-cpu indices (#69607)
Prior to 1.13, cpu_tensor[cuda_indices]
was a valid program that would return a cpu tensor. The original use case for mixed device indexing was for non_cpu_tensor[cpu_indices]
, and allowing the opposite was unintentional (cpu_tensor[non_cpu_indices]
). This behavior appears to be rarely used, and a refactor of our indexing kernels made it difficult to represent an op that takes in (cpu_tensor, non_cpu_tensor) and returns another cpu_tensor, so it is now an error.
To replicate the old behavior for base[indices]
, you can ensure that either indices
lives on the CPU device, or base
and indices
both live on the same device.
1.12.1
>>> a = torch.tensor([1.0, 2.0, 3.0])
>>> b = torch.tensor([0, 2], device='cuda')
>>> a[b]
tensor([1., 3.])
1.13
>>> a = torch.tensor([1.0, 2.0, 3.0])
>>> b = torch.tensor([0, 2], device='cuda')
>>> a[b]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
# Old behavior can be replicated by moving b to CPU, or a to CUDA
>>> a[b.cpu()]
tensor([1., 3.])
>>> a.cuda()[b]
tensor([1., 3.], device='cuda:0')
Remove deprecated torch.eig
, torch.matrix_rank
, torch.lstsq
(#70982, #70981, #70980)
The deprecation cycle for the above functions has been completed and they have been removed in the 1.13 release.
torch.nn
Enforce that the bias
has the same dtype as input
and weight
for convolutions on CPU (#83686)
To align with the implementation on other devices, the CPU implementation for convolutions was updated to enforce that the dtype
of the bias
matches the dtype
of the input
and weight
.
1.12.1
# input and weight are dtype torch.int64
# bias is torch.float32
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)
1.13
# input and weight are dtype torch.int64
# bias is torch.float32
>>> with assertRaisesError():
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)
# Updated code to avoid the error
>>> out = torch.nn.functional.conv2d(input, weight, bias.to(input.dtype), ...)
Autograd
Disallow setting the .data
of a tensor that requires_grad=True
with an integer tensor (#78436)
Setting the .data
of a tensor that requires_grad
with an integer tensor now raises an error.
1.12.1
>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
>>> x
tensor([0, 0], requires_grad=True)
1.13
>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: data set to a tensor that requires gradients must be floating point or complex dtype
Added variable_list support to ExtractVariables struct (#84583)
Prior to this change, C++ custom autograd Function considers tensors passed in TensorList to not be tensors for the purposes of recording the backward graph. After this change, custom Functions that receive TensorList must modify their backward functions to also compute gradients for these additional tensor inputs. Note that this behavior now differs from that of custom autograd Functions in Python.
1.12.1
struct MyFunction : public Function<MyFunction> {
static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
return 2 * tensors[0] + 3 * t;
}
static variable_list backward(
AutogradContext* ctx,
variable_list grad_output) {
return {3 * grad_output[0]};
}
};
1.13
struct MyFunction : public Function<MyFunction> {
static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
return 2 * tensors[0] + 3 * t;
}
static variable_list backward(
AutogradContext* ctx,
variable_list grad_output) {
return {3 * grad_output[0], 2 * grad_output[0]};
}
};
Don't detach when making views; force kernel to detach (#84893)
View operations registered as CompositeExplicitAutograd kernels are no longer allowed to return input tensors as-is. You must explic...