24 Apr 16:12

97ff6cf

PyTorch 2.3: User-Defined Triton Kernels in torch.compile, Tensor Parallelism in Distributed Latest

Latest

PyTorch 2.3 Release notes

Highlights
Backwards Incompatible Changes
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation

Highlights

We are excited to announce the release of PyTorch® 2.3! PyTorch 2.3 offers support for user-defined Triton kernels in torch.compile, allowing for users to migrate their own Triton kernels from eager without experiencing performance complications or graph breaks. As well, Tensor Parallelism improves the experience for training Large Language Models using native PyTorch functions, which has been validated on training runs for 100B parameter models.

This release is composed of 3393 commits and 426 contributors since PyTorch 2.2. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.3. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Stable	Beta	Prototype	Performance Improvements
	User-defined Triton kernels in torch.compile	torch.export adds new API to specify dynamic_shapes	Weight-Only-Quantization introduced into Inductor CPU backend
	Tensor parallelism within PyTorch Distributed	Asynchronous checkpoint generation
	Support for semi-structured sparsity

*To see a full list of public feature submissions click here.

Tracked Regressions

torch.compile on MacOS is considered unstable for 2.3 as there are known cases where it will hang (#124497)

torch.compile imports many unrelated packages when it is invoked (#123954)

This can cause significant first-time slowdown and instability when these packages are not fully compatible with PyTorch within a single process.

torch.compile is not supported on Python 3.12 (#120233)

PyTorch support for Python 3.12 in general is considered experimental. Please use Python version between 3.8 and 3.11 instead. This is an existing issue since PyTorch 2.2.

Backwards Incompatible Changes

Change default torch_function behavior to be disabled when torch_dispatch is defined (#120632)

Defining a subclass with a torch_dispatch entry will now automatically set torch_function to be disabled. This aligns better with all the use cases we’ve observed for subclasses. The main change of behavior is that the result of the torch_dispatch handler will not go through the default torch_function handler anymore, wrapping it into the current subclass. This allows in particular for your subclass to return a plain Tensor or another subclass from any op.

The original behavior can be recovered by adding the following to your Tensor subclass:

@classmethod
def __torch_function__(cls, func, types, args=(), kwargs=None):
      return super().__torch_function__(func, types, args, kwargs)

ProcessGroupNCCL removes multi-device-per-thread support from C++ level (#119099, #118674)

Python level support was removed in 2.2.
To simplify ProcessGroupNCCL’s code, we remove support for multiple cuda devices per thread. To our knowledge, this is not an active use case, but it adds a large burden to our codebase. If you are relying on this, there is no workaround other than rewriting your pytorch program to use one device per process or one device per thread (multi-threads per process is still supported).

Removes `no_dist` and `coordinator_rank` from public DCP API's (#121317)

As part of an overall effort to simplify our public facing API's for Distributed Checkpointing, we've decided to deprecate usage of the coordinator_rank and no_dist parameters under torch.distributed.checkpoint. In our opinion, these parameters can lead to confusion around the intended effect during API usage, and have limited value to begin with. One concrete example is here, #118337, where there is ambiguity in which Process Group is referenced by the coordinator rank (additional context: #118337). In the case of the no_dist parameter, we consider this an implementation detail which should be hidden from the user. Starting in this release, no_dist is inferred from the initialized state of the process group, assuming the intention is to use collectives if a process group is initialized, and assuming the opposite in the case it is not.

2.2

2.3

# Version 2.2.2
import torch.distributed.checkpoint as dcp

dcp.save(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
       no_dist=True,
       coordinator_rank=0
)
# ...
dcp.load(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
       no_dist=True,
       coordinator_rank=0
)

# Version 2.2.3
# no dist is assumed from pg state, and rank 0 is always coordinator.
import torch.distributed.checkpoint as dcp

dcp.save(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
) 
# ...
dcp.load(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
)

Remove deprecated tp_mesh_dim arg (#121432)

Starting from PyTorch 2.3, parallelize_module API only accepts a DeviceMesh (the tp_mesh_dim argument has been removed). If having a N-D DeviceMesh for multi-dimensional parallelism, you can use mesh_nd["tp"] to obtain a 1-D DeviceMesh for tensor parallelism.

torch.export

Users must pass in an nn.Module to torch.export.export. The reason is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function, such as how we guarantee that every call_function node has an nn_module_stack populated, and we offer ways to access the state_dict/parameters/buffers of the exported program. We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model. (#117528)
Removed constraints in favor of dynamic_shapes (#117573, #117917, #117916, #120981, #120979)
ExportedProgram is no longer a callable. Instead users will need to use .module() to call the ExportedProgram. This is to prevent users from treating ExportedPrograms as torch.nn.Modules as we do not plan to support all features that torch.nn.Modules have, like hooks. Instead users can create a proper torch.nn.Module through exported_program.module() and use that as a callable. (#120019, #118425, #119105)
Remove equality_constraints from ExportedProgram as it is not used or useful anymore. Dimensions with equal constraints will now have the same symbol. (#116979)
Remove torch._export.export in favor of torch.export.export (#119095)
Remove CallSpec (#117671)

Enable fold_quantize by default in PT2 Export Quantization (#118701, #118605, #119425, #117797)

Previously, the PT2 Export Quantization flow did not generate quantized weight by default, but instead used fp32 weight in the quantized model in this pattern: fp32 weight -> q -> dq -> linear. Setting fold_quantize=True produces a graph with quantized weights in the quantized model in this pattern by default after convert_pt2e, and users will see a reduction in the model size: int8 weight -> dq -> linear.

2.2	2.3
folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model)	folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False)

Remove deprecated torch.jit.quantized APIs (#118406)

All functions and classes under torch.jit.quantized will now raise an error if called/instantiated. This API has long been deprecated in favor of torch.ao.nn.quantized.

2.2

2.3

# torch.jit.quantized APIs

torch.jit.quantized.quantize_rnn_cell_modules

torch.jit.quantized.quantize_rnn_modules
torch.jit.quantized.quantize_linear_modules

torch.jit.quantized.QuantizedLinear
torch.jit.QuantizedLinearFP16

torch.jit.quantized.QuantizedGRU
torch.jit.quantized.QuantizedGRUCell
torch.jit.quantized.QuantizedLSTM
torch.jit.quantized.QuantizedLSTMCell

# Corresponding torch.ao.quantization APIs

torch.ao.nn.quantized.dynamic.RNNCell

torch.ao.quantization.quantize_dynamic APIs

torch.ao.nn.quantized.dynamic.Linear

torch.ao.nn.quantized.dynamic.GRU
torch.ao.nn.quantized.dynamic.GRUCell
torch.ao.nn.quantized.dynamic.LSTM

...

Assets 3

27 Mar 22:27

atalman

v2.2.2

39901f2

PyTorch 2.2.2 Release, bug fix release

This release is meant to fix the following issues (regressions / silent correctness):

Properly raise an error when trying to use inductor backend on non-supported platforms such as Windows (#115969)
Fix mkldnn performance issue on Windows platform (#121618)
Fix RuntimeError: cannot create std::vector larger than max_size() in torch.nn.functional.conv1d on non-contiguous cpu inputs by patching OneDNN (pytorch/builder#1742) (pytorch/builder#1744)
Add support for torch.distributed.fsdp.StateDictType.FULL_STATE_DICT for when using torch.distributed.fsdp.FullyShardedDataParallel with the device_mesh argument (#120837)
Fix make triton command on release branch for users building the release branch from source (#121169)
Ensure gcc>=9.0 for build from source and cpp_extensions (#120126)
Fix cxx11-abi build in release branch (pytorch/builder#1709)
Fix building from source on Windows source MSVC 14.38 - VS 2022 (#122120)

Release tracker #120999 contains all relevant pull requests related to this release as well as links to related issues.

Assets 3

22 Feb 21:15

atalman

v2.2.1

6c8c5ad

PyTorch 2.2.1 Release, bug fix release

This release is meant to fix the following issues (regressions / silent correctness):

Fix missing OpenMP support on Apple Silicon binaries (pytorch/builder#1697)
Fix crash when mixing lazy and non-lazy tensors in one operation (#117653)
Fix PyTorch performance regression on Linux aarch64 (pytorch/builder#1696)
Fix silent correctness in DTensor _to_copy operation (#116426)
Fix properly assigning param.grad_fn for next forward (#116792)
Ensure gradient clear out pending AsyncCollectiveTensor in FSDP Extension (#116122)
Fix processing unflatten tensor on compute stream in FSDP Extension (#116559)
Fix FSDP AssertionError on tensor subclass when setting sync_module_states=True (#117336)
Fix DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592)
Fix OOM when when returning a AsyncCollectiveTensor by forcing _gather_state_dict() to be synchronous with respect to the mian stream. (#118197) (#119716)
Fix Windows runtime torch.distributed.DistNetworkError: [WinError 32] The process cannot access the file because it is being used by another process (#118860)
Update supported python versions in package description (#119743)
Fix SIGILL crash during import torch on CPUs that do not support SSE4.1 (#116623)
Fix DCP RuntimeError in get_state_dict and set_state_dict (#119573)
Fixes for HSDP + TP integration with device_mesh (#112435) (#118620) (#119064) (#118638) (#119481)
Fix numerical error with mixedmm on NVIDIA V100 (#118591)
Fix RuntimeError when using SymInt input invariant when splitting graphs (#117406)
Fix compile DTensor.from_local in trace_rule_look up (#119659)
Improve torch.compile integration with CUDA-11.8 binaries (#119750)

Release tracker #119295 contains all relevant pull requests related to this release as well as links to related issues.

Assets 3

30 Jan 17:58

jcaip

v2.2.0

8ac9b20

PyTorch 2.2: FlashAttention-v2, AOTInductor

PyTorch 2.2 Release Notes

Highlights
Backwards Incompatible Changes
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation

Highlights

We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.

This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.

Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.

Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.
PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.
torch.distributed supports a new abstraction for initializing and representing ProcessGroups called device_mesh.
PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.
A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
torch.ao.quantization now offers a prototype torch.export based flow

Stable	Beta	Prototype	Performance Improvements
	FlashAttentionV2 backend for scaled dot product attention	PT 2 Quantization	Inductor optimizations
	AOTInductor	Scaled dot product attention support for jagged layout NestedTensors	aarch64-linux optimizations (AWS Graviton)
	TORCH_LOGS
	torch.distributed.device_mesh
	torch.compile + Optimizers

*To see a full list of public 2.2 - 1.12 feature submissions click here.

Tracked Regressions

Performance reduction when using NVLSTree algorithm in NCCL 2.19.3 (#117748)

We have noticed a performance regression introduced to all-reduce in NCCL 2.19.3. Please use version 2.19.1 instead.

Poor numeric stability of loss when training with FSDP + DTensor (#117471)

We observe the loss will flatline randomly while training with FSDP + DTensor in some instances.

Backwards Incompatible Changes

Building PyTorch from source now requires GCC 9.4 or newer (#112858)

GCC 9.4 is the oldest version fully compatible with C++17, which the PyTorch codebase has migrated to from C++14.

Updated flash attention kernel in `scaled_dot_product_attention` to use Flash Attention v2 (#105602)

Previously, the v1 Flash Attention kernel had a Windows implementation. So if a user on Windows had explicitly forced the flash attention kernel to be run by using sdp_kernel context manager with only flash attention enabled, it would work. In 2.2, if the sdp_kernel context manager must be used, use the memory efficient or math kernel if on Windows.

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)

# Don't force flash attention to be used if using sdp_kernel on Windows
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)

Rewrote DTensor (Tensor Parallel) APIs to improve UX (#114732)

In PyTorch 2.1 or before, users can use ParallelStyles like PairwiseParallel and specify input/output layout with functions like make_input_replicate_1d or make_output_replicate_1d. And we have default values for _prepare_input and _prepare_output. The UX of Tensor Parallel was like:

from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    make_input_replicate_1d,
    make_input_reshard_replicate,
    make_input_shard_1d,
    make_input_shard_1d_last_dim,
    make_sharded_output_tensor,
    make_output_replicate_1d,
    make_output_reshard_tensor,
    make_output_shard_1d,
    make_output_tensor,
    PairwiseParallel,
    parallelize_module,
)
from torch.distributed.tensor import DeviceMesh

module = DummyModule()
device_mesh = DeviceMesh("cuda", list(range(self.world_size)))
parallelize_module(module, device_mesh, PairwiseParallel(_prepare_input=make_input_replicate_1d))
...

Starting from PyTorch 2.2, we simplified parallel styles to only contain ColwiseParallel and RowwiseParallel because other ParallelStyle can consist of these two. We also deleted the input/output functions, and started using input_layouts and output_layouts as kwargs instead to specify the sharding layout of both input/output tensors. Finally, added PrepareModuleInput/PrepareModuleOutput style, and no default arguments for layouts in these two styles and users need to specify them to think about the sharding layouts.

from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    PrepareModuleInput,
    RowwiseParallel,
    parallelize_module,
)
from torch.distributed._tensor import init_device_mesh

module = SimpleMLPModule()
device_mesh = init_device_mesh("cuda", (self.world_size,)))
parallelize_module(
   module,
   device_mesh,
   {
      "fqn": PrepareModuleInput(
                input_layouts=Shard(0),
                desired_input_layouts=Replicate()
             ),
      "fqn.net1": ColwiseParallel(),
      "fqn.net2": RowwiseParallel(output_layouts=Shard(0)),
   }
)
...

`UntypedStorage.resize_` now uses the original device instead of the current device context (#113386)

Before this PR, UntypedStorage.resize_ would move data to the current CUDA device index (given by torch.cuda.current_device()).
Now, UntypedStorage.resize_() keeps the data on the same device index that it was on before, regardless of the current device index.

2.1	2.2
>>> import torch >>> with torch.cuda.device('cuda:0'): ...: a = torch.zeros(0, device='cuda:1') ...: print(a.device) ...: a = a.untyped_storage().resize_(0) ...: print(a.device) cuda:1 cuda:0	>>> import torch >>> with torch.cuda.device('cuda:0'): ...: a = torch.zeros(0, device='cuda:1') ...: print(a.device) ...: a = a.untyped_storage().resize_(0) ...: print(a.device) cuda:1 cuda:1

2.1

2.2

>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:0

>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:1

Wrapping a function with set_grad_enabled will consume its global mutation (#113359)

This bc-breaking change fixes some unexpected behavior when set_grad_enabled is used as a decorator.

2.1	2.2
>>> import torch >>> @torch.set_grad_enabled(False) # unexpectedly, this mutates the grad mode! def inner_func(x): return x.sin() >>> torch.is_grad_enabled() True	>>> import torch >>> @torch.set_grad_enabled(False) # unexpectedly, this mutates the grad mode! def inner_func(x): return x.sin() >>> torch.is_grad_enabled() False

2.1

2.2

>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
True

>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
False

Deprecated `verbose` parameter in `LRscheduler` constructors (#111302)

As part of our decision to move towards a consolidated logging system, we are deprecating the verbose flag in LRScheduler.

If you would like to print the learning rate during execution, please use get_last_lr()

2.1	2.2
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True) for epoch in range(10): train(...) val_loss = validate(...) # Note that step should be called after validate() scheduler.step(val_loss)	optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) scheduler = ReduceLROnPlateau(optimizer, 'min') for epoch in range(10): train(...) val_loss = validate(...) # Note that step should be called after validate() scheduler.step(val_loss) print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}") </td...

2.1

2.2

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True)
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
	print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}")

</td...

Assets 3

15 Dec 01:59

atalman

v2.1.2

a8e7c98

PyTorch 2.1.2 Release, bug fix release

This release is meant to fix the following issues (regressions / silent correctness):

Fix crashes for float16 empty tensors (#115183)
Fix MPS memory corruption when working with tensor slices (#114838)
Fix crashes during Conv backward pass on MPS devices (#113398)
Partially fix nn.Linear behavior on AArch64 platform (#110150)
Fix cosine_similarity for tensors of different sizes (#109363)
Package missing headers needed for extension development (#113055)
Improve error handling of torch.set_num_threads (#113684)
Fix profiling traces generation (#113763)

The Cherry pick tracker #113962 contains all relevant pull requests related to this release as well as links to related issues.

Assets 3

15 Nov 22:59

jerryzh168

v2.1.1

4c55dc5

PyTorch 2.1.1 Release, bug fix release

This release is meant to fix the following issues (regressions / silent correctness):

Remove spurious warning in comparison ops (#112170)
Fix segfault in foreach_* operations when input list length does not match (#112349)
Fix cuda driver API to load the appropriate .so file (#112996)
Fix missing CUDA initialization when calling FFT operations (#110326)
Ignore beartype==0.16.0 within the onnx package as it is incompatible (#111861)
Fix the behavior of torch.new_zeros in onnx due to TorchScript behavior change (#111694)
Remove unnecessary slow code in torch.distributed.checkpoint.optimizer.load_sharded_optimizer_state_dict (#111687)
Add planner argument to torch.distributed.checkpoint.optimizer.load_sharded_optimizer_state_dict (#111393)
Continue if param not exist in sharded load in torch.distributed.FSDP (#109116)
Fix handling of non-contiguous bias_mask in torch.nn.functional.scaled_dot_product_attention (#112673)
Fix the meta device implementation for nn.functional.scaled_dot_product_attention (#110893)
Fix copy from mps to cpu device when storage_offset is non-zero (#109557)
Fix segfault in torch.sparse.mm for non-contiguous inputs (#111742)
Fix circular import between Dynamo and einops (#110575)
Verify flatbuffer module fields are initialized for mobile deserialization (#109794)

The #110961 contains all relevant pull requests related to this release as well as links to related issues.

Assets 3

04 Oct 17:32

jerryzh168

v2.1.0

7bcf7da

PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing

PyTorch 2.1 Release Notes

Highlights
Backwards Incompatible Change
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation
Developers
Security

Highlights

We are excited to announce the release of PyTorch® 2.1! PyTorch 2.1 offers automatic dynamic shape support in torch.compile, torch.distributed.checkpoint for saving/loading distributed training jobs on multiple ranks in parallel, and torch.compile support for the NumPy API.

In addition, this release offers numerous performance improvements (e.g. CPU inductor improvements, AVX512 support, scaled-dot-product-attention support) as well as a prototype release of torch.export, a sound full-graph capture mechanism, and torch.export-based quantization.

Along with 2.1, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 6,682 commits and 784 contributors since 2.0. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.1. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

torch.compile now includes automatic support for detecting and minimizing recompilations due to tensor shape changes using automatic dynamic shapes.
torch.distributed.checkpoint enables saving and loading models from multiple ranks in parallel, as well as resharding due to changes in cluster topology.
torch.compile can now compile NumPy operations via translating them into PyTorch-equivalent operations.
torch.compile now includes improved support for Python 3.11.
New CPU performance features include inductor improvements (e.g. bfloat16 support and dynamic shapes), AVX512 kernel support, and scaled-dot-product-attention kernels.
torch.export, a sound full-graph capture mechanism is introduced as a prototype feature, as well as torch.export-based quantization.
torch.sparse now includes prototype support for semi-structured (2:4) sparsity on NVIDIA® GPUs.

Stable	Beta	Prototype	Performance Improvements
	Automatic Dynamic Shapes	torch.export()	AVX512 kernel support
	torch.distributed.checkpoint	torch.export-based Quantization	CPU optimizations for scaled-dot-product-attention (SDPA)
	torch.compile + NumPy	semi-structured (2:4) sparsity	CPU optimizations for bfloat16
	torch.compile + Python 3.11	cpp_wrapper for torchinductor
	torch.compile + autograd.Function
	third-party device integration: PrivateUse1

*To see a full list of public 2.1, 2.0, and 1.13 feature submissions click here.

For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.

Backwards Incompatible Changes

Building PyTorch from source now requires C++ 17 (#100557)

The PyTorch codebase has migrated from the C++14 to the C++17 standard, so a C++17 compatible compiler is now required to compile PyTorch, to integrate with libtorch, or to implement a C++ PyTorch extension.

Disable `torch.autograd.{backward, grad}` for complex scalar output (#92753)

Gradients are not defined for functions that don't return real outputs; we now raise an error if you try to call backward on complex outputs. Previously, the complex component of the output was implicitly ignored. If you wish to preserve this behavior, you must now explicitly call .real on your complex outputs before calling .grad() or .backward().

Example

def fn(x):
    return (x * 0.5j).sum()

x = torch.ones(1, dtype=torch.double, requires_grad=True)
o = fn(x)

2.0.1

o.backward()

2.1

o.real.backward()

Update non-reentrant checkpoint to allow nesting and support `autograd.grad` (#90105)

As a part of a larger refactor to torch.utils.checkpoint, we changed the interaction activation checkpoint and retain_graph=True. Previously in 2.0.1, recomputed activations are kept alive if retain_graph=True, in PyTorch 2.1, non-reentrant impl now clears recomputed tensors on backward immediately upon unpack, even if retain_graph=True. This has the following additional implications: (1) Accessing ctx.saved_tensor twice in the same backward will now raise an error. (2) Accessing _saved_tensors multiple times will silently recompute forward multiple times.

2.1

class Func(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        out = x.exp()
        ctx.save_for_backward(out)
        return out

    @staticmethod
    def backward(ctx, x);
        out, = ctx.saved_tensors
        # Calling ctx.saved_tensors again will raise in 2.1
        out, = ctx.saved_tensors
        return out

a = torch.tensor(1., requires_grad=True)

def fn(x):
    return Func.apply(x)


out = torch.utils.checkpoint(fn, (a,), use_reentrant=False)

def fn2(x):
    return x.exp()

out = torch.utils.checkpoint(fn2, (a,), use_reentrant=False)

out.grad_fn._saved_result
# Calling _saved_result will trigger another unpack, and lead to forward being
# recomputed again
out.grad_fn._saved_result

Only sync buffers when `broadcast_buffers` is True (#100729)

In PyTorch 2.0.1 and previous releases, when users use DistributedDataParallel (DDP), all buffers were synced automatically even if users set flag broadcast_buffers to be False:

from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is synchronized across all devices.
...

Starting with PyTorch 2.1, if users specify the flag broadcast_buffers to be False, we don’t sync the buffer across devices:

from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is NOT synchronized across all devices
...

Remove store barrier after PG init (#99937)

In PyTorch 2.0.1 and previous releases, after we initialize PG, we always call store based barrier:

from torch.distributed.distributed_c10d import init_process_group
init_process_group(...) # Will call _store_based_barrier in the end.
...

Starting with PyTorch 2.1, after we initialize PG, the environment variable TORCH_DIST_INIT_BARRIER controls whether we call store based barrier or not:

from torch.distributed.distributed_c10d import init_process_group
import os
os.environ["TORCH_DIST_INIT_BARRIER"] = "1" # This is the default behavior
init_process_group(...) # Will call _store_based_barrier in the end.
os.environ["TORCH_DIST_INIT_BARRIER"] = "0"
init_process_group(...) # Will not call _store_based_barrier in the end.
...

Disallow non-bool masks in `torch.masked_{select, scatter, fill}` (#96112, #97999, #96594)

Finish the deprecation cycle for non-bool masks. Functions now require the dtype of the mask to be torch.bool.

>>> # 2.0.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1855.)
  torch.masked_select(inp, mask)

>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine

>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine

>>> # 2.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
RuntimeError: masked_select: expected BoolTensor for mask

>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine

>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine

Fix the result of `torch.unique` to make it consistent with NumPy when `dim` is specified (#101693)

The dim argument was clarified and its behavior aligned to match the one from NumPy to signify which sub-tensor to consider when considering uniqueness. See the documentation for more details, https://pytorch.org/docs/stable/generated/torch.unique.html

Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)

Prior to this change, for torch.nn.functional.grid_sample(mode='nearest') the forward 2D kernel used std::nearbyint whereas the forward 3D kernel used std::round in order to determine the nearest pixel locations after un-normalization of the grid. Additionally, the backward kernels for both ...

Assets 3

08 May 19:55

drisspg

v2.0.1

e9ebda2

PyTorch 2.0.1 Release, bug fix release

This release is meant to fix the following issues (regressions / silent correctness):

Fix _canonical_mask throws warning when bool masks passed as input to TransformerEncoder/TransformerDecoder (#96009, #96286)
Fix Embedding bag max_norm=-1 causes leaf Variable that requires grad is being used in an in-place operation #95980
Fix type hint for torch.Tensor.grad_fn, which can be a torch.autograd.graph.Node or None. #96804
Can’t convert float to int when the input is a scalar np.ndarray. #97696
Revisit torch._six.string_classes removal #97863
Fix module backward pre-hooks to actually update gradient #97983
Fix load_sharded_optimizer_state_dict error on multi node #98063
Warn once for TypedStorage deprecation #98777
cuDNN V8 API, Fix incorrect use of emplace in the benchmark cache #97838

Torch.compile:

Add support for Modules with custom getitem method to torch.compile #97932
Fix improper guards with on list variables. #97862
Fix Sequential nn module with duplicated submodule #98880

Distributed:

Fix distributed_c10d's handling of custom backends #95072
Fix MPI backend not properly initialized #98545

NN_frontend:

Update Multi-Head Attention's doc string #97046
Fix incorrect behavior of is_causal paremeter for torch.nn.TransformerEncoderLayer.forward #97214
Fix error for SDPA on sm86 and sm89 hardware #99105
Fix nn.MultiheadAttention mask handling #98375

DataLoader:

Fix regression for pin_memory recursion when operating on bytes #97737
Fix collation logic #97789
Fix Ppotentially backwards incompatible change with DataLoader and is_shardable Datapipes #97287

MPS:

Fix LayerNorm crash when input is in float16 #96208
Add support for cumsum on int64 input #96733
Fix issue with setting BatchNorm to non-trainable #98794

Functorch:

Fix Segmentation Fault for vmaped function accessing BatchedTensor.data #97237
Fix index_select support when dim is negative #97916
Improve docs for autograd.Function support #98020
Fix Exception thrown when running Migration guide example for jacrev #97746

Releng:

Fix Convolutions for CUDA-11.8 wheel builds #99451
Fix Import torchaudio + torch.compile crashes on exit #96231
Linux aarch64 wheels are missing the mkldnn+acl backend support - pytorch/builder@54931c2
Linux aarch64 torchtext 0.15.1 wheels are missing for aarch64_linux platform - pytorch/builder#1375
Enable ROCm 5.4.2 manywheel and python 3.11 builds #99552
PyTorch cannot be installed at the same time as numpy in a conda env on osx-64 / Python 3.11 #97031
Illegal instruction (core dumped) on Raspberry Pi 4.0 8gb - pytorch/builder#1370

Torch.optim:

Fix fused AdamW causes NaN loss #95847
Fix Fused AdamW has worse loss than Apex and unfused AdamW for fp16/AMP #98620

The release tracker should contain all relevant pull requests related to this release as well as links to related issues

Assets 3

15 Mar 19:38

drisspg

v2.0.0

c263bd4

PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever

PyTorch 2.0 Release notes

Highlights
Backwards Incompatible Changes
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation

Highlights

We are excited to announce the release of PyTorch® 2.0 (release note) which we highlighted during the PyTorch Conference on 12/2/22! PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.

This next-generation release includes a Stable version of Accelerated Transformers (formerly called Better Transformers); Beta includes torch.compile as the main API for PyTorch 2.0, the scaled_dot_product_attention function as part of torch.nn.functional, the MPS backend, functorch APIs in the torch.func module; and other Beta/Prototype improvements across various inferences, performance and training optimization features on GPUs and CPUs. For a comprehensive introduction and technical overview of torch.compile, please visit the 2.0 Get Started page.

Along with 2.0, we are also releasing a series of beta updates to the PyTorch domain libraries, including those that are in-tree, and separate libraries including TorchAudio, TorchVision, and TorchText. An update for TorchX is also being released as it moves to community supported mode. More details can be found in this library blog.

This release is composed of over 4,541 commits and 428 contributors since 1.13.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.0 and the overall 2-series this year.

Summary:

torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
As an underpinning technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized cuda libraries such as cublas.
Accelerated Transformers introduce high-performance support for training and inference using a custom kernel architecture for scaled dot product attention (SPDA). The API is integrated with torch.compile() and model developers may also use the scaled dot product attention kernels directly by calling the new scaled_dot_product_attention() operator.
Metal Performance Shaders (MPS) backend provides GPU accelerated PyTorch training on Mac platforms with added support for Top 60 most used ops, bringing coverage to over 300 operators.
Amazon AWS optimize the PyTorch CPU inference on AWS Graviton3 based C7g instances. PyTorch 2.0 improves inference performance on Graviton compared to the previous releases, including improvements for Resnet50 and Bert.
New prototype features and technologies across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.

Stable	Beta	Prototype	Platform Changes
Accelerated PT 2 Transformers	torch.compile	DTensor	CUDA support for 11.7 & 11.8 (deprecating CUDA 11.6)
	PyTorch MPS Backend	TensorParallel	Python 3.8 (deprecating Python 3.7)
	Scaled dot product attention	2D Parallel	AWS Graviton3
	Functorch	Torch.compile (dynamic=True)
	Dispatchable Collectives	Torch.compile (dynamic=True)
	torch.set_default_device and torch.device as context manager
	X86 quantization backend
	GNN inference and training performance

*To see a full list of public 2.0, 1.13 and 1.12 feature submissions click here

Backwards Incompatible Changes

Drop support for Python versions <= 3.7 (#93155)

Previously the minimum supported version of Python for PyTorch was 3.7. This PR updates the minimum version to require 3.8 in order to install PyTorch. See Hardware / Software Support for more information.

Drop support for CUDA 10 (#89582)

This PR updates the minimum CUDA version to 11.0. See the getting-started for installation or building from source for more information.

**Gradients are now set to `None` instead of zeros by default in `torch.optim.*.zero_grad()` and `torch.nn.Module.zero_grad()` (#92731)**

This changes the default behavior of zero_grad() to zero out the grads by setting them to None instead of zero tensors. In other words, the set_to_none kwarg is now True by default instead of False. Setting grads to None reduces peak memory usage and increases performance. This will break code that directly accesses data or does computation on the grads after calling zero_grad() as they will now be None. To revert to the old behavior, pass in zero_grad(set_to_none=False).

1.13	2.0
>>> import torch >>> from torch import nn >>> module = nn.Linear(2,22) >>> i = torch.randn(2, 2, requires_grad=True) >>> module(i).sum().backward() >>> module.zero_grad() >>> module.weight.grad == None False >>> module.weight.grad.data tensor([[0., 0.], [0., 0.]]) >>> module.weight.grad + 1.0 tensor([[1., 1.], [1., 1.]])	>>> import torch >>> from torch import nn >>> module = nn.Linear(5, 5) >>> i = torch.randn(2, 5, requires_grad=True) >>> module(i).sum().backward() >>> module.zero_grad() >>> module.weight.grad == None True >>> module.weight.grad.data AttributeError: 'NoneType' object has no attribute 'data' >>> module.weight.grad + 1.0 TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'

1.13

2.0

>>> import torch
>>> from torch import nn
>>> module = nn.Linear(2,22)
>>> i = torch.randn(2, 2, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
False
>>> module.weight.grad.data
tensor([[0., 0.],
        [0., 0.]])
>>> module.weight.grad + 1.0
tensor([[1., 1.],
        [1., 1.]])

>>> import torch
>>> from torch import nn
>>> module = nn.Linear(5, 5)
>>> i = torch.randn(2, 5, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
True
>>> module.weight.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
>>> module.weight.grad + 1.0
TypeError: unsupported operand type(s) for +:
'NoneType' and 'float'

Update `torch.tensor` and `nn.Parameter` to serialize all their attributes (#88913)

Any attribute stored on torch.tensor and torch.nn.Parameter will now be serialized. This aligns the serialization behavior of torch.nn.Parameter, torch.Tensor and other tensor subclasses

1.13	2.0
# torch.Tensor behavior >>> a = torch.Tensor() >>> a.foo = 'hey' >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>> print(a.foo) hey >>> print(b.foo) AttributeError: 'Tensor' object has no attribute 'foo' # torch.nn.Parameter behavior >>> a = nn.Parameter() >>> a.foo = 'hey' >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>> print(a.foo) hey >>> print(b.foo) AttributeError: 'Parameter' object has no attribute 'foo' # torch.Tensor subclass behavior >>> class MyTensor(torch.Tensor): ... pass >>> a = MyTensor() >>> a.foo = 'hey' >>> print(a.foo) hey >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>>print(b.foo) hey	# torch.Tensor behavior a = torch.Tensor() a.foo = 'hey' >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>> print(a.foo) hey >>> print(b.foo) hey # torch.nn.Parameter behavior >>> a = nn.Parameter() >>> a.foo = 'hey' >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>> print(a.foo) hey >>> print(b.foo) hey # torch.Tensor subclass behavior >>> class MyTensor(torch.Tensor): ... pass >>> a = MyTensor() >>> a.foo = 'hey' >>> print(a.foo) hey >>> buffer = io.BytesIO() >>> torch.save(a, buffer) >>> buffer.seek(0) >>> b = torch.load(buffer) >>>print(b.foo) hey

1.13

2.0

# torch.Tensor behavior
>>> a = torch.Tensor()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)

>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Tensor' object has no attribute 'foo'

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Parameter' object has no attribute 'foo'

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey

# torch.Tensor behavior
a = torch.Tensor()
a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey

If you have an attribute that you don't want to be serialized you should not store it as an attribute on tensor or Parameter but instead it is recommended to use torch.utils.weak.WeakTensorKeyDictionary

>>> foo_dict = weak.WeakTensorKeyDictionary()
>>> foo_dict[a] = 'hey'
>>> print(foo_dict[a])
hey

Algorithms `{Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD}` default to faster `foreach` implementation when on CUDA + differentiable=`False`

When applicable, this changes the default behavior of step() and anything that ca...

Contributors

pytorch

Assets 3

16 Dec 00:17

atalman

v1.13.1

49444c3

PyTorch 1.13.1 Release, small bug fix release

This release is meant to fix the following issues (regressions / silent correctness):

RuntimeError by torch.nn.modules.activation.MultiheadAttention with bias=False and batch_first=True #88669
Installation via pip on Amazon Linux 2, regression #88869
Installation using poetry on Mac M1, failure #88049
Missing masked tensor documentation #89734
torch.jit.annotations.parse_type_line is not safe (command injection) #88868
Use the Python frame safely in _pythonCallstack #88993
Double-backward with full_backward_hook causes RuntimeError #88312
Fix logical error in get_default_qat_qconfig #88876
Fix cuda/cpu check on NoneType and unit test #88854 and #88970
Onnx ATen Fallback for BUILD_CAFFE2=0 for ONNX-only ops #88504
Onnx operator_export_type on the new registry #87735
torchrun AttributeError caused by file_based_local_timer on Windows #85427

The release tracker should contain all relevant pull requests related to this release as well as links to related issues

Assets 3

Releases: pytorch/pytorch

PyTorch 2.3: User-Defined Triton Kernels in torch.compile, Tensor Parallelism in Distributed

PyTorch 2.3 Release notes

Highlights

Tracked Regressions

torch.compile on MacOS is considered unstable for 2.3 as there are known cases where it will hang (#124497)

torch.compile imports many unrelated packages when it is invoked (#123954)

torch.compile is not supported on Python 3.12 (#120233)

Backwards Incompatible Changes

Change default torch_function behavior to be disabled when torch_dispatch is defined (#120632)

ProcessGroupNCCL removes multi-device-per-thread support from C++ level (#119099, #118674)

Removes no_dist and coordinator_rank from public DCP API's (#121317)

Remove deprecated tp_mesh_dim arg (#121432)

torch.export

Enable fold_quantize by default in PT2 Export Quantization (#118701, #118605, #119425, #117797)

Remove deprecated torch.jit.quantized APIs (#118406)

...

PyTorch 2.2.2 Release, bug fix release

PyTorch 2.2.1 Release, bug fix release

PyTorch 2.2: FlashAttention-v2, AOTInductor

PyTorch 2.2 Release Notes

Highlights

Tracked Regressions

Performance reduction when using NVLSTree algorithm in NCCL 2.19.3 (#117748)

Poor numeric stability of loss when training with FSDP + DTensor (#117471)

Backwards Incompatible Changes

Building PyTorch from source now requires GCC 9.4 or newer (#112858)

Updated flash attention kernel in scaled_dot_product_attention to use Flash Attention v2 (#105602)

Rewrote DTensor (Tensor Parallel) APIs to improve UX (#114732)

UntypedStorage.resize_ now uses the original device instead of the current device context (#113386)

Wrapping a function with set_grad_enabled will consume its global mutation (#113359)

Deprecated verbose parameter in LRscheduler constructors (#111302)

PyTorch 2.1.2 Release, bug fix release

PyTorch 2.1.1 Release, bug fix release

PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing

PyTorch 2.1 Release Notes

Highlights

Backwards Incompatible Changes

Building PyTorch from source now requires C++ 17 (#100557)

Disable torch.autograd.{backward, grad} for complex scalar output (#92753)

Example

2.0.1

2.1

Update non-reentrant checkpoint to allow nesting and support autograd.grad (#90105)

2.1

Only sync buffers when broadcast_buffers is True (#100729)

Remove store barrier after PG init (#99937)

Disallow non-bool masks in torch.masked_{select, scatter, fill} (#96112, #97999, #96594)

Fix the result of torch.unique to make it consistent with NumPy when dim is specified (#101693)

Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)

PyTorch 2.0.1 Release, bug fix release

Torch.compile:

Distributed:

NN_frontend:

DataLoader:

MPS:

Functorch:

Releng:

Torch.optim:

PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever

PyTorch 2.0 Release notes

Highlights

Backwards Incompatible Changes

Drop support for Python versions <= 3.7 (#93155)

Drop support for CUDA 10 (#89582)

Gradients are now set to None instead of zeros by default in torch.optim.*.zero_grad() and torch.nn.Module.zero_grad() (#92731)

Update torch.tensor and nn.Parameter to serialize all their attributes (#88913)

Algorithms {Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD} default to faster foreach implementation when on CUDA + differentiable=False

Contributors

PyTorch 1.13.1 Release, small bug fix release

Removes `no_dist` and `coordinator_rank` from public DCP API's (#121317)

Updated flash attention kernel in `scaled_dot_product_attention` to use Flash Attention v2 (#105602)

`UntypedStorage.resize_` now uses the original device instead of the current device context (#113386)

Deprecated `verbose` parameter in `LRscheduler` constructors (#111302)

Disable `torch.autograd.{backward, grad}` for complex scalar output (#92753)

Update non-reentrant checkpoint to allow nesting and support `autograd.grad` (#90105)

Only sync buffers when `broadcast_buffers` is True (#100729)

Disallow non-bool masks in `torch.masked_{select, scatter, fill}` (#96112, #97999, #96594)

Fix the result of `torch.unique` to make it consistent with NumPy when `dim` is specified (#101693)

**Gradients are now set to `None` instead of zeros by default in `torch.optim.*.zero_grad()` and `torch.nn.Module.zero_grad()` (#92731)**

Update `torch.tensor` and `nn.Parameter` to serialize all their attributes (#88913)

Algorithms `{Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD}` default to faster `foreach` implementation when on CUDA + differentiable=`False`