Skip to content

Releases: pytorch/pytorch

PyTorch 2.2.2 Release, bug fix release

27 Mar 22:27
39901f2
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • Properly raise an error when trying to use inductor backend on non-supported platforms such as Windows (#115969)
  • Fix mkldnn performance issue on Windows platform (#121618)
  • Fix RuntimeError: cannot create std::vector larger than max_size() in torch.nn.functional.conv1d on non-contiguous cpu inputs by patching OneDNN (pytorch/builder#1742) (pytorch/builder#1744)
  • Add support for torch.distributed.fsdp.StateDictType.FULL_STATE_DICT for when using torch.distributed.fsdp.FullyShardedDataParallel with the device_mesh argument (#120837)
  • Fix make triton command on release branch for users building the release branch from source (#121169)
  • Ensure gcc>=9.0 for build from source and cpp_extensions (#120126)
  • Fix cxx11-abi build in release branch (pytorch/builder#1709)
  • Fix building from source on Windows source MSVC 14.38 - VS 2022 (#122120)

Release tracker #120999 contains all relevant pull requests related to this release as well as links to related issues.

PyTorch 2.2.1 Release, bug fix release

22 Feb 21:15
6c8c5ad
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • Fix missing OpenMP support on Apple Silicon binaries (pytorch/builder#1697)
  • Fix crash when mixing lazy and non-lazy tensors in one operation (#117653)
  • Fix PyTorch performance regression on Linux aarch64 (pytorch/builder#1696)
  • Fix silent correctness in DTensor _to_copy operation (#116426)
  • Fix properly assigning param.grad_fn for next forward (#116792)
  • Ensure gradient clear out pending AsyncCollectiveTensor in FSDP Extension (#116122)
  • Fix processing unflatten tensor on compute stream in FSDP Extension (#116559)
  • Fix FSDP AssertionError on tensor subclass when setting sync_module_states=True (#117336)
  • Fix DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592)
  • Fix OOM when when returning a AsyncCollectiveTensor by forcing _gather_state_dict() to be synchronous with respect to the mian stream. (#118197) (#119716)
  • Fix Windows runtime torch.distributed.DistNetworkError: [WinError 32] The process cannot access the file because it is being used by another process (#118860)
  • Update supported python versions in package description (#119743)
  • Fix SIGILL crash during import torch on CPUs that do not support SSE4.1 (#116623)
  • Fix DCP RuntimeError in get_state_dict and set_state_dict (#119573)
  • Fixes for HSDP + TP integration with device_mesh (#112435) (#118620) (#119064) (#118638) (#119481)
  • Fix numerical error with mixedmm on NVIDIA V100 (#118591)
  • Fix RuntimeError when using SymInt input invariant when splitting graphs (#117406)
  • Fix compile DTensor.from_local in trace_rule_look up (#119659)
  • Improve torch.compile integration with CUDA-11.8 binaries (#119750)

Release tracker #119295 contains all relevant pull requests related to this release as well as links to related issues.

PyTorch 2.2: FlashAttention-v2, AOTInductor

30 Jan 17:58
8ac9b20
Compare
Choose a tag to compare

PyTorch 2.2 Release Notes

  • Highlights
  • Backwards Incompatible Changes
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation

Highlights

We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.

This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.

Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.

Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

  • scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.
  • PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.
  • torch.distributed supports a new abstraction for initializing and representing ProcessGroups called device_mesh.
  • PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.
  • A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
  • Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
  • torch.ao.quantization now offers a prototype torch.export based flow
Stable Beta Prototype Performance Improvements
FlashAttentionV2 backend for scaled dot product attention PT 2 Quantization Inductor optimizations
AOTInductor Scaled dot product attention support for jagged layout NestedTensors aarch64-linux optimizations (AWS Graviton)
TORCH_LOGS
torch.distributed.device_mesh
torch.compile + Optimizers

*To see a full list of public 2.2 - 1.12 feature submissions click here.

Tracked Regressions

Performance reduction when using NVLSTree algorithm in NCCL 2.19.3 (#117748)

We have noticed a performance regression introduced to all-reduce in NCCL 2.19.3. Please use version 2.19.1 instead.

Poor numeric stability of loss when training with FSDP + DTensor (#117471)

We observe the loss will flatline randomly while training with FSDP + DTensor in some instances.

Backwards Incompatible Changes

Building PyTorch from source now requires GCC 9.4 or newer (#112858)

GCC 9.4 is the oldest version fully compatible with C++17, which the PyTorch codebase has migrated to from C++14.

Updated flash attention kernel in scaled_dot_product_attention to use Flash Attention v2 (#105602)

Previously, the v1 Flash Attention kernel had a Windows implementation. So if a user on Windows had explicitly forced the flash attention kernel to be run by using sdp_kernel context manager with only flash attention enabled, it would work. In 2.2, if the sdp_kernel context manager must be used, use the memory efficient or math kernel if on Windows.

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)
# Don't force flash attention to be used if using sdp_kernel on Windows
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)

Rewrote DTensor (Tensor Parallel) APIs to improve UX (#114732)

In PyTorch 2.1 or before, users can use ParallelStyles like PairwiseParallel and specify input/output layout with functions like make_input_replicate_1d or make_output_replicate_1d. And we have default values for _prepare_input and _prepare_output. The UX of Tensor Parallel was like:

from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    make_input_replicate_1d,
    make_input_reshard_replicate,
    make_input_shard_1d,
    make_input_shard_1d_last_dim,
    make_sharded_output_tensor,
    make_output_replicate_1d,
    make_output_reshard_tensor,
    make_output_shard_1d,
    make_output_tensor,
    PairwiseParallel,
    parallelize_module,
)
from torch.distributed.tensor import DeviceMesh

module = DummyModule()
device_mesh = DeviceMesh("cuda", list(range(self.world_size)))
parallelize_module(module, device_mesh, PairwiseParallel(_prepare_input=make_input_replicate_1d))
...

Starting from PyTorch 2.2, we simplified parallel styles to only contain ColwiseParallel and RowwiseParallel because other ParallelStyle can consist of these two. We also deleted the input/output functions, and started using input_layouts and output_layouts as kwargs instead to specify the sharding layout of both input/output tensors. Finally, added PrepareModuleInput/PrepareModuleOutput style, and no default arguments for layouts in these two styles and users need to specify them to think about the sharding layouts.

from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    PrepareModuleInput,
    RowwiseParallel,
    parallelize_module,
)
from torch.distributed._tensor import init_device_mesh

module = SimpleMLPModule()
device_mesh = init_device_mesh("cuda", (self.world_size,)))
parallelize_module(
   module,
   device_mesh,
   {
      "fqn": PrepareModuleInput(
                input_layouts=Shard(0),
                desired_input_layouts=Replicate()
             ),
      "fqn.net1": ColwiseParallel(),
      "fqn.net2": RowwiseParallel(output_layouts=Shard(0)),
   }
)
...

UntypedStorage.resize_ now uses the original device instead of the current device context (#113386)

Before this PR, UntypedStorage.resize_ would move data to the current CUDA device index (given by torch.cuda.current_device()).
Now, UntypedStorage.resize_() keeps the data on the same device index that it was on before, regardless of the current device index.

2.1 2.2
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:0
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:1

Wrapping a function with set_grad_enabled will consume its global mutation (#113359)

This bc-breaking change fixes some unexpected behavior when set_grad_enabled is used as a decorator.

2.1 2.2
>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
True
>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
False

Deprecated verbose parameter in LRscheduler constructors (#111302)

As part of our decision to move towards a consolidated logging system, we are deprecating the verbose flag in LRScheduler.

If you would like to print the learning rate during execution, please use get_last_lr()

2.1 2.2
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True)
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
	print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}")

</td...

Read more

PyTorch 2.1.2 Release, bug fix release

15 Dec 01:59
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • Fix crashes for float16 empty tensors (#115183)
  • Fix MPS memory corruption when working with tensor slices (#114838)
  • Fix crashes during Conv backward pass on MPS devices (#113398)
  • Partially fix nn.Linear behavior on AArch64 platform (#110150)
  • Fix cosine_similarity for tensors of different sizes (#109363)
  • Package missing headers needed for extension development (#113055)
  • Improve error handling of torch.set_num_threads (#113684)
  • Fix profiling traces generation (#113763)

The Cherry pick tracker #113962 contains all relevant pull requests related to this release as well as links to related issues.

PyTorch 2.1.1 Release, bug fix release

15 Nov 22:59
4c55dc5
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • Remove spurious warning in comparison ops (#112170)
  • Fix segfault in foreach_* operations when input list length does not match (#112349)
  • Fix cuda driver API to load the appropriate .so file (#112996)
  • Fix missing CUDA initialization when calling FFT operations (#110326)
  • Ignore beartype==0.16.0 within the onnx package as it is incompatible (#111861)
  • Fix the behavior of torch.new_zeros in onnx due to TorchScript behavior change (#111694)
  • Remove unnecessary slow code in torch.distributed.checkpoint.optimizer.load_sharded_optimizer_state_dict (#111687)
  • Add planner argument to torch.distributed.checkpoint.optimizer.load_sharded_optimizer_state_dict (#111393)
  • Continue if param not exist in sharded load in torch.distributed.FSDP (#109116)
  • Fix handling of non-contiguous bias_mask in torch.nn.functional.scaled_dot_product_attention (#112673)
  • Fix the meta device implementation for nn.functional.scaled_dot_product_attention (#110893)
  • Fix copy from mps to cpu device when storage_offset is non-zero (#109557)
  • Fix segfault in torch.sparse.mm for non-contiguous inputs (#111742)
  • Fix circular import between Dynamo and einops (#110575)
  • Verify flatbuffer module fields are initialized for mobile deserialization (#109794)

The #110961 contains all relevant pull requests related to this release as well as links to related issues.

PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing

04 Oct 17:32
7bcf7da
Compare
Choose a tag to compare

PyTorch 2.1 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation
  • Developers
  • Security

Highlights

We are excited to announce the release of PyTorch® 2.1! PyTorch 2.1 offers automatic dynamic shape support in torch.compile, torch.distributed.checkpoint for saving/loading distributed training jobs on multiple ranks in parallel, and torch.compile support for the NumPy API.

In addition, this release offers numerous performance improvements (e.g. CPU inductor improvements, AVX512 support, scaled-dot-product-attention support) as well as a prototype release of torch.export, a sound full-graph capture mechanism, and torch.export-based quantization.

Along with 2.1, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 6,682 commits and 784 contributors since 2.0. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.1. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

  • torch.compile now includes automatic support for detecting and minimizing recompilations due to tensor shape changes using automatic dynamic shapes.
  • torch.distributed.checkpoint enables saving and loading models from multiple ranks in parallel, as well as resharding due to changes in cluster topology.
  • torch.compile can now compile NumPy operations via translating them into PyTorch-equivalent operations.
  • torch.compile now includes improved support for Python 3.11.
  • New CPU performance features include inductor improvements (e.g. bfloat16 support and dynamic shapes), AVX512 kernel support, and scaled-dot-product-attention kernels.
  • torch.export, a sound full-graph capture mechanism is introduced as a prototype feature, as well as torch.export-based quantization.
  • torch.sparse now includes prototype support for semi-structured (2:4) sparsity on NVIDIA® GPUs.
Stable Beta Prototype Performance Improvements
Automatic Dynamic Shapes torch.export() AVX512 kernel support
torch.distributed.checkpoint torch.export-based Quantization CPU optimizations for scaled-dot-product-attention (SDPA)
torch.compile + NumPy semi-structured (2:4) sparsity CPU optimizations for bfloat16
torch.compile + Python 3.11 cpp_wrapper for torchinductor
torch.compile + autograd.Function
third-party device integration: PrivateUse1

*To see a full list of public 2.1, 2.0, and 1.13 feature submissions click here.

For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.

Backwards Incompatible Changes

Building PyTorch from source now requires C++ 17 (#100557)

The PyTorch codebase has migrated from the C++14 to the C++17 standard, so a C++17 compatible compiler is now required to compile PyTorch, to integrate with libtorch, or to implement a C++ PyTorch extension.

Disable torch.autograd.{backward, grad} for complex scalar output (#92753)

Gradients are not defined for functions that don't return real outputs; we now raise an error if you try to call backward on complex outputs. Previously, the complex component of the output was implicitly ignored. If you wish to preserve this behavior, you must now explicitly call .real on your complex outputs before calling .grad() or .backward().

Example

def fn(x):
    return (x * 0.5j).sum()

x = torch.ones(1, dtype=torch.double, requires_grad=True)
o = fn(x)

2.0.1

o.backward()

2.1

o.real.backward()

Update non-reentrant checkpoint to allow nesting and support autograd.grad (#90105)

As a part of a larger refactor to torch.utils.checkpoint, we changed the interaction activation checkpoint and retain_graph=True. Previously in 2.0.1, recomputed activations are kept alive if retain_graph=True, in PyTorch 2.1, non-reentrant impl now clears recomputed tensors on backward immediately upon unpack, even if retain_graph=True. This has the following additional implications: (1) Accessing ctx.saved_tensor twice in the same backward will now raise an error. (2) Accessing _saved_tensors multiple times will silently recompute forward multiple times.

2.1

class Func(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        out = x.exp()
        ctx.save_for_backward(out)
        return out

    @staticmethod
    def backward(ctx, x);
        out, = ctx.saved_tensors
        # Calling ctx.saved_tensors again will raise in 2.1
        out, = ctx.saved_tensors
        return out

a = torch.tensor(1., requires_grad=True)

def fn(x):
    return Func.apply(x)


out = torch.utils.checkpoint(fn, (a,), use_reentrant=False)

def fn2(x):
    return x.exp()

out = torch.utils.checkpoint(fn2, (a,), use_reentrant=False)

out.grad_fn._saved_result
# Calling _saved_result will trigger another unpack, and lead to forward being
# recomputed again
out.grad_fn._saved_result

Only sync buffers when broadcast_buffers is True (#100729)

  • In PyTorch 2.0.1 and previous releases, when users use DistributedDataParallel (DDP), all buffers were synced automatically even if users set flag broadcast_buffers to be False:
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is synchronized across all devices.
...
  • Starting with PyTorch 2.1, if users specify the flag broadcast_buffers to be False, we don’t sync the buffer across devices:
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is NOT synchronized across all devices
...

Remove store barrier after PG init (#99937)

  • In PyTorch 2.0.1 and previous releases, after we initialize PG, we always call store based barrier:
from torch.distributed.distributed_c10d import init_process_group
init_process_group(...) # Will call _store_based_barrier in the end.
...
  • Starting with PyTorch 2.1, after we initialize PG, the environment variable TORCH_DIST_INIT_BARRIER controls whether we call store based barrier or not:
from torch.distributed.distributed_c10d import init_process_group
import os
os.environ["TORCH_DIST_INIT_BARRIER"] = "1" # This is the default behavior
init_process_group(...) # Will call _store_based_barrier in the end.
os.environ["TORCH_DIST_INIT_BARRIER"] = "0"
init_process_group(...) # Will not call _store_based_barrier in the end.
...

Disallow non-bool masks in torch.masked_{select, scatter, fill} (#96112, #97999, #96594)

Finish the deprecation cycle for non-bool masks. Functions now require the dtype of the mask to be torch.bool.

>>> # 2.0.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1855.)
  torch.masked_select(inp, mask)

>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine

>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine

>>> # 2.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
RuntimeError: masked_select: expected BoolTensor for mask

>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine

>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine

Fix the result of torch.unique to make it consistent with NumPy when dim is specified (#101693)

The dim argument was clarified and its behavior aligned to match the one from NumPy to signify which sub-tensor to consider when considering uniqueness. See the documentation for more details, https://pytorch.org/docs/stable/generated/torch.unique.html

Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)

Prior to this change, for torch.nn.functional.grid_sample(mode='nearest') the forward 2D kernel used std::nearbyint whereas the forward 3D kernel used std::round in order to determine the nearest pixel locations after un-normalization of the grid. Additionally, the backward kernels for both ...

Read more

PyTorch 2.0.1 Release, bug fix release

08 May 19:55
e9ebda2
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • Fix _canonical_mask throws warning when bool masks passed as input to TransformerEncoder/TransformerDecoder (#96009, #96286)
  • Fix Embedding bag max_norm=-1 causes leaf Variable that requires grad is being used in an in-place operation #95980
  • Fix type hint for torch.Tensor.grad_fn, which can be a torch.autograd.graph.Node or None. #96804
  • Can’t convert float to int when the input is a scalar np.ndarray. #97696
  • Revisit torch._six.string_classes removal #97863
  • Fix module backward pre-hooks to actually update gradient #97983
  • Fix load_sharded_optimizer_state_dict error on multi node #98063
  • Warn once for TypedStorage deprecation #98777
  • cuDNN V8 API, Fix incorrect use of emplace in the benchmark cache #97838

Torch.compile:

  • Add support for Modules with custom getitem method to torch.compile #97932
  • Fix improper guards with on list variables. #97862
  • Fix Sequential nn module with duplicated submodule #98880

Distributed:

  • Fix distributed_c10d's handling of custom backends #95072
  • Fix MPI backend not properly initialized #98545

NN_frontend:

  • Update Multi-Head Attention's doc string #97046
  • Fix incorrect behavior of is_causal paremeter for torch.nn.TransformerEncoderLayer.forward #97214
  • Fix error for SDPA on sm86 and sm89 hardware #99105
  • Fix nn.MultiheadAttention mask handling #98375

DataLoader:

  • Fix regression for pin_memory recursion when operating on bytes #97737
  • Fix collation logic #97789
  • Fix Ppotentially backwards incompatible change with DataLoader and is_shardable Datapipes #97287

MPS:

  • Fix LayerNorm crash when input is in float16 #96208
  • Add support for cumsum on int64 input #96733
  • Fix issue with setting BatchNorm to non-trainable #98794

Functorch:

  • Fix Segmentation Fault for vmaped function accessing BatchedTensor.data #97237
  • Fix index_select support when dim is negative #97916
  • Improve docs for autograd.Function support #98020
  • Fix Exception thrown when running Migration guide example for jacrev #97746

Releng:

  • Fix Convolutions for CUDA-11.8 wheel builds #99451
  • Fix Import torchaudio + torch.compile crashes on exit #96231
  • Linux aarch64 wheels are missing the mkldnn+acl backend support - pytorch/builder@54931c2
  • Linux aarch64 torchtext 0.15.1 wheels are missing for aarch64_linux platform - pytorch/builder#1375
  • Enable ROCm 5.4.2 manywheel and python 3.11 builds #99552
  • PyTorch cannot be installed at the same time as numpy in a conda env on osx-64 / Python 3.11 #97031
  • Illegal instruction (core dumped) on Raspberry Pi 4.0 8gb - pytorch/builder#1370

Torch.optim:

  • Fix fused AdamW causes NaN loss #95847
  • Fix Fused AdamW has worse loss than Apex and unfused AdamW for fp16/AMP #98620

The release tracker should contain all relevant pull requests related to this release as well as links to related issues

PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever

15 Mar 19:38
c263bd4
Compare
Choose a tag to compare

PyTorch 2.0 Release notes

  • Highlights
  • Backwards Incompatible Changes
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation

Highlights

We are excited to announce the release of PyTorch® 2.0 (release note) which we highlighted during the PyTorch Conference on 12/2/22! PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.

This next-generation release includes a Stable version of Accelerated Transformers (formerly called Better Transformers); Beta includes torch.compile as the main API for PyTorch 2.0, the scaled_dot_product_attention function as part of torch.nn.functional, the MPS backend, functorch APIs in the torch.func module; and other Beta/Prototype improvements across various inferences, performance and training optimization features on GPUs and CPUs. For a comprehensive introduction and technical overview of torch.compile, please visit the 2.0 Get Started page.

Along with 2.0, we are also releasing a series of beta updates to the PyTorch domain libraries, including those that are in-tree, and separate libraries including TorchAudio, TorchVision, and TorchText. An update for TorchX is also being released as it moves to community supported mode. More details can be found in this library blog.

This release is composed of over 4,541 commits and 428 contributors since 1.13.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.0 and the overall 2-series this year.

Summary:

  • torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
  • As an underpinning technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized cuda libraries such as cublas.
  • Accelerated Transformers introduce high-performance support for training and inference using a custom kernel architecture for scaled dot product attention (SPDA). The API is integrated with torch.compile() and model developers may also use the scaled dot product attention kernels directly by calling the new scaled_dot_product_attention() operator.
  • Metal Performance Shaders (MPS) backend provides GPU accelerated PyTorch training on Mac platforms with added support for Top 60 most used ops, bringing coverage to over 300 operators.
  • Amazon AWS optimize the PyTorch CPU inference on AWS Graviton3 based C7g instances. PyTorch 2.0 improves inference performance on Graviton compared to the previous releases, including improvements for Resnet50 and Bert.
  • New prototype features and technologies across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.
Stable Beta Prototype Platform Changes
Accelerated PT 2 Transformers torch.compile DTensor CUDA support for 11.7 & 11.8 (deprecating CUDA 11.6)
PyTorch MPS Backend TensorParallel Python 3.8 (deprecating Python 3.7)
Scaled dot product attention 2D Parallel AWS Graviton3
Functorch Torch.compile (dynamic=True)
Dispatchable Collectives
torch.set_default_device and torch.device as context manager
X86 quantization backend
GNN inference and training performance

*To see a full list of public 2.0, 1.13 and 1.12 feature submissions click here

Backwards Incompatible Changes

Drop support for Python versions <= 3.7 (#93155)

Previously the minimum supported version of Python for PyTorch was 3.7. This PR updates the minimum version to require 3.8 in order to install PyTorch. See Hardware / Software Support for more information.

Drop support for CUDA 10 (#89582)

This PR updates the minimum CUDA version to 11.0. See the getting-started for installation or building from source for more information.

Gradients are now set to None instead of zeros by default in torch.optim.*.zero_grad() and torch.nn.Module.zero_grad() (#92731)

This changes the default behavior of zero_grad() to zero out the grads by setting them to None instead of zero tensors. In other words, the set_to_none kwarg is now True by default instead of False. Setting grads to None reduces peak memory usage and increases performance. This will break code that directly accesses data or does computation on the grads after calling zero_grad() as they will now be None. To revert to the old behavior, pass in zero_grad(set_to_none=False).

1.13 2.0
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(2,22)
>>> i = torch.randn(2, 2, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
False
>>> module.weight.grad.data
tensor([[0., 0.],
        [0., 0.]])
>>> module.weight.grad + 1.0
tensor([[1., 1.],
        [1., 1.]])
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(5, 5)
>>> i = torch.randn(2, 5, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
True
>>> module.weight.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
>>> module.weight.grad + 1.0
TypeError: unsupported operand type(s) for +:
'NoneType' and 'float'

Update torch.tensor and nn.Parameter to serialize all their attributes (#88913)

Any attribute stored on torch.tensor and torch.nn.Parameter will now be serialized. This aligns the serialization behavior of torch.nn.Parameter, torch.Tensor and other tensor subclasses

1.13 2.0
# torch.Tensor behavior
>>> a = torch.Tensor()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)

>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Tensor' object has no attribute 'foo'

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Parameter' object has no attribute 'foo'

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey
# torch.Tensor behavior
a = torch.Tensor()
a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey

If you have an attribute that you don't want to be serialized you should not store it as an attribute on tensor or Parameter but instead it is recommended to use torch.utils.weak.WeakTensorKeyDictionary

>>> foo_dict = weak.WeakTensorKeyDictionary()
>>> foo_dict[a] = 'hey'
>>> print(foo_dict[a])
hey

Algorithms {Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD} default to faster foreach implementation when on CUDA + differentiable=False

When applicable, this changes the default behavior of step() and anything that ca...

Read more

PyTorch 1.13.1 Release, small bug fix release

16 Dec 00:17
49444c3
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • RuntimeError by torch.nn.modules.activation.MultiheadAttention with bias=False and batch_first=True #88669
  • Installation via pip on Amazon Linux 2, regression #88869
  • Installation using poetry on Mac M1, failure #88049
  • Missing masked tensor documentation #89734
  • torch.jit.annotations.parse_type_line is not safe (command injection) #88868
  • Use the Python frame safely in _pythonCallstack #88993
  • Double-backward with full_backward_hook causes RuntimeError #88312
  • Fix logical error in get_default_qat_qconfig #88876
  • Fix cuda/cpu check on NoneType and unit test #88854 and #88970
  • Onnx ATen Fallback for BUILD_CAFFE2=0 for ONNX-only ops #88504
  • Onnx operator_export_type on the new registry #87735
  • torchrun AttributeError caused by file_based_local_timer on Windows #85427

The release tracker should contain all relevant pull requests related to this release as well as links to related issues

PyTorch 1.13: beta versions of functorch and improved support for Apple’s new M1 chips are now available

28 Oct 16:54
7c98e70
Compare
Choose a tag to compare

Pytorch 1.13 Release Notes

  • Highlights
  • Backwards Incompatible Changes
  • New Features
  • Improvements
  • Performance
  • Documentation
  • Developers

Highlights

We are excited to announce the release of PyTorch 1.13! This includes stable versions of BetterTransformer. We deprecated CUDA 10.2 and 11.3 and completed migration of CUDA 11.6 and 11.7. Beta includes improved support for Apple M1 chips and functorch, a library that offers composable vmap (vectorization) and autodiff transforms, being included in-tree with the PyTorch release. This release is composed of over 3,749 commits and 467 contributors since 1.12.1. We want to sincerely thank our dedicated community for your contributions.

Summary:

  • The BetterTransformer feature set supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. Additional improvements include accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models and Nested Tensors is now enabled by default.

  • Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows support for C++17 in PyTorch and new NVIDIA Open GPU Kernel Modules.

  • Previously, functorch was released out-of-tree in a separate package. After installing PyTorch, a user will be able to import functorch and use functorch without needing to install another package.

  • PyTorch is offering native builds for Apple® silicon machines that use Apple's new M1 chip as a beta feature, providing improved support across PyTorch's APIs.

Stable Beta Prototype
  • Better Transformer
  • CUDA 10.2 and 11.3 CI/CD Deprecation
  • Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs
  • Extend NNC to support channels last and bf16
  • Functorch now in PyTorch Core Library
  • Beta Support for M1 devices
  • Arm® Compute Library backend support for AWS Graviton
  • CUDA Sanitizer

You can check the blogpost that shows the new features here.

Backwards Incompatible changes

Python API

uint8 and all integer dtype masks are no longer allowed in Transformer (#87106)

Prior to 1.13, key_padding_mask could be set to uint8 or other integer dtypes in TransformerEncoder and MultiheadAttention, which might generate unexpected results. In this release, these dtypes are not allowed for the mask anymore. Please convert them to torch.bool before using.

1.12.1

>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.uint8)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
# works before 1.13
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)

1.13

>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.bool)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)

Updated torch.floor_divide to perform floor division (#78411)

Prior to 1.13, torch.floor_divide erroneously performed truncation division (i.e. truncated the quotients). In this release, it has been fixed to perform floor division. To replicate the old behavior, use torch.div with rounding_mode='trunc'.

1.12.1

>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -1.])

1.13

>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -2.])
# Old behavior can be replicated using torch.div with rounding_mode='trunc'
>>> torch.div(a, b, rounding_mode='trunc')
tensor([ 2., -1.])

Fixed torch.index_select on CPU to error that index is out of bounds when the source tensor is empty (#77881)

Prior to 1.13, torch.index_select would return an appropriately sized tensor filled with random values on CPU if the source tensor was empty. In this release, we have fixed this bug so that it errors out. A consequence of this is that torch.nn.Embedding which utilizes index_select will error out rather than returning an empty tensor when embedding_dim=0 and input contains indices which are out of bounds. The old behavior cannot be reproduced with torch.nn.Embedding, however since an Embedding layer with embedding_dim=0 is a corner case this behavior is unlikely to be relied upon.

1.12.1

>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
tensor([], size=(1, 0), grad_fn=<EmbeddingBackward0>)

1.13

>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
RuntimeError: INDICES element is out of DATA bounds, id=4 axis_dim=3

Disallow overflows when tensors are constructed from scalars (#82329)

Prior to this PR, overflows during tensor construction from scalars would not throw an error. In 1.13, such cases will error.

1.12.1

>>> torch.tensor(1000, dtype=torch.int8)
tensor(-24, dtype=torch.int8)

1.13

>>> torch.tensor(1000, dtype=torch.int8)
RuntimeError: value cannnot be converted to type int8 without overflow

Error on indexing a cpu tensor with non-cpu indices (#69607)

Prior to 1.13, cpu_tensor[cuda_indices] was a valid program that would return a cpu tensor. The original use case for mixed device indexing was for non_cpu_tensor[cpu_indices], and allowing the opposite was unintentional (cpu_tensor[non_cpu_indices]). This behavior appears to be rarely used, and a refactor of our indexing kernels made it difficult to represent an op that takes in (cpu_tensor, non_cpu_tensor) and returns another cpu_tensor, so it is now an error.

To replicate the old behavior for base[indices], you can ensure that either indices lives on the CPU device, or base and indices both live on the same device.

1.12.1

>>> a = torch.tensor([1.0, 2.0, 3.0])
>>> b = torch.tensor([0, 2], device='cuda')
>>> a[b]
tensor([1., 3.])

1.13

>>> a = torch.tensor([1.0, 2.0, 3.0])
>>> b = torch.tensor([0, 2], device='cuda')
>>> a[b]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
# Old behavior can be replicated by moving b to CPU, or a to CUDA
>>> a[b.cpu()]
tensor([1., 3.])
>>> a.cuda()[b]
tensor([1., 3.], device='cuda:0')

Remove deprecated torch.eig, torch.matrix_rank, torch.lstsq (#70982, #70981, #70980)

The deprecation cycle for the above functions has been completed and they have been removed in the 1.13 release.

torch.nn

Enforce that the bias has the same dtype as input and weight for convolutions on CPU (#83686)

To align with the implementation on other devices, the CPU implementation for convolutions was updated to enforce that the dtype of the bias matches the dtype of the input and weight.

1.12.1

# input and weight are dtype torch.int64
# bias is torch.float32
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)

1.13

# input and weight are dtype torch.int64
# bias is torch.float32
>>> with assertRaisesError():
>>>    out = torch.nn.functional.conv2d(input, weight, bias, ...)

# Updated code to avoid the error
>>> out = torch.nn.functional.conv2d(input, weight, bias.to(input.dtype), ...)

Autograd

Disallow setting the .data of a tensor that requires_grad=True with an integer tensor (#78436)

Setting the .data of a tensor that requires_grad with an integer tensor now raises an error.

1.12.1

>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
>>> x
tensor([0, 0], requires_grad=True)

1.13

>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: data set to a tensor that requires gradients must be floating point or complex dtype

Added variable_list support to ExtractVariables struct (#84583)

Prior to this change, C++ custom autograd Function considers tensors passed in TensorList to not be tensors for the purposes of recording the backward graph. After this change, custom Functions that receive TensorList must modify their backward functions to also compute gradients for these additional tensor inputs. Note that this behavior now differs from that of custom autograd Functions in Python.

1.12.1

struct MyFunction : public Function<MyFunction> {
    static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
      return 2 * tensors[0] + 3 * t;
    }

    static variable_list backward(
        AutogradContext* ctx,
        variable_list grad_output) {
      return {3 * grad_output[0]};
    }
};

1.13

struct MyFunction : public Function<MyFunction> {
    static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
      return 2 * tensors[0] + 3 * t;
    }

    static variable_list backward(
        AutogradContext* ctx,
        variable_list grad_output) {
      return {3 * grad_output[0], 2 * grad_output[0]};
    }
};

Don't detach when making views; force kernel to detach (#84893)

View operations registered as CompositeExplicitAutograd kernels are no longer allowed to return input tensors as-is. You must explic...

Read more