Skip to content

Releases: pytorch/pytorch

PyTorch 2.0.1 Release, bug fix release

08 May 19:55
e9ebda2
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • Fix _canonical_mask throws warning when bool masks passed as input to TransformerEncoder/TransformerDecoder (#96009, #96286)
  • Fix Embedding bag max_norm=-1 causes leaf Variable that requires grad is being used in an in-place operation #95980
  • Fix type hint for torch.Tensor.grad_fn, which can be a torch.autograd.graph.Node or None. #96804
  • Can’t convert float to int when the input is a scalar np.ndarray. #97696
  • Revisit torch._six.string_classes removal #97863
  • Fix module backward pre-hooks to actually update gradient #97983
  • Fix load_sharded_optimizer_state_dict error on multi node #98063
  • Warn once for TypedStorage deprecation #98777
  • cuDNN V8 API, Fix incorrect use of emplace in the benchmark cache #97838

Torch.compile:

  • Add support for Modules with custom getitem method to torch.compile #97932
  • Fix improper guards with on list variables. #97862
  • Fix Sequential nn module with duplicated submodule #98880

Distributed:

  • Fix distributed_c10d's handling of custom backends #95072
  • Fix MPI backend not properly initialized #98545

NN_frontend:

  • Update Multi-Head Attention's doc string #97046
  • Fix incorrect behavior of is_causal paremeter for torch.nn.TransformerEncoderLayer.forward #97214
  • Fix error for SDPA on sm86 and sm89 hardware #99105
  • Fix nn.MultiheadAttention mask handling #98375

DataLoader:

  • Fix regression for pin_memory recursion when operating on bytes #97737
  • Fix collation logic #97789
  • Fix Ppotentially backwards incompatible change with DataLoader and is_shardable Datapipes #97287

MPS:

  • Fix LayerNorm crash when input is in float16 #96208
  • Add support for cumsum on int64 input #96733
  • Fix issue with setting BatchNorm to non-trainable #98794

Functorch:

  • Fix Segmentation Fault for vmaped function accessing BatchedTensor.data #97237
  • Fix index_select support when dim is negative #97916
  • Improve docs for autograd.Function support #98020
  • Fix Exception thrown when running Migration guide example for jacrev #97746

Releng:

  • Fix Convolutions for CUDA-11.8 wheel builds #99451
  • Fix Import torchaudio + torch.compile crashes on exit #96231
  • Linux aarch64 wheels are missing the mkldnn+acl backend support - pytorch/builder@54931c2
  • Linux aarch64 torchtext 0.15.1 wheels are missing for aarch64_linux platform - pytorch/builder#1375
  • Enable ROCm 5.4.2 manywheel and python 3.11 builds #99552
  • PyTorch cannot be installed at the same time as numpy in a conda env on osx-64 / Python 3.11 #97031
  • Illegal instruction (core dumped) on Raspberry Pi 4.0 8gb - pytorch/builder#1370

Torch.optim:

  • Fix fused AdamW causes NaN loss #95847
  • Fix Fused AdamW has worse loss than Apex and unfused AdamW for fp16/AMP #98620

The release tracker should contain all relevant pull requests related to this release as well as links to related issues

PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever

15 Mar 19:38
c263bd4
Compare
Choose a tag to compare

PyTorch 2.0 Release notes

  • Highlights
  • Backwards Incompatible Changes
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation

Highlights

We are excited to announce the release of PyTorch® 2.0 (release note) which we highlighted during the PyTorch Conference on 12/2/22! PyTorch 2.0 offers the same eager-mode development and user experience, while fundamentally changing and supercharging how PyTorch operates at compiler level under the hood with faster performance and support for Dynamic Shapes and Distributed.

This next-generation release includes a Stable version of Accelerated Transformers (formerly called Better Transformers); Beta includes torch.compile as the main API for PyTorch 2.0, the scaled_dot_product_attention function as part of torch.nn.functional, the MPS backend, functorch APIs in the torch.func module; and other Beta/Prototype improvements across various inferences, performance and training optimization features on GPUs and CPUs. For a comprehensive introduction and technical overview of torch.compile, please visit the 2.0 Get Started page.

Along with 2.0, we are also releasing a series of beta updates to the PyTorch domain libraries, including those that are in-tree, and separate libraries including TorchAudio, TorchVision, and TorchText. An update for TorchX is also being released as it moves to community supported mode. More details can be found in this library blog.

This release is composed of over 4,541 commits and 428 contributors since 1.13.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.0 and the overall 2-series this year.

Summary:

  • torch.compile is the main API for PyTorch 2.0, which wraps your model and returns a compiled model. It is a fully additive (and optional) feature and hence 2.0 is 100% backward compatible by definition.
  • As an underpinning technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized cuda libraries such as cublas.
  • Accelerated Transformers introduce high-performance support for training and inference using a custom kernel architecture for scaled dot product attention (SPDA). The API is integrated with torch.compile() and model developers may also use the scaled dot product attention kernels directly by calling the new scaled_dot_product_attention() operator.
  • Metal Performance Shaders (MPS) backend provides GPU accelerated PyTorch training on Mac platforms with added support for Top 60 most used ops, bringing coverage to over 300 operators.
  • Amazon AWS optimize the PyTorch CPU inference on AWS Graviton3 based C7g instances. PyTorch 2.0 improves inference performance on Graviton compared to the previous releases, including improvements for Resnet50 and Bert.
  • New prototype features and technologies across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.
Stable Beta Prototype Platform Changes
Accelerated PT 2 Transformers torch.compile DTensor CUDA support for 11.7 & 11.8 (deprecating CUDA 11.6)
PyTorch MPS Backend TensorParallel Python 3.8 (deprecating Python 3.7)
Scaled dot product attention 2D Parallel AWS Graviton3
Functorch Torch.compile (dynamic=True)
Dispatchable Collectives
torch.set_default_device and torch.device as context manager
X86 quantization backend
GNN inference and training performance

*To see a full list of public 2.0, 1.13 and 1.12 feature submissions click here

Backwards Incompatible Changes

Drop support for Python versions <= 3.7 (#93155)

Previously the minimum supported version of Python for PyTorch was 3.7. This PR updates the minimum version to require 3.8 in order to install PyTorch. See Hardware / Software Support for more information.

Drop support for CUDA 10 (#89582)

This PR updates the minimum CUDA version to 11.0. See the getting-started for installation or building from source for more information.

Gradients are now set to None instead of zeros by default in torch.optim.*.zero_grad() and torch.nn.Module.zero_grad() (#92731)

This changes the default behavior of zero_grad() to zero out the grads by setting them to None instead of zero tensors. In other words, the set_to_none kwarg is now True by default instead of False. Setting grads to None reduces peak memory usage and increases performance. This will break code that directly accesses data or does computation on the grads after calling zero_grad() as they will now be None. To revert to the old behavior, pass in zero_grad(set_to_none=False).

1.13 2.0
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(2,22)
>>> i = torch.randn(2, 2, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
False
>>> module.weight.grad.data
tensor([[0., 0.],
        [0., 0.]])
>>> module.weight.grad + 1.0
tensor([[1., 1.],
        [1., 1.]])
>>> import torch
>>> from torch import nn
>>> module = nn.Linear(5, 5)
>>> i = torch.randn(2, 5, requires_grad=True)
>>> module(i).sum().backward()
>>> module.zero_grad()
>>> module.weight.grad == None
True
>>> module.weight.grad.data
AttributeError: 'NoneType' object has no attribute 'data'
>>> module.weight.grad + 1.0
TypeError: unsupported operand type(s) for +:
'NoneType' and 'float'

Update torch.tensor and nn.Parameter to serialize all their attributes (#88913)

Any attribute stored on torch.tensor and torch.nn.Parameter will now be serialized. This aligns the serialization behavior of torch.nn.Parameter, torch.Tensor and other tensor subclasses

1.13 2.0
# torch.Tensor behavior
>>> a = torch.Tensor()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)

>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Tensor' object has no attribute 'foo'

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
AttributeError: 'Parameter' object has no attribute 'foo'

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey
# torch.Tensor behavior
a = torch.Tensor()
a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey

# torch.nn.Parameter behavior
>>> a = nn.Parameter()
>>> a.foo = 'hey'

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>> print(a.foo)
hey
>>> print(b.foo)
hey

# torch.Tensor subclass behavior
>>> class MyTensor(torch.Tensor):
...   pass

>>> a = MyTensor()
>>> a.foo = 'hey'
>>> print(a.foo)
hey

>>> buffer = io.BytesIO()
>>> torch.save(a, buffer)
>>> buffer.seek(0)
>>> b = torch.load(buffer)
>>>print(b.foo)
hey

If you have an attribute that you don't want to be serialized you should not store it as an attribute on tensor or Parameter but instead it is recommended to use torch.utils.weak.WeakTensorKeyDictionary

>>> foo_dict = weak.WeakTensorKeyDictionary()
>>> foo_dict[a] = 'hey'
>>> print(foo_dict[a])
hey

Algorithms {Adadelta, Adagrad, Adam, Adamax, AdamW, ASGD, NAdam, RAdam, RMSProp, RProp, SGD} default to faster foreach implementation when on CUDA + differentiable=False

When applicable, this changes the default behavior of step() and anything that ca...

PyTorch 1.13.1 Release, small bug fix release

16 Dec 00:17
49444c3
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • RuntimeError by torch.nn.modules.activation.MultiheadAttention with bias=False and batch_first=True #88669
  • Installation via pip on Amazon Linux 2, regression #88869
  • Installation using poetry on Mac M1, failure #88049
  • Missing masked tensor documentation #89734
  • torch.jit.annotations.parse_type_line is not safe (command injection) #88868
  • Use the Python frame safely in _pythonCallstack #88993
  • Double-backward with full_backward_hook causes RuntimeError #88312
  • Fix logical error in get_default_qat_qconfig #88876
  • Fix cuda/cpu check on NoneType and unit test #88854 and #88970
  • Onnx ATen Fallback for BUILD_CAFFE2=0 for ONNX-only ops #88504
  • Onnx operator_export_type on the new registry #87735
  • torchrun AttributeError caused by file_based_local_timer on Windows #85427

The release tracker should contain all relevant pull requests related to this release as well as links to related issues

PyTorch 1.13: beta versions of functorch and improved support for Apple’s new M1 chips are now available

28 Oct 16:54
7c98e70
Compare
Choose a tag to compare

Pytorch 1.13 Release Notes

  • Highlights
  • Backwards Incompatible Changes
  • New Features
  • Improvements
  • Performance
  • Documentation
  • Developers

Highlights

We are excited to announce the release of PyTorch 1.13! This includes stable versions of BetterTransformer. We deprecated CUDA 10.2 and 11.3 and completed migration of CUDA 11.6 and 11.7. Beta includes improved support for Apple M1 chips and functorch, a library that offers composable vmap (vectorization) and autodiff transforms, being included in-tree with the PyTorch release. This release is composed of over 3,749 commits and 467 contributors since 1.12.1. We want to sincerely thank our dedicated community for your contributions.

Summary:

  • The BetterTransformer feature set supports fastpath execution for common Transformer models during Inference out-of-the-box, without the need to modify the model. Additional improvements include accelerated add+matmul linear algebra kernels for sizes commonly used in Transformer models and Nested Tensors is now enabled by default.

  • Timely deprecating older CUDA versions allows us to proceed with introducing the latest CUDA version as they are introduced by Nvidia®, and hence allows support for C++17 in PyTorch and new NVIDIA Open GPU Kernel Modules.

  • Previously, functorch was released out-of-tree in a separate package. After installing PyTorch, a user will be able to import functorch and use functorch without needing to install another package.

  • PyTorch is offering native builds for Apple® silicon machines that use Apple's new M1 chip as a beta feature, providing improved support across PyTorch's APIs.

Stable Beta Prototype
  • Better Transformer
  • CUDA 10.2 and 11.3 CI/CD Deprecation
  • Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs
  • Extend NNC to support channels last and bf16
  • Functorch now in PyTorch Core Library
  • Beta Support for M1 devices
  • Arm® Compute Library backend support for AWS Graviton
  • CUDA Sanitizer

You can check the blogpost that shows the new features here.

Backwards Incompatible changes

Python API

uint8 and all integer dtype masks are no longer allowed in Transformer (#87106)

Prior to 1.13, key_padding_mask could be set to uint8 or other integer dtypes in TransformerEncoder and MultiheadAttention, which might generate unexpected results. In this release, these dtypes are not allowed for the mask anymore. Please convert them to torch.bool before using.

1.12.1

>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.uint8)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
# works before 1.13
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)

1.13

>>> layer = nn.TransformerEncoderLayer(2, 4, 2)
>>> encoder = nn.TransformerEncoder(layer, 2)
>>> pad_mask = torch.tensor([[1, 1, 0, 0]], dtype=torch.bool)
>>> inputs = torch.cat([torch.randn(1, 2, 2), torch.zeros(1, 2, 2)], dim=1)
>>> outputs = encoder(inputs, src_key_padding_mask=pad_mask)

Updated torch.floor_divide to perform floor division (#78411)

Prior to 1.13, torch.floor_divide erroneously performed truncation division (i.e. truncated the quotients). In this release, it has been fixed to perform floor division. To replicate the old behavior, use torch.div with rounding_mode='trunc'.

1.12.1

>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -1.])

1.13

>>> a = torch.tensor([4.0, -3.0])
>>> b = torch.tensor([2.0, 2.0])
>>> torch.floor_divide(a, b)
tensor([ 2., -2.])
# Old behavior can be replicated using torch.div with rounding_mode='trunc'
>>> torch.div(a, b, rounding_mode='trunc')
tensor([ 2., -1.])

Fixed torch.index_select on CPU to error that index is out of bounds when the source tensor is empty (#77881)

Prior to 1.13, torch.index_select would return an appropriately sized tensor filled with random values on CPU if the source tensor was empty. In this release, we have fixed this bug so that it errors out. A consequence of this is that torch.nn.Embedding which utilizes index_select will error out rather than returning an empty tensor when embedding_dim=0 and input contains indices which are out of bounds. The old behavior cannot be reproduced with torch.nn.Embedding, however since an Embedding layer with embedding_dim=0 is a corner case this behavior is unlikely to be relied upon.

1.12.1

>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
tensor([], size=(1, 0), grad_fn=<EmbeddingBackward0>)

1.13

>>> t = torch.tensor([4], dtype=torch.long)
>>> embedding = torch.nn.Embedding(3, 0)
>>> embedding(t)
RuntimeError: INDICES element is out of DATA bounds, id=4 axis_dim=3

Disallow overflows when tensors are constructed from scalars (#82329)

Prior to this PR, overflows during tensor construction from scalars would not throw an error. In 1.13, such cases will error.

1.12.1

>>> torch.tensor(1000, dtype=torch.int8)
tensor(-24, dtype=torch.int8)

1.13

>>> torch.tensor(1000, dtype=torch.int8)
RuntimeError: value cannnot be converted to type int8 without overflow

Error on indexing a cpu tensor with non-cpu indices (#69607)

Prior to 1.13, cpu_tensor[cuda_indices] was a valid program that would return a cpu tensor. The original use case for mixed device indexing was for non_cpu_tensor[cpu_indices], and allowing the opposite was unintentional (cpu_tensor[non_cpu_indices]). This behavior appears to be rarely used, and a refactor of our indexing kernels made it difficult to represent an op that takes in (cpu_tensor, non_cpu_tensor) and returns another cpu_tensor, so it is now an error.

To replicate the old behavior for base[indices], you can ensure that either indices lives on the CPU device, or base and indices both live on the same device.

1.12.1

>>> a = torch.tensor([1.0, 2.0, 3.0])
>>> b = torch.tensor([0, 2], device='cuda')
>>> a[b]
tensor([1., 3.])

1.13

>>> a = torch.tensor([1.0, 2.0, 3.0])
>>> b = torch.tensor([0, 2], device='cuda')
>>> a[b]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
# Old behavior can be replicated by moving b to CPU, or a to CUDA
>>> a[b.cpu()]
tensor([1., 3.])
>>> a.cuda()[b]
tensor([1., 3.], device='cuda:0')

Remove deprecated torch.eig, torch.matrix_rank, torch.lstsq (#70982, #70981, #70980)

The deprecation cycle for the above functions has been completed and they have been removed in the 1.13 release.

torch.nn

Enforce that the bias has the same dtype as input and weight for convolutions on CPU (#83686)

To align with the implementation on other devices, the CPU implementation for convolutions was updated to enforce that the dtype of the bias matches the dtype of the input and weight.

1.12.1

# input and weight are dtype torch.int64
# bias is torch.float32
>>> out = torch.nn.functional.conv2d(input, weight, bias, ...)

1.13

# input and weight are dtype torch.int64
# bias is torch.float32
>>> with assertRaisesError():
>>>    out = torch.nn.functional.conv2d(input, weight, bias, ...)

# Updated code to avoid the error
>>> out = torch.nn.functional.conv2d(input, weight, bias.to(input.dtype), ...)

Autograd

Disallow setting the .data of a tensor that requires_grad=True with an integer tensor (#78436)

Setting the .data of a tensor that requires_grad with an integer tensor now raises an error.

1.12.1

>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
>>> x
tensor([0, 0], requires_grad=True)

1.13

>>> x = torch.randn(2, requires_grad=True)
>>> x.data = torch.randint(1, (2,))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: data set to a tensor that requires gradients must be floating point or complex dtype

Added variable_list support to ExtractVariables struct (#84583)

Prior to this change, C++ custom autograd Function considers tensors passed in TensorList to not be tensors for the purposes of recording the backward graph. After this change, custom Functions that receive TensorList must modify their backward functions to also compute gradients for these additional tensor inputs. Note that this behavior now differs from that of custom autograd Functions in Python.

1.12.1

struct MyFunction : public Function<MyFunction> {
    static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
      return 2 * tensors[0] + 3 * t;
    }

    static variable_list backward(
        AutogradContext* ctx,
        variable_list grad_output) {
      return {3 * grad_output[0]};
    }
};

1.13

struct MyFunction : public Function<MyFunction> {
    static Variable forward(AutogradContext* ctx, at::Tensor t, at::TensorList tensors) {
      return 2 * tensors[0] + 3 * t;
    }

    static variable_list backward(
        AutogradContext* ctx,
        variable_list grad_output) {
      return {3 * grad_output[0], 2 * grad_output[0]};
    }
};

Don't detach when making views; force kernel to detach (#84893)

View operations registered as CompositeExplicitAutograd kernels are no longer allowed to return input tensors as-is. You must explic...

PyTorch 1.12.1 Release, small bug fix release

05 Aug 19:35
664058f
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

Optim

  • Remove overly restrictive assert in adam #80222

Autograd

  • Convolution forward over reverse internal asserts in specific case #81111
  • 25% Performance regression from v0.1.1 to 0.2.0 when calculating hessian #82504

Distributed

  • Fix distributed store to use add for the counter of DL shared seed #80348
  • Raise proper timeout when sharing the distributed shared seed #81666

NN

  • Allow register float16 weight_norm on cpu and speed up test #80600
  • Fix weight norm backward bug on CPU when OMP_NUM_THREADS <= 2 #80930
  • Weight_norm is not working with float16 #80599
  • New release breaks torch.nn.weight_norm backwards pass and breaks all Wav2Vec2 implementations #80569
  • Disable src mask for transformer and multiheadattention fastpath #81277
  • Make nn.stateless correctly reset parameters if the forward pass fails #81262
  • torchvision.transforms.functional.rgb_to_grayscale() + torch.nn.Conv2d() don`t work on 1080 GPU #81106
  • Transformer and CPU path with src_mask raises error with torch 1.12 #81129

Data Loader

  • [Locking lower ranks seed recepients https://github.com//pull/81071

CUDA

  • os.environ["CUDA_VISIBLE_DEVICES"] has no effect #80876
  • share_memory() on CUDA tensors no longer no-ops and instead crashes #80733
  • [Prims] Unbreak CUDA lazy init #80899
  • PyTorch 1.12 cu113 wheels cudnn discoverability issue #80637
  • Remove overly restrictive checks for cudagraph #80881

ONNX

MPS

Other

  • Don't error if _warned_capturable_if_run_uncaptured not set #80345
  • Initializing libiomp5.dylib, but found libomp.dylib already initialized. #78490
  • Assertion error - _dl_shared_seed_recv_cnt - pt 1.12 - multi node #80845
  • Add 3.10 stdlib to torch.package #81261
  • CPU-only c++ extension libraries (functorch, torchtext) built against PyTorch wheels are not fully compatible with PyTorch wheels #80489

PyTorch 1.12: TorchArrow, Functional API for Modules and nvFuser, are now available

28 Jun 16:48
67ece03
Compare
Choose a tag to compare

PyTorch 1.12 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • New Features
  • Improvements
  • Performance
  • Documentation

Highlights

We are excited to announce the release of PyTorch 1.12! This release is composed of over 3124 commits, 433 contributors. Along with 1.12, we are releasing beta versions of AWS S3 Integration, PyTorch Vision Models on Channels Last on CPU, Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16 and FSDP API. We want to sincerely thank our dedicated community for your contributions.

Summary:

  • Functional Module API to functionally apply module computation with a given set of parameters
  • Complex32 and Complex Convolutions in PyTorch
  • DataPipes from TorchData fully backward compatible with DataLoader
  • Functorch with improved coverage for APIs
  • nvFuser a deep learning compiler for PyTorch
  • Changes to float32 matrix multiplication precision on Ampere and later CUDA hardware
  • TorchArrow, a new beta library for machine learning preprocessing over batch data

Backwards Incompatible changes

Python API

Updated type promotion for torch.clamp (#77035)

In 1.11, the ‘min’ and ‘max’ arguments in torch.clamp did not participate in type promotion, which made it inconsistent with minimum and maximum operations. In 1.12, the ‘min’ and ‘max’ arguments participate in type promotion.

1.11

>>> import torch
>>> a = torch.tensor([1., 2., 3., 4.], dtype=torch.float32)
>>> b = torch.tensor([2., 2., 2., 2.], dtype=torch.float64)
>>> c = torch.tensor([3., 3., 3., 3.], dtype=torch.float64)
>>> torch.clamp(a, b, c).dtype
torch.float32

1.12

>>> import torch
>>> a = torch.tensor([1., 2., 3., 4.], dtype=torch.float32)
>>> b = torch.tensor([2., 2., 2., 2.], dtype=torch.float64)
>>> c = torch.tensor([3., 3., 3., 3.], dtype=torch.float64)
>>> torch.clamp(a, b, c).dtype
torch.float64

Complex Numbers

Fix complex type promotion (#77524)

Updates the type promotion rule such that given a complex scalar and real tensor, the value type of real tensor is preserved

1.11

>>> a = torch.randn((2, 2), dtype=torch.float)
>>> b = torch.tensor(1, dtype=torch.cdouble)
>>> (a + b).dtype
torch.complex128

1.12

>>> a = torch.randn((2, 2), dtype=torch.float)
>>> b = torch.tensor(1, dtype=torch.cdouble)
>>> (a + b).dtype
torch.complex64

LinAlg

Disable TF32 for matmul by default and add high-level control of fp32 matmul precision (#76509)

PyTorch 1.12 makes the default math mode for fp32 matrix multiplications more precise and consistent across hardware. This may affect users on Ampere or later CUDA devices and TPUs. See the PyTorch blog for more details.

Sparse

Use ScatterGatherKernel for scatter_reduce (CPU-only) (#74226, #74608)

In 1.11.0, unlike scatter which takes a reduce kwarg or scatter_add, scatter_reduce was not an in-place function. That is, it did not allow the user to pass an output tensor which contains data that is reduced together with the scattered data. Instead, the scatter reduction took place on an output tensor initialized under the hood. Indices of the output that were not scattered to were filled with reduction inits (or 0 for options ‘amin’ and ‘amax’).

In 1.12.0, scatter_reduce (which is in beta) is in-place to align with the API of the related existing functions scatter/scatter_add. For this reason, the argument input in 1.11.0 has been renamed src in 1.12.0 and the new self argument now takes a destination tensor to be scattered onto. Since the destination tensor is no longer initialized under the hood, the output_size kwarg in 1.11.0 that allowed users to specify the size of the output at dimension dim has been removed. Further, in 1.12.0 we introduce an include_self kwarg which determines whether values in the self (destination) tensor are included in the reduction. Setting include_self=True could, for example, allow users to provide special reduction inits for the scatter_reduction operation. Otherwise, if include_self=False, indices scattered to are treated as if they were filled with reduction inits.

In the snippet below, we illustrate how the behavior of scatter_reduce in 1.11.0 can be achieved with the function released in 1.12.0.

Example:

>>> src = torch.arange(6, dtype=torch.float).reshape(3, 2)
>>> index = torch.tensor([[0, 2], [1, 1], [0, 0]])
>>> dim = 1
>>> output_size = 4
>>> reduce = "prod"

1.11

>>> torch.scatter_reduce(src, dim, index, reduce, output_size=output_size)
`tensor([[ 0., 1., 1., 1.],
        [ 1., 6., 1., 1.],
        [20., 1., 1., 1.]])`

1.12

>>> output_shape = list(src.shape)
>>> output_shape[dim] = output_size
# reduction init for prod is 1
# filling the output with 1 is only necessary if the user wants to preserve the behavior in 1.11
# where indices not scattered to are filled with reduction inits
>>> output = src.new_empty(output_shape).fill_(1)
>>> output.scatter_reduce_(dim, index, src, reduce)
`tensor([[ 0., 1., 1., 1.],
        [ 1., 6., 1., 1.],
        [20., 1., 1., 1.]])`

torch.nn

nn.GroupNorm: Report an error if num_channels is not divisible by num_groups (#74293)

Previously, nn.GroupNorm would error out during the forward pass if num_channels is not divisible by num_groups. Now, the error is thrown for this case during module construction instead.

1.11

m = torch.nn.GroupNorm(3, 7)
m(...)  # errors during forward pass

1.12

m = torch.nn.GroupNorm(3, 7)  # errors during construction

nn.Dropout2d: Return to 1.10 behavior: perform 1D channel-wise dropout for 3D inputs

In PyTorch 1.10 and older, passing a 3D input to nn.Dropout2D resulted in 1D channel-wise dropout behavior; i.e. such inputs were interpreted as having shape (N, C, L) with N = batch size and C = # channels and channel-wise dropout was performed along the second dimension.

1.10

x = torch.randn(2, 3, 4)
m = nn.Dropout2d(p=0.5)
out = m(x)  # input is assumed to be shape (N, C, L); dropout along the second dim.

With the introduction of no-batch-dim input support in 1.11, 3D inputs were reinterpreted as having shape (C, H, W); i.e. an input without a batch dimension, and dropout behavior was changed to drop along the first dimension. This was a silent breaking change.

1.11

x = torch.randn(2, 3, 4)
m = nn.Dropout2d(p=0.5)
out = m(x)  # input is assumed to be shape (C, H, W); dropout along the first dim.

The breaking change in 1.11 resulted in a lack of support for 1D channel-wise dropout behavior, so Dropout2d in PyTorch 1.12 returns to 1.10 behavior with a warning to give some time to adapt before the no-batch-dim interpretation goes back into effect.

1.12

x = torch.randn(2, 3, 4)
m = nn.Dropout2d(p=0.5)
out = m(x)  # input is assumed to be shape (N, C, L); dropout along the second dim.
            # throws a warning suggesting nn.Dropout1d for 1D channel-wise dropout.

If you want 1D channel-wise dropout behavior, please switch to use of the newly-added nn.Dropout1d module instead of nn.Dropout2d. If you want no-batch-dim input behavior, please note that while this is not supported in 1.12, a future release will reinstate the interpretation of 3D inputs to nn.Dropout2d as those without a batch dimension.

F.cosine_similarity: Improve numerical stability (#31378)

Previously, we first compute the inner product, then normalize. After this change, we first normalize, then compute inner product. This should be more numerically stable because it avoids losing precision in inner product for inputs with large norms. Because of this change, outputs may be different in some cases.

Composability

Functions in torch.ops.aten.{foo} no longer accept self as a kwarg

torch.ops.aten.{foo} objects are now instances of OpOverloadPacket (instead of a function) that have their __call__ method in Python, which means that you cannot pass self as a kwarg. You can pass it normally as a positional argument instead.

1.11

>>> torch.ops.aten.sin(self=torch.ones(2))
    tensor([0.8415, 0.8415])

1.12

# this now fails
>>> torch.ops.aten.sin(self=torch.ones(2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __call__() got multiple values for argument 'self'
# this works
>>> torch.ops.aten.sin(torch.ones(2))
tensor([0.8415, 0.8415])

torch_dispatch now traces individual op overloads instead of op overload packets (#72673)

torch.ops.aten.add actually corresponds to a bundle of functions from C++, corresponding to all over the overloads of add operator (specifically, add.Tensor, add.Scalar and add.out). Now, __torch_dispatch__ will directly take in an overload corresponding to a single aten function.

1.11

class MyTensor(torch.Tensor):
    ....
    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
        # Before, func refers to a "packet" of all overloads
        # for a given operator, e.g. "add"
        assert func == torch.ops.aten.add

1.12

class MyTensor(torch.Tensor):
    ....
    def __torch_dispatch__(cls, func, types, args=(), kwargs=No...

PyTorch 1.11, TorchData, and functorch are now available

10 Mar 16:59
bc2c6ed
Compare
Choose a tag to compare

PyTorch 1.11 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • New Features
  • Improvements
  • Performance
  • Documentation

Highlights

We are excited to announce the release of PyTorch 1.11. This release is composed of over 3,300 commits since 1.10, made by 434 contributors. Along with 1.11, we are releasing beta versions of TorchData and functorch. We want to sincerely thank our community for continuously improving PyTorch.

  • TorchData is a new library for common modular data loading primitives for easily constructing flexible and performant data pipelines. View it on GitHub.
  • functorch, a library that adds composable function transforms to PyTorch, is now available in beta. View it on GitHub.
  • Distributed Data Parallel (DDP) static graph optimizations available in stable.

You can check the blogpost that shows the new features here.

Backwards Incompatible changes

Python API

Fixed python deepcopy to correctly copy all attributes on Tensor objects (#65584)

This change ensures that the deepcopy operation on Tensor properly copies all the attributes (and not just the plain Tensor properties).

1.10.21.11.0
a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# Raise AttributeError: "Tensor" object has no attribute "foo"
      
a = torch.rand(2)
a.foo = 3
torch.save(a, "bar")
b = torch.load("bar")
print(b.foo)
# 3
      

steps argument is no longer optional in torch.linspace and torch.logspace

This argument used to default to 100 in PyTorch 1.10.2, but was deprecated (previously you would see a deprecation warning if you didn’t explicitly pass in steps). In PyTorch 1.11, it is not longer optional.

1.10.21.11.0
# Works, but raises a deprecation warning
# Steps defaults to 100
a = torch.linspace(1, 10)
# UserWarning: Not providing a value for linspace's steps is deprecated
# and will throw a runtime error in a future release.
# This warning will appear only once per process.
# (Triggered internally at  ../aten/src/ATen/native/RangeFactories.cpp:19
      
# In 1.11, you must specify steps
a = torch.linspace(1, 10, steps=100)
      

Remove torch.hub.import_module function that was mistakenly public (#67990)

This function is not intended for public use.
If you have existing code that relies on it, you can find an equivalent function at torch.hub._import_module.

C++ API

We’ve cleaned up many of the headers in the C++ frontend to only include the subset of aten operators that they actually used (#68247, #68687, #68688, #68714, #68689, #68690, #68697, #68691, #68692, #68693, #69840)

When you #include a header from the C++ frontend, you can no longer assume that every aten operators are transitively included. You can work around this by directly adding #include <ATen/ATen.h> in your file, which will maintain the old behavior of including every aten operators.

Custom implementation for c10::List and c10::Dict move constructors have been removed (#69370)

The semantics have changed from "make the moved-from List/Dict empty" to "keep the moved-from List/Dict unchanged"

1.10.21.11.0
c10::List list1({"3", "4"});
c10::List list2(std::move(list1));
std::cout << list1.size() // 0
      
c10::List list1({"3", "4"});
c10::List list2(std::move(list1)); // calls copy ctr
std::cout << list1.size() // 2
      

CUDA

Removed THCeilDiv function and corresponding THC/THCDeviceUtils.cuh header (#65472)

As part of cleaning up TH from the codebase, the THCeilDiv function has been removed. Instead, please use at::ceil_div, and include the corresponding ATen/ceil_div.h header

Removed THCudaCheck (#66391)

You can replace it with C10_CUDA_CHECK, which has been available since at least PyTorch 1.4, so just replacing is enough even if you support older versions

Removed THCudaMalloc(), THCudaFree(), THCThrustAllocator.cuh (#65492)

If your extension is using THCThrustAllocator.cuh, please replace it with ATen/cuda/ThrustAllocator.h and corresponding APIs (see examples in this PR).

This PR also removes THCudaMalloc/THCudaFree calls. Please use c10::cuda::CUDACachingAllocator::raw_alloc(size)/raw_delete(ptr), or, preferably, switch to c10:cuda::CUDaCachingAllocator::allocate which manages deallocation. Caching allocator APIs are available since PyTorch 1.2, so just replacing it is enough even if you support older versions of PyTorch.

Build

Stopped building shared library for AOT Compiler, libaot_compiler.so (#66227)

Building aot_compiler.cpp as a separate library is not necessary, as it’s already included in libtorch.so.
You can update your build system to only dynamically link libtorch.so.

Mobile

Make typing.Union type unsupported for mobile builds (#65556)

typing.Union support was added for TorchScript in 1.10. It was removed specifically for mobile due to its lack of use and increase in binary size of PyTorch for Mobile builds.

Distributed

torch.distributed.rpc: Final Removal of ProcessGroup RPC backend (#67363)

ProcessGroup RPC backend is deprecated. In 1.10, it threw an error to help users update their code, and, in 1.11, it is removed completely.

The backend type “PROCESS_GROUP” is now deprecated, e.g.
torch.distributed.rpc.init_rpc("worker0", backend="PROCESS_GROUP", rank=0, world_size=1)
and should be replaced with:
torch.distributed.rpc.init_rpc("worker0", backend="TENSORPIPE", rank=0, world_size=1)

Quantization

Disabled the support for getitem in FX Graph Mode Quantization (#66647)

getitem used to be quantized in FX Graph Mode Quantization, and it is no longer quantized. This won’t break any models but could result in a slight difference in numerics.

1.10.21.11.0
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
    def forward(self, x):
        x = self.linear(x)
        y = torch.stack([x], 0)
        return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
#   (linear): QuantizedLinear(in_features=5, out_features=5,
#      scale=1.0, zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
#     linear_input_scale_0 = self.linear_input_scale_0
#     linear_input_zero_point_0 = self.linear_input_zero_point_0
#     quantize_per_tensor = torch.quantize_per_tensor(x,
#         linear_input_scale_0, linear_input_zero_point_0, torch.quint8)
#     x = linear_input_scale_0 = linear_input_zero_point_0 = None
#     linear = self.linear(quantize_per_tensor)
#     quantize_per_tensor = None
#     stack = torch.stack([linear], 0);  linear = None
#     getitem = stack[0]; stack = None
#     dequantize_2 = getitem.dequantize();  getitem = None
#     return getitem
      
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
    def forward(self, x):
        x = self.linear(x)
        y = torch.stack([x], 0)
        return y[0]
m = M().eval()
m = prepare_fx(m, {"": torch.ao.quantization.default_qconfig})
m = convert_fx(m)
print(m)
# prints
# GraphModule(
#   (linear): QuantizedLinear(in_features=5, out_features=5, scale=1.0,
                    zero_point=0, qscheme=torch.per_tensor_affine)
# )
# def forward(self, x):
#     linear_input_scale_0 = self.linear_input_scale_0
#     linear_input_zero_point_0 = self.linear_input_zero_point_0
#     quantize_per_tensor = tor...

PyTorch 1.10.2 Release, small bug fix release

27 Jan 21:51
71f889c
Compare
Choose a tag to compare

This release is meant to deploy additional fixes not included in 1.10.1 release:

  • fix pybind issue for get_autocast_cpu_dtype and get_autocast_gpu_dtype #66396
  • Remove fgrad_input from slow_conv2d #64280
  • fix formatting CIRCLE_TAG when building docs #67026

PyTorch 1.10.1 Release, small bug fix release

15 Dec 22:27
302ee7b
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • torch.nn.cross_entropy silently incorrect in PyTorch 1.10 on CUDA on non-contiguous inputs #67167
  • channels_last significantly degrades accuracy #67239
  • Potential strict aliasing rule violation in bitwise_binary_op (on ARM/NEON) #66119
  • torch.get_autocast_cpu_dtype() returns a new dtype #65786
  • Conv2d grad bias gets wrong value for bfloat16 case #68048

The release tracker should contain all relevant pull requests related to this release as well as links to related issues

PyTorch 1.10 Release, including CUDA Graphs APIs, Frontend and compiler improvements

21 Oct 15:49
36449ea
Compare
Choose a tag to compare

1.10.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • New Features
  • Improvements
  • Performance
  • Documentation

Highlights

We are excited to announce the release of PyTorch 1.10. This release is composed of over 3,400 commits since 1.9, made by 426 contributors. We want to sincerely thank our community for continuously improving PyTorch.

PyTorch 1.10 updates are focused on improving training and performance of PyTorch, and developer usability. Highlights include:

  • CUDA Graphs APIs are integrated to reduce CPU overheads for CUDA workloads.
  • Several frontend APIs such as FX, torch.special, and nn.Module Parametrization, have moved from beta to stable.
  • Support for automatic fusion in JIT Compiler expands to CPUs in addition to GPUs.
  • Android NNAPI support is now available in beta.

You can check the blogpost that shows the new features here.

Backwards Incompatible changes

Python API

torch.any/torch.all behavior changed slightly to be more consistent for zero-dimension, uint8 tensors. (#64642)

These two functions match the behavior of NumPy, returning an output dtype of bool for all support dtypes, except for uint8 (in which case they return a 1 or a 0, but with uint8 dtype). In some cases with 0-dim tensor inputs, the returned uint8 value could mistakenly take on a value > 1. This has now been fixed.

1.9.11.10.0
>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(42, dtype=torch.uint8) # wrong, old behavior
      
>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(1, dtype=torch.uint8) # new, corrected and consistent behavior
      

Remove deprecated torch.{is,set}_deterministic (#62158)

This is the end of the deprecation cycle for both of these functions. You should be using torch.use_deterministic_algorithms andtorch.are_deterministic_algorithms_enabled instead.

Complex Numbers

Conjugate View: tensor.conj() now returns a view tensor that aliases the same memory and has conjugate bit set (#54987, #60522, #66082, #63602).

This means that .conj() is now an O(1) operation and returns a tensor that views the same memory as tensor and has conjugate bit set. This notion of conjugate bit enables fusion of operations with conjugation which gives a lot of performance benefit for operations like matrix multiplication. All out-of-place operations will have the same behavior as before, but an in-place operation on a conjugated tensor will additionally modify the input tensor.

1.9.11.10.0
>>> import torch
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> y.add_(2)
>>> print(x)
tensor([1.+2.j])
      
>>> import torch
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> y.add_(2)
>>> print(x)
tensor([3.+2.j])
      

Note: You can verify if the conj bit is set by calling tensor.is_conj(). The conjugation can be resolved, i.e., you can obtain a new tensor that doesn’t share storage with the input tensor at any time by calling conjugated_tensor.clone() or conjugated_tensor.resolve_conj() .

Note that these conjugated tensors behave differently from the corresponding numpy arrays obtained from np.conj() when an in-place operation is performed on them (similar to the example shown above).

Negative View: tensor.conj().neg() returns a view tensor that aliases the same memory as both tensor and tensor.conj() and has a negative bit set (#56058).

conjugated_tensor.neg() continues to be an O(1) operation, but the returned tensor shares memory with both tensor and conjugated_tensor.

1.9.11.10.0
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> z = y.imag
>>> z.add_(2)
>>> print(x)
tensor([1.+2.j])
      
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> z = y.imag
>>> print(z.is_neg())
True
>>> z.add_(2)
>>> print(x)
tensor([1.-0.j])
      

tensor.numpy() now throws RuntimeError when called on a tensor with conjugate or negative bit set (#61925).

Because the notion of conjugate bit and negative bit doesn’t exist outside of PyTorch, calling operations that return a Python object viewing the same memory as input like .numpy() would no longer work for tensors with conjugate or negative bit set.

1.9.11.10.0
>>> x = torch.tensor([1+2j])
>>> y = x.conj().imag
>>> print(y.numpy())
[2.]
      
>>> x = torch.tensor([1+2j])
>>> y = x.conj().imag
>>> print(y.numpy())
RuntimeError: Can't call numpy() on Tensor that has negative
bit set. Use tensor.resolve_neg().numpy() instead.
      

Autograd

Raise TypeError instead of RuntimeError when assigning to a Tensor’s grad field with wrong type (#64876)

Setting the .grad field with a non-None and non-Tensor object used to return a RuntimeError but it now properly returns a TypeError. If your code was catching this error, you should simply update it to catch a TypeError instead of a RuntimeError.

1.9.11.10.0
try:
    # Assigning an int to a Tensor's grad field
    a.grad = 0
except RuntimeError as e:
    pass
      
try:
   a.grad = 0
except TypeError as e:
    pass
      

Raise error when inputs to autograd.grad are empty (#52016)

Calling autograd.grad with an empty list of inputs used to do the same as backward. To reduce confusion, it now raises the expected error. If you were relying on this, you can simply update your code as follows:

1.9.11.10.0
grad = autograd.grad(out, tuple())
assert grad == tuple()
      
out.backward()
      

Optional arguments to autograd.gradcheck and autograd.gradgradcheck are now kwarg-only (#65290)

These two functions now have a significant number of optional arguments controlling what they do (i.e., eps, atol, rtol, raise_exception, etc.). To improve readability, we made these arguments kwarg-only. If you are passing these arguments to autograd.gradcheck or autograd.gradgradcheck as positional arguments, you can update your code as follows:

1.9.11.10.0
torch.autograd.gradcheck(fn, x, 1e-6)
      
torch.autograd.gradcheck(fn, x, eps=1e-6)
      

In-place detach (detach_) now errors for views that return multiple outputs (#58285)

This change is finishing the deprecation cycle for the inplace-over-view logic. In particular, a few things that were warning are updated:

* `detach_` will now raise an error when invoked on any view created by `split`, `split_with_sizes`, or `chunk`. You should use the non-inplace `detach` instead.
* The error message for when an in-place operation (that is not detach) is performed on a view created by `split`, `split_with_size`, and `chunk` has been changed from "This view is an output of a function..." to "This view is the output of a function...".

1.9.11.10.0
b = a.split(1)[0]
b.detach_()
      
b = a.split(1)[0]
c = b.detach()
      

Fix saved variable unpacking version counter (#60195)

In-place on the unpacked SavedVariables used to be ignored. They are now properly detected which can lead to errors saying that a variable needed for backward was modified in-place.
This is a valid error and the ...