Skip to content

Releases: pytorch/pytorch

Small bug fix release

25 Mar 16:07
56b43f4
Compare
Choose a tag to compare

PyTorch 1.8.1 Release Notes

  • New Features
  • Improvements
  • Bug Fixes
  • Documentation

New Features

Revamp of profiling tools in torch.profiler

The torch.profiler submodule is now available. It leveraged the newly released kineto library for profiling.
You can find more details in this blogpost: https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/

Enable use of autocast for pytorch xla (#48570)

The torch.cuda.autocast package can now be used in conjunction with torch xla to provide easy mixed-precision training.

Improvements

  • Make torch. submodule import more autocomplete-friendly (#52339)
  • Add support in ONNX for torch.{isinf,any,all} (#53529)
  • Replace thrust with cub in GPU implementation of torch.randperm for performance (#54537)

Bug fixes

Misc

  • Fixes for torch.distributions validation checks (#53763)
  • Allow changing the padding vector for nn.Embedding (#53447)
  • Fix TensorPipe for large copies and interoperability with CUDA (#53804)
  • Properly de-sugar Ellipsis in TorchScript (#53766)
  • Stop using OneDNN for group convolutions when groups size is a multiple of 24 (#54015)
  • Use int8_t instead of char in {load,store}_scalar (#52616)
  • Make ideep honor torch.set_num_thread (#53871)
  • Fix dimension out of range in pixel_{un}shuffle (#54178)
  • Update kineto to fix libtorch builds (#54205)
  • Fix distributed autograd CUDA stream synchronization for send/recv operations (#54358)

ONNX

  • Update error handling in ONNX to avoid ValueError (#53548)
  • Update assign output shape for nested structure and dict output (#53311)
  • Update embedding export wrt padding_idx (#53931)

Documentation

  • Doc update for torch.fx (#53674)
  • Fix distributed.rpc.options.TensorPipeRpcBackendOptions.set_device_map (#53508)
  • Update example for nn.LSTMCell (#51983)
  • Update doc for the padding_idx argument for nn.Embedding (#53809)
  • Update general doc template (#54141)

PyTorch 1.8 Release, including Compiler and Distributed Training updates, New Mobile Tutorials and more

04 Mar 20:44
37c1f4a
Compare
Choose a tag to compare

PyTorch 1.8.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • New Features
  • Improvements
  • Performance
  • Documentation

Highlights

We are excited to announce the availability of PyTorch 1.8. This release is composed of more than 3,000 commits since 1.7. It includes major updates and new features for compilation, code optimization, frontend APIs for scientific computing, and AMD ROCm support through binaries that are available via pytorch.org. It also provides improved features for large-scale training for pipeline and model parallelism, and gradient compression. A few of the highlights include:

  1. Support for doing python to python functional transformations via torch.fx;
  2. Added or stabilized APIs to support FFTs (torch.fft), Linear Algebra functions (torch.linalg), added support for autograd for complex tensors and updates to improve performance for calculating hessians and jacobians; and
  3. Significant updates and improvements to distributed training including: Improved NCCL reliability; Pipeline parallelism support; RPC profiling; and support for communication hooks adding gradient compression. See the full release notes here.

Along with 1.8, we are also releasing major updates to PyTorch libraries including TorchCSPRNG, TorchVision, TorchText and TorchAudio. For more on the library releases, see the post here. As previously noted, features in PyTorch releases are classified as Stable, Beta and Prototype. You can learn more about the definitions in the post here.

You can find more details on all the highlighted features in the PyTorch 1.8 Release blogpost.

Backwards Incompatible changes

Fix Tensor inplace modulo in python (#49390)

Inplace modulo in python %= was wrongfully done out of place for Tensors. This change fixes the behavior.
Previous code that was relying on this operation being done out of place should be updated to use the out of place version t = t % other instead of t %= other.

1.7.11.8.0
>>> a = torch.arange(0, 10)
>>> b = a
>>> b %= 3
>>> print(a)
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> print(b)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
      
>>> a = torch.arange(0, 10)
>>> b = a
>>> b %= 3
>>> print(a)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
>>> print(b)
tensor([0, 1, 2, 0, 1, 2, 0, 1, 2, 0])
      

Standardize torch.clamp edge cases (#43288)

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU implementation but uses different approaches for other backends.
These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

>>> t = torch.arange(200).to(torch.float)
>>> torch.clamp(t, 4, 2)[0]
tensor(2.)

>>> torch.clamp(t.cuda(), 4, 2)[0]
tensor(4., device='cuda:0')

>>> torch.clamp(torch.tensor(0), 4, 2)
tensor(4)

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max. Python has no standard clamp implementation.

Tensor deepcopy now properly copies the .grad field (#50663)

The deepcopy protocol will now properly copy the .grad field of Tensors when it exists.
The old behavior can be recovered by setting the .grad field to None after doing the deepcopy.

1.7.11.8.0
>>> t.grad
tensor([0.8883, 0.5765])
>>> deepcopy(t).grad
None
      
>>> t.grad
tensor([0.8883, 0.5765])
>>> deepcopy(t).grad
tensor([0.8883, 0.5765])
      

Fix torch.fmod type promotion (#47323, #48278)

1.7.1
Raises RuntimeError for integral tensor and floating-point tensor.
The dtype of output is determined by the first input.

>>> x = torch.arange(start=1, end=6, dtype=torch.int32) # tensor([1, 2, 3, 4, 5])
>>> y = torch.arange(start=1.1, end=2.1, step=0.2, dtype=torch.float32) # tensor([1.1, 1.3, 1.5, 1.7, 1.9])
>>> torch.fmod(x, y)
RuntimeError: result type Float can't be cast to the desired output type Int
>>> z = torch.arange(start=0.2, end=1.1, step=0.2, dtype=torch.float64) # tensor([0.2, 0.4, 0.6, 0.8, 1.], dtype=torch.float64)
>>> torch.fmod(y, z).dtype
torch.float32
>>> torch.fmod(z, y).dtype
torch.float64
>>> torch.fmod(x, 1.2)
tensor([0, 0, 0, 0, 0], dtype=torch.int32)

1.8.0:
Support integral tensor and floating-point tensor as inputs.
The dtype of output is determined by both inputs.

>>> x = torch.arange(start=1, end=6, dtype=torch.int32) # tensor([1, 2, 3, 4, 5])
>>> y = torch.arange(start=1.1, end=2.1, step=0.2, dtype=torch.float32) # tensor([1.1, 1.3, 1.5, 1.7, 1.9])
>>> torch.fmod(x, y)
tensor([1.0000, 0.7000, 0.0000, 0.6000, 1.2000])
>>> z = torch.arange(start=0.2, end=1.1, step=0.2, dtype=torch.float64) # tensor([0.2, 0.4, 0.6, 0.8, 1.], dtype=torch.float64)
>>> torch.fmod(y, z).dtype
torch.float64
>>> torch.fmod(z, y).dtype
torch.float64
>>> torch.fmod(x, 1.2)
tensor([1.0000, 0.8000, 0.6000, 0.4000, 0.2000])

Preserve non-dense or overlapping tensor's layout in *_like functions (#46046)

All the *_like factory functions will now generate the same striding as out of place operations would.
This means in particular that non-contiguous tensors will produce non-contiguous outputs.
If you require a contiguous output, you can pass the memory_format=torch.contiguous keyword argument to the factory function. Such factory functions include clone, to, float, cuda, *_like, zeros, rand{n}, etc.

Make output of torch.norm and torch.linalg.norm consistent for complex inputs (#48284)

Previously, when given a complex input, torch.linalg.norm and torch.norm would return a complex output. torch.linalg.cond would sometimes return a complex output and sometimes return a real output when given a complex input, depending on its p argument. This PR changes this behavior to match numpy.linalg.norm and numpy.linalg.cond, so that a complex input will result in a real number type, consistent with NumPy.

Make torch.svd return V, not V.conj() for complex inputs (#51012)

torch.svd added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex V tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy.
Users that were already using the previous version of torch.svd with complex inputs can recover the previous behavior by taking the complex conjugate of the returned V.

torch.angle: properly handle pure real numbers (#49163)

This PR updates PyTorch's torch.angle operator to be consistent with NumPy's. Previously torch.angle would return zero for all real inputs (including NaN). Now angle returns pi for negative real inputs, zero for non-negative real inputs, and propagates NaNs.

Enable distribution validation by default for torch.distributions (#48743)

This may slightly slow down some models. Concerned users may disable validation by using torch.distributions.Distribution.set_default_validate_args(False) or by disabling individual distribution validation via MyDistribution(..., validate_args=False).

This may cause new ValueErrors in models that rely on unsupported behavior, e.g. Categorical.log_prob() applied to continuous-valued tensors (only {0,1}-valued tensors are supported).
Such models should be fixed but the previous behavior can be recovered by disabling argument validation using the methods mentioned above.

Prohibit assignment to a sparse tensor (#50040)

Assigning to a sparse Tensor did not work properly and resulted in a no-op. The following code now properly raises an error:

>>> t = torch.rand(10).to_sparse()
>>> t[0] = 42
TypeError: Cannot assign to a sparse tensor

C++ API: operators that take a list of optional Tensors cannot be called with ArrayRef<Tensor> anymore (#49138)

This PR changes the C++ API representation of lists of optional Tensors (e.g. in the Tensor::``index method) from ArrayRef<Tensor> to List<optional<Tensor>>. This change breaks backwards compatibility, since there is no implicit conversion from ArrayRef<Tensor> to List<optional<Tensor>>.

A common call pattern is tensor.index({indices_tensor}), where indices_tensor is a Tensor. This will continue to wo...

Read more

Bug fix release with updated binaries for Python 3.9 and cuDNN 8.0.5

10 Dec 17:19
57bffc3
Compare
Choose a tag to compare

PyTorch 1.7.1 Release Notes

  • New Features
  • Critical Fixes
  • Other Fixes

New Features

Add Python 3.9 binaries for linux and macOS (#48133) and Windows (#48218)

NOTE: Conda installs for Python 3.9 will require the conda-forge channel, example:
conda install -y -c pytorch -c conda-forge pytorch.

Upgrade CUDA binaries to use cuDNN 8.0.5 (builder repo #571)

This upgrade fix regressions on Ampere cards introduced in cuDNN 8.0.4.
It will improve performance for 3090 RTX cards, and may improve performance in other RTX-30 series card.

Critical Fixes

Python 3.9

  • Use custom version of pybind11 to work around Python 3.9 issues (#48312)
  • Fix jit Python 3.9 parsing (#48744)
  • Fix cpp_extension to work with Python 3.9 (#48768)

Build

  • Fix cpp_extension to properly handle env variable on Windows (#48937)
  • Properly package libomp.dylib for macOS binaries (#48337)
  • Fix build for statically linked OpenBLAS on aarch64 (#48819)

Misc

  • torch.sqrt: fix wrong output values for very large complex input (#48216)
  • max_pool1d: fix for discontiguous inputs (#48219)
  • collect_env: fix detection of DEBUG flag (#48319)
  • collect_env: Fix to work when PyTorch is not installed (#48311)
  • Fix amp memory usage when running in no_grad() mode (#48936)
  • nn.ParameterList and nn.ParameterDict: Remove spurious warnings (#48215)
  • Tensor Expression fuser bugfixes (#48137)

Other Fixes

  • Tensor Expression fix for CUDA 11.0 (#48309)
  • torch.overrides: doc fix (#47843)
  • torch.max: Fix output type for Tensor subclasses (#47735)
  • torch.mul: Add support for boolean Tensors (#48310)
  • Add user friendly error when trying to compile from source with Python 2 (#48317)

PyTorch 1.7 released w/ CUDA 11, New APIs for FFTs, Windows support for Distributed training and more

27 Oct 16:35
e85d494
Compare
Choose a tag to compare

PyTorch 1.7.0 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • New Features
  • Improvements
  • Performance
  • Documentation

Highlights

The PyTorch 1.7 release includes a number of new APIs including support for NumPy-Compatible FFT operations, profiling tools and major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to stable including custom C++ Classes, the memory profiler, the creation of custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed such as Per-RPC timeout, DDP dynamic bucketing and RRef helper.

A few of the highlights include:

  • CUDA 11 is now officially supported with binaries available at PyTorch.org
  • Updates and additions to profiling and performance for RPC, TorchScript and Stack traces in the autograd profiler
  • (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft
  • (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format
  • (Prototype) Distributed training on Windows now supported

To reiterate, starting PyTorch 1.6, features are now classified as stable, beta and prototype. You can see the detailed announcement here. Note that the prototype features listed in this blog are available as part of this release.

Front End APIs

[Beta] NumPy Compatible torch.fft module

FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy.

This new module must be imported to be used in the 1.7 release, since its name conflicts with the historic (and now deprecated) torch.fft function.

Example usage:

>>> import torch.fft
>>> t = torch.arange(4)
>>> t
tensor([0, 1, 2, 3])

>>> torch.fft.fft(t)
tensor([ 6.+0.j, -2.+2.j, -2.+0.j, -2.-2.j])

>>> t = tensor([0.+1.j, 2.+3.j, 4.+5.j, 6.+7.j])
>>> torch.fft.fft(t)
tensor([12.+16.j, -8.+0.j, -4.-4.j,  0.-8.j])
  • Documentation | Link

[Beta] C++ Support for Transformer NN Modules

Since PyTorch 1.5, we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly.

  • Documentation | Link

[Beta] torch.set_deterministic

Reproducibility (bit-for-bit determinism) may help identify errors when debugging or testing a program. To facilitate reproducibility, PyTorch 1.7 adds the torch.set_deterministic(bool) function that can direct PyTorch operators to select deterministic algorithms when available, and to throw a runtime error if an operation may result in nondeterministic behavior. By default, the flag this function controls is false and there is no change in behavior, meaning PyTorch may implement its operations nondeterministically by default.

More precisely, when this flag is true:

  • Operations known to not have a deterministic implementation throw a runtime error;
  • Operations with deterministic variants use those variants (usually with a performance penalty versus the non-deterministic version); and
  • torch.backends.cudnn.deterministic = True is set.

Note that this is necessary, but not sufficient, for determinism within a single run of a PyTorch program. Other sources of randomness like random number generators, unknown operations, or asynchronous or distributed computation may still cause nondeterministic behavior.

See the documentation for torch.set_deterministic(bool) for the list of affected operations.

Performance & Profiling

[Beta] Stack traces added to profiler

Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the autograd profiler as before but with optional new parameters: with_stack and group_by_stack_n. Caution: regular profiling runs should not use this feature as it adds significant overhead.

Distributed Training & RPC

[Stable] TorchElastic now bundled into PyTorch docker image

Torchelastic offers a strict superset of the current torch.distributed.launch CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting max_restarts=0 with the added convenience of auto-assigned RANK and MASTER_ADDR|PORT (versus manually specified in torch.distributed.launch).

By bundling torchelastic in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately install torchelastic. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators.

  • Usage examples and how to get started | Link

[Beta] Support for uneven dataset inputs in DDP

PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using torch.nn.parallel.DistributedDataParallel to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training.

[Beta] NCCL Reliability - Async Error/Timeout Handling

In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt-in and sits behind an environment variable that needs to be explicitly set in order to enable this functionality (otherwise users will see the same behavior as before).

[Beta] TorchScript remote and rpc_sync

torch.distributed.rpc.rpc_async has been available in TorchScript in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs, torch.distributed.rpc.rpc_sync and torch.distributed.rpc.remote. This will complete the major RPC APIs targeted for support in TorchScript, it allows users to use the existing python RPC APIs within TorchScript (in a script function or script method, which releases the python Global Interpreter Lock) and could possibly improve application performance in multithreaded environment.

  • Documentation | Link
  • Usage examples | Link

[Beta] Distributed optimizer with TorchScript support

PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPC-based training application). Users couldn’t do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this.

In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has the exact same interface as before but it automatically converts optimizers within each worker into TorchScript to make each GIL free. This is done by leveraging a functional optimizer concept and allowing the distributed optimizer to convert the computational portion of the optimizer into TorchScript. This will help use cases like distributed model parallel training and improve performance using mul...

Read more

Stable release of automatic mixed precision (AMP). New Beta features include a TensorPipe backend for RPC, memory profiler, and several improvements to distributed training for both RPC and DDP.

28 Jul 17:13
b31f58d
Compare
Choose a tag to compare

PyTorch 1.6.0 Release Notes

  • Highlights
  • Backwards Incompatible Changes
  • Deprecations
  • New Features
  • Improvements
  • Bug Fixes
  • Performance
  • Documentation

Highlights

The PyTorch 1.6 release includes a number of new APIs, tools for performance improvement and profiling, as well as major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training.

A few of the highlights include:

  1. Automatic mixed precision (AMP) training is now natively supported and a stable feature - thanks to NVIDIA’s contributions;
  2. Native TensorPipe support now added for tensor-aware, point-to-point communication primitives built specifically for machine learning;
  3. New profiling tools providing tensor-level memory consumption information; and
  4. Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.

Additionally, from this release onward, features will be classified as Stable, Beta and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can learn more about what this change means in the post here.

[Stable] Automatic Mixed Precision (AMP) Training

AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported torch.cuda.amp API, AMP provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in float16. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype.

  • Design doc | Link
  • Documentation | Link
  • Usage examples | Link

[Beta] TensorPipe backend for RPC

PyTorch 1.6 introduces a new backend for the RPC module which leverages the TensorPipe library, a tensor-aware point-to-point communication primitive targeted at machine learning, intended to complement the current primitives for distributed training in PyTorch (Gloo, MPI, ...) which are collective and blocking. The pairwise and asynchronous nature of TensorPipe lends itself to new networking paradigms that go beyond data parallel: client-server approaches (e.g., parameter server for embeddings, actor-learner separation in Impala-style RL, ...) and model and pipeline parallel training (think GPipe), gossip SGD, etc.

# One-line change needed to opt in
torch.distributed.rpc.init_rpc(
    ...
    backend=torch.distributed.rpc.BackendType.TENSORPIPE,
)

# No changes to the rest of the RPC API
torch.distributed.rpc.rpc_sync(...)
  • Design doc | Link
  • Documentation | Link

[Beta] Memory Profiler

The torch.autograd.profiler API now includes a memory profiler that lets you inspect the tensor memory cost of different operators inside your CPU and GPU models.

Here is an example usage of the API:

import torch
import torchvision.models as models
import torch.autograd.profiler as profiler

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inputs)

# NOTE: some columns were removed for brevity
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
# ---------------------------  ---------------  ---------------  ---------------
# Name                         CPU Mem          Self CPU Mem     Number of Calls
# ---------------------------  ---------------  ---------------  ---------------
# empty                        94.79 Mb         94.79 Mb         123
# resize_                      11.48 Mb         11.48 Mb         2
# addmm                        19.53 Kb         19.53 Kb         1
# empty_strided                4 b              4 b              1
# conv2d                       47.37 Mb         0 b              20
# ---------------------------  ---------------  ---------------  ---------------

Distributed and RPC Features and Improvements

[Beta] DDP+RPC

PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Currently, these two features work independently and users can’t mix and match these to try out hybrid parallelism paradigms.

Starting PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.

// On each trainer

remote_emb = create_emb(on="ps", ...)
ddp_model = DDP(dense_model)

for data in batch:
   with torch.distributed.autograd.context():
      res = remote_emb(data)
      loss = ddp_model(res)
      torch.distributed.autograd.backward([loss])
  • DDP+RPC Tutorial | Link
  • Documentation | Link
  • Usage Examples | Link

[Beta] RPC - Asynchronous User Functions

RPC Asynchronous User Functions supports the ability to yield and resume on the server side when executing a user-defined function. Prior to this feature, when an callee processes a request, one RPC thread waits until the user function returns. If the user function contains IO (e.g., nested RPC) or signaling (e.g., waiting for another request to unblock), the corresponding RPC thread would sit idle waiting for these events. As a result, some applications have to use a very large number of threads and send additional RPC requests, which can potentially lead to performance degradation. To make a user function yield on such events, applications need to: 1) Decorate the function with the @rpc.functions.async_execution decorator; and 2) Let the function return a torch.futures.Future and install the resume logic as callbacks on the Future object. See below for an example:

@rpc.functions.async_execution
def async_add_chained(to, x, y, z):
    return rpc.rpc_async(to, torch.add, args=(x, y)).then(
        lambda fut: fut.wait() + z
    )

ret = rpc.rpc_sync(
    "worker1", 
    async_add_chained, 
    args=("worker2", torch.ones(2), 1, 1)
)
        
print(ret)  # prints tensor([3., 3.])
  • Tutorial for performant batch RPC using Asynchronous User Functions| Link
  • Documentation | Link
  • Usage examples | Link

[Beta] Fork/Join Parallelism

This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.

Parallel execution of TorchScript programs is enabled through two primitives: torch.jit.fork and torch.jit.wait. In the below example, we parallelize execution of foo:

import torch
from typing import List

def foo(x):
    return torch.neg(x)

@torch.jit.script
def example(x):
    futures = [torch.jit.fork(foo, x) for _ in range(100)]
    results = [torch.jit.wait(future) for future in futures]
    return torch.sum(torch.stack(results))

print(example(torch.ones([])))
  • Documentation | Link

Backwards Incompatible Changes

Dropped support for Python <= 3.5 (#39879)

The minimum version of Python we support now is 3.6. Please upgrade your Python to match. If you use conda, instructions for setting up a new environment with Python >= 3.6 can be found here.

Throw a RuntimeError for deprecated torch.div and torch.addcdiv integer floor division behavior (#38762, #38620)

In 1.5.1 and older PyTorch releases torch.div , torch.addcdiv, and the / operator perform integer floor division. In 1.6 attempting to perform integer division throw a RuntimeError, and in 1.7 the behavior will change so that these operations always perform true division (consistent with Python and NumPy division).

To floor divide integer tensors, please use torch.floor_divide instead.

1.5.11...

Read more

Bug Fix release

18 Jun 16:43
3c31d73
Compare
Choose a tag to compare

PyTorch 1.5.1 Release Notes

  • Backwards Incompatible Changes
  • Known Issues and Workarounds
  • Critical Fixes
  • Crashes and Error Fixes
  • Other Fixes

Backwards Incompatible Changes

Autograd: Operations that return integer-type tensors now always returns tensors that don’t require grad (#37789).

This most notably affects torch.argmin, torch.argmax, and torch.argsort. This change is BC-Breaking because previously one could obtain an integer-type tensor that requires grad in 1.5.0. However, said tensors were not usable by autograd; calling .backward() on them resulted in an error, so most users are likely to not have been relying on this behavior.

Version 1.5.0Version 1.5.1
>>> tensor = torch.randn(3, requires_grad=True)
>>> torch.argmax(tensor).requires_grad
True
      
>>> tensor = torch.randn(3, requires_grad=True)
>>> torch.argmax(tensor).requires_grad
False
      

Known Issues and Workarounds

When using multiprocessing, PyTorch 1.5.1 and 1.5.0 may error out with complaints about incompatibility between MKL and libgomp (#37377)

You may see error messages like the following when using the torch.multiprocessing package. This bug has primarily affected users with AMD CPUs.

`Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
        Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.`

You can get rid of the error and the error message by setting the environment MKL_THREADING_LAYER=GNU. This can be done either by including the following in your python code:

import os
os.environ['MKL_THREADING_LAYER'] = 'GNU'

or by specifying the environment variable when running your script:

MKL_THREADING_LAYER=GNU python my_script.py

To learn more about what triggers this bug and other workarounds if the above isn’t working, please read this comment on the issue.

Critical Fixes

torch.multinomial: Fixed a bug where CUDA multinomial generated the same sequence over and over again with a shift of 4. (#38046)

nn.Conv2d: Fixed a bug where circular padding applied padding across the wrong dimension (#37881)

Version 1.5.0Version 1.5.1
>>> circular = nn.Conv2d(6, 1, (3, 3), padding=(0, 1), padding_mode='circular')
>>> circular(torch.zeros(1, 6, 10, 10)).shape
# Notice the padding is incorrectly on the H dimension, not the W dimension.
torch.Size([1, 1, 10, 8])
      
>>> tensor = torch.randn(3, requires_grad=True)
>>> other = tensor + 1
>>> output = nn.LeakyReLU(0, inplace=True)(other)
>>> output.sum().backward()
torch.Size([1, 1, 8, 10])
      

Fixed bug where asserts in CUDA kernels were mistakingly disabled, leading to many silent kernel errors. (#38943, #39047, #39218)

torch.gather, torch.scatter: added checks for illegal input dtypes that caused silently incorrect behaviors (#38025, #38646)

torch.argmin, torch.argmax: Fixed silently incorrect result for inputs with more than 2^32 elements (#39212)

C++ Custom Operators: fixed a bug where custom operators stopped working with autograd and ignored the requires_grad=True flag. (#37355)

Crashes and Error Fixes

Fixed CUDA reduction operations on inputs with more than 2^32 elements (#37788)

Version 1.5.0Version 1.5.1
>>> `torch.zeros(5, 14400, 14400, device='cuda').sum(0)`
`RuntimeError: sub_iter.strides(0)[0] == 0 INTERNAL ASSERT FAILED at /pytorch/aten/src/ATen/native/cuda/Reduce.cuh:706, please report a bug to PyTorch.`      
>>> torch.zeros(5, 14400, 14400, device='cuda').sum(0)
# No problem
      

Fixed pickling of PyTorch operators (#38033)

Version 1.5.0Version 1.5.1
>>> `pickle.dumps(torch.tanh)`
PicklingError: Can't pickle : it's not the same object as torch._C._VariableFunctions
      
>>> pickle.dumps(torch.tanh)
# No problem
      

nn.LeakyReLU: Fixed a bug where using autograd with in-place nn.LeakyReLu with a slope of 0 incorrectly errored out. (#37453, #37559)

Version 1.5.0Version 1.5.1
>>> tensor = torch.randn(3, requires_grad=True)
>>> other = tensor + 1
>>> output = nn.LeakyReLU(0, inplace=True)(other)
>>> output.sum().backward()
RuntimeError: In-place leakyReLu backward calculation is triggered with a non-positive slope which is not supported. This is caused by calling in-place forward function with a non-positive slope, please call out-of-place version instead.
      
>>> tensor = torch.randn(3, requires_grad=True)
>>> other = tensor + 1
>>> output = nn.LeakyReLU(0, inplace=True)(other)
>>> output.sum().backward()
# No error
      

torch.as_strided : Fixed crash when passed sizes and strides of different lengths. (#39301)

nn.SyncBatchNorm.convert_sync_batchnorm: Fixed bug where it did not respect the devices of the original BatchNorm module, resulting in device mismatch errors (#39344)

nn.utils.clip_grad_norm_: Fixed ability to operate on tensors on different devices (#38615)

torch.min, torch.max: added check for illegal output dtypes (#38850)

MacOS: Fixed import torch error (#36941).

C++ Extensions: fixed compilation error when building with older versions of nvcc (#37221)

This bug mainly affected users of ubuntu 16.04. We’re certain it affected the following configurations:

  • ubuntu 16.04 + cuda 9.2 + gcc 5
  • ubuntu 16.04 + cuda 9.2 + gcc 7
  • ubuntu 16.04 + cuda 10.0 + gcc 5

C++ Extensions: fixed ability to compile with paths that include spaces (#38860, #38670)

C++ Extensions: fixed ability to compile with relative include_dirs for ahead-of-time compilation (#38264)

Other Fixes

nn.Conv1d, nn.Conv2d, nn.Conv3d: Fixed a bug where convolutions were using more memory than previous versions of PyTorch. (#38674)

Fixed in-place floor division magic method (#38695)

In 1.5.0, the in-place floor division magic method mistakingly performed the floor division out-of-place. We’ve fixed this in 1.5.1.

Version 1.5.0Version 1.5.1
>>> tensor = torch.ones(1)
>>> expected_data_ptr = tensor.data_ptr()
>>> tensor //= 1
>>> tensor.data_ptr() == expected_data_ptr
False
      
>>> tensor = torch.ones(1)
>>> expected_data_ptr = tensor.data_ptr()
>>> tensor //= 1
>>> tensor.data_ptr() == expected_data_ptr
True
      

Documentation: fixed link to java docs. (#39039)

Quantization: Fixed weight quantization inaccuracies for LSTM (#35961)

Weight quantization was done incorrectly for LSTMs, the statistics for all weights (across layers) were combined in the observer. This meant that weights for later layers in a LSTM would use sub-optimal scales impacting accuracy. The problem gets worse as the number of layers increases.

DistributedDataParallel: Fixed single-process multi-GPU use case (#36503)

RPC: Fixed future callbacks not capturing and restoring autograd context id (#38512)

TorchScript: Fixed support with torch.unique (#38156)

ONNX: Fix pow operator export (#39791)

Stable C++ Frontend, Distributed RPC framework, and more. New experimental higher-level autograd API, Channels Last memory format, and more.

21 Apr 16:26
4ff3872
Compare
Choose a tag to compare

PyTorch 1.5.0 Release Notes

  • Highlights
  • Known Issues
  • Backwards Incompatible Changes
    • Python
    • C++ API
    • JIT
    • Quantization
    • RPC
  • New Features
  • Improvements
  • Bug Fixes
  • Performance
  • Documentation
  • Deprecations
    • Python
    • C++ API
  • Miscellaneous

Highlights

This release includes several major new API additions and improvements. These include new APIs for autograd allowing for easy computation of hessians and jacobians, a significant update to the C++ frontend, ‘channels last’ memory format for more performant computer vision models, a stable release of the distributed RPC framework used for model parallel training, and a new API that allows for the creation of Custom C++ Classes that was inspired by PyBind. Additionally torch_xla 1.5 is now available and tested with the PyTorch 1.5 release providing a mature Cloud TPU experience.

C++ Frontend API [Now Stable]

The C++ frontend API is now at parity with Python and the features overall has been moved to ‘stable’. (previously tagged as experimental). Some of the major highlights include:

  • C++ torch::nn module/functional are now at ~100% parity with Python API, with appropriate documentation. Now users can easily translate their model from Python API to C++ API, making the model authoring experience much smoother.
  • C++ optimizers now behave identically to the Python API. In the past, optimizers in C++ had deviated from the Python equivalent: C++ optimizers couldn’t take parameter groups as input while the Python ones could. Also step function implementations were not exactly the same. With the 1.5 release, C++ optimizers will always behave the same as the Python equivalent.
  • New C++ tensor multi-dim indexing API which looks and behaves the similar to the Python API. The previous workaround was to use a combination of narrow / select / index_select / masked_select, which is clunky and error-prone compared to the Python API’s elegant tensor[:, 0, ..., mask] syntax. With the 1.5 release users can use tensor.index({Slice(), 0, "...", mask}) to achieve the same result.

Channels last memory format for Computer Vision models [Experimental]

Channels Last memory format is an alternative way of ordering NCHW tensors in memory while preserving the NCHW semantic dimensions ordering. Channels Last tensors are ordered in memory in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).

Channels Last memory format unlocks the ability to use performance efficient convolution algorithms and hardware (NVidia’s Tensor Cores, FBGEMM, QNNPACK). Additionally it was designed to automatically propagate through the operators, which allows easy switching between memory layouts.

Learn more here on how to write memory format aware operators.

Custom C++ Classes [Experimental]

This release adds a new API for binding custom C++ classes into TorchScript and Python simultaneously. This API is almost identical in syntax to pybind11. It allows users to expose their C++ class and its methods to the TorchScript type system and runtime system such that they can instantiate and manipulate arbitrary C++ objects from TorchScript and Python. An example C++ binding:

template <class T>
struct MyStackClass : torch::CustomClassHolder {
  std::vector<T> stack_;
  MyStackClass(std::vector<T> init) : stack_(std::move(init)) {}

  void push(T x) {
    stack_.push_back(x);
  }
  T pop() {
    auto val = stack_.back();
    stack_.pop_back();
    return val;
  }
};

static auto testStack =
  torch::class_<MyStackClass<std::string>>("myclasses", "MyStackClass")
      .def(torch::init<std::vector<std::string>>())
      .def("push", &MyStackClass<std::string>::push)
      .def("pop", &MyStackClass<std::string>::pop)
      .def("size", [](const c10::intrusive_ptr<MyStackClass>& self) {
        return self->stack_.size();
      });

Which exposes a class you can use in Python and TorchScript like so:

@torch.jit.script
def do_stacks(s : torch.classes.myclasses.MyStackClass):
    s2 = torch.classes.myclasses.MyStackClass(["hi", "mom"])
    print(s2.pop()) # "mom"
    s2.push("foobar")
    return s2 # ["hi", "foobar"]

You can try it out in the tutorial here.

Distributed RPC framework APIs [Now Stable]

The torch.distributed.rpc package aims at supporting a wide range of distributed training paradigms that do not fit into DistributedDataParallel. Examples include parameter server training, distributed model parallelism, and distributed pipeline parallelism. Features in the torch.distributed.rpc package can be categorized into four main sets of APIs.

  • The RPC API allows running a function on a specified destination worker with given arguments and fetches the return value or creates a distributed reference to the return value.
  • The RRef (Remote REFerence) serves as a reference to an object on another worker. A worker holding an RRef can explicitly request copies of the object, and it can also share the light-weight RRef with other workers without worrying about reference counting. This is especially useful when multiple workers need to repeatedly access different versions of the same remote object.
  • With Distributed Autograd, applications can automatically compute gradients even if a model is split on multiple workers using RPC. This is achieved by stitching together local autograd graphs at RPC boundaries in the forward pass and reaching out to participants to transparently launch local autograd in the backward pass.
  • The Distributed Optimizer uses gradients computed by Distributed Autograd to update model parameters. Its constructor takes a local optimizer (e.g., SGD, Adagrad, etc.) and a list of parameter RRefs, and its step() function automatically uses the local optimizer to update parameters on all distinct RRef owner workers.

Learn more here.

torch_xla 1.5 now available

torch_xla is a Python package that uses the XLA linear algebra compiler to accelerate the PyTorch deep learning framework on Cloud TPUs and Cloud TPU Pods. torch_xla aims to give PyTorch users the ability to do everything they can do on GPUs on Cloud TPUs as well while minimizing changes to the user experience. This release of torch_xla is aligned and tested with PyTorch 1.5 to reduce friction for developers and to provide a stable and mature PyTorch/XLA stack for training models using Cloud TPU hardware. You can try it for free in your browser on an 8-core Cloud TPU device with Google Colab, and you can use it at a much larger scale on Google Cloud.

See the full torch_xla release notes here and the full docs here.

New High level autograd API [Experimental]

PyTorch 1.5 brings new functions including jacobian, hessian, jvp, vjp, hvp and vhp to the torch.autograd.functional.* submodule. This feature builds on the current API and allow the user to easily perform these functions.

See the full docs here.

Python 2 no longer supported

For PyTorch 1.5.0 we will no longer support Python 2, specifically version 2.7. Going forward support for Python will be limited to Python 3, specifically Python 3.5, 3.6, 3.7 and 3.8 (first enabled in PyTorch 1.4.0).

Known Issues

torch.nn.parallel.DistributedDataParallel does not work in Single-Process Multi-GPU mode.

DistributedDataParallel (DDP) used to support two modes

  1. Single-Process Multi-GPU (SPMG): In this mode, each DDP process replicates the input module to all specified devices and trains on all module replicas. This mode is enabled when application passes in a device_ids argument that contains multiple devices. Or if device_ids is not presented, DDP will try to use all available devices.
  2. Multi-Process Single-GPU (MPSG): This is the recommended mode, as it is faster than SPMG. In this mode, each DDP process directly works on the provided module without creating additional replicas. This mode is enabled when device_ids only contains a single device or if there is only one visible device (e.g., by setting CUDA_VISIBLE_DEVICES).

A recent change (#33907) in torch.nn.parallel.replicate breaks DDP’s assumption on replicated modules and leads to failures in the SPMG mode. However, since SPMG is known to be slower due to GIL contention and additional overhead caused by scattering input and gathering output, we are planning to retire this mode in future releases and make MPSG the only supported mode in DDP. The code below shows an example of the recommended way to construct DDP.

import torch
from torch.nn.parallel import DistributedDataParallel as DDP

# use "cuda:1" as the target device
target_device = 1 
local_model = torch.nn.Linear(2, 2).to(target_device)
ddp_model = DDP(local_model, device_ids=[target_device])

See #36268 for more discussion.

Tensor.exponential_(0) used to return Inf, now it incorrectly returns 0

Previously in 1.4, x.exponential_(0) gives a tensor full of inf. On 1.5.0, it wrongly give...

Read more

Mobile build customization, Distributed model parallel training, Java bindings, and more

16 Jan 00:03
Compare
Choose a tag to compare

PyTorch 1.4.0 Release Notes

  • Highlights
  • Backwards Incompatible Changes
    • Python
    • JIT
    • C++
  • New Features
    • torch.optim
    • Distributed
    • RPC [Experimental]
    • JIT
    • Mobile
  • Improvements
    • Distributed
    • JIT
    • Mobile
    • Named Tensors
    • C++ API
    • AMD Support
    • ONNX
    • Quantization
    • Visualization
    • Other Improvements
  • Bug Fixes
    • Distributed
    • RPC
    • C++ API
    • JIT
    • Quantization
    • Mobile
    • Other Bug fixes
  • Deprecations
  • Performance

The PyTorch v1.4.0 release is now available.

The release contains over 1,500 commits and a significant amount of effort in areas spanning existing areas like JIT, ONNX, Distributed, Performance and Eager Frontend Improvements and improvements to experimental areas like mobile and quantization. It also contains new experimental features including rpc-based model parallel distributed training and language bindings for the Java language (inference only).

PyTorch 1.4 is the last release that supports Python 2. For the C++ API, it is the last release that supports C++11: you should start migrating to Python 3 and building with C++14 to make the future transition from 1.4 to 1.5 easier.

Highlights

PyTorch Mobile - Build level customization

Following the experimental release of PyTorch Mobile in the 1.3 release, PyTorch 1.4 adds additional mobile support including the ability to customize build scripts at a fine-grain level. This allows mobile developers to optimize library size by only including the operators used by their models and, in the process, reduce their on device footprint significantly. Initial results show that, for example, a customized MobileNetV2 is 40% to 50% smaller than the prebuilt PyTorch mobile library. Learn more about how to create your own custom builds, and please engage with the community on the PyTorch forums to provide any feedback you have.

Distributed Model Parallel Training [Experimental]

With the scale of models, such as RoBERTa, continuing to increase into the billions of parameters, model parallel training has become ever more important to help researchers push the limits. This release provides a distributed RPC framework to support distributed model parallel training. It allows for running functions remotely and referencing remote objects without copying the real data around, and provides autograd and optimizer APIs to transparently run backwards and update parameters across RPC boundaries.

To learn more about the APIs and the design of this feature, see the links below:

For the full tutorials, see the links below:

As always, you can connect with community members and discuss more on the forums.

Java bindings [Experimental]

In addition to supporting Python and C++, this release adds experimental support for Java bindings. Based on the interface developed for Android in PyTorch Mobile, the new bindings allow you to invoke TorchScript models from any Java program. Note that the Java bindings are only available for Linux for this release, and for inference only. We expect support to expand in subsequent releases. See the code snippet below for how to use PyTorch within Java:

Learn more about how to use PyTorch from Java here, and see the full Javadocs API documentation here.

Pruning

Pruning functionalities have been added to PyTorch in the nn.utils.prune module. This provides out-of-the-box support for common magnitude-based and random pruning techniques, both structured and unstructured, both layer-wise and global, and it also enables custom pruning from user-provided masks.

To prune a tensor, first select a pruning technique among those available in nn.utils.prune (or implement your own by subclassing BasePruningMethod).

from torch.nn.utils import prune
t = torch.rand(2, 5)
p = prune.L1Unstructured(amount=0.7)
pruned_tensor = p.prune(t)

To prune a module, select one of the pruning functions available in nn.utils.prune (or implement your own) and specify which module and which parameter within that module pruning should act on.

m = nn.Conv2d(3, 1, 2)
prune.ln_structured(module=m, name='weight', amount=5, n=2, dim=1)

Pruning reparametrizes the module by turning weight (in the example above) from a parameter to an attribute, and replacing it with a new parameter called weight_orig (i.e. appending "_orig" to the initial parameter name) that stores the unpruned version of the tensor. The pruning mask is stored as a buffer named weight_mask (i.e. appending "_mask" to the initial parameter name). Pruning is applied prior to each forward pass by recomputing weight through a multiplication with the updated mask using PyTorch's forward_pre_hooks.

Iterative pruning is seamlessly enabled by repeatedly calling pruning functions on the same parameter (this automatically handles the combination of successive masks by making use of a PruningContainer under the hood).

nn.utils.prune is easily extensible to support new pruning functions by subclassing the BasePruningMethod base class and implementing the compute_mask method with the instructions to compute the mask according to the logic of the new pruning technique.

Backwards Incompatible Changes

Python

torch.optim: It is no longer supported to use Scheduler.get_lr() to obtain the last computed learning rate. to get the last computed learning rate, call Scheduler.get_last_lr() instead. (26423)

Learning rate schedulers are now “chainable,” as mentioned in the New Features section below. Scheduler.get_lr was sometimes used for monitoring purposes to obtain the current learning rate. But since Scheduler.get_lr is also used internally for computing new learning rates, this actually returns a value that is “one step ahead.” To get the last computed learning rate, use Scheduler.get_last_lr instead.

Note that optimizer.param_groups[0]['lr'] was in version 1.3.1 and remains in 1.4.0 a way of getting the current learning rate used in the optimizer.

Tensor.unfold on a 0-dimensional Tensor now properly returns a 1-dimensional Tensor.

Version 1.3.1Version 1.4.0
>>> torch.tensor(5).unfold(dimension=0, size=1, step=1)
tensor(5)
      
>>> torch.tensor(5).unfold(dimension=0, size=1, step=1)
tensor([5])
      

torch.symeig now return a 0-element eigenvectors tensor when eigenvectors=False (the default).

Version 1.3.1Version 1.4.0
>>> torch.symeig(torch.randn(3,3)).eigenvectors.shape
torch.Size([3, 3])
      
>>> torch.symeig(torch.randn(3,3)).eigenvectors.shape
torch.Size([0])
      

JIT

  • Make torch.jit.get_trace_graph private (it is now torch.jit._get_trace_graph) (29149)
    • This function was intended only for ONNX integration; use traced_module.graph instead, like:
    • traced_module = torch.jit.trace(my_module, example_inputs)
      traced_graph = traced_module.graph
  • @property on ScriptModules has been disabled (28395)
    • Scripted @property accesses were silently broken before, where we would evaluate the the get function once and store that as the attribute permanently. They properly error now; a workaround is to make your @property a regular method.
  • Custom ops: torch::jit::RegisterOperators has been removed, use torch::RegisterOperators instead (28229). The usage and behavior should remain the same.
  • Remove torch.jit._register_* bindings from Python (e.g. torch.jit._register_attribute). These were private functions that were not intended to be used. (29499)

C++

[C++] The distinction between Tensor and Variable has been eliminated at the C++ level. (28287)

This change simplifies our C++ API and matches previous changes we did at the python level that merged Tensors and Variables into a single type.

This change is unlikely to affect user code; the most likely exceptions are:

  1. Argument-dependent lookup for torch::autograd may no longer work. This can break because Variable is now defined as an alias for Tensor (using Variable = Tensor;). In this case, you must explicitly qualify the calls to torch::autograd functions.

  2. Because Variable and Tensor are now the same type, code which assumes that they are differen...

Read more

Bug Fix Release

07 Nov 17:19
Compare
Choose a tag to compare

Significant Fixes

Type Promotion: fixed a bug where type promotion, combined with non-contiguous tensors could compute incorrect results. (28253)

Version 1.3.0Version 1.3.1
>>> a = torch.tensor([[True,  True],
                      [False, True]])
# get a non-contiguous tensor
>>> a_transpose = a.t()
# type promote by comparing across dtypes (bool -> long)
>>> a_transpose == 0
# POTENTIALLY INCORRECT VALUES
      
>>> a = torch.tensor([[True,  True],
                      [False, True]])
# get a non-contiguous tensor
>>> a_transpose = a.t()
# type promote by comparing across dtypes (bool -> long)
>>> a_transpose == 0
tensor([[False,  True],
        [False, False]])
      

Type Promotion / Indexing: Fixed a Bug that Allowed Mixed-Dtype Indexing and assignment could lead to incorrect results. Mixed dtype operations of this form are currently disabled, as they were in 1.2. (28231)

Version 1.3.0Version 1.3.1
>>> a = torch.ones(5, 2, dtype=torch.float)
>>> b = torch.zeros(5, dtype=torch.long)
>>> a[:, [1]] = b.unsqueeze(-1)
>>> a
# POTENTIALLY INCORRECT VALUES
      
>>> a = torch.ones(5, 2, dtype=torch.float)
>>> b = torch.zeros(5, dtype=torch.long)
>>> a[:, [1]] = b.unsqueeze(-1)
RuntimeError: expected dtype Float but got dtype Long
      

torch.where(condition, x, y): fixed a bug on CPU where incorrect results could be returned if x and y were of different dtypes. Mixed dtype operations of this form are currently disabled, as they were in version 1.2. (29078)

Version 1.3.0Version 1.3.1
>>> x = torch.randn(2, 3)
>>> y = torch.randint(0, 10, (2, 3))
>>> torch.where(x < 0, x, y)
tensor(...)
# POTENTIALLY INCORRECT VALUES
      
>>> x = torch.randn(2, 3)
>>> y = torch.randint(0, 10, (2, 3))
>>> torch.where(x < 0, x, y)
RuntimeError: expected scalar type Float but found Long
      

Other Fixes

  • torch.argmax: fix regression on CUDA that disabled support for torch.float16 inputs. (28915)
  • NamedTensor: fix Python refcounting bug with Tensor.names. (28922)
  • Quantization: support deepcopy for quantized tensors. (28612)
  • Quantization: support nn.quantized.ReLU with inplace=True. (28710)
  • Documentation: torch.lgamma and torch.polygamma are now documented. (28964)

Mobile Support, Named Tensors, Quantization, Type Promotion and many more

10 Oct 17:26
Compare
Choose a tag to compare

Table of Contents

  • Breaking Changes
  • Highlights
    • [Experimental]: Mobile Support
    • [Experimental]: Named Tensor Support
    • [Experimental]: Quantization support
    • Type Promotion
    • Deprecations
  • New Features
    • TensorBoard: 3D Mesh and Hyperparameter Support
    • Distributed
    • Libtorch Binaries with C++11 ABI
    • New TorchScript features
  • Improvements
    • C++ Frontend Improvements
      • Autograd
      • New torch::nn modules
      • New torch::nn::functional functions
      • tensor Construction API
      • Other C++ Improvements
    • Distributed Improvements
    • Performance Improvements
    • JIT Improvements
    • ONNX Exporter Improvements
      • Adding Support for ONNX IR v4
      • Adding Support for ONNX Opset 11
      • Exporting More Torch Operators/Models to ONNX
      • Enhancing ONNX Export Infra
    • Other Improvements
  • Bug Fixes
    • TensorBoard Bug Fixes
    • C++ API Bug fixes
    • JIT
    • Other Bug Fixes
  • Documentation Updates
    • Distributed
    • JIT
    • Other documentation improvements

Breaking Changes

Type Promotion: Mixed dtype operations may return a different dtype and value than in previous versions. (22273, 26981)

Previous versions of PyTorch supported a limited number of mixed dtype operations. These operations could result in loss of precision by, for example, truncating floating-point zero-dimensional tensors or Python numbers.

In Version 1.3, PyTorch supports NumPy-style type promotion (with slightly modified rules, see full documentation). These rules generally will retain precision and be less surprising to users.

Version 1.2Version 1.3
>>> torch.tensor(1) + 2.5
tensor(3)
>>> torch.tensor([1]) + torch.tensor(2.5)
tensor([3])
>>> torch.tensor(**True**) + 5
tensor(True)
      
>>> torch.tensor(1) + 2.5
tensor(3.5000)
>>> torch.tensor([1]) + torch.tensor(2.5)
tensor([3.5000])
>>> torch.tensor(True) + 5
tensor(6)
      

Type Promotion: in-place operations whose result_type is a lower dtype category (bool < integer < floating-point) than the in-place operand now throw an Error. (22273, 26981)

Version 1.2Version 1.3
>>> int_tensor = torch.tensor(1)
>>> int_tensor.add_(1.5)
tensor(2)
>>> bool_tensor = torch.tensor(True)
>>> bool_tensor.add_(5)
tensor(True)
      
>>> int_tensor = torch.tensor(1)
>>> int_tensor.add_(1.5)
RuntimeError: result type Float cannot be cast to the desired output type Long
>>> bool_tensor = torch.tensor(True)
>>> bool_tensor.add_(5)
RuntimeError: result type Long cannot be cast to the desired output type Bool
      

These rules can be checked at runtime via torch.can_cast.

torch.flatten: 0-dimensional inputs now return a 1-dim tensor. (25406).

Version 1.2Version 1.3
>>> torch.flatten(torch.tensor(0))
tensor(0)
      
>>> torch.flatten(torch.tensor(0))
tensor([0])
      

nn.functional.affine_grid: when align_corners = True, changed the behavior of 2D affine transforms on 1D data and 3D affine transforms on 2D data (i.e., when one of the spatial dimensions has unit size).

Previously, all grid points along a unit dimension were considered arbitrarily to be at -1, now they are considered to be at 0 (the center of the input image).

torch.gels: removed deprecated operator, use torch.lstsq instead. (26480).

utils.data.DataLoader: made a number of Iterator attributes private (e.g. num_workers, pin_memory). (22273)

[C++] Variable::backward will no longer implicitly create a gradient for non-1-element Variables. Previously, a gradient tensor of all 1s would be implicitly created . This behavior matches the Python API. (26150)

auto x = torch::randn({5, 5}, torch::requires_grad());
auto y = x * x;
y.backward()
// ERROR: "grad can be implicitly created only for scalar outputs"

[C++] All option specifiers (e.g. GRUOptions::bidirectional_) are now private, use the function variants (GRUOptions::bidirectional(...)) instead. (26419).

Highlights

[Experimental]: Mobile Support

In PyTorch 1.3, we are launching experimental support for mobile. Now you can run any TorchScript model directly without any conversion. Here are the full list of features in this release:

  • Support for full TorchScript inference on mobile;
  • Prebuilt LibTorch libraries for Android/iOS on JCenter/CocoaPods;
  • Java wrapper for Android with functionality to cover common inference cases (loading and invoking the model);
  • Support for all forward ops on mobile CPU (backward ops are not supported yet);
  • Some optimized fp32 operator implementations for ARM CPUs (based on Caffe2Go);
  • Some optimized int8 operator implementations for ARM CPUs (based on QNNPACK);

We decided not to create a new framework for mobile so that you can use the same APIs you are already familiar with to run the same TorchScript models on Android/iOS devices without any format conversion. This way you can have the shortest path from research ideas to production-ready mobile apps.

The tutorials, demo apps and download links for prebuilt libraries can be found at: https://pytorch.org/mobile/

This is an experimental release. We are working on other features like customized builds to make PyTorch smaller, faster and better for your specific use cases. Stay tuned and give us your feedback!

[Experimental]: Named Tensor Support

Named Tensors aim to make tensors easier to use by allowing users to associate explicit names with tensor dimensions. In most cases, operations that take dimension parameters will accept dimension names, avoiding the need to track dimensions by position. In addition, named tensors use names to automatically check that APIs are being used correctly at runtime, providing extra safety. Names can also be used to rearrange dimensions, for example, to support "broadcasting by name" rather than "broadcasting by position".

Create a named tensor by passing a names argument into most tensor factory function.

>>> tensor = torch.zeros(2, 3, names=('C', 'N'))
    tensor([[0., 0., 0.],
            [0., 0., 0.]], names=('C', 'N'))

Named tensors propagate names across operations.

>>> tensor.abs()
    tensor([[0., 0., 0.],
            [0., 0., 0.]], names=('C', 'N'))

Rearrange to a desired ordering by using align_to .

>>> tensor = tensor.align_to('N', 'C', 'H', 'W')
>>> tensor.names, tensor.shape
    (('N', 'C', 'H', 'W'), torch.Size([3, 2, 1, 1]))

And more! Please see our documentation on named tensors.

[Experimental]: Quantization support

PyTorch now supports quantization from the ground up, starting with support for quantized tensors. Convert a float tensor to a quantized tensor and back by:

x = torch.rand(10,1, dtype=torch.float32)
xq = torch.quantize_per_tensor(x, scale = 0.5, zero_point = 8, dtype=torch.quint8)
# xq is a quantized tensor with data represented as quint8
xdq = x.dequantize()
# convert back to floating point

We also support 8 bit quantized implementations of most common operators in CNNs, including:

  • Tensor operations:
    • view, clone, resize, slice
    • add, multiply, cat, mean, max, sort, topk
  • Modules/Functionals (in torch.nn.quantized)
    • Conv2d
    • Linear
    • Avgpool2d, AdaptiveAvgpool2d, MaxPool2d, AdaptiveMaxPool2d
    • Interpolate
    • Upsample
  • Fused operations for preserving better accuracy (in torch.nn.intrinsic)
    • ConvReLU2d, ConvBnReLU2d, ConvBn2d
    • LinearReLU
    • add_relu

We also support dynamic quantized operators, which take in floating point activations, but use quantized weights (in torch.nn.quantized.dynamic).

  • LSTM
  • Linear

Quantization also requires support for methods to collect statistics from tensors and calculate quantization parameters (implementing interface torch.quantization.Observer). We support several methods to do so:

  • MinMaxObserver
  • MovingAverageMinMaxObserver
  • PerChannelMinMaxObserver
  • MovingAveragePerChannelMinMaxObserver
  • HistogramObserver

For quantization aware training, we support fake-quantization operators and modules to mimic quantization during training:

  • torch.fake_quantize_per_tensor_affine, torch.fake_quantize_per_channel_affine
  • torch.quantization.FakeQuantize

In addition, we also support workflows in torch.quantization for:

  • post-training dynamic quantization
  • static post training quantization
  • quantization aware training

All quantized operators are compatible with TorchScript.

For more details, see the docum...

Read more