@soumith soumith released this Oct 2, 2018 · 378 commits to master since this release

Assets 2

This is a pre-release preview, do not rely on the tag to have a fixed set of commits, or rely on the tag for anything practical / important

Table of Contents



The JIT is a set of compiler tools for bridging the gap between research in PyTorch
and production. It includes a language called Torch Script (don't worry it is a subset of Python,
so you'll still be writing Python), and two ways in which you can make your existing code compatible with the JIT.
Torch Script code can be aggressively optimized and it can be serialized for later use in our new C++ API, which doesn't depend on Python at all.

# Write in Python, run anywhere!
def RNN(x, h, W_h, U_h, b_h):
  y = []
  for t in range(x.size(0)):
    h = torch.tanh(x[t] @ W_h + h @ U_h + b_h)
    y += [h]
  return torch.stack(y), h

As an example, see a tutorial on deploying a seq2seq model,
loading an exported model from C++, or browse the docs.

torch.distributed new "C10D" library

The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by the new "C10D" library. The main highlights of the new library are:

  • C10D is performance driven and operates entirely asynchronously for all backends: Gloo, NCCL, and MPI.
  • Significant Distributed Data Parallel performance improvements especially for slower network like ethernet-based hosts
  • Adds async support for all distributed collective operations in the torch.distributed package.
  • Adds send and recv support in the Gloo backend

C++ Frontend [API Unstable].

The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to torch.nn, torch.optim, torch.data and other components of the Python frontend. Here is a minimal side-by-side comparison of the two language frontends:

import torch

model = torch.nn.Linear(5, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
prediction = model.forward(torch.randn(3, 5))
loss = torch.nn.functional.mse_loss(prediction, torch.ones(3, 1))
#include <torch/torch.h>

torch::nn::Linear model(5, 1);
torch::optim::SGD optimizer(model->parameters(), /*lr=*/0.1);
torch::Tensor prediction = model->forward(torch::randn({3, 5}));
auto loss = torch::mse_loss(prediction, torch::ones({3, 1}));

We are releasing the C++ frontend marked as "API Unstable" as part of PyTorch 1.0. This means it is ready to be used for your research application, but still has some open construction sites that will stabilize over the next month or two. Some parts of the API may undergo breaking changes during this time.

See https://pytorch.org/cppdocs for detailed documentation on the greater PyTorch C++ API as well as the C++ frontend.

Breaking Changes

  • Indexing a 0-dimensional tensor will now throw an error instead of warn. Use tensor.item() instead. (#11679).
  • torch.legacy is removed. (#11823).
  • torch.masked_copy_ is removed, use torch.masked_scatter_ instead. (#9817).
  • Operations that result in 0 element tensors may return changed shapes.
    • Before: all 0 element tensors would collapse to shape (0,). For example, torch.nonzero is documented to return a tensor of shape (n,z), where n = number of nonzero elements and z = dimensions of the input, but would always return a Tensor of shape _(0,) when no nonzero elements existed.
    • Now: Operations return their documented shape.
      # Previously: all 0-element tensors are collapsed to shape (0,)
      >>> torch.nonzero(torch.zeros(2, 3))
      tensor([], dtype=torch.int64)
      # Now, proper shape is returned
      >>> torch.nonzero(torch.zeros(2, 3))
      tensor([], size=(0, 2), dtype=torch.int64)
  • Sparse tensor indices and values shape invariants are changed to be more consistent in the case of 0-element tensors. See link for more details. (#9279).
  • torch.distributed: the TCP backend is removed, we recommend to use Gloo and MPI backends for CPU collectives and NCCL backend for GPU collectives.
  • Some inter-type operations (e.g. *) between torch.Tensors and NumPy arrays will now favor dispatching to the torch variant. This may result in different return types. (#9651).
  • Implicit numpy conversion no longer implicitly moves a tensor to CPU. Therefore, you may have to explicitly move a CUDA tensor to CPU (tensor.to('cpu')) before an implicit conversion. (#10553).
  • torch.randint now defaults to using dtype torch.int64 rather than the default floating-point dtype. (#11040).
  • torch.tensor function with a Tensor argument now returns a detached Tensor (i.e. a Tensor where grad_fn is None). This more closely aligns with the intent of the function, which is to return a Tensor with copied data and no history. (#11061,
  • torch.nn.functional.multilabel_soft_margin_loss now returns Tensors of shape (N,) instead of (N, C) to match the behavior of torch.nn.MultiMarginLoss. In addition, it is more numerically stable.
  • The result type of a torch.float16 0-dimensional tensor and a integer is now torch.float16 (was torch.float32 or torch.float64 depending on the dtype of the integer). (#11941).
  • Dirichlet and Categorical distributions no longer accept scalar parameters. (#11589).
  • CPP Extensions: Deprecated factory functions that accept a type as the first argument and a size as a second argument argument have been removed. Instead, use the new-style factory functions that accept the size as the first argument and TensorOptions as the last argument. For example, replace your call to at::ones(torch::CPU(at::kFloat)), {2, 3}) with torch::ones({2, 3}, at::kCPU). This applies to the following functions:
    • arange, empty, eye, full, linspace, logspace, ones, rand, randint, randn, randperm, range, zeros.

Additional New Features

N-dimensional empty tensors

  • Tensors with 0 elements can now have an arbitrary number of dimensions and support indexing and other torch operations; previously, 0 element tensors were limited to shape (0,). (#9947). Example:
    >>> torch.empty((0, 2, 4, 0), dtype=torch.float64)
    tensor([], size=(0, 2, 4, 0), dtype=torch.float64)

New Operators

New Distributions

Additions to existing Operators and Distributions

Bug Fixes


Backwards Compatibility

  • torch.nn.Module load_from_state_dict now correctly handles 1-dimensional vs 0-dimensional tensors saved from 0.3 versions. (#9781).
  • Fix RuntimeError: storages don't support slicing when loading models saved with PyTorch 0.3. (#11314).


Error checking

  • torch.gesv now properly checks LAPACK errors. (#11634).
  • Fixed an issue where extra positional arguments were accepted (and ignored) in Python functions calling into C++. (#10499).
  • legacy Tensor constructors (e.g. torch.FloatTensor(...)) now correctly check their device argument.
  • Properly check that out parameter is a CPU Tensor for CPU unary ops. (#10358).
  • torch.nn.InstanceNorm1d now correctly accepts 2 dimensional inputs. (#9776).
  • torch.nn.Module.load_state_dict had an incorrect error message. (#11200).
  • torch.nn.RNN now properly checks that inputs and hidden_states are on the same devices. (#10185).


Other Improvements


CPP Extensions

  • The torch/torch.h header is deprecated in favor of torch/extension.h, which should be used in all C++ extensions going forward. Including torch/torch.h from a C++ extension will produce a warning. It is safe to batch replace torch/torch.h with torch/extension.h.
  • Usage of the following functions in C++ extensions is also deprecated:
    • torch::set_requires_grad. Replacement: at::Tensor now has a set_requires_grad method.
    • torch::requires_grad. Replacement: at::Tensor now has a requires_grad method.
    • torch::getVariableType. Replacement: None.



Documentation Improvements

Oct 2, 2018
Back out "Revert D10123245: Back out "codemod cuda_gpu_id to device_i…
…d"" (#12232)

Pull Request resolved: #12232

Original commit changeset: fca91fea58b7

This adds proper modifications to the DeviceType <->DeviceOption conversion code added in D10033396

Reviewed By: jerryzh168

Differential Revision: D10132473

fbshipit-source-id: 801ef777e2950982cb47b48051b1471a0a91e64b
Assets 2

Table of Contents

  • Breaking Changes
  • New Features
    • Neural Networks
      • Adaptive Softmax, Spectral Norm, etc.
    • Operators
      • torch.bincount, torch.as_tensor, ...
    • torch.distributions
      • Half Cauchy, Gamma Sampling, ...
    • Other
      • Automatic anomaly detection (detecting NaNs, etc.)
  • Performance
    • Faster CPU ops in a wide variety of cases
  • Other improvements
  • Bug Fixes
  • Documentation Improvements

Breaking Changes

  • torch.stft has changed its signature to be consistent with librosa #9497
    • Before: stft(signal, frame_length, hop, fft_size=None, normalized=False, onesided=True, window=None, pad_end=0)
    • After: stft(input, n_fft, hop_length=None, win_length=None, window=None, center=True, pad_mode='reflect', normalized=False, onesided=True)
    • torch.stft is also now using FFT internally and is much faster.
  • torch.slice is removed in favor of the tensor slicing notation #7924
  • torch.arange now does dtype inference: any floating-point argument is inferred to be the default dtype; all integer arguments are inferred to be int64. #7016
  • torch.nn.functional.embedding_bag's old signature embedding_bag(weight, input, ...) is deprecated, embedding_bag(input, weight, ...) (consistent with torch.nn.functional.embedding) should be used instead
  • torch.nn.functional.sigmoid and torch.nn.functional.tanh are deprecated in favor of torch.sigmoid and torch.tanh #8748
  • Broadcast behavior changed in an (very rare) edge case: [1] x [0] now broadcasts to [0] (used to be [1]) #9209

New Features

Neural Networks

  • Adaptive Softmax nn.AdaptiveLogSoftmaxWithLoss #5287

    >>> in_features = 1000
    >>> n_classes = 200
    >>> adaptive_softmax = nn.AdaptiveLogSoftmaxWithLoss(in_features, n_classes, cutoffs=[20, 100, 150])
    >>> adaptive_softmax
      (head): Linear(in_features=1000, out_features=23, bias=False)
      (tail): ModuleList(
        (0): Sequential(
          (0): Linear(in_features=1000, out_features=250, bias=False)
          (1): Linear(in_features=250, out_features=80, bias=False)
        (1): Sequential(
          (0): Linear(in_features=1000, out_features=62, bias=False)
          (1): Linear(in_features=62, out_features=50, bias=False)
        (2): Sequential(
          (0): Linear(in_features=1000, out_features=15, bias=False)
          (1): Linear(in_features=15, out_features=50, bias=False)
    >>> batch = 15
    >>> input = torch.randn(batch, in_features)
    >>> target = torch.randint(n_classes, (batch,), dtype=torch.long)
    >>> # get the log probabilities of target given input, and mean negative log probability loss
    >>> adaptive_softmax(input, target) 
    ASMoutput(output=tensor([-6.8270, -7.9465, -7.3479, -6.8511, -7.5613, -7.1154, -2.9478, -6.9885,
            -7.7484, -7.9102, -7.1660, -8.2843, -7.7903, -8.4459, -7.2371],
           grad_fn=<ThAddBackward>), loss=tensor(7.2112, grad_fn=<MeanBackward1>))
    >>> # get the log probabilities of all targets given input as a (batch x n_classes) tensor
    >>> adaptive_softmax.log_prob(input)  
    tensor([[-2.6533, -3.3957, -2.7069,  ..., -6.4749, -5.8867, -6.0611],
            [-3.4209, -3.2695, -2.9728,  ..., -7.6664, -7.5946, -7.9606],
            [-3.6789, -3.6317, -3.2098,  ..., -7.3722, -6.9006, -7.4314],
            [-3.3150, -4.0957, -3.4335,  ..., -7.9572, -8.4603, -8.2080],
            [-3.8726, -3.7905, -4.3262,  ..., -8.0031, -7.8754, -8.7971],
            [-3.6082, -3.1969, -3.2719,  ..., -6.9769, -6.3158, -7.0805]],
    >>> # predit: get the class that maximize log probaility for each input
    >>> adaptive_softmax.predict(input)  
    tensor([ 8,  6,  6, 16, 14, 16, 16,  9,  4,  7,  5,  7,  8, 14,  3])
  • Add spectral normalization nn.utils.spectral_norm #6929

    >>> # Usage is similar to weight_norm
    >>> convT = nn.ConvTranspose2d(3, 64, kernel_size=3, pad=1)
    >>> # Can specify number of power iterations applied each time, or use default (1)
    >>> convT = nn.utils.spectral_norm(convT, n_power_iterations=2)
    >>> # apply to every conv and conv transpose module in a model
    >>> def add_sn(m):
            for name, c in m.named_children():
                 m.add_module(name, add_sn(c))    
             if isinstance(m, (nn.Conv2d, nn.ConvTranspose2d)):
                 return nn.utils.spectral_norm(m)
                 return m
    >>> my_model = add_sn(my_model)
  • nn.ModuleDict and nn.ParameterDict containers #8463

  • Add nn.init.zeros_ and nn.init.ones_ #7488

  • Add sparse gradient option to pretrained embedding #7492

  • Add max pooling support to nn.EmbeddingBag #5725

  • Depthwise convolution support for MKLDNN #8782

  • Add nn.FeatureAlphaDropout (featurewise Alpha Dropout layer) #9073



  • Half Cauchy and Half Normal #8411
  • Gamma sampling for CUDA tensors #6855
  • Allow vectorized counts in Binomial Distribution #6720



  • Accelerate bernoulli number generation on CPU #7171
  • Enable cuFFT plan caching (80% speed-up in certain cases) #8344
  • Fix unnecessary copying in bernoulli_ #8682
  • Fix unnecessary copying in broadcast #8222
  • Speed-up multidim sum (2x~6x speed-up in certain cases) #8992
  • Vectorize CPU sigmoid (>3x speed-up in most cases) #8612
  • Optimize CPU nn.LeakyReLU and nn.PReLU (2x speed-up) #9206
  • Vectorize softmax and logsoftmax (4.5x speed-up on single core and 1.8x on 10 threads) #7375
  • Speed up nn.init.sparse (10-20x speed-up) #6899


Tensor printing

  • Tensor printing now includes requires_grad and grad_fn information #8211
  • Improve number formatting in tensor print #7632
  • Fix scale when printing some tensors #7189
  • Speed up printing of large tensors #6876

Neural Networks

  • NaN is now propagated through many activation functions #8033
  • Add non_blocking option to nn.Module.to #7312
  • Loss modules now allow target to require gradient #8460
  • Add pos_weight argument to nn.BCEWithLogitsLoss #6856
  • Support grad_clip for parameters on different devices #9302
  • Removes the requirement that input sequences to pad_sequence have to be sorted #7928
  • stride argument for max_unpool1d, max_unpool2d, max_unpool3d now defaults to kernel_size #7388
  • Allowing calling grad mode context managers (e.g., torch.no_grad, torch.enable_grad) as decorators #7737
  • torch.optim.lr_scheduler._LRSchedulers __getstate__ include optimizer info #7757
  • Add support for accepting Tensor as input in clip_grad_* functions #7769
  • Return NaN in max_pool/adaptive_max_pool for NaN inputs #7670
  • nn.EmbeddingBag can now handle empty bags in all modes #7389
  • torch.optim.lr_scheduler.ReduceLROnPlateau is now serializable #7201
  • Allow only tensors of floating point dtype to require gradients #7034 and #7185
  • Allow resetting of BatchNorm running stats and cumulative moving average #5766
  • Set the gradient of LP-Pooling to zero if the sum of all input elements to the power of p is zero #6766



  • Always enable grad when calculating lazy_property #7708

Sparse Tensor

  • Add log1p for sparse tensor #8969
  • Better support for adding zero-filled sparse tensors #7479

Data Parallel

  • Allow modules that return scalars in nn.DataParallel #7973
  • Allow nn.parallel.parallel_apply to take in a list/tuple of tensors #8047


  • torch.Size can now accept PyTorch scalars #5676
  • Move torch.utils.data.dataset.random_split to torch.utils.data.random_split, and torch.utils.data.dataset.Subset to torch.utils.data.Subset #7816
  • Add serialization for torch.device #7713
  • Allow copy.deepcopy of torch.(int/float/...)* dtype objects #7699
  • torch.load can now take a torch.device as map location #7339

Bug Fixes

  • Fix nn.BCELoss sometimes returning negative results #8147
  • Fix tensor._indices on scalar sparse tensor giving wrong result #8197
  • Fix backward of tensor.as_strided not working properly when input has overlapping memory #8721
  • Fix x.pow(0) gradient when x contains 0 #8945
  • Fix CUDA torch.svd and torch.eig returning wrong results in certain cases #9082
  • Fix nn.MSELoss having low precision #9287
  • Fix segmentation fault when calling torch.Tensor.grad_fn #9292
  • Fix torch.topk returning wrong results when input isn't contiguous #9441
  • Fix segfault in convolution on CPU with large inputs / dilation #9274
  • Fix avg_pool2/3d count_include_pad having default value False (should be True) #8645
  • Fix nn.EmbeddingBag's max_norm option #7959
  • Fix returning scalar input in Python autograd function #7934
  • Fix THCUNN SpatialDepthwiseConvolution assuming contiguity #7952
  • Fix bug in seeding random module in DataLoader #7886
  • Don't modify variables in-place for torch.einsum #7765
  • Make return uniform in lbfgs step #7586
  • The return value of uniform.cdf() is now clamped to [0..1] #7538
  • Fix advanced indexing with negative indices #7345
  • CUDAGenerator will not initialize on the current device anymore, which will avoid unnecessary memory allocation on GPU:0 #7392
  • Fix tensor.type(dtype) not preserving device #7474
  • Batch sampler should return the same results when used alone or in dataloader with num_workers > 0 #7265
  • Fix broadcasting error in LogNormal, TransformedDistribution #7269
  • Fix torch.max and torch.min on CUDA in presence of NaN #7052
  • Fix torch.tensor device-type calculation when used with CUDA #6995
  • Fixed a missing '=' in nn.LPPoolNd repr function #9629


Assets 2

PyTorch 0.4.0 release notes

Table of Contents

  • Major Core Changes
    • Tensor / Variable merged
    • Zero-dimensional Tensors
    • dtypes
    • migration guide
  • New Features
    • Tensors
      • Full support for advanced indexing
      • Fast Fourier Transforms
    • Neural Networks
      • Trade-off memory for compute
      • bottleneck - a tool to identify hotspots in your code
    • torch.distributions
      • 24 basic probability distributions
      • Added cdf, variance, entropy, perplexity etc.
    • Distributed Training
      • Launcher utility for ease of use
      • NCCL2 backend
    • C++ Extensions
    • Windows Support
    • ONNX Improvements
      • RNN support
  • Performance improvements
  • Bug fixes

Major Core changes

Here is a summary of the updates to the most important core features users will use daily.

Major Changes and Potentially Breaking Changes:

  • Tensors and Variables have merged
  • Some operations now return 0-dimensional (scalar) Tensors
  • Deprecation of the volatile flag


  • dtypes, devices, and Numpy-style Tensor creation functions added
  • Support for writing device-agnostic code

We wrote a migration guide that should help you transition your code to new APIs and style. Please read it if you have code in a previous version of PyTorch that you would like to migrate.

Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.

The contents of this section (Major Core changes) are included in the migration guide.

Merging Tensor and Variable classes

torch.autograd.Variable and torch.Tensor are now the same class. More precisely, torch.Tensor is capable of tracking history and behaves like the old Variable; Variable wrapping continues to work as before but returns an object of type torch.Tensor. This means that you don't need the Variable wrapper everywhere in your code anymore.

The type() of a Tensor has changed

Note also that the type() of a Tensor no longer reflects the data type. Use isinstance() or x.type() instead:

>>> x = torch.DoubleTensor([1, 1, 1])
>>> print(type(x)) # was torch.DoubleTensor
<class 'torch.autograd.variable.Variable'>
>>> print(x.type())  # OK: 'torch.DoubleTensor'
>>> print(isinstance(x, torch.DoubleTensor))  # OK: True

When does autograd start tracking history now?

requires_grad, the central flag for autograd, is now an attribute on Tensors. Let's see how this change manifests in code.

autograd uses the same rules previously used for Variables. It starts tracking history when any input Tensor of an operation has requires_grad=True. For example,

>>> x = torch.ones(1)  # create a tensor with requires_grad=False (default)
>>> x.requires_grad
>>> y = torch.ones(1)  # another tensor with requires_grad=False
>>> z = x + y
>>> # both inputs have requires_grad=False. so does the output
>>> z.requires_grad
>>> # then autograd won't track this computation. let's verify!
>>> z.backward()
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
>>> # now create a tensor with requires_grad=True
>>> w = torch.ones(1, requires_grad=True)
>>> w.requires_grad
>>> # add to the previous result that has require_grad=False
>>> total = w + z
>>> # the total sum now requires grad!
>>> total.requires_grad
>>> # autograd can compute the gradients as well
>>> total.backward()
>>> w.grad
tensor([ 1.])
>>> # and no computation is wasted to compute gradients for x, y and z, which don't require grad
>>> z.grad == x.grad == y.grad == None
Manipulating requires_grad flag

Other than directly setting the attribute, you can change this flag in-place using my_tensor.requires_grad_(requires_grad=True), or, as in the above example, at creation time by passing it in as an argument (default is False), e.g.,

>>> existing_tensor.requires_grad_()
>>> existing_tensor.requires_grad
>>> my_tensor = torch.zeros(3, 4, requires_grad=True)
>>> my_tensor.requires_grad

What about .data?

.data was the primary way to get the underlying Tensor from a Variable. After this merge, calling y = x.data still has similar semantics. So y will be a Tensor that shares the same data with x, is unrelated with the computation history of x, and has requires_grad=False.

However, .data can be unsafe in some cases. Any changes on x.data wouldn't be tracked by autograd, and the computed gradients would be incorrect if x is needed in a backward pass. A safer alternative is to use x.detach(), which also returns a Tensor that shares data with requires_grad=False, but will have its in-place changes reported by autograd if x is needed in backward.

Some operations now return 0-dimensional (scalar) Tensors

Previously, indexing into a Tensor vector (1-dimensional tensor) gave a Python number but indexing into a Variable vector gave (incosistently!) a vector of size (1,)! Similar behavior existed with reduction functions, i.e. tensor.sum() would return a Python number, but variable.sum() would retun a vector of size (1,).

Fortunately, this release introduces proper scalar (0-dimensional tensor) support in PyTorch! Scalars can be created using the new torch.tensor function (which will be explained in more detail later; for now just think of it as the PyTorch equivalent of numpy.array). Now you can do things like:

>>> torch.tensor(3.1416)         # create a scalar directly
>>> torch.tensor(3.1416).size()  # scalar is 0-dimensional
>>> torch.tensor([3]).size()     # compare to a vector of size 1
>>> vector = torch.arange(2, 6)  # this is a vector
>>> vector
tensor([ 2.,  3.,  4.,  5.])
>>> vector.size()
>>> vector[3]                    # indexing into a vector gives a scalar
>>> vector[3].item()             # .item() gives the value as a Python number
>>> sum = torch.tensor([2, 3]).sum()
>>> sum
>>> sum.size()

Accumulating losses

Consider the widely used pattern total_loss += loss.data[0] before 0.4.0. loss was a Variable wrapping a tensor of size (1,), but in 0.4.0 loss is now a scalar and has 0 dimensions. Indexing into a scalar doesn't make sense (it gives a warning now, but will be a hard error in 0.5.0): use loss.item() to get the Python number from a scalar.

Note that if you don't convert to a Python number when accumulating losses, you may find increased memory usage in your program. This is because the right-hand-side of the above expression used to be a Python float, while it is now a zero-dim Tensor. The total loss is thus accumulating Tensors and their gradient history, which may keep around large autograd graphs for much longer than necessary.

Deprecation of volatile flag

The volatile flag is now deprecated and has no effect. Previously, any computation that involves a Variable with volatile=True won't be tracked by autograd. This has now been replaced by a set of more flexible context managers including torch.no_grad(), torch.set_grad_enabled(grad_mode), and others.

>>> x = torch.zeros(1, requires_grad=True)
>>> with torch.no_grad():
...     y = x * 2
>>> y.requires_grad
>>> is_train = False
>>> with torch.set_grad_enabled(is_train):
...     y = x * 2
>>> y.requires_grad
>>> torch.set_grad_enabled(True)  # this can also be used as a function
>>> y = x * 2
>>> y.requires_grad
>>> torch.set_grad_enabled(False)
>>> y = x * 2
>>> y.requires_grad

dtypes, devices and NumPy-style creation functions

In previous versions of PyTorch, we used to specify data type (e.g. float vs double), device type (cpu vs cuda) and layout (dense vs sparse) together as a "tensor type". For example, torch.cuda.sparse.DoubleTensor was the Tensor type respresentingdouble data type, living on CUDA devices, and with COO sparse tensor layout.

In this release, we introduce torch.dtype, torch.device and torch.layout classes to allow better management of these properties via NumPy-style creation functions.


Below is a complete list of available torch.dtypes (data types) and their corresponding tensor types.

Data type torch.dtype Tensor types
32-bit floating point torch.float32 or torch.float torch.*.FloatTensor
64-bit floating point torch.float64 or torch.double torch.*.DoubleTensor
16-bit floating point torch.float16 or torch.half torch.*.HalfTensor
8-bit integer (unsigned) torch.uint8 torch.*.ByteTensor
8-bit integer (signed) torch.int8 torch.*.CharTensor
16-bit integer (signed) torch.int16 or torch.short torch.*.ShortTensor
32-bit integer (signed) torch.int32 or torch.int torch.*.IntTensor
64-bit integer (signed) torch.int64 or torch.long torch.*.LongTensor

Use torch.set_default_dtype and torch.get_default_dtype to manipulate default dtype for floating point tensors.


A torch.device contains a device type ('cpu' or 'cuda') and optional device ordinal (id) for the device type. It can be initilized with torch.device('{device_type}') or torch.device('{device_type}:{device_ordinal}').

If the device ordinal is not present, this represents the current device for the device type; e.g., torch.device('cuda') is equivalent to torch.device('cuda:X') where X is the result of torch.cuda.current_device().


torch.layout represents the data layout of a Tensor. Currentlytorch.strided (dense tensors) and torch.sparse_coo (sparse tensors with COO format) are supported.

Creating Tensors

Methods that create a Tensor now also take in dtype, device, layout, and requires_grad options to specify the desired attributes on the returned Tensor. For example,

>>> device = torch.device("cuda:1")
>>> x = torch.randn(3, 3, dtype=torch.float64, device=device)
tensor([[-0.6344,  0.8562, -1.2758],
        [ 0.8414,  1.7962,  1.0589],
        [-0.1369, -1.0462, -0.4373]], dtype=torch.float64, device='cuda:1')
>>> x.requires_grad  # default is False
>>> x = torch.zeros(3, requires_grad=True)
>>> x.requires_grad


torch.tensor is one of the newly added tensor creation methods. It takes in array like data of all kinds and copies the contained values into a new Tensor. As mentioned earlier, torch.tensor is the PyTorch equivalent of NumPy's numpy.array constructor. Unlike the torch.*Tensor methods, you can also create zero-dimensional Tensors (aka scalars) this way (a single python number is treated as a Size in thetorch.*Tensor methods). Moreover, if a dtype argument isn't given, it will infer the suitable dtype given the data. It is the recommended way to create a tensor from existing data like a Python list. For example,

>>> cuda = torch.device("cuda")
>>> torch.tensor([[1], [2], [3]], dtype=torch.half, device=cuda)
tensor([[ 1],
        [ 2],
        [ 3]], device='cuda:0')
>>> torch.tensor(1)               # scalar
>>> torch.tensor([1, 2.3]).dtype  # type inferece
>>> torch.tensor([1, 2]).dtype    # type inferece

We've also added more tensor creation methods. Some of them have torch.*_like and/or tensor.new_* variants.

  1. torch.*_like takes in an input Tensor instead of a shape. It returns a Tensor with same attributes as the input Tensor by default unless otherwise specified:

    >>> x = torch.randn(3, dtype=torch.float64)
    >>> torch.zeros_like(x)
    tensor([ 0.,  0.,  0.], dtype=torch.float64)
    >>> torch.zeros_like(x, dtype=torch.int)
    tensor([ 0,  0,  0], dtype=torch.int32)
  2. tensor.new_* can also create Tensors with same attributes as tensor, but it always takes in a shape argument:

    >>> x = torch.randn(3, dtype=torch.float64)
    >>> x.new_ones(2)
    tensor([ 1.,  1.], dtype=torch.float64)
    >>> x.new_ones(4, dtype=torch.int)
    tensor([ 1,  1,  1,  1], dtype=torch.int32)

To specify the desired shape, you can either use a tuple (e.g., torch.zeros((2, 3))) or variable arguments (e.g., torch.zeros(2, 3)) in most cases.

Name Returned Tensor torch.*_like variant tensor.new_* variant
torch.empty unintialized memory
torch.zeros all zeros
torch.ones all ones
torch.full filled with a given value
torch.rand i.i.d. continuous Uniform[0, 1)
torch.randn i.i.d. Normal(0, 1)
torch.randint i.i.d. discrete Uniform in given range
torch.randperm random permutation of {0, 1, ..., n - 1}
torch.tensor copied from existing data (list, NumPy ndarray, etc.)
torch.from_numpy* from NumPy ndarray (sharing storage without copying)
torch.range, and
uniformly spaced values in a given range
torch.logspace logarithmically spaced values in a given range
torch.eye identity matrix

*: torch.from_numpy only takes in a NumPy ndarray as its input argument.

Writing device-agnostic code

Previous versions of PyTorch made it difficult to write code that was device agnostic (i.e. that could run on both CUDA-enabled and CPU-only machines without modification).

PyTorch 0.4.0 makes this easier in two ways:

  • The device attribute of a Tensor gives the torch.device for all Tensors (get_device only works for CUDA tensors)
  • The to method of Tensors and Modules can be used to easily move objects to different devices (instead of having to call cpu() or cuda() based on the context)

We recommend the following pattern:

# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)


Full support for Advanced indexing

PyTorch now has full support for advanced indexing, following numpy's advanced indexing rules. The following examples are now possible:

a = torch.rand(10, 10, 10, 10)

# the indexing elements can have other shapes than 1
b = a[[[3, 2]], :, [[1, 3]]]

# broadcasting also supported in the indices, as well as lists,
# negative indices, slices, elipses, numbers
c = a[[1, -2], 2:4, :, [1]]

# can also support tensors as indices
index = torch.tensor([2, 4])
d = a[index]

# and the indices can be on the GPU
# or CPU
e = a[index.cuda()]
f = a.cuda()[index]

mask = torch.rand(10) > 0.5
# we can now index with a mask that has fewer
# dimensions than the indexing tensor
c = a[mask, :5]

Fast Fourier Transform

  • Add new FFT methods #5856
  • Add torch.stft (short time Fourier transform) and hann/hamming/bartlett window functions. #4095
  • Support arbitrary number of batch dimensions in *FFT #6528

New and updated Torch operators

  • Added torch.log2 and torch.log10 #6272
  • Added torch.isnan #5273
  • Add torch.reshape, which is similar to numpy.reshape. It is roughly equivalent to tensor.contiguous().view(), but avoids copying in certain cases #5575
  • Add CPU implementation of torch.unique, which outputs the unique elements of a Tensor #5503
  • Add torch.det, torch.logdet and torch.slogdet, for computing the (log-)determinant of square 2D tensors. For negative determinants, torch.logdet returns nan, while torch.slogdet returns the sign of the log-determinant and the log of the absolute value of the determinant. #3816 and #5393
  • Add nn.functional.gumbel_softmax, which lets you use the reparametrization trick for discrete variables #3341
  • Add torch.take and Tensor.put_. Those functions are equivalent to numpy.take and numpy.put, and are the base for full support of advanced indexing in PyTorch #3263
  • Add torch.randint, similar to numpy.random.randint #6136
  • Add torch.diagonal and torch.diagflat, similar to numpy.diagonal and numpy.diagflat. They are meant as a replacement for torch.diag, which handled both the cases of constructing a diagonal tensor as well as extracting the diagonal of a matrix #5622
  • Add torch.einsum, equivalent to numpy.einsum. einsum allows you to perform operations using Einstein's notation. #5503
a = torch.arange(0, 9).reshape(3, 3)
# the following transposes a
b = torch.einsum('ij->ji', (a,))
  • Add torch.expm1, a numerically stable exp(x)-1 for small x. #4350
  • Allow users to specify individual split sizes with torch.split #3837
  • Add torch.where(condition, tensor1, tensor2) that returns a tensors of elements selected from tensor1 or tensor2 based on condition. #4259, #4259
  • Add Tensor.norm(dim) for sparse tensors. #4882
  • Implement torch.neg for all types. #4075
  • Implement gradient calculation for torch.trtrs. #3972
  • Deprecate out-of-place Tensor.resize and Tensor.resize_as. These have weird semantics and are hard to use correctly. Please use their in-place variants Tensor.resize_ and Tensor.resize_as_. #4886

Rename async argument in .cuda() to non_blocking

The async keyword argument in conversion calls is now deprecated in PyTorch, and it has been replaced by non_blocking. This was necessary because async will be a keyword in Python 3.7

Neural Networks

A new autograd container that lets you trade compute for memory

The new checkpoint container allows you to only store a subset of the outputs necessary for backpropagation. If an output is missing (to save memory), the checkpoint container will recompute the intermediate outputs from the closest checkpoint, so that memory usage can be reduced (with an increase in computation time).
Here is an example:

# input
input = torch.rand(1, 10)
# suppose we have a very deep model
layers = [nn.Linear(10, 10) for _ in range(1000)]
model = nn.Sequential(*layers)
output = model(input)

The above model uses a lot of memory, because it needs to keep the intermediate values of every operation for backpropagation. checkpoint lets your reduce the memory requirements:

# create the input tensors and set the requires_grad=True
# NOTE: the requires_grad=True for the input is a current
# limitation of checkpointing. At least one of the 
# model inputs should have requires_grad=True. 
# If you don't do it, you might have empty gradients.
input = torch.rand(1, 10, requires_grad=True)
layers = [nn.Linear(10, 10) for _ in range(1000)]

# define function that will define where
# we will checkpoint and store
# intermediate gradients. In this case,
# we will only store one intermediate
# gradient, in the middle of the
# model

def run_first_half(*args):
    x = args[0]
    for layer in layers[:500]:
        x = layer(x)
    return x

def run_second_half(*args):
    x = args[0]
    for layer in layers[500:-1]:
        x = layer(x)
    return x

# now uses the new checkpoint functionality
from torch.utils.checkpoint import checkpoint

x = checkpoint(run_first_half, input)
x = checkpoint(run_second_half, x)
# last output need to be run without checkpoint
x = layers[-1](x)
x.sum.backward()  # works!

For sequential modules (which can have arbitrary blocks inside), a helper function checkpoint_sequential is provided, which takes care of the most common use-cases:

input = torch.rand(1, 10, requires_grad=True)
layers = [nn.Linear(10, 10) for _ in range(1000)]
model = nn.Sequential(*layers)

from torch.utils.checkpoint import checkpoint_sequential

# split in two blocks
num_segments = 2
x = checkpoint_sequential(model, num_segments, input)
x.sum().backward()  # works!

bottleneck - a tool to identify hotspots in your code

torch.utils.bottleneck (#5216, #6425) is a tool that can be used as an initial step for
debugging bottlenecks in your program. It summarizes runs of your script with
the Python profiler and PyTorch’s autograd profiler. See the bottleneck docs for more details.

reduce=False Losses

As of this release, all of our loss functions support the reduce keyword. Specifying reduce=False gives a Tensor per unit of loss instead of a single reduced loss. #4924, #5346, #5646, #4231, #4705, #5680

New modules and module improvements

  • Add DistributedDataParallelCPU. This is similar to DistributedDataParallel, but with specific support for models running on the CPU (contrary to DistributedDataParallel, which targets GPU), and supports mpi, gloo and tcp backends #5919.
  • Add Group Normalization (nn.GroupNorm), an alternative to batch normalization that doesn't suffer from the same issues as BatchNorm for small batch sizes
  • Add Layer Normalization (nn.LayerNorm), an alternative for batch normalization often used in NLP tasks. #4922
  • Add Local Response Normalization (nn.LocalResponseNorm). #4922
  • MaxPool3d now supports double backwards. MaxPool3d and MaxUnpool3d now use indices consistent with the rest of the pooling layers. #5328
  • All loss functions now support a reduce argument to return a batch of losses. #264
  • Add util to clip gradient value in torch.nn.utils.clip_grad and add param to He initialization scheme in torch.nn.init. #6173
  • Renamed torch.nn.init.* methods to have an underscore in the end, as they operate in-place, and deprecate the old versions 6093
  • Added support for returning dictionaries in DataParallel #6113
  • Added support for N-D tensors in torch.nn.Bilinear #5764
  • Add Embedding.from_pretrained factory. This allows to initialize an Embedding layer with an existing tensor, bypassing the initial random initialization of its weights.
  • You can now slice nn.Sequential, nn.ModuleList, and nn.ParameterList #4491
  • Registered nn.Module integer parameters and buffers are now immune to module.float(), module.double() module.half() calls. #3820


torch.distributions has expanded to include 24 basic probability distributions: Bernoulli, Beta, Binomial, Categorical, Cauchy, Chi2, Dirichlet, Exponential, FisherSnedecor, Gamma, Geometric, Gumbel, Laplace, LogNormal, Multinomial, MultivariateNormal, Normal, OneHotCategorical, Pareto, Poisson, RelaxedBernoulli, RelaxedOneHotCategorical, StudentT, and Uniform.

The Distribution interface has expanded to include many methods including .cdf(), .icdf(), .mean(), .variance(), .entropy(), and .perplexity(). Distributions now split tensor dimensions into sample_shape+batch_shape+event_shape. Most continuous distributions now also implement a differentiable .rsample() method to compute pathwise derivatives aka the reparameterization trick (check .has_rsample for availability):

>>> loc = torch.tensor(0., requires_grad=True)
>>> scale = torch.tensor(1., requires_grad=True)
>>> samples = Normal(loc, scale).rsample(sample_shape=(1000,))
>>> loss = (samples - 0.5).pow(4).mean()  # average over 1000 monte carlo samples
>>> grad(loss, [loc, scale])
(tensor(-7.5092), tensor(15.2704))

Most discrete distributions implement an .enumerate_support() method to make it easy to sum over all possible sample values (check .has_enumerate_support for availability).

kl_divergence is defined for many pairs of distributions, e.g.

>>> x = torch.tensor(1.0, requires_grad=True)
>>> kl = kl_divergence(Uniform(-x, x), Normal(0., 1.))
>>> grad(kl, [x])[0]

Distribution Transforms

New distributions can be created by combining TransformedDistribution with any number of Transform objects from the torch.distributions.transforms library, including: ExpTransform, PowerTransform, SigmoidTransform, AbsTransform, AffineTransform, SoftmaxTransform, StickBreakingTransform, LowerCholeskyTransform, and their inverses via the .inv property.

Distribution Constraints

Distributions provide metadata about the constraints of their .support and about their arguments (.arg_constraints). These Constraint objects are registered with transforms using transform_to() and biject_to(). Together constraints and transforms make it easy to specify new distributions in a generic way

>>> scale = torch.tensor(1., requires_grad=True)
>>> p = Normal(0., scale)
>>> assert p.arg_constraints['scale'] == constraints.positive
>>> prior = TransformedDistribution(Normal(0., 1.),
...                                 transform_to(constraints.positive))

Constraints in the torch.distributions.constraints library include: boolean, greater_than(lower_bound), integer_interval(lower_bound, upper_bound), interval(lower_bound, upper_bound), lower_cholesky, lower_triangular, nonnegative_integer, positive, positive_definite, positive_integer, real, real_vector, simplex, and unit_interval.


Helper utility for launching Distributed Training jobs

We have added an utility function to help launch jobs on a distributed setup.
In order to launch a script that leverages DistributedDataParallel on either single-node multiple-nodes, we can make use of torch.distributed launch as follows

python -m torch.distributed.launch my_script.py --arg1 --arg2 --arg3

The script simplifies day to day usability of the distributed package.

You can read about it's usage here: http://pytorch.org/docs/stable/distributed.html#launch-utility

A new distributed backend based on NCCL 2.0

PyTorch now has a new distributed backend, which leverages NCCL 2.0 for maximum speed.
It also provides new APIs for collective operations on multiple GPUs.
You can enable the new backend via


Other distributed improvements

  • Coalesce many small broadcasts to improve performance #4978
  • Add mixed-precision support for distributed training #4891
  • Release NCCL distributed backend. Previously it was marked as experimental. #4921
  • Enable Infiniband support for Gloo data channel with automatic IB device detection #4795

C++ extensions

Previously, the official way of writing extensions using C or CUDA for custom modules was through the cffi extension. The drawback of this method was that it required a separate step for compiling the CUDA kernels, which could be a bit messy.

PyTorch now provides a better system for writing your own C++ / CUDA extensions. Example implementations using this new extension support can be found in the pytorch/cpp_extensions repo.

We provide two compilation modes:

  • ahead of time compilation: you write a setup.py script using the new CppExtension or CUDAExtension, which is an extension of setuptools.Extension module;
  • just-in-time compilation: you pass the list of C++ / CUDA files that you want to compile to torch.utils.cpp_extension.load, and it will compile on the fly and cache the libraries for you. Here is an example illustrating how easy it is to implement an extension:

In C++

// my_implementation.cpp
#include <torch/torch.h>
#include <unordered_set>

// can use templates as well. But let's keep it
// simple
using scalar_t = float;

at::Tensor unique_float(at::Tensor input_) {
  // only works for floats
  AT_ASSERT(input_.type().scalarType() == at::ScalarType::Float, "input must be a float tensor");
  // and CPU tensors
  AT_ASSERT(!input_.type().is_cuda(), "input must be a CPU tensor");
  // make the input contiguous, to simplify the implementation
  at::Tensor input = input_.contiguous();
  // get the pointer that holds the data
  scalar_t* input_data = input.data<scalar_t>();
  // let's use a function from the std library to implement
  // the unique function
  std::unordered_set<scalar_t> set(input_data, input_data + input.numel());
  // create the output tensor, with size set.size()
  at::Tensor output = input.type().tensor({static_cast<int64_t>(set.size())});
  scalar_t* output_data = output.data<scalar_t>();
  // copy the content of the set to the output tensor
  std::copy(set.begin(), set.end(), output_data);
  return output;

// this defines the functions exposed to Python
  m.def("unique_float", &unique_float, "Unique for float tensors");

And then in Python

import torch
from torch.utils.cpp_extension import load as load_ext
# pass the source files, they will be compiled on the fly 
# and will return a python module
_C = load_ext('my_unique_lib', sources=['my_implementation.cpp'])

# now can use the functions implemented in C++
unique = _C.unique_float

a = torch.tensor([1.0, 2.0, 1.0])
# tensor([ 2.,  1.])

Windows support

PyTorch now officially supports Windows. We provide pre-compiled Conda binaries and pip wheels for Python 3.5 and 3.6.
PyTorch on Windows doesn't support distributed training and might be a tad bit slower than Linux / OSX because Visual Studio supports an older version of OpenMP.

As always, you can use the commands at http://pytorch.org to install PyTorch on Windows
We have an FAQ that answers most questions you might have around Windows here: http://pytorch.org/docs/stable/notes/windows.html

ONNX Improvements

New ONNX operators

  • Support export torch.max(input, dim) and torch.min(input, dim) #6220
  • Add symbolic for ReLU to support exporting to ONNX #5759
  • Add sum, prod, sqrt and improve log_softmax #4579
  • Add ONNX support for InstanceNorm #4626
  • Add ONNX symbolic for Elu #3453
  • Add ONNX symbolic for UpsamplingNearest2d #3450


  • Print source location when ONNX export fails for a node #5652
  • Export onnx protobuf bindings to python #6651
  • Support output_padding in ConvTranspose #4583

Better RNN support

PyTorch can now export a subset of RNNs to ONNX #4409

  • Add Elman RNN export to ONNX #4613
  • Support batch-first in ONNX export of padded sequences #5360
  • Bidirectional Elman RNN export to ONNX #5120
  • Handle sequence lengths correctly when exporting RNNs to ONNX #4695
  • Support GRU export to ONNX #4390


  • Fix a bug in ONNX symbolic of 3d average pooling #6101
  • Fix onnx export of replication/reflection pad #4263

Miscellaneous improvements

  • implement __dir__ for Tensors, so that editors can automatically auto-complete and query for the possible fields in Tensors

  • Add numpy() and from_numpy() to HalfTensor

  • Enable TensorDataset to have any number of input tensors.

  • Add padding_value to torch.nn.utils.rnn.pad_sequence

  • Add total_length option to pack_padded_sequence, which is useful when using DataParallel, as we can ensure that we have sequences of the same length.

  • Improve numerical precision of torch.arange, making it consistent with numpy.arange

  • torch.load() and torch.save() support arbitrary file-like object

  • torch.nn.functional.grid_sample now supports 2D (spatial) and 3D (volumetric) inputs

  • set python random seed in DataLoader workers, in order to improve experiment reproducibility

  • Add __delitem__ to nn.Sequential. Now one can delete arbitrary elements of a nn.Sequential.

For example:

model = nn.Sequential(nn.Linear(2, 2), nn.ReLU(), nn.Linear(2, 2))
del model[1]  # deletes nn.ReLU
  • ReduceLROnPlateau is now serializable #5300

  • Add option to flush denormal numbers on CPU. #5294

  • PyTorch now exposes the gradients of conv1d, conv2d and conv3d with respect to the input and the weights #5408

  • Add support for calling pack_padded_sequence with either list or with a Tensor #5133

  • Support negative indexing for padding_idx in nn.Embedding #4496

  • Implement backward pass for pack_padded_sequence #4512

  • Add nn.utils.rnn.pad_sequence and nn.utils.rnn.pack_sequence to pad lists of variable length Tensors with 0 and to pack a list of variable length Tensors.

  • Add torch.cuda.memory_cached, torch.cuda.max_memory_cached, torch.cuda.memory_allocated, and torch.cuda.max_memory_allocated methods
    for checking CUDA memory usage #4511

  • Allow viewing on noncontiguous tensors if the new view size is compatible with the tensor's original size and stride. #4062

  • NLLLoss and CrossEntropyLoss now support more than 2 dimensions. #4654

  • Add an option to not show model_zoo download progress bar #4135

  • You can now assign modules to indices of nn.Sequential. #4931

  • You can create tensors with a numpy np.longlong array #4367

  • Change the autograd execution order to use good heuristics. This greatly improves memory usage for large models. #4746

  • Add AMSgrad mode to Adam and SparseAdam optmizers. #4034

  • Better torch.autograd.profiler support for CUDA profiling using the cudaEvent API. #3734

  • torch.set_num_threads also sets the respective MKL option so you won't need to use an environment variable to control it. #4949

Performance improvements

  • Speed up CPU nn.EmbeddingBag, making training overall 30% faster #5433
  • Move nn.MarginRankingLoss, nn.CosineEmbeddingLoss, nn.HingeEmbeddingLoss, and nn.TripletMarginLoss from Python to our ATen backend, resulting in some cases up to a 3x performance gains.
    #5346, #5646, #5080, #5680
  • Implement pin_memory() as a NativeFunction #4094
  • Save self.numel() for backward computation instead of self to save memory #5747
  • Rearrange dimensions for pointwise operations for up to 10x better performance in one case. #4174
  • Vectorize normal_ for a 5-6x speed up in a small case #4312
  • Allowing usage of GPU Direct within PyTorch for the Broadcast operation #4183
  • Speed-up nn.Linear for the 3D input case #5279
  • Speed up Conv3D on the CPU by parallelizing vol2col and col2vol #4824
  • Add AVX2 implementation for sigmoid function, showing around 10x speedup #5010
  • Use fast integer division algorithm to avoid division ops inside kernels. #5054
  • Improve occupancy for CUDA random number generation #5710
  • Add optimization to norm for common norms #5722
  • Add a fast fused GLU backward #5782
  • Optimize unique sorting by using std::vector+sort instead of std::set, giving up to 5x speedup. #5913
  • Speed up sum over a dimension #6026
  • Enable MKLDNN convolution forward and backward. #6062
  • Parallelize non-contiguous point-wise operations with OpenMP #2764
  • Add cudnn Tensor Core ops to RNNs for Volta #3409
  • Vectorize exp, log, sin, cos #6078
  • Reuse intermediate results over multiple backwards grad_inputs #3526


  • DistributedDataParallel: 10% of NCCL backend perf improvements with mixed-precision support #5064
  • Slightly improve DistributedDataParallel (single-GPU binding) multi-process distributed training performance #4870

Bug fixes

torch operators

  • Improve torch.digamma precision near poles #6517
  • Fix incorrect behavior of Tensor.random_ on negative inputs #6463
  • Fix undefined behavior in backward pass for tensor.permute(dims) with negative dims #5945
  • Fix integer overflow in torch.remainder operator (it would break with a divisor above 2**48) #5906
  • Fix memory leak in torch.bmm #5744
  • Make dimension checker of scatter_add_ consistent with scatter_'s #5659
  • Fix CPU torch.multinomial with noncontiguous probability tensor input (previously, it would overwrite input data)#5093
  • Fix CUDA torch.multinomial using incorrect strides and being able to select zero-probability events. #5774, #5238
  • Support empty index tensor for index_select #3429
  • Support empty indices tensor in CUDA Tensor.put_ #4486
  • Improve stability of torch.cat with empty tensors #3602, #5971, #5819
  • Fix torch.fft in the case where any of the input dimensions is not aligned #6118
  • Improve the CUDA btrifact error message #5644
  • Return zeros for eigenvector tensor when not requested in torch.symeig#3411
  • Fix torch.btrifact on tensors. #4318
  • Fix torch.pstrf on tensors. #4883
  • Fix memory leak in torch.median 6889
  • Fix SVD backward on non-square matrices when some=False 6870


  • Detect re-initialization of _C shared library that would often result in segfaults on exit #6232
  • Fix indexing with all zero ByteTensors #3926
  • Only allow dense floating-point types as the default tensor type. #5674
  • Initialize CUDA before setting CUDA tensor types as default to prevent crash #4788
  • Fix a bug where from_dlpack fails if CUDA is not initialized. #4182
  • Fix crash in creating a CUDA tensor with a numpy array #5850
  • Fix broken sharing of empty tensor in multiprocessing on some OSes #6229


  • Restore allow_unused functionality: throw error when differentiated input is unused or unreachable. #6553
  • Fix output_nr not being incremented correctly. This caused crashes in the backward pass of operations that don't requires_grad on some inputs. #4812
  • Fix nvprof parsing in the torch.autograd.profiler #5840

nn layers

  • Support only specifying size in certain dimension for adaptive pooling #3127
  • Fix reflection padding boundary checks to not cause invalid memory access #6438
  • Improve error messages for NLLLoss. #5299, #6072
  • Fix kl_div backward on CUDA. Previously it would not respect gradOutput when computing gradInput. #5814
  • Fix incorrect bias size assert for Linear #5992
  • Fix incorrect nn.functional.convNd and nn.functional.conv_transposeNd error message #5701
  • Check that shape for input and target matches instead of number of elements for some loss functions #5085
  • Fix torch.diag backward returning square grad with non-square input #4538
  • Fix convolution type mismatch error message #5815
  • Add align_corners option to linearly interpolating upsampling and make the default upsampling behavior more consistent with other frameworks #5927
  • Prevent numerical issues with poisson_nll_loss when log_input=False #3336


  • Ensure convolution weights are contiguous to fix CUDA ConvTranspose double backward #4543
  • Fix CUDA double backwards #4460


  • Fix embedding with sparse=True #4686
  • Fix sparse embedding backward when input contains only padding_idx #6211
  • Handle copying empty sparse tensors to/from CPU, GPU. #5361


  • Add argument checks to the torch.utils.data.Sampler classes, fixing a bug where DataLoader tries to load the entire dataset on non-integer batch_size. #6249
  • Set dataloader.batch_size = None when batch_sampler is given, fixing a bug where DataLoader would report batch_size as 1. #6108
  • Improve signal handling in DataLoader #4643
  • Ignore FileNotFoundError when shutting down #5380
  • Make preprocessing deterministic #4640


  • Cast tensors when loading optimizer state dicts to improve usability #3658
  • List model parameters in deterministic order to improve stability of load_state_dict() #6031
  • Add parameter range checks for all optimizers #6000
  • Fix AMSGrad mode for SparseAdam #4314

distributed and multi-gpu

  • Fix a number of distributed training errors caused by a detach in place error #5829
  • Don't modify requires_grad when running DataParallel in no_grad mode #5880
  • Add GPU guard for broadcast_coalesce for Distributed Data Parallel stability #5655
Assets 2


  • Removed support for CUDA capability 3.0 and 5.0 (they still work for source builds for now, but the commitment to support this forward is removed)
  • Stop binary releases for CUDA 7.5
  • Add CPU-only binary releases that are 10x smaller in size than the full binary with CUDA capabilities.

As always, links to our binaries are on http://pytorch.org

New features

Bug Fixes

Data Loader / Datasets / Multiprocessing

  • Made DataLoader workers more verbose on bus error and segfault. Additionally, add a timeout option to the DataLoader, which will error if sample loading time exceeds the given value. #3474
  • DataLoader workers used to all have the same random number generator (RNG) seed because of the semantics of fork syscall. Now, each worker will have it's RNG seed set to base_seed + worker_id where base_seed is a random int64 value generated by the parent process. You may use torch.initial_seed() to access this value in worker_init_fn, which can be used to set other seeds (e.g. NumPy) before data loading. worker_init_fn is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading #4018
  • Add additional signal handling in DataLoader worker processes when workers abruptly die.
  • Negative value for n_workers now gives a ValueError #4019
  • fixed a typo in ConcatDataset.cumulative_sizes attribute name #3534
  • Accept longs in default_collate for dataloader in python 2 #4001
  • Re-initialize autograd engine in child processes #4158
  • Fix distributed dataloader so it pins memory to current GPU not GPU 0. #4196


  • allow cudnn for fp16 batch norm #4021
  • Use enabled argument in torch.autograd.profiler.emit_nvtx (was being ignored) #4032
  • Fix cuBLAS arguments for fp16 torch.dot #3660
  • Fix CUDA index_fill_ boundary check with small tensor size #3953
  • Fix CUDA Multinomial checks #4009
  • Fix CUDA version typo in warning #4175
  • Initialize cuda before setting cuda tensor types as default #4788
  • Add missing lazy_init in cuda python module #4907
  • Lazy init order in set device, should not be called in getDevCount #4918
  • Make torch.cuda.empty_cache() a no-op when cuda is not initialized #4936


  • Assert MKL ld* conditions for ger, gemm, and gemv #4056

torch operators

  • Fix tensor.repeat when the underlying storage is not owned by torch (for example, coming from numpy) #4084
  • Add proper shape checking to torch.cat #4087
  • Add check for slice shape match in index_copy_ and index_add_. #4342
  • Fix use after free when advanced indexing tensors with tensors #4559
  • Fix triu and tril for zero-strided inputs on gpu #4962
  • Fix blas addmm (gemm) condition check #5048
  • Fix topk work size computation #5053
  • Fix reduction functions to respect the stride of the output #4995
  • Improve float precision stability of linspace op, fix 4419. #4470


  • Fix python gc race condition with THPVariable_traverse #4437

nn layers

  • Fix padding_idx getting ignored in backward for Embedding(sparse=True) #3842
    Fix cosine_similarity's output shape #3811
  • Add rnn args check #3925
  • NLLLoss works for arbitrary dimensions #4654
  • More strict shape check on Conv operators #4637
  • Fix maxpool3d / avgpool3d crashes #5052
  • Fix setting using running stats in InstanceNorm*d #4444


  • Fix DataParallel scattering for empty lists / dicts / tuples #3769
  • Fix refcycles in DataParallel scatter and gather (fix elevated memory usage) #4988
  • Broadcast output requires_grad only if corresponding input requires_grad #5061


  • Remove hard file offset reset in load() #3695
  • Have sizeof account for size of stored elements #3821
  • Fix undefined FileNotFoundError #4384
  • make torch.set_num_threads also set MKL threads (take 2) #5002


  • Fix wrong learning rate evaluation in CosineAnnealingLR in Python 2 #4656

Performance improvements

  • slightly simplified math in IndexToOffset #4040
  • improve performance of maxpooling backwards #4106
  • Add cublas batched gemm support. #4151
  • Rearrange dimensions for pointwise operations for better performance. #4174
  • Improve memory access patterns for index operations. #4493
  • Improve CUDA softmax performance #4973
  • Fixed double memory accesses of several pointwise operations. #5068

Documentation and UX Improvements

  • Better error messages for blas ops with cuda.LongTensor #4160
  • Add missing trtrs, orgqr, ormqr docs #3720
  • change doc for Adaptive Pooling #3746
  • Fix MultiLabelMarginLoss docs #3836
  • More docs for Conv1d Conv2d #3870
  • Improve Tensor.scatter_ doc #3937
  • [docs] rnn.py: Note zero defaults for hidden state/cell #3951
  • Improve Tensor.new doc #3954
  • Improve docs for torch and torch.Tensor #3969
  • Added explicit tuple dimensions to doc for Conv1d. #4136
  • Improve svd doc #4155
  • Correct instancenorm input size #4171
  • Fix StepLR example docs #4478
Assets 2

Table of contents

  • Breaking changes: removed reinforce()
  • New features
    • Unreduced losses
    • A profiler for the autograd engine
    • More functions support Higher order gradients
    • New features in Optimizers
    • New layers and nn functionality
    • New Tensor functions and Features
    • Other additions
  • API changes
  • Performance improvements
    • Big reduction in framework overhead (helps small models)
    • 4x to 256x faster Softmax/LogSoftmax
    • More...
  • Framework Interoperability
    • DLPack Interoperability
    • Model Exporter to ONNX (ship PyTorch to Caffe2, CoreML, CNTK, MXNet, Tensorflow)
  • Bug Fixes (a lot of them)

Breaking changes

Stochastic functions, i.e. Variable.reinforce() were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice, users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.

We introduce the torch.distributions package to replace Stochastic functions.

Your previous code typically looked like this:

probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)

This is the new equivalent code:

probs = policy_network(state)
# NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward

New features

Unreduced losses

Now, Some loss functions can compute per-sample losses in a mini-batch

  • By default PyTorch sums losses over the mini-batch and returns a single scalar loss. This was limiting to users.
  • Now, a subset of loss functions allow specifying reduce=False to return individual losses for each sample in the mini-batch
  • Example: loss = nn.CrossEntropyLoss(..., reduce=False)
  • Currently supported losses: MSELoss, NLLLoss, NLLLoss2d, KLDivLoss, CrossEntropyLoss, SmoothL1Loss, L1Loss
  • More loss functions will be covered in the next release

An in-built Profiler in the autograd engine

We built a low-level profiler to help you identify bottlenecks in your models

Let us start with an example:

>>> x = Variable(torch.randn(1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
...     y = x ** 2
...     y.backward()
>>> # NOTE: some columns were removed for brevity
... print(prof)
--------------------------------  ----------  ---------
Name                               CPU time   CUDA time
-------------------------------   ----------  ---------
PowConstant                        142.036us    0.000us
N5torch8autograd9GraphRootE         63.524us    0.000us
PowConstantBackward                184.228us    0.000us
MulConstant                         50.288us    0.000us
PowConstant                         28.439us    0.000us
Mul                                 20.154us    0.000us
N5torch8autograd14AccumulateGradE   13.790us    0.000us
N5torch8autograd5CloneE              4.088us    0.000us

The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special nvprof prefix. For example:

nvprof --profile-from-start off -o trace_name.prof -- python <your arguments>

# in python
>>> with torch.cuda.profiler.profile():
...     model(x) # Warmup CUDA memory allocator and profiler
...     with torch.autograd.profiler.emit_nvtx():
...         model(x)

Then, you can load trace_name.prof in PyTorch and print a summary profile report.

>>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof')
>>> print(prof)

Read additional documentation here

Higher order gradients

Added higher-order gradients support for the following layers

  • ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
  • PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
  • MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
  • DataParallel


  • optim.SparseAdam: Implements a lazy version of Adam algorithm suitable for sparse tensors.
    • In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
  • Optimizers now have an add_param_group function that lets you add new parameter groups to an already constructed optimizer.

New layers and nn functionality

  • Added AdpativeMaxPool3d and AdaptiveAvgPool3d
  • Added LPPool1d
  • F.pad now has support for:
    • 'reflection' and 'replication' padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
    • constant padding on n-d signals
  • nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in nearest and linear modes.
  • grid_sample now allows padding with the border value via padding_mode="border". grid_sample expects a grid in the range of [-1, 1], and if the values are out of these bounds, padding with the value 0.0 is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model.
  • Introducing nn.utils.parameters_to_vector and nn.utils.vector_to_parameters
    • parameters_to_vector takes net.parameters() and return a 1D vector that contains all the parameters
    • vector_to_parameters takes a vector of flattened parameters and copies the values over to a network's parameters
    • Convenient for some reinforcement learning algorithms, such as cross-entropy method, TRPO etc., which need to pull all network parameters as one big vector, modify them, and put the modified vector back.
  • Allow user to not specify certain input dimensions for AdaptivePool*d and infer them at runtime.
    • For example:
    # target output size of 10x7
    m = nn.AdaptiveMaxPool2d((None, 7))
  • DataParallel container on CPU is now a no-op (instead of erroring out)

New Tensor functions and features

  • Introduced torch.erf and torch.erfinv that compute the error function and the inverse error function of each element in the Tensor.
  • adds broadcasting support to bitwise operators
  • Added Tensor.put_ and torch.take similar to numpy.take and numpy.put.
    • The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
      first. The output has the same shape as the indices.
    • The put function copies value into a tensor also using linear indices.
    • Differences from numpy equivalents:
      • numpy.take has an optional axis argument, which behaves like index_select. This axis argument is not yet present.
      • numpy.put repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
  • add zeros and zeros_like for sparse Tensors.
  • 1-element Tensors can now be casted to Python scalars. For example: int(torch.Tensor([5])) works now.

Other additions

  • Added torch.cuda.get_device_name and torch.cuda.get_device_capability that do what the names say. Example:
    >>> torch.cuda.get_device_name(0)
    'Quadro GP100'
    >>> torch.cuda.get_device_capability(0)
    (6, 0)
  • If one sets torch.backends.cudnn.deterministic = True, then the CuDNN convolutions use deterministic algorithms
  • torch.cuda_get_rng_state_all and torch.cuda_set_rng_state_all are introduced to let you save / load the state of the random number generator over all GPUs at once
  • torch.cuda.emptyCache() frees the cached memory blocks in PyTorch's caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.

API changes

  • softmax and log_softmax now take a dim argument that specifies the dimension in which slices are taken for the softmax operation. dim allows negative dimensions as well (dim = -1 will be the last dimension)
  • torch.potrf (Cholesky decomposition) is now differentiable and defined on Variable
  • Remove all instances of device_id and replace it with device, to make things consistent
  • torch.autograd.grad now allows you to specify inputs that are unused in the autograd graph if you use allow_unused=True
    This gets useful when using torch.autograd.grad in large graphs with lists of inputs / outputs
    For example:
    x, y = Variable(...), Variable(...)
    torch.autograd.grad(x * 2, [x, y]) # errors
    torch.autograd.grad(x * 2, [x, y], allow_unused=True) # works
  • pad_packed_sequence now allows a padding_value argument that can be used instead of zero-padding
  • Dataset now has a + operator (which uses ConcatDataset). You can do something like MNIST(...) + FashionMNIST(...) for example, and you will get a concatenated dataset containing samples from both.
  • torch.distributed.recv allows Tensors to be received from any sender (hence, src is optional). recv returns the rank of the sender.
  • adds zero_() to Variable
  • Variable.shape returns the size of the Tensor (now made consistent with Tensor)
  • torch.version.cuda specifies the CUDA version that PyTorch was compiled with
  • Add a missing function random_ for CUDA.
  • torch.load and torch.save can now take a pathlib.Path object, which is a standard Python3 typed filepath object
  • If you want to load a model's state_dict into another model (for example to fine-tune a pre-trained network), load_state_dict was strict on matching the key names of the parameters. Now we provide a strict=False option to load_state_dict where it only loads in parameters where the keys match, and ignores the other parameter keys.
  • added nn.functional.embedding_bag that is equivalent to nn.EmbeddingBag

Performance Improvements

  • The overhead of torch functions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using our ATen library. This speeds-up models that are very small, such as small LSTMs and other common models seen in NLP.
  • softmax and log_softmax are now 4x to 256x faster on the GPU after rewriting the gpu kernels
  • 2.5x to 3x performance improvement of the distributed AllReduce (gloo backend) by enabling GPUDirect
  • nn.Embedding's renorm option is much faster on the GPU. For embedding dimensions of 100k x 128 and a batch size of 1024, it is 33x faster.
  • All pointwise ops now use OpenMP and get multi-core CPU benefits
  • Added dedicated CUDA kernels for group convolutions where groups == nInputPlane (depthwise convolution). Speedups range from 5x to 1000x for tested layer sizes. See the benchmark table for more details as well as this table.
  • Fixed optim.SGD's memory usage for sparse gradients (for ex. nn.Embedding(..., sparse=True)), reducing the usage on a user-provided test script by 10x.
  • Optional NNPack integration for faster CPU convolutions (not part of binaries)
  • Reduce overhead of broadcasting if Tensors aren't broadcastable
  • torch.nn.utils.weight_norm over the right-most dimensions is faster
  • Backward of torch.norm is sped up by ~1.5x
  • Improve the performance of pack_padded_sequence
  • Add a single-argument version of torch.arange. For example torch.arange(10)

Framework Interoperability

DLPack Interoperability

DLPack Tensors are cross-framework Tensor formats. We now have torch.utils.to_dlpack(x) and torch.utils.from_dlpack(x) to convert between DLPack and torch Tensor formats. The conversion has zero memory copy and hence is very efficient.

Model exporter to ONNX

ONNX is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, Tensorflow at the moment. PyTorch models that are ConvNet-like and RNN-like (static graphs) can now be shipped to the ONNX format.

  • There is a new module torch.onnx (http://pytorch.org/docs/0.3.0/onnx.html) which provides the API for exporting ONNX models.

  • The operations supported in this release are:

    • add, sub (nonzero alpha not supported), mul, div, cat, mm, addmm, neg, tanh, sigmoid, mean, t, transpose, view, split, squeeze
    • expand (only when used before a broadcasting ONNX operator; e.g., add)
    • prelu (single weight shared among input channels not supported)
    • threshold (non-zero threshold/non-zero value not supported)
    • Conv, ConvTranspose, BatchNorm, MaxPool, RNN, Dropout, ConstantPadNd, Negate
    • elu, leaky_relu, glu, softmax, log_softmax, avg_pool2d
    • unfold (experimental support with ATen-Caffe2 integration)
    • Embedding (no optional arguments supported)
    • RNN
    • FeatureDropout (training mode not supported)
    • Index (constant integer and tuple indices supported)

Usability Improvements

  • More cogent error messages during indexing of Tensors / Variables
    Breaking changes
  • Add proper error message for specifying dimension on a tensor with no dimensions
  • better error messages for Conv*d input shape checking
  • More user-friendly error messages for LongTensor indexing
  • Better error messages and argument checking for Conv*d routines
  • Trying to construct a Tensor from a Variable fails more appropriately
  • If you are using a PyTorch binary with insufficient CUDA version, then a warning is printed to the user.
  • Fixed incoherent error messages in load_state_dict
  • Fix error message for type mismatches with sparse tensors

Bug fixes


  • Fix CUDA lazy initialization to not trigger on calls to torch.manual_seed (instead, the calls are queued and run when CUDA is initialized)


  • if x is 2D, x[[0, 3],] was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can do x[[0, 3]]
  • x.sort(descending=True) used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this.
  • Tensor constructors with numpy input: torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))
    • torch will now copy the contents of the array in a storage of appropriate type.
    • If types match, it will share the underlying array (no-copy), with equivalent semantics to initializing a tensor with another tensor.
    • On CUDA, torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32)) will now work by making a copy.
  • ones_like and zeros_like now create Tensors on the same device as the original Tensor
  • torch.multinomial on the CPU would reshape the input prob_dist in-place. Fixed this to make sure the prob_dist input's shape is unchanged after the call to multinomial
  • expand and expand_as allow expanding an empty Tensor to another empty Tensor
  • when [..., None, ...] was given (i.e. newaxis placement in indexing was specified), PyTorch had different behavior from NumPy. This is made consistent with NumPy in all cases.
  • Fix exponential distribution implementation to never sample infinity - cuRAND returns numbers in (0, 1]
  • torch.HalfTensor supports numpy() and torch.from_numpy
  • Add additional size checking for torch.scatter
  • fix torch.tril and torch.triu on the GPU for storage-offset Tensors (would return incorrect result).
  • Fix a memory leak in CUDA qr decomposition
  • Fix stream-awareness issues in THCUNN kernels
  • Fix kwargs parsing in torch.topk
  • Fixed random_ on CPU (which previously had a max value of 2^32) for DoubleTensor and LongTensor
  • Fix ZeroDivisionError: float division by zero when printing certain Tensors
  • torch.gels when m > n had a truncation bug on the CPU and returned incorrect results. Fixed.
  • Add a check in tensor.numpy() that checks if no positional arguments are passed
  • Before a Tensor is moved to CUDA pinned memory, added a check to ensure that it is contiguous
  • any and all work on empty Tensors on the cpu (previously errored out)
  • Fix symeig on CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior.
  • Improved the numerical stability of torch.var and torch.std by using Welford's algorithm
  • The Random Number Generator returned uniform samples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug).
    • Now, all uniform sampled numbers will return within the bounds [0, 1), across all types and devices
  • Fix torch.svd to not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings)
  • Allows empty index Tensor for index_select (instead of erroring out)
  • Previously when eigenvector=False, symeig returns some unknown value for the eigenvectors. Now we zero them out.


  • Fix bug with 'coalesced' calculation in sparse 'cadd'
  • Fixes .type() not converting indices tensor.
  • Fixes sparse tensor coalesce on the GPU in corner cases


  • Fixed crashes when calling backwards on leaf variable with requires_grad=False
  • fix bug on Variable type() around non-default GPU input.
  • when torch.norm returned 0.0, the gradient was NaN. We now use the subgradient at 0.0, so the gradient is 0.0.
  • Fix an correctness issue with advanced indexing and higher-order gradients
  • torch.prod's backward was failing on the GPU due to a type error, fixed.
  • Advanced Indexing on Variables now allows the index to be a LongTensor backed Variable
  • Variable.cuda() and Tensor.cuda() are consistent in kwargs options


  • torch.optim.lr_scheduler is now imported by default.


  • Returning a dictionary from a nn.Module's forward function is now supported (used to throw an error)
  • When register_buffer("foo", ...) is called, and self.foo already exists, then instead of silently failing, now raises a KeyError
  • Fixed loading of older checkpoints of RNN/LSTM which were missing _data_ptrs attributes.
  • nn.Embedding had a hard error when using the max_norm option. This is fixed now.
  • when using the max_norm option, the passed-in indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel.
  • F.affine_grid now can take non-contiguous inputs
  • EmbeddingBag can accept both 1D and 2D inputs now.
  • Workaround a CuDNN bug where batch sizes greater than 131070 fail in CuDNN BatchNorm
  • fix nn.init.orthogonal to correctly return orthonormal vectors when rows < cols
  • if BatchNorm has only 1 value per channel in total, raise an error in training mode.
  • Make cuDNN bindings respect the current cuda stream (previously raised incoherent error)
  • fix grid_sample backward when gradOutput is a zero-strided Tensor
  • Fix a segmentation fault when reflection padding is out of Tensor bounds.
  • If LogSoftmax has only 1 element, -inf was returned. Now this correctly returns 0.0
  • Fix pack_padded_sequence to accept inputs of arbitrary sizes (not just 3D inputs)
  • Detect pointer aliasing in cuDNN RNN flatten_parameters and avoid that path.
  • Fixed ELU higher order gradients when applied in-place
  • Workaround a CuDNN RNN bug for half-precision
  • Prevent numerical issues with poisson_nll_loss when log_input=False by adding a small epsilon

distributed and multi-gpu

  • Allow kwargs-only inputs to DataParallel. This used to fail: n = nn.DataParallel(Net()); out = n(input=i)
  • DistributedDataParallel calculates num_samples correctly in python2
  • Fix the case of DistributedDataParallel when 1-GPU per process is used.
  • Fixed DataParallel to specify GPUs that don't include GPU-0
  • DistributedDataParallel's exit doesn't error out anymore, the daemon flag is set.
  • Fix a bug in DistributedDataParallel in the case when model has no buffers (previously raised incoherent error)
  • Fix __get_state__ to be functional in DistributedDataParallel (was returning nothing)
  • Fix a deadlock in the NCCL bindings when GIL and CudaFreeMutex were starving each other


  • model.zoo.load_url now first attempts to use the requests library if available, and then falls back to urllib
  • Fix error when default_collate is passed a collection of numpy.str_
Assets 2

Here comes the next major release of PyTorch, just in time for ICML. Install it today from our website http://pytorch.org
Package documentation for this release is available at http://pytorch.org/docs/0.2.0/

We're introducing long-awaited features such as Broadcasting, Advanced Indexing, Higher-order gradients and finally: Distributed PyTorch.

Due to introducing Broadcasting, the code behavior for certain broadcastable situations is different from behavior in 0.1.12. This might lead to silent bugs in your existing code. We've provided easy ways of identifying this ambiguous code in the Important Breakages and Workarounds section.

Table of contents:

  • Tensor Broadcasting (numpy-style)
  • Advanced Indexing for Tensors and Variables
  • Higher-order gradients
  • Distributed PyTorch (multi-node training, etc.)
  • Neural Network layers and features: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
  • New in torch and autograd: matmul, inverse, etc.
  • Easier debugging, better error messages
  • Bug Fixes
  • Important Breakages and Workarounds

Tensor Broadcasting (numpy-style)

In short, if a PyTorch operation supports broadcasting, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).

PyTorch Broadcasting semantics closely follow numpy-style broadcasting; if you are familiar with numpy broadcasting, things should just work as expected.

General Semantics

Two tensors are “broadcastable” if the following rules hold:

  • Each tensor has at least one dimension.
  • When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

For Example:

>>> x=torch.FloatTensor(5,7,3)
>>> y=torch.FloatTensor(5,7,3)
# same shapes are always broadcastable (i.e. the above rules always hold)

# can line up trailing dimensions
>>> x=torch.FloatTensor(5,3,4,1)
>>> y=torch.FloatTensor(  3,1,1)

# x and y are broadcastable.
# 1st trailing dimension: both have size 1
# 2nd trailing dimension: y has size 1
# 3rd trailing dimension: x size == y size
# 4th trailing dimension: y dimension doesn't exist

# but:
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor(  3,1,1)
# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3

If two tensors x, y are "broadcastable", the resulting tensor size is calculated as follows:

  • If the number of dimensions of x and y are not equal, prepend 1 to the dimensions of the tensor with fewer dimensions to make them equal length.
  • Then, for each dimension size, the resulting dimension size is the max of the sizes of x and y along that dimension.

For Example:

# can line up trailing dimensions to make reading easier
>>> x=torch.FloatTensor(5,1,4,1)
>>> y=torch.FloatTensor(  3,1,1)
>>> (x+y).size()
torch.Size([5, 3, 4, 1])

# error case
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor(  3,1,1)
>>> (x+y).size()
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

More details can be found on the PyTorch documentation site. Also, each torch function lists its broadcasting semantics in the documentation.

Advanced Indexing for Tensors and Variables

PyTorch now supports a subset of NumPy style advanced indexing. This allows users to select arbitrary indices at each dimension of the Tensor, including non-adjacent indices and duplicate indices, using the same []-style operation. This allows for a more flexible indexing strategy without needing calls to PyTorch's Index[Select, Add, ...] functions.

Let's look at some examples:

x = torch.Tensor(5, 5, 5)

Pure Integer Array Indexing - specify arbitrary indices at each dimension

x[[1, 2], [3, 2], [1, 0]]
--> yields a 2-element Tensor (x[1][3][1], x[2][2][0])

also supports broadcasting, duplicates

x[[2, 3, 2], [0], [1]]
--> yields a 3-element Tensor (x[2][0][1], x[3][0][1], x[2][0][1])

arbitrary indexer shapes allowed

x[[[1, 0], [0, 1]], [0], [1]].shape
--> yields a 2x2 Tensor [[x[1][0][1], x[0][0][1]],
                         [x[0][0][1], x[1][0][1]]]

can use colon, ellipse

x[[0, 3], :, :]
x[[0, 3], ...]
--> both yield a 2x5x5 Tensor [x[0], x[3]]

also use Tensors to index!

y = torch.LongTensor([0, 2, 4])
x[y, :, :]
--> yields a 3x5x5 Tensor [x[0], x[2], x[4]]

selection with less than ndim, note the use of comma

x[[1, 3], ]
--> yields a 2x5x5 Tensor [x[1], x[3]]

Higher order gradients

Now you can evaluate higher order differentials in PyTorch. For example, you can compute Hessian-Vector products, penalize the norm of the gradients of your model, implement Unrolled GANs and Improved WGANs, etc.

In the 0.2 release, we've enabled the ability to compute higher order gradients for all of torch.XXX functions and the most popular nnlayers. The rest will be covered in the next release.

Here's a short example that penalizes the norm of the weight gradients of a Resnet-18 model, so that the volume of weights is slow-changing.

import torch
from torchvision.models import resnet18
from torch.autograd import Variable

model = resnet18().cuda()

# dummy inputs for the example
input = Variable(torch.randn(2,3,224,224).cuda(), requires_grad=True)
target = Variable(torch.zeros(2).long().cuda())

# as usual
output = model(input)
loss = torch.nn.functional.nll_loss(output, target)

grad_params = torch.autograd.grad(loss, model.parameters(), create_graph=True)
# torch.autograd.grad does not accumuate the gradients into the .grad attributes
# It instead returns the gradients as Variable tuples.

# now compute the 2-norm of the grad_params
grad_norm = 0
for grad in grad_params:
    grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()

# take the gradients wrt grad_norm. backward() will accumulate
# the gradients into the .grad attributes

# do an optimization step

We see two new concepts here:

  1. torch.autograd.grad is a function that takes in [outputs, list of inputs (for which you want gradients)], and returns the gradients wrt. these inputs as a tuple, rather than accumulating the gradients into the .grad attributes. This is useful if you want to further operate on the gradients.
  2. You can operate on the gradients, and call backward() on them.

The list of nn layers that support higher order gradients are:

  • AvgPool*d, BatchNorm*d, Conv*d, MaxPool1d,2d, Linear, Bilinear
  • pad, ConstantPad2d, ZeroPad2d, LPPool2d, PixelShuffle
  • ReLU6, LeakyReLU, PReLU, Tanh, Tanhshrink, Threshold, Sigmoid, HardTanh, ELU, Softsign, SeLU
  • L1Loss, NLLLoss, PoissonNLLLoss, LogSoftmax, Softmax2d
    The rest will be enabled in the next release.

To enable higher order gradients, we've introduced a new style of writing autograd.Function (the current/old style of writing functions is fully backward compatible). You can read more about the new style of functions here.

Most of you dont write your own autograd.Functions, they are low-level primitives that introduce
new operations to the autograd engine, where you specify the forward and backward calls.

Distributed PyTorch

We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger mini-batches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

The distributed package follows an MPI-style programming model. This means that there are functions provided to you such as send, recv, all_reduce that will exchange Tensors among nodes (machines).

For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:

  • shared file system (requires that all processes can access a single file system)
  • IP multicast (requires that all processes are in the same network)
  • environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)

Our package documentation contains more details on initialization and available backends, but here's an example of initializing using a multicast address:

import torch.distributed as dist


print('Hello from process {} (out of {})!'.format(
        dist.get_rank(), dist.get_world_size()))

This would print Hello from process 2 (out of 4)on the 3rd machine.

World size is the number of processes that will participate in the job. Each will be assigned a rank, which is a number between 0 and world_size - 1, unique within this job. It will serve as a process identifier and will be used instead of an address to, for example, specify to which process should a tensor be sent.

Here's a snippet that shows how simple point-to-point communication can be performed:

# All processes (receiving ones too!) need to have tensors of appropriate
# size preallocated.
x = torch.Tensor(10)
if dist.get_rank() == 0:
    # Send x to process with rank 1
    dist.send(x, dst=1)
else:  # rank == 1
    # Receive data from process with rank 0 and save result in x
    dist.recv(x, src=0)

Asynchronous p2p functions (isend, irecv) are available too.

However, some communication patterns appear so often that more efficient collective calls have been developed. They typically engage the whole process group and are much faster than naive algorithms using send/recv. One example is all_reduce:

x = torch.Tensor([dist.get_rank()])
# Add tensors from all processes such that they all receive the result.
# x is an input and output to this operation.

The distributed package is fairly low-level, so that it allows to implement more advanced algorithms and tailor the code to very specific purposes, but data-parallel training is such a common one that we have created high-level helpers for it.

Hence, we've introduced DistributedDataParallel, which is meant to be a nearly drop-in replacement for nn.DataParallel.
Here's a code snippet demonstrating changes necessary to add it to existing training code:

# Wrap model in DistributedDataParallel (CUDA only for the moment)
model = torch.nn.parallel.DistributedDataParallel(model.cuda())

# Use a DistributedSampler to restrict each process to a distinct subset
# of the dataset.
train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, num_workers=args.workers,
    pin_memory=True, sampler=train_sampler)

for epoch in range(args.num_epochs):
    # Use .set_epoch() method to reshuffle the dataset partition at every iteration
    # training loop

You can see a fuller Imagenet training example here

New nn layers: SpatialTransformers, WeightNorm, EmbeddingBag, etc.

New features

  • forward_pre_hook is introduced to execute user-specified closures right before a forward function is called.
  • Convenient access to non-leaf gradients:
    Currently, to access and inspect gradients of intermediate values, we have to use hooks. This is not convenient for doing simple inspections. Hence, we introduce retain_grad. It is best explained via an example:
input = Variable(torch.rand(1, 3), requires_grad=True)
h1 = input * 3
out = (h1 * h1).sum()


# without calling retain_grad(), h1.grad is None
  • DataParallel now supports dicts as inputs

New Layers

  • Spatial Transformer Networks via F.grid_sample and F.affine_grid
  • nn.SeLU and nn.AlphaDropout are introduced, from the paper: Self-Normalizing Neural Networks
  • nn.GLU (Gated Linear Unit) is introduced from the paper Convolutional Sequence to Sequence Learning
  • Weight Normalization is now implemented via torch.utils.weight_norm.
  • You can now ignore specific target indices while computing cross_entropy_loss and nll_loss using the ignore_index argument. This is a cheap and useful way of implementing masking, where you can have a mask index that is ignored in computing the loss.
  • F.normalize implements dimension-wise renormalization
  • F.upsample and nn.Upsample consolidate multiple Upsampling layers into one function. It implements 2d and 3d bilinear/trilinear/nearest upsampling.
  • nn.EmbeddingBag: When build bag-of-words models, doing an Embedding followed by Sum or Mean is common. For variable length sequences, computing bags of embeddings involves masking. We provide a singe nn.EmbeddingBag which is much more efficent and faster to compute bags of embeddings, especially for variable length sequences.
  • Numerically stable Binary Cross-Entropy loss via bce_with_logits
  • A negative log-likelihood loss with Poisson distribution of the target via PoissonNLLLoss
  • cosine_similarity: Returns cosine similarity between x1 and x2, computed along dim.

training utilities

Learning Rate Schedulers: torch.optim.lr_scheduler provides several dumb and smart methods to adjust the current learning rate. They are quite convenient while experimenting, giving a proxy for what you as the user would likely want to do.

There are various strategies provided, which can be used depending on the appropriate situation, more can be read in the package docs:

  • ReduceLROnPlateau, LambdaLR, StepLR, MultiStepLR, ExponentialLR

ConcatDataset that is a convenient dataset meta-class that can merge and concatenate two individual datasets.

New in torch and autograd

  • All reduce functions such as sum and meannow default to squeezing the reduced dimension. For example torch.sum(torch.randn(10, 20), 0) returns a 1D Tensor.
  • x.shape, similar to numpy. A convenience property that is equivalent to x.size()
  • torch.matmul, similar to np.matmul
  • bitwise and, or, xor, lshift, rshift
  • autograd support for inverse, gesv, cumprod, atan2
  • unbiased var and std now available via keyword argument option
  • torch.scatter_add - torch.scatter, except when duplicate indices are encountered, the values are summed.
  • torch.median behaves similar to torch.sum when no arguments are given, i.e. it reduces all the dimensions and returns a single median value of the flattened Tensor.
  • masked_copy_ has been renamed to masked_scatter_ (with deprecation on masked_copy_)
  • torch.manual_seed now seeds all CUDA devices as well
  • You can now specify the random number generator object via keyword arguments torch.rand(1000, generator=gen)

Bug-fixes and small improvements

  • Now we emit an error when a Variable is converted to a bool. For example:
b = Variable(torch.zeros(1))
if b[0]: # errors now
  • Fix correctness bugs in qr decomposition on CUDA.
  • Support for IBM PowerPC64 platform
  • Check that the CuDNN version at compile-time is the same version at run-time.
  • Improve error message in CUDA forked subprocess
  • Faster transposed-copy on CPU
  • Improve error messages in InstanceNorm
  • Add more argument checking for various routines, especially BatchNorm and Convolution routines.
  • Better error messages around shape reporting across the CPU backend.
  • Support more than 8 GPUs per machine (work-around a CUDA p2p restriction)
  • Improve error message when accessing attributes that don't exist
  • t() of Variable consistent with Tensor
  • prevent divide-by-zero when dropout p=1
  • fix sharing of CUDA tensors on non-current devices
  • when BN epsilon < allowed CuDNN value, fallback to THNN
  • Fix thread-trashing when using different number of threads for MKL and OMP
  • improve memory usage when using CuDNN RNN
  • Fix ZeroPad2d backwards with negative padding
  • add dummy tensor.data property, to provide interpretable error message to users
  • Fix in-place division for Python3
  • Raise error when call from_numpy on 0-dim array
  • Empty Tensors dont error out when shared across multiprocessing
  • fix baddbmm for expanded tensors
  • Let parallel_apply accept arbitrary inputs
  • keyword arguments in Tensor and Variable are now consistent
  • fix torch.inverse when Magma is not available
  • Add logical not operator for ByteTensor
  • add device asserts in scatter/gather kernels

Important Breakages and Workarounds

As you've read, we've introduced two important changes that are not
backward compatible:

  • Numpy-style Broadcasting
  • Reduction functions such as sum(1) now default to keepdim=False

We provide different levels of Python warnings that you can enable to alert you if you are using deprecated behavior or if the behavior of your code has changed.


Here is a code snippet that you can add to the top of your scripts.
Adding this code will generate warnings highlighting incompatible code.

Fix your code to no longer generate warnings.

# insert this to the top of your scripts (usually main.py)
import sys, warnings, traceback, torch
def warn_with_traceback(message, category, filename, lineno, file=None, line=None):
    sys.stderr.write(warnings.formatwarning(message, category, filename, lineno, line))
warnings.showwarning = warn_with_traceback; warnings.simplefilter('always', UserWarning);
torch.utils.backcompat.broadcast_warning.enabled = True
torch.utils.backcompat.keepdim_warning.enabled = True

Once all warnings disappear, you can remove the code snippet.

More elaborately

Now, let us see the three incompatible changes with examples.

Using the (now deprecated) 1-dimensional view pointwise function

Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes, as long as the number of elements in each tensor was equal. The pointwise operation would then be carried out by viewing each tensor as 1-dimensional. PyTorch now supports broadcasting. The “1-dimensional” pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are not broadcastable, but have the same number of elements.

For example:

>>> torch.add(torch.ones(4), torch.ones(2,2))
__main__:1: UserWarning: self and other not broadcastable, but have the same
number of elements.  Falling back to deprecated pointwise behavior.
[torch.FloatTensor of size 4]
Broadcasting in code where it didn't happen before

The introduction of broadcasting can cause backwards incompatible changes in the case where two tensors do not have the same shape,
but are broadcastable and have the same number of elements.

For example:

>>> torch.add(torch.ones(4,1), torch.randn(4))

would previously produce a Tensor with size: torch.Size([4,1]),
but now produces a Tensor with size: torch.Size([4,4]).

In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist, you may set torch.utils.backcompat.broadcast_warning.enabled to True, which will generate a python warning in such cases.

For Example:

>>> torch.utils.backcompat.broadcast_warning.enabled=True
>>> torch.add(torch.ones(4,1), torch.ones(4))
__main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.

Note that this setting can trigger warnings for valid uses of broadcasting (including in library code), so you probably want to turn this warning off after migrating your code.

KeepDim=False for Reduction Functions

To get a warning when using a dimensional reduction function with the default keepdim argument, set torch.utils.backcompat.keepdim_warning.enabled to True. For example:

>>> torch.sum(torch.ones(2,3), 1)
__main__:1: UserWarning: backwards compatibility: call to "sum" uses default value for keepdim which has changed default to False.  Consider passing as kwarg.
[torch.FloatTensor of size 2]

As with torch.utils.backcompat.broadcast_warning.enabled, this warning can trigger from valid code, so you most likely want to disable this warning after migrating your code.

Note also that using keepdim=False can cause your existing code to "just work" with broadcasting. For example:

# behavior with (old) keepdim=True, causes accidental broadcast
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=True))
5  5  5  5
5  5  5  5
5  5  5  5
5  5  5  5
[torch.FloatTensor of size 4x4]

# new behavior with keepdim=False is equivalent to non-broadcasted result
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=False))
[torch.FloatTensor of size 4]
Assets 2

API Changes

  • torch.range is deprecated in favor of torch.arange which is consistent with numpy and python range.
  • On sparse Tensors, contiguous is renamed to coalesce and coalesce is now made out-of-place.
    (a reminder that Sparse API is still experimental and evolving, so we dont provide backward-compability).

New Features

New layers and functions

  • torch.topk is now supported for all CUDA types, not just torch.cuda.FloatTensor.
  • Added a three-way ranking loss: nn.TripletMarginLoss
  • Added per-instance normalization layers: nn.InstanceNorm1d, nn.InstanceNorm2d, nn.InstanceNorm3d
    Each channel is treated as an instance to normalize, and mean-subtraction and std-division is done. This is useful when dealing with larger images and smaller mini-batches where BatchNorm like effects are desired.
  • nn.ZeroPad2d and nn.ConstantPad2d are added.
  • nn.Bilinear is added, which computes Y = X1 * W * X2 + b

Negative dimension support for all functions

Every single function that took a dimension argument will also allow taking negative dimensions.

A negative dimension will index the tensor from the last dimension.

For example:

x = torch.randn(10, 20, 30)
y = torch.mean(x, dim = -1)

Here, since x has 3 dimensions, and dim = -1, the last dimension, i.e. dim=3 is picked for taking a mean.

The functions with dimension arguments are:

narrow, transpose, size, cat, chunk, gather, index_select, split, squeeze,
stack, unbind, unsqueeze, cumprod, cumsum, mean, median, mode, norm, prod, std,
sum, var, kthvalue, max, min, sort, topk, renorm,
index_add, index_copy, index_fill, scatter, select, unfold

CUDA support for Sparse Tensors, faster CPU sparse

Now a part of the torch.sparse API is also supported for torch.cuda.sparse.*Tensor.

Functions that are supported on CUDA are:

sparse_mask, to_dense, coalesce, transpose, spaddmm
spcadd, mul, div, cadd, csub, cmul

nn.Embedding now supports sparse even on CUDA (with the sparse=True flag) leveraging these sparse functions.

A new hybrid matrix-multiply hspmm operation that multiplies a sparse matrix with a dense matrix and returns a matrix in the form of a hybrid tensor (i.e. 1 sparse dimension, 1 dense dimension).

Several of the CPU sparse functions have more efficient implementations.

In a quickly hacked up Embedding classifier training script by @martinraison we see CUDA sparse performing as well as CUDA dense:

Table times of seconds / batch

Dense 10 0.86
Sparse 0.15 0.13

named_parameters to filter out specific parameter types

Let's say that you want to add weight decay to all parameters of your model except for the biases. How do you get only the biases of your model?
We introduce nn.Module.named_parameters for this.
It joins named_children and named_modules in helping you filter specific attributes of models.

Example of filtering out biases of a model and give them weight_decay of 0:

import torch
import torch.nn as nn
import torch.optim as optim
m = nn.Sequential(
      nn.Linear(10, 20),
      nn.Linear(20, 20),
weights, biases = [], []
for name, p in m.named_parameters():
   if 'bias' in name:
       biases += [p]
       weights += [p]

  {'params': weights},
  {'params': biases, weight_decay=0}
], lr=1e-2, momentum=0.9, weight_decay=1e-5)

Performance Improvements

  • cumsum and cumprod have been significantly made faster on the GPU via using some thrust primitives where appropriate.
  • LSTMCell and GRUCell are now significantly faster on the GPU via a fused kernel
  • The default Algorithm for CuDNN has been changed to PRECOMP_GEMM which is a
    much faster algorithm that takes a tiny bit of workspace. Previously, it used to
    be IMPLICIT_GEMM which took zero workspace, but was significantly slower.
  • 5% to 10% improvement in data loader by collating batches directly into shared memory.
  • SVD is now computed on the GPU via divide-and-conquer (sgesdd) which gives a 2x to 5x speedup.
  • The commonly used function expand has been moved to C, to have better performance in smaller models.

Bug Fixes

  • Added contiguous checks on weight and bias for a large range of THNN functions
  • make the range of random_ correct when both lower and upper bound are specified
  • parallel_apply now can take arguments that are unhashable
  • Reshape grad correctly in the Dot function (inputs don't have to be 1D vectors...)
  • Added Variable.type_as
  • Unify argument names of norm and renorm to have p=norm_type, dim=dim
  • btrisolve works on CPU doubles
  • ipython autocomplete for torch.nn.Module fixed via implementing __dir__
  • device_ids can now be None again in F.data_parallel and will use all available GPUs
  • workaround cudnn bugs in BatchNorm (<5.1.10) and Dilation (6.0.20)
  • Padding bugfix in Conv1d CPU
  • remainder and cremainder are fixed for integer types
  • fix memory leak in btrisolve and getri
  • If nn.Module's source cant be retrieved because of any exception,
    handle serialization to be non-fatal
  • collate_fn now retains the type of the numpy array
  • is_tensor and is_storage are now fixed for old-style Python classes
  • torch.cat now supports keyword arguments
  • CUDA collectives supported coalescing, but the inputs were all assumed
    to be of the same Tensor type. This is fixed.
  • Fix a deadlock bug in autograd because of an underlying glibc bug in specific
    linux distros (ArchLinux in particular)
  • abs is now fixed for char and short cuda types
  • fix torch.diag autograd when giving a dimension argument
  • fix grouped convolution on CPU when bias=False
  • expose dilated convolutions for ConvTranspose*d
  • Fix a bug in HingeEmbeddingLoss where margin can now be specified via kwargs

Improved error messages

  • Fix errors and messages when no CUDA devices are available.
Assets 2

Minor API Changes

  • in optim.Adamax, the default learning rate and epsilon have been made
    consistent with Lasagne, Keras and TF.
    • Previous: (lr=1e-2, eps=1e-38)
    • Current : (lr=2e-3, eps=1e-8)
  • Make random_ range exclusive (it used to be exclusive when only the upper bound was specified, and inclusive when both were given).
  • torch.cat now disallows catting along inexistent dimensions
    (to make it consistent with numpy and Variable cat)
  • torch.utils.clip_grad_norm now returns the total norm (say, for logging purposes).

Performance Improvements

  • Reduce DataParallel overhead on >4 GPUs
    • Improve broadcast/reduce performance by coalescing tensors
  • nn.Embedding's backward performance increased for batch sizes > 1024

New Features


  • Batch triangular factorization and solves have been interfaced (CPU and GPU) and
    are available under torch.btrifact and torch.btrisolve. See documentation
    for usage
  • All RNG functions now have generator specifiable via a keyword argument
  • torch.mode is now supported on the GPU via a high-performance kernel.

autograd, nn and optim

  • CuDNN v6 integrated:
    • Faster Dilated Convolutions (and less memory hungry)
    • 1D FFT-based Convolutions
    • Significant performance improvement for Softmax layers
    • Speedups across many functions
    • Improved CuDNN error messages
    • We will integrate persistent RNNs in the next release
  • torch.trace, torch.cumsum, torch.cross are now implemented in autograd
  • nll_loss now supports Spatial inputs (i.e. 4d inputs BCHW) and computes
    channel-wise cross-entropy.
  • nn.PReLU now supports all dimensional Tensors, not just 1d and 2d.
  • add nn.PairwiseDistance and F.pairwise_distance that compute batchwise
    pairwise distance between two vectors.
  • Adaptive Max and Average Pooling added for 1d, 2d inputs via
    nn.AdaptiveMaxPooling1d, nn.AdaptiveAvgPooling2d, etc.
  • RMSProp now has momentum and a centered option. If centered is True,
    the gradient is normalized by an estimation of it's variance. (Graves 2013)


  • WeightedRandomSampler has been added as a custom sampler for the DataLoader.
    It samples elements from [0,..,len(weights)-1] with the given probabilities
    and is useful to sample from unbalanced datasets where some classes have
    many more samples than others. See the docs
    for more details
  • DataLoader now allows returning of numpy arrays

Bug Fixes


  • When loading GPU checkpoints from disk with storage location remapping,
    torch.cuda was still attempted to be imported. This is now fixed, and
    you can load GPU checkpoints on machines with no GPUs or CUDA.
  • Work around an OSX fread bug where loading checkpoints of each Tensor > 1GB
    would give an error.
  • Fixed a in torch.cat where it now does not
    accept reverse (it's not a PySequence)
    For example:
    l = [Variable(torch.ones(1,3)*i) for i in range(3)]
    torch.cat(reversed(l), 0) # errors now
  • Fix a memory leak in torch.from_numpy
  • GPU svd returned a larger matrix than expected in the some mode.
    This is now fixed to match CPU behavior.
  • Fix a bug in CPU max that was introduced in the previous release.

autograd, nn and optim

  • Reassigning attributes in modules correctly works now.
    This example used to not work correctly, l.a always remained None.
    Now it works as one would expect:
    l = nn.Linear(10, 20)
    l.a = None
    l.a = nn.Parameter(torch.randn(2))
    # l.a is correctly updated
  • Fix bug where adding a hook could replace an existing hook
  • Fix nn.Embedding and nn.CosineEmbeddingLoss to work without
    error on non-float CUDA (half, double)
  • Fix a bug in nn.Embedding when the max_norm option was used. Some of the
    indices were not respecting max_norm and this is fixed.
  • Fix corner-case in Variable's SetItem where gradient was of incorrect shape.
    x.grad used to be of shape 20, because `y[1]`` was of shape 20.
    x = Variable(torch.randn(1, 20), requires_grad=True)
    y = Variable(torch.zeros(10, 20))
    y[1] = x
  • Fix a segfault in Conv1d when input doesn't require grad.
  • Assertions in pack_padded_sequence to check that sequence is of length > 0
  • torch.prod's autograd forumlae were incorrect if the Tensor had 0. This
    formula has been fixed.
  • Variable expand and expand_as had incorrect dimension inference when using
    broadcasting semantics. The formula has been fixed in these cases.
  • Fix a size mismatch in CosineEmbeddingLoss. See this issue for more details.
  • Fixed a bug in LBFGS that caused it to use uninitialized locals. See issue
  • Add assertions for negative padding in nn.Conv* functions.
  • Fix the sttdev gradient formula for the stochastic function normal.


  • Fix issue when returning strings from the DataLoader when pin_memory=True
  • Binaries no longer dependent on needing a libcudart.so at runtime.
Assets 2

New Features

Indexing and Broadcasting Improvements

  • Add broadcasting semantics to expand / expand_as.
    • Previously, expand had no ability to add new dimensions, and unsqueeze
      had to be used to first create singleton dimensions before expansion.
    • Now, singleton dimensions are automatically prepended to the shape of
      the tensor if a matching dimension is found.
      Here's an example:
      x = torch.rand(5)
      y = torch.rand(4, 8, 5)
      z = x.expand_as(y) # z is of shape (4, 8, 5)
      x = torch.rand(1, 8, 1)
      z.expand_as(y) # z is of shape (4, 8, 5)
  • Unsqueeze dimensions using None indexing
    a = torch.randn(10)
    b = a.unsqueeze(0)
    b = a[None, :]     # Equivalent operations
  • Indexing with steps is supported (only positive steps)
    In [1]: a = torch.randn(10)
    In [2]: a
      [torch.FloatTensor of size 10]
    In [3]: a[0:10:3]
      [torch.FloatTensor of size 4]

Variable-length mini-batches in Recurrent Networks

nn.RNN, nn.LSTM, nn.GRU now support mini-batches where sequences are of variable
You can pass an input of type PackedSequence
into these layers.
A PackedSequence holds data and a list of sequence sizes of a packed sequence batch.
For example, a PackedSequence will hold an input mini-batch of such sequences:

a b c d e
a b c d e f g h
a b
a b c d

Here, each input row is of variable length.

You can construct a PackedSequence using the provided function

pack_padded_sequence takes a Variable containing padded sequences, i.e. a Tensor
of T x B x *, where B is the size of the mini-batch, and each input is either of
length T or is padded to length T. It also takes a list of lengths of each input.
From these, it constructs a PackedSequence

For example, it will take [8, 5, 4, 2] and and an input 8 x 4 x 128
that corresponds to:

a b c d e f g h
a b c d e 0 0 0
a b c d 0 0 0 0
a b 0 0 0 0 0 0

The output of the RNN layers will also be a PackedSequence, which can then be inverted
back to a padded Tensor using the inverse function:

Sparse Tensors (CPU)

Original goals:

  • ability to propagate sparse updates in a network (e.g. for updating an embedding matrix)
  • ability to efficiently compute "bag-of-words" sentence embeddings (e.g. weighted average of word embeddings)

Implemented features:

  • enable backpropagation of sparse gradients without conversion to dense tensors. In most cases a runtime exception is thrown when mixing different gradient types for the same variable
  • add some methods for THSTensor: zero, elementwise add and mul, scalar mul and div
  • make addcmul method of THTensor compatible with sparse operands
  • make spmm method accessible from Python as dsmm
  • sparse_mask method on THTensor. This produces a sparse tensor from a dense tensor,
    by using a sparse tensor as a mask. A value is only present in the output sparse
    tensor if it also exists in the mask.
  • update optim.Adagrad to use sparse updates when possible.
  • leave Variable's gradient to None by default.
    This is because there is no canonical zero gradient anymore (it could be dense or
    sparse, and if it is sparse we don't know how many dimensions are sparse)
  • N-dimensional values for sparse tensors:
    • Basically for things like applying sparse updates to embedding matrices, only the
      first dimension (the one that corresponds to the word index) is sparse. The other
      dimension is always dense (only whole embedding vectors are updated). An elegant
      solution is to make the values tensor N-dimensional instead of 1-dimensional.
      For an embedding matrix, the sparse gradient will have a values tensor of
      size nnz * embedding_size instead of just nnz.

Common weight initialization methods for neural networks

By default, all Linear and Conv layers in PyTorch are initialized according to
a scheme proposed by LeCun'98.

However, there are several other commonly used initialization methods.
We now support many other methods via torch.nn.init.
Supported methods include:
uniform, normal, constant, xavier_uniform, xavier_normal, kaiming_uniform,
kaiming_normal, orthogonal, sparse

Here's an example of using these initialization methods:

import math
from torch import nn

class Net(nn.Module):
  def __init__(self):
     super(Net, self).__init__()
     self.conv1 = nn.Conv2d(5, 10, (3, 3))
     nn.init.xavier_uniform(self.conv1.weight, gain=math.sqrt(2.0))
     nn.init.constant(self.conv1.bias, 0.1)

network = Net()

Other features

  • Added a gradient checker utility torch.autograd.gradcheck that can
    be used to check your implementations. Here's a small example:
    from torch.autograd import Variable, gradcheck
    inputs = Variable(torch.randn(4, 4), requires_grad=True)
    gradcheck(lambda x: 2*x.diag(), (inputs,), eps=1e-3)
  • Add a clip_grad_norm utility to easily clip gradients via constraints on their norms.
  • Document nn.ModuleList and nn.ParameterList that are immensely useful when
    storing a list of modules in a Container
  • Optimizers have backward-compatiblity for old checkpoints.
    __set_state__ and __get_state__ introduced into optimizers.
  • Add Nesterov momentum to optim.SGD via nesterov=True kwarg
  • DataParallel supports multiple inputs and keyword args (which are also scattered)
    m = nn.DataParallel(model)
    # Now valid
    m(x, y, option=z)
    See the documentation for exact behavior.
  • DataLoader's default_collate now also supports numpy arrays
  • Added F.pad that supports Constant, Reflection and Replication padding in a single
    interface: http://pytorch.org/docs/nn.html#pad
  • train() now optionally supports a boolean argument. For example model.train(False)
    will set it to eval mode and model.train(True) sets it to train mode.
  • Added a DataLoader sampler: SubsetRandomSamplerthat takes a list of indices
    in it's constructor and randomly samples from these indices. Useful when you
    want to sample only a particular subset of your dataset.
  • Transpose supports negative dimensions. For example:
    a = torch.randn(2, 3)
    b = a.transpose(0, 1)   # both are equivalent
    b = a.transpose(-2, -1) # both are equivalent

Performance Improvements

  • CPU Tensor backend gets faster
    • Explicit AVX, AVX2 and improved SSE intrinsics to speedup copy, fill, add, mul, div
    • Much improved speed for all apply and reduce operations to have better cache hits
    • Added OpenMP in TH_TENSOR_APPLY* operations
    • Overall, 2x to 10x+ faster on a lot of operations, closer to Numpy speeds
    • Runtime dispatch of intrinsics based on CPU features (easy to ship binaries)
  • Serialization Improvements
    • Fixed bugs on serialization for Tensors > 2GB
    • 5x to 10x faster serialization (no longer Tarring Tensors)

Bug Fixes

  • Multi-GPU CuDNN RNN now has separate dropout descriptors per GPU
  • NLLLoss2d has proper shape checks on GPU and stable sizeAverage formulation
  • LogSoftmax2d has a more stable formula
  • Fix prodall (prod without dim arguments) to not average
  • Return correct number of gradients from cuDNN RNN
  • NLLLoss2d has support for weights
  • Fix Unpooling bug for MaxPool1d
  • Fix Indexing when using only an ellipsis
x = torch.randn(2,2,2,2)
x[...] # used to fail, fixed now.
  • expose stateless methods (torch.*`` methods) fortorch.cuda.HalfTensor`
  • Prevent creation of reference cycles (and hence improve memory usage) when
    leaf variables were using in-place operations.
  • Fix gradient computation for the indexing operation in the case of sending in
  • Fix a reshaping bug in the grad_input of basic operations such as +, -, *, / etc.
    This used to fail, but is fixed now:
    x = Variable(torch.randn(4, 6), requires_grad=True)
    b = Variable(torch.rand(12, 1) + 1e-2, requires_grad=True)
    (x + b.mm(Variable(torch.rand(1, 2) + 1e-2))).sum().backward()
  • Revert partial indexing with LongTensor to return to numpy-compatibility
  • References to some Tensors in BatchNorm and Conv are now freed to improve
    memory usage in certain situations. ResNet-152 finetuning with batch_size 16
    used to consume the same amount of memory as batch 256 after this fix.
  • Fix a bug where requires_grad was being propagated forward differently in
    CPU mode and CUDA mode.
  • Fix bugs in torch.multinomial on CUDA, where in rare cases, the sampling
    lead to nonsensical values
  • Allow backprop through CuDNN RNN in eval() mode.
  • Support np.int16 in conversion to ShortTensor
  • Enable multithreading in MKL (was disabled previously due to a cmake bug).

Improved error messages

  • Print a readable error message when arguments are on different GPUs
  • Add better error message for conversion of CUDA tensors to numpy
  • Add checks for reward type and size in StochasticFunction