Skip to content

Releases: pytorch/pytorch

CuDNN v6, new layers, lots of bugfixes

31 Mar 16:27
Compare
Choose a tag to compare

Minor API Changes

  • in optim.Adamax, the default learning rate and epsilon have been made
    consistent with Lasagne, Keras and TF.
    • Previous: (lr=1e-2, eps=1e-38)
    • Current : (lr=2e-3, eps=1e-8)
  • Make random_ range exclusive (it used to be exclusive when only the upper bound was specified, and inclusive when both were given).
  • torch.cat now disallows catting along inexistent dimensions
    (to make it consistent with numpy and Variable cat)
  • torch.utils.clip_grad_norm now returns the total norm (say, for logging purposes).

Performance Improvements

  • Reduce DataParallel overhead on >4 GPUs
    • Improve broadcast/reduce performance by coalescing tensors
  • nn.Embedding's backward performance increased for batch sizes > 1024

New Features

torch

  • Batch triangular factorization and solves have been interfaced (CPU and GPU) and
    are available under torch.btrifact and torch.btrisolve. See documentation
    for usage
  • All RNG functions now have generator specifiable via a keyword argument
  • torch.mode is now supported on the GPU via a high-performance kernel.

autograd, nn and optim

  • CuDNN v6 integrated:
    • Faster Dilated Convolutions (and less memory hungry)
    • 1D FFT-based Convolutions
    • Significant performance improvement for Softmax layers
    • Speedups across many functions
    • Improved CuDNN error messages
    • We will integrate persistent RNNs in the next release
  • torch.trace, torch.cumsum, torch.cross are now implemented in autograd
  • nll_loss now supports Spatial inputs (i.e. 4d inputs BCHW) and computes
    channel-wise cross-entropy.
  • nn.PReLU now supports all dimensional Tensors, not just 1d and 2d.
  • add nn.PairwiseDistance and F.pairwise_distance that compute batchwise
    pairwise distance between two vectors.
  • Adaptive Max and Average Pooling added for 1d, 2d inputs via
    nn.AdaptiveMaxPooling1d, nn.AdaptiveAvgPooling2d, etc.
  • RMSProp now has momentum and a centered option. If centered is True,
    the gradient is normalized by an estimation of it's variance. (Graves 2013)

utils

  • WeightedRandomSampler has been added as a custom sampler for the DataLoader.
    It samples elements from [0,..,len(weights)-1] with the given probabilities
    and is useful to sample from unbalanced datasets where some classes have
    many more samples than others. See the docs
    for more details
  • DataLoader now allows returning of numpy arrays

Bug Fixes

torch

  • When loading GPU checkpoints from disk with storage location remapping,
    torch.cuda was still attempted to be imported. This is now fixed, and
    you can load GPU checkpoints on machines with no GPUs or CUDA.
  • Work around an OSX fread bug where loading checkpoints of each Tensor > 1GB
    would give an error.
  • Fixed a in torch.cat where it now does not
    accept reverse (it's not a PySequence)
    For example:
    l = [Variable(torch.ones(1,3)*i) for i in range(3)]
    torch.cat(reversed(l), 0) # errors now
    
  • Fix a memory leak in torch.from_numpy
  • GPU svd returned a larger matrix than expected in the some mode.
    This is now fixed to match CPU behavior.
  • Fix a bug in CPU max that was introduced in the previous release.

autograd, nn and optim

  • Reassigning attributes in modules correctly works now.
    This example used to not work correctly, l.a always remained None.
    Now it works as one would expect:
    l = nn.Linear(10, 20)
    l.a = None
    l.a = nn.Parameter(torch.randn(2))
    # l.a is correctly updated
  • Fix bug where adding a hook could replace an existing hook
  • Fix nn.Embedding and nn.CosineEmbeddingLoss to work without
    error on non-float CUDA (half, double)
  • Fix a bug in nn.Embedding when the max_norm option was used. Some of the
    indices were not respecting max_norm and this is fixed.
  • Fix corner-case in Variable's SetItem where gradient was of incorrect shape.
    x.grad used to be of shape 20, because `y[1]`` was of shape 20.
    x = Variable(torch.randn(1, 20), requires_grad=True)
    y = Variable(torch.zeros(10, 20))
    y[1] = x
    
  • Fix a segfault in Conv1d when input doesn't require grad.
  • Assertions in pack_padded_sequence to check that sequence is of length > 0
  • torch.prod's autograd forumlae were incorrect if the Tensor had 0. This
    formula has been fixed.
  • Variable expand and expand_as had incorrect dimension inference when using
    broadcasting semantics. The formula has been fixed in these cases.
  • Fix a size mismatch in CosineEmbeddingLoss. See this issue for more details.
  • Fixed a bug in LBFGS that caused it to use uninitialized locals. See issue
  • Add assertions for negative padding in nn.Conv* functions.
  • Fix the sttdev gradient formula for the stochastic function normal.

other

  • Fix issue when returning strings from the DataLoader when pin_memory=True
  • Binaries no longer dependent on needing a libcudart.so at runtime.

Variable-length RNNs, Better Indexing, Sparse Tensors, Faster CPU, Many Bug Fixes

15 Mar 05:58
Compare
Choose a tag to compare

New Features

Indexing and Broadcasting Improvements

  • Add broadcasting semantics to expand / expand_as.
    • Previously, expand had no ability to add new dimensions, and unsqueeze
      had to be used to first create singleton dimensions before expansion.
    • Now, singleton dimensions are automatically prepended to the shape of
      the tensor if a matching dimension is found.
      Here's an example:
      x = torch.rand(5)
      y = torch.rand(4, 8, 5)
      z = x.expand_as(y) # z is of shape (4, 8, 5)
      
      x = torch.rand(1, 8, 1)
      z.expand_as(y) # z is of shape (4, 8, 5)
  • Unsqueeze dimensions using None indexing
    a = torch.randn(10)
    b = a.unsqueeze(0)
    b = a[None, :]     # Equivalent operations
  • Indexing with steps is supported (only positive steps)
    In [1]: a = torch.randn(10)
    In [2]: a
    Out[2]:
    
       0.1338
       1.0789
       1.2302
      -1.3343
      -0.4676
       1.3511
      -0.4374
      -1.0611
      -0.1528
      -1.3994
      [torch.FloatTensor of size 10]
    
    In [3]: a[0:10:3]
    Out[3]:
    
       0.1338
      -1.3343
      -0.4374
      -1.3994
      [torch.FloatTensor of size 4]

Variable-length mini-batches in Recurrent Networks

nn.RNN, nn.LSTM, nn.GRU now support mini-batches where sequences are of variable
lengths.
You can pass an input of type PackedSequence
into these layers.
A PackedSequence holds data and a list of sequence sizes of a packed sequence batch.
For example, a PackedSequence will hold an input mini-batch of such sequences:

a b c d e
a b c d e f g h
a b
a b c d

Here, each input row is of variable length.

You can construct a PackedSequence using the provided function
pack_padded_sequence

pack_padded_sequence takes a Variable containing padded sequences, i.e. a Tensor
of T x B x *, where B is the size of the mini-batch, and each input is either of
length T or is padded to length T. It also takes a list of lengths of each input.
From these, it constructs a PackedSequence

For example, it will take [8, 5, 4, 2] and and an input 8 x 4 x 128
that corresponds to:

a b c d e f g h
a b c d e 0 0 0
a b c d 0 0 0 0
a b 0 0 0 0 0 0

The output of the RNN layers will also be a PackedSequence, which can then be inverted
back to a padded Tensor using the inverse function:
pad_packed_sequence

Sparse Tensors (CPU)

Original goals:

  • ability to propagate sparse updates in a network (e.g. for updating an embedding matrix)
  • ability to efficiently compute "bag-of-words" sentence embeddings (e.g. weighted average of word embeddings)

Implemented features:

  • enable backpropagation of sparse gradients without conversion to dense tensors. In most cases a runtime exception is thrown when mixing different gradient types for the same variable
  • add some methods for THSTensor: zero, elementwise add and mul, scalar mul and div
  • make addcmul method of THTensor compatible with sparse operands
  • make spmm method accessible from Python as dsmm
  • sparse_mask method on THTensor. This produces a sparse tensor from a dense tensor,
    by using a sparse tensor as a mask. A value is only present in the output sparse
    tensor if it also exists in the mask.
  • update optim.Adagrad to use sparse updates when possible.
  • leave Variable's gradient to None by default.
    This is because there is no canonical zero gradient anymore (it could be dense or
    sparse, and if it is sparse we don't know how many dimensions are sparse)
  • N-dimensional values for sparse tensors:
    • Basically for things like applying sparse updates to embedding matrices, only the
      first dimension (the one that corresponds to the word index) is sparse. The other
      dimension is always dense (only whole embedding vectors are updated). An elegant
      solution is to make the values tensor N-dimensional instead of 1-dimensional.
      For an embedding matrix, the sparse gradient will have a values tensor of
      size nnz * embedding_size instead of just nnz.

Common weight initialization methods for neural networks

By default, all Linear and Conv layers in PyTorch are initialized according to
a scheme proposed by LeCun'98.

However, there are several other commonly used initialization methods.
We now support many other methods via torch.nn.init.
Supported methods include:
uniform, normal, constant, xavier_uniform, xavier_normal, kaiming_uniform,
kaiming_normal, orthogonal, sparse

Here's an example of using these initialization methods:

import math
from torch import nn

class Net(nn.Module):
  def __init__(self):
     super(Net, self).__init__()
     self.conv1 = nn.Conv2d(5, 10, (3, 3))
     nn.init.xavier_uniform(self.conv1.weight, gain=math.sqrt(2.0))
     nn.init.constant(self.conv1.bias, 0.1)

network = Net()

Other features

  • Added a gradient checker utility torch.autograd.gradcheck that can
    be used to check your implementations. Here's a small example:
    from torch.autograd import Variable, gradcheck
    inputs = Variable(torch.randn(4, 4), requires_grad=True)
    gradcheck(lambda x: 2*x.diag(), (inputs,), eps=1e-3)
  • Add a clip_grad_norm utility to easily clip gradients via constraints on their norms.
  • Document nn.ModuleList and nn.ParameterList that are immensely useful when
    storing a list of modules in a Container
  • Optimizers have backward-compatiblity for old checkpoints.
    __set_state__ and __get_state__ introduced into optimizers.
  • Add Nesterov momentum to optim.SGD via nesterov=True kwarg
  • DataParallel supports multiple inputs and keyword args (which are also scattered)
    m = nn.DataParallel(model)
    # Now valid
    m(x, y, option=z)
    
    See the documentation for exact behavior.
  • DataLoader's default_collate now also supports numpy arrays
  • Added F.pad that supports Constant, Reflection and Replication padding in a single
    interface: http://pytorch.org/docs/nn.html#pad
  • train() now optionally supports a boolean argument. For example model.train(False)
    will set it to eval mode and model.train(True) sets it to train mode.
  • Added a DataLoader sampler: SubsetRandomSamplerthat takes a list of indices
    in it's constructor and randomly samples from these indices. Useful when you
    want to sample only a particular subset of your dataset.
  • Transpose supports negative dimensions. For example:
    a = torch.randn(2, 3)
    b = a.transpose(0, 1)   # both are equivalent
    b = a.transpose(-2, -1) # both are equivalent

Performance Improvements

  • CPU Tensor backend gets faster
    • Explicit AVX, AVX2 and improved SSE intrinsics to speedup copy, fill, add, mul, div
    • Much improved speed for all apply and reduce operations to have better cache hits
    • Added OpenMP in TH_TENSOR_APPLY* operations
    • Overall, 2x to 10x+ faster on a lot of operations, closer to Numpy speeds
    • Runtime dispatch of intrinsics based on CPU features (easy to ship binaries)
  • Serialization Improvements
    • Fixed bugs on serialization for Tensors > 2GB
    • 5x to 10x faster serialization (no longer Tarring Tensors)

Bug Fixes

  • Multi-GPU CuDNN RNN now has separate dropout descriptors per GPU
  • NLLLoss2d has proper shape checks on GPU and stable sizeAverage formulation
  • LogSoftmax2d has a more stable formula
  • Fix prodall (prod without dim arguments) to not average
  • Return correct number of gradients from cuDNN RNN
  • NLLLoss2d has support for weights
  • Fix Unpooling bug for MaxPool1d
  • Fix Indexing when using only an ellipsis
x = torch.randn(2,2,2,2)
x[...] # used to fail, fixed now.
  • expose stateless methods (torch.*`` methods) for torch.cuda.HalfTensor`
  • Prevent creation of reference cycles (and hence improve memory usage) when
    leaf variables were using in-place operations.
  • Fix gradient computation for the indexing operation in the case of sending in
    LongTensor.
  • Fix a reshaping bug in the grad_input of basic operations such as +, -, *, / etc.
    This used to fail, but is fixed now:
    x = Variable(torch.randn(4, 6), requires_grad=True)
    b = Variable(torch.rand(12, 1) + 1e-2, requires_grad=True)
    (x + b.mm(Variable(torch.rand(1, 2) + 1e-2))).sum().backward()
  • Revert partial indexing with LongTensor to return to numpy-compatibility
  • References to some Tensors in BatchNorm and Conv are now freed to improve
    memory usage in certain situations. ResNet-152 finetuning with batch_size 16
    used to consume the same amount of memory as batch 256 after this fix.
  • Fix a bug where requires_grad was being propagated forward differently in
    CPU mode and CUDA mode.
  • Fix bugs in torch.multinomial on CUDA, where in rare cases, the sampling
    lead to nonsensical values
  • Allow backprop through CuDNN RNN in eval() mode.
  • Support np.int16 in conversion to ShortTensor
  • Enable multithreading in MKL (was disabled previously due to a cmake bug).

Improved error messages

  • Print a readable error message when arguments are on different GPUs
  • Add better error message for conversion of CUDA tensors to numpy
  • Add checks for ...

Bug fix release

24 Feb 13:02
Compare
Choose a tag to compare

Bug fixes:

  • Major bugfix in CuDNN bindings for cases of non-contiguous grad-outputs
    • also added better error checking and asserts to cudnn RNN and Conv
  • Fixed serialization bugs when serializing Tensors > 2GB
  • Enable and half and double THNN backends
  • RNNBase and Embedding fixed to be compatible with DataParallel
  • Fix bug in torch.cat for multi-GPU settings
  • Support bias=False in Conv3d
  • Change behavior of detach() to actually remove the creator (previously was just detaching compute)

Features and performance

  • Refactored autograd internals into python-agnostic C++ (#662)
  • view, unsqeeze and squeeze moved to C for superior performance
  • Allow DataParallel to have tuple inputs
  • Add a torch.__version__ string.

Bug Fixes, initial Distributed support

05 Feb 02:01
Compare
Choose a tag to compare

A bugfix release with some small features:

New Features

  • THPP now has CUDA Tensors
  • autograd functions: repeat, var, std, renorm, comparison ops added.
  • Merged an initial version of THD (distributed pytorch)
  • Indexing support with LongTensor indices
  • Add torch.unbind
  • Add ModuleList and ParameterList to store lists of modules / params in an nn.Module

Bug and usability fixes

  • Fix a bug in FFI utils
  • Fix lua-reader for SpatialConvolution
  • Fix backward contiguous check in BatchNorm
  • Fix travis builds
  • Pep8 enforced for the entire codebase
  • CuDNN RNN non-contiguous fixes
  • Remove circular references in some Autograd functions
  • Add CUDA asserts to various kernels for out-of-bounds checks
  • Fix non-contiguous bug in torch.cat
  • Fix memory leak in Unpooling

API Changes

  • nn.Billinear* -> nn.Bilinear*
  • Return indices as well in autograd for torch.sort and torch.topk
  • .set_index -> ._set_index (made private)
  • normal and log_normal kwarg changed from var to std
  • Optimizer.state_dict now has semantics matching Module state_dict

bug fixes and small features

02 Feb 12:40
Compare
Choose a tag to compare

A bugfix release with some small features:

New Features

  • LBFGS Optimizer added
  • Add state_dict for optimizers for easy checkpointing
  • Add differential upsampling modules for 2d (bilinear, nearest)

Bug and usability fixes

  • Fix multi-GPU bugs in indexing
  • Improve error messages for optimizer
  • Fix bug in Conv1d
  • Fix bug in Conv*d groups
  • Add improved error messages for unsupported CuDNN codepaths
  • fix bugs in CuDNN bindings
  • Workaround bugs in CuDNN itself (batchnorm-backward, non-contiguous weights)
  • Fix lua-reader's BatchNorm and Linear layers
  • Fix some memory leaks
  • Give fatal errors on Variable comparison
  • Fix bug in ELU backward
  • Fix index_select backward
  • Fix BatchNorm backward in evaluate mode (workaround CuDNN bug)

API Changes

  • Adadelta's step_rate is renamed to lr
  • Adam's default learning rate the same as LuaTorch

Beta is here.

02 Feb 12:29
Compare
Choose a tag to compare

Our last release (v0.1.5) was on November 14th, 2016

We finished, froze and released (v0.1.6) on Jan 21st, 2017.

A lot has happened since 0.1.5.

Summary

  • PyTorch public release on 18th Jan, 2016.
  • An initial Model Zoo, several common Vision models can be initialized with pretrained weights downloaded from the zoo.
  • All the 100+ torch.* functions bar 3 (topk, mode and kthvalue) are GPU-ready, and performance improvements across board for several existing ones.
  • All relevant neural network modules are now CuDNN bound.
  • Stochastic functions added to Autograd, for use in reinforcement learning
  • A functional interface of the nn library is added
  • GPU device initialization has been made lazy (improvement in CUDA initialization time on multi-GPU machines)
  • Pinned memory support, and leveraging it in DataLoader
  • Made error messages across board more informative, especially around shape checks
  • A rich set of examples and tutorials added to pytorch/examples and pytorch/tutorials
  • API Reference at pytorch.org/docs
  • Multiprocessing support for CUDA (Python3 only)
  • An initial version of CPU Sparse Tensors is added and used in nn.Embedding(sparse=True). More to come on this side.
  • Added a lua reader to load existing .t7 files with Torch models
  • Various bug-fixes.
  • Allow returning of changed gradients in hooks

API Changes

  • Conv*d and *Pool*d layers now take a tuple of kernel sizes/strides/padding instead of kh/kw.
  • Unpooling* layers have a changed API
  • Variable.grad is now a Variable (was a Tensor)
  • nn.Container is deprecated and merged into nn.Module. Replace all instances of nn.Container in your code with nn.Module
  • torch.cat changed API to take an iterable of tensors, along with a dimension (previously varargs of Tensors). Also torch.cat's default dimension is changed. It's been made an inverse transform for torch.split and torch.chunk.
  • Variable.no_grad has been renamed to Variable.detach
  • RMSProp's initialization of gradients changed from ones to zeros (#485)
  • Removed cmin, cmax and cinv (functionality of cmin, cmax split between max/min and clamp; cinv renamed to reciprocal)
  • register_hook API changed, names are removed. See: #446
  • torch.*(..., out=Tensor) is adopted for output arguments

Model Zoo

A model zoo has been started with several pre-trained vision models available such as AlexNet, ResNet50, etc. The download and usage of the models is seamless with a keyword argument.

import torchvision.models as models
models.alexnet(pretrained=True)

The models are hosted on Amazon S3, and we look forward to more models from the community.
Basic documentation is found here:

http://pytorch.org/docs/model_zoo.html

You can find specific models listed in the README of torchvision and torchtext

Stochastic Functions in Autograd

We introduced Stochastic functions that needed to be provided with a reward for their backward.
This feature was inspired by Gradient Estimation Using Stochastic Computation Graphs by Schulman et. al. and is helpful to implement reinforcement learning techniques.
Documentation is here: http://pytorch.org/docs/autograd.html#torch.autograd.Variable.reinforce
A showcase of using these nodes is in the REINFORCE example: https://github.com/pytorch/examples/blob/master/reinforcement_learning/reinforce.py#L70

Functional interface to nn

PyTorch neural networks have so far been modeled around nn.Module. However, for most simple functions such as ReLU, using this is a bit cumbersome.
To simplify this, we've introduced a functional interface to nn, and modified the tutorials to use this API where appropriate.

For example:

import torch.nn as nn
import torch.nn.functional as F

# module style
relu = nn.ReLU()
y = relu(x)

# functional style
y = F.relu(x)

The functional style is convenient when using non-parametric and non-learnable functions.

Documentation for these functions is here: http://pytorch.org/docs/nn.html#torch-nn-functional

Faster GPU code

The initialization of the GPU backend has been made lazy. This means that it will automatically be
imported and initialized when needed (and not before-hand). Doing this has improved startup times (especially for multi-GPU systems) and reduced boilerplate code.

We've also integrated support for pinned memory, which accelerates CPU to GPU transfers for specially marked buffers. Using this, we accelerated the multiprocessing data loaders.

A rich set of examples

With the help of some of you, we've added a rich set of examples from Image Super-resolution to Neural Machine Translation.
You can explore more here: https://github.com/pytorch/examples

API Reference and Notes

We've fleshed out a full API reference that is mostly complete at docs.pytorch.org
Contributions are welcome :)

We've also added notes such has CUDA Semantics, Extending PyTorch, etc.

Multiprocessing support for CUDA

Uptil now, Tensor sharing using multiprocessing only worked for CPU Tensors.
We've now enabled Tensor sharing for CUDA tensors when using python-3.
You can read more notes here: http://pytorch.org/docs/notes/multiprocessing.html

Lua Reader

A "lua reader" has been integrated, that can load most LuaTorch .t7 files, including nn models.
nngraph models are not supported.

Example usage can be found here: https://discuss.pytorch.org/t/convert-import-torch-model-to-pytorch/37/2

Alpha-5

18 Nov 10:15
Compare
Choose a tag to compare
Alpha-5 Pre-release
Pre-release

What's new in Alpha-5?

Usability

  • keyword arguments, improved indexing for all torch and autograd functions!
  • Deterministic data loader even under multiple workers
  • LAPACK bindings with full CUDA support via MAGMA
  • Easier numpy2torch conversion with torch.from_numpy(x)
  • Lot more documentation
    • fully covered neural networks
    • fully covered optim package
    • partly covered torch documentation
  • Tutorials:
    • Increased depth, length and clarity of the tutorials

New Features and modules

  • PyTorch Vision: a package to hold common dataloaders, transforms and utilities for images and videos
    • Data loaders for: COCO (captioning and detection), Imagenet, CIFAR10/100, LSUN etc.
    • Image Transforms: commonly used data augmentation transforms such as random-cropping, normalization
      • Unit-tested
    • Utilities: saving Tensors as images, creating grids of images from a mini-batch of tensors.
  • Recurrent Neural Networks
    • A complete and robust implementation of efficient Stacked LSTMs, RNNs, GRUs (bidirectional and otherwise)
    • Seamlessly integrated CuDNN is used whenever possible for maximum performance
    • A complete word-level language modeling example on the PennTreeBank dataset
      • verification that the perplexity matches the reference Torch implementation
  • an example of Generative Adversarial Networks:
    • DCGAN example in < 250 lines (includes everything)
    • Verified the results to match reference implementations
    • Multi-GPU ready!
  • A redesigned Optim package with the following optimization methods:
    • SGD, AdaDelta, Adagrad, Adam, AdaMax, Averaged SGD, RProp, RMSProp
    • Fully unit tested against their reference implementations
    • Fully documented
  • Improved Multi-GPU performance (and more is coming)
    • Integrated NVIDIA NCCL for maximizing multi-GPU communication performance

Plans for Alpha-6

  • docstrings support and finishing torch and autograd documentation
  • Fully verifying the convergence of ResNet / Imagenet training
  • More examples around:
    • Reinforcement Learning / OpenAI Gym
    • Object Detection
    • Sequence to Sequence methods
    • WaveNet / ByteNet
    • More adversarial networks (text2image, etc.)
  • More gains in performance, and fully flesh out CuDNN integration
  • Half-precision training for GPUs
  • A Lua-Torch model loader, and improved legacy.nn support
  • Lua bridge, to call your existing lua code

Usability

Keyword arguments

All torch and autograd functions used to only support arguments in the correct order.
For example:

torch.clamp(x, -0.1, 0.1)

This is often unreadable, especially for LAPACK usage where one declares booleans such as upper=True

Now, one can simply do:

torch.clamp(x, min=-0.1, max=0.1)

We've also implemented ellipsis indexing similar to NumPy

Deterministic Data Loader

The data loader now generates indices on the main process and regardless of how many workers you use,
the order of data loading will remain consistent if you use the same random seed.

Fully tested LAPACK bindings

Unit tests on both the CPU and CUDA side.
On the CPU, we ship with MKL-integration, and on the GPU, LAPACK is powered by MAGMA

Documentation

We are at a stage where we have converged to stable APIs.
Hence, documentation is going at a rapid pace, and we have covered:

  • nn
  • optim
  • part of torch / Tensors

As always, you can check out the documentation here: pytorch.org/api/latest/en/

Tutorials

We added one new tutorial: Creating extensions using numpy and scipy

  • This covers the case where you would want to quickly write some modules of your neural network using familiar scipy tools like scipy.sparse for example.

We improved the existing tutorials to cover more of the basics, and improved them.

New Features and modules

PyTorch Vision

A one-stop repository for all of your image (and soon) video needs, whether that be data loaders, common neural network definitions (such as alexnet, inception, resnet etc.) or data augmentation routines.
Our plan is to put some serious engineering firepower into this module, with GPU loaders and augmentation routines, especially for video processing. Contributions welcome :)

So far, we have:

Data loaders

All the data loaders are fully documented, and share a basic interface.
They are fully compatible with torch.utils.DataLoader to be parallelized in fetching.

Common Image Transforms

  • Convertors from PIL Image to Torch Tensors
  • Random Cropping, Scaling, Normalization transforms
    • Unit tested

The Imagenet example has been updated to use this package

Recurrent Neural Networks

One of the biggest strengths of PyTorch's new design is to be able to seamlessly share weights and do recurrent nets.
We've emphasized this, and also deeply integrated CuDNN in a way that as a user you do not notice a thing, while having the full power and speed.

nn.RNN, nn.LSTM and nn.GRU are the stacked RecurrentNet modules that you would want to use, and for generally crazy research, we've also given implementations of individual cells: nn.LSTMCell and nn.GRUCell

A fully tested and verified example is provided in https://github.com/pytorch/examples/tree/master/word_language_model
This example does word-level language modeling on the PennTreeBank dataset.

Adversarial Networks

A concise example of Generative Adversarial Networks for Image Generation is provided, integrating multiple datasets (showcasing the power of the vision package).
The example is < 250 lines of code, and gives a lot more clarity towards the usage of PyTorch.
Multiple data loader threads, checkpointing, saving generated images to disk and much more is showcased.

A stable and fleshed out Optim package

It took us some time to design a good and stable Optim API, but now we have converged to a clean design.
The Optim package is fully Multi-GPU and Multi-device ready out of the box.
Now we've implemented and unit tested the following algorithms:

  • SGD, AdaDelta, Adagrad, Adam, AdaMax, Averaged SGD, RProp, RMSProp

Setting per-layer learning rates, or optimizing only part of your neural network is now very trivial.

It is fully documented here: http://pytorch.org/api/latest/en/#torch-optim
It's usage can be seen both in the DCGAN and Imagenet examples.

Improved Multi-GPU performance (and more is coming)

We've improved the Multi-GPU performance since alpha-4, and we are close to squeezing out full performance.
We are working closely with NVIDIA to squeeze out the last drops of performance and make PyTorch future-proof for the P100 and new cards.

Alpha-4 Release

03 Oct 22:21
Compare
Choose a tag to compare
Alpha-4 Release Pre-release
Pre-release

Some interesting stats

On Resnets

Because of our aggressive freeing and allocating resources, ResNets in PyTorch take lesser memory than torch-nn

  • 4.4GB in PyTorch
  • 6.5GB in Torch-nn
  • 4.6GB in Torch-nn with a hacky sharing of gradinput buffers
  • On 1-GPU, PyTorch speed is 10s of milliseconds faster than Torch-nn
  • On 2-GPUs, PyTorch is the same speed as Torch-nn
  • On 4-GPUs, PyTorch is about 10 to 20% slower, but it's because we have just finished implementing Multi-GPU and we will be plugging this perf difference in the next week.

FFI-based C extension

On a small benchmark of adding a constant to a 5x5 tensor at 1000 calls:

  • LuaJIT FFI: 0.001 seconds
  • Lua 5.2 FFI: 0.003 seconds
  • PyTorch CFFI: 0.003 seconds
  • Raw Python CFFI / CTypes: 0.001 seconds

What's new in Alpha-4?

Usability

New Features and modules

  • Multi-GPU primitives
  • A custom CUDA allocator to maximize autograd performance (backported to Torch too)
  • More autograd functions. Now it's almost API complete for all differentiable torch.* functions.
  • CuDNN Integration
  • Multiprocess DataLoader in torch.utils (used in the imagenet example)
  • Extensions API to interface to your C code simply via FFI

Plans for Alpha-5

  • Revamping and rethinking the Checkpointing API
  • Revamping the Optim API to support things like per-layer learning rates and optimizing non-weights (like in NeuralStyle)
  • RNN Examples, initially for PennTreeBank language modeling
  • Better RNN support in general, improved error messages, multi-GPU etc.
  • NCCL Integration for improved multi-GPU performance (already implemented at #78 )
  • Documentation / Reference manual for torch.* and autograd

Usability

Tutorials

We've added two tutorials to get you all started.

  • Tutorial 1: Introduction to PyTorch for former Torchies
    • In this tutorial we cover the torch, autograd and nn packages from a perspective of former Torch users.
    • Going through this tutorial should get you started. Let us know how we can improve it.
  • Tutorial 2: Write your own C code that interfaces into PyTorch via FFI
    • In this tutorial, we showcase how you can call your own C code that takes torch tensors as inputs / outputs in a seamless way via FFI
    • The tutorial showcases how you can write your own neural network Module that calls in C implementations

Examples

We've added a full imagenet example with ResNets that should be really suited towards “learning by example”.
It is located here: https://github.com/pytorch/examples/tree/master/imagenet
The data for the example has to be preprocessed for now in the same way as is specified in fb.resnet.torch

The example has Multi-GPU support in a DataParallel fashion.

More improved error messages

We've gone through the TH and THNN C libraries and added much more intuitive error messages that report the mismatched shapes. We will continue to make improvements on this front.
If you have any unintuitive error messages that you encounter, please open an issue at https://github.com/pytorch/pytorch/issues

For example:

Old error message:

bad argument #2 to 'v' (3D or 4D (batch mode) tensor expected for input

New error message:

bad argument #2 to 'v' (3D or 4D (batch mode) tensor expected for input, but got: [100 x 100]

No more CamelCase for functions

All torch functions have been renamed from CamelCase to underscore_case.
indexAdd → index_add_
getRNGState → get_rng_state
etc.

New Features and modules

Multi-GPU primitives

  • We've added efficient multi-GPU support in general for neural networks. Instead of building magic blocks that do opaque parallelization for you, we've broken them down into easy to use collectives.
  • A pattern like DataParallel is implemented in terms of:

Performance

With Multi-GPU, we naturally overlap data transfers with compute across the whole graph. This makes multi-GPU much more efficient, and is done in a way that does not interfere with the imperativeness / error reporting.

Another important note is that we now dispatch parallel modules via python threads, which makes the CUDA kernel launches in a breadth-first fashion, getting rid of obvious kernel launch latency bottlenecks.

Custom CUDA allocator to maximize autograd performance

In Torch, we had to write nn modules in a careful way to avoid cuda synchronization points which were a multi-GPU bottleneck and general performance bottleneck. This affected neural networks and autograd sometimes up to 2x in performance penalty.

In PyTorch (and Torch), Sam Gross has written a new Caching CUDA allocator that avoids cuda synchronization points while being really suited towards Tensor use-cases where we typically do short-term and long-term allocations of memory of the same tensor sizes.

This unblocks us from a lot of performance issues.

More autograd functions

Now the torch.* API should be pretty much be ready for full autograd support (short of 3 functions).
Autograd has been enabled for all the functions with the exception of non-differentiable functions like torch.eq.

CuDNN Integration

We now fully integrate and support CuDNN version 5.1.3, and it is shipped in the binaries (just like CUDA), so you never have to worry about manually downloading and installing it from the NVIDIA website.

Generic Multiprocess DataLoader

We've added a flexible Data Loader that supports multiple data loading workers. This enables a lot of use-cases, and is first used in our Imagenet example.

C Extensions API

We added an easy to use extensions API and an example extension here:
https://github.com/pytorch/extension-ffi

You can call your C functions (that have TH*Tensor inputs / outputs and other fundamental types in the function signature) without writing any manual Python bindings.

One question you might have is, what kind of call overhead these auto-generated FFI bindings have. The answer is “None”, as seen in the numbers in the beginning of the note.

The example extension also covers how you can define your autograd-ready nn module that calls your C function.

Alpha-3 Release

16 Sep 11:01
Compare
Choose a tag to compare
Alpha-3 Release Pre-release
Pre-release

What's new?

Usability

  • conda binaries for all Linux (as old as RHEL 6 and Ubuntu 12.04) (we are working on OSX and pip binaries).
    • Now installing pytorch is as simple as:
      • conda install pytorch -c https://conda.anaconda.org/t/6N-MsQ4WZ7jo/soumith
      • it links against MKL, ships CUDA and MAGMA runtime with it, and #justworks
  • Human-ready error messages
  • Started working on documentation and an API Reference
  • Continuous integration with GPU support. Never have a broken master again

New Features and modules

  • The (new) neural network module now has 75% of the modules implemented (71 out of 93), and we are powering through the rest
    • most of the modules in old-nn have been removed because we do not need Containers and many modules such as CAddTable are covered by Autograd
  • autograd now supports all torch functions present in twitter-autograd and a lot more....
  • Added Trainer and Dataset abstractions (like in TorchNet)

Plans for Alpha-4

  • cudnn integration (and CUDA allocator).
    • We have this implemented but are iterating over design #36
  • Multi-GPU support in nn
  • examples, examples, examples
    • we will work on having examples across all domains (vision, NLP, RL, etc.)

Usability

Conda binaries for Linux

PyTorch will be shipped on Linux and OSX (and likely Windows) from the day-1, and we want it to be as simple and intuitive install process.
We have versioned binaries, that do not require the user to install anything (except an NVIDIA Driver if you intend to use the GPU. Not even CUDA is a dependency).

For now, to get started on Linux:

conda install pytorch -c https://conda.anaconda.org/t/6N-MsQ4WZ7jo/soumith

We have built OSX binaries, but have some small bugs on OSX, and we'll fix the issues there over the week.
We are working on “pip install” for non Anaconda python installs.

Human-ready error messages

We've gone through how we report type errors and dispatch errors and make it easy for the user to understand what they did wrong. See this small example:

In [1]: import torch
In [2]: x = torch.FloatTensor(10)
In [3]: x.addmm(torch.ones(1), 1, 'str')
ValueError                                Traceback (most recent call last)
<ipython-input-3-90eb50ea2e35> in <module>()
----> 1 x.addmm(torch.ones(1), 1, 'str')

ValueError: addmm recieved an invalid combination of argument types - got (torch.DoubleTensor, int, str), but expected one of:
 * (torch.FloatTensor mat1, torch.FloatTensor mat2)
 * (float beta, torch.FloatTensor mat1, torch.FloatTensor mat2)
 * (float beta, float alpha, torch.FloatTensor mat1, torch.FloatTensor mat2)

Continuous Builds with GPU support

New Features and modules

Neural Network Modules

  • Added fully functional and fully unit-tested nn modules and criterions for pretty much everything one would need for their current workflows.
  • We have about 25% of the modules missing (mostly exotic and lightly used ones) but will get to those in the coming few days.
  • nn modules have been renamed to be simplified in their naming. For example:
  • Full unit-test coverage for all implemented functions

Autograd

  • We've added autograd support for almost all the torch functions (and operators like +, - etc.)
    • We have all the functions implemented that are presented in twitter-autograd, and we have many more.
    • At this point we have about 75 to 80% of them covered (ball park).
    • Full unit-test coverage for all implemented functions

Trainer & Dataset classes

Trainer

We've added a TorchNet style Trainer class that provides a convenient abstraction

trainer = Trainer(model, criterion, optimizer, dataset)
trainer.register_plugin(ProgressMonitor())
trainer.register_plugin(LossMonitor())
trainer.register_plugin(AccuracyMonitor())
trainer.register_plugin(Logger(['progress', 'accuracy', 'loss'], interval=(5, 'iterations')))
trainer.run(epochs=5)

################################################################################
# progress: 180/60000 (0.30%)     accuracy: 0.00% (3.24%)         loss: 2.3051 (2.2116)
# progress: 280/60000 (0.47%)     accuracy: 5.00% (4.84%)         loss: 2.3045 (2.2891)
# progress: 380/60000 (0.63%)     accuracy: 25.00% (13.04%)       loss: 2.2974 (2.2992)

Dataset

The data loading is implemented using three abstractions:

  • DataSource - a simple object that defines indexing and checking length. Indexing returns a tuple of (sample, label)
  • Sampler - an object that defines the data ordering. it has to be iterable, and it’s iterator should return a string of indices in [0; len(data_source)-1] interval. The end of the iterator indicates completing the epoch.
  • Dataset - an object which wraps a DataSource and a Sampler. Defines all the data loading logic (e.g. all the multiprocessing code).

The Datsets will accept a list of transforms (like image augmentation) that are given to it, which will run on the data before given out.

alpha-2 Release

01 Sep 05:03
Compare
Choose a tag to compare
alpha-2 Release Pre-release
Pre-release

What's new?

We've

  • built seamless support for multiprocessing with Tensor sharing
  • changed the API of the optim engine
  • added a complete Hook system for nn and autograd
  • added in-place ops to autograd and more neural network modules to nn

Multiprocessing with Tensor sharing

In Torch, or in general, one uses "threads" to build parallel data loaders, as well as to do Hogwild training.
Threads are powerful, as one can share Tensors between threads.
This allows you to:

  • transfer data between threads with efficiently with zero memory copy and serialization overhead.
  • share tensors among threads for parameter sharing models

Sharing Tensors among threads is very useful when you do Hogwild training, i.e. if you want to train several models in parallel, but want to share their underlying parameters.
This is often used in non ConvNets, like training word embeddings, RL-for-games, etc.

With Python, one cannot use threads because of a few technical issues.
Python has what is called Global Interpreter Lock, which does not allow threads to concurrently execute python code.

Hence, the most pythonic way to use multiple CPU cores is multiprocessing

We made PyTorch to seamlessly integrate with python multiprocessing.
This involved solving some complex technical problems to make this an air-tight solution, and more can be read in this in-depth technical discussion.

What this means for you as the end-user is that you can simply use multiprocessing in this way:

# loaders.py
# Functions from this file run in the workers

def fill(queue):
  while True:
    tensor = queue.get()
    tensor.fill_(10)
    queue.put(tensor)

def fill_pool(tensor):
  tensor.fill_(10)
# Example 1: Using multiple persistent processes and a Queue
# process.py

import torch
import torch.multiprocessing as multiprocessing
from loaders import fill

# torch.multiprocessing.Queue automatically moves Tensor data to shared memory
# So the main process and worker share the data
queue = multiprocessing.Queue()
buffers = [torch.Tensor(2, 2) for i in range(4)]
for b in buffers:
  queue.put(b)
processes = [multiprocessing.Process(target=fill, args=(queue,)).start() for i in range(10)]
# Example 2: Using a process pool
# pool.py

import torch
from torch.multiprocessing import Pool
from loaders import fill_pool

tensors = [torch.Tensor(2, 2) for i in range(100)]
pool = Pool(10)
pool.map(fill_pool, tensors)

Optim's API changes

Optimizer's step function now accepts a closure that should return a loss variable (similar to legacy.optim).

We've realized that to keep Optim flexible for multiple methods, like SGD with nesterov, Conjugate Gradient, LBFGS etc., we need to have the input to optim be a function that evaluates the model.
This is necessary because several optimization methods re-evaluate the function multiple times at different parameters.
To come to this necessary API change, we took into account complicated scenarios like Dynamic RNNs and complex ConvNet models with dynamic branching.

So the API now looks like this:

optimizer = optim.SGD(model, lr=1e-3, momentum)
input, target = ...
optimizer.step(lambda: criterion(model(input), target)) #sufficient for simple models

To simplify things at the user end for simple or specific common models, we will introduce a Trainer class, that will take a (dataset, model, optim) triple and train the model. This trainer class is planned for alpha-3.

A complete Hook system for nn and autograd

Accessing intermediate values during the forward pass is straightforward, but during backward the buffers can rapidly change their content (for example: when doing in-place optimizations).

If you want to get access to the gradients at a particular Op or Layer inside your model, one uses a hook system.
Hooks can be attached to variables or to modules and are called as soon as the gradient is available:

# Example in autograd
a, b, c = [Variable(torch.Tensor(5, 5)) for i in range(3)]

def print_norm(grad):
    print(grad.norm(2))

y = b * c + a
y.register_hook(print_norm)

z = y * y - b
z.backward(torch.ones(5, 5))

# Example in nn
model = ...

def inspect_forward(module, input, output):
    ...

model.conv2.register_forward_hook(inspect_forward)

def inspect_backward(module, grad_input, grad_output):
    ...

model.conv2.register_backward_hook(inspect_backward)

We would definitely look forward to comments about the Hook system. Let us know what you think.

Added in-place ops to autograd and more neural network modules to nn

  • As part of porting fb.resnet.torch, we've added AveragePool2d and fixed BatchNorm2d
  • Now, autograd fully supports in-place operations, with in-place variables immediately marked as dirty.
    To illustrate this, let's look at a small example
x = Variable(torch.ones(5, 5))
y = Variable(torch.ones(5, 5) * 4)

z = x * y
q = z * y
r = z + y
z.add_(y)
# z is a the last expression, so this should succeed
z.backward(torch.ones(5, 5))

# r doesn't use the z in it's backward, so it should succeed
r.backward(torch.ones(5, 5))

# however, q needs z in it's backward, but z has now been 
# marked as dirty (because it was used in an in-place operation)
# this line will hence raise an error
q.backward(torch.ones(5, 5))

Plans for alpha-3

  • Unit tests for multiprocessing
  • Add more nn modules and autograd functions ( we're porting fb.resnet.torch )
  • New CUDA memory allocator (non-synchronizing CUDA tensors allocations)
    • We've made progress on this, but it is not complete yet
  • Trainer and Dataset classes
  • Continuous builds for CUDA (using Nimbix)
  • Binary packages (nightly and versioned)