@soumith soumith released this Oct 3, 2016

Assets 2

Some interesting stats

On Resnets

Because of our aggressive freeing and allocating resources, ResNets in PyTorch take lesser memory than torch-nn

  • 4.4GB in PyTorch
  • 6.5GB in Torch-nn
  • 4.6GB in Torch-nn with a hacky sharing of gradinput buffers
  • On 1-GPU, PyTorch speed is 10s of milliseconds faster than Torch-nn
  • On 2-GPUs, PyTorch is the same speed as Torch-nn
  • On 4-GPUs, PyTorch is about 10 to 20% slower, but it's because we have just finished implementing Multi-GPU and we will be plugging this perf difference in the next week.

FFI-based C extension

On a small benchmark of adding a constant to a 5x5 tensor at 1000 calls:

  • LuaJIT FFI: 0.001 seconds
  • Lua 5.2 FFI: 0.003 seconds
  • PyTorch CFFI: 0.003 seconds
  • Raw Python CFFI / CTypes: 0.001 seconds

What's new in Alpha-4?


New Features and modules

  • Multi-GPU primitives
  • A custom CUDA allocator to maximize autograd performance (backported to Torch too)
  • More autograd functions. Now it's almost API complete for all differentiable torch.* functions.
  • CuDNN Integration
  • Multiprocess DataLoader in torch.utils (used in the imagenet example)
  • Extensions API to interface to your C code simply via FFI

Plans for Alpha-5

  • Revamping and rethinking the Checkpointing API
  • Revamping the Optim API to support things like per-layer learning rates and optimizing non-weights (like in NeuralStyle)
  • RNN Examples, initially for PennTreeBank language modeling
  • Better RNN support in general, improved error messages, multi-GPU etc.
  • NCCL Integration for improved multi-GPU performance (already implemented at #78 )
  • Documentation / Reference manual for torch.* and autograd



We've added two tutorials to get you all started.

  • Tutorial 1: Introduction to PyTorch for former Torchies
    • In this tutorial we cover the torch, autograd and nn packages from a perspective of former Torch users.
    • Going through this tutorial should get you started. Let us know how we can improve it.
  • Tutorial 2: Write your own C code that interfaces into PyTorch via FFI
    • In this tutorial, we showcase how you can call your own C code that takes torch tensors as inputs / outputs in a seamless way via FFI
    • The tutorial showcases how you can write your own neural network Module that calls in C implementations


We've added a full imagenet example with ResNets that should be really suited towards “learning by example”.
It is located here:
The data for the example has to be preprocessed for now in the same way as is specified in fb.resnet.torch

The example has Multi-GPU support in a DataParallel fashion.

More improved error messages

We've gone through the TH and THNN C libraries and added much more intuitive error messages that report the mismatched shapes. We will continue to make improvements on this front.
If you have any unintuitive error messages that you encounter, please open an issue at

For example:

Old error message:

bad argument #2 to 'v' (3D or 4D (batch mode) tensor expected for input

New error message:

bad argument #2 to 'v' (3D or 4D (batch mode) tensor expected for input, but got: [100 x 100]

No more CamelCase for functions

All torch functions have been renamed from CamelCase to underscore_case.
indexAdd → index_add_
getRNGState → get_rng_state

New Features and modules

Multi-GPU primitives

  • We've added efficient multi-GPU support in general for neural networks. Instead of building magic blocks that do opaque parallelization for you, we've broken them down into easy to use collectives.
  • A pattern like DataParallel is implemented in terms of:


With Multi-GPU, we naturally overlap data transfers with compute across the whole graph. This makes multi-GPU much more efficient, and is done in a way that does not interfere with the imperativeness / error reporting.

Another important note is that we now dispatch parallel modules via python threads, which makes the CUDA kernel launches in a breadth-first fashion, getting rid of obvious kernel launch latency bottlenecks.

Custom CUDA allocator to maximize autograd performance

In Torch, we had to write nn modules in a careful way to avoid cuda synchronization points which were a multi-GPU bottleneck and general performance bottleneck. This affected neural networks and autograd sometimes up to 2x in performance penalty.

In PyTorch (and Torch), Sam Gross has written a new Caching CUDA allocator that avoids cuda synchronization points while being really suited towards Tensor use-cases where we typically do short-term and long-term allocations of memory of the same tensor sizes.

This unblocks us from a lot of performance issues.

More autograd functions

Now the torch.* API should be pretty much be ready for full autograd support (short of 3 functions).
Autograd has been enabled for all the functions with the exception of non-differentiable functions like torch.eq.

CuDNN Integration

We now fully integrate and support CuDNN version 5.1.3, and it is shipped in the binaries (just like CUDA), so you never have to worry about manually downloading and installing it from the NVIDIA website.

Generic Multiprocess DataLoader

We've added a flexible Data Loader that supports multiple data loading workers. This enables a lot of use-cases, and is first used in our Imagenet example.

C Extensions API

We added an easy to use extensions API and an example extension here:

You can call your C functions (that have TH*Tensor inputs / outputs and other fundamental types in the function signature) without writing any manual Python bindings.

One question you might have is, what kind of call overhead these auto-generated FFI bindings have. The answer is “None”, as seen in the numbers in the beginning of the note.

The example extension also covers how you can define your autograd-ready nn module that calls your C function.