Torch-7 FFI bindings for NVIDIA CuDNN
Lua CMake
Latest commit ae794a2 Jan 10, 2017 @soumith committed on GitHub Merge pull request #312 from soumith/ngimel-mathfloor
Fix for #311
Permalink
Failed to load latest commit information.
test Improved existing 16->32 fallback. Added performance-based fallback. Nov 10, 2016
.gitignore first commit Sep 10, 2014
BGRU.lua Added dropout to rnn constructor, and new BGRU layer Jul 1, 2016
BLSTM.lua Added missing changes Jul 5, 2016
BatchNormalization.lua prevent bn from backward in evalute mode Aug 16, 2016
CMakeLists.txt lowered cuda dependency; there is no requirement for 7.5 Jun 29, 2016
ClippedReLU.lua modified mode to be like batchnorm Jun 11, 2016
GRU.lua saving states works Jul 28, 2016
LICENSE Initial commit Sep 10, 2014
LSTM.lua saving states works Jul 28, 2016
LogSoftMax.lua syncing uptil master/R3 9bc3cba Jan 26, 2016
Pointwise.lua working double precision Aug 6, 2016
Pooling.lua working double precision Aug 6, 2016
Pooling3D.lua working double precision Aug 6, 2016
README.md Update README.md Oct 8, 2016
RNN.lua Fix for #311 Jan 10, 2017
RNNReLU.lua saving states works Jul 28, 2016
RNNTanh.lua saving states works Jul 28, 2016
ReLU.lua syncing uptil master/R3 9bc3cba Jan 26, 2016
Sigmoid.lua syncing uptil master/R3 9bc3cba Jan 26, 2016
SoftMax.lua syncing uptil master/R3 9bc3cba Jan 26, 2016
SpatialAveragePooling.lua syncing uptil master/R3 9bc3cba Jan 26, 2016
SpatialBatchNormalization.lua fix running_var meaning in BN Apr 13, 2016
SpatialConvolution.lua Improved existing 16->32 fallback. Added performance-based fallback. Nov 10, 2016
SpatialCrossEntropyCriterion.lua Use SpatialClassNLLCriterion in SpatialCrossEntropyCriterion Dec 4, 2016
SpatialCrossMapLRN.lua working double precision Aug 6, 2016
SpatialDivisiveNormalization.lua working double precision Aug 6, 2016
SpatialFullConvolution.lua Improved existing 16->32 fallback. Added performance-based fallback. Nov 10, 2016
SpatialLogSoftMax.lua new module interface Aug 3, 2015
SpatialMaxPooling.lua syncing uptil master/R3 9bc3cba Jan 26, 2016
SpatialSoftMax.lua working double precision Aug 6, 2016
Tanh.lua syncing uptil master/R3 9bc3cba Jan 26, 2016
TemporalConvolution.lua Improved existing 16->32 fallback. Added performance-based fallback. Nov 10, 2016
VolumetricAveragePooling.lua syncing uptil master/R3 9bc3cba Jan 26, 2016
VolumetricBatchNormalization.lua fix running_var meaning in BN Apr 13, 2016
VolumetricConvolution.lua Improved existing 16->32 fallback. Added performance-based fallback. Nov 10, 2016
VolumetricCrossEntropyCriterion.lua Add VolumetricCrossEntropyCriterion.lua Aug 11, 2016
VolumetricFullConvolution.lua Improved existing 16->32 fallback. Added performance-based fallback. Nov 10, 2016
VolumetricLogSoftMax.lua Add cudnn.VolumetricLogSoftMax cudnn.VolumetricSoftMax.lua Aug 11, 2016
VolumetricMaxPooling.lua working double precision Aug 6, 2016
VolumetricSoftMax.lua Add cudnn.VolumetricLogSoftMax cudnn.VolumetricSoftMax.lua Aug 11, 2016
convert.lua adding VolumetricFullConvolution Sep 30, 2016
cudnn-scm-1.rockspec Revert "Refactoring CUDNN Find" Aug 6, 2016
env.lua adding a functional interface, with the bias calculations to start with Aug 2, 2015
ffi.lua add cudnn64_5.dll to list of library for loading Nvidia CUDNN library Dec 28, 2016
find.lua change LongTensor for size_t[1] to avoid mismatch due to long being 3… Dec 28, 2016
functional.lua Improved existing 16->32 fallback. Added performance-based fallback. Nov 10, 2016
init.lua add guard to ffi.load to avoid potential symbol duplication on linux Dec 30, 2016

README.md

cudnn.torch

Torch7 FFI bindings for NVIDIA cuDNN (R5) kernels!

Modules are API compatible their nn equivalents. Fully unit-tested against nn implementations. Conversion between nn and cudnn is available through cudnn.convert function.

Installation

  • Install cuDNN (version R5 EA)
  • Have at least CUDA 7.0
  • Have libcudnn.so in your library path ($LD_LIBRARY_PATH) (Install cuDNN it from https://developer.nvidia.com/cuDNN )
  • Instead of the previous step, you can copy the library files into /usr/local/cuda/lib64/ or to the corresponding folders in CUDA directory

Modules

-- All inputs have to be 3D or 4D(batch-mode), except ReLU, Tanh, Sigmoid, and BatchNormalization
cudnn.SpatialConvolution(nInputPlane, nOutputPlane, kW, kH, [dW = 1], [dH = 1], [padW = 0], [padH = 0], [groups = 1])
cudnn.SpatialMaxPooling(kW, kH, dW, dH, padW, padH)
cudnn.SpatialAveragePooling(kW, kH, dW, dH, padW, padH)

-- the pointwise functions take an additional optional argument. if inplace=true then they do operations in-place without using any extra memory for themselves
cudnn.ReLU(inplace[=false])
cudnn.ClippedReLU(ceiling, inplace[=false])
cudnn.Tanh(inplace[=false])
cudnn.Sigmoid(inplace[=false])

-- SoftMax can be run in fast mode or accurate mode. Default is accurate mode.
cudnn.SoftMax(fastMode [= false])          -- SoftMax across each image (just like nn.SoftMax)
cudnn.LogSoftMax()                         -- LogSoftMax across each image (just like nn.LogSoftMax)
cudnn.SpatialSoftMax(fastMode [= false])   -- SoftMax across feature-maps (per spatial location)
cudnn.SpatialLogSoftMax()                  -- LogSoftMax across feature-maps (per spatial location)
cudnn.VolumetricSoftMax(fastMode [= false])   -- SoftMax across feature-maps (per spatial location)
cudnn.VolumetricLogSoftMax()                  -- LogSoftMax across feature-maps (per spatial location)

cudnn.SpatialCrossEntropyCriterion()       -- A spatial version of LogSoftMax + ClassNLLCriterion in one shot
cudnn.VolumetricCrossEntropyCriterion()       -- A volumetric version of LogSoftMax + ClassNLLCriterion in one shot

-- Batch Normalization
cudnn.BatchNormalization(nFeature, eps, momentum, affine) -- same arguments as https://github.com/torch/nn/blob/master/doc/simple.md#nn.BatchNormalization
cudnn.SpatialBatchNormalization(nFeature, eps, momentum, affine)
cudnn.VolumetricBatchNormalization(nFeature, eps, momentum, affine)


-- Volumetric inputs (4D or 5D batched mode)
cudnn.VolumetricConvolution(nInputPlane, nOutputPlane, kT, kW, kH, dT, dW, dH, padT, padW, padH)
cudnn.VolumetricMaxPooling(kT, kW, kH, dT, dW, dH, padT, padW, padH)
cudnn.VolumetricAveragePooling(kT, kW, kH, dT, dW, dH, padT, padW, padH)

-- Recurrent Modules

-- All inputs have to be 3D. Accepts input of seqLength x batch x inputDim, or batch x seqLength x inputDim if batchFirst set to true.
cudnn.RNNReLU(inputDim, outputDim, numberOfLayers, [batchFirst = false])
cudnn.RNNTanh(inputDim, outputDim, numberOfLayers, [batchFirst = false])
cudnn.LSTM(inputDim, outputDim, numberOfLayers, [batchFirst = false])
cudnn.GRU(inputDim, outputDim, numberOfLayers, [batchFirst = false])
cudnn.BLSTM(inputDim, outputDim, numberOfLayers, [batchFirst = false])

Modes

There are two globally availabe modes useful for tuning performance:

require 'cudnn'
cudnn.benchmark = true -- uses the inbuilt cudnn auto-tuner to find the fastest convolution algorithms.
                       -- If this is set to false, uses some in-built heuristics that might not always be fastest.

by default cudnn.benchmark is set to false. Setting to true will improve performance, at the expense of using more memory. The input shape should be the same for each batch, otherwise autotune will re-run for each batch, causing a huge slow-down.

cudnn.fastest = true -- this is like the :fastest() mode for the Convolution modules,
                     -- simply picks the fastest convolution algorithm, rather than tuning for workspace size

by default, cudnn.fastest is set to false. You should set to true if memory is not an issue, and you want the fastest performance

cudnn.verbose = true -- this prints out some more verbose information useful for debugging

by default, cudnn.verbose is set to false.

Conversion between cudnn and nn

Conversion is done by cudnn.convert function which takes a network and backend arguments and goes over network modules recursively substituting equivalents. No memory copy is done, just metatables are swapped. If you don't want to convert all modules you can pass a function as the third argument to cudnn.convert. It will be called at each step, with a module that is currently converted. It is meant to exclude modules i.e. if it returns true, they will be left untouched, otherwise they will be subject to conversion.

Note that you cannot do backward pass when using cuDNN and when your model has batch normaliation layers and is in evaluate mode.

net = nn.Sequential()
net:add(nn.SpatialConvolution(3,96,11,11,3,3))
net:add(nn.ReLU())
cudnn.convert(net, cudnn)
print(net)

net = nn.Sequential()
net:add(nn.SpatialConvolution(3,96,11,11,3,3))
net:add(nn.ReLU())
cudnn.convert(net, cudnn, function(module)
   return torch.type(module):find('ReLU')
end)
print(net)

will result in:

nn.Sequential {
  [input -> (1) -> (2) -> output]
  (1): cudnn.SpatialConvolution(3 -> 96, 11x11, 3,3)
  (2): cudnn.ReLU
}
nn.Sequential {
  [input -> (1) -> (2) -> output]
  (1): cudnn.SpatialConvolution(3 -> 96, 11x11, 3,3)
  (2): nn.ReLU
}

Older versions

For version CuDNN R1, checkout the branch R1 For version CuDNN R2, checkout the branch R2 For version CuDNN R3, checkout the branch R3 For version CuDNN R4, checkout the branch R4