Integrating CuDNN API for LSTMs with projections #46213

Kipok · 2020-10-12T20:52:05Z

🚀 Feature

CuDNN supports LSTMs with projections (https://arxiv.org/abs/1402.1128) from release 7.1.2, but it's not exposed in PyTorch. It would be great if PyTorch natively integrated that CuDNN functionality.

Motivation

The motivation for this is twofold. First, the performance of TorchScript accelerated code is still significantly lagging behind that of CuDNN implementation. When sequence length is relatively short (e.g. < 100), the performance of JIT code is good enough. But when sequence length is much larger (e.g. ~1000), which is typical for speech recognition models, it can be up to 2x slower than CuDNN implementation. Second, it seems hard to make JIT code work with AMP. I do not see any examples of that on PyTorch website and when I try to use typical code that works for regular models, I get errors like this:

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<string>", line 204, in <backward op>
        def mm(self, mat2):
            def backward(grad_output):
                grad_self = AD_mm_backward_self(grad_output, mat2)
                            ~~~~~~~~~~~~~~~~~~~ <--- HERE
                grad_mat2 = AD_mm_backward_mat2(grad_output, self)
                return grad_self, grad_mat2
  File "<string>", line 197, in AD_mm_backward_self

        def AD_mm_backward_self(grad, mat2):
            return grad.mm(mat2.t())
                   ~~~~~~~ <--- HERE

        def AD_mm_backward_mat2(grad, self):
RuntimeError: expected scalar type Half but found Float

I don't quite understand if TorchScript is officially supported by AMP package. If it does, can you please provide some examples? From what I understand, it can only work in the "tracing" mode, but that will significantly limit code flexibility.

Pitch

Officially support CuDNN with projections, as it is quite important for speech recognition models. It shouldn't be too big of a change, given that regular CuDNN implementation is already integrated in PyTorch and it just needs to be extended to support case of projection models.

Alternatives

Since there are 2 reasons for integrating projection models, there also different alternatives that I considered. For the speed issue, basically the only choice right now is to use JIT. It's not too slow to be unusable, but having code run 2x faster is definitely a big advantage.

For the AMP integration, as far as I understand I have 2 options. First, I can use tracing, but that limits code flexibility. Second, I can implement AMP functionality by hand, by manually casting tensors in JIT modules to half precision and manually handling gradient scaling/second weights copy in optimizer. This seems like a lot of coding and I will need to do proper tests to make sure I don't cast some unsafe operations to fp16. Please let me know if there are any other simpler options to use AMP with TorchScript.

Additional context

I'm aware of the official PyTorch position on supporting custom LSTM modifications via JIT and TorchScript: #9572. But given that projections models are already supported on the CuDNN side and their importance for speech recognition, I think it might make sense to make an exception for this special case. Especially, if this does not involve a lot of code modifications.

cc @ngimel @csarofeen @ptrblck @zou3519

The text was updated successfully, but these errors were encountered:

Kipok · 2020-10-20T00:14:59Z

Any feedback on this proposal?

Kipok · 2020-10-29T18:03:51Z

Actually, I ended up implementing this myself. You can see the draft of the code here: https://github.com/Kipok/pytorch/tree/cudnn_projections (based on 1.6 tag). There is almost 2x speed up for bidirectional LSTM models when sequence length is large over the JIT-based implementation. If it is to be merged, I will need to clean it up (currently some cases, e.g. when there are no biases are broken, because I didn't need that support for testing). But the question now is -- do you want to take these change upstream and officially support them?

gchanan · 2020-11-02T18:21:49Z

@ngimel

zou3519 · 2020-11-02T20:00:53Z

I'm in favor of upstreaming into our nn.LSTM API, the paper has been around for a while, has a good number of citations, and I'm not sure when we will get better TorchScript support for custom LSTMs (cc @suo).

I'm not very familiar with AMP -- does nn.LSTM support AMP / will the cudnn projection lstm kernel support AMP?

ngimel · 2020-11-02T20:02:45Z

Thank you, we'll accept a PR enabling this. To land it, we'll need to

error out or support projection on cpu/on gpu w/o cudnn
support with and without bias cases
have comprehensive tests
update documentation to describe projection option

ngimel · 2020-11-02T20:22:09Z

cc @mcarilli for expert opinion on amp question. nn.LSTM supports amp in a sense that it will convert inputs to lstm to fp16 in autocast regions (because it is expected to be stable and more performant with fp16). cudnn with projection should work with fp16 also.

Kipok · 2020-11-04T22:10:47Z

Thanks for your feedback, it's great that you're willing to accept this! I am starting to work on cleaning-up my code and I have a few design questions.

First, is it ok for me to keep "proj_size" argument in RNNBase class from torch/nn/modules/rnn.py? It feels a bit misplaced there, given that only LSTM networks can actually accept it. On the other hand, all the weights creation logic happens there, so it's much easier to modify it in the base class, then in nn.LSTM. But if you think that it's not good to have "proj_size" in RNNBase, I can write some code to create/append weights inside the LSTM class itself, just will be a bit more involved.

Second, I am considering implementing the CPU/CUDA version of LSTM with projections. That will make it easier to test if my implementation is correct, but I'm trying to understand how involved it will be. There is a similar question here: do you think it's ok to expose projections in the base classes, even when they cannot be used or should I create new classes just for the LSTM case? E.g. CellParams class from ATen/native/RNN.cpp assumes there are 2 or 4 weights passed in. For projections it will be 3 or 5. Should I extend that class with another weight and mark it as undefined for all models, but LSTM? Or should I create a derived class which will add the additional weight and try to keep the base implementation clean whenever possible? Also, it is straightforward for me to implement regular CPU/CUDA versions (basically just one line change to LSTMCell by doing additional matmul on the resulting hidden state), but I don't completely understand the quantized cells implementations, so it might be tricky to add proper support there. Will be great to have some advice on the quantized implementation, e.g. should it also be just one line addition somewhere or it's more involved and I should maybe leave it out and mark as not supported for now?

Kipok · 2020-11-05T19:54:47Z

@zou3519, @ngimel, any feedback on the above?

ngimel · 2020-11-05T21:19:36Z

I think it's ok to expose projection in base classes, as long as you error out properly in RNN/GRU, same for CellParams. cc @vkuzo for quantization question, I think marking it as not supported should be acceptable.

vkuzo · 2020-11-07T00:20:56Z

should it also be just one line addition somewhere or it's more involved and I should maybe leave it out and mark as not supported for now?

Yes, sounds reasonable to me. cc @raghuramank100 and @z-a-f as a heads up.

Summary: Fixes pytorch#46213 I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should. 1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes. 2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that. 3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places. 4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that? 5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout. Pull Request resolved: pytorch#47725 Reviewed By: zou3519 Differential Revision: D25449794 Pulled By: ngimel fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c

glaringlee added feature A request for a proper, new feature. module: cudnn Related to torch.backends.cudnn, and CuDNN support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 13, 2020

seemethere added the module: cuda Related to torch.cuda, and CUDA support in general label Oct 29, 2020

zou3519 added the module: rnn Issues related to RNN support (LSTM, GRU, etc) label Nov 2, 2020

Kipok mentioned this issue Nov 11, 2020

Adding support for CuDNN-based LSTM with projections #47725

Closed

facebook-github-bot closed this as completed in 1b6d18a Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating CuDNN API for LSTMs with projections #46213

Integrating CuDNN API for LSTMs with projections #46213

Kipok commented Oct 12, 2020 •

edited by pytorch-probot bot

Kipok commented Oct 20, 2020

Kipok commented Oct 29, 2020

gchanan commented Nov 2, 2020

zou3519 commented Nov 2, 2020 •

edited

ngimel commented Nov 2, 2020

ngimel commented Nov 2, 2020

Kipok commented Nov 4, 2020

Kipok commented Nov 5, 2020

ngimel commented Nov 5, 2020

vkuzo commented Nov 7, 2020

Integrating CuDNN API for LSTMs with projections #46213

Integrating CuDNN API for LSTMs with projections #46213

Comments

Kipok commented Oct 12, 2020 • edited by pytorch-probot bot

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Kipok commented Oct 20, 2020

Kipok commented Oct 29, 2020

gchanan commented Nov 2, 2020

zou3519 commented Nov 2, 2020 • edited

ngimel commented Nov 2, 2020

ngimel commented Nov 2, 2020

Kipok commented Nov 4, 2020

Kipok commented Nov 5, 2020

ngimel commented Nov 5, 2020

vkuzo commented Nov 7, 2020

Kipok commented Oct 12, 2020 •

edited by pytorch-probot bot

zou3519 commented Nov 2, 2020 •

edited