Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating CuDNN API for LSTMs with projections #46213

Closed
Kipok opened this issue Oct 12, 2020 · 10 comments
Closed

Integrating CuDNN API for LSTMs with projections #46213

Kipok opened this issue Oct 12, 2020 · 10 comments
Labels
feature A request for a proper, new feature. module: cuda Related to torch.cuda, and CUDA support in general module: cudnn Related to torch.backends.cudnn, and CuDNN support module: rnn Issues related to RNN support (LSTM, GRU, etc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Kipok
Copy link

Kipok commented Oct 12, 2020

馃殌 Feature

CuDNN supports LSTMs with projections (https://arxiv.org/abs/1402.1128) from release 7.1.2, but it's not exposed in PyTorch. It would be great if PyTorch natively integrated that CuDNN functionality.

Motivation

The motivation for this is twofold. First, the performance of TorchScript accelerated code is still significantly lagging behind that of CuDNN implementation. When sequence length is relatively short (e.g. < 100), the performance of JIT code is good enough. But when sequence length is much larger (e.g. ~1000), which is typical for speech recognition models, it can be up to 2x slower than CuDNN implementation. Second, it seems hard to make JIT code work with AMP. I do not see any examples of that on PyTorch website and when I try to use typical code that works for regular models, I get errors like this:

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<string>", line 204, in <backward op>
        def mm(self, mat2):
            def backward(grad_output):
                grad_self = AD_mm_backward_self(grad_output, mat2)
                            ~~~~~~~~~~~~~~~~~~~ <--- HERE
                grad_mat2 = AD_mm_backward_mat2(grad_output, self)
                return grad_self, grad_mat2
  File "<string>", line 197, in AD_mm_backward_self

        def AD_mm_backward_self(grad, mat2):
            return grad.mm(mat2.t())
                   ~~~~~~~ <--- HERE

        def AD_mm_backward_mat2(grad, self):
RuntimeError: expected scalar type Half but found Float

I don't quite understand if TorchScript is officially supported by AMP package. If it does, can you please provide some examples? From what I understand, it can only work in the "tracing" mode, but that will significantly limit code flexibility.

Pitch

Officially support CuDNN with projections, as it is quite important for speech recognition models. It shouldn't be too big of a change, given that regular CuDNN implementation is already integrated in PyTorch and it just needs to be extended to support case of projection models.

Alternatives

Since there are 2 reasons for integrating projection models, there also different alternatives that I considered. For the speed issue, basically the only choice right now is to use JIT. It's not too slow to be unusable, but having code run 2x faster is definitely a big advantage.

For the AMP integration, as far as I understand I have 2 options. First, I can use tracing, but that limits code flexibility. Second, I can implement AMP functionality by hand, by manually casting tensors in JIT modules to half precision and manually handling gradient scaling/second weights copy in optimizer. This seems like a lot of coding and I will need to do proper tests to make sure I don't cast some unsafe operations to fp16. Please let me know if there are any other simpler options to use AMP with TorchScript.

Additional context

I'm aware of the official PyTorch position on supporting custom LSTM modifications via JIT and TorchScript: #9572. But given that projections models are already supported on the CuDNN side and their importance for speech recognition, I think it might make sense to make an exception for this special case. Especially, if this does not involve a lot of code modifications.

cc @ngimel @csarofeen @ptrblck @zou3519

@glaringlee glaringlee added feature A request for a proper, new feature. module: cudnn Related to torch.backends.cudnn, and CuDNN support triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 13, 2020
@Kipok
Copy link
Author

Kipok commented Oct 20, 2020

Any feedback on this proposal?

@Kipok
Copy link
Author

Kipok commented Oct 29, 2020

Actually, I ended up implementing this myself. You can see the draft of the code here: https://github.com/Kipok/pytorch/tree/cudnn_projections (based on 1.6 tag). There is almost 2x speed up for bidirectional LSTM models when sequence length is large over the JIT-based implementation. If it is to be merged, I will need to clean it up (currently some cases, e.g. when there are no biases are broken, because I didn't need that support for testing). But the question now is -- do you want to take these change upstream and officially support them?

@seemethere seemethere added the module: cuda Related to torch.cuda, and CUDA support in general label Oct 29, 2020
@gchanan
Copy link
Contributor

gchanan commented Nov 2, 2020

@ngimel

@zou3519
Copy link
Contributor

zou3519 commented Nov 2, 2020

I'm in favor of upstreaming into our nn.LSTM API, the paper has been around for a while, has a good number of citations, and I'm not sure when we will get better TorchScript support for custom LSTMs (cc @suo).

I'm not very familiar with AMP -- does nn.LSTM support AMP / will the cudnn projection lstm kernel support AMP?

@zou3519 zou3519 added the module: rnn Issues related to RNN support (LSTM, GRU, etc) label Nov 2, 2020
@ngimel
Copy link
Collaborator

ngimel commented Nov 2, 2020

Thank you, we'll accept a PR enabling this. To land it, we'll need to

  1. error out or support projection on cpu/on gpu w/o cudnn
  2. support with and without bias cases
  3. have comprehensive tests
  4. update documentation to describe projection option

@ngimel
Copy link
Collaborator

ngimel commented Nov 2, 2020

cc @mcarilli for expert opinion on amp question. nn.LSTM supports amp in a sense that it will convert inputs to lstm to fp16 in autocast regions (because it is expected to be stable and more performant with fp16). cudnn with projection should work with fp16 also.

@Kipok
Copy link
Author

Kipok commented Nov 4, 2020

Thanks for your feedback, it's great that you're willing to accept this! I am starting to work on cleaning-up my code and I have a few design questions.

First, is it ok for me to keep "proj_size" argument in RNNBase class from torch/nn/modules/rnn.py? It feels a bit misplaced there, given that only LSTM networks can actually accept it. On the other hand, all the weights creation logic happens there, so it's much easier to modify it in the base class, then in nn.LSTM. But if you think that it's not good to have "proj_size" in RNNBase, I can write some code to create/append weights inside the LSTM class itself, just will be a bit more involved.

Second, I am considering implementing the CPU/CUDA version of LSTM with projections. That will make it easier to test if my implementation is correct, but I'm trying to understand how involved it will be. There is a similar question here: do you think it's ok to expose projections in the base classes, even when they cannot be used or should I create new classes just for the LSTM case? E.g. CellParams class from ATen/native/RNN.cpp assumes there are 2 or 4 weights passed in. For projections it will be 3 or 5. Should I extend that class with another weight and mark it as undefined for all models, but LSTM? Or should I create a derived class which will add the additional weight and try to keep the base implementation clean whenever possible? Also, it is straightforward for me to implement regular CPU/CUDA versions (basically just one line change to LSTMCell by doing additional matmul on the resulting hidden state), but I don't completely understand the quantized cells implementations, so it might be tricky to add proper support there. Will be great to have some advice on the quantized implementation, e.g. should it also be just one line addition somewhere or it's more involved and I should maybe leave it out and mark as not supported for now?

@Kipok
Copy link
Author

Kipok commented Nov 5, 2020

@zou3519, @ngimel, any feedback on the above?

@ngimel
Copy link
Collaborator

ngimel commented Nov 5, 2020

I think it's ok to expose projection in base classes, as long as you error out properly in RNN/GRU, same for CellParams. cc @vkuzo for quantization question, I think marking it as not supported should be acceptable.

@vkuzo
Copy link
Contributor

vkuzo commented Nov 7, 2020

should it also be just one line addition somewhere or it's more involved and I should maybe leave it out and mark as not supported for now?

Yes, sounds reasonable to me. cc @raghuramank100 and @z-a-f as a heads up.

hwangdeyu pushed a commit to hwangdeyu/pytorch that referenced this issue Jan 6, 2021
Summary:
Fixes pytorch#46213

I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should.

1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes.
2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that.
3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places.
4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that?
5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout.

Pull Request resolved: pytorch#47725

Reviewed By: zou3519

Differential Revision: D25449794

Pulled By: ngimel

fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. module: cuda Related to torch.cuda, and CUDA support in general module: cudnn Related to torch.backends.cudnn, and CuDNN support module: rnn Issues related to RNN support (LSTM, GRU, etc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants