-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating CuDNN API for LSTMs with projections #46213
Comments
Any feedback on this proposal? |
Actually, I ended up implementing this myself. You can see the draft of the code here: https://github.com/Kipok/pytorch/tree/cudnn_projections (based on 1.6 tag). There is almost 2x speed up for bidirectional LSTM models when sequence length is large over the JIT-based implementation. If it is to be merged, I will need to clean it up (currently some cases, e.g. when there are no biases are broken, because I didn't need that support for testing). But the question now is -- do you want to take these change upstream and officially support them? |
I'm in favor of upstreaming into our nn.LSTM API, the paper has been around for a while, has a good number of citations, and I'm not sure when we will get better TorchScript support for custom LSTMs (cc @suo). I'm not very familiar with AMP -- does nn.LSTM support AMP / will the cudnn projection lstm kernel support AMP? |
Thank you, we'll accept a PR enabling this. To land it, we'll need to
|
cc @mcarilli for expert opinion on amp question. nn.LSTM supports amp in a sense that it will convert inputs to lstm to fp16 in autocast regions (because it is expected to be stable and more performant with fp16). cudnn with projection should work with fp16 also. |
Thanks for your feedback, it's great that you're willing to accept this! I am starting to work on cleaning-up my code and I have a few design questions. First, is it ok for me to keep "proj_size" argument in RNNBase class from torch/nn/modules/rnn.py? It feels a bit misplaced there, given that only LSTM networks can actually accept it. On the other hand, all the weights creation logic happens there, so it's much easier to modify it in the base class, then in nn.LSTM. But if you think that it's not good to have "proj_size" in RNNBase, I can write some code to create/append weights inside the LSTM class itself, just will be a bit more involved. Second, I am considering implementing the CPU/CUDA version of LSTM with projections. That will make it easier to test if my implementation is correct, but I'm trying to understand how involved it will be. There is a similar question here: do you think it's ok to expose projections in the base classes, even when they cannot be used or should I create new classes just for the LSTM case? E.g. CellParams class from ATen/native/RNN.cpp assumes there are 2 or 4 weights passed in. For projections it will be 3 or 5. Should I extend that class with another weight and mark it as undefined for all models, but LSTM? Or should I create a derived class which will add the additional weight and try to keep the base implementation clean whenever possible? Also, it is straightforward for me to implement regular CPU/CUDA versions (basically just one line change to LSTMCell by doing additional matmul on the resulting hidden state), but I don't completely understand the quantized cells implementations, so it might be tricky to add proper support there. Will be great to have some advice on the quantized implementation, e.g. should it also be just one line addition somewhere or it's more involved and I should maybe leave it out and mark as not supported for now? |
I think it's ok to expose projection in base classes, as long as you error out properly in RNN/GRU, same for CellParams. cc @vkuzo for quantization question, I think marking it as not supported should be acceptable. |
Yes, sounds reasonable to me. cc @raghuramank100 and @z-a-f as a heads up. |
Summary: Fixes pytorch#46213 I didn't yet update the documentation, will add those change soon. A few other things that I didn't do, but want to clarify if I maybe should. 1. I didn't expose projections in c++ API: torch/csrc/api/src/nn/modules/rnn.cpp. Let me know if this is desirable and I will add those changes. 2. I didn't expose projections in "lstm_cell" function and "_thnn_differentiable_lstm_cell_backward" functions from aten/src/ATen/native/RNN.cpp. As far as I understand, they are not needed for nn.LSTM CPU execution. For lstm_cell, projections don't bring any real benefit, since if cell is used separately, it can be easily added in Python. For "_thnn_differentiable_lstm_cell_backward", I'm actually not sure where exactly that function is used, so I also disabled projections there for now. Please let me know if I should change that. 3. I added check that projections are not supported for quantized LSTMs to quantized_lstm_<data/input> functions. But I didn't add any checks to LSTMCell code. It seems that since I disabled projections in "lstm_cell" function, they should also not be available for quantized models through any other API than quantized_lstm_<data/input>. Please let me know if I'm not correct and I will add checks to other places. 4. Projections are not supported for CuDNN versions < 7.1.2. Should I add the check for CuDNN version and disable projections in that case? If so, what will be the best way to do that? 5. Currently I added projection weight as the last weight, so the layout is "w_ih, w_hh, b_ih, b_hh, w_hr". This breaks the assumption that biases come after weights and thus I had to add additional if-s in various places. Alternative way would be to have "w_ih, w_hh, w_hr, b_ih, b_hh" layout, in which case the assumption will be true. But in that case I will need to split the loop in get_parameters function from aten/src/ATen/native/cudnn/RNN.cpp. And in some cases, I will still need to add an "undefined" tensor in the 3rd position, because we get all 5 weights from CuDNN most of the time. So I'm not sure which way is better. Let me know if you think I should change to the weights-then-biases layout. Pull Request resolved: pytorch#47725 Reviewed By: zou3519 Differential Revision: D25449794 Pulled By: ngimel fbshipit-source-id: fe6ce59e481d1f5fd861a8ff7fa13d1affcedb0c
馃殌 Feature
CuDNN supports LSTMs with projections (https://arxiv.org/abs/1402.1128) from release 7.1.2, but it's not exposed in PyTorch. It would be great if PyTorch natively integrated that CuDNN functionality.
Motivation
The motivation for this is twofold. First, the performance of TorchScript accelerated code is still significantly lagging behind that of CuDNN implementation. When sequence length is relatively short (e.g. < 100), the performance of JIT code is good enough. But when sequence length is much larger (e.g. ~1000), which is typical for speech recognition models, it can be up to 2x slower than CuDNN implementation. Second, it seems hard to make JIT code work with AMP. I do not see any examples of that on PyTorch website and when I try to use typical code that works for regular models, I get errors like this:
I don't quite understand if TorchScript is officially supported by AMP package. If it does, can you please provide some examples? From what I understand, it can only work in the "tracing" mode, but that will significantly limit code flexibility.
Pitch
Officially support CuDNN with projections, as it is quite important for speech recognition models. It shouldn't be too big of a change, given that regular CuDNN implementation is already integrated in PyTorch and it just needs to be extended to support case of projection models.
Alternatives
Since there are 2 reasons for integrating projection models, there also different alternatives that I considered. For the speed issue, basically the only choice right now is to use JIT. It's not too slow to be unusable, but having code run 2x faster is definitely a big advantage.
For the AMP integration, as far as I understand I have 2 options. First, I can use tracing, but that limits code flexibility. Second, I can implement AMP functionality by hand, by manually casting tensors in JIT modules to half precision and manually handling gradient scaling/second weights copy in optimizer. This seems like a lot of coding and I will need to do proper tests to make sure I don't cast some unsafe operations to fp16. Please let me know if there are any other simpler options to use AMP with TorchScript.
Additional context
I'm aware of the official PyTorch position on supporting custom LSTM modifications via JIT and TorchScript: #9572. But given that projections models are already supported on the CuDNN side and their importance for speech recognition, I think it might make sense to make an exception for this special case. Especially, if this does not involve a lot of code modifications.
cc @ngimel @csarofeen @ptrblck @zou3519
The text was updated successfully, but these errors were encountered: