New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nn.LSTM gives nondeterministic results with dropout and multiple layers #18110
Comments
cuDNN's seed comes from us, but it's possible that the backwards is non-deterministic for RNNs. CC @ngimel You wouldn't happen to have a self-contained script we could run and see the nondeterminism, do you? |
I just push a toy example of how to reproduce the issue. please checkout the following code: https://github.com/freewym/pytorch_test_code/blob/master/train.py and type the command:
it will give you the loss for each batch like: loss at batch 0: 4.609390735626221 if the command is: The loss on each batch is not consistent across different runs (although the difference is very small in this toy example, it is significant in my real experiments). However, if To sum up, this inconsistency only occurs when 1) cudnn is enabled; and 2) nn.LSTM's argument bidirectional=True; and 3) nn.LSTM's argument dropout > 0; and 4) nn.LSTM's argument num_layers > 1 |
We are investigating this issue. |
If it's any use, I also have this issue. Here is my environment: EnvironmentCUDA used to build PyTorch: 9.0.176 OS: Ubuntu 18.04.1 LTS Python version: 3.7 Versions of relevant libraries: |
The same problem at PyTorch 0.4 🙁 |
Yes. Unfortunately I think we expect this issue with all versions of PyTorch. The issue is in cuDNN, not PyTorch. |
Is there a way to report to Nvidia about this issue? |
Yes, I work at NVIDIA :P |
I'm having the same problem here. Been trying for awhile to figure out what is the possible cause. Thanks for this report! |
@nairbv to check if turning off a nondeterministic algorithm fixed this. |
FYI, I tried that but the problem persists
On Thu, Aug 22, 2019 at 4:38 PM gchanan ***@***.***> wrote:
@nairbv <https://github.com/nairbv> to check if turning off a
nondeterministic algorithm fixed this.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#18110?email_source=notifications&email_token=AA2YBEXTRDIDA24CBG4DBLDQF32N3A5CNFSM4G7AUBL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD46KH2Y#issuecomment-524067819>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA2YBEW7A5HGJIPTEBELSNTQF32N3ANCNFSM4G7AUBLQ>
.
--
Sent from my iPhone
|
The semi-related issue was a case where non-deterministic results were incorrect, so setting that operation to always be deterministic fixed the issue. In this issue we see non-deterministic results when deterministic=True. I'm not sure if there's a way to fix this path in pytorch without fixing in cudnn, though maybe we could add a warning. |
It should have been fixed in cudnn 7.6.2, @jjsjann123 can you please check nvbug? |
Closed and fixed in cudnn_7.6.1 @ngimel |
🐛 Bug
I got non-deterministic results when I run my model with nn.LSTM with its dropout > 0 on GPU, even when I seeded everything and torch.backends.cudnn.deterministic = True. Also, if I set torch.backends.cudnn.enabled = False, the results are deterministic.
To Reproduce
Steps to reproduce the behavior:
torch.backends.cudnn.deterministic = True
random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed_all(1)
np.random.seed(1)
define a module as:
nn.LSTM(input_size=256,
hidden_size=256,
num_layers=3,
dropout=0.1,
bidirectional=True,
)
training with the defined module multiple times
Expected behavior
The training should be deterministic across different runs
Environment
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Debian GNU/Linux 9.4 (stretch)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
CMake version: version 3.7.2
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla K80
GPU 1: Tesla K80
GPU 2: Tesla K80
GPU 3: Tesla K80
Nvidia driver version: 387.26
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.12.1
[conda] blas 1.0 mkl
[conda] mkl 2019.1 144
[conda] mkl-service 1.1.2 py37h90e4bf4_5
[conda] mkl_fft 1.0.10 py37ha843d7b_0
[conda] mkl_random 1.0.2 py37hd81dba3_0
[conda] pytorch 1.0.1 py3.7_cuda9.0.176_cudnn7.4.2_2 pytorch
[conda] torchvision 0.2.2 py_3 pytorch
Additional context
The text was updated successfully, but these errors were encountered: