Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding max_norm wrong results on cuda when repeating indices are present #44792

Closed
ivkireev86 opened this issue Sep 16, 2020 · 4 comments
Closed
Assignees
Labels
high priority module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ivkireev86
Copy link

ivkireev86 commented Sep 16, 2020

馃悰 Bug

I need an reproducible output from model, but Embedding layer reproduces different results in some cases.

To Reproduce

Steps to reproduce the behavior:

  1. Set torch random seed
  2. Use all of these options. Result will be reproducible if you miss or change any of them:
  • Use cuda device
  • Use Embedding layer with large embedding_dim and max_norm enabled
  • Get embeddings for large amount of repeated indexes.

Embeddings are different for different application runs.

import torch

torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

device = torch.device('cuda:0')
model = torch.nn.Embedding(
            num_embeddings=2,
            embedding_dim=64,
            max_norm=1.0,
        ).to(device)
ix = torch.arange(2).long().to(device)
out = model(ix.repeat(2000))

for p in model.parameters():
    print((p ** 2).sum(dim=1, keepdim=True) ** 0.5)
print(out.sum())

Expected behavior

I expect the same output for different application runs.

Environment

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
GPU 2: Tesla P100-PCIE-16GB
GPU 3: Tesla P100-PCIE-16GB

Nvidia driver version: 435.21
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] pytorch-ignite==0.4.0.post1
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.1.243             h6bb024c_0  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.1.0            py37h23d657b_0  
[conda] mkl_random                1.1.1            py37h0573a6f_0  
[conda] numpy                     1.19.1           py37hbc911f0_0  
[conda] numpy-base                1.19.1           py37hfa32c7d_0  
[conda] pytorch                   1.6.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-ignite            0.4.0.post1              pypi_0    pypi
[conda] torchvision               0.7.0                    pypi_0    pypi

Additional context

cc @ezyang @gchanan @zou3519 @ngimel

@mruberry mruberry added module: determinism triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: cuda Related to torch.cuda, and CUDA support in general labels Sep 17, 2020
@mruberry
Copy link
Collaborator

Thanks for reporting this issue, @ivkireev86. We just updated our determinism documentation (see https://pytorch.org/docs/master/generated/torch.set_deterministic.html#torch.set_deterministic). It mentions EmbeddingBag but not the Embedding module.

cc @kurtamohler

@ngimel
Copy link
Collaborator

ngimel commented Sep 17, 2020

High priority for silent wrong results.

@ngimel ngimel changed the title Embedding max_norm reproducibility error Embedding max_norm wrong results on cuda when repeating indices are present Sep 17, 2020
@ngimel
Copy link
Collaborator

ngimel commented Sep 17, 2020

This bug existed forever and was inherited from the old code #4322

@kurtamohler
Copy link
Collaborator

kurtamohler commented Sep 24, 2020

@ngimel , what is the cause of the nondeterminism? I'd like to include a short description of the cause next to the nondeterministic alert, so we have it documented.

Nevermind, I noticed that the reason is mentioned in the description of issue #4322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
5 participants