Embedding max_norm wrong results on cuda when repeating indices are present #44792

ivkireev86 · 2020-09-16T16:11:21Z

🐛 Bug

I need an reproducible output from model, but Embedding layer reproduces different results in some cases.

To Reproduce

Steps to reproduce the behavior:

Set torch random seed
Use all of these options. Result will be reproducible if you miss or change any of them:

Use cuda device
Use Embedding layer with large embedding_dim and max_norm enabled
Get embeddings for large amount of repeated indexes.

Embeddings are different for different application runs.

import torch

torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

device = torch.device('cuda:0')
model = torch.nn.Embedding(
            num_embeddings=2,
            embedding_dim=64,
            max_norm=1.0,
        ).to(device)
ix = torch.arange(2).long().to(device)
out = model(ix.repeat(2000))

for p in model.parameters():
    print((p ** 2).sum(dim=1, keepdim=True) ** 0.5)
print(out.sum())

Expected behavior

I expect the same output for different application runs.

Environment

Collecting environment information...
PyTorch version: 1.6.0
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
GPU 2: Tesla P100-PCIE-16GB
GPU 3: Tesla P100-PCIE-16GB

Nvidia driver version: 435.21
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] pytorch-ignite==0.4.0.post1
[pip3] torch==1.6.0
[pip3] torchvision==0.7.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.1.243             h6bb024c_0  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.1.0            py37h23d657b_0  
[conda] mkl_random                1.1.1            py37h0573a6f_0  
[conda] numpy                     1.19.1           py37hbc911f0_0  
[conda] numpy-base                1.19.1           py37hfa32c7d_0  
[conda] pytorch                   1.6.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-ignite            0.4.0.post1              pypi_0    pypi
[conda] torchvision               0.7.0                    pypi_0    pypi

Additional context

cc @ezyang @gchanan @zou3519 @ngimel

The text was updated successfully, but these errors were encountered:

mruberry · 2020-09-17T04:01:50Z

Thanks for reporting this issue, @ivkireev86. We just updated our determinism documentation (see https://pytorch.org/docs/master/generated/torch.set_deterministic.html#torch.set_deterministic). It mentions EmbeddingBag but not the Embedding module.

cc @kurtamohler

ngimel · 2020-09-17T04:55:53Z

High priority for silent wrong results.

ngimel · 2020-09-17T05:19:29Z

This bug existed forever and was inherited from the old code #4322

kurtamohler · 2020-09-24T00:36:04Z

~~@ngimel , what is the cause of the nondeterminism? I'd like to include a short description of the cause next to the nondeterministic alert, so we have it documented.~~

Nevermind, I noticed that the reason is mentioned in the description of issue #4322

mruberry added module: determinism triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: cuda Related to torch.cuda, and CUDA support in general labels Sep 17, 2020

ngimel added the high priority label Sep 17, 2020

pytorch-probot bot added the triage review label Sep 17, 2020

ngimel changed the title ~~Embedding max_norm reproducibility error~~ Embedding max_norm wrong results on cuda when repeating indices are present Sep 17, 2020

kurtamohler self-assigned this Sep 17, 2020

agolynski removed module: determinism triage review labels Sep 21, 2020

kurtamohler mentioned this issue Sep 24, 2020

Fix incorrect CUDA torch.nn.Embedding result when max_norm is not None and indices are not sorted #45248

Closed

facebook-github-bot closed this as completed in 66505b6 Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding max_norm wrong results on cuda when repeating indices are present #44792

Embedding max_norm wrong results on cuda when repeating indices are present #44792

ivkireev86 commented Sep 16, 2020 •

edited by pytorch-probot bot

mruberry commented Sep 17, 2020

ngimel commented Sep 17, 2020

ngimel commented Sep 17, 2020

kurtamohler commented Sep 24, 2020 •

edited

Embedding max_norm wrong results on cuda when repeating indices are present #44792

Embedding max_norm wrong results on cuda when repeating indices are present #44792

Comments

ivkireev86 commented Sep 16, 2020 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

mruberry commented Sep 17, 2020

ngimel commented Sep 17, 2020

ngimel commented Sep 17, 2020

kurtamohler commented Sep 24, 2020 • edited

ivkireev86 commented Sep 16, 2020 •

edited by pytorch-probot bot

kurtamohler commented Sep 24, 2020 •

edited