New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
nn.Embedding with max_norm shows unstable behavior and causes sometimes runtime error. #26596
Comments
Per documentation (of functional.embedding): max_norm (float, optional): If given, each embedding vector with norm larger than So we need to update https://pytorch.org/docs/stable/nn.html#embedding accordingly. |
I see. Could you explain why swapping Line a and Line b then does not lead to the same error? By the way, setting |
Because when you call |
I am wondering whether one could introduce an embedding normalization function that is a subclass of |
This is a super critical bug, when used with DDP, causes embedding to be out of alignment and is detrimental for training. Furthermore, warning doesn't exist in nn.Embedding (only in F.ebmedding), assuming someone would read the code and realize underlying F.embedding, this must be detrimental for hundreds of projects that rely on embedding combined with DDP..... |
I will bump priority on it as it seems not only docs issue, and many people hitting it. |
To me, it seems like we have at least two options to solve this issue:
Even though option 2 might seem more ideal, but maybe we should go with option 1, to avoid a BC break and to keep |
Both options make sense, but the first variant is much easier, especially taking into account that where is CPP version of nn, and we would need to keep in sync both as well as take perf impacts. |
I am facing the same issue in pytorch 1.13.1 |
@tonygracious, as the documentation mentions, if you're doing operations on |
馃悰 Bug
An
nn.Embedding
object withmax_norm
set toTrue
causes a RuntimeError that is hard to track.To Reproduce
The following code causes a RuntimeError. The error can be avoided by removing the max_norm feature or by swapping Line a and Line b in the code.
Expected behavior
There shouldn't be any error when running the code above.
Strangely, there is no RuntimeError when Line a and Line b are swapped. This is something that has to be investigated.
Environment
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.3 LTS
GCC version: (Homebrew gcc 5.5.0_4) 5.5.0
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 430.26
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.3
Versions of relevant libraries:
[pip] botorch==0.1.3
[pip] gpytorch==0.3.5
[pip] numpy==1.17.2
[pip] torch==1.2.0
[pip] torchvision==0.4.0a0+6b959ee
[conda] blas 1.0 mkl
[conda] botorch 0.1.3 pypi_0 pypi
[conda] gpytorch 0.3.5 pypi_0 pypi
[conda] libblas 3.8.0 12_mkl conda-forge
[conda] libcblas 3.8.0 12_mkl conda-forge
[conda] liblapack 3.8.0 12_mkl conda-forge
[conda] mkl 2019.4 243
[conda] pytorch 1.2.0 py3.7_cuda10.0.130_cudnn7.6.2_0 pytorch
[conda] torchvision 0.4.0 py37_cu100 pytorch
Additional context
cc @ezyang @gchanan @zou3519 @jlin27 @albanD @mruberry
The text was updated successfully, but these errors were encountered: