Seg fault with test_rnn_retain_variables on ppc64le #16953

deepali-c · 2019-02-11T10:25:40Z

🐛 Bug

Observed segfault with the test case test_rnn_retain_variables:

Other tests that crash are:

test_cuda_rnn_fused
test_rnn_initial_hidden_state
test_einsum

To Reproduce

The minimum to code to reproduce is:

import torch
import torch.nn as nn
device="cpu"
dtype=torch.double
rnn = nn.GRU(10, 20, num_layers=2).to(device,dtype)
input = torch.randn(5, 6, 10, device=device, dtype=dtype, requires_grad=True)
output = rnn(input)

Other observations are:
a. The above code works for device="cuda"
b. The above code works for dtype=torch.float
c. The above code works if the following is used for input:
input = torch.randn(3, 3, 10, device=device, dtype=dtype, requires_grad=True)

Expected behavior

The tests should pass.

Environment

PyTorch version: 1.0.0a0+7998997 (with some local changes)
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Red Hat Enterprise Linux Server 7.6 (Maipo)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
CMake version: Could not collect

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 410.72
cuDNN version: Could not collect

Additional context

The text was updated successfully, but these errors were encountered:

ezyang · 2019-06-16T03:36:48Z

Possibly related: the test occasionally times out on non-PPC: https://circleci.com/gh/pytorch/pytorch/2000995?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

fmassa added the module: crash Problem manifests as a hard crash, as opposed to a RuntimeError label Feb 19, 2019

umanwizard added module: POWER Issues specific to the POWER/ppc architecture triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seg fault with test_rnn_retain_variables on ppc64le #16953

Seg fault with test_rnn_retain_variables on ppc64le #16953

deepali-c commented Feb 11, 2019

ezyang commented Jun 16, 2019

Seg fault with test_rnn_retain_variables on ppc64le #16953

Seg fault with test_rnn_retain_variables on ppc64le #16953

Comments

deepali-c commented Feb 11, 2019

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

ezyang commented Jun 16, 2019