RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED when using torch.repeat #7197

ajayvohra2005 · 2024-06-05T13:48:47Z

🐛 Bug

Using torch.repeat leads to runtime error:

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1717594747.366986     228 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1717594747.369320      16 service.cc:145] XLA service 0x55fb8bed0e60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1717594747.369368      16 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1717594747.369825      16 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1717594747.369881      16 gpu_helpers.cc:107] XLA backend allocating 17696931840 bytes on device 0 for BFCAllocator.
I0000 00:00:1717594747.369915      16 gpu_helpers.cc:147] XLA backend will use up to 5898977280 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1717594747.370073      16 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
Traceback (most recent call last):
  File "/app/torch_xla_issue.py", line 35, in <module>
    loss = custom_loss_module.forward(pred=custom_pred[:, [-1], :], target=custom_target, is_dummy=True)
  File "/app/torch_xla_issue.py", line 22, in forward
    cos_sim = self.custom_cos_similarity_module.forward(x1=pred, x2=target)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/distance.py", line 89, in forward
    return F.cosine_similarity(x1, x2, self.dim, self.eps)
RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED at "/src/pytorch/torch/csrc/autograd/functions/utils.h":75, please report a bug to PyTorch.

To Reproduce

Steps to reproduce the behavior:

Docker Image:

us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.3.0_3.10_cuda_12.1

Python script to reproduce error:

import torch
import torch.nn as nn

import torch_xla.core.xla_model as xm

class CustomLoss(nn.Module):

    def __init__(self, cos_similarity_dim=2):
        super().__init__()

        self.custom_cos_similarity_module = nn.CosineSimilarity(dim=cos_similarity_dim)
        self.custom_mse_loss_module = nn.MSELoss(reduction="none")

    def forward(self, pred: torch.Tensor, target: torch.Tensor, is_dummy: bool):

        if is_dummy:
            pred = pred.repeat((1, target.shape[1], 1)) 
            #pred = pred.expand((-1, target.shape[1], -1)) 
        else:
            assert pred.shape[1] == target.shape[1]

        cos_sim = self.custom_cos_similarity_module.forward(x1=pred, x2=target)
        custom_cos_loss_tensor = 1 - cos_sim
        custom_mse_loss_tensor = self.custom_mse_loss_module.forward(input=pred, target=target)

        return custom_cos_loss_tensor, custom_mse_loss_tensor
    

custom_loss_module = CustomLoss()

device = xm.xla_device()
custom_pred = torch.rand( size=(10, 150, 256), dtype=torch.float, requires_grad=True).to(device)
custom_target = torch.zeros(size=(10, 150, 256), dtype=torch.float, requires_grad=True).to(device)

loss = custom_loss_module.forward(pred=custom_pred[:, [-1], :], target=custom_target, is_dummy=True)
print(loss)

Expected behavior

Script should run without error

Environment

Reproducible on XLA backend [CUDA]:
torch_xla version: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.3.0_3.10_cuda_12.1

Additional context

Using torch.expand is a work around.

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-06-05T17:23:38Z

@zpcore can you take a look at this one? I suspect you can repo with the CPU as well.

zpcore · 2024-06-05T18:07:36Z

@JackCaoG , yes, the issue can be reproduced with the xla CPU also. Meanwhile, I tried the same code with the master branch, the issue doesn't exist. So the issue only exists on the 2.3 release.

The simplest solution is to use the most latest docker build us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_cuda_12.1_20240605. @ajayvohra2005 , can you try this docker instead?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED when using torch.repeat #7197

RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED when using torch.repeat #7197

ajayvohra2005 commented Jun 5, 2024

JackCaoG commented Jun 5, 2024

zpcore commented Jun 5, 2024 •

edited

Loading

RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED when using torch.repeat #7197

RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED when using torch.repeat #7197

Comments

ajayvohra2005 commented Jun 5, 2024

🐛 Bug

To Reproduce

Docker Image:

Python script to reproduce error:

Expected behavior

Environment

Additional context

JackCaoG commented Jun 5, 2024

zpcore commented Jun 5, 2024 • edited Loading

zpcore commented Jun 5, 2024 •

edited

Loading