Error when training on GPU, tensor gets moved to CPU #3611

davidireland3 · 2021-11-30T18:48:58Z

davidireland3
Nov 30, 2021

I am training a network that consists of SAGEConv and TopKPooling layers. I put the data and model on a GPU, but somewhere in the training a tensor gets put onto a cpu. I have checked via print statements that my data and model do indeed get put onto the GPU, which they do, so I am not sure where the error comes in.

I have trained the same network before (using the same data) using an InfoNCE loss where I created class labels from the score attached to each graph. Now I am training using MSE to predict the score directly, so I am not sure if this is what is causing the issue. The other main difference is that the GPU I am using is not the default (I am using a machine with multiple GPU's so I specify device = cuda:1).

Below is the attached error message, please let me know if I need to add anything else to the question, e.g. snippets of the training script (I checked the data device using the node feature matrix, batch score is the target matrix i.e. y inside the training loop).

data device is  cuda:1
model parameters are on  cuda:1
batch score device is  cuda:1
Traceback (most recent call last):
  File "mse_training.py", line 95, in <module>
    prediction = model(batch)
  File ".../torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "...networks.py", line 177, in forward
    x, edge_index, _, batch, _, _ = self.first_pool_layer(x, edge_index, batch=batch)
  File ".../torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File ".../torch_geometric/nn/pool/topk_pool.py", line 160, in forward
    perm = topk(score, self.ratio, batch, self.min_score)
  File ".../torch_geometric/nn/pool/topk_pool.py", line 27, in topk
    index = (index - cum_num_nodes[batch]) + (batch * max_num_nodes)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!

rusty1s · 2021-12-01T09:01:30Z

rusty1s
Dec 1, 2021
Maintainer

Can you confirm that data.batch is on GPU? I don't see any issue in topk that might explain this error. Otherwise, do you have a simple script to reproduce the issue?

13 replies

davidireland3 Feb 24, 2022
Author

Hi, I started to work on a new conda environment again today and am still facing the same issue.

This is the following commands used:
conda create --name new_env
conda activate new_env
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
conda install pyg -c pyg -c conda-forge
pip install git+https://github.com/pyg-team/pytorch_geometric.git # install the version given in the comment above
python3 paired_training.py

(ged_env) irelan_d@ds-fargo:~/GED_test$ python3 paired_training.py Traceback (most recent call last): File "/home/GED_test/paired_training.py", line 44, in <module> first_pair, _, _ = gnn.forward(anchor, g1) File "/home/GED_test/networks.py", line 290, in forward g1_embedding, g2_embedding = self.get_graph_embedding(x1, batch1), self.get_graph_embedding(x2, batch2) File "/home/GED_test/networks.py", line 305, in get_graph_embedding output = self.output_layer(global_add_pool(gated_output * x, batch)) File "/home/anaconda3/envs/ged_env/lib/python3.9/site-packages/torch_geometric/nn/glob/glob.py", line 26, in global_add_pool return scatter(x, batch, dim=0, dim_size=size, reduce='add') File "/home/anaconda3/envs/ged_env/lib/python3.9/site-packages/torch_scatter/scatter.py", line 152, in scatter return scatter_sum(src, index, dim, out, dim_size) File "/home/anaconda3/envs/ged_env/lib/python3.9/site-packages/torch_scatter/scatter.py", line 21, in scatter_sum return out.scatter_add_(dim, index, src) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_scatter_add_)

and when debugging it is the batch tensor that is on the CPU. Below is the paired training file that I am using:

import torch
from utils import triplet_loss_masked, get_loss_mask
import numpy as np
import matplotlib.pyplot as plt
from networks import GMN
from functions import make_pyg_dataset
from torch_geometric.loader import DataLoader
import pickle
import random

SEED = 1
np.random.seed(SEED)
random.seed(SEED)
torch.random.manual_seed(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"

gnn = GMN(2, 64, 128).to(device)
optimiser = torch.optim.Adam(gnn.parameters())

with open("GED_AIDS", mode="rb") as f:
    graphs, distances = pickle.load(f)

dataset = make_pyg_dataset(graphs, device)
N = int(len(dataset) * 0.8)
train = dataset[:N]
val = dataset[N:]
batch_size = 64

train_anchors = DataLoader(train, shuffle=True, batch_size=batch_size)
train_data1 = DataLoader(train, shuffle=True, batch_size=batch_size)
train_data2 = DataLoader(train, shuffle=True, batch_size=batch_size)

val_anchors = DataLoader(val, shuffle=True, batch_size=len(val))
val_data1 = DataLoader(val, shuffle=True, batch_size=len(val))
val_data2 = DataLoader(val, shuffle=True, batch_size=len(val))


losses = []
val_losses = []
for epoch in range(10):
    batch_loss = []
    for anchor, g1, g2 in zip(train_anchors, train_data1, train_data2):
        first_pair, _, _ = gnn.forward(anchor, g1)
        second_pair, _, _ = gnn.forward(anchor, g2)
        first_mask, second_mask = get_loss_mask(anchor, g1, g2, distances, device)
        optimiser.zero_grad()
        loss = triplet_loss_masked(first_pair, second_pair, first_mask, second_mask, 1)
        batch_loss.append(loss.item())
        loss.backward()
        optimiser.step()
    losses.append(np.mean(batch_loss))

The GED_AIDS dataset is just a list of networkx objects, and the function make_pyg_dataset turns them into a list of Data objects. The error happens when starting the for loop using anchor, g1, g2. When sampled from the data loader the batch tensor is put onto the cpu.

rusty1s Feb 25, 2022
Maintainer

How does make_pyg_dataset look like? In general, data.batch is set to the device of the last tensor we have encountered during batching. Does that might explain your issue? Perhaps not all your tensors inside data sit on CUDA?

davidireland3 Feb 25, 2022
Author

Here is the code:

def make_pyg_dataset(graphs, device="cpu"):
    data = []
    for count, g in enumerate(graphs):
        g = nx.convert_node_labels_to_integers(g, first_label=0, ordering='default', label_attribute=None)
        N = g.number_of_nodes()
        E = g.number_of_edges()
        features = make_graph_features_for_encoder(g)
        g = from_networkx(g)
        g = Data(edge_index=g.edge_index)
        g.x = torch.FloatTensor([features[node] for node in range(N)])
        g.edge_attr = torch.ones((2 * E, 1))
        g.idx = count
        g = g.to(device)
        data.append(g)
    return data

maybe my understanding is wrong but I was under the impression that putting g = g.to(device) would put all tensors defined in the Data object to be on the GPU.

rusty1s Feb 25, 2022
Maintainer

I see. The issue stems from data.idx (which is an int rather than a Tensor and thus is not put onto GPU). You can either convert it to a Tensor or install the fix in master, see https://github.com/pyg-team/pytorch_geometric/blob/master/torch_geometric/data/collate.py#L87-L88.

davidireland3 Feb 25, 2022
Author

perfect, thanks a lot for the help :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when training on GPU, tensor gets moved to CPU #3611

{{title}}

Replies: 1 comment 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Error when training on GPU, tensor gets moved to CPU #3611

davidireland3 Nov 30, 2021

Replies: 1 comment · 13 replies

rusty1s Dec 1, 2021 Maintainer

davidireland3 Feb 24, 2022 Author

rusty1s Feb 25, 2022 Maintainer

davidireland3 Feb 25, 2022 Author

rusty1s Feb 25, 2022 Maintainer

davidireland3 Feb 25, 2022 Author

davidireland3
Nov 30, 2021

Replies: 1 comment 13 replies

rusty1s
Dec 1, 2021
Maintainer

davidireland3 Feb 24, 2022
Author

rusty1s Feb 25, 2022
Maintainer

davidireland3 Feb 25, 2022
Author

rusty1s Feb 25, 2022
Maintainer

davidireland3 Feb 25, 2022
Author