Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get wrong results when running random walk on gpu #43

Closed
EricGz opened this issue Jan 4, 2020 · 11 comments
Closed

Get wrong results when running random walk on gpu #43

EricGz opened this issue Jan 4, 2020 · 11 comments
Labels

Comments

@EricGz
Copy link

EricGz commented Jan 4, 2020

Hi, I got an issue when running random walk on gpu. Plz help!
The demo below can reproduce the issue.

import torch
from torch_cluster import random_walk

device = 'cpu'
# device = 'cuda:0'

num_nodes = 3
walk_length = 3
p = 1
q = 1
edge_index = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]]).to(device)
subset = torch.arange(num_nodes, device=edge_index.device)

rw = random_walk(edge_index[0], edge_index[1], subset,
                         walk_length, p, q, num_nodes)
print(rw)

There are three nodes and two edges in the graph. When I ran this code on cpu, I got the following results:

tensor([[0, 1, 0, 1],
        [1, 0, 1, 0],
        [2, 1, 2, 1]])

However, when I ran this code on gpu, the results became:

tensor([[-1, -1, -1, -1],
        [-1, -1, -1, -1],
        [-1, -1, -1, -1]], device='cuda:0')

Do you have any idea about it?

@rusty1s
Copy link
Owner

rusty1s commented Jan 4, 2020

Thanks for reporting. I will look into it. I guess python setup.py test also fails for you?

@EricGz
Copy link
Author

EricGz commented Jan 4, 2020

Thank you for looking into it! You are right. The test failed.

@rusty1s
Copy link
Owner

rusty1s commented Jan 4, 2020

Do all GPU tests fail?

@EricGz
Copy link
Author

EricGz commented Jan 4, 2020

Yes, I think so. 55 failed and 56 passed. All the failed ones are GPU tests.

@rusty1s
Copy link
Owner

rusty1s commented Jan 4, 2020

Ok, so this is not a problem with the random walk function, but the installation of torch-cluster. Can you post the log of

rm -rf build && python setup.py install

@EricGz
Copy link
Author

EricGz commented Jan 4, 2020

Here's the log. log.txt

@EricGz
Copy link
Author

EricGz commented Jan 6, 2020

It seems the whole installation goes fine. However, I still get wrong results running random walk on GPU. Do you have any idea what went wrong?

@rusty1s
Copy link
Owner

rusty1s commented Jan 6, 2020

Unfortunately no :( Logs look okay to me. Maybe you have multiple versions installed where one installation failed? You can try removing torch-cluster repeatedly and install again.

@EricGz
Copy link
Author

EricGz commented Jan 14, 2020

Hi, @rusty1s, thanks for your timely reply. I tried your suggestion, but the problem is still unsolved.

I tried some other tests and got more information about this error. When I used the GPU version of scatter_max and scatter_min in package torch_scatter, I met this error again, and the interesting thing is that the GPU version of functions' like scatter_add and scatter_mean worked fine.

Maybe there's something common about scatter_max and random_walk that caused the error?

P.S. Here's the test results of scatter_max and scatter_add

import torch
from torch_scatter import *

# device = 'cpu'
device = 'cuda:1'

src = torch.tensor([[1., 1.], [1., 1.], [4., 2.], [2., 4.]]).to(device)
index = torch.tensor([0, 0, 1, 1]).to(device)
index = index.view(-1,1).repeat(1,src.size()[1])

res1, _ = scatter_max(src, index, dim=0, fill_value=1.)
res2 = scatter_add(src, index, dim=0, dim_size=2, fill_value=0.)

print(res1)
print(res2)

The results are

tensor([[1., 1.],
        [1., 1.]], device='cuda:1')
tensor([[2., 2.],
        [6., 6.]], device='cuda:1')

I tried to debug it and I found that line 13 func(src, index, out, arg, dim) of max.py did not change the variable out at all. Do you have any clue about what caused the problem?

@rusty1s
Copy link
Owner

rusty1s commented Jan 14, 2020

Yeah, those are the functions that call our own kernel implementations. It seems that there is something wrong with you GPU setup in conjunction with the provided cuda code.

@github-actions
Copy link

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants