Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heterogenous graph, use NeighborLoader with num_workers>0, and stucks after many epochs #5348

Open
PolarisRisingWar opened this issue Sep 3, 2022 · 2 comments
Labels

Comments

@PolarisRisingWar
Copy link

🐛 Describe the bug

My code is like this:

***The code for creating graph, GNN model***

train_loader = NeighborLoader(
    train_data,
    num_neighbors=[2] * 2,
    batch_size=train_batch_size,
    input_nodes='case',
    shuffle=True,
    num_workers=4,
)

test_loader = NeighborLoader(
    test_data,
    num_neighbors=[2] * 2,
    batch_size=train_batch_size,
    input_nodes='case',
    shuffle=True, 
    num_workers=4,
)

***The code to train and test***

(I need the subgraph sampled in test_loader to be random, so I put shuffle=True and use n_id attribute to rearrange the predicted logits)
I used W&B to log the train_losses and other metrics during training, but I found that after 80min and 6h (2 experiments) it stucks, the curve stop running for about 2 hours. I can only think it's because the num_workers cause after I deleted num_workers paramater, it can successfully finished the 22h process.
Honestly it's hard for me to trace back the bug and reproduce it... So I can only just report this problem.

Environment

  • PyG version: 2.1.0.dev20220815
  • PyTorch version: 1.11.0
  • OS: Linux
  • Python version: 3.8.13
  • CUDA/cuDNN version: cuda10.2 cudnn7.6.5
  • How you installed PyTorch and PyG (conda, pip, source):
    PyTorch: conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=10.2 -c pytorch
    PyG:
pip install torch-scatter -f https://data.pyg.org/whl/torch-1.11.0+cu102.html
pip install torch-sparse -f https://data.pyg.org/whl/torch-1.11.0+cu102.html
pip install pyg-nightly
  • Any other relevant information (e.g., version of torch-scatter):
    torch-scatter 2.0.9
    torch-sparse 0.6.14
@rusty1s
Copy link
Member

rusty1s commented Sep 5, 2022

Thanks for reporting. Do you have some intuition what might cause this? Is there a memory leak and memory requirements are increasing over epochs? Any guidance appreciated!

@LukeLIN-web
Copy link
Contributor

Many workers accumulate variables may lead to out of memory? I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants