Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNN training on MAG240M hangs---slow loading of np.memmap #131

Closed
chenxuhao opened this issue Mar 25, 2021 · 5 comments
Closed

GNN training on MAG240M hangs---slow loading of np.memmap #131

chenxuhao opened this issue Mar 25, 2021 · 5 comments

Comments

@chenxuhao
Copy link

Hello,
I got this error when I was trying to run /ogb/examples/lsc/mag240m

$ python gnn.py --device=0 --model=graphsage
Namespace(batch_size=1024, device='0', dropout=0.5, epochs=100, evaluate=False, hidden_channels=1024, model='graphsage', sizes=[25, 15])
Global seed set to 42
#Params 4884633
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Reading dataset... Done! [23.71s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type       | Params
-------------------------------------
0 | convs | ModuleList | 3.7 M
1 | norms | ModuleList | 4.1 K
2 | skips | ModuleList | 0
3 | mlp   | Sequential | 1.2 M
4 | acc   | Accuracy   | 0
-------------------------------------
4.9 M     Trainable params
0         Non-trainable params
4.9 M     Total params
19.539    Total estimated model params size (MB)
Traceback (most recent call last):
  File "gnn.py", line 231, in <module>
    trainer.fit(model, datamodule=datamodule)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 846, in run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 364, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 278, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, loader_name))
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 398, in request_dataloader
    dataloader = dataloader_fx()
  File "gnn.py", line 100, in val_dataloader
    return NeighborSampler(self.adj_t, node_idx=self.val_idx,
  File "/h2/xchen/.local/lib/python3.8/site-packages/torch_geometric/data/sampler.py", line 139, in __init__
    super(NeighborSampler, self).__init__(
TypeError: __init__() got an unexpected keyword argument 'transform'

I have installed PyTorch 1.8.0, pytorch_lightning-1.2.5 and also installed PyG: pip install git+https://github.com/rusty1s/pytorch_geometric.git

What am I missing here?

Thank you!

Xuhao Chen
http://people.csail.mit.edu/xchen/

@rusty1s
Copy link
Collaborator

rusty1s commented Mar 25, 2021

My guess is that you have multiple PyG versions installed. Try to run:

pip uninstall torch-geometric
pip uninstall torch-geometric  # Until no further versions are found
pip install git+https://github.com/rusty1s/pytorch_geometric.git

@chenxuhao
Copy link
Author

Thank you! This works for me.

Now that it hangs like this:

$ python gnn.py --epochs 1 --hidden_channels 16
Namespace(batch_size=1024, device='0', dropout=0.5, epochs=1, evaluate=False, hidden_channels=16, model='gat', sizes=[25, 15])
Global seed set to 42
#Params 28185
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Reading dataset... Done! [185.68s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.

  | Name  | Type       | Params
-------------------------------------
0 | convs | ModuleList | 12.6 K
1 | norms | ModuleList | 64
2 | skips | ModuleList | 12.6 K
3 | mlp   | Sequential | 2.9 K
4 | acc   | Accuracy   | 0
-------------------------------------
28.2 K    Trainable params
0         Non-trainable params
28.2 K    Total params
0.113     Total estimated model params size (MB)
/jet/home/xhchen/anaconda3/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check:   0%|                                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]

I know it is supposed to be slow, but how much time it is supposed to take in this stage?

Thanks,

Xuhao

@rusty1s
Copy link
Collaborator

rusty1s commented Mar 27, 2021

Can you try to replace this line with self.x = self.all_paper_feat to see if that fixes this issue?

@weihua916
Copy link
Contributor

weihua916 commented Mar 27, 2021

You can also try running the following to see if numpy's memmap mode is fast enough in your enviroment. Sometimes, we found this is slow.

import time
import torch
from ogb.lsc import MAG240MDataset
dataset = MAG240MDataset(ROOT_DIR)
x = dataset.paper_feat
idx1 = torch.randint(0, dataset.paper_feat.shape[0], (200, )).long().numpy()
idx2 = torch.randint(0, dataset.paper_feat.shape[0], (200, )).long().numpy()
t = time.perf_counter()
x[idx1]
print(time.perf_counter() - t)
t = time.perf_counter()
x[idx2]
print(time.perf_counter() - t)

@chenxuhao
Copy link
Author

Got it! It runs now! Thanks!

@weihua916 weihua916 changed the title runtime error when running ogb/examples/lsc/mag240m/gnn.py GNN training on MAG240M hangs---slow loading of np.memmap Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants