GNN training on MAG240M hangs---slow loading of np.memmap #131

chenxuhao · 2021-03-25T15:12:07Z

Hello,
I got this error when I was trying to run /ogb/examples/lsc/mag240m

$ python gnn.py --device=0 --model=graphsage
Namespace(batch_size=1024, device='0', dropout=0.5, epochs=100, evaluate=False, hidden_channels=1024, model='graphsage', sizes=[25, 15])
Global seed set to 42
#Params 4884633
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Reading dataset... Done! [23.71s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type       | Params
-------------------------------------
0 | convs | ModuleList | 3.7 M
1 | norms | ModuleList | 4.1 K
2 | skips | ModuleList | 0
3 | mlp   | Sequential | 1.2 M
4 | acc   | Accuracy   | 0
-------------------------------------
4.9 M     Trainable params
0         Non-trainable params
4.9 M     Total params
19.539    Total estimated model params size (MB)
Traceback (most recent call last):
  File "gnn.py", line 231, in <module>
    trainer.fit(model, datamodule=datamodule)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 846, in run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 364, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 278, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, loader_name))
  File "/h2/xchen/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 398, in request_dataloader
    dataloader = dataloader_fx()
  File "gnn.py", line 100, in val_dataloader
    return NeighborSampler(self.adj_t, node_idx=self.val_idx,
  File "/h2/xchen/.local/lib/python3.8/site-packages/torch_geometric/data/sampler.py", line 139, in __init__
    super(NeighborSampler, self).__init__(
TypeError: __init__() got an unexpected keyword argument 'transform'

I have installed PyTorch 1.8.0, pytorch_lightning-1.2.5 and also installed PyG: pip install git+https://github.com/rusty1s/pytorch_geometric.git

What am I missing here?

Thank you!

Xuhao Chen
http://people.csail.mit.edu/xchen/

The text was updated successfully, but these errors were encountered:

rusty1s · 2021-03-25T16:42:17Z

My guess is that you have multiple PyG versions installed. Try to run:

pip uninstall torch-geometric
pip uninstall torch-geometric  # Until no further versions are found
pip install git+https://github.com/rusty1s/pytorch_geometric.git

chenxuhao · 2021-03-27T00:15:52Z

Thank you! This works for me.

Now that it hangs like this:

$ python gnn.py --epochs 1 --hidden_channels 16
Namespace(batch_size=1024, device='0', dropout=0.5, epochs=1, evaluate=False, hidden_channels=16, model='gat', sizes=[25, 15])
Global seed set to 42
#Params 28185
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Reading dataset... Done! [185.68s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Set SLURM handle signals.

  | Name  | Type       | Params
-------------------------------------
0 | convs | ModuleList | 12.6 K
1 | norms | ModuleList | 64
2 | skips | ModuleList | 12.6 K
3 | mlp   | Sequential | 2.9 K
4 | acc   | Accuracy   | 0
-------------------------------------
28.2 K    Trainable params
0         Non-trainable params
28.2 K    Total params
0.113     Total estimated model params size (MB)
/jet/home/xhchen/anaconda3/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check:   0%|                                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]

I know it is supposed to be slow, but how much time it is supposed to take in this stage?

Thanks,

Xuhao

rusty1s · 2021-03-27T01:04:05Z

Can you try to replace this line with self.x = self.all_paper_feat to see if that fixes this issue?

weihua916 · 2021-03-27T03:30:30Z

You can also try running the following to see if numpy's memmap mode is fast enough in your enviroment. Sometimes, we found this is slow.

import time
import torch
from ogb.lsc import MAG240MDataset
dataset = MAG240MDataset(ROOT_DIR)
x = dataset.paper_feat
idx1 = torch.randint(0, dataset.paper_feat.shape[0], (200, )).long().numpy()
idx2 = torch.randint(0, dataset.paper_feat.shape[0], (200, )).long().numpy()
t = time.perf_counter()
x[idx1]
print(time.perf_counter() - t)
t = time.perf_counter()
x[idx2]
print(time.perf_counter() - t)

chenxuhao · 2021-04-09T21:25:03Z

Got it! It runs now! Thanks!

weihua916 mentioned this issue Apr 7, 2021

Running time difference in "DGL Baseline Code for MAG240M" dmlc/dgl#2823

Closed

chenxuhao closed this as completed Apr 9, 2021

weihua916 mentioned this issue Apr 13, 2021

GPU utilization rate #153

Closed

weihua916 changed the title ~~runtime error when running ogb/examples/lsc/mag240m/gnn.py~~ GNN training on MAG240M hangs---slow loading of np.memmap Apr 13, 2021

weihua916 mentioned this issue Apr 14, 2021

gnn training stuck #155

Closed

jwkirchenbauer mentioned this issue Apr 28, 2021

Hyperparameters and runtimes of performance table for MAG240M #177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNN training on MAG240M hangs---slow loading of np.memmap #131

GNN training on MAG240M hangs---slow loading of np.memmap #131

chenxuhao commented Mar 25, 2021

rusty1s commented Mar 25, 2021

chenxuhao commented Mar 27, 2021

rusty1s commented Mar 27, 2021 •

edited

Loading

weihua916 commented Mar 27, 2021 •

edited

Loading

chenxuhao commented Apr 9, 2021

GNN training on MAG240M hangs---slow loading of np.memmap #131

GNN training on MAG240M hangs---slow loading of np.memmap #131

Comments

chenxuhao commented Mar 25, 2021

rusty1s commented Mar 25, 2021

chenxuhao commented Mar 27, 2021

rusty1s commented Mar 27, 2021 • edited Loading

weihua916 commented Mar 27, 2021 • edited Loading

chenxuhao commented Apr 9, 2021

rusty1s commented Mar 27, 2021 •

edited

Loading

weihua916 commented Mar 27, 2021 •

edited

Loading