New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a custom dataset #6
Comments
Hi, I just finished creating a custom dataset class (see here), which is on a separate branch called
Then you can use the command line argument The arguments The I hope this helps already! |
Excellent! That looks like exactly what I need. To clarify, I should sort all the molecules based on number of atoms, then create a separate set of files for each possible number of atoms? |
Actually, it looks like it wants a separate set of files for every individual molecule? That's going to be inconvenient, and probably really slow. For a dataset like QM9 it will require hundreds of thousands of files. Could we make it accept a 2D embeddings array of shape (samples, atoms), so many different molecules can all go in the same file as long as they have the same size? |
yes for QM9 we don't use that format. For ANI we use an HDF5 file. Check
in the datasets module.
g
…On Thu, Apr 22, 2021 at 6:46 PM Peter Eastman ***@***.***> wrote:
Actually, it looks like it wants a separate set of files for every
individual molecule? That's going to be inconvenient, and probably really
slow. For a dataset like QM9 it will require hundreds of thousands of
files. Could we make it accept a 2D embeddings array of shape (samples,
atoms), so many different molecules can all go in the same file as long as
they have the same size?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUORETKQ33EWMYEQR2STTKBHHJANCNFSM43LGY2PQ>
.
|
So it sounds like I should write my own class that subclasses Dataset, storing the data in a way that's efficient? My class should implement |
That is the pytorch way. Otherwise create a list of Data objects and use the pytorch geometric collate function to prepare the data, etc. as in https://github.com/compsciencelab/torchmd-net/blob/main/torchmdnet/datasets/ani1.py |
|
The changes are now also on the main branch. |
I put my dataset into a HDF5 file. It contains a top level group for each number of atoms: import torch
from torch_geometric.data import Dataset, Data
import h5py
class HDF5(Dataset):
def __init__(self, filename, label):
super(HDF5, self).__init__()
self.file = h5py.File(filename, 'r')
self.index = []
for group_name in self.file:
group = self.file[group_name]
types = group['types']
pos = group['pos']
energy = group['energy']
for i in range(len(energy)):
self.index.append((types, pos, energy, i))
def get(self, idx):
types, pos, energy, i = self.index[idx]
return Data(pos=torch.from_numpy(pos[i]), z=torch.from_numpy(types[i]).to(torch.long), y=torch.tensor(energy[i]))
def len(self):
return len(self.index) When I train using this dataset class, it runs but shows no sign of learning. The loss stays very high and doesn't decrease. However, I found that if I set the batch size to 1 then it does learn, but of course it runs very very slowly. I suspect the problem is therefore related to grouping multiple samples (which may have different numbers of atoms) into a single batch? Something that might or might not be relevant. While running, I get a few warning messages:
I assume this is coming from inside the loss function. |
I think I maybe have it working? The above warning made me think the problem might be related to the shape of y=torch.tensor([[energy[i]]]) That made the warnings go away. And with that change, I could use a larger batch size and it even seemed to be learning... for a little while. About 77% of the way through the first epoch it would give the warning
then immediately crash with the exception
Apparently even though the batch size was 128, the model was producing an output of size 127? I suspected something to do with how it merges molecules of different sizes into a single batch. After a little experimenting, I figured that it must be using atom type 0 to indicate a missing atom somewhere. I haven't found exactly where it's doing that, but if I renumber my atom types to start from 1 rather than 0, the error goes away. One clue to this was the presence of the |
can you share a small subest of your dataset in the same format as you are using it? |
The model outputs predictions with shape It might make sense to call However, the other warnings you are getting are hinting at a discrepancy in the batch size, which also occurred while you were using a batch size of 1 ( |
Could you explain how After making the changes I described above, it was able to run for 10 epochs, then crashed with the following exception:
I assume this is related to the option |
You should not care at all about atom_filter. It is needed for something
else and it should have no effect on what you are doing.
…On Mon, Apr 26, 2021 at 6:45 PM Peter Eastman ***@***.***> wrote:
Could you explain how atom_filter works, and what its intended use is?
After making the changes I described above, it was able to run for 10
epochs, then crashed with the following exception:
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 644, in run_train
self.train_loop.run_training_epoch()
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 564, in run_training_epoch
self.trainer.run_evaluation(on_epoch=True)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in run_evaluation
self.evaluation_loop.on_evaluation_end()
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 100, in on_evaluation_end
self.trainer.call_hook('on_validation_end', *args, **kwargs)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1101, in call_hook
trainer_hook(*args, **kwargs)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 183, in on_validation_end
callback.on_validation_end(self, self.lightning_module)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint
self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 563, in _save_top_k_checkpoints
current = trainer.training_type_plugin.reduce(current, reduce_op="mean")
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 283, in reduce
output = sync_ddp_if_available(output, group, reduce_op)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 128, in sync_ddp_if_available
return sync_ddp(result, group=group, reduce_op=reduce_op)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 161, in sync_ddp
torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1171, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/peastman/workspace/torchmd-net/scripts/train.py", line 148, in <module>
main()
File "/home/peastman/workspace/torchmd-net/scripts/train.py", line 141, in main
trainer.fit(model)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 513, in fit
self.dispatch()
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in dispatch
self.accelerator.start_training(self)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 111, in start_training
self._results = trainer.run_train()
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 676, in run_train
self.train_loop.on_train_end()
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end
self.check_checkpoint_callback(should_update=True, is_last=True)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback
cb.on_validation_end(self.trainer, model)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint
self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 563, in _save_top_k_checkpoints
current = trainer.training_type_plugin.reduce(current, reduce_op="mean")
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 283, in reduce
output = sync_ddp_if_available(output, group, reduce_op)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 128, in sync_ddp_if_available
return sync_ddp(result, group=group, reduce_op=reduce_op)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 161, in sync_ddp
torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1171, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
I assume this is related to the option save_interval: 10. I'm attempting
to debug now. Any insight would be appreciated!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3KUOT3VAWCRCXJAE4IJ6LTKWKBLANCNFSM43LGY2PQ>
.
|
I found a lot of people encountering this same error: Lightning-AI/pytorch-lightning#2529 The fixes suggested in those threads didn't help. However, I was able to work around the problem by removing these two lines: and |
Philipp I think that this must be related to the issues you have with the fact that you need to move to device manually. We might be doing something wrong. |
I have to move some tensors to the CPU manually because the workaround for evaluating the test set during training fails otherwise. If we don't test during fit then the I'm not able to reproduce the problem though, @peastman can you provide some more information on how you are training? I.e. how many GPUs, single-/multi-node, pytorch-lightning and pytorch versions. I also changed the way we are evaluating the test set for PyTorch Lightning version 1.3.x, maybe the problem is already fixed by that. However, as you already said it looks like it has something to do with the checkpoints so I guess it's unlikely that it works now. |
I tried removing that call to
|
@peastman the code should now work with PyTorch Lightning 1.3.0 as well. The test set evaluation is also much cleaner now and the manual |
Thanks! |
I want to train a model on a custom dataset. I'm trying to follow the example at https://github.com/torchmd/torchmd-cg/blob/master/tutorial/Chignolin_Coarse-Grained_Tutorial.ipynb, but my data is different enough that it isn't quite clear how I should format it.
My datasets consist of many molecules of different sizes. For each molecule I have
This differs from the tutorial in a few critical ways. My molecules are all different sizes, so I can't just put everything into rectangular arrays. And the training data is different: sometimes I will have only energies, and sometimes I will have both forces and energies which should be trained on together. The example trains only on forces with no energies.
Any guidance would be appreciated!
The text was updated successfully, but these errors were encountered: