Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a custom dataset #6

Closed
peastman opened this issue Apr 21, 2021 · 20 comments
Closed

Creating a custom dataset #6

peastman opened this issue Apr 21, 2021 · 20 comments

Comments

@peastman
Copy link
Collaborator

I want to train a model on a custom dataset. I'm trying to follow the example at https://github.com/torchmd/torchmd-cg/blob/master/tutorial/Chignolin_Coarse-Grained_Tutorial.ipynb, but my data is different enough that it isn't quite clear how I should format it.

My datasets consist of many molecules of different sizes. For each molecule I have

  • an array of atom type indices
  • an array of atom coordinates
  • a potential energy
  • (optional) an array of forces on atoms

This differs from the tutorial in a few critical ways. My molecules are all different sizes, so I can't just put everything into rectangular arrays. And the training data is different: sometimes I will have only energies, and sometimes I will have both forces and energies which should be trained on together. The example trains only on forces with no energies.

Any guidance would be appreciated!

@PhilippThoelke
Copy link
Collaborator

Hi, I just finished creating a custom dataset class (see here), which is on a separate branch called cg_dataset for now. This code is not well tested and therefore not part of the main branch yet. To try it out, change the branch in your local clone of this repository to cg_dataset via

git checkout cg_dataset

Then you can use the command line argument --dataset custom to choose the new custom dataset, which requires you also to specify --cood-files, --embed-files as the input data and --energy-files, --force-files or both as the targets. To train only on energies, only provide --energy-files and to train only on forces provide only --force-files. If you train on forces (or both energy and forces), you also have to specify --derivative true in order to compute the derivative of the model's predictions.

The arguments --energy-weight and --force-weight allow you to set the weighting factor for energy predictions and force predictions in the loss function respectively. These are 1.0 by default.

The --xxx-files arguments all accept glob paths (see this for more information) with which you can input multiple files, allowing you to train on multiple molecules. Unfortunately there is no way yet to have some parts of the data train only on energies, some only on forces and some on both but I'll look into that in the future.

I hope this helps already!

@peastman
Copy link
Collaborator Author

Excellent! That looks like exactly what I need.

To clarify, I should sort all the molecules based on number of atoms, then create a separate set of files for each possible number of atoms?

@peastman
Copy link
Collaborator Author

Actually, it looks like it wants a separate set of files for every individual molecule? That's going to be inconvenient, and probably really slow. For a dataset like QM9 it will require hundreds of thousands of files. Could we make it accept a 2D embeddings array of shape (samples, atoms), so many different molecules can all go in the same file as long as they have the same size?

@giadefa
Copy link
Contributor

giadefa commented Apr 22, 2021 via email

@peastman
Copy link
Collaborator Author

So it sounds like I should write my own class that subclasses Dataset, storing the data in a way that's efficient? My class should implement len() to return the total number of samples, and get() to return a single sample in the form of a Data object. When constructing the Data object for a molecule with N atoms, I should specify the following constructor arguments: pos: a (N, 3) array of coordinates. z: a length N array of atom type indices. y: the energy. dy: a (N, 3) array of forces (or should it be the gradient of the energy, that is, the negative force?).

@giadefa
Copy link
Contributor

giadefa commented Apr 22, 2021

That is the pytorch way.

Otherwise create a list of Data objects and use the pytorch geometric collate function to prepare the data, etc. as in https://github.com/compsciencelab/torchmd-net/blob/main/torchmdnet/datasets/ani1.py

@PhilippThoelke
Copy link
Collaborator

or should it be the gradient of the energy, that is, the negative force?

dy should be the force. The model outputs the negative derivative of the energy.

@PhilippThoelke
Copy link
Collaborator

The changes are now also on the main branch.

@peastman
Copy link
Collaborator Author

I put my dataset into a HDF5 file. It contains a top level group for each number of atoms: samples2 is all samples with 2 atoms, samples3 is all samples with 3 atoms, etc. Inside each group are three datasets called pos (atom positions), types (atom type indices), and energy (potential energies). Here is my dataset class:

import torch
from torch_geometric.data import Dataset, Data
import h5py

class HDF5(Dataset):

    def __init__(self, filename, label):
        super(HDF5, self).__init__()
        self.file = h5py.File(filename, 'r')
        self.index = []
        for group_name in self.file:
            group = self.file[group_name]
            types = group['types']
            pos = group['pos']
            energy = group['energy']
            for i in range(len(energy)):
                self.index.append((types, pos, energy, i))

    def get(self, idx):
        types, pos, energy, i = self.index[idx]
        return Data(pos=torch.from_numpy(pos[i]), z=torch.from_numpy(types[i]).to(torch.long), y=torch.tensor(energy[i]))

    def len(self):
        return len(self.index)

When I train using this dataset class, it runs but shows no sign of learning. The loss stays very high and doesn't decrease. However, I found that if I set the batch size to 1 then it does learn, but of course it runs very very slowly. I suspect the problem is therefore related to grouping multiple samples (which may have different numbers of atoms) into a single batch?

Something that might or might not be relevant. While running, I get a few warning messages:

/home/peastman/workspace/torchmd-net/torchmdnet/module.py:114: UserWarning: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([1, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = loss + loss_fn(pred, batch.y) * self.hparams.energy_weight
/home/peastman/workspace/torchmd-net/torchmdnet/module.py:114: UserWarning: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([0, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = loss + loss_fn(pred, batch.y) * self.hparams.energy_weight

I assume this is coming from inside the loss function. target size (torch.Size([1])) presumably refers to the energies in the dataset?

@peastman
Copy link
Collaborator Author

I think I maybe have it working? The above warning made me think the problem might be related to the shape of y, so I forced it to be a 2D tensor:

y=torch.tensor([[energy[i]]])

That made the warnings go away. And with that change, I could use a larger batch size and it even seemed to be learning... for a little while. About 77% of the way through the first epoch it would give the warning

/home/peastman/workspace/torchmd-net/torchmdnet/module.py:114: UserWarning: Using a target size (torch.Size([128, 1])) that is different to the input size (torch.Size([127, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = loss + loss_fn(pred, batch.y) * self.hparams.energy_weight

then immediately crash with the exception

RuntimeError: The size of tensor a (127) must match the size of tensor b (128) at non-singleton dimension 0

Apparently even though the batch size was 128, the model was producing an output of size 127? I suspected something to do with how it merges molecules of different sizes into a single batch. After a little experimenting, I figured that it must be using atom type 0 to indicate a missing atom somewhere. I haven't found exactly where it's doing that, but if I renumber my atom types to start from 1 rather than 0, the error goes away. One clue to this was the presence of the atom-filter option with the description, "Only sum over atoms with Z > atom_filter". Setting it to -1 doesn't fix the problem though.

@giadefa
Copy link
Contributor

giadefa commented Apr 26, 2021

can you share a small subest of your dataset in the same format as you are using it?

@PhilippThoelke
Copy link
Collaborator

The model outputs predictions with shape (batch_size, 1) so the labels (i.e. the energy in you dataset) should match that shape. For you it only works with a batch size of 1 because of PyTorch's broadcasting. If for example your batch size is 32, then the prediction has shape (32, 1) and your energy label has shape (32,). PyTorch will broadcast both tensors to shape (32, 32), which messes up the loss.

It might make sense to call squeeze() on the prediction before passing it to the loss function instead of requiring the label to be 2D, what do you think @giadefa?

However, the other warnings you are getting are hinting at a discrepancy in the batch size, which also occurred while you were using a batch size of 1 (Using a target size (torch.Size([1])) that is different to the input size (torch.Size([0, 1]))). It seems like one sample got lost inside the model. I can reproduce this by setting the atom filter to something high such that all atoms are filtered out for some samples. Could you check if you have some molecule where z < atom_filter for all atoms? I also changed the default atom filter value to -1 now so atoms with atom type 0 are not accidentally filtered out. I have added additional error handling now so you should get a better error message if this is actually the problem.

@peastman
Copy link
Collaborator Author

Could you explain how atom_filter works, and what its intended use is?

After making the changes I described above, it was able to run for 10 epochs, then crashed with the following exception:

  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 644, in run_train
    self.train_loop.run_training_epoch()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 564, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 100, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1101, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 183, in on_validation_end
    callback.on_validation_end(self, self.lightning_module)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint
    self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 563, in _save_top_k_checkpoints
    current = trainer.training_type_plugin.reduce(current, reduce_op="mean")
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 283, in reduce
    output = sync_ddp_if_available(output, group, reduce_op)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 128, in sync_ddp_if_available
    return sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 161, in sync_ddp
    torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1171, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/peastman/workspace/torchmd-net/scripts/train.py", line 148, in <module>
    main()
  File "/home/peastman/workspace/torchmd-net/scripts/train.py", line 141, in main
    trainer.fit(model)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 513, in fit
    self.dispatch()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in dispatch
    self.accelerator.start_training(self)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 111, in start_training
    self._results = trainer.run_train()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 676, in run_train
    self.train_loop.on_train_end()
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 134, in on_train_end
    self.check_checkpoint_callback(should_update=True, is_last=True)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 164, in check_checkpoint_callback
    cb.on_validation_end(self.trainer, model)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 212, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 259, in save_checkpoint
    self._save_top_k_checkpoints(trainer, pl_module, monitor_candidates)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 563, in _save_top_k_checkpoints
    current = trainer.training_type_plugin.reduce(current, reduce_op="mean")
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 283, in reduce
    output = sync_ddp_if_available(output, group, reduce_op)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 128, in sync_ddp_if_available
    return sync_ddp(result, group=group, reduce_op=reduce_op)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 161, in sync_ddp
    torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
  File "/home/peastman/miniconda3/envs/torchmd/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1171, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

I assume this is related to the option save_interval: 10. I'm attempting to debug now. Any insight would be appreciated!

@giadefa
Copy link
Contributor

giadefa commented Apr 26, 2021 via email

@peastman
Copy link
Collaborator Author

I found a lot of people encountering this same error:

Lightning-AI/pytorch-lightning#2529
Lightning-AI/pytorch-lightning#5641

The fixes suggested in those threads didn't help. However, I was able to work around the problem by removing these two lines:

https://github.com/compsciencelab/torchmd-net/blob/06c105a1fa871baa81f894e092248a22a0db71f1/scripts/train.py#L127

and

https://github.com/compsciencelab/torchmd-net/blob/06c105a1fa871baa81f894e092248a22a0db71f1/scripts/train.py#L136

@giadefa
Copy link
Contributor

giadefa commented Apr 26, 2021

Philipp I think that this must be related to the issues you have with the fact that you need to move to device manually. We might be doing something wrong.

@PhilippThoelke
Copy link
Collaborator

PhilippThoelke commented Apr 29, 2021

I have to move some tensors to the CPU manually because the workaround for evaluating the test set during training fails otherwise. If we don't test during fit then the .cpu() is unnecessary here:
https://github.com/compsciencelab/torchmd-net/blob/9bef6b8bb9bd11b5d6ea97b9138bbb030165847d/torchmdnet/module.py#L119

I'm not able to reproduce the problem though, @peastman can you provide some more information on how you are training? I.e. how many GPUs, single-/multi-node, pytorch-lightning and pytorch versions. I also changed the way we are evaluating the test set for PyTorch Lightning version 1.3.x, maybe the problem is already fixed by that. However, as you already said it looks like it has something to do with the checkpoints so I guess it's unlikely that it works now.

@peastman
Copy link
Collaborator Author

I tried removing that call to .cpu(), but it didn't help. Here are the details of my configuration:

  • Pytorch 1.8.1+cu111
  • Pytorch Lightning 1.2.0 (I downgraded, since this code seemed to have multiple incompatibilities with 1.3.)
  • A single GPU running on a single node

@PhilippThoelke
Copy link
Collaborator

@peastman the code should now work with PyTorch Lightning 1.3.0 as well. The test set evaluation is also much cleaner now and the manual .cpu() calls and so on are no longer required. Please let me know if you still experience the problems you were having on the most recent version on the torchmd-net main branch.

@peastman
Copy link
Collaborator Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants