Unable to fit model #82

peastman · 2022-05-05T19:53:43Z

I've been trying to train a model on an early subset of the SPICE dataset. All my efforts so far have been unsuccessful. I must be doing something wrong, but I really don't know what. I'm hoping someone else can spot the problem. My configuration file is given below. Here's the HDF5 file for the dataset.

I've tried training with or without derivatives. I've tried a range of initial learning rates, with or without warmup. I've tried varying model parameters. I've tried restricting it to only molecules with no formal charges. Nothing makes any difference. In all cases, the loss starts out at about 2e7 and never decreases.

The dataset consists of all SPICE calculations that had been completed when I started working on this a couple of weeks ago. I converted the units so positions are in Angstroms and energies in kJ/mol. I also subtracted off per-atom energies. Atom types are the union of element and formal charge. Here's the mapping:

typeDict = {('Br', -1): 0, ('Br', 0): 1, ('C', -1): 2, ('C', 0): 3, ('C', 1): 4, ('Ca', 2): 5, ('Cl', -1): 6,
            ('Cl', 0): 7, ('F', -1): 8, ('F', 0): 9, ('H', 0): 10, ('I', -1): 11, ('I', 0): 12, ('K', 1): 13,
            ('Li', 1): 14, ('Mg', 2): 15, ('N', -1): 16, ('N', 0): 17, ('N', 1): 18, ('Na', 1): 19, ('O', -1): 20,
            ('O', 0): 21, ('O', 1): 22, ('P', 0): 23, ('P', 1): 24, ('S', -1): 25, ('S', 0): 26, ('S', 1): 27}

If anyone can provide insight, I'll be very grateful!

activation: silu
atom_filter: -1
batch_size: 128
cutoff_lower: 0.0
cutoff_upper: 8.0
dataset: HDF5
dataset_root: SPICE-corrected.hdf5
derivative: false
distributed_backend: ddp
early_stopping_patience: 40
embedding_dimension: 64
energy_weight: 1.0
force_weight: 0.001
inference_batch_size: 128
lr: 1.e-4
lr_factor: 0.8
lr_min: 1.e-7
lr_patience: 10
lr_warmup_steps: 5000
max_num_neighbors: 80
max_z: 28
model: equivariant-transformer
neighbor_embedding: true
ngpus: -1
num_epochs: 1000
num_heads: 8
num_layers: 5
num_nodes: 1
num_rbf: 64
num_workers: 4
rbf_type: expnorm
save_interval: 5
seed: 1
test_interval: 10
test_size: 0.01
trainable_rbf: true
val_size: 0.05
weight_decay: 0.0

The text was updated successfully, but these errors were encountered:

giadefa · 2022-05-05T20:18:54Z

Have you computed atomrefs from QM at the level of theory that you are using for the data?

…

On Thu, May 5, 2022, 21:53 Peter Eastman ***@***.***> wrote: I've been trying to train a model on an early subset of the SPICE dataset. All my efforts so far have been unsuccessful. I must be doing something wrong, but I really don't know what. I'm hoping someone else can spot the problem. My configuration file is given below. Here's the HDF5 file for the dataset <https://www.dropbox.com/s/brxao096sl5poxz/SPICE-corrected.hdf5.gz?dl=0>. I've tried training with or without derivatives. I've tried a range of initial learning rates, with or without warmup. I've tried varying model parameters. I've tried restricting it to only molecules with no formal charges. Nothing makes any difference. In all cases, the loss starts out at about 2e7 and never decreases. The dataset consists of all SPICE calculations that had been completed when I started working on this a couple of weeks ago. I converted the units so positions are in Angstroms and energies in kJ/mol. I also subtracted off per-atom energies. Atom types are the union of element and formal charge. Here's the mapping: typeDict = {('Br', -1): 0, ('Br', 0): 1, ('C', -1): 2, ('C', 0): 3, ('C', 1): 4, ('Ca', 2): 5, ('Cl', -1): 6, ('Cl', 0): 7, ('F', -1): 8, ('F', 0): 9, ('H', 0): 10, ('I', -1): 11, ('I', 0): 12, ('K', 1): 13, ('Li', 1): 14, ('Mg', 2): 15, ('N', -1): 16, ('N', 0): 17, ('N', 1): 18, ('Na', 1): 19, ('O', -1): 20, ('O', 0): 21, ('O', 1): 22, ('P', 0): 23, ('P', 1): 24, ('S', -1): 25, ('S', 0): 26, ('S', 1): 27} If anyone can provide insight, I'll be very grateful! activation: silu atom_filter: -1 batch_size: 128 cutoff_lower: 0.0 cutoff_upper: 8.0 dataset: HDF5 dataset_root: SPICE-corrected.hdf5 derivative: false distributed_backend: ddp early_stopping_patience: 40 embedding_dimension: 64 energy_weight: 1.0 force_weight: 0.001 inference_batch_size: 128 lr: 1.e-4 lr_factor: 0.8 lr_min: 1.e-7 lr_patience: 10 lr_warmup_steps: 5000 max_num_neighbors: 80 max_z: 28 model: equivariant-transformer neighbor_embedding: true ngpus: -1 num_epochs: 1000 num_heads: 8 num_layers: 5 num_nodes: 1 num_rbf: 64 num_workers: 4 rbf_type: expnorm save_interval: 5 seed: 1 test_interval: 10 test_size: 0.01 trainable_rbf: true val_size: 0.05 weight_decay: 0.0 — Reply to this email directly, view it on GitHub <#82>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOWZTJQVV5OPWE5TB4DVIQRNDANCNFSM5VGDF2CA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

peastman · 2022-05-05T21:49:53Z

I obtained them by doing a least squares fit to the full dataset to find per-atom-type energies that best matched the total energy of every sample.

PhilippThoelke · 2022-05-05T23:03:12Z

I'm not able to download the dataset. When I click on download nothing happens. I tried different browsers and downloading from a terminal but I just get a connection timeout. Is it possible to get it somewhere else?

peastman · 2022-05-05T23:06:26Z

I just tested and it worked fine for me. It's just on Dropbox. I would attach it here, but it's about 360 MB. Any suggestions for other places to post it?

PhilippThoelke · 2022-05-06T00:23:17Z

Now it works for me too. Didn't change anything... I'll try to see if I can figure something out.

PhilippThoelke · 2022-05-06T02:36:13Z

I'm having a couple of problems with getting it to train without errors. Have you made modifications to the code to handle the formal charge? Most of the atom types are fine but I get a few molecules where the atom types just seem broken.
Some example atom types:
[ 48 13 -128 63 105 -102 -24 -65 45 23 -53 -65 114 -110 -127 63 -109 -122 -115 -65 -55 97 -113 -65 69 46 -79 -66 -31 -68 46 63 7 65 118 62]
[ 42 -91 -11 66]
[ 60 -94 -121 -63 41 54 -22 65 24 -19 2 -62 -47 91 104 -64 20 60 23 -62 -14 -45 35 66 112 -24]
If this is intended, how do I handle them correctly? Currently it just breaks the embedding layer.

Then, if I just skip the atoms with atom types outside [0, 27] (the max_z from your config file), I get the newly added error of surpassing the maximum number of neighbors in the radius_graph call (#81). If you were not using the latest code, this might have messed with some of the predictions but I don't think it would just cause the loss to be stuck at 2e7.

I increased max_num_neighbors to handle larger molecules and I get a lot of the skipping gradients warnings due to atoms not being in the cutoff of any other atom. I'm not sure how much interactions with these atoms contributed to the energy in the reference calculation but this probably doesn't help the model to fit the data well. For me the loss starts out around 1e6 and fluctuates a bit, sometimes going up to 2e7 or higher.

I computed the mean and standard deviation of the energy on a subset (first 8000 molecules) of the dataset and got a mean around -70 and standard deviation of ~740. This doesn't really seem reasonable to me, I don't think the model will work well, if at all, if the distribution of energy is so wide. These are some random energies from the dataset:

tensor([ 1.1613e+03,  1.1532e+03,  5.5358e+02,  1.2090e+03,  1.2154e+03,
         1.7931e+02,  2.7697e+02, -2.4045e+02,  4.5677e+00,  1.5983e+03,
         9.7936e+02,  1.1500e+03, -2.3005e+00, -2.0028e+02,  7.5593e+02,
         1.9390e+03, -3.2701e+02,  1.0661e+03, -2.7516e+02,  9.5363e+02,
         2.9707e+02,  1.1408e+03,  1.2444e+03, -1.9616e+02,  3.9944e+02,
         1.7736e+03, -6.0677e+01,  2.4780e+02,  1.0419e+03, -2.7178e+02,
         1.5953e+03,  1.8742e+01,  2.1269e+03, -9.7358e+01, -1.9511e+02,
         5.3904e+01, -6.2817e+01, -1.5848e+02,  3.1398e+02, -1.3329e+02,
        -1.9192e+02,  1.0333e+03,  2.6484e+02, -1.5122e+02,  1.2274e+03,
         9.9747e+02, -1.9101e+02,  6.2536e+02, -2.9667e+02,  1.1766e+03,
        -2.1408e+02,  3.1954e+02,  5.1644e+02, -2.5849e+02,  5.4932e+00,
        -2.1692e+02,  8.0138e+02,  4.8013e+01])

The model won't learn anything efficiently with this. I also computed mean and standard deviation for the last ~60000 molecules in the dataset and got mean -5e13 and standard deviation 1e16. It seems to me like some of the data is broken.

Edit:
This is a plot showing 1. the full distribution of energy and 2. the distribution in the interval (-5000, 5000). Although the outliers are rare (log scale on the y-axis I think this mostly contributes to the model not learning anything.

raimis · 2022-05-06T12:30:11Z

A while ago I tried to compute SPICE with xTB, but wasn't able to train. One issue was the number of neighbors (#81), but probably there are more.

I tried to analyze the dataset:

The type distribution looks reasonable, though many types are underrepresented. I didn't find any negative types. @PhilippThoelke how did you got them?
The force distribution is a bit wild. It spans 18 orders of magnitude:
If I skip large forces (>500 kJ/mol/A), the distribution looks reasonable:

raimis · 2022-05-06T12:36:03Z

My recommendation is to filter out the samples with large forces and underrepresented types. If it doesn't work, there still might be some issues with TorchMD-NET.

PhilippThoelke · 2022-05-06T13:47:59Z

I didn't find any negative types. @PhilippThoelke how did you got them?

I just loaded the dataset with the HDF5 loader class as given in the config file, I didn't change anything.

peastman · 2022-05-06T15:41:39Z

I just ran this test to check for negative types:

f = h5py.File('SPICE-corrected.hdf5')
for g in f:
  if np.any(np.array(f[g]['types']) < 0):
    print('error')

It didn't find anything. Where exactly do you see them?

peastman · 2022-05-06T15:56:56Z

I'll try excluding the 1% of samples with the highest and lowest energies and see if that helps. Note that some of the variation in energy comes from differences in size. The samples range from 2 atoms up to nearly 100. Since the model predicts per-atom energies, that shouldn't be a problem for it. Here's a plot of the central 98% of energies.

And here's the same plot, but showing energy per atom instead of total energy.

PhilippThoelke · 2022-05-06T16:56:50Z

The negative atom types are very peculiar. I can catch negative atom types with a debugger and I consistenly get broken atom types for the current index. When I retrieve another index, followed by the seemingly broken index again, I get normal looking atom types. I guess this breaks some caching that h5py might be doing. I also only get broken atom types with num_workers > 1, which makes me think that the problem has to do with multiprocessing. The h5py documentation states

Read-only parallel access to HDF5 files works with no special preparation: each process should open the file independently and read data normally (avoid opening the file and then forking).

I believe pytorch-lightning opens the dataset once and then copies it to the different worker processes, which is not recommended by h5py. My guess is that data gets corrupted there, which I guess could also mess with your data, although I'm confused by me being the only one where it fully breaks.

It might be worth storing the dataset in something other than HDF5 or loading the full dataset into memory before training, using only a single process. When I set num_workers to 1 for testing my loss also seems to decrease with only some batches spiking to 10^7 which I would assume comes from outliers.

giadefa · 2022-05-06T17:19:35Z

It must be opened in the get function the first time unless it's loaded into memory in _init and then never access again.

…

On Fri, May 6, 2022, 18:57 Philipp Thölke ***@***.***> wrote: The negative atom types are very peculiar. I can catch negative atom types with a debugger and when I consistenly get broken atom types for the current index. When I retrieve another index, followed by the seemingly broken index again, I get normal looking atom types. I guess this breaks some caching that h5py might be doing. I also only get broken atom types with num_workers > 1, which makes me think that the problem has to do with multiprocessing. The h5py documentation states Read-only parallel access to HDF5 files works with no special preparation: each process should open the file independently and read data normally (avoid opening the file and then forking). I believe pytorch-lightning opens the dataset once and then copies it to the different worker processes, which is not recommended by h5py. My guess is that data gets corrupted there, which I guess could also mess with your data, although I'm confused by me being the only one where it fully breaks. It might be worth storing the dataset in something other than HDF5 or loading the full dataset into memory before training, using only a single process. When I set num_workers to 1 for testing my loss also seems to decrease with only some batches spiking to 10^7 which I would assume comes from outliers. — Reply to this email directly, view it on GitHub <#82 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOVGQWEP76OB273S2ETVIVFNZANCNFSM5VGDF2CA> . You are receiving this because you commented.Message ID: ***@***.***>

peastman · 2022-05-06T17:32:05Z

I'll try converting everything to numpy arrays in memory and see if that makes a difference. It's not an option for really large datasets, of course, but this one is small enough it shouldn't be a problem.

For the moment, though, excluding the energy outliers seems to be helping. The loss is lower, and I'm seeing a slow but steady improvement with training. My tentative conclusion is that was the main problem. I'll continue to experiment.

PhilippThoelke · 2022-05-06T17:49:19Z

A while ago @raimis created some benchmarks for different ways of loading large datasets. I believe memory mapped numpy arrays were the best option he found but I'm not 100% sure.

giadefa · 2022-05-06T17:50:28Z

yes, copy the new ANI1 dataloader in torchmd-net

…

On Fri, May 6, 2022 at 7:49 PM Philipp Thölke ***@***.***> wrote: A while ago @raimis <https://github.com/raimis> created some benchmarks for different ways of loading large datasets. I believe memory mapped numpy arrays were the best option he found but I'm not 100% sure. — Reply to this email directly, view it on GitHub <#82 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOQGNVLL26P4SXAIBXLVIVLSXANCNFSM5VGDF2CA> . You are receiving this because you commented.Message ID: ***@***.***>

peastman · 2022-05-11T18:27:59Z

If I omit the samples with extreme energies and forces, training works as expected. I'm happy to close this issue, although maybe you want to keep it open to track the problem with the atom types?

giadefa · 2022-10-11T07:58:20Z

Can you make a PR with the spice data loader?

…

On Wed, May 11, 2022, 20:28 Peter Eastman ***@***.***> wrote: If I omit the samples with extreme energies and forces, training works as expected. I'm happy to close this issue, although maybe you want to keep it open to track the problem with the atom types? — Reply to this email directly, view it on GitHub <#82 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOVKO2TSXOYIP4LUCJLVJP33XANCNFSM5VGDF2CA> . You are receiving this because you commented.Message ID: ***@***.***>

PhilippThoelke mentioned this issue May 12, 2022

Avoid multiple processes reading from the same HDF5 file #84

Merged

PhilippThoelke closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to fit model #82

Unable to fit model #82

peastman commented May 5, 2022

giadefa commented May 5, 2022 via email

peastman commented May 5, 2022

PhilippThoelke commented May 5, 2022

peastman commented May 5, 2022

PhilippThoelke commented May 6, 2022

PhilippThoelke commented May 6, 2022 •

edited

raimis commented May 6, 2022

raimis commented May 6, 2022

PhilippThoelke commented May 6, 2022

peastman commented May 6, 2022

peastman commented May 6, 2022

PhilippThoelke commented May 6, 2022 •

edited

giadefa commented May 6, 2022 via email

peastman commented May 6, 2022

PhilippThoelke commented May 6, 2022

giadefa commented May 6, 2022 via email

peastman commented May 11, 2022

giadefa commented Oct 11, 2022 via email

Unable to fit model #82

Unable to fit model #82

Comments

peastman commented May 5, 2022

giadefa commented May 5, 2022 via email

peastman commented May 5, 2022

PhilippThoelke commented May 5, 2022

peastman commented May 5, 2022

PhilippThoelke commented May 6, 2022

PhilippThoelke commented May 6, 2022 • edited

raimis commented May 6, 2022

raimis commented May 6, 2022

PhilippThoelke commented May 6, 2022

peastman commented May 6, 2022

peastman commented May 6, 2022

PhilippThoelke commented May 6, 2022 • edited

giadefa commented May 6, 2022 via email

peastman commented May 6, 2022

PhilippThoelke commented May 6, 2022

giadefa commented May 6, 2022 via email

peastman commented May 11, 2022

giadefa commented Oct 11, 2022 via email

PhilippThoelke commented May 6, 2022 •

edited

PhilippThoelke commented May 6, 2022 •

edited