Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

written train and validation file looses info from the original file #242

Open
sandipde opened this issue Dec 10, 2023 · 3 comments · Fixed by #243
Open

written train and validation file looses info from the original file #242

sandipde opened this issue Dec 10, 2023 · 3 comments · Fixed by #243
Labels
enhancement New feature or request

Comments

@sandipde
Copy link

sandipde commented Dec 10, 2023

Hello again,
I really like the code you built and trying to see if it fits our need. thanks a lot in advance for your help.

I would like to keep all the info in the user given extxyz (in the info field as well as extra arrays in the train and validation file split written during mace training. although going over the code I see it should be the case, but this is not happening. I suspect this is because of interal data representation of the atoms.h5 format that gets written in the data add step which can not store these info.

  data = ips.AddData(file="my_data.extxyz")
  test_data = ips.configuration_selection.UniformEnergeticSelection(
      data=data, n_configurations=156, name="test_data"
  )
  # we have now given the Nodes a "name" attribute to uniquly identify them
  train_data = ips.configuration_selection.RandomSelection(
      data=data,
      n_configurations=100,
      exclude_configurations=test_data.exclude_configurations,
      name="train_data",
  )
  validate_data = ips.configuration_selection.RandomSelection(
      data=train_data.excluded_atoms,
      n_configurations=10,
      name="validate_data",
  )
  model = ips.models.MACE(
      data=train_data,
      test_data=validate_data,
  )
  prediction = ips.analysis.Prediction(model=model, data=test_data)
  analysis = ips.analysis.PredictionMetrics(data=prediction)
project.run()

the train-data.extxyz and test-data.extxyz files in the nodes/MLModel directory only have the minimum info Lattice="7.725411453296575 -0.2019109857113853 -0.23392998344550006 0.15426298908328637 7.725411453296575 -0.26775798105159754 -0.20378498557876815 -0.23229898356092088 7.725411453296575" Properties=species:S:1:pos:R:3:forces:R:3 energy=-30889.876898123603 free_energy=-30889.876898123603 pbc="T T T" the original file has more info in the train field and more arrays. you can easily test the issues by making a traj from the atom config below as example

32
Lattice="7.677598456680151 -0.25407298202004247 -0.1725099877920028 0.2883699795929503 7.733745452706803 0.11198199207538147 -0.3065139783089557 -0.05001199646080601 7.7641824505528705" Properties=species:S:1:pos:R:3:qe_forces:R:3:initial_magmoms:R:1:forces:R:3:magmoms:R:1 uid=63a593e36858459f43c7f119 qe_fenergy=-30898.412704745006 qe_virial="_JSON [[[-32.465447264707926, 3.0325194075781785, 1.3928161334345242], [3.0325194075781785, -107.34424200406917, -11.84423059775926], [1.3928161334345242, -11.84423059775926, -92.13548596388446]]]" type=Bulk rattle="_JSON {\"type\": \"mc_rattle\", \"stdev\": 0.1, \"min_distance\": 1.2951394504343348}" shear="_JSON {\"percent\": 0.04, \"direction\": [1, 1, 1]}" origin=mp-19342 energy=-30898.412704745006 stress="-0.07038963541165076 0.006574926682443274 0.0030198203963909963 0.006574926682443274 -0.2327373467736026 -0.025679950339513462 0.0030198203963909963 -0.025679950339513462 -0.1997626341813295" free_energy=-30898.412704745006 pbc="T T T"
W        6.03934477       3.79464123       5.60867370     -10.04848256      -2.04930677      -0.27058007      12.00000000     -10.04848256      -2.04930677      -0.27058007       0.02710000
W        1.80766217       4.43603199       5.22372253       6.38167011       2.59475577      15.14046007      12.00000000       6.38167011       2.59475577      15.14046007       0.23670000
W        5.69784350       7.27427649       5.82665889       1.03921213       2.31397356       1.45725499      12.00000000       1.03921213       2.31397356       1.45725499      -0.00280000
W        2.11740395       7.58203576       6.28060216       1.25890338      10.86720871     -22.13931295      12.00000000       1.25890338      10.86720871     -22.13931295      -0.00000000
W        2.17672265       4.15034061       1.98079066       0.01043328       1.94979120       3.21655624      12.00000000       0.01043328       1.94979120       3.21655624       0.04410000
W        6.16863206       3.93346572       1.59462689       2.51737199       1.54370602      -7.43414476      12.00000000       2.51737199       1.54370602      -7.43414476       0.01540000
W        1.85609387      -0.04833240       1.64488268      -5.97029309      -0.09082730       6.64501226      12.00000000      -5.97029309      -0.09082730       6.64501226      -0.04100000
W        5.56727111       7.34393308       1.92714096       7.24282126      11.48187232       1.74269681      12.00000000       7.24282126      11.48187232       1.74269681      -0.00370000
O        5.84538409       3.54860695       3.35702526       1.46795335       3.99300189       8.01405785       4.00000000       1.46795335       3.99300189       8.01405785      -0.08610000
O        2.46051443       3.84697053       0.38777517      -1.31569082       3.33678128     -10.62278011       4.00000000      -1.31569082       3.33678128     -10.62278011       0.14190000
O        5.43527942       0.18020669       0.16119319      -1.34808312      -0.39718916      -0.61130574       4.00000000      -1.34808312      -0.39718916      -0.61130574       0.00290000
O        2.31585674       7.60189146       4.12510861      -2.05420293      -0.56911600      -3.44812188       4.00000000      -2.05420293      -0.56911600      -3.44812188       0.27880000
O        1.43721610       3.70838334       3.89910382      -2.74418108      -7.97299836     -15.20005619       4.00000000      -2.74418108      -7.97299836     -15.20005619      -0.00420000
O        5.08656044       3.88153503       7.20369649       4.15908178      -2.44308485       6.88916371       4.00000000       4.15908178      -2.44308485       6.88916371      -0.03030000
O        1.28546371      -0.01565930       7.61066906      -6.84280929       2.68036399      19.74376343       4.00000000      -6.84280929       2.68036399      19.74376343      -0.00080000
O        6.11225417       6.74003292       4.25453310      -1.31142074      -0.82867503      -7.17219743       4.00000000      -1.31142074      -0.82867503      -7.17219743       0.01060000
O        3.81889303       7.24753669       2.29857714      -0.80219601       0.25269110       0.28421849       4.00000000      -0.80219601       0.25269110       0.28421849       0.00380000
O        7.29439148       0.05182610       1.77636177       1.66827931      -1.00294415       0.44802141       4.00000000       1.66827931      -1.00294415       0.44802141       0.00800000
O        0.31058558       4.48538928       1.79132927      -4.23184246      -1.94882549       0.29269978       4.00000000      -4.23184246      -1.94882549       0.29269978      -0.00150000
O        3.84653813       3.08787438       1.65756108      -1.20410263      -1.94545837       2.37249007       4.00000000      -1.20410263      -1.94545837       2.37249007      -0.00960000
O        3.67332184       0.21763008       5.57906081       1.79259961      -2.40604117       0.75878396       4.00000000       1.79259961      -2.40604117       0.75878396       0.04830000
O        7.71790185       7.12096630       5.71645010       3.05158758      -0.72300192       2.02365668       4.00000000       3.05158758      -0.72300192       2.02365668       0.06690000
O        0.14380859       3.96682982       5.92556168       2.29946637       0.53123696       1.11016764       4.00000000       2.29946637       0.53123696       1.11016764      -0.00110000
O        4.20757430       3.65342624       5.32855902      -2.94630116      -0.87967184      -4.03148770       4.00000000      -2.94630116      -0.87967184      -4.03148770      -0.01630000
O        5.85827919       1.23510701       5.53994211       1.45308646       6.03777940       0.07526416       4.00000000       1.45308646       6.03777940       0.07526416       0.03550000
O        2.53196352       2.31829414       1.32533971      -0.61989734      -7.85142632       1.85903614       4.00000000      -0.61989734      -7.85142632       1.85903614       0.11530000
O        5.49028221       2.48434052       2.24351524       1.30946901      -8.26879169      -0.82819192       4.00000000       1.30946901      -8.26879169      -0.82819192      -0.09250000
O        1.90350477       5.76264969       1.18911882       1.57921836       6.52884545       1.28008850       4.00000000       1.57921836       6.52884545       1.28008850       0.01610000
O        2.04933825       1.52834729       5.96912108      -0.99168246       7.23341462      -0.66034234       4.00000000      -0.99168246       7.23341462      -0.66034234       0.06410000
O        5.10964054       5.55609271       6.10698317       1.39839061       0.67365140       0.31619787       4.00000000       1.39839061       0.67365140       0.31619787      -0.00450000
O        2.24457194       5.94254138       6.20632316       3.13484350     -13.37921738      -0.20775722       4.00000000       3.13484350     -13.37921738      -0.20775722       0.00180000
O        5.71680690       5.74124769       1.75821778       0.66679838      -9.26249735      -1.04331175       4.00000000       0.66679838      -9.26249735      -1.04331175      -0.00150000

Any suggestion how to solve this?
Thanks!

@PythonFZ PythonFZ added the enhancement New feature or request label Dec 11, 2023
@PythonFZ PythonFZ linked a pull request Dec 11, 2023 that will close this issue
@PythonFZ
Copy link
Member

Thanks for using IPSuite and letting us know about this issue.
As you have described, we store the information from the ase.Atoms in H5MD.
To read the file we use ase.io, so every data not supported by ASE can't be read directly.
Secondly, the conversion currently silently ignores some other properties, such as the get_magnetic_moments.

Personally, I think the two ways of storing these information in ASE, either through e.g., the .arrays or in the calc.results can be confusing.

I'll add support for: magmoms, initial_magmoms, qe_forces through https://github.com/zincware/ZnH5MD, which we use to save / load H5MD. I'll try to further support all properties in .arrays.

For a quick solution, I added ips.data_loading.ReadData in #243 which will keep the result from ase.io.iread

@functools.cached_property
def atoms(self) -> typing.List[ase.Atoms]:
return load_data(self.file, self.lines_to_read)

you should be able to test it via pip install git+https://github.com/zincware/IPSuite@242-written-train-and-validation-file-looses-info-from-the-original-file

@sandipde
Copy link
Author

Thanks a lot @PythonFZ ! problem solved :-)

I think having both options is really desired. I understand the motivation of using the H5MD format for large data files in MD but for ML trianing codes, those benefits are not really applicable and ase extxyz format does provide scope to include different properties which are not restricted to calc.results. so the way you have implemented it now provides a good basis to serve both needs.

On a different notes

  • the dependency of dvc version did not work for me. could be because I had the latest dvc installed and downgrading breaks some config but I had to use dvc 3.33.1

  • the mace calculator sysntax has changed in the meantime. you should fix this by replacing model_path with model_paths, it can take both list of models or a single model file. https://github.com/zincware/IPSuite/blob/83d0ed272a9b198fa2b1eb30373d93a0662de8e3/ipsuite/models/mace_model.py#L131C9-L131C9

  • it would also be desirable to just use cmd = "mace_run_train " cli command instead of
    cmd = """curl -sSL https://raw.githubusercontent.com/ACEsuit/mace/main/scripts/run_train.py | python - """ # noqa E501 in

    cmd = """curl -sSL https://raw.githubusercontent.com/ACEsuit/mace/main/scripts/run_train.py | python - """ # noqa E501

@PythonFZ
Copy link
Member

I think having both options is really desired. I understand the motivation of using the H5MD format for large data files in MD but for ML trianing codes, those benefits are not really applicable and ase extxyz format does provide scope to include different properties which are not restricted to calc.results. so the way you have implemented it now provides a good basis to serve both needs.

Our focus will remain on storing everything in H5MD, but I do agree, that we should avoid unnecessary data duplication and overuse of the H5 files, if not required.

Thanks for pointing out the changes made to MACE (new issue #246).
These changes do not allow the usage of a fsspec object but require an actual path. Therefore, I need to implement zincware/ZnTrack#746 first, such that the model can be loaded from an arbitrary commit / experiment.
Once I'll go back to the MACE model, I'll fix the curl as well. It was a dirty fix for design choices made by the MACE developores to not include the train script in the install back then. AFAIK, this should no longer be necessary.

@PythonFZ PythonFZ reopened this Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants