Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add func #29

Merged
merged 20 commits into from
Dec 1, 2023
Merged

Add func #29

merged 20 commits into from
Dec 1, 2023

Conversation

FNTwin
Copy link
Collaborator

@FNTwin FNTwin commented Nov 28, 2023

Add statistics of datasets, some improvements and fixes to downloading issues.

@FNTwin FNTwin requested a review from prtos as a code owner November 28, 2023 15:30
@@ -50,6 +60,25 @@ def read_raw_entries(self):
samples = read_qc_archive_h5(raw_path, self.__name__, self.energy_target_names, self.force_target_names)
return samples

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these stats are not computed on the fly? using numpy /torch or at loading time? It would take less than 15s for most datasets no? Doing this way will create problem if the data change.

Copy link
Collaborator Author

@FNTwin FNTwin Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the on the fly computation with the original script was too slow for some datasets.
Also we talked about precomputing the statistics.
Why would a dataset data change ?

Comment on lines 137 to 141
# calculation per molecule formation energy statistics
e = []
for i in range(len(self.__energy_methods__)):
e.append(converted_energy_data[:, i] - np.array(list(map(lambda x: x.sum(), matrixs[i]))))
E = np.array(e).T
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know a better way to do this?

@FNTwin
Copy link
Collaborator Author

FNTwin commented Nov 28, 2023

Nikhil checked the timing and for GEOM it takes around 3 min. Also switching to the pkl files seems to cause issue on loading GEOM for my notebook kernel. Probably the memory usage increased.


@property
def numbers(self):
if hasattr(self, "_numbers"):
return self._numbers
self._numbers = np.array(list(set(self.data["atomic_inputs"][..., 0])), dtype=np.int32)
self._numbers = np.unique(self.data["atomic_inputs"][..., 0]).astype(np.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pandas.unique because it is way faster than numpy

"""
if self.__average_nb_atoms__ is None:
logger.info(PROPERTY_NOT_AVAILABLE_ERROR)
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1? should be either an error or None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a WIP patch to test different datasets on the multihead branch. Now we can just raise an error

@FNTwin
Copy link
Collaborator Author

FNTwin commented Nov 29, 2023

The last commits address numerous issues:

  • Data was converted only in the get item call
  • Statistics calculation was/is slow and implemented a caching system for the statistics
  • Caching and statistics are done on the original data (aka original unit) and only after that the data will be converted
  • Statistics are converted on the fly
  • Fixed Mean and Component mean (+ std) on the Forces statistics
  • Statistics shape for scalar values (mean,std) are not (1,) but (1,1) now [Nikhil request for MultiHead]

@FNTwin FNTwin merged commit 420bdb1 into main Dec 1, 2023
3 of 5 checks passed
@FNTwin FNTwin deleted the add_func branch December 1, 2023 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants