Add func #29

FNTwin · 2023-11-28T15:30:51Z

Add statistics of datasets, some improvements and fixes to downloading issues.

…n the _stats

prtos · 2023-11-28T16:56:36Z

src/openqdc/datasets/ani.py

@@ -50,6 +60,25 @@ def read_raw_entries(self):
        samples = read_qc_archive_h5(raw_path, self.__name__, self.energy_target_names, self.force_target_names)
        return samples

+    @property


why these stats are not computed on the fly? using numpy /torch or at loading time? It would take less than 15s for most datasets no? Doing this way will create problem if the data change.

Because the on the fly computation with the original script was too slow for some datasets.
Also we talked about precomputing the statistics.
Why would a dataset data change ?

FNTwin · 2023-11-28T19:40:46Z

src/openqdc/datasets/base.py

+        # calculation per molecule formation energy statistics
+        e = []
+        for i in range(len(self.__energy_methods__)):
+            e.append(converted_energy_data[:, i] - np.array(list(map(lambda x: x.sum(), matrixs[i]))))
+        E = np.array(e).T


Do you know a better way to do this?

FNTwin · 2023-11-28T21:24:17Z

Nikhil checked the timing and for GEOM it takes around 3 min. Also switching to the pkl files seems to cause issue on loading GEOM for my notebook kernel. Probably the memory usage increased.

prtos · 2023-11-28T21:21:56Z

src/openqdc/datasets/base.py


    @property
    def numbers(self):
        if hasattr(self, "_numbers"):
            return self._numbers
-        self._numbers = np.array(list(set(self.data["atomic_inputs"][..., 0])), dtype=np.int32)
+        self._numbers = np.unique(self.data["atomic_inputs"][..., 0]).astype(np.int32)


use pandas.unique because it is way faster than numpy

prtos · 2023-11-28T21:25:53Z

src/openqdc/datasets/base.py

+        """
+        if self.__average_nb_atoms__ is None:
+            logger.info(PROPERTY_NOT_AVAILABLE_ERROR)
+            return 1


Why 1? should be either an error or None

It was a WIP patch to test different datasets on the multihead branch. Now we can just raise an error

…onversion on the fly

FNTwin · 2023-11-29T13:48:26Z

The last commits address numerous issues:

Data was converted only in the get item call
Statistics calculation was/is slow and implemented a caching system for the statistics
Caching and statistics are done on the original data (aka original unit) and only after that the data will be converted
Statistics are converted on the fly
Fixed Mean and Component mean (+ std) on the Forces statistics
Statistics shape for scalar values (mean,std) are not (1,) but (1,1) now [Nikhil request for MultiHead]

FNTwin and others added 12 commits November 13, 2023 19:32

For nikhil

1b06437

Component values

22c8609

wip

28bbf00

Merge branch 'main' of github.com:OpenDrugDiscovery/openQDC into main

70d0071

Precomputed stats

4c8bf76

Fixes + black

c349f05

update init to not call openqdc.datasets everytime

6a67b79

RMS Spice fix

0dca850

combine smiles and subset into one artifact

32caf77

Fix xyz save + Updated e0 matrix to fix PCQM

f74a6df

GDML Stats, Improvements, Exceptions, Forces as None if not present i…

97f093c

…n the _stats

Fix the downloading issue and incompatibilities with new file types

51a5191

FNTwin requested a review from prtos as a code owner November 28, 2023 15:30

prtos and others added 2 commits November 28, 2023 15:43

change format for many reasons

16dcb4e

Merged format_change + Fix NOT_DEFINED

74f1c19

prtos reviewed Nov 28, 2023

View reviewed changes

On the fly calculation

2bf9050

FNTwin commented Nov 28, 2023

View reviewed changes

Update base.py

c90192a

prtos approved these changes Nov 28, 2023

View reviewed changes

FNTwin added 3 commits November 28, 2023 22:47

raise correct Error + cleaning

67bf201

Local caching statistics

52f69ce

Fix on incorrect unit changing, stats calculated on original units, c…

f38bda1

…onversion on the fly

Deepcopy dict to avoid reference issue

0197dd0

FNTwin merged commit 420bdb1 into main Dec 1, 2023
3 of 5 checks passed

FNTwin deleted the add_func branch December 1, 2023 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add func #29

Add func #29

FNTwin commented Nov 28, 2023

prtos Nov 28, 2023

FNTwin Nov 28, 2023 •

edited

Loading

FNTwin Nov 28, 2023

FNTwin commented Nov 28, 2023

prtos Nov 28, 2023

prtos Nov 28, 2023

FNTwin Nov 28, 2023

FNTwin commented Nov 29, 2023 •

edited

Loading

Add func #29

Add func #29

Conversation

FNTwin commented Nov 28, 2023

prtos Nov 28, 2023

Choose a reason for hiding this comment

FNTwin Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

FNTwin Nov 28, 2023

Choose a reason for hiding this comment

FNTwin commented Nov 28, 2023

prtos Nov 28, 2023

Choose a reason for hiding this comment

prtos Nov 28, 2023

Choose a reason for hiding this comment

FNTwin Nov 28, 2023

Choose a reason for hiding this comment

FNTwin commented Nov 29, 2023 • edited Loading

FNTwin Nov 28, 2023 •

edited

Loading

FNTwin commented Nov 29, 2023 •

edited

Loading