MACE-OFF dataset #332

RaulPPelaez · 2024-06-27T14:41:46Z

I added a Dataset class for the dataset used in the work "MACE-OFF23: Transferable Machine Learning
Force Fields for Organic Molecules" https://arxiv.org/pdf/2312.15211

RaulPPelaez · 2024-06-28T12:18:45Z

This is ready for review. It is really slow to preprocess, about 40 minutes. The Dataset comes in an XYZ file which I am processing with ase. I do not know how to speed it up.

stefdoerr · 2024-07-01T08:07:47Z

import tarfile
from moleculekit.periodictable import periodictable


def parse_xyz(xyz_file):
    import re

    energy_re = re.compile("energy=(\S+)")

    with tarfile.open(xyz_file, "r:gz") as tar:
        for member in tar.getmembers():
            f = tar.extractfile(member)
            if f is None:
                continue

            n_atoms = None
            counter = 0
            positions = []
            numbers = []
            forces = []
            energy = None

            for line in f:
                line = line.decode("utf-8").strip()
                if n_atoms is None:
                    n_atoms = int(line)
                    positions = []
                    numbers = []
                    forces = []
                    energy = None
                    counter = 1
                    continue
                if counter == 1:
                    props = line
                    energy = float(energy_re.search(props).group(1))
                    counter = 2
                    continue

                el, x, y, z, fx, fy, fz, _, _, _ = line.split()
                numbers.append(periodictable[el].number)
                positions.append([float(x), float(y), float(z)])
                forces.append([float(fx), float(fy), float(fz)])
                counter += 1
                if counter == n_atoms + 2:
                    n_atoms = None
                    yield energy, numbers, positions, forces

I wrote an xyz parser for the MACE dataset. You can use it with:

gen = parse_xyz("./train_large_neut_no_bad_clean.tar.gz")
x = next(gen)
x = next(gen)

First call takes a small while to extract, then it goes super fast (around 60μs per call for me)

stefdoerr · 2024-07-01T08:13:49Z

Takes 1 minute total to parse the whole file (excluding the initial extraction cost which is like 10-20s)

RaulPPelaez · 2024-07-09T08:04:20Z

Works great @stefdoerr, thanks. Please review again!

RaulPPelaez added 8 commits June 27, 2024 13:15

Add MACEOFF Dataset

f4308de

Add MACEOFF dataset

64e7ee6

Update docstring

2ebe79f

Add check for max_gradient

525b16d

Add MACEOFF example

3dc2b90

Add error checking

603e795

Merge remote-tracking branch 'origin/main' into maceds

23b9bea

Add pre_transform and pre_filter

33cd988

RaulPPelaez marked this pull request as ready for review June 28, 2024 12:18

RaulPPelaez requested a review from stefdoerr June 28, 2024 12:18

Use @stefdoerr xyz parser

191f454

stefdoerr approved these changes Jul 9, 2024

View reviewed changes

stefdoerr merged commit 6c42c8b into torchmd:main Jul 9, 2024
2 checks passed

RaulPPelaez deleted the maceds branch July 9, 2024 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MACE-OFF dataset #332

MACE-OFF dataset #332

RaulPPelaez commented Jun 27, 2024

RaulPPelaez commented Jun 28, 2024

stefdoerr commented Jul 1, 2024

stefdoerr commented Jul 1, 2024

RaulPPelaez commented Jul 9, 2024

MACE-OFF dataset #332

MACE-OFF dataset #332

Conversation

RaulPPelaez commented Jun 27, 2024

RaulPPelaez commented Jun 28, 2024

stefdoerr commented Jul 1, 2024

stefdoerr commented Jul 1, 2024

RaulPPelaez commented Jul 9, 2024