Skip to content

Conversation

@theosaulus
Copy link
Owner

@theosaulus theosaulus commented Nov 22, 2025

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests.
  • My PR follows PEP8 guidelines.
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.

Description

This PR implements three Open Catalyst (OC20/OC22) datasets:

  1. OC20 IS2RE - Initial Structure to Relaxed Energy prediction task using OC20 data
  2. OC22 IS2RE - Initial Structure to Relaxed Energy prediction task using OC22 data
  3. OC20 S2EF - Structure to Energy and Forces prediction task with multiple training splits (200K, 2M, 20M, all)

Implementation Details:

Dataset Loaders (3 files):

  • topobench/data/loaders/graph/oc20_is2re_dataset_loader.py - IS2REDatasetLoader for OC20
  • topobench/data/loaders/graph/oc22_is2re_dataset_loader.py - OC22IS2REDatasetLoader for OC22
  • topobench/data/loaders/graph/oc20_dataset_loader.py - OC20DatasetLoader for S2EF task

Dataset Classes (3 files):

  • topobench/data/datasets/oc20_is2re_dataset.py - IS2REDataset handling LMDB data
  • topobench/data/datasets/oc22_is2re_dataset.py - OC22IS2REDataset with multiple LMDB files
  • topobench/data/datasets/oc20_dataset.py - OC20Dataset for S2EF with configurable splits

Configuration Files (7 files):

  • configs/dataset/graph/OC20_IS2RE.yaml
  • configs/dataset/graph/OC22_IS2RE.yaml
  • configs/dataset/graph/OC20_S2EF_200K.yaml
  • configs/dataset/graph/OC20_S2EF_2M.yaml
  • configs/dataset/graph/OC20_S2EF_20M.yaml
  • configs/dataset/graph/OC20_S2EF_all.yaml

Test Coverage (21 tests total):

  • test/data/load/test_oc20_datasets.py - 17 unit tests covering:

    • Loader initialization and dataset loading
    • Data item access and validation
    • Split indices validity and non-overlap
    • Integration with PreProcessor pipeline
    • Multiple train split configurations for S2EF
  • test/pipeline/test_pipeline.py

Issue

This PR addresses TDL Challenge 2025 - Category A1: Broadening Benchmarks with Graph Datasets.

Additional context

Datasets Source: Open Catalyst Project

References:

  • Chanussot, L., et al. (2021). "Open Catalyst 2020 (OC20) Dataset and Community Challenges." ACS Catalysis.
  • Tran, R., et al. (2023). "The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts." ACS Catalysis.

Testing Notes:

  • All tests use max_samples=100 for efficiency
  • Users should set max_samples: null in configs for full dataset training

@theosaulus theosaulus merged commit de9f3e7 into main Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants