Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #1

theosaulus · 2025-11-22T01:30:36Z

Checklist

My pull request has a clear and explanatory title.
My pull request passes the Linting test.
I added appropriate unit tests and I made sure the code passes all unit tests.
My PR follows PEP8 guidelines.
My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
I linked to issues and PRs that are relevant to this PR.

Description

This PR implements three Open Catalyst (OC20/OC22) datasets:

OC20 IS2RE - Initial Structure to Relaxed Energy prediction task using OC20 data
OC22 IS2RE - Initial Structure to Relaxed Energy prediction task using OC22 data
OC20 S2EF - Structure to Energy and Forces prediction task with multiple training splits (200K, 2M, 20M, all)

Implementation Details:

Dataset Loaders (3 files):

topobench/data/loaders/graph/oc20_is2re_dataset_loader.py - IS2REDatasetLoader for OC20
topobench/data/loaders/graph/oc22_is2re_dataset_loader.py - OC22IS2REDatasetLoader for OC22
topobench/data/loaders/graph/oc20_dataset_loader.py - OC20DatasetLoader for S2EF task

Dataset Classes (3 files):

topobench/data/datasets/oc20_is2re_dataset.py - IS2REDataset handling LMDB data
topobench/data/datasets/oc22_is2re_dataset.py - OC22IS2REDataset with multiple LMDB files
topobench/data/datasets/oc20_dataset.py - OC20Dataset for S2EF with configurable splits

Configuration Files (7 files):

configs/dataset/graph/OC20_IS2RE.yaml
configs/dataset/graph/OC22_IS2RE.yaml
configs/dataset/graph/OC20_S2EF_200K.yaml
configs/dataset/graph/OC20_S2EF_2M.yaml
configs/dataset/graph/OC20_S2EF_20M.yaml
configs/dataset/graph/OC20_S2EF_all.yaml

Test Coverage (21 tests total):

test/data/load/test_oc20_datasets.py - 17 unit tests covering:
- Loader initialization and dataset loading
- Data item access and validation
- Split indices validity and non-overlap
- Integration with PreProcessor pipeline
- Multiple train split configurations for S2EF
test/pipeline/test_pipeline.py

Issue

This PR addresses TDL Challenge 2025 - Category A1: Broadening Benchmarks with Graph Datasets.

Additional context

Datasets Source: Open Catalyst Project

References:

Chanussot, L., et al. (2021). "Open Catalyst 2020 (OC20) Dataset and Community Challenges." ACS Catalysis.
Tran, R., et al. (2023). "The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts." ACS Catalysis.

Testing Notes:

All tests use max_samples=100 for efficiency
Users should set max_samples: null in configs for full dataset training

…o respect the code tree

theosaulus added 8 commits November 17, 2025 15:17

initial commit for OC20/22

6f16d4d

preprocessing still not fully working

42e7ad8

preprocessing seems to work now although slow

2d40f9d

code cleaning and separating different functions in different files t…

869894d

…o respect the code tree

IS2RE works

b290b42

format

bf3d279

renaming and tests

42adabf

keep some files untouched

01f082e

theosaulus merged commit de9f3e7 into main Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #1

Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #1

Uh oh!

theosaulus commented Nov 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #1

Category: A1; Team name: MatTheo; Dataset: OC20/OC22 #1

Uh oh!

Conversation

theosaulus commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Implementation Details:

Issue

Additional context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

theosaulus commented Nov 22, 2025 •

edited

Loading