v2.0.0
Performance
- Parser 2-3x faster: Significant optimizations to structure parsing, especially for symmetric assemblies
- Cache loading 3-5x faster: Improved pickle/gzip cache handling with 2-level directory sharding for better filesystem performance
- Vectorized annotations:
add_pn_unit_iid_annotation()now uses boolean masks instead of expensive subarray operations (10-100x speedup on symmetric assemblies)
Breaking Changes
Dataset Module Restructuring
The dataset module has been restructured to align with TorchVision/TorchAudio and HuggingFace conventions, using a dataset/loader pattern:
- Removed
dataset.datasetnesting: Datasets are now flat; access data directly from the dataset object - MetadataRowParser deprecated: The
StructuralDatasetWrapper+dataset_parserpattern is replaced with aloaderparameter directly on datasets (backwards-compatible but deprecated)
Migration example:
# Old (deprecated)
from atomworks.ml.datasets import StructuralDatasetWrapper, PandasDataset
from atomworks.ml.datasets.parsers import PNUnitsDFParser
dataset = StructuralDatasetWrapper(
dataset=PandasDataset(data="df.parquet"),
dataset_parser=PNUnitsDFParser(...)
)
# New
from atomworks.ml.datasets import PandasDataset
from atomworks.ml.datasets.loaders import create_base_loader
dataset = PandasDataset(
data="df.parquet",
loader=create_base_loader(
example_id_colname="example_id",
path_colname="path",
)
)Parser Changes
- CCD mirror path validation:
ccd_mirror_pathnow raisesFileNotFoundErrorif the path doesn't exist. PassNoneexplicitly to use Biotite's bundled CCD build_assembly="_spoof"removed: Use"all"instead (raises deprecation warning)convert_mse_to_metdefault changed: NowTrueby default (wasFalse)STANDARD_PARSER_ARGSrenamed: WasDEFAULT_PARSE_KWARGS; now uses tuples instead of lists for hashability
Environment Changes
- Removed automatic
.envloading:dotenvis no longer auto-loaded on import. Callload_dotenv()explicitly if needed:from dotenv import load_dotenv load_dotenv()
Removed Exports
monkey_patch_atomarrayremoved from top-level exports. Usefrom atomworks.biotite_patch import monkey_patch_biotiteinstead
Added
New Modules
atomworks.ml.conditions- Unified conditioning management for model trainingatomworks.ml.preprocessing.msa- MSA preprocessing (organize, filter, generate)atomworks.ml.executables- External executable management (hbplus, hhfilter, mmseqs2, x3dna)atomworks.ml.transforms.design_task- Design task transformsatomworks.ml.transforms.mask_generator- Mask generation for trainingatomworks.ml.utils.condition- Condition utilitiesatomworks.io.utils.compression- Compression utilities (zstd support)
New Dataset Classes
FileDataset- Each file is one example (extracted from old monolithic datasets.py)PandasDataset- DataFrame-backed dataset with loader support
New Loader Functions
create_base_loader()- Standard CIF loadingcreate_loader_with_query_pn_units()- Loading with PN unit queriescreate_loader_with_interfaces_and_pn_units_to_score()- Interface scoring loader
New Constants
PROTEIN_BACKBONE_ATOM_NAMES- Backbone atoms including OXTRNA_BACKBONE_ATOM_NAMES- Sugar-phosphate + 2' hydroxyl atomsDNA_BACKBONE_ATOM_NAMES- Sugar-phosphate atomsNUCLEIC_ACID_BACKBONE_ATOM_NAMES- Union of RNA+DNA backbonesMASKED- Token code for masked positionsMSAFileExtensionenum - Supported MSA file formats- Expanded
METAL_ELEMENTS- Now includes lanthanides and actinides
New Features
AtomArrayPlussupport in parser - Extended atom array with additional metadata- Spawn multiprocessing support for data loading
- zstd compression support for MSA files
- Atom37 encoding with atomization support
- JSON-level atom selection for bonds argument
Fixed
- Residue starts bug with dependent functions
- SASA calculation for empty amino acid arrays
- Null handling in A3M files
- Design tasks with zero frequency now handled gracefully instead of erroring
- Non-uniform shard sizes handling
- Pickling during data loading with spawn multiprocessing
Changed
- Loaders module restructured from
loaders.pytoloaders/subpackage (imports still work via__init__.py) - Parser cache structure now uses 2-level sharding (old caches automatically regenerated)
Deprecated
atomworks.ml.datasets.parsersmodule - Use loaders insteadStructuralDatasetWrapper- Use loader parameter on datasets directly
See CHANGELOG.md for full history.