Feat/v2 by PythonFZ · Pull Request #3 · zincware/asebytes

PythonFZ · 2026-02-20T14:27:58Z

No description provided.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rent path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add update_row() to WritableBackend with optimized LMDBBackend override - Split LMDBBackend into LMDBReadOnlyBackend + LMDBBackend hierarchy - Add get_with_txn() to BytesIO for batched single-transaction reads - Add backend registry (_registry.py) with glob pattern matching - Fix index bounds checking and negative index normalization - Fix empty list handling in RowView.__getitem__ - Fix BytesIO.__delitem__ to accept negative indices - Add ViewParent Protocol and __bool__ to views - Expose env property on LMDB backends - Update all tests from .db to .lmdb extension Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add iter_rows() to ReadableBackend for streaming row access - LMDB override streams within a single read transaction - RowView.__iter__ now streams (safe for TB-scale datasets) - Add RowView.chunked() for batched throughput iteration - Add ASEIO._validate_keys() enforcing calc.*/info.*/arrays.* namespaces - Update README with new API (views, column access, chunked, readonly) - Regenerate benchmark_comparison.png Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- ASEReadOnlyBackend wraps ase.io.read for .traj, .xyz, .extxyz files - Lazy per-frame loading with configurable LRU cache (default 1000) - __len__ raises RuntimeError until count_frames() is called - iter_rows streams via ase.io.iread for sequential access - Registry auto-detects read-only backends (readonly: bool | None) - ASEIO.__getitem__ handles unknown-length backends gracefully Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove ASEIO.get() and get_available_keys() (DRY — use db[i] and db.columns instead) - LMDB backend raises IndexError via bounds check, not string matching - ASE backend uses TypeError for unknown length (list() compatible) - ASEIO.__iter__ uses IndexError sentinel (works without len()) - Iterating discovers length as side-effect of hitting end-of-file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove incorrect _length inference from _read_frame (was setting length to out-of-bounds index on IndexError) - Fix ASEIO.columns to work on unknown-length backends by catching TypeError - Update __len__ error references from RuntimeError to TypeError in docs - Fix _cache_put to update value on re-insert (not just move_to_end) - Expand iter_rows iread optimization to any sorted sequence from 0 - Normalize negative indices to positive cache keys when length is known - Add set_length() public method; ASEIO.__iter__ calls it after iteration - Add tests for cache normalization, set_length, columns on unknown length Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add set_length() as a no-op default on the protocol instead of using hasattr duck-typing in ASEIO.__iter__. Backends with lazy length discovery (ASEReadOnlyBackend) override it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove set_length from the protocol and ASEIO.__iter__. Instead, the ASE backend tracks _max_read and auto-discovers _length when a sequential read fails at exactly _max_read + 1. Each layer manages its own concerns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace assert with explicit RuntimeError for type safety (#2) - Fall back to per-frame reads when iter_rows gets duplicate indices (#1) - Use O(n) sorted check instead of O(n log n) sorted() comparison - Document IndexError contract in ReadableBackend.read_row (#3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Track _max_read in iter_rows streaming path so length auto-discovery works after streaming followed by per-frame read - Remove _backend.count_frames() from README example (private attr) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add _URI_REGISTRY and parse_uri() to support URI-style paths (e.g. hf://user/dataset) alongside existing glob patterns. URIs are checked first in get_backend_cls(), preserving full backward compatibility. URI backends are read-only; requesting readonly=False raises TypeError. Tests for HuggingFaceBackend import will pass once the backend class is implemented in a subsequent task. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implements ReadableBackend for HuggingFace datasets with two modes: - Downloaded: random access with known length via Dataset.__getitem__ - Streaming: sequential access via IterableDataset with auto-discovered length Includes from_uri class method supporting hf://, colabfit://, and optimade:// URI schemes with automatic mapping selection and org prepending. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a URI string (e.g. hf://, colabfit://, optimade://) is passed to ASEIO(), the constructor now detects the scheme and delegates to cls.from_uri(uri, **kwargs) instead of calling cls(path, **kwargs) directly. File paths continue to use the direct constructor path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Auto-select first split when load_dataset returns DatasetDict - Validate malformed URIs and empty paths in from_uri - Improve streaming read_column error message - Fix cached target row double-read in streaming mode - Update README with valid dataset examples Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Default streaming=True for URI backends (avoids downloading full datasets) - Validate malformed URIs and empty paths in from_uri - Better error for read_column on streaming backend without indices - Fix cached target row double-read in streaming mode - README uses real dataset paths (colabfit://mlearn_Cu_train, optimade://LeMaterial/LeMat-Bulk) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- close() now checks hasattr(iter, 'close') before calling it - Add tests for close(), context manager, and read-after-close Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Extract _make_dataset into tests/conftest_hf.py (was duplicated in test_hf_backend.py and test_hf_aseio.py) - Narrow _probe_length catch to AttributeError/KeyError/TypeError Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Covers read-write H5MD backend built on h5py with: - Standard H5MD + ZnH5MD extension support (variable particle count, per-frame PBC) - Discovery-based reading for foreign H5MD files - Append-only write semantics - Connectivity via H5MD group + bond_order extension - fsspec-compatible file_handle parameter - h5py chunk cache tuning options Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Core dependencies reduced to just ase. The lmdb, msgpack, and msgpack-numpy packages move to the new [lmdb] extra. BytesIO relocated into the lmdb/ subpackage so the entire backend is self-contained behind the optional install. Registry now raises helpful ImportError messages when optional backend deps are missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch all intra-package imports from absolute (from asebytes.X) to relative (from .X / from ..X) per PEP 328. Add module-level __getattr__ so `from asebytes import BytesIO` raises a helpful ImportError with install instructions when lmdb is not installed. Add 12 mock-based tests for both registry and module-level error hint paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace repetitive test methods with @pytest.mark.parametrize for both registry and __getattr__ hint paths. Now covers all 7 _OPTIONAL_ATTRS entries (was missing COLABFIT/OPTIMADE) and all 5 registered path/URI patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add H5MD read-write backend using h5py with: - Full round-trip for positions, numbers, cell, pbc, calc results, info, arrays - Variable particle count via NaN padding (znh5md compatible) - H5MD-standard connectivity in /connectivity/bonds (int32, -1 fill) and /connectivity/bond_orders (float64, NaN fill) with particles_group ref - Backward-compatible reading of znh5md's non-standard connectivity - Registry entries for *.h5 and *.h5md extensions - Optional dependency via asebytes[h5md] - Comprehensive test suite (44 tests) and benchmarks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nnectivity - Add author_name/author_email constructor kwargs (only written when non-None) - Use dynamic asebytes version from importlib.metadata for creator version - Add list_groups() static method for discovering particles groups - Namespace connectivity under connectivity/{grp_name}/ for multi-trajectory support - Remove duplicate _write_connectivity method that used flat paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix read_column to preserve caller's index order instead of returning sorted order (sort-then-reorder strategy matching read_rows) - Add H5MDBackend to __init__.py conditional import and __getattr__ hints so missing h5py gives a helpful install message - Add asebytes.h5md._backend to _EXTRAS_HINT for parity with HF backend Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Persistent read-through cache backed by any WritableBackend. On read, the cache is checked first; on miss the source is read and the result written to cache. Accepts a WritableBackend instance or a string path (auto-resolved via registry). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address code review findings: - Add type validation for cache_to argument (TypeError on bad type) - Make cache writes best-effort (catch exceptions to avoid breaking reads) - Warn when cache_to is used with a writable source (stale data risk) - Update ASEIO docstring with cache_to parameter docs - Update README with H5MD backend and cache_to examples - Fix design doc to match implementation (no internal ASEIO) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restructure from flat API list to feature-oriented sections: quick start, lazy views, cache_to, HuggingFace, H5MD, key convention, and custom backends. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Redesign benchmarks with realistic data (ethanol+calc, LeMat-Traj), six backends (asebytes LMDB/H5MD, aselmdb, znh5md, extxyz, sqlite), five operations (write, read, random access, column access, file size), and one figure per operation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…igures Rewrite benchmarks to use two datasets (ethanol small molecules + LeMat-Traj periodic structures with variable atom counts) across 6 backends. Add column access and file size benchmarks, download script for HF data, and one figure per operation. Update README with all five benchmark figures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…abel - Fix file size benchmark writing to same path across rounds (inflated sizes ~5x). Use unique path per iteration like write benchmarks. - Add hatching to distinguish ethanol (solid) vs lemat (hatched) bars. - Rename column access znh5md → h5py since it uses h5py directly. - Simplify download script to use list(src[:1000]). - Add HDF5 compression note and LeMat-Traj snippet to README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Machine-specific benchmark output should not be tracked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds a new Zarr-based storage backend using a custom flat layout where each asebytes column maps directly to a Zarr array. Uses Blosc/LZ4 compression for fast I/O. Includes full test suite (29 tests), inconsistent calc tests, and benchmark integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add asebytes Zarr to visualization script, regenerate all benchmark figures, and update README with Zarr install instructions, backend table entry, and usage section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add 3 test files (dataset roundtrip, inconsistent data, IO operations) with 69 tests parametrized over lmdb/h5/zarr backends - Add s22-based fixtures to conftest.py, move db_path fixture from test_inconsistent_calc.py to shared conftest - Fix H5MD/Zarr JSON serialization: add recursive _jsonable() helper so nested numpy arrays inside dicts/lists serialize correctly - Fix H5MD/Zarr string round-trip: JSON-encode all string/dict/list values uniformly so plain strings aren't confused with JSON-encoded dicts on read-back - Fix ASEIO negative index bounds check: io[-N] with N > len now raises IndexError instead of wrapping around in columnar backends Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Move jsonable, get_version, strip_nan_padding, concat_varying to asebytes._columnar so H5MD and Zarr backends share a single copy - Seed all RandomState instances in test fixtures for deterministic runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…r H5MD Replace per-row __idx__/__keys__ with packed block index and global schema in BytesIO, reducing LMDB lookups from 3 to 1 per read. Add _PostProc type tags for H5MD to bypass isinstance chains and json.loads attempts on numeric data. Increase h5py chunk cache to 64 MB for better random-access performance. LMDB column access ~50% faster, H5MD column access ~35% faster, H5MD sequential reads up to 30% faster. BREAKING: on-disk LMDB format changed — old .lmdb files must be rewritten. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PythonFZ and others added 30 commits February 18, 2026 15:00

docs: add backend abstraction & lazy views implementation plan

f6270e5

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add ReadableBackend and WritableBackend ABCs

989ea1f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add atoms_to_dict and dict_to_atoms conversion functions

74dee05

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add LMDBBackend wrapping BytesIO with dict[str, Any] interface

92fbcaf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add RowView and ColumnView lazy view classes

05653dc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: ASEIO uses backend abstraction with lazy views

baf58ee

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: move BytesIO to _bytesio module, remove ASEIO.io property

8e8bd23

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: export new backend, view, and conversion APIs

4771e4d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bench: add performance comparison between backend abstraction and cur…

a679e3f

…rent path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add ASE I/O read-only backend design

2166cd7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: move set_length to ReadableBackend protocol

124bf04

Add set_length() as a no-op default on the protocol instead of using hasattr duck-typing in ASEIO.__iter__. Backends with lazy length discovery (ASEReadOnlyBackend) override it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add ASE I/O backend to README

31495de

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add HuggingFace backend implementation plan

f78db18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add ColumnMapping with ColabFit and OPTIMADE presets

ef412f9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add asebytes[hf] optional dependency via uv

76c77ed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: export HF backend conditionally from asebytes

b1917c7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add HuggingFace backend documentation to README

1593696

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PythonFZ and others added 28 commits February 19, 2026 18:12

fix: guard close() with hasattr, add lifecycle tests

5f2b8de

- close() now checks hasattr(iter, 'close') before calling it - Add tests for close(), context manager, and read-after-close Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add design for making LMDB an optional dependency

cf34868

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix threading access issue

d863707

docs: redesign README to reflect full feature set

0a545c2

Restructure from flat API list to feature-oriented sections: quick start, lazy views, cache_to, HuggingFace, H5MD, key convention, and custom backends. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: gitignore benchmark_results.json

aacd375

Machine-specific benchmark output should not be tracked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: update benchmarks and README with Zarr backend

6f06767

Add asebytes Zarr to visualization script, regenerate all benchmark figures, and update README with Zarr install instructions, backend table entry, and usage section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: regenerate benchmark figures after serialization fixes

2ff8f4c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: regenerate benchmark figures after refactor

3b9e5d6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: regenerate benchmark figures after perf optimizations

978dc4e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

move / remove files

87fbdfc

bump version

e273b5f

PythonFZ merged commit 42453e2 into main Feb 20, 2026
3 checks passed

PythonFZ deleted the feat/v2 branch February 20, 2026 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/v2#3

Feat/v2#3
PythonFZ merged 62 commits intomainfrom
feat/v2

PythonFZ commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PythonFZ commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant