Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rent path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add update_row() to WritableBackend with optimized LMDBBackend override - Split LMDBBackend into LMDBReadOnlyBackend + LMDBBackend hierarchy - Add get_with_txn() to BytesIO for batched single-transaction reads - Add backend registry (_registry.py) with glob pattern matching - Fix index bounds checking and negative index normalization - Fix empty list handling in RowView.__getitem__ - Fix BytesIO.__delitem__ to accept negative indices - Add ViewParent Protocol and __bool__ to views - Expose env property on LMDB backends - Update all tests from .db to .lmdb extension Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add iter_rows() to ReadableBackend for streaming row access - LMDB override streams within a single read transaction - RowView.__iter__ now streams (safe for TB-scale datasets) - Add RowView.chunked() for batched throughput iteration - Add ASEIO._validate_keys() enforcing calc.*/info.*/arrays.* namespaces - Update README with new API (views, column access, chunked, readonly) - Regenerate benchmark_comparison.png Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ASEReadOnlyBackend wraps ase.io.read for .traj, .xyz, .extxyz files - Lazy per-frame loading with configurable LRU cache (default 1000) - __len__ raises RuntimeError until count_frames() is called - iter_rows streams via ase.io.iread for sequential access - Registry auto-detects read-only backends (readonly: bool | None) - ASEIO.__getitem__ handles unknown-length backends gracefully Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove ASEIO.get() and get_available_keys() (DRY — use db[i] and db.columns instead) - LMDB backend raises IndexError via bounds check, not string matching - ASE backend uses TypeError for unknown length (list() compatible) - ASEIO.__iter__ uses IndexError sentinel (works without len()) - Iterating discovers length as side-effect of hitting end-of-file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove incorrect _length inference from _read_frame (was setting length to out-of-bounds index on IndexError) - Fix ASEIO.columns to work on unknown-length backends by catching TypeError - Update __len__ error references from RuntimeError to TypeError in docs - Fix _cache_put to update value on re-insert (not just move_to_end) - Expand iter_rows iread optimization to any sorted sequence from 0 - Normalize negative indices to positive cache keys when length is known - Add set_length() public method; ASEIO.__iter__ calls it after iteration - Add tests for cache normalization, set_length, columns on unknown length Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add set_length() as a no-op default on the protocol instead of using hasattr duck-typing in ASEIO.__iter__. Backends with lazy length discovery (ASEReadOnlyBackend) override it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove set_length from the protocol and ASEIO.__iter__. Instead, the ASE backend tracks _max_read and auto-discovers _length when a sequential read fails at exactly _max_read + 1. Each layer manages its own concerns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace assert with explicit RuntimeError for type safety (#2) - Fall back to per-frame reads when iter_rows gets duplicate indices (#1) - Use O(n) sorted check instead of O(n log n) sorted() comparison - Document IndexError contract in ReadableBackend.read_row (#3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Track _max_read in iter_rows streaming path so length auto-discovery works after streaming followed by per-frame read - Remove _backend.count_frames() from README example (private attr) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add _URI_REGISTRY and parse_uri() to support URI-style paths (e.g. hf://user/dataset) alongside existing glob patterns. URIs are checked first in get_backend_cls(), preserving full backward compatibility. URI backends are read-only; requesting readonly=False raises TypeError. Tests for HuggingFaceBackend import will pass once the backend class is implemented in a subsequent task. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements ReadableBackend for HuggingFace datasets with two modes: - Downloaded: random access with known length via Dataset.__getitem__ - Streaming: sequential access via IterableDataset with auto-discovered length Includes from_uri class method supporting hf://, colabfit://, and optimade:// URI schemes with automatic mapping selection and org prepending. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a URI string (e.g. hf://, colabfit://, optimade://) is passed to ASEIO(), the constructor now detects the scheme and delegates to cls.from_uri(uri, **kwargs) instead of calling cls(path, **kwargs) directly. File paths continue to use the direct constructor path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Auto-select first split when load_dataset returns DatasetDict - Validate malformed URIs and empty paths in from_uri - Improve streaming read_column error message - Fix cached target row double-read in streaming mode - Update README with valid dataset examples Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Default streaming=True for URI backends (avoids downloading full datasets) - Validate malformed URIs and empty paths in from_uri - Better error for read_column on streaming backend without indices - Fix cached target row double-read in streaming mode - README uses real dataset paths (colabfit://mlearn_Cu_train, optimade://LeMaterial/LeMat-Bulk) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- close() now checks hasattr(iter, 'close') before calling it - Add tests for close(), context manager, and read-after-close Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract _make_dataset into tests/conftest_hf.py (was duplicated in test_hf_backend.py and test_hf_aseio.py) - Narrow _probe_length catch to AttributeError/KeyError/TypeError Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers read-write H5MD backend built on h5py with: - Standard H5MD + ZnH5MD extension support (variable particle count, per-frame PBC) - Discovery-based reading for foreign H5MD files - Append-only write semantics - Connectivity via H5MD group + bond_order extension - fsspec-compatible file_handle parameter - h5py chunk cache tuning options Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core dependencies reduced to just ase. The lmdb, msgpack, and msgpack-numpy packages move to the new [lmdb] extra. BytesIO relocated into the lmdb/ subpackage so the entire backend is self-contained behind the optional install. Registry now raises helpful ImportError messages when optional backend deps are missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch all intra-package imports from absolute (from asebytes.X) to relative (from .X / from ..X) per PEP 328. Add module-level __getattr__ so `from asebytes import BytesIO` raises a helpful ImportError with install instructions when lmdb is not installed. Add 12 mock-based tests for both registry and module-level error hint paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace repetitive test methods with @pytest.mark.parametrize for both registry and __getattr__ hint paths. Now covers all 7 _OPTIONAL_ATTRS entries (was missing COLABFIT/OPTIMADE) and all 5 registered path/URI patterns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add H5MD read-write backend using h5py with: - Full round-trip for positions, numbers, cell, pbc, calc results, info, arrays - Variable particle count via NaN padding (znh5md compatible) - H5MD-standard connectivity in /connectivity/bonds (int32, -1 fill) and /connectivity/bond_orders (float64, NaN fill) with particles_group ref - Backward-compatible reading of znh5md's non-standard connectivity - Registry entries for *.h5 and *.h5md extensions - Optional dependency via asebytes[h5md] - Comprehensive test suite (44 tests) and benchmarks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nnectivity
- Add author_name/author_email constructor kwargs (only written when non-None)
- Use dynamic asebytes version from importlib.metadata for creator version
- Add list_groups() static method for discovering particles groups
- Namespace connectivity under connectivity/{grp_name}/ for multi-trajectory support
- Remove duplicate _write_connectivity method that used flat paths
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix read_column to preserve caller's index order instead of returning sorted order (sort-then-reorder strategy matching read_rows) - Add H5MDBackend to __init__.py conditional import and __getattr__ hints so missing h5py gives a helpful install message - Add asebytes.h5md._backend to _EXTRAS_HINT for parity with HF backend Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Persistent read-through cache backed by any WritableBackend. On read, the cache is checked first; on miss the source is read and the result written to cache. Accepts a WritableBackend instance or a string path (auto-resolved via registry). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address code review findings: - Add type validation for cache_to argument (TypeError on bad type) - Make cache writes best-effort (catch exceptions to avoid breaking reads) - Warn when cache_to is used with a writable source (stale data risk) - Update ASEIO docstring with cache_to parameter docs - Update README with H5MD backend and cache_to examples - Fix design doc to match implementation (no internal ASEIO) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure from flat API list to feature-oriented sections: quick start, lazy views, cache_to, HuggingFace, H5MD, key convention, and custom backends. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Redesign benchmarks with realistic data (ethanol+calc, LeMat-Traj), six backends (asebytes LMDB/H5MD, aselmdb, znh5md, extxyz, sqlite), five operations (write, read, random access, column access, file size), and one figure per operation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…igures Rewrite benchmarks to use two datasets (ethanol small molecules + LeMat-Traj periodic structures with variable atom counts) across 6 backends. Add column access and file size benchmarks, download script for HF data, and one figure per operation. Update README with all five benchmark figures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…abel - Fix file size benchmark writing to same path across rounds (inflated sizes ~5x). Use unique path per iteration like write benchmarks. - Add hatching to distinguish ethanol (solid) vs lemat (hatched) bars. - Rename column access znh5md → h5py since it uses h5py directly. - Simplify download script to use list(src[:1000]). - Add HDF5 compression note and LeMat-Traj snippet to README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Machine-specific benchmark output should not be tracked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a new Zarr-based storage backend using a custom flat layout where each asebytes column maps directly to a Zarr array. Uses Blosc/LZ4 compression for fast I/O. Includes full test suite (29 tests), inconsistent calc tests, and benchmark integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add asebytes Zarr to visualization script, regenerate all benchmark figures, and update README with Zarr install instructions, backend table entry, and usage section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 3 test files (dataset roundtrip, inconsistent data, IO operations) with 69 tests parametrized over lmdb/h5/zarr backends - Add s22-based fixtures to conftest.py, move db_path fixture from test_inconsistent_calc.py to shared conftest - Fix H5MD/Zarr JSON serialization: add recursive _jsonable() helper so nested numpy arrays inside dicts/lists serialize correctly - Fix H5MD/Zarr string round-trip: JSON-encode all string/dict/list values uniformly so plain strings aren't confused with JSON-encoded dicts on read-back - Fix ASEIO negative index bounds check: io[-N] with N > len now raises IndexError instead of wrapping around in columnar backends Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move jsonable, get_version, strip_nan_padding, concat_varying to asebytes._columnar so H5MD and Zarr backends share a single copy - Seed all RandomState instances in test fixtures for deterministic runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r H5MD Replace per-row __idx__/__keys__ with packed block index and global schema in BytesIO, reducing LMDB lookups from 3 to 1 per read. Add _PostProc type tags for H5MD to bypass isinstance chains and json.loads attempts on numeric data. Increase h5py chunk cache to 64 MB for better random-access performance. LMDB column access ~50% faster, H5MD column access ~35% faster, H5MD sequential reads up to 30% faster. BREAKING: on-disk LMDB format changed — old .lmdb files must be rewritten. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.