Simplify sample handling#292
Merged
jeromekelleher merged 11 commits intosgkit-dev:mainfrom Apr 17, 2026
Merged
Conversation
VczReader's samples kwarg now accepts only None or list[str]; a string input raises TypeError, and a ^-prefixed list element raises ValueError pointing at the new samples_complement flag. The CLI strips leading ^ on -s and -S (setting samples_complement=True) and splits -s strings on commas before constructing VczReader. parse_samples_file is replaced by read_samples_file(path) -> list[str], dropping the previous file -> "^a,b" -> list[str] text roundtrip. parse_samples itself is narrowed to list[str] | None with a complement bool, mirroring dataframe_to_ranges for regions/targets.
Restructure tests/test_samples.py into focused classes covering None, list, duplicates, complement, unknown samples, and missing-header entries. Coverage of vcztools/samples.py reaches 100% from this file alone. Two bugs were surfaced and fixed along the way: - parse_samples([]) previously built a float64 numpy array (np.array([]) defaults to float) which then blew up in _as_fixed_length_unicode. Construct sample_ids with StringDType explicitly so empty lists round through cleanly. - The missing-header path ran np.setdiff1d(int_selection, string_masked_ids); the dtype mismatch silently left masked samples in the output. Use np.flatnonzero to work with the integer indices of the masked entries.
parse_samples now always returns a concrete int index array — np.arange(len(all_samples)) when samples=None — instead of the overloaded (all_samples, None) shape. Downstream code in retrieval.py and vcf_writer.py stops checking for None; the orthogonal "was a subset requested?" signal moves to an explicit VczReader.subsetting_samples bool, which gates AC/AN recompute and header forcing in write_vcf. get_vchunk_array is split into get_vchunk_array (no samples axis) and get_vchunk_call_array (samples axis), with VariantChunkReader dispatching on the "call_" key prefix. The degenerate "subset resolved to zero samples" case (e.g. --force-samples dropped every name) still reads the full call arrays so AC/AN can be recomputed over the complete genotype set, matching bcftools. The 2D filter mask path in variant_chunk_iter now subsets the sample axis only when subsetting_samples is true, preserving per-allele masks like the ones produced by INFO/AC filters.
42bde37 to
bc359ba
Compare
Fold the three module-level iterator functions (variant_chunk_index_iter, variant_chunk_index_iter_with_filtering, variant_chunk_iter) into VczReader.variant_chunks, and split the per-chunk computation into four named stages: - _chunk_indexes(): which chunks to visit, with region-index pruning when regions are set. - _chunk_region_mask() / _chunk_mask(): per-chunk boolean mask combining regions/targets with the filter expression. - _reduce_chunk_mask(): split the mask into (variants_selection, call_mask), applying the sample subset when one was requested. - VariantChunkReader.get_chunk_data(): load the query fields with those selections. The region-only path now shares the filter path's np.any(v_mask) skip on empty chunks — previously only the filter path pruned them. Test call sites that used the module-level variant_chunk_iter move to VczReader(root).variant_chunks(...).
Match bcftools 1.19: duplicate sample names raise an error in regular (non-complement) mode regardless of --force-samples, and "" is treated as an unknown sample rather than a "no samples" shortcut. Adds bcftools-parity tests in test_bcftools_validation.py covering both the new error paths and the corner cases that already matched (complement collapsing duplicates, all-unknown with --force-samples, empty samples file).
Document what bcftools 1.19 accepts on the #CHROM line (letters, digits, dots, dashes, underscores, slashes, embedded spaces, numerics, Unicode, long names; case-sensitive) and pin parse_samples behaviour for each category. Add a cyvcf2 round-trip through VczReader + write_vcf to confirm IDs survive the zarr to VCF writer path byte-for-byte, plus a read_samples_file test for Unicode and embedded-space entries.
bcftools view round-trips "a,b" byte-for-byte on the #CHROM line, but -s 'a,b' fails because the CLI splits on comma first. parse_samples treats the comma as opaque; vcztools' writer preserves it too. Same CLI-splitting constraint applies to vcztools as bcftools.
read_samples_file is line-oriented, so "a,b" on a line is kept as a single ID. Contrasts with the CLI -s argument, which splits on commas.
Windows CI's default locale is cp1252, so the implicit encoding in open() was mojibake-ing Unicode sample IDs. VCF 4.3 specifies UTF-8 for sample IDs; be explicit so the samples-file reader is portable.
Member
Author
|
Made some good progress here. Nailed down the semantics of sample handling against bcftools and simplified the code for Zarr reading quite a bit. Still some work to do here, as we try to remove duplicate IO from the read path and try and simplify the logic. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.