Simplify sample handling by jeromekelleher · Pull Request #292 · sgkit-dev/vcztools

jeromekelleher · 2026-04-17T13:45:09Z

No description provided.

VczReader's samples kwarg now accepts only None or list[str]; a string input raises TypeError, and a ^-prefixed list element raises ValueError pointing at the new samples_complement flag. The CLI strips leading ^ on -s and -S (setting samples_complement=True) and splits -s strings on commas before constructing VczReader. parse_samples_file is replaced by read_samples_file(path) -> list[str], dropping the previous file -> "^a,b" -> list[str] text roundtrip. parse_samples itself is narrowed to list[str] | None with a complement bool, mirroring dataframe_to_ranges for regions/targets.

Restructure tests/test_samples.py into focused classes covering None, list, duplicates, complement, unknown samples, and missing-header entries. Coverage of vcztools/samples.py reaches 100% from this file alone. Two bugs were surfaced and fixed along the way: - parse_samples([]) previously built a float64 numpy array (np.array([]) defaults to float) which then blew up in _as_fixed_length_unicode. Construct sample_ids with StringDType explicitly so empty lists round through cleanly. - The missing-header path ran np.setdiff1d(int_selection, string_masked_ids); the dtype mismatch silently left masked samples in the output. Use np.flatnonzero to work with the integer indices of the masked entries.

parse_samples now always returns a concrete int index array — np.arange(len(all_samples)) when samples=None — instead of the overloaded (all_samples, None) shape. Downstream code in retrieval.py and vcf_writer.py stops checking for None; the orthogonal "was a subset requested?" signal moves to an explicit VczReader.subsetting_samples bool, which gates AC/AN recompute and header forcing in write_vcf. get_vchunk_array is split into get_vchunk_array (no samples axis) and get_vchunk_call_array (samples axis), with VariantChunkReader dispatching on the "call_" key prefix. The degenerate "subset resolved to zero samples" case (e.g. --force-samples dropped every name) still reads the full call arrays so AC/AN can be recomputed over the complete genotype set, matching bcftools. The 2D filter mask path in variant_chunk_iter now subsets the sample axis only when subsetting_samples is true, preserving per-allele masks like the ones produced by INFO/AC filters.

Fold the three module-level iterator functions (variant_chunk_index_iter, variant_chunk_index_iter_with_filtering, variant_chunk_iter) into VczReader.variant_chunks, and split the per-chunk computation into four named stages: - _chunk_indexes(): which chunks to visit, with region-index pruning when regions are set. - _chunk_region_mask() / _chunk_mask(): per-chunk boolean mask combining regions/targets with the filter expression. - _reduce_chunk_mask(): split the mask into (variants_selection, call_mask), applying the sample subset when one was requested. - VariantChunkReader.get_chunk_data(): load the query fields with those selections. The region-only path now shares the filter path's np.any(v_mask) skip on empty chunks — previously only the filter path pruned them. Test call sites that used the module-level variant_chunk_iter move to VczReader(root).variant_chunks(...).

Match bcftools 1.19: duplicate sample names raise an error in regular (non-complement) mode regardless of --force-samples, and "" is treated as an unknown sample rather than a "no samples" shortcut. Adds bcftools-parity tests in test_bcftools_validation.py covering both the new error paths and the corner cases that already matched (complement collapsing duplicates, all-unknown with --force-samples, empty samples file).

Document what bcftools 1.19 accepts on the #CHROM line (letters, digits, dots, dashes, underscores, slashes, embedded spaces, numerics, Unicode, long names; case-sensitive) and pin parse_samples behaviour for each category. Add a cyvcf2 round-trip through VczReader + write_vcf to confirm IDs survive the zarr to VCF writer path byte-for-byte, plus a read_samples_file test for Unicode and embedded-space entries.

bcftools view round-trips "a,b" byte-for-byte on the #CHROM line, but -s 'a,b' fails because the CLI splits on comma first. parse_samples treats the comma as opaque; vcztools' writer preserves it too. Same CLI-splitting constraint applies to vcztools as bcftools.

read_samples_file is line-oriented, so "a,b" on a line is kept as a single ID. Contrasts with the CLI -s argument, which splits on commas.

Windows CI's default locale is cp1252, so the implicit encoding in open() was mojibake-ing Unicode sample IDs. VCF 4.3 specifies UTF-8 for sample IDs; be explicit so the samples-file reader is portable.

jeromekelleher · 2026-04-17T16:15:22Z

Made some good progress here. Nailed down the semantics of sample handling against bcftools and simplified the code for Zarr reading quite a bit. Still some work to do here, as we try to remove duplicate IO from the read path and try and simplify the logic.

jeromekelleher added 3 commits April 17, 2026 14:44

jeromekelleher force-pushed the simplify-sample-handling branch from 42bde37 to bc359ba Compare April 17, 2026 13:45

jeromekelleher added 8 commits April 17, 2026 15:16

Simplify sample handling

db1bd24

Rename force_samples to "ignore_missing_samples"

ab3a889

Test comma sample IDs survive samples-file reading

969794a

read_samples_file is line-oriented, so "a,b" on a line is kept as a single ID. Contrasts with the CLI -s argument, which splits on commas.

Open samples file as UTF-8 regardless of platform locale

b0825a1

Windows CI's default locale is cp1252, so the implicit encoding in open() was mojibake-ing Unicode sample IDs. VCF 4.3 specifies UTF-8 for sample IDs; be explicit so the samples-file reader is portable.

jeromekelleher merged commit d6de39b into sgkit-dev:main Apr 17, 2026
14 checks passed

jeromekelleher deleted the simplify-sample-handling branch April 17, 2026 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify sample handling#292

Simplify sample handling#292
jeromekelleher merged 11 commits intosgkit-dev:mainfrom
jeromekelleher:simplify-sample-handling

jeromekelleher commented Apr 17, 2026

Uh oh!

jeromekelleher commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeromekelleher commented Apr 17, 2026

Uh oh!

jeromekelleher commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant