Skip to content

Simplify sample handling#292

Merged
jeromekelleher merged 11 commits intosgkit-dev:mainfrom
jeromekelleher:simplify-sample-handling
Apr 17, 2026
Merged

Simplify sample handling#292
jeromekelleher merged 11 commits intosgkit-dev:mainfrom
jeromekelleher:simplify-sample-handling

Conversation

@jeromekelleher
Copy link
Copy Markdown
Member

No description provided.

VczReader's samples kwarg now accepts only None or list[str]; a string
input raises TypeError, and a ^-prefixed list element raises ValueError
pointing at the new samples_complement flag. The CLI strips leading ^
on -s and -S (setting samples_complement=True) and splits -s strings
on commas before constructing VczReader.

parse_samples_file is replaced by read_samples_file(path) -> list[str],
dropping the previous file -> "^a,b" -> list[str] text roundtrip.
parse_samples itself is narrowed to list[str] | None with a complement
bool, mirroring dataframe_to_ranges for regions/targets.
Restructure tests/test_samples.py into focused classes covering None,
list, duplicates, complement, unknown samples, and missing-header
entries. Coverage of vcztools/samples.py reaches 100% from this file
alone.

Two bugs were surfaced and fixed along the way:

- parse_samples([]) previously built a float64 numpy array (np.array([])
  defaults to float) which then blew up in _as_fixed_length_unicode.
  Construct sample_ids with StringDType explicitly so empty lists round
  through cleanly.

- The missing-header path ran
  np.setdiff1d(int_selection, string_masked_ids); the dtype mismatch
  silently left masked samples in the output. Use np.flatnonzero to
  work with the integer indices of the masked entries.
parse_samples now always returns a concrete int index array —
np.arange(len(all_samples)) when samples=None — instead of the
overloaded (all_samples, None) shape. Downstream code in retrieval.py
and vcf_writer.py stops checking for None; the orthogonal "was a
subset requested?" signal moves to an explicit
VczReader.subsetting_samples bool, which gates AC/AN recompute and
header forcing in write_vcf.

get_vchunk_array is split into get_vchunk_array (no samples axis)
and get_vchunk_call_array (samples axis), with VariantChunkReader
dispatching on the "call_" key prefix. The degenerate
"subset resolved to zero samples" case (e.g. --force-samples dropped
every name) still reads the full call arrays so AC/AN can be
recomputed over the complete genotype set, matching bcftools.

The 2D filter mask path in variant_chunk_iter now subsets the sample
axis only when subsetting_samples is true, preserving per-allele
masks like the ones produced by INFO/AC filters.
@jeromekelleher jeromekelleher force-pushed the simplify-sample-handling branch from 42bde37 to bc359ba Compare April 17, 2026 13:45
Fold the three module-level iterator functions
(variant_chunk_index_iter, variant_chunk_index_iter_with_filtering,
variant_chunk_iter) into VczReader.variant_chunks, and split the
per-chunk computation into four named stages:

- _chunk_indexes(): which chunks to visit, with region-index pruning
  when regions are set.
- _chunk_region_mask() / _chunk_mask(): per-chunk boolean mask
  combining regions/targets with the filter expression.
- _reduce_chunk_mask(): split the mask into (variants_selection,
  call_mask), applying the sample subset when one was requested.
- VariantChunkReader.get_chunk_data(): load the query fields with
  those selections.

The region-only path now shares the filter path's np.any(v_mask)
skip on empty chunks — previously only the filter path pruned them.
Test call sites that used the module-level variant_chunk_iter move
to VczReader(root).variant_chunks(...).
Match bcftools 1.19: duplicate sample names raise an error in regular
(non-complement) mode regardless of --force-samples, and "" is treated
as an unknown sample rather than a "no samples" shortcut.

Adds bcftools-parity tests in test_bcftools_validation.py covering both
the new error paths and the corner cases that already matched (complement
collapsing duplicates, all-unknown with --force-samples, empty samples
file).
Document what bcftools 1.19 accepts on the #CHROM line (letters,
digits, dots, dashes, underscores, slashes, embedded spaces,
numerics, Unicode, long names; case-sensitive) and pin parse_samples
behaviour for each category. Add a cyvcf2 round-trip through
VczReader + write_vcf to confirm IDs survive the zarr to VCF writer
path byte-for-byte, plus a read_samples_file test for Unicode and
embedded-space entries.
bcftools view round-trips "a,b" byte-for-byte on the #CHROM line, but
-s 'a,b' fails because the CLI splits on comma first. parse_samples
treats the comma as opaque; vcztools' writer preserves it too. Same
CLI-splitting constraint applies to vcztools as bcftools.
read_samples_file is line-oriented, so "a,b" on a line is kept as a
single ID. Contrasts with the CLI -s argument, which splits on commas.
Windows CI's default locale is cp1252, so the implicit encoding in
open() was mojibake-ing Unicode sample IDs. VCF 4.3 specifies UTF-8
for sample IDs; be explicit so the samples-file reader is portable.
@jeromekelleher
Copy link
Copy Markdown
Member Author

Made some good progress here. Nailed down the semantics of sample handling against bcftools and simplified the code for Zarr reading quite a bit. Still some work to do here, as we try to remove duplicate IO from the read path and try and simplify the logic.

@jeromekelleher jeromekelleher merged commit d6de39b into sgkit-dev:main Apr 17, 2026
14 checks passed
@jeromekelleher jeromekelleher deleted the simplify-sample-handling branch April 17, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant