Skip to content

Implement joint multi-allelic left-alignment (Tan 2015) for v0.2 #1

@robertlangdonn

Description

@robertlangdonn

Context

vcfkit currently skips left-alignment for multi-allelic indel records (records where alts.len() > 1). See docs/known_differences.md.

This matches bcftools norm without -m, but diverges from bcftools norm with joint alignment flags that trigger per-record left-shifting.

Adversarial test: tests/corpus/synthetic/multiallelic_polyA.vcf — a multi-allelic indel at position 9 of a poly-A tract, where both ALTs could theoretically be shifted left. Current behaviour (passthrough) is confirmed against bcftools in normalize_test::diff_multiallelic_polya_matches_bcftools_no_split.

What needs to be done

Implement joint multi-allelic left-alignment per:

Tan, Abecasis, Kang (2015). "Unified representation of genetic variants."
Bioinformatics 31(13):2202–2204. doi:10.1093/bioinformatics/btv112

The algorithm: left-align all ALTs jointly against the reference. Find the leftmost position P such that trim_and_extend(REF, ALT_k, P) is valid for every k simultaneously, then rewrite the record at P.

Acceptance criteria

  • Joint alignment implementation in crates/vcfkit-core/src/normalize.rs
  • Remove the alts.len() > 1 shortcut in left_align_record
  • diff_multiallelic_polya_matches_bcftools_no_split passes (currently shows known divergence)
  • Real-world differential test on 1000G chr22 passes with updated logic
  • Update docs/known_differences.md to remove this entry
  • Changelog entry for v0.2

Risk

This changes normalize output for multi-allelic indels that are not yet fully left-aligned. Flag prominently in v0.2 release notes as a behaviour change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions