Skip to content

Roadmap

Adam Novak edited this page Feb 27, 2023 · 71 revisions

This document sets out the high-level tasks which the vg development team hopes to accomplish in the next few versions of vg and beyond.

By Time

These are the things we hope to achieve on several planning horizons:

End of Spring (6/23)

  • Actual tutorials and examples for using libvgio/libbdsg for DV use cases (see pysam's progressively increasing examples) (~2 weeks good, ~1/day bad) (Adam)
    • Better developer documentation (model: pysam docs) for vg library,
    • Links from libvgio Doxygen section to Protobuf-derived doc pages
  • Support DV use cases in vg libraries (libhandlegraph/libvgio) (Adam)
    • Query the graph (for nodes/edges)
    • GRCh38 location -> retrieve aligned reads near there
    • Get read attributes and know what they mean
  • Drop pinchesAndCacti and sonlib
    • Drop Cactus-library-based snarl finder (Adam)
  • Figure out if we need a libsnarls actually (Adam, Xian, Jordan)
  • Giraffe optimizations/presets for Ultima single-end short reads (A student)

End of Summer (6/23)

  • Giraffe pretty good on long reads
  • Delete at least one index each from vg index #3144 (Adam, Jordan)
    • GCSA to its own command
  • Eliminate intermediate Alignment as surject output, go graph Alignment -> BAM record(s) to make supplementary alignments work (Jordan, Aleksis)
    • Generalize spliced alignment code to let vg surject handle long deletions vs. target path (and generate 30000D CIGARs)
    • Let surject generate supplementary alignments for e.g. mappings over inversions
  • Haplotype sampling to modified GBZ based on k-mers should be good (Jouni)
  • User-facing, under-test Giraffe docs for HPRC graphs (Jordan)
  • Delete old snarl manager (except as a backward-compatible load?) and use snarl-only DI2 everywhere
  • No cruft in vg index #3144 (Adam, Jordan)

End of Fall (12/23)

  • Releases for vg libraries (libhandlegraph, libbdsg, libvgio)
    • Enhanced release acceptance testing so a release means production quality, like we would do for a paper. Proven to be able to run some number of genomes. (Tag paper-validated hashes as release?)
    • Stability guarantees for vg libraries (like htslib, 1-2 releases per year requiring a code change)
  • Libraries free of cryptic error messages when the user or inputs are wrong
  • Just front-end the autoindex system with a sensible CLI for single indexes in vg index. How do we deprecate things? 3 levels of command? Clear input vs. output distinction.
  • Diplotyping related to haplotype sampling, compare/collaborate with Erik's version
  • Use memory-mapped graphs (Adam)
    • For tube map, to enable interactive whole-genome use (Future data vis enthusiast)
    • For Giraffe
  • Giraffe actually competitive on long reads

Later

  • Example tutorial under test for idiomatic Giraffe on long-node rGFA on long reads with auto-chopping (autoindex already does the work probably) and close #3126
    • Queries (vg find) in rGFA space
  • Completely new paradigm for pangenome variant calling so we don't need population-specific reference hacks
    • Variant calling against rGFA references, allowing new variants all over
    • Think about PanGenie fitting in here
    • Will need to involve non-independent read mapping
    • Solve CNVs with EM (Jordan)

Wishlist

These are things we would like to do eventually.

  • Eliminate vg::VG (Jordan)

    • Steal all the things only it can do away from it
  • Default everything to GAF instead of GAM

    • mpGAF (Jordan, Jonas)
    • Also pgvf (Graph to graph)
    • Calls and snarls in one of these?
  • Python bindings for libhandlegraph algorithms

    • Are they the right algorithms?
  • Use of MCMC techniques in the genotyper with multipath alignments

  • vg deconstruct and Beagle to impute genotypes into partially-mapped and called data, as a PanGenie alternative (Erik, Andrea)

  • Alignment

    • Adoption of the multipath alignment paradigm as the default
    • Graph-to-graph mapping (Xian)
  • Variant Calling

    • Implementation of an HHGA-like machine learning based variant caller
    • Integration of variant calling and assembly polishing processes
    • Prune the zoo of TraversalFinders, and expose the useful ones to Python
  • Visualization

    • Browser-free tube map
    • Better tube map handling of edge cases
      • No haplotypes on a node
      • Starting on a rare haplotype
  • Infrastructure

    • Destructively modernize and unify IO
      • Eliminate VPKG framing if possible in favor of magic numbers everywhere
        • Resolve ensuing questions about GAM format
          • Just use GAF?
        • Handle things like GFA that need to manually sniff
      • Just save from the object; no more save_handle_graph
      • Magic format registration for libvgio magic numbers for loading
      • Depend on libvgio in libbdsg to do the IO there and pick the right handle graph implementation
    • Replace Protobuf internal formats with faster ones
    • Revision of ID assignment logic to allow deterministic node breaking
    • Accept gzipped GFA if practical (can't mmap)
    • Improved HandleGraph API
      • Abstract away node boundaries
      • View all sequence as C++17 string_views instead of sequence-owning strings
      • O(1) reverse complement DNAStringView
    • CMake-ify the main vg build
    • Eliminate old systems and their associated submodules, or factor them out into their own projects
      • vg vectorize could be its own project
        • Update vg vectorize to modern, system Vowpal Wabbit
        • Or pull it out into its own submodule and remove Vowpal Wabbit dependency from vg
      • Eliminate RocksDB from vg; everybody using vg map uses GCSA indexes now.
      • vg genotype
      • vg srpe
    • More cross-language support
      • Interoperate with Rust handle graph users/providers
      • Interoperate with Java handle graph users/providers
Clone this wiki locally