More complete rGFA support (experimental) #4113

glennhickey · 2023-10-04T15:59:41Z

Changelog Entry

To be copied to the draft changelog by merger:

vg paths -R option added to compute an rGFA cover (based on a reference sample) from a (mutable) graph, and add it as path fragments.

Description

This is a fixup and refactor of #3891. It uses the same greedy (on steps) method to select snarl traversals to add to the cover. The cover is now stored in a (really simple) in-memory index that maps nodes to fragments that cover them, and these fragments can be walked back to the rank-0 reference.

rGFA covers can be computed from full paths, loaded from path fragments, and saved to path fragments. They are always defined relative to a rank-0 / reference sample which stays as a normal reference path in the graph.

Off-reference rGFA path fragments are REFERENCE-sense paths with a special sample name (_rGFA_) which allows them to be easily selected. Would like to find something less hacky, but this should work for experiments now.

Corresponding Cactus update is: ComparativeGenomicsToolkit/cactus#1186

JosephLalli · 2023-12-05T07:04:38Z

Just wanted to say I've been experimenting with an end user hack to do something similar*, and I've seen the largest impact in accuracy from variants called around insertions. Reads which straddle a reference and insertion contig will align to one or the other.

As a result, depth around insertions flattens out, or even dips a little. Here's an example in HG002 around chr10:133111600, where vg haplotypes identified two overlapping 900bp insertions (one maternal, one paternal).

Surjecting reads to both insertions and GRCh38 led to reduced depth in the region, causing deepvariant to miss some GIAB annotated variants. However, the reads which surjected to the reference sequence now had their non-reference sequence softclipped. This led to fewer false positive calls around insertions due to misaligned reads.

PS: This is a really good test region. Both the insertions and the surrounding reference sequence are complex enough to allow for uniquely mapping reads, and the accuracy of variant calling is greatly improved when reads from either insertion are not mismapped to the linear reference sequence. It is also in an intron of a CNS expressed gene that has been linked to appetite and overeating, suggesting potential clinical relevance.

I'm very excited to eventually see this code on the main branch!

KMC
Create personalized GFA with haplotypes
Deconstruct GFA against reference of interest
Isolate insertions > 100bp, extract paths from deconstruction vcf
Format paths as GFA haplotypes, append to end of GFA file
- contig names were just chr|start|stop|unique_id
Convert to GBZ, mark all insertion paths as references
Extract per-sample fastas
Bring all fastas in cohort together, harmonize the unique id of each insertion. Create translation file of old contig name -> harmonized contig name.

Then:
Map reads to GBZ
Use samtools reheader, using sed to remove haplotype information and awk to apply the old contig name -> harmonized contig name translation

JosephLalli · 2024-02-23T20:16:54Z

Hi @glennhickey,

I wonder if you are still planning to implement this feature?

-Joe

glennhickey · 2024-02-26T14:37:00Z

Yes! But I've been completely bogged down with other things. The only good news is that I think I'm on the hook to present about it in a few weeks so I won't have much choice but to get moving. I still have your examples to reproduce problems that I will check out. Sorry for the long wait.

JosephLalli · 2024-02-27T19:42:23Z

Not a problem @glennhickey, I'm so grateful for the work you've put in. I just wanted to double-check, as I'm a few months out from putting together a manuscript comparing the effect of vg-deepvariant with bwamem-haplotypecaller and bwamem-deepvariant, and I think this would be a good addition to the paper.

I'm in the middle of a too-many-projects-too-little-time period myself, so I am very, very familiar with the feeling.

This reverts commit 89af9b6.

glennhickey mentioned this pull request Oct 4, 2023

[WIP] Improve off-reference coordinate and rGFA support #3891

Closed

glennhickey force-pushed the rgfa2 branch from be890f3 to 9df2a05 Compare October 6, 2023 19:11

glennhickey force-pushed the rgfa2 branch from ea5d6e5 to 98e3b7c Compare November 21, 2023 16:51

glennhickey added 23 commits March 25, 2024 10:28

start wiring in new rgfa interface

5c90455

start wiring in cli interface

a8dcd2b

add rgfa options to save_graph interface

a2261b4

open intervals

60f8351

clean a few cases

6dd3856

rgfa tests

860660b

Add tag test

37945cb

add rgfa support to deconstruct

f3642ee

fix up some edge cases

f816072

allow multiple -S in call

7c86b03

rgfa call test

e0e3cc2

speed up rgfa path application

b6b955f

sample option for surject path selection, some rgfa support

e02e8a2

fix bugs in load() and get_rank()

17f3f57

ignore rgfa path fragments in depth and clip coverage calcs

4d65d74

use (by default) fragment not original paths in rgfa index

b46eaaa

break ties with name when choosing cover

1942dc6

work around step iterator issue in gbz

e539734

get rgfa tests passing; add back end range where possible

2a5ffad

fix option collision from last merge

c4bca66

rGFA selection option

c8419ab

leave subrange in surjected-to rgfa fragments

22974f7

keep rgfa subpaths in surject

98a6751

glennhickey added 8 commits March 25, 2024 10:28

revert rgfa name processing in paths listing -- too confusing for now

2ba1583

clean rgfa names in fasta output

07aa374

fair enough, clang

65a5d7f

fix rgfa tests post merge

182cf39

start rgfa VCF annotator

1d07f4c

Fix assertion bug

b8cc937

better rgfa support in call

220d33d

fix merge re -S option

f23e65a

glennhickey force-pushed the rgfa2 branch from 98e3b7c to f23e65a Compare March 25, 2024 15:01

glennhickey added 17 commits March 27, 2024 12:51

merge contiguous rgfa fragments

6ebd95f

disable rank check

9680149

auto-add rgfa prefix during rgfa conversion

89af9b6

merge adjacent intervals from different threads

7ce91eb

Merge remote-tracking branch 'origin/master' into rgfa2

ff71cd8

fix merge

0fa7487

fix bug where cyclic paths mishandled during interval merging

1f5e928

Revert "auto-add rgfa prefix during rgfa conversion"

56dc5c2

This reverts commit 89af9b6.

fix debug message

54712e1

fix debug msg

69fc397

fix bug where double-ended merge would lead to invalid overlap

3093eff

fix bug in last commit

286ec9f

treat phase block as subrange to hand converted gbz

16e3d61

opt to print rgfa stats table

f5f211b

can use previous step from path end on gbz

5fa4fbc

dont crash with multiple references

dd55ec7

dont crash with multiple references

42bfaa1

jeizenga mentioned this pull request May 17, 2024

Construct a generation-level pan-genome #4295

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More complete rGFA support (experimental) #4113

More complete rGFA support (experimental) #4113

glennhickey commented Oct 4, 2023

JosephLalli commented Dec 5, 2023

JosephLalli commented Feb 23, 2024

glennhickey commented Feb 26, 2024

JosephLalli commented Feb 27, 2024

More complete rGFA support (experimental) #4113

Are you sure you want to change the base?

More complete rGFA support (experimental) #4113

Conversation

glennhickey commented Oct 4, 2023

Changelog Entry

Description

JosephLalli commented Dec 5, 2023

JosephLalli commented Feb 23, 2024

glennhickey commented Feb 26, 2024

JosephLalli commented Feb 27, 2024