Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vg call returns an empty VCF when genotyping SVs #3950

Open
fabio-cunial opened this issue May 6, 2023 · 5 comments
Open

vg call returns an empty VCF when genotyping SVs #3950

fabio-cunial opened this issue May 6, 2023 · 5 comments

Comments

@fabio-cunial
Copy link

fabio-cunial commented May 6, 2023

Hi, I'm sure I'm doing something wrong, but after a week of trial and error I still cannot find the root of the problem, so I'm asking for help here. Sorry in advance if I'm doing something stupid.

1. What were you trying to do?

I want to build a graph from GRCh37 and a VCF file that contains only SVs (attached), and then to genotype exactly the same SVs using short-read alignments to the graph.

2. What did you want to happen?

I want a genotyped VCF, so following the SV genotyping with vg document, I did:

vg construct --threads 64 --progress --handle-sv --alt-paths --reference ref.fa --vcf calls.vcf > graph.vg

vg index --threads 64 --temp-dir ./vgtmp --progress --dist-name graph.dist graph.vg

vg index --threads 64 --temp-dir ./vgtmp --progress --xg-alts --xg-name graph.xg graph.vg

vg gbwt --num-jobs 64 --temp-dir ./vgtmp --progress --path-cover --xg-name graph.xg --output graph.gbwt

vg gbwt --num-jobs 64 --temp-dir ./vgtmp --progress --xg-name graph.xg --graph-name graph.gbz --gbz-format graph.gbwt

vg minimizer --threads 64 --progress --distance-index graph.dist --output-name graph.min graph.gbz

vg giraffe --threads 64 --progress --output-format gam --gbz-name graph.gbz --minimizer-name graph.min --dist-name graph.dist --fastq-in reads1.fastq.gz --fastq-in reads2.fastq.gz > alignments.gam

vg pack --threads 64 --xg graph.gbz --gam alignments.gam --min-mapq 5 --trim-ends 5 --packs-out alignments.pack

vg call --threads 64 --ploidy 2 --vcf calls.vcf --pack alignments.pack graph.gbz > calls-genotyped.vcf

3. What actually happened?

During vg call, all calls in the VCF file are skipped with messages like:

[VCFTraversalFinder] Warning: No alt path (prefix=_alt_dd31adde3cbbb60ef0c8788fad51d901a00b2855_) found in graph for variant. It will be ignored:
Y 58997447 Sniffles2.INS.83S17 T TGACATATCTCTGCACTGATCACCCCAGGGAGAGAATTCTTGTTTAGGCTCTGCCTACAGGGGGCTTTG 60 PASS AF=1;COVERAGE=7,7,6,5,5;END=58997447;STDEV_LEN=0;STDEV_POS=0;STRAND=+-;SUPPORT=6;SUPPORT_LONG=0;SVLEN=69;SVTYPE=INS;PRECISE

and the output of vg call is a VCF with a header but no records. Note that I also tried vg autoindex --workflow giraffe but I got the same result (I actually started from autoindex -- I then tried building every index manually just to identify the problem).

5. What data and command can the vg dev team use to make the problem happen?

You can use a GRCh37 reference and the calls.vcf.gz file attached to this issue. You can use any 30x short-read pairs FASTQs from HG002.

6. What does running vg version say?

vg version v1.47.0 "Ostuni"
Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux
Linked against libstd++ 20210601
Built by anovak@octagon
@wjwei-handsome
Copy link

Hi!

I also encountered the same problem as you :)

Did you have try to replace the vg call --threads 64 --ploidy 2 --vcf calls.vcf --pack alignments.pack graph.gbz > calls-genotyped.vcf to vg call --threads 64 --ploidy 2 --vcf calls.vcf --pack alignments.pack graph.vg > calls-genotyped.vcf

As the Warning messages say, maybe both graph.gbz and graph.vg represent graphs, but the size of the namespace or node is different between them (It's just my dumb guess).

@glennhickey
Copy link
Contributor

Hi, thanks for raising this issue. The problem is that vg call --vcf only works on the the xg file (as created from vg construct --alt-paths then vg index --xg-alts --xg-name). It will not (unlike vg call without --vcf) work on the gbz.

If your graph is coming from construct, I think the node id space should be compatible between the xg and gbwt. You can verify this with vg gbwt -Z graph.gbz --translation graph.trans and making sure the toil columns of that file are the same.

If that's the case, you can run vg pack and vg call on the xg instead of the GBZ and it should work. This isn't ideal, but I think is the only way to proceed barring a substantial rewrite of vg call --vcf. Please let me know whether or not it works.

@fabio-cunial
Copy link
Author

Thanks a lot, it works now!

Since we are here... I'm slightly confused about the difference between vg call and vg call --vcf when the graph contains only SVs. From what I understood, vg call --vcf is identical to vg call, except that it also returns calls with genotype 0/0, correct? I don't think that vg call is also trying to call new SVs that are supported by discordant read pairs but that are not in the graph, correct?

Finally, an unrelated, minor issue :) As I mentioned, vg call --vcf works now, but it skips 77 calls because there are too many traversals: see this example output.

[VCFTraversalFinder] Warning: Site {"directed_acyclic_net_graph": true, "end": {"node_id": "1326945"}, "start": {"node_id": "1326155"}, "start_end_reachable": true, "type": 1} with 77 variants contains too many traversals (>50000) to enumerate so it will be skipped:
10 42358412 Sniffles2.DEL.2FDS9 CATTCGTGTTTATTCCATTCCATTCCATTCCATTCCATTCCACTCGGGTTCATTGCATTCAGTTCCGTTCCATTCCATTC...TTCCATTCCTTTCCATTCCATTCCATGCCAGTCATGTTGATTCCATTCCATTCCTA C 60 PASS AF=0.077;COVERAGE=19,16,9,14,15;END=42371507;STDEV_LEN=0;STDEV_POS=0;STRAND=-;SUPPORT=1;SVLEN=-13095;SVTYPE=DEL;PRECISE
10 42371080 Sniffles2.INS.DCS9 A ATTGCATTCTATTCCATTCTAATCGGGTTGATTTCATTCCATTCCATTCCATTCTAGTCCATTCCATTCCATTCCGTTCCATTAAATTCCATTCCGTTCCATTCCCCTGGTGATTATTCCAGTCCGTTCCATTCCATTGCATTCCCTTCCACTGGTGTTTTTGGAATCGTGTTGATTCCAATCCATTCAATTACAGTCCAGTCTTTTCCATTCCATTACATTCCACTCGGTTTGTTTCCATTACATTGAATTCCATTGTATTCCATTCCATACCATTGCATTCCATTGCATTCCCATCTTTCCAGTTGATTCCATTTCATTCCATTGCATTCTATTCCATTCAAATCGGGTTTTGGTGCTTATTCCAGTCCGCTCCATTCCATTGCATTCCACATTCCACTCCGTTTGTTTCCATTACATTGAATTACATTGCATTCCATTAAAATA 42 PASS AF=0.143;COVERAGE=16,15,14,15,14;END=42371080;STDEV_LEN=107.48;STDEV_POS=156.271;STRAND=+;SUPPORT=2;SUPPORT_LONG=0;SVLEN=599;SVTYPE=INS;IMPRECISE
10 42370809 Sniffles2.INS.DBS9 C CATTCCATTCTATTCCAATGCATTCCATTAGGGTTGAATTCATTGTCCATTCCTCTCCATTCCATTCCATTCCATTCCATTCCATTCCATTCTTTCCATTCTTCCATTCCATTCCATTCCATTCCACTCGTGTTCATTT 60 PASS AF=0.067;COVERAGE=16,15,15,15,15;END=42370809;STDEV_LEN=0;STDEV_POS=0;STRAND=+;SUPPORT=1;SUPPORT_LONG=0;SVLEN=139;SVTYPE=INS;PRECISE
...

I thought that vg giraffe has a heuristic for dealing with complex regions in the alignment stage (choosing a greedy path cover and prioritizing those paths). So why is this region being skipped in the genotyping stage? Isn't genotyping just looking for alignment coverage on nodes and edges?

Thanks a lot for your help and time!

@glennhickey
Copy link
Contributor

vg call --vcf outputs variants strictly in terms of the VCF used as input to vg construct. In doing so, it does some exhaustive enumeration of possible allele combinations in large sites and will give up if it gets overwhelmed (hence the warning). It has nothing to do with giraffe.

The newer way to go about this would be to use vg call <graph.gbz> -z. This will limit the possible alleles, even for large sites, to haplotypes from the phasing information of your graph. The drawback is that the resulting VCF may look a little different than your input VCF as it may have changed slightly while roundtripping into and out of the graph.

It would be nice to have a combination of the two: use the haplotypes from the gbz but cast output exactly in terms of the input VCF, but that's not implemented and probably won't be anytime soon...

@fabio-cunial
Copy link
Author

Thanks a lot Glenn. I think you can close the issue now if you want to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants