Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construction crash introduced with Safe SVs #1737

Closed
adamnovak opened this issue Jun 13, 2018 · 7 comments · Fixed by #1738
Closed

Construction crash introduced with Safe SVs #1737

adamnovak opened this issue Jun 13, 2018 · 7 comments · Fixed by #1738
Assignees
Labels
bug HAMP Haplotype-Aware Mapping Paper
Milestone

Comments

@adamnovak
Copy link
Member

adamnovak commented Jun 13, 2018

Since the safe SVs merge commit, if you use:

s3://cgl-pipeline-inputs/vg_cgl/pop-map/graphs/wg38/v2/ALL.chrX_GRCh38.genotypes.20170504_minaf_0.1.vcf.gz
s3://cgl-pipeline-inputs/vg_cgl/pop-map/graphs/wg38/v2/ALL.chrX_GRCh38.genotypes.20170504_minaf_0.1.vcf.gz.tbi
s3://cgl-pipeline-inputs/vg_cgl/pop-map/input/GRCh38.fa.gz .

And you do:

gunzip GRCh38.fa.gz
vg construct -p --reference GRCh38.fa --vcf ALL.chrX_GRCh38.genotypes.20170504_minaf_0.1.vcf.gz --region X --region-is-chrom --node-max 32 --alt-paths --threads 1 >/dev/null

There is a crash because we try to go beyond the end of the chunk when creating reference nodes, at about 55.2% done.

(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=obj@entry=0x2489d40, tinfo=0x1970270 <typeinfo for std::out_of_range@@GLIBCXX_3.4>, 
    dest=0x6655b0 <std::out_of_range::~out_of_range()@plt>)
    at /public/home/anovak/build/gcc-8.1.0/libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00007fe216510891 in std::__throw_out_of_range_fmt (__fmt=<optimized out>) from /public/home/anovak/.local/lib64/libstdc++.so.6
#2  0x0000000000db28b9 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_check (
    __s=0x143fd77 "basic_string::substr", __pos=<optimized out>, this=<optimized out>)
    at /public/home/anovak/.local/include/c++/8.1.0/bits/char_traits.h:350
#3  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::substr (__n=<optimized out>, 
    __pos=<optimized out>, this=<optimized out>) at /public/home/anovak/.local/include/c++/8.1.0/bits/basic_string.h:2807
#4  vg::Constructor::construct_chunk(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<vcflib::Variant, std::allocator<vcflib::Variant> >, unsigned long) const::{lambda(unsigned long)#3}::operator()(unsigned long) const () at src/constructor.cpp:347
#5  0x0000000000db563b in vg::Constructor::construct_chunk(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<vcflib::Variant, std::allocator<vcflib::Variant> >, unsigned long) const () at src/constructor.cpp:946
#6  0x0000000000db852a in vg::Constructor::construct_graph(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, FastaReference&, vg::VcfBuffer&, std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::function<void (vg::Graph&)>) () at /public/home/anovak/.local/include/c++/8.1.0/new:169
#7  0x0000000000db901d in vg::Constructor::construct_graph(std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::vector<vcflib::VariantCallFile*, std::allocator<vcflib::VariantCallFile*> > const&, std::vector<FastaReference*, std::allocator<FastaReference*> > const&, std::function<void (vg::Graph&)>) () at src/constructor.cpp:1644
#8  0x0000000000b2fa83 in main_construct(int, char**) () at /public/home/anovak/.local/include/c++/8.1.0/new:169
#9  0x0000000000b00ae8 in std::function<int (int, char**)>::operator()(int, char**) const (__args#1=<optimized out>, 
    __args#0=<optimized out>, this=<optimized out>) at /public/home/anovak/.local/include/c++/8.1.0/bits/std_function.h:260
#10 vg::subcommand::Subcommand::operator() (this=<optimized out>, argc=argc@entry=15, argv=argv@entry=0x7fffa20cdde8)
    at src/subcommand/subcommand.cpp:72
#11 0x000000000073eeb7 in main () at src/main.cpp:61
#12 0x00007fe215c93445 in __libc_start_main () from /usr/lib64/libc.so.6
#13 0x00000000007f4eee in _start () at src/main.cpp:70

I can find the chunk position and the reference size for the chunk, and I see the reference cursor getting beyond the end of the chunk somehow.

(gdb) p reference_sequence.size()
$2 = 521981
(gdb) p reference_cursor
$3 = 1107218
(gdb) p chunk_offset
$4 = 86107742

The crash happens in b5b3b2f but not in 62b2729.

@adamnovak adamnovak added bug HAMP Haplotype-Aware Mapping Paper labels Jun 13, 2018
@adamnovak adamnovak added this to the v1.9.0 milestone Jun 13, 2018
@glennhickey
Copy link
Contributor

glennhickey commented Jun 13, 2018 via email

@edawson
Copy link
Contributor

edawson commented Jun 13, 2018

Dang. Sorry y'all - I'll rewire this tomorrow. I'll make sure to get it right this time.

@adamnovak
Copy link
Member Author

Here's the only SV in the failing chunk. It's a multiallelic SV, which apparently we don't support yet?

X       86470037        rs59780433      TTTCATTGATATTCTGCTGTGGCACTGTTAATCTTATTTTGATTGAAACAGAGTTCTGGACAGTAAAATTGATG      TGGTTCATTGATATTCTGCTGTGGCACTGTTAATCTTATTTTGATTGAAACAGAGTTCTGGACAGTAAAATTGATG,T  100     PASS    AC=7,389;AF=0.0018543,0.103046;AN=3775;NS=2504;DP=19876;AMR_AF=0,0.0458;AFR_AF=0.001,0.3609;EUR_AF=0.0013,0.0026;SAS_AF=0.0014,0;EAS_AF=0.0052,0.0013;END=85725113;SVTYPE=DEL;CS=DEL_pindel;VT=SV;GRCH37_POS=85725040;GRCH37_REF=TTTCATTGATATTCTGCTGTGGCACTGTTAATCTTATTTTGATTGAAACAGAGTTCTGGACAGTAAAATTGATG;GRCH37_38_REF_STRING_MATCH

I am adding code to skip these. We will see what happens.

@ghost ghost assigned adamnovak Jun 13, 2018
@ghost ghost added the in progress label Jun 13, 2018
@edawson
Copy link
Contributor

edawson commented Jun 14, 2018

This SV has its END defined on GRCh37, which is probably what threw the cursor off.

Sorry to steamroll your fix, Glenn. If I remember correctly I was trying to recover the original output, maybe to pass the tests. I clearly didn't get it quite right. EDIT: It seems it's on my local branch? Maybe it didn't get pushed up.

vg (before any safe-svs PRs) sees this and says it's not an SV (since it passes the allATGC check), and variant_acceptable = true, so it would get included in the graph (whether or not -S is passed).

SV-aware (but broken) vg sees this is an SV (since it has an SVTYPE), so if we have variant_acceptable = !is_sv(), this variant doesn't get included without -S.

Since the current batch of VCFs seems to prefer putting things in ALT alleles, my plan is to make those first priority and default to testing if something is an SV only after the allATGCN check fails. Does this sound alright to folks?

I'll work on reducing the complexity/fragility of the SV handling code.

@adamnovak
Copy link
Member Author

adamnovak commented Jun 14, 2018

OK, I think the behavior we want is:

  • Bail out on (mark unacceptable and skip/warn) SVs with multiple alleles
  • Bail out on SVs with obviously incorect end annotations. If the SV end annotations are incorrect in these VCFs, though, we probably should report that to 1000 Genomes and make them fix it, because these VCFs are supposed to be in GRCh38 coordinates.
  • Bail out on all SVs if -S is not passed, since we have the option and that's what it is documented to do. Right now it looks like we are canonicalizing SVs only if -S is passed, but we are always doing them.

We might want some logic that will demote some SVs (like maybe this problematic one) to ordinary variants, if they have fully specified ref and alt strings? I'm not sure about that.

I will implement this and make a PR shortly.

@adamnovak
Copy link
Member Author

@edawson I think we maybe do want to handle fully-specified (and maybe short?) SVs with the normal codepath. But the vcflib is_sv() check is used in several places, and changing them all to use the allACGT check might be fiddly.

Maybe we just want to delete the SV marker from the variant record?

@edawson
Copy link
Contributor

edawson commented Jun 14, 2018

I'm working on this - no need for another PR, but some review might be good. I'll have it by tonight.

I think we should warn if there are SV tags and ref/alts that are mismatched. Otherwise, prefer ref/alt, then SV tags, then anything external (e.g. insertion sequences in FASTA). If we don't get -S, anything that's not ref/alt friendly will get ignored.

@ghost ghost removed the in progress label Jun 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug HAMP Haplotype-Aware Mapping Paper
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants