Question on resulting VG variants #2594

RenzoTale88 · 2020-01-08T12:34:04Z

Hello,
I'm calling variants using a cactus generated graph genome, generated and indexed as previously reported in another issue (#2362). Alignments and post processing is described in the issue #2546 with the recommended changes. The only additional modification I've done is to increase the chunk size to 10 Mb, with a 100Kb overlap, to reduce the number of tasks to perform.
I've managed also to call the variants on each chunk using the command:

vg call ./FILTER/$sample/${bname}.aug.vg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t 1 -p genome1.1 -l 120000000 -o $offset | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz

This generated a vcf file, refered to the reference genome of interest (called genome1). I've noticed that some of the variants are represented with alternate allele N, and that no deletions appear. Is the N indicative of deletions?

Thank you for your support,
Andrea

The text was updated successfully, but these errors were encountered:

ekg · 2020-01-08T12:47:24Z

The N is probably found in the input genomes. Do you see Ns in the input FASTA sequences that you gave to cactus? Do you find it in the GFA output of the graph (from vg view)?

…

On Wed, Jan 8, 2020 at 1:34 PM RenzoTale88 ***@***.***> wrote: Hello, I'm calling variants using a cactus generated graph genome, generated and indexed as previously reported in another issue (#2362 <#2362>). Alignments and post processing is described in the issue #2546 <#2546> with the recommended changes. The only additional modification I've done is to increase the chunk size to 10 Mb, with a 100Kb overlap, to reduce the number of tasks to perform. I've managed also to call the variants on each chunk using the command: vg call ./FILTER/$sample/${bname}.aug.vg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t 1 -p genome1.1 -l 120000000 -o $offset | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz This generated a vcf file, refered to the reference genome of interest (called genome1). I've noticed that some of the variants are represented with alternate allele N, and that no deletions appear. Is the N indicative of deletions? Thank you for your support, Andrea — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2594?email_source=notifications&email_token=AABDQEPW7LG2L2IBZOBP6QDQ4XB35A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEYAOQQ>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEII5MGFSRJNZYFIJS3Q4XB35ANCNFSM4KEHVJTA> .

RenzoTale88 · 2020-01-08T12:53:19Z

Yes there are gaps in the input genomes which are coded as Ns, and yes I see them when I run vg view. Are there changes to be done when creating the input graph from cactus ouputs?

ekg · 2020-01-08T13:02:24Z

There is nothing you need to do, and this is all expected. You're getting these calls because in some places reads align across the gaps (Ns).

…

On Wed, Jan 8, 2020 at 1:53 PM RenzoTale88 ***@***.***> wrote: Yes there are gaps in the input genomes which are coded as Ns, and yes I see them when I run vg view. Are there changes to be done when creating the input graph from cactus ouputs? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2594?email_source=notifications&email_token=AABDQEOF7RESKSVSXZENEPTQ4XED7A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMI6RQ#issuecomment-572034886>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEJWHM4A7IHMDVVSRJ3Q4XED7ANCNFSM4KEHVJTA> .

RenzoTale88 · 2020-01-08T13:07:43Z

I'm not I'm getting this right sorry. So, I'm getting variants like the one below:

genome1.1      10333   10528667_10528803       ATC     NNN     237.434 PASS    DP=239  GT:DP:AD:GL     1/1:239:1,237:-663.514901,-219.974202,-161.973251

So, the reference allele is ATC, but the alternate is NNN. So, the gap is supposed to be in the reads?
Sorry, maybe I'm just misinterpreting what you're telling me right now, but I'd like to completely understand what I'm getting as an output.

ekg · 2020-01-08T13:12:35Z

Sorry, I misunderstood. This is indeed strange. Maybe @glennhickey can give some insight. It would indeed look like the gap is in the reads, or the alternate allele that the caller has determined. It looks like a bug.

…

On Wed, Jan 8, 2020 at 2:07 PM RenzoTale88 ***@***.***> wrote: I'm not I'm getting this right sorry. So, I'm getting variants like the one below: genome1.1 10333 10528667_10528803 ATC NNN 237.434 PASS DP=239 GT:DP:AD:GL 1/1:239:1,237:-663.514901,-219.974202,-161.973251 So, the reference allele is ATC, but the alternate is NNN. So, the gap is supposed to be in the reads? Sorry, maybe I'm just misinterpreting what you're telling me right now, but I'd like to completely understand what I'm getting as an output. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2594?email_source=notifications&email_token=AABDQELG6O6OYIHVV3QTDWLQ4XFZ7A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMK6OY#issuecomment-572043067>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEKD2I5WKKLZFJD42L3Q4XFZ7ANCNFSM4KEHVJTA> .

RenzoTale88 · 2020-01-08T13:20:49Z

So, the problem appear when i use the augmented vg graph, but is not present when using the xg index. Don't know if this might help tracking down the problem. I'm using vg version 1.20.0-78-gc71c000.

glennhickey · 2020-01-08T16:24:08Z

There must be N's in the reads that are getting augmented into the graph. @adamnovak and I were just chatting about this on Slack the other day. We used to have a filter for this, but it got lost in refactoring. I think I will have to add it back. In the meantime, you can just ignore these calls.

It may be worth double-checking that this is not a bug (and this NNN is really in the vast majority of your reads as aligned to this position of the graph). I can do this for you if you share the agumented graph / gam in question.

RenzoTale88 · 2020-01-09T13:49:38Z

Sorry at the moment I do not have the augmented chunks anymore (I've deleted them after finishing to call the variants). I can recreate them, but it might take a bit.
In the meanwhile, I'm encountering another kind of problem. I've created chunks of ~10Mb with an overlap of 100Kb to process in parallel. However, some chunks (the initial parts of some chromosomes) do not give any errors but simply generate an empty vcf file. Not sure what is the reason.
Is there a way to chunk the gam and the vg into chromosomes? In this way I can try to process them and see whether it creates other issues. Right now I'm doing it as follow:

vg chunk -x all.xg -a reads.gam -P PATHS.txt -c 50 -g -s 250000000 -o 100000 -b ${sample}_call_chunk -t 4 -E ${sample}.chunklist -f

Using a chunk size higher than the longest chromosome available should allow for the split into single chromosomes, but I'd like to know whether there is a better, more specific way to do it.

Thanks again for the help

Andrea

glennhickey · 2020-01-09T14:37:04Z

Yes, this is now how toil-vg works -- by splitting into chromosomes. This can be done with the -C option. The -O pg option is pretty important for performance as well.

vg chunk -x all.xg -a reads.gam -P PATHS.txt -C -g -b ${sample}_call_chunk -t 4 -E ${sample}.chunklist -O pg

The other steps (augment / pack / call) should work fine on chromosome-sized graphs. You can save some time by not splitting the GAM (no chunk -a -g) option, and just passing in the entire GAM when running augment on each chromosome chunk.

Finally: if you do split the GAM, you'll need the very latest master branch for vg chunk to work properly with -C -a.

RenzoTale88 · 2020-01-09T14:47:52Z

Ok perfect, I'm proceeding as you suggested. Does vg augment accept also pg files as an input? Because just running the command shows me that it accept a vg file (just double checking of course) using vg 1.20.0?

glennhickey · 2020-01-09T14:54:11Z

Yes, augment definitely accepts pg files in 1.21.0. If it isn't working in 1.20.0, then switching to the newer release should fix it.

…

On Thu, Jan 9, 2020 at 9:47 AM RenzoTale88 ***@***.***> wrote: Ok perfect, I'm proceeding as you suggested. Does vg augment accept also pg files as an input? Because just running the command shows me that it accept a vg file (just double checking of course) using vg 1.20.0? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2594?email_source=notifications&email_token=AAG373UQWMQB5JBCRTC65LLQ442JTA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIQRNJA#issuecomment-572593828>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG373R5OU7C5RBBOK7YUCTQ442JTANCNFSM4KEHVJTA> .

RenzoTale88 · 2020-01-10T08:03:54Z

Ok it is running now. I have one more question, more related to computational time/space. Using vg augment I will produce an augmented graph and reads (gam), that are then the inputs for vg pack. My question is: the augmented gam will contain only the reads that map to the specific chromosome I'm processing, or it will contain all of the reads mapped to the graph? I' asking because this might lead to much higher space requirements, and therefore I got to plan the process more carefully.

Thanks again
Andrea

glennhickey · 2020-01-10T12:25:49Z

The augmented GAM will only contain reads that map to the graph. So every read in your original GAM should appear in at most one augmented GAM. Note that this requires the -s option for vg augment. (otherwise you'll get an error telling you to use it when a read that's outside the graph is encountered).

…

On Fri, Jan 10, 2020 at 3:03 AM RenzoTale88 ***@***.***> wrote: Ok it is running now. I have one more question, more related to computational time/space. Using vg augment I will produce an augmented graph and reads (gam), that are then the inputs for vg pack. My question is: the augmented gam will contain only the reads that map to the specific chromosome I'm processing, or it will contain all of the reads mapped to the graph? I' asking because this might lead to much higher space requirements, and therefore I got to plan the process more carefully. Thanks again Andrea — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2594?email_source=notifications&email_token=AAG373WZCFFENCS6LJVXKRDQ5ATWXA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEITA3XY#issuecomment-572919263>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG373S4CP5IFVGSZPPWHSLQ5ATWXANCNFSM4KEHVJTA> .

RenzoTale88 · 2020-01-10T12:35:35Z

Ok thank you very much, I'll give it a go with this additional option. Thanks a lot again

RenzoTale88 · 2020-01-14T10:13:00Z

Hi again,
I've tried to do as you suggested. Unfortunately, I'm getting problems at augmenting the graph, getting an error both with the full gam and the chunked gam.
I've proceeded first chunking the graph into single chromosome graph:

# Chunking
vg chunk -x all.xg -P ./LISTS/PATHS.txt -C -O pg -b ${sample}_call_chunk -t ${NSLOTS} -E ${sample}.chunklist

Then, for every chunk I've augmented the graph with the full dataset (bname is the basename of each chunk):

# Augmenting
vg augment ${bname}.pg ${sample}.filtered.gam -m 4 -s -t $NSLOTS -A ${bname}.aug.gam > ${bname}.aug.vg

This step however finishes with errors every time I run it. The error I get is the following:

ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_ogOCBO/stacktrace.txt
Please include the stack trace file in your bug report!
error[VPKG::load_one]: Correct input type not found while loading handlegraph::PathHandleGraph

Below, the stack trace of the issue encountered. Is it that I'm doing something wrong?
Thank you for your help

Andrea

stacktrace_13012020.txt

glennhickey · 2020-01-14T15:07:26Z

This type of error usually means it's completely failing to read an input file. The strange thing is that the error message implies it's the graph (which is not loaded as the type shown), but the stack trace implies it's the GAM.

Does vg validate ${bname}.pg run through? How about vg view -a ${sample}.filtered.gam | head -10 ?

RenzoTale88 · 2020-01-15T09:10:38Z

Ok sorry, I might have used the wrong stack trace (I'm running on a cluster and have to retrieve the stacktraces from each node).

I've tried again between yesterday night and today morning, this time chunking the graph into chromosomes creating sub-vg files instead of pg. I've alweays use version 1.20.0 for all the operations, although using the version 1.21.0 didn't changed the results.
the command I've used are the same in the post above, just creating a vg instead of a pg.
Below the new stacktrace and the error printed is the following:

ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_TUvUqS/stacktrace.txt
Please include the stack trace file in your bug report!
error[VPKG::load_one]: Correct input type not found while loading handlegraph::PathHandleGraph

Odd enough, all the chromosomes failed with the exception of the last one among them. Not really sure why, since it is an arrayed job, and repeat every step for every chromosome the same way.

Running vg validate on the different subgraphs didn't return any error. The software proceed generating the augmented gam files, but then fails when augmenting the vg.

Let me know what I can try next

stacktrace_15012020.txt

glennhickey · 2020-01-15T14:56:41Z

I see. This stack trace is pretty clear about it being vg augment crashing. I fixed some bugs in the function it seems to be crashing in in a recent PR (#2599), so it may work with the latest master branch of vg. If you can share the input and command line for this vg augment crash with me, I'd be happy to debug it.

These issues have been helpful, so thanks for your patience. We've mostly been using vg call for genotyping -- ie without augmenting. I'm in the midst of doing some whole-genome benchmarks for de-novo calling with augmentation, but there's clearly a little work left to be done. Any more feedback appreciated.

RenzoTale88 · 2020-01-15T15:12:17Z

Thank you very much for you help!
I might have some problems sharing the data since they are quite large (~40X mammalian sized genomes), sorry about that. I'll try downloading and compile the latest vg branch, compile it and then retry to run it. In the meanwhile, I'm trying to run it on the whole graph instead of on single chromosomes.
To download the latest branch I simply need to git clone or should I specify some specific branch name?

glennhickey · 2020-01-15T17:19:37Z

Yeah, just a fresh clone with no branch specified. But keep in mind that in this version, you can no longer combine graphs with cat (use vg combine instead -- it will be like this in the next release). And there's a bug (to be fixed shortly) where vg ids -s doesn't work

RenzoTale88 · 2020-01-16T09:42:16Z

Ok that should be fine. Is the resulting graph from vg combine compliant with the previous versions of vg? I mean, if I combine the graph with vg combine and feed it as input for vg 1.20.0, will it work fine?

glennhickey · 2020-01-16T14:50:59Z

As far as I know, yes. Good question though. @adamnovak can confirm for sure.
One thing that is not compatible it the output of vg pack. So you need to make sure, in this case, that you're using the same vg-version for vg call and vg pack.

adamnovak · 2020-01-16T18:42:42Z

Yes, the vg combine output is in vg's Protobuf format and should be readable by v1.20.0.

glennhickey · 2020-01-16T19:17:57Z

@RenzoTale88 I'm getting a crash with a similar stack trace on my own test. So I suspect that trying with the latest master will not fix it. I do have all the data I need to debug, and will let you know when it's fixed.

RenzoTale88 · 2020-01-17T12:25:40Z

Ok great then I'll wait your go-on to try upgrade my vg version. In the meanwhile, I've tried again to run vg chunk, vg augment and vg call. With the exception of some vg snarls with some core dumped error, that disappeared on the next run (not really sure why, haven't made any changes to the script from one run to the other), everything worked fine. I'll wait for your fixes anyway, maybe it will make things more stable and easier for my next run.

Below the command I'm using to produce the different analyses:

sample=Sample1
reads=Sample1.filtered.gam

# Chunking
vg chunk -x GRAPH_CACTUS/all.xg -a $reads -g -P ./LISTS/PATHS.txt -C -O vg -b FILTER/${sample}/${sample}_call_chunk -t ${NSLOTS} -E FILTER/${sample}/${sample}.chunklist 
# For every chunk
tgtpath=genome1.1
bname=FILTER/${sample}/${sample}_call_chunk_genome1.1
NSLOTS=4

vg augment ./FILTER/$sample/${bname}.vg ./FILTER/$sample/${bname}.gam -m 4 -s -t $NSLOTS -A ./FILTER/$sample/${bname}.aug.gam > ./FILTER/$sample/${bname}.aug.vg
vg index ./FILTER/$sample/${bname}.aug.vg -x ./FILTER/$sample/${bname}.aug.xg
/${bname}.aug.snarls
vg pack -x ./FILTER/$sample/${bname}.aug.xg -g ./FILTER/$sample/${bname}.aug.gam -Q 5 -o ./FILTER/$sample/${bname}.aug.pack #&& rm ./FILTER/$sample/${bname}.aug.gam
vg mod -r $tgtpath ./FILTER/$sample/${bname}.aug.vg | vg snarls - > ./FILTER/$sample/${bname}.aug.snarls

# Call the variants for the chunk
vg call ./FILTER/$sample/${bname}.aug.xg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t $NSLOTS -p $tgtpath | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz

One question: when I use vg snarls alone I get several core dumped error. This gets better when I use vg mod -r $tgtpath ./FILTER/$sample/${bname}.aug.vg | vg snarls -. However, I'm not sure whether dropping the other paths while computing the snarls will affect the variant calling. You probably already answered to this point in a previous issue I had but I prefer to confirm it before proceeding. Thanks again!

glennhickey · 2020-01-17T18:45:45Z

I think your crash was due to a bug in the -s option of vg augment. It should be fixed in the master branch now.

vg chunk -C is also fixed to name broken paths better in the master branch. This should stop the vg snarls crash. Using mod -r (which is now paths -r) is fine as a fix, and won't cause any trouble.

vg snarls uses a fair amount of memory. If it's still crashing with the fixed paths, that could be an issue. There was also an instability in vg snarls when running in parallel (which shouldn't be an issue here) that's fixed in the latest vg.

I'm continuing to test on my end.

RenzoTale88 · 2020-01-21T11:23:56Z

Ok, I can confirm that for a sample it keep crashing on chromosome 1, which is the largest in the genome, even after fixing the paths with vg paths -r.
I proceeded as follow:

vg paths -r ganome1.1 $myfile.aug.vg > tmp.aug.vg
vg snarls tmp.aug.vg > $myfile.aug.snarls

Below the stacktrace:

Crash report for vg v1.21.0-114-g9f3eb752d "Fanano"
Stack trace (most recent call last):
#6    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x4daf89, in _start
#5    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x1b44f68, in __libc_start_main
#4    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x40ada2, in main
      Source "src/main.cpp", line 75, in main [0x40ada2]
#3    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x9b6007, in vg::subcommand::Subcommand::operator()(int, char**) const
    | Source "src/subcommand/subcommand.cpp", line 72, in operator()
      Source "/usr/include/c++/7/bits/std_function.h", line 706, in operator() [0x9b6007]
#2    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x93db96, in main_snarl(int, char**)
      Source "src/subcommand/snarls_main.cpp", line 224, in main_snarl [0x93db96]
#1    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb32fb3, in vg::CactusSnarlFinder::find_snarls_parallel()
    | Source "src/snarls.cpp", line 114, in finish
      Source "src/snarls.cpp", line 947, in find_snarls_parallel [0xb32fb3]
#0    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb321a5, in vg::SnarlManager::build_indexes()
    | Source "src/snarls.cpp", line 993, in size
    | Source "/usr/include/c++/7/bits/stl_deque.h", line 1272, in operator-<vg::SnarlManager::SnarlRecord, vg::SnarlManager::SnarlRecord&, vg::SnarlManager::SnarlRecord*>
      Source "/usr/include/c++/7/bits/stl_deque.h", line 356, in build_indexes [0xb321a5]

glennhickey · 2020-01-21T15:08:21Z

That looks like it's running out of memory. Is that what you're seeing? Any idea how much memory it's using? The only way I can think of to reduce the memory is using more stringent coverage (-m) and / or quality filtering (-Q) in vg augment.

…

On Tue, Jan 21, 2020 at 6:23 AM RenzoTale88 ***@***.***> wrote: Ok, I can confirm that for a sample it keep crashing on chromosome 1, which is the largest in the genome, even after fixing the paths with vg paths -r. I proceeded as follow: vg paths -r ganome1.1 $myfile.aug.vg > tmp.aug.vg vg snarls tmp.aug.vg > $myfile.aug.snarls Below the stacktrace: Crash report for vg v1.21.0-114-g9f3eb752d "Fanano" Stack trace (most recent call last): #6 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x4daf89, in _start #5 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x1b44f68, in __libc_start_main #4 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x40ada2, in main Source "src/main.cpp", line 75, in main [0x40ada2] #3 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x9b6007, in vg::subcommand::Subcommand::operator()(int, char**) const | Source "src/subcommand/subcommand.cpp", line 72, in operator() Source "/usr/include/c++/7/bits/std_function.h", line 706, in operator() [0x9b6007] #2 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x93db96, in main_snarl(int, char**) Source "src/subcommand/snarls_main.cpp", line 224, in main_snarl [0x93db96] #1 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb32fb3, in vg::CactusSnarlFinder::find_snarls_parallel() | Source "src/snarls.cpp", line 114, in finish Source "src/snarls.cpp", line 947, in find_snarls_parallel [0xb32fb3] #0 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb321a5, in vg::SnarlManager::build_indexes() | Source "src/snarls.cpp", line 993, in size | Source "/usr/include/c++/7/bits/stl_deque.h", line 1272, in operator-<vg::SnarlManager::SnarlRecord, vg::SnarlManager::SnarlRecord&, vg::SnarlManager::SnarlRecord*> Source "/usr/include/c++/7/bits/stl_deque.h", line 356, in build_indexes [0xb321a5] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2594?email_source=notifications&email_token=AAG373S4LT65DH5TRDU2YK3Q63LM3A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJPM7RI#issuecomment-576638917>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG373Q7IFCRJUOUWBH2MVDQ63LM3ANCNFSM4KEHVJTA> .

RenzoTale88 · 2020-01-21T15:11:07Z

No, the script never reaches the top usage of memory. I can try to use a more stringent coverage filtering, to see if it makes a difference (some samples are 30X other almost 40X, so using the same threshold might have given some problems).

adamnovak · 2020-01-21T21:04:46Z

It looks like it isn't the reserve() call at src/snarls.cpp line 993 that is failing, but the size() call, like something has upset the std::deque internals. I'd run the offending snarls command under gdb, and get a stack trace that way to see if it really is stopping us inside the deque size calculation, and, if so, what is wrong with the deque. You could also try it with -t 1 for using a single thread. I think snarl finding is parallel, and it could be that a signal from one thread is ending up getting received by another. Our stack trace code I thought was supposed to work around that, but maybe it doesn't always succeed?

…

On 1/21/20, RenzoTale88 ***@***.***> wrote: No, the script never reaches the top usage of memory. I can try to use a more stringent coverage filtering, to see if it makes a difference (some samples are 30X other almost 40X, so using the same threshold might have given some problems). -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #2594 (comment)

glennhickey · 2020-01-21T22:03:34Z

I can take a look if you can share the graph.

glennhickey · 2020-01-23T14:31:09Z

For what it's worth, I finally have this pipeline running through reliably with and without gam chunking via toil-vg call. Up until recently, I was getting bogged down in lot of the same crashes you've described.

That said, I'm running on a graph made with vg construct, which is no doubt much less complex than yours from cactus, so perhaps there are still some issues. I will try to run a cactus graph soon.

For the vg snarls crash one thing to try if you haven't already is to run with the latest vg master.

RenzoTale88 · 2020-01-24T12:49:55Z

Hi sorry for my late reply, have missed the email notifying them. I'm testing the latest version right now, starting from the chunking onward. I'll keep you updated on how it develops. Thanks again for the support

RenzoTale88 · 2020-01-27T12:23:22Z

Hi, I've tried with vg v1.21.0-160 and seems to be working. The only odd thing is, when I submit an array of job to process each chromosome separately some of them will generate an empty vcf without creating any error message or failure signal. I can simply rerun them and they'll just work fine, but I find it a puzzling situation.

Thanks again for the support, I'll try now to rerun the whole pipeline with the new version of the software and see whether I can simply run every step correctly.

Andrea

RenzoTale88 · 2020-02-27T09:17:55Z

Hi,
sorry to come back to this, but I got some update.
I'm running the analyses using vg-1.21.0-160 as written above. When running the pipeline using vg chunk -C, for some reasons, the software crashes during the snarls augmentation of some chromosomes. However, when I chunk the dataset using:

vg chunk -x all.xg -a FILTER/${sample}/${sample}.filtered.gam -P ./LISTS/PATHS.txt -c 50 -g -s 999999999 -o 100000 -b CHUNK/${sample}/${sample}_call_chunk -t 4 -E CHUNK/${sample}/${sample}.chunklist -f

This produces a series of chunks of chromosomal size (considering the large chunk size), and they work just fine in all the subsequent stages, including calling the variants.
My question is: is it a good workaround? I mean, does it produces similar results to the -C chunking approach?

Thanks for your help again
Andrea

glennhickey · 2020-02-27T19:21:38Z

The snarls do seem to be a real issue for these cactus alignments. It's ironic, as the snarls code is actually using the cactus graph internally. Anyway, not using -C in this case limits the size of the variation in your chunks. If you're not interested in large sv's, it probably won't make much difference. I will take a shot at fixing `snarls` to be more robust, but won't be able to start for another 2 weeks, unfortunately.

…

On Thu, Feb 27, 2020 at 4:17 AM RenzoTale88 ***@***.***> wrote: Hi, sorry to come back to this, but I got some update. I'm running the analyses using vg-1.21.0-160 as written above. When running the pipeline using vg chunk -C, for some reasons, the software crashes during the snarls augmentation of some chromosomes. However, when I chunk the dataset using: vg chunk -x all.xg -a FILTER/${sample}/${sample}.filtered.gam -P ./LISTS/PATHS.txt -c 50 -g -s 999999999 -o 100000 -b CHUNK/${sample}/${sample}_call_chunk -t 4 -E CHUNK/${sample}/${sample}.chunklist -f This produces a series of chunks of chromosomal size (considering the large chunk size), and they work just fine in all the subsequent stages, including calling the variants. My question is: is it a good workaround? I mean, does it produces similar results to the -C chunking approach? Thanks for your help again Andrea — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2594?email_source=notifications&email_token=AAG373UEV4MA5DEQ377RL3DRE6AMHA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENDSO3Y#issuecomment-591865711>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG373SCKNTLIOWNE3YLPHLRE6AMHANCNFSM4KEHVJTA> .

RenzoTale88 · 2020-02-28T13:55:14Z

So, after calling the variants I actually have variants several thousands of bases long. What do you mean with large SV? Is there a size limit applied by the above command?

glennhickey mentioned this issue Jan 10, 2020

Filter N's in vg augment and fix bug with coverage filter. #2599

Merged

glennhickey mentioned this issue Jan 17, 2020

Fix augment -s option #2611

Merged

Question on resulting VG variants #2594

Question on resulting VG variants #2594

Comments

RenzoTale88 commented Jan 8, 2020

ekg commented Jan 8, 2020 via email

RenzoTale88 commented Jan 8, 2020

ekg commented Jan 8, 2020 via email

RenzoTale88 commented Jan 8, 2020

ekg commented Jan 8, 2020 via email

RenzoTale88 commented Jan 8, 2020

glennhickey commented Jan 8, 2020

RenzoTale88 commented Jan 9, 2020

glennhickey commented Jan 9, 2020

RenzoTale88 commented Jan 9, 2020

glennhickey commented Jan 9, 2020 via email

RenzoTale88 commented Jan 10, 2020

glennhickey commented Jan 10, 2020 via email

RenzoTale88 commented Jan 10, 2020

RenzoTale88 commented Jan 14, 2020 • edited Loading

glennhickey commented Jan 14, 2020

RenzoTale88 commented Jan 15, 2020 • edited Loading

glennhickey commented Jan 15, 2020

RenzoTale88 commented Jan 15, 2020 • edited Loading

glennhickey commented Jan 15, 2020

RenzoTale88 commented Jan 16, 2020

glennhickey commented Jan 16, 2020

adamnovak commented Jan 16, 2020

glennhickey commented Jan 16, 2020

RenzoTale88 commented Jan 17, 2020

glennhickey commented Jan 17, 2020

RenzoTale88 commented Jan 21, 2020

glennhickey commented Jan 21, 2020 via email

RenzoTale88 commented Jan 21, 2020

adamnovak commented Jan 21, 2020 via email

glennhickey commented Jan 21, 2020

glennhickey commented Jan 23, 2020

RenzoTale88 commented Jan 24, 2020

RenzoTale88 commented Jan 27, 2020

RenzoTale88 commented Feb 27, 2020

glennhickey commented Feb 27, 2020 via email

RenzoTale88 commented Feb 28, 2020

RenzoTale88 commented Jan 14, 2020 •

edited

Loading

RenzoTale88 commented Jan 15, 2020 •

edited

Loading

RenzoTale88 commented Jan 15, 2020 •

edited

Loading