Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on resulting VG variants #2594

Open
RenzoTale88 opened this issue Jan 8, 2020 · 37 comments
Open

Question on resulting VG variants #2594

RenzoTale88 opened this issue Jan 8, 2020 · 37 comments

Comments

@RenzoTale88
Copy link

Hello,
I'm calling variants using a cactus generated graph genome, generated and indexed as previously reported in another issue (#2362). Alignments and post processing is described in the issue #2546 with the recommended changes. The only additional modification I've done is to increase the chunk size to 10 Mb, with a 100Kb overlap, to reduce the number of tasks to perform.
I've managed also to call the variants on each chunk using the command:

vg call ./FILTER/$sample/${bname}.aug.vg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t 1 -p genome1.1 -l 120000000 -o $offset | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz

This generated a vcf file, refered to the reference genome of interest (called genome1). I've noticed that some of the variants are represented with alternate allele N, and that no deletions appear. Is the N indicative of deletions?

Thank you for your support,
Andrea

@ekg
Copy link
Member

ekg commented Jan 8, 2020 via email

@RenzoTale88
Copy link
Author

Yes there are gaps in the input genomes which are coded as Ns, and yes I see them when I run vg view. Are there changes to be done when creating the input graph from cactus ouputs?

@ekg
Copy link
Member

ekg commented Jan 8, 2020 via email

@RenzoTale88
Copy link
Author

I'm not I'm getting this right sorry. So, I'm getting variants like the one below:

genome1.1      10333   10528667_10528803       ATC     NNN     237.434 PASS    DP=239  GT:DP:AD:GL     1/1:239:1,237:-663.514901,-219.974202,-161.973251

So, the reference allele is ATC, but the alternate is NNN. So, the gap is supposed to be in the reads?
Sorry, maybe I'm just misinterpreting what you're telling me right now, but I'd like to completely understand what I'm getting as an output.

@ekg
Copy link
Member

ekg commented Jan 8, 2020 via email

@RenzoTale88
Copy link
Author

So, the problem appear when i use the augmented vg graph, but is not present when using the xg index. Don't know if this might help tracking down the problem. I'm using vg version 1.20.0-78-gc71c000.

@glennhickey
Copy link
Contributor

There must be N's in the reads that are getting augmented into the graph. @adamnovak and I were just chatting about this on Slack the other day. We used to have a filter for this, but it got lost in refactoring. I think I will have to add it back. In the meantime, you can just ignore these calls.

It may be worth double-checking that this is not a bug (and this NNN is really in the vast majority of your reads as aligned to this position of the graph). I can do this for you if you share the agumented graph / gam in question.

@RenzoTale88
Copy link
Author

Sorry at the moment I do not have the augmented chunks anymore (I've deleted them after finishing to call the variants). I can recreate them, but it might take a bit.
In the meanwhile, I'm encountering another kind of problem. I've created chunks of ~10Mb with an overlap of 100Kb to process in parallel. However, some chunks (the initial parts of some chromosomes) do not give any errors but simply generate an empty vcf file. Not sure what is the reason.
Is there a way to chunk the gam and the vg into chromosomes? In this way I can try to process them and see whether it creates other issues. Right now I'm doing it as follow:

vg chunk -x all.xg -a reads.gam -P PATHS.txt -c 50 -g -s 250000000 -o 100000 -b ${sample}_call_chunk -t 4 -E ${sample}.chunklist -f

Using a chunk size higher than the longest chromosome available should allow for the split into single chromosomes, but I'd like to know whether there is a better, more specific way to do it.

Thanks again for the help

Andrea

@glennhickey
Copy link
Contributor

Yes, this is now how toil-vg works -- by splitting into chromosomes. This can be done with the -C option. The -O pg option is pretty important for performance as well.

vg chunk -x all.xg -a reads.gam -P PATHS.txt -C -g -b ${sample}_call_chunk -t 4 -E ${sample}.chunklist -O pg

The other steps (augment / pack / call) should work fine on chromosome-sized graphs. You can save some time by not splitting the GAM (no chunk -a -g) option, and just passing in the entire GAM when running augment on each chromosome chunk.

Finally: if you do split the GAM, you'll need the very latest master branch for vg chunk to work properly with -C -a.

@RenzoTale88
Copy link
Author

Ok perfect, I'm proceeding as you suggested. Does vg augment accept also pg files as an input? Because just running the command shows me that it accept a vg file (just double checking of course) using vg 1.20.0?

@glennhickey
Copy link
Contributor

glennhickey commented Jan 9, 2020 via email

@RenzoTale88
Copy link
Author

Ok it is running now. I have one more question, more related to computational time/space. Using vg augment I will produce an augmented graph and reads (gam), that are then the inputs for vg pack. My question is: the augmented gam will contain only the reads that map to the specific chromosome I'm processing, or it will contain all of the reads mapped to the graph? I' asking because this might lead to much higher space requirements, and therefore I got to plan the process more carefully.

Thanks again
Andrea

@glennhickey
Copy link
Contributor

glennhickey commented Jan 10, 2020 via email

@RenzoTale88
Copy link
Author

Ok thank you very much, I'll give it a go with this additional option. Thanks a lot again

@RenzoTale88
Copy link
Author

RenzoTale88 commented Jan 14, 2020

Hi again,
I've tried to do as you suggested. Unfortunately, I'm getting problems at augmenting the graph, getting an error both with the full gam and the chunked gam.
I've proceeded first chunking the graph into single chromosome graph:

# Chunking
vg chunk -x all.xg -P ./LISTS/PATHS.txt -C -O pg -b ${sample}_call_chunk -t ${NSLOTS} -E ${sample}.chunklist

Then, for every chunk I've augmented the graph with the full dataset (bname is the basename of each chunk):

# Augmenting
vg augment ${bname}.pg ${sample}.filtered.gam -m 4 -s -t $NSLOTS -A ${bname}.aug.gam > ${bname}.aug.vg

This step however finishes with errors every time I run it. The error I get is the following:

ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_ogOCBO/stacktrace.txt
Please include the stack trace file in your bug report!
error[VPKG::load_one]: Correct input type not found while loading handlegraph::PathHandleGraph

Below, the stack trace of the issue encountered. Is it that I'm doing something wrong?
Thank you for your help

Andrea

stacktrace_13012020.txt

@glennhickey
Copy link
Contributor

This type of error usually means it's completely failing to read an input file. The strange thing is that the error message implies it's the graph (which is not loaded as the type shown), but the stack trace implies it's the GAM.

Does vg validate ${bname}.pg run through? How about vg view -a ${sample}.filtered.gam | head -10 ?

@RenzoTale88
Copy link
Author

RenzoTale88 commented Jan 15, 2020

Ok sorry, I might have used the wrong stack trace (I'm running on a cluster and have to retrieve the stacktraces from each node).

I've tried again between yesterday night and today morning, this time chunking the graph into chromosomes creating sub-vg files instead of pg. I've alweays use version 1.20.0 for all the operations, although using the version 1.21.0 didn't changed the results.
the command I've used are the same in the post above, just creating a vg instead of a pg.
Below the new stacktrace and the error printed is the following:

ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_TUvUqS/stacktrace.txt
Please include the stack trace file in your bug report!
error[VPKG::load_one]: Correct input type not found while loading handlegraph::PathHandleGraph

Odd enough, all the chromosomes failed with the exception of the last one among them. Not really sure why, since it is an arrayed job, and repeat every step for every chromosome the same way.

Running vg validate on the different subgraphs didn't return any error. The software proceed generating the augmented gam files, but then fails when augmenting the vg.

Let me know what I can try next

stacktrace_15012020.txt

@glennhickey
Copy link
Contributor

I see. This stack trace is pretty clear about it being vg augment crashing. I fixed some bugs in the function it seems to be crashing in in a recent PR (#2599), so it may work with the latest master branch of vg. If you can share the input and command line for this vg augment crash with me, I'd be happy to debug it.

These issues have been helpful, so thanks for your patience. We've mostly been using vg call for genotyping -- ie without augmenting. I'm in the midst of doing some whole-genome benchmarks for de-novo calling with augmentation, but there's clearly a little work left to be done. Any more feedback appreciated.

@RenzoTale88
Copy link
Author

RenzoTale88 commented Jan 15, 2020

Thank you very much for you help!
I might have some problems sharing the data since they are quite large (~40X mammalian sized genomes), sorry about that. I'll try downloading and compile the latest vg branch, compile it and then retry to run it. In the meanwhile, I'm trying to run it on the whole graph instead of on single chromosomes.
To download the latest branch I simply need to git clone or should I specify some specific branch name?

@glennhickey
Copy link
Contributor

Yeah, just a fresh clone with no branch specified. But keep in mind that in this version, you can no longer combine graphs with cat (use vg combine instead -- it will be like this in the next release). And there's a bug (to be fixed shortly) where vg ids -s doesn't work

@RenzoTale88
Copy link
Author

Ok that should be fine. Is the resulting graph from vg combine compliant with the previous versions of vg? I mean, if I combine the graph with vg combine and feed it as input for vg 1.20.0, will it work fine?

@glennhickey
Copy link
Contributor

As far as I know, yes. Good question though. @adamnovak can confirm for sure.
One thing that is not compatible it the output of vg pack. So you need to make sure, in this case, that you're using the same vg-version for vg call and vg pack.

@adamnovak
Copy link
Member

Yes, the vg combine output is in vg's Protobuf format and should be readable by v1.20.0.

@glennhickey
Copy link
Contributor

@RenzoTale88 I'm getting a crash with a similar stack trace on my own test. So I suspect that trying with the latest master will not fix it. I do have all the data I need to debug, and will let you know when it's fixed.

@RenzoTale88
Copy link
Author

Ok great then I'll wait your go-on to try upgrade my vg version. In the meanwhile, I've tried again to run vg chunk, vg augment and vg call. With the exception of some vg snarls with some core dumped error, that disappeared on the next run (not really sure why, haven't made any changes to the script from one run to the other), everything worked fine. I'll wait for your fixes anyway, maybe it will make things more stable and easier for my next run.

Below the command I'm using to produce the different analyses:

sample=Sample1
reads=Sample1.filtered.gam

# Chunking
vg chunk -x GRAPH_CACTUS/all.xg -a $reads -g -P ./LISTS/PATHS.txt -C -O vg -b FILTER/${sample}/${sample}_call_chunk -t ${NSLOTS} -E FILTER/${sample}/${sample}.chunklist 
# For every chunk
tgtpath=genome1.1
bname=FILTER/${sample}/${sample}_call_chunk_genome1.1
NSLOTS=4

vg augment ./FILTER/$sample/${bname}.vg ./FILTER/$sample/${bname}.gam -m 4 -s -t $NSLOTS -A ./FILTER/$sample/${bname}.aug.gam > ./FILTER/$sample/${bname}.aug.vg
vg index ./FILTER/$sample/${bname}.aug.vg -x ./FILTER/$sample/${bname}.aug.xg
/${bname}.aug.snarls
vg pack -x ./FILTER/$sample/${bname}.aug.xg -g ./FILTER/$sample/${bname}.aug.gam -Q 5 -o ./FILTER/$sample/${bname}.aug.pack #&& rm ./FILTER/$sample/${bname}.aug.gam
vg mod -r $tgtpath ./FILTER/$sample/${bname}.aug.vg | vg snarls - > ./FILTER/$sample/${bname}.aug.snarls

# Call the variants for the chunk
vg call ./FILTER/$sample/${bname}.aug.xg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t $NSLOTS -p $tgtpath | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz

One question: when I use vg snarls alone I get several core dumped error. This gets better when I use vg mod -r $tgtpath ./FILTER/$sample/${bname}.aug.vg | vg snarls -. However, I'm not sure whether dropping the other paths while computing the snarls will affect the variant calling. You probably already answered to this point in a previous issue I had but I prefer to confirm it before proceeding. Thanks again!

@glennhickey
Copy link
Contributor

I think your crash was due to a bug in the -s option of vg augment. It should be fixed in the master branch now.

vg chunk -C is also fixed to name broken paths better in the master branch. This should stop the vg snarls crash. Using mod -r (which is now paths -r) is fine as a fix, and won't cause any trouble.

vg snarls uses a fair amount of memory. If it's still crashing with the fixed paths, that could be an issue. There was also an instability in vg snarls when running in parallel (which shouldn't be an issue here) that's fixed in the latest vg.

I'm continuing to test on my end.

@RenzoTale88
Copy link
Author

Ok, I can confirm that for a sample it keep crashing on chromosome 1, which is the largest in the genome, even after fixing the paths with vg paths -r.
I proceeded as follow:

vg paths -r ganome1.1 $myfile.aug.vg > tmp.aug.vg
vg snarls tmp.aug.vg > $myfile.aug.snarls

Below the stacktrace:

Crash report for vg v1.21.0-114-g9f3eb752d "Fanano"
Stack trace (most recent call last):
#6    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x4daf89, in _start
#5    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x1b44f68, in __libc_start_main
#4    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x40ada2, in main
      Source "src/main.cpp", line 75, in main [0x40ada2]
#3    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x9b6007, in vg::subcommand::Subcommand::operator()(int, char**) const
    | Source "src/subcommand/subcommand.cpp", line 72, in operator()
      Source "/usr/include/c++/7/bits/std_function.h", line 706, in operator() [0x9b6007]
#2    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x93db96, in main_snarl(int, char**)
      Source "src/subcommand/snarls_main.cpp", line 224, in main_snarl [0x93db96]
#1    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb32fb3, in vg::CactusSnarlFinder::find_snarls_parallel()
    | Source "src/snarls.cpp", line 114, in finish
      Source "src/snarls.cpp", line 947, in find_snarls_parallel [0xb32fb3]
#0    Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb321a5, in vg::SnarlManager::build_indexes()
    | Source "src/snarls.cpp", line 993, in size
    | Source "/usr/include/c++/7/bits/stl_deque.h", line 1272, in operator-<vg::SnarlManager::SnarlRecord, vg::SnarlManager::SnarlRecord&, vg::SnarlManager::SnarlRecord*>
      Source "/usr/include/c++/7/bits/stl_deque.h", line 356, in build_indexes [0xb321a5]

@glennhickey
Copy link
Contributor

glennhickey commented Jan 21, 2020 via email

@RenzoTale88
Copy link
Author

No, the script never reaches the top usage of memory. I can try to use a more stringent coverage filtering, to see if it makes a difference (some samples are 30X other almost 40X, so using the same threshold might have given some problems).

@adamnovak
Copy link
Member

adamnovak commented Jan 21, 2020 via email

@glennhickey
Copy link
Contributor

I can take a look if you can share the graph.

@glennhickey
Copy link
Contributor

For what it's worth, I finally have this pipeline running through reliably with and without gam chunking via toil-vg call. Up until recently, I was getting bogged down in lot of the same crashes you've described.

That said, I'm running on a graph made with vg construct, which is no doubt much less complex than yours from cactus, so perhaps there are still some issues. I will try to run a cactus graph soon.

For the vg snarls crash one thing to try if you haven't already is to run with the latest vg master.

@RenzoTale88
Copy link
Author

Hi sorry for my late reply, have missed the email notifying them. I'm testing the latest version right now, starting from the chunking onward. I'll keep you updated on how it develops. Thanks again for the support

@RenzoTale88
Copy link
Author

Hi, I've tried with vg v1.21.0-160 and seems to be working. The only odd thing is, when I submit an array of job to process each chromosome separately some of them will generate an empty vcf without creating any error message or failure signal. I can simply rerun them and they'll just work fine, but I find it a puzzling situation.

Thanks again for the support, I'll try now to rerun the whole pipeline with the new version of the software and see whether I can simply run every step correctly.

Andrea

@RenzoTale88
Copy link
Author

Hi,
sorry to come back to this, but I got some update.
I'm running the analyses using vg-1.21.0-160 as written above. When running the pipeline using vg chunk -C, for some reasons, the software crashes during the snarls augmentation of some chromosomes. However, when I chunk the dataset using:

vg chunk -x all.xg -a FILTER/${sample}/${sample}.filtered.gam -P ./LISTS/PATHS.txt -c 50 -g -s 999999999 -o 100000 -b CHUNK/${sample}/${sample}_call_chunk -t 4 -E CHUNK/${sample}/${sample}.chunklist -f

This produces a series of chunks of chromosomal size (considering the large chunk size), and they work just fine in all the subsequent stages, including calling the variants.
My question is: is it a good workaround? I mean, does it produces similar results to the -C chunking approach?

Thanks for your help again
Andrea

@glennhickey
Copy link
Contributor

glennhickey commented Feb 27, 2020 via email

@RenzoTale88
Copy link
Author

So, after calling the variants I actually have variants several thousands of bases long. What do you mean with large SV? Is there a size limit applied by the above command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants