-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on resulting VG variants #2594
Comments
The N is probably found in the input genomes.
Do you see Ns in the input FASTA sequences that you gave to cactus? Do you
find it in the GFA output of the graph (from vg view)?
…On Wed, Jan 8, 2020 at 1:34 PM RenzoTale88 ***@***.***> wrote:
Hello,
I'm calling variants using a cactus generated graph genome, generated and
indexed as previously reported in another issue (#2362
<#2362>). Alignments and post
processing is described in the issue #2546
<#2546> with the recommended changes.
The only additional modification I've done is to increase the chunk size to
10 Mb, with a 100Kb overlap, to reduce the number of tasks to perform.
I've managed also to call the variants on each chunk using the command:
vg call ./FILTER/$sample/${bname}.aug.vg -r ./FILTER/$sample/${bname}.aug.snarls -s $sample -k ./FILTER/$sample/${bname}.aug.pack -t 1 -p genome1.1 -l 120000000 -o $offset | vcf-sort | bgzip -c > ./VCALL/$sample/${bname}.vcf.gz
This generated a vcf file, refered to the reference genome of interest
(called genome1). I've noticed that some of the variants are represented
with alternate allele N, and that no deletions appear. Is the N indicative
of deletions?
Thank you for your support,
Andrea
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2594?email_source=notifications&email_token=AABDQEPW7LG2L2IBZOBP6QDQ4XB35A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEYAOQQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEII5MGFSRJNZYFIJS3Q4XB35ANCNFSM4KEHVJTA>
.
|
Yes there are gaps in the input genomes which are coded as Ns, and yes I see them when I run vg view. Are there changes to be done when creating the input graph from cactus ouputs? |
There is nothing you need to do, and this is all expected. You're getting
these calls because in some places reads align across the gaps (Ns).
…On Wed, Jan 8, 2020 at 1:53 PM RenzoTale88 ***@***.***> wrote:
Yes there are gaps in the input genomes which are coded as Ns, and yes I
see them when I run vg view. Are there changes to be done when creating the
input graph from cactus ouputs?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2594?email_source=notifications&email_token=AABDQEOF7RESKSVSXZENEPTQ4XED7A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMI6RQ#issuecomment-572034886>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEJWHM4A7IHMDVVSRJ3Q4XED7ANCNFSM4KEHVJTA>
.
|
I'm not I'm getting this right sorry. So, I'm getting variants like the one below:
So, the reference allele is ATC, but the alternate is NNN. So, the gap is supposed to be in the reads? |
Sorry, I misunderstood. This is indeed strange. Maybe @glennhickey can give
some insight.
It would indeed look like the gap is in the reads, or the alternate allele
that the caller has determined. It looks like a bug.
…On Wed, Jan 8, 2020 at 2:07 PM RenzoTale88 ***@***.***> wrote:
I'm not I'm getting this right sorry. So, I'm getting variants like the
one below:
genome1.1 10333 10528667_10528803 ATC NNN 237.434 PASS DP=239 GT:DP:AD:GL 1/1:239:1,237:-663.514901,-219.974202,-161.973251
So, the reference allele is ATC, but the alternate is NNN. So, the gap is
supposed to be in the reads?
Sorry, maybe I'm just misinterpreting what you're telling me right now,
but I'd like to completely understand what I'm getting as an output.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2594?email_source=notifications&email_token=AABDQELG6O6OYIHVV3QTDWLQ4XFZ7A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMK6OY#issuecomment-572043067>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABDQEKD2I5WKKLZFJD42L3Q4XFZ7ANCNFSM4KEHVJTA>
.
|
So, the problem appear when i use the augmented vg graph, but is not present when using the xg index. Don't know if this might help tracking down the problem. I'm using vg version 1.20.0-78-gc71c000. |
There must be N's in the reads that are getting augmented into the graph. @adamnovak and I were just chatting about this on Slack the other day. We used to have a filter for this, but it got lost in refactoring. I think I will have to add it back. In the meantime, you can just ignore these calls. It may be worth double-checking that this is not a bug (and this NNN is really in the vast majority of your reads as aligned to this position of the graph). I can do this for you if you share the agumented graph / gam in question. |
Sorry at the moment I do not have the augmented chunks anymore (I've deleted them after finishing to call the variants). I can recreate them, but it might take a bit.
Using a chunk size higher than the longest chromosome available should allow for the split into single chromosomes, but I'd like to know whether there is a better, more specific way to do it. Thanks again for the help Andrea |
Yes, this is now how
The other steps (augment / pack / call) should work fine on chromosome-sized graphs. You can save some time by not splitting the GAM (no Finally: if you do split the GAM, you'll need the very latest master branch for |
Ok perfect, I'm proceeding as you suggested. Does |
Yes, augment definitely accepts pg files in 1.21.0. If it isn't working in
1.20.0, then switching to the newer release should fix it.
…On Thu, Jan 9, 2020 at 9:47 AM RenzoTale88 ***@***.***> wrote:
Ok perfect, I'm proceeding as you suggested. Does vg augment accept also
pg files as an input? Because just running the command shows me that it
accept a vg file (just double checking of course) using vg 1.20.0?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2594?email_source=notifications&email_token=AAG373UQWMQB5JBCRTC65LLQ442JTA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIQRNJA#issuecomment-572593828>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG373R5OU7C5RBBOK7YUCTQ442JTANCNFSM4KEHVJTA>
.
|
Ok it is running now. I have one more question, more related to computational time/space. Using Thanks again |
The augmented GAM will only contain reads that map to the graph. So every
read in your original GAM should appear in at most one augmented GAM.
Note that this requires the -s option for vg augment. (otherwise you'll
get an error telling you to use it when a read that's outside the graph is
encountered).
…On Fri, Jan 10, 2020 at 3:03 AM RenzoTale88 ***@***.***> wrote:
Ok it is running now. I have one more question, more related to
computational time/space. Using vg augment I will produce an augmented
graph and reads (gam), that are then the inputs for vg pack. My question
is: the augmented gam will contain only the reads that map to the specific
chromosome I'm processing, or it will contain all of the reads mapped to
the graph? I' asking because this might lead to much higher space
requirements, and therefore I got to plan the process more carefully.
Thanks again
Andrea
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2594?email_source=notifications&email_token=AAG373WZCFFENCS6LJVXKRDQ5ATWXA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEITA3XY#issuecomment-572919263>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG373S4CP5IFVGSZPPWHSLQ5ATWXANCNFSM4KEHVJTA>
.
|
Ok thank you very much, I'll give it a go with this additional option. Thanks a lot again |
Hi again,
Then, for every chunk I've augmented the graph with the full dataset (bname is the basename of each chunk):
This step however finishes with errors every time I run it. The error I get is the following:
Below, the stack trace of the issue encountered. Is it that I'm doing something wrong? Andrea |
This type of error usually means it's completely failing to read an input file. The strange thing is that the error message implies it's the graph (which is not loaded as the type shown), but the stack trace implies it's the GAM. Does |
Ok sorry, I might have used the wrong stack trace (I'm running on a cluster and have to retrieve the stacktraces from each node). I've tried again between yesterday night and today morning, this time chunking the graph into chromosomes creating sub-vg files instead of pg. I've alweays use version 1.20.0 for all the operations, although using the version 1.21.0 didn't changed the results.
Odd enough, all the chromosomes failed with the exception of the last one among them. Not really sure why, since it is an arrayed job, and repeat every step for every chromosome the same way. Running Let me know what I can try next |
I see. This stack trace is pretty clear about it being These issues have been helpful, so thanks for your patience. We've mostly been using |
Thank you very much for you help! |
Yeah, just a fresh clone with no branch specified. But keep in mind that in this version, you can no longer combine graphs with |
Ok that should be fine. Is the resulting graph from |
As far as I know, yes. Good question though. @adamnovak can confirm for sure. |
Yes, the |
@RenzoTale88 I'm getting a crash with a similar stack trace on my own test. So I suspect that trying with the latest master will not fix it. I do have all the data I need to debug, and will let you know when it's fixed. |
Ok great then I'll wait your go-on to try upgrade my vg version. In the meanwhile, I've tried again to run Below the command I'm using to produce the different analyses:
One question: when I use |
I think your crash was due to a bug in the
I'm continuing to test on my end. |
Ok, I can confirm that for a sample it keep crashing on chromosome 1, which is the largest in the genome, even after fixing the paths with vg paths -r.
Below the stacktrace:
|
That looks like it's running out of memory. Is that what you're seeing?
Any idea how much memory it's using? The only way I can think of to reduce
the memory is using more stringent coverage (-m) and / or quality filtering
(-Q) in vg augment.
…On Tue, Jan 21, 2020 at 6:23 AM RenzoTale88 ***@***.***> wrote:
Ok, I can confirm that for a sample it keep crashing on chromosome 1,
which is the largest in the genome, even after fixing the paths with vg
paths -r.
I proceeded as follow:
vg paths -r ganome1.1 $myfile.aug.vg > tmp.aug.vg
vg snarls tmp.aug.vg > $myfile.aug.snarls
Below the stacktrace:
Crash report for vg v1.21.0-114-g9f3eb752d "Fanano"
Stack trace (most recent call last):
#6 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x4daf89, in _start
#5 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x1b44f68, in __libc_start_main
#4 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x40ada2, in main
Source "src/main.cpp", line 75, in main [0x40ada2]
#3 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x9b6007, in vg::subcommand::Subcommand::operator()(int, char**) const
| Source "src/subcommand/subcommand.cpp", line 72, in operator()
Source "/usr/include/c++/7/bits/std_function.h", line 706, in operator() [0x9b6007]
#2 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0x93db96, in main_snarl(int, char**)
Source "src/subcommand/snarls_main.cpp", line 224, in main_snarl [0x93db96]
#1 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb32fb3, in vg::CactusSnarlFinder::find_snarls_parallel()
| Source "src/snarls.cpp", line 114, in finish
Source "src/snarls.cpp", line 947, in find_snarls_parallel [0xb32fb3]
#0 Object "/PATH/TO/Andrea/vg/vg-1.21.0-114-g9f3eb752d", at 0xb321a5, in vg::SnarlManager::build_indexes()
| Source "src/snarls.cpp", line 993, in size
| Source "/usr/include/c++/7/bits/stl_deque.h", line 1272, in operator-<vg::SnarlManager::SnarlRecord, vg::SnarlManager::SnarlRecord&, vg::SnarlManager::SnarlRecord*>
Source "/usr/include/c++/7/bits/stl_deque.h", line 356, in build_indexes [0xb321a5]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2594?email_source=notifications&email_token=AAG373S4LT65DH5TRDU2YK3Q63LM3A5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJPM7RI#issuecomment-576638917>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG373Q7IFCRJUOUWBH2MVDQ63LM3ANCNFSM4KEHVJTA>
.
|
No, the script never reaches the top usage of memory. I can try to use a more stringent coverage filtering, to see if it makes a difference (some samples are 30X other almost 40X, so using the same threshold might have given some problems). |
It looks like it isn't the reserve() call at src/snarls.cpp line 993
that is failing, but the size() call, like something has upset the
std::deque internals.
I'd run the offending snarls command under gdb, and get a stack trace
that way to see if it really is stopping us inside the deque size
calculation, and, if so, what is wrong with the deque.
You could also try it with -t 1 for using a single thread. I think
snarl finding is parallel, and it could be that a signal from one
thread is ending up getting received by another. Our stack trace code
I thought was supposed to work around that, but maybe it doesn't
always succeed?
…On 1/21/20, RenzoTale88 ***@***.***> wrote:
No, the script never reaches the top usage of memory. I can try to use a
more stringent coverage filtering, to see if it makes a difference (some
samples are 30X other almost 40X, so using the same threshold might have
given some problems).
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#2594 (comment)
|
I can take a look if you can share the graph. |
For what it's worth, I finally have this pipeline running through reliably with and without gam chunking via That said, I'm running on a graph made with For the |
Hi sorry for my late reply, have missed the email notifying them. I'm testing the latest version right now, starting from the chunking onward. I'll keep you updated on how it develops. Thanks again for the support |
Hi, I've tried with vg v1.21.0-160 and seems to be working. The only odd thing is, when I submit an array of job to process each chromosome separately some of them will generate an empty vcf without creating any error message or failure signal. I can simply rerun them and they'll just work fine, but I find it a puzzling situation. Thanks again for the support, I'll try now to rerun the whole pipeline with the new version of the software and see whether I can simply run every step correctly. Andrea |
Hi,
This produces a series of chunks of chromosomal size (considering the large chunk size), and they work just fine in all the subsequent stages, including calling the variants. Thanks for your help again |
The snarls do seem to be a real issue for these cactus alignments. It's
ironic, as the snarls code is actually using the cactus graph internally.
Anyway, not using -C in this case limits the size of the variation in your
chunks. If you're not interested in large sv's, it probably won't make
much difference.
I will take a shot at fixing `snarls` to be more robust, but won't be able
to start for another 2 weeks, unfortunately.
…On Thu, Feb 27, 2020 at 4:17 AM RenzoTale88 ***@***.***> wrote:
Hi,
sorry to come back to this, but I got some update.
I'm running the analyses using vg-1.21.0-160 as written above. When
running the pipeline using vg chunk -C, for some reasons, the software
crashes during the snarls augmentation of some chromosomes. However, when I
chunk the dataset using:
vg chunk -x all.xg -a FILTER/${sample}/${sample}.filtered.gam -P ./LISTS/PATHS.txt -c 50 -g -s 999999999 -o 100000 -b CHUNK/${sample}/${sample}_call_chunk -t 4 -E CHUNK/${sample}/${sample}.chunklist -f
This produces a series of chunks of chromosomal size (considering the
large chunk size), and they work just fine in all the subsequent stages,
including calling the variants.
My question is: is it a good workaround? I mean, does it produces similar
results to the -C chunking approach?
Thanks for your help again
Andrea
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2594?email_source=notifications&email_token=AAG373UEV4MA5DEQ377RL3DRE6AMHA5CNFSM4KEHVJTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENDSO3Y#issuecomment-591865711>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG373SCKNTLIOWNE3YLPHLRE6AMHANCNFSM4KEHVJTA>
.
|
So, after calling the variants I actually have variants several thousands of bases long. What do you mean with large SV? Is there a size limit applied by the above command? |
Hello,
I'm calling variants using a cactus generated graph genome, generated and indexed as previously reported in another issue (#2362). Alignments and post processing is described in the issue #2546 with the recommended changes. The only additional modification I've done is to increase the chunk size to 10 Mb, with a 100Kb overlap, to reduce the number of tasks to perform.
I've managed also to call the variants on each chunk using the command:
This generated a vcf file, refered to the reference genome of interest (called genome1). I've noticed that some of the variants are represented with alternate allele N, and that no deletions appear. Is the N indicative of deletions?
Thank you for your support,
Andrea
The text was updated successfully, but these errors were encountered: