-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
from GAM to VCF notes #794
Comments
Setting the parallelism to 8 seems to work:
But now I see something strange--- either the compression is working fantastically better due to the binning by chromosome, or something else is going on:
Could this be caused by the removal of the decoy sequences? |
I'm now attempting to use I notice that the first thing it does is re-index the GAMs, then chunk based off of these indexes. This seems like a pretty expensive step if we've already had to make the genome wide index in order to subdivide the alignments. @glennhickey is there any way to pass the original index directly in rather than make the chromosome-level GAMs and then re-index these to get smaller GAMs? |
Yeah, by default vg chunk uses the subgraph extraction from xg, which takes
a lot of memory to do, as well as to store the vg output. It tries to do
subgraphs in parallel when you give it more threads which can bloat the
memory quickly.
There is an option (probably not documented well enough) to use id ranges
for extraction in vg chunk. It's what gets used in toil-vg
https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_map.py#L343-L352
and it is much faster and takes far less memory. It requires a tsv file
for name / lowest-node-id / highest-node-id for each chromosome. Something
that toil-vg creates while indexing using vg stats in on the input vg files
(which must have contiguous id space per chromosome)
https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_index.py#L306-L334
…On Mon, May 8, 2017 at 5:29 AM, Erik Garrison ***@***.***> wrote:
Setting the parallelism to 8 seems to work:
vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8
But now I see something strange--- either the compression is working
fantastically better due to the binning by chromosome, or something else is
going on:
# input
-> % du -hs HG002-NA24385-30x.gam
135G HG002-NA24385-30x.gam
# input sorted, with unaligned reads removed
-> % du -hs HG002-NA24385-30x.gam.sorted
85G HG002-NA24385-30x.gam.sorted
# per-chromosome binned, unaligned reads removed and no decoy sequences
-> % du -hs HG002-NA24385-30x.gams
78G HG002-NA24385-30x.gams
Could this be caused by the removal of the decoy sequences?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#794 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU>
.
|
Just a note to clarify:
toil-vg uses the id range heuristic only for splitting into chromosomes.
it does a second level of splitting along small path intervals for the
calling and that uses the subgraph extraction.
On Mon, May 8, 2017 at 11:04 AM, Glenn Hickey <glenn.hickey@gmail.com>
wrote:
… Yeah, by default vg chunk uses the subgraph extraction from xg, which
takes a lot of memory to do, as well as to store the vg output. It tries
to do subgraphs in parallel when you give it more threads which can bloat
the memory quickly.
There is an option (probably not documented well enough) to use id ranges
for extraction in vg chunk. It's what gets used in toil-vg
https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_map.py#
L343-L352
and it is much faster and takes far less memory. It requires a tsv file
for name / lowest-node-id / highest-node-id for each chromosome. Something
that toil-vg creates while indexing using vg stats in on the input vg files
(which must have contiguous id space per chromosome)
https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_index.
py#L306-L334
On Mon, May 8, 2017 at 5:29 AM, Erik Garrison ***@***.***>
wrote:
> Setting the parallelism to 8 seems to work:
>
> vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8
>
> But now I see something strange--- either the compression is working
> fantastically better due to the binning by chromosome, or something else is
> going on:
>
> # input
> -> % du -hs HG002-NA24385-30x.gam
> 135G HG002-NA24385-30x.gam
>
> # input sorted, with unaligned reads removed
> -> % du -hs HG002-NA24385-30x.gam.sorted
> 85G HG002-NA24385-30x.gam.sorted
>
> # per-chromosome binned, unaligned reads removed and no decoy sequences
> -> % du -hs HG002-NA24385-30x.gams
> 78G HG002-NA24385-30x.gams
>
> Could this be caused by the removal of the decoy sequences?
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#794 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU>
> .
>
|
I see what you mean. This was written to save so disk space and IO of
copying the whole-genome gam index to each worker, but at the cost of
re-indexing. I can imagine this not being worth it on many (most?)
computing environments. Is this step taking a lot of time for you? I
don't think it'll be too much trouble to add an option to pass through the
whole index.
…On Mon, May 8, 2017 at 9:32 AM, Erik Garrison ***@***.***> wrote:
I'm now attempting to use toil-vg call to generate calls for the GAMs.
I notice that the first thing it does is re-index the GAMs, then chunk
based off of these indexes. This seems like a pretty expensive step if
we've already had to make the genome wide index in order to subdivide the
alignments.
@glennhickey <https://github.com/glennhickey> is there any way to pass
the original index directly in rather than make the chromosome-level GAMs
and then re-index these to get smaller GAMs?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#794 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA2_7iPI1ph8NuOgbKUsFlSZrcxe7ZAPks5r3xltgaJpZM4NTPyU>
.
|
Finally: I have noticed the radical differences in gam sizes you show,
and as far as I can tell it's entirely due to ordering and compression.
On Mon, May 8, 2017 at 11:10 AM, Glenn Hickey <glenn.hickey@gmail.com>
wrote:
… Just a note to clarify:
toil-vg uses the id range heuristic only for splitting into chromosomes.
it does a second level of splitting along small path intervals for the
calling and that uses the subgraph extraction.
On Mon, May 8, 2017 at 11:04 AM, Glenn Hickey ***@***.***>
wrote:
> Yeah, by default vg chunk uses the subgraph extraction from xg, which
> takes a lot of memory to do, as well as to store the vg output. It tries
> to do subgraphs in parallel when you give it more threads which can bloat
> the memory quickly.
>
> There is an option (probably not documented well enough) to use id ranges
> for extraction in vg chunk. It's what gets used in toil-vg
>
> https://github.com/glennhickey/toil-vg/blob/master/src/toil_
> vg/vg_map.py#L343-L352
>
> and it is much faster and takes far less memory. It requires a tsv file
> for name / lowest-node-id / highest-node-id for each chromosome. Something
> that toil-vg creates while indexing using vg stats in on the input vg files
> (which must have contiguous id space per chromosome)
>
> https://github.com/glennhickey/toil-vg/blob/master/src/toil_
> vg/vg_index.py#L306-L334
>
>
>
>
>
>
> On Mon, May 8, 2017 at 5:29 AM, Erik Garrison ***@***.***>
> wrote:
>
>> Setting the parallelism to 8 seems to work:
>>
>> vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8
>>
>> But now I see something strange--- either the compression is working
>> fantastically better due to the binning by chromosome, or something else is
>> going on:
>>
>> # input
>> -> % du -hs HG002-NA24385-30x.gam
>> 135G HG002-NA24385-30x.gam
>>
>> # input sorted, with unaligned reads removed
>> -> % du -hs HG002-NA24385-30x.gam.sorted
>> 85G HG002-NA24385-30x.gam.sorted
>>
>> # per-chromosome binned, unaligned reads removed and no decoy sequences
>> -> % du -hs HG002-NA24385-30x.gams
>> 78G HG002-NA24385-30x.gams
>>
>> Could this be caused by the removal of the decoy sequences?
>>
>> —
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly, view it on GitHub
>> <#794 (comment)>, or mute
>> the thread
>> <https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU>
>> .
>>
>
>
|
The removal of the unaligned reads can't hurt, but it's good to know you're seeing the boost in compression too.
My "dream" variant calling process for a single node would work directly from the whole graph node2alignment GAM index, generating the chunks and calling them in parallel. It would even be fine to make all the regions directly from the GAM index first, then run jobs across these (maybe remotely). In theory, the GAM index can support very high parallelism in reads, so this could be done very quickly for the whole genome. It also didn't take a long time to build the indexes, even in a non-ideal system (networked storage < SSD). I'm not sure what the best way to implement this is. Is toil-vg the closest thing or should we consider dropping something into vg itself to support the single-server variant calling pipeline application? Also, I've not been able to get |
That second one (make all regions directly from gam index then call),
should be doable with vg chunk and vg call as they stand, and a small
option to toil-vg call (which may already exist) to skip the chromosome
level stuff.
That's too bad about toil-vg. Do you mind sending me any error logs? The
resource requirements specification seems to be the major bottleneck of
people using this. It really needs to be simplified and more automated.
…On Mon, May 8, 2017 at 12:25 PM, Erik Garrison ***@***.***> wrote:
Finally: I have noticed the radical differences in gam sizes you show, and
as far as I can tell it's entirely due to ordering and compression.
The removal of the unaligned reads can't hurt, but it's good to know
you're seeing the boost in compression too.
This was written to save so disk space and IO of copying the whole-genome
gam index to each worker, but at the cost of re-indexing.
My "dream" variant calling process for a single node would work directly
from the whole graph node2alignment GAM index, generating the chunks and
calling them in parallel. It would even be fine to make all the regions
directly from the GAM index first, then run jobs across these (maybe
remotely). In theory, the GAM index can support very high parallelism in
reads, so this could be done very quickly for the whole genome. It also
didn't take a long time to build the indexes, even in a non-ideal system
(networked storage < SSD).
I'm not sure what the best way to implement this is. Is toil-vg the
closest thing or should we consider dropping something into vg itself to
support the single-server variant calling pipeline application?
Also, I've not been able to get toil-vg call to run without failure on my
system. It seems to be running out of memory in various ways. It's
reporting job failures related to exceeding memory, then it's re-trying and
it seems to be failing for good then. And when I try to set large memory
limits it is completely failing as well.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#794 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA2_7tPLemsyeqeEYpCZnbbk5k0fkkIVks5r30IDgaJpZM4NTPyU>
.
|
I'm stuck trying to run toil-vg call. I need to set up the per-chromosome GAMs, and I don't see a clear way scripted out in the toil-vg scripts. I'm trying with vg filter and vg chunk.
vg filter
seems like it should do the right thing in one pass over the data, but it's taking ages to produce output:Is this the right way to run it? The lack of output is concerning.
vg chunk
has run out of memory on my system.I have 256G RAM on the system. Should I set the thread count lower? I'm a bit confused how it would be running out of memory, so I'm looking for some advice about what's worked in the past.
The text was updated successfully, but these errors were encountered: