from GAM to VCF notes #794

ekg · 2017-05-07T18:26:51Z

I'm stuck trying to run toil-vg call. I need to set up the per-chromosome GAMs, and I don't see a clear way scripted out in the toil-vg scripts. I'm trying with vg filter and vg chunk.

vg filter seems like it should do the right thing in one pass over the data, but it's taking ages to produce output:

vg filter -x ~/graphs/human/pan/index.xg -R path.bed -B HG002-NA24385-30x.gams/ -t 32 HG002-NA24385-30x.gam

Is this the right way to run it? The lack of output is concerning.

vg chunk has run out of memory on my system.

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 32

I have 256G RAM on the system. Should I set the thread count lower? I'm a bit confused how it would be running out of memory, so I'm looking for some advice about what's worked in the past.

The text was updated successfully, but these errors were encountered:

ekg · 2017-05-08T09:29:39Z

Setting the parallelism to 8 seems to work:

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8

But now I see something strange--- either the compression is working fantastically better due to the binning by chromosome, or something else is going on:

# input
-> % du -hs HG002-NA24385-30x.gam   
135G    HG002-NA24385-30x.gam
       
# input sorted, with unaligned reads removed
-> % du -hs HG002-NA24385-30x.gam.sorted
85G     HG002-NA24385-30x.gam.sorted    

# per-chromosome binned, unaligned reads removed and no decoy sequences
-> % du -hs HG002-NA24385-30x.gams  
78G     HG002-NA24385-30x.gams

Could this be caused by the removal of the decoy sequences?

ekg · 2017-05-08T13:32:29Z

I'm now attempting to use toil-vg call to generate calls for the GAMs.

I notice that the first thing it does is re-index the GAMs, then chunk based off of these indexes. This seems like a pretty expensive step if we've already had to make the genome wide index in order to subdivide the alignments.

@glennhickey is there any way to pass the original index directly in rather than make the chromosome-level GAMs and then re-index these to get smaller GAMs?

glennhickey · 2017-05-08T15:04:06Z

Yeah, by default vg chunk uses the subgraph extraction from xg, which takes a lot of memory to do, as well as to store the vg output. It tries to do subgraphs in parallel when you give it more threads which can bloat the memory quickly. There is an option (probably not documented well enough) to use id ranges for extraction in vg chunk. It's what gets used in toil-vg https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_map.py#L343-L352 and it is much faster and takes far less memory. It requires a tsv file for name / lowest-node-id / highest-node-id for each chromosome. Something that toil-vg creates while indexing using vg stats in on the input vg files (which must have contiguous id space per chromosome) https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_index.py#L306-L334

…

On Mon, May 8, 2017 at 5:29 AM, Erik Garrison ***@***.***> wrote: Setting the parallelism to 8 seems to work: vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8 But now I see something strange--- either the compression is working fantastically better due to the binning by chromosome, or something else is going on: # input -> % du -hs HG002-NA24385-30x.gam 135G HG002-NA24385-30x.gam # input sorted, with unaligned reads removed -> % du -hs HG002-NA24385-30x.gam.sorted 85G HG002-NA24385-30x.gam.sorted # per-chromosome binned, unaligned reads removed and no decoy sequences -> % du -hs HG002-NA24385-30x.gams 78G HG002-NA24385-30x.gams Could this be caused by the removal of the decoy sequences? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#794 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU> .

glennhickey · 2017-05-08T15:10:28Z

Just a note to clarify: toil-vg uses the id range heuristic only for splitting into chromosomes. it does a second level of splitting along small path intervals for the calling and that uses the subgraph extraction. On Mon, May 8, 2017 at 11:04 AM, Glenn Hickey <glenn.hickey@gmail.com> wrote:

…

Yeah, by default vg chunk uses the subgraph extraction from xg, which takes a lot of memory to do, as well as to store the vg output. It tries to do subgraphs in parallel when you give it more threads which can bloat the memory quickly. There is an option (probably not documented well enough) to use id ranges for extraction in vg chunk. It's what gets used in toil-vg https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_map.py# L343-L352 and it is much faster and takes far less memory. It requires a tsv file for name / lowest-node-id / highest-node-id for each chromosome. Something that toil-vg creates while indexing using vg stats in on the input vg files (which must have contiguous id space per chromosome) https://github.com/glennhickey/toil-vg/blob/master/src/toil_vg/vg_index. py#L306-L334 On Mon, May 8, 2017 at 5:29 AM, Erik Garrison ***@***.***> wrote: > Setting the parallelism to 8 seems to work: > > vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8 > > But now I see something strange--- either the compression is working > fantastically better due to the binning by chromosome, or something else is > going on: > > # input > -> % du -hs HG002-NA24385-30x.gam > 135G HG002-NA24385-30x.gam > > # input sorted, with unaligned reads removed > -> % du -hs HG002-NA24385-30x.gam.sorted > 85G HG002-NA24385-30x.gam.sorted > > # per-chromosome binned, unaligned reads removed and no decoy sequences > -> % du -hs HG002-NA24385-30x.gams > 78G HG002-NA24385-30x.gams > > Could this be caused by the removal of the decoy sequences? > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#794 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU> > . >

glennhickey · 2017-05-08T15:15:55Z

I see what you mean. This was written to save so disk space and IO of copying the whole-genome gam index to each worker, but at the cost of re-indexing. I can imagine this not being worth it on many (most?) computing environments. Is this step taking a lot of time for you? I don't think it'll be too much trouble to add an option to pass through the whole index.

…

On Mon, May 8, 2017 at 9:32 AM, Erik Garrison ***@***.***> wrote: I'm now attempting to use toil-vg call to generate calls for the GAMs. I notice that the first thing it does is re-index the GAMs, then chunk based off of these indexes. This seems like a pretty expensive step if we've already had to make the genome wide index in order to subdivide the alignments. @glennhickey <https://github.com/glennhickey> is there any way to pass the original index directly in rather than make the chromosome-level GAMs and then re-index these to get smaller GAMs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#794 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2_7iPI1ph8NuOgbKUsFlSZrcxe7ZAPks5r3xltgaJpZM4NTPyU> .

glennhickey · 2017-05-08T15:17:15Z

Finally: I have noticed the radical differences in gam sizes you show, and as far as I can tell it's entirely due to ordering and compression. On Mon, May 8, 2017 at 11:10 AM, Glenn Hickey <glenn.hickey@gmail.com> wrote:

…

Just a note to clarify: toil-vg uses the id range heuristic only for splitting into chromosomes. it does a second level of splitting along small path intervals for the calling and that uses the subgraph extraction. On Mon, May 8, 2017 at 11:04 AM, Glenn Hickey ***@***.***> wrote: > Yeah, by default vg chunk uses the subgraph extraction from xg, which > takes a lot of memory to do, as well as to store the vg output. It tries > to do subgraphs in parallel when you give it more threads which can bloat > the memory quickly. > > There is an option (probably not documented well enough) to use id ranges > for extraction in vg chunk. It's what gets used in toil-vg > > https://github.com/glennhickey/toil-vg/blob/master/src/toil_ > vg/vg_map.py#L343-L352 > > and it is much faster and takes far less memory. It requires a tsv file > for name / lowest-node-id / highest-node-id for each chromosome. Something > that toil-vg creates while indexing using vg stats in on the input vg files > (which must have contiguous id space per chromosome) > > https://github.com/glennhickey/toil-vg/blob/master/src/toil_ > vg/vg_index.py#L306-L334 > > > > > > > On Mon, May 8, 2017 at 5:29 AM, Erik Garrison ***@***.***> > wrote: > >> Setting the parallelism to 8 seems to work: >> >> vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8 >> >> But now I see something strange--- either the compression is working >> fantastically better due to the binning by chromosome, or something else is >> going on: >> >> # input >> -> % du -hs HG002-NA24385-30x.gam >> 135G HG002-NA24385-30x.gam >> >> # input sorted, with unaligned reads removed >> -> % du -hs HG002-NA24385-30x.gam.sorted >> 85G HG002-NA24385-30x.gam.sorted >> >> # per-chromosome binned, unaligned reads removed and no decoy sequences >> -> % du -hs HG002-NA24385-30x.gams >> 78G HG002-NA24385-30x.gams >> >> Could this be caused by the removal of the decoy sequences? >> >> — >> You are receiving this because you are subscribed to this thread. >> Reply to this email directly, view it on GitHub >> <#794 (comment)>, or mute >> the thread >> <https://github.com/notifications/unsubscribe-auth/AA2_7qJEWwIWE1aScv85vl1WIYluznWlks5r3uCDgaJpZM4NTPyU> >> . >> > >

ekg · 2017-05-08T16:25:35Z

Finally: I have noticed the radical differences in gam sizes you show, and as far as I can tell it's entirely due to ordering and compression.

The removal of the unaligned reads can't hurt, but it's good to know you're seeing the boost in compression too.

This was written to save so disk space and IO of copying the whole-genome gam index to each worker, but at the cost of re-indexing.

My "dream" variant calling process for a single node would work directly from the whole graph node2alignment GAM index, generating the chunks and calling them in parallel. It would even be fine to make all the regions directly from the GAM index first, then run jobs across these (maybe remotely). In theory, the GAM index can support very high parallelism in reads, so this could be done very quickly for the whole genome. It also didn't take a long time to build the indexes, even in a non-ideal system (networked storage < SSD).

I'm not sure what the best way to implement this is. Is toil-vg the closest thing or should we consider dropping something into vg itself to support the single-server variant calling pipeline application?

Also, I've not been able to get toil-vg call to run without failure on my system. It seems to be running out of memory in various ways. It's reporting job failures related to exceeding memory, then it's re-trying and it seems to be failing for good then. And when I try to set large memory limits it is completely failing as well.

glennhickey · 2017-05-08T18:51:11Z

That second one (make all regions directly from gam index then call), should be doable with vg chunk and vg call as they stand, and a small option to toil-vg call (which may already exist) to skip the chromosome level stuff. That's too bad about toil-vg. Do you mind sending me any error logs? The resource requirements specification seems to be the major bottleneck of people using this. It really needs to be simplified and more automated.

…

On Mon, May 8, 2017 at 12:25 PM, Erik Garrison ***@***.***> wrote: Finally: I have noticed the radical differences in gam sizes you show, and as far as I can tell it's entirely due to ordering and compression. The removal of the unaligned reads can't hurt, but it's good to know you're seeing the boost in compression too. This was written to save so disk space and IO of copying the whole-genome gam index to each worker, but at the cost of re-indexing. My "dream" variant calling process for a single node would work directly from the whole graph node2alignment GAM index, generating the chunks and calling them in parallel. It would even be fine to make all the regions directly from the GAM index first, then run jobs across these (maybe remotely). In theory, the GAM index can support very high parallelism in reads, so this could be done very quickly for the whole genome. It also didn't take a long time to build the indexes, even in a non-ideal system (networked storage < SSD). I'm not sure what the best way to implement this is. Is toil-vg the closest thing or should we consider dropping something into vg itself to support the single-server variant calling pipeline application? Also, I've not been able to get toil-vg call to run without failure on my system. It seems to be running out of memory in various ways. It's reporting job failures related to exceeding memory, then it's re-trying and it seems to be failing for good then. And when I try to set large memory limits it is completely failing as well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#794 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2_7tPLemsyeqeEYpCZnbbk5k0fkkIVks5r30IDgaJpZM4NTPyU> .

ekg changed the title ~~best way to chunk up a GAM~~ from GAM to VCF notes May 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

from GAM to VCF notes #794

from GAM to VCF notes #794

ekg commented May 7, 2017

ekg commented May 8, 2017

ekg commented May 8, 2017

glennhickey commented May 8, 2017 via email

glennhickey commented May 8, 2017 via email

glennhickey commented May 8, 2017 via email

glennhickey commented May 8, 2017 via email

ekg commented May 8, 2017

glennhickey commented May 8, 2017 via email

from GAM to VCF notes #794

from GAM to VCF notes #794

Comments

ekg commented May 7, 2017

ekg commented May 8, 2017

ekg commented May 8, 2017

glennhickey commented May 8, 2017 via email

glennhickey commented May 8, 2017 via email

glennhickey commented May 8, 2017 via email

glennhickey commented May 8, 2017 via email

ekg commented May 8, 2017

glennhickey commented May 8, 2017 via email