Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from GAM to VCF notes #794

Open
ekg opened this issue May 7, 2017 · 8 comments
Open

from GAM to VCF notes #794

ekg opened this issue May 7, 2017 · 8 comments

Comments

@ekg
Copy link
Member

ekg commented May 7, 2017

I'm stuck trying to run toil-vg call. I need to set up the per-chromosome GAMs, and I don't see a clear way scripted out in the toil-vg scripts. I'm trying with vg filter and vg chunk.

vg filter seems like it should do the right thing in one pass over the data, but it's taking ages to produce output:

vg filter -x ~/graphs/human/pan/index.xg -R path.bed -B HG002-NA24385-30x.gams/ -t 32 HG002-NA24385-30x.gam

Is this the right way to run it? The lack of output is concerning.

vg chunk has run out of memory on my system.

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 32

I have 256G RAM on the system. Should I set the thread count lower? I'm a bit confused how it would be running out of memory, so I'm looking for some advice about what's worked in the past.

@ekg
Copy link
Member Author

ekg commented May 8, 2017

Setting the parallelism to 8 seems to work:

vg chunk -x ~/graphs/human/pan/index.xg -a HG002-NA24385-30x.gam.nodedb -P path.list -b HG002-NA24385-30x.gams/ -R targets.bed -t 8

But now I see something strange--- either the compression is working fantastically better due to the binning by chromosome, or something else is going on:

# input
-> % du -hs HG002-NA24385-30x.gam   
135G    HG002-NA24385-30x.gam
       
# input sorted, with unaligned reads removed
-> % du -hs HG002-NA24385-30x.gam.sorted
85G     HG002-NA24385-30x.gam.sorted    

# per-chromosome binned, unaligned reads removed and no decoy sequences
-> % du -hs HG002-NA24385-30x.gams  
78G     HG002-NA24385-30x.gams      

Could this be caused by the removal of the decoy sequences?

@ekg ekg changed the title best way to chunk up a GAM from GAM to VCF notes May 8, 2017
@ekg
Copy link
Member Author

ekg commented May 8, 2017

I'm now attempting to use toil-vg call to generate calls for the GAMs.

I notice that the first thing it does is re-index the GAMs, then chunk based off of these indexes. This seems like a pretty expensive step if we've already had to make the genome wide index in order to subdivide the alignments.

@glennhickey is there any way to pass the original index directly in rather than make the chromosome-level GAMs and then re-index these to get smaller GAMs?

@glennhickey
Copy link
Contributor

glennhickey commented May 8, 2017 via email

@glennhickey
Copy link
Contributor

glennhickey commented May 8, 2017 via email

@glennhickey
Copy link
Contributor

glennhickey commented May 8, 2017 via email

@glennhickey
Copy link
Contributor

glennhickey commented May 8, 2017 via email

@ekg
Copy link
Member Author

ekg commented May 8, 2017

Finally: I have noticed the radical differences in gam sizes you show, and as far as I can tell it's entirely due to ordering and compression.

The removal of the unaligned reads can't hurt, but it's good to know you're seeing the boost in compression too.

This was written to save so disk space and IO of copying the whole-genome gam index to each worker, but at the cost of re-indexing.

My "dream" variant calling process for a single node would work directly from the whole graph node2alignment GAM index, generating the chunks and calling them in parallel. It would even be fine to make all the regions directly from the GAM index first, then run jobs across these (maybe remotely). In theory, the GAM index can support very high parallelism in reads, so this could be done very quickly for the whole genome. It also didn't take a long time to build the indexes, even in a non-ideal system (networked storage < SSD).

I'm not sure what the best way to implement this is. Is toil-vg the closest thing or should we consider dropping something into vg itself to support the single-server variant calling pipeline application?

Also, I've not been able to get toil-vg call to run without failure on my system. It seems to be running out of memory in various ways. It's reporting job failures related to exceeding memory, then it's re-trying and it seems to be failing for good then. And when I try to set large memory limits it is completely failing as well.

@glennhickey
Copy link
Contributor

glennhickey commented May 8, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants