Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for long reference sequences #900

Open
jayoung opened this issue Jun 15, 2017 · 17 comments
Open

Support for long reference sequences #900

jayoung opened this issue Jun 15, 2017 · 17 comments

Comments

@jayoung
Copy link

jayoung commented Jun 15, 2017

Hi there,

I'd like to add another voice for providing support for longer reference sequences: it already seems useful for some genomes, and as better assembled and larger genomes come out it seems like the need will get more frequent.

I've been working a little with the opossum monDom5 assembly where two chromosomes too long to be handled by IGV (chr1 is 748 Mb and chr2 is 541 Mb). I've been told it's an underlying limitation of htsjdk that creates this issue for IGV with longer reference sequences, so I'm posting the request here too.

Thanks for considering it,

Janet Young


Dr. Janet Young

Malik lab
http://research.fhcrc.org/malik/en.html

Division of Basic Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512
email: jayoung ...at... fredhutch.org


@lindenb
Copy link
Contributor

lindenb commented Jun 15, 2017

as far as I know the max length of a chromosome is max(int32)= 2,147,483,647 so there shouldn't have any problem on this side.

There may be a problem when you're using a binning index (like bam.bai or tabix.tbi) : where the max size is only 512Mb (http://genomewiki.ucsc.edu/index.php/Bin_indexing_system ). But you can always use a csi index: https://www.biostars.org/p/111984/

@jrobinso
Copy link
Contributor

Yes, the problem is the index. Does htsjdk support csi indexes? My understanding is it does not. I'm looking at the source code but don't see anything.

@lbergelson
Copy link
Member

It doesn't support CSI as far as I know.

@yfarjoun
Copy link
Contributor

yfarjoun commented Jun 16, 2017 via email

@lbergelson
Copy link
Member

I'm not really sure, but we haven't always done a great job of handling index related things in a clean way, and I suspect we have client code that relies on the details of the bai index since it's been the only one we support for bam. I'd guess that it would be a fair amount of work to implement.

@yfarjoun
Copy link
Contributor

from what I saw online, bai is a specialization of csi....so it might not be so bad...it would be good for someone knowledgable about the format to step in though....

@jayoung
Copy link
Author

jayoung commented Jul 8, 2017

Thanks all for looking into this: hoping someone capable (I am not capable) is interested in taking this on at some point.

Janet

@nathanhaigh
Copy link

There is another issue open regarding CSI index support in htsjdk: #447

@FredericBGA
Copy link

Hi,

Picard has a PR that allows to work with CSI index (from #1040)
broadinstitute/picard#998

But I need htsjdk to be able to write CSI as well (example of Wheat genome):

Exception in thread "main" htsjdk.samtools.SAMException: Exception when processing alignment for BAM index M01322:139:000000000-BVL2R:1:2108:5384:21249 2/2 301b aligned to chr1A:537177243-537177543.
Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record M01322:139:000000000-BVL2R:1:2108:5384:21249 2/2 301b aligned to chr1A:537177243-537177543.
Caused by: java.lang.IllegalStateException: Read position too high for BAI bin indexing.

How can I help?

@lbergelson
Copy link
Member

I think GATK SHOULD work with bams with a CSI index already as well. There's no VCF support at the moment which will probably be problematic though.

I don't think we have any plan to implement it on our own in the near future. We just don't have the bandwidth to look into it, especially with the current pandemic craziness which is significantly cutting into people's coding time.

However, if you're interested in contributing and have a fairly significant chunk of time on your hands we'd be able to work with you on getting a CSI writer into htsjdk. In theory it shouldn't be too hard, it's basically just a tiny modification to the bai writer. I suspect there will be a lot of pain points though with how the code is structured and (poorly) tested.
The index code is unfortunately, really gross. It wasn't really designed to be extensible. There's also a weird divide between the bam index code and the vcf index code, the CSI could in theory be used for VCF to support long references, but currently we only support tabix. As far as I understand vcf CSI and bam CSI have to be slightly different due to how extra metadata is encoded in the bai/ bam csi.

@cmnbroad IS possibly going to be trying to modernize some of the index code it in the near future. He might be able to offer some guidance as to where to start if you're interested.

We'd love the help, but it will probably take a while.

@jrobinso
Copy link
Contributor

jrobinso commented Apr 3, 2020

Hey @lbergelson @cmnbroad hang in there, must be crazy where you are.

RE index modernization, ping me if you do that, I'm interested in modernizing the igv.js code as well, might be some synergy. Currently both bam and tabix indexes are handled by the same code, with a few switches, this in turn was adapted from even older code. At least its relatively small https://github.com/igvteam/igv.js/blob/master/js/bam/bamIndex.js

@nathanhaigh
Copy link

Perhaps the code developed for use by JBrowse could help?

BAM Index

Tabix Index

@FredericBGA
Copy link

Thank you for all these comments.
I'm not sure that my skills (java and computing) will be enough, but I'm still ready to help.
Of course I'm stuck at home right now like half of the people on Earth, so this is not the best timing but we could get back in touch when you managed to find time to make progress on this issue.

@aschaetz
Copy link

aschaetz commented Jun 1, 2022

Hi Guys,
First of all, thank you for the read support for .csi indices.

We start seeing more and more genomes where the contigs exceed the limits of the .bai index.
While it's good that htsjdk can read these indices, we need write support, as well.

Not being able to write a .csi index is a significant limitation when working with large, well assembled, plant genomes.
And we believe that this will become more important as well assembled genomes become more frequent, due to the rise of long read technologies.

Adding this functionality would be very much appreciated.

@lindenb
Copy link
Contributor

lindenb commented Jun 1, 2022

there was a project samtools/htsjdk-next-beta which aim was to implement long REFs (eg.: samtools/htsjdk-next-beta#6 ) . But the project looks inactive (dead ?) now.

@aschaetz
Copy link

aschaetz commented Jun 9, 2022

Yes, sadly no activity there.

@lynnjo
Copy link

lynnjo commented Nov 8, 2022

My group (Buckler Lab at Cornell University) also has a need for VCF CSI support for software we write/maintain. Our open source code (Buckler Lab PHG) uses gvcf files to store variant information for plant genomes and has run into problems working with the larger genomes .e.g wheat.

Has anything changed in terms of scheduling CSI support for VCF? We would be interested in working on this in the htsjdk code base if there were someone with whom we could consult. @lbergelson @cmnbroad (or anyone else) will you comment/ update on where this stands? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants