Support for long reference sequences #900

jayoung · 2017-06-15T17:57:14Z

Hi there,

I'd like to add another voice for providing support for longer reference sequences: it already seems useful for some genomes, and as better assembled and larger genomes come out it seems like the need will get more frequent.

I've been working a little with the opossum monDom5 assembly where two chromosomes too long to be handled by IGV (chr1 is 748 Mb and chr2 is 541 Mb). I've been told it's an underlying limitation of htsjdk that creates this issue for IGV with longer reference sequences, so I'm posting the request here too.

Thanks for considering it,

Janet Young

Dr. Janet Young

Malik lab
http://research.fhcrc.org/malik/en.html

Division of Basic Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512
email: jayoung ...at... fredhutch.org

lindenb · 2017-06-15T18:15:08Z

as far as I know the max length of a chromosome is max(int32)= 2,147,483,647 so there shouldn't have any problem on this side.

There may be a problem when you're using a binning index (like bam.bai or tabix.tbi) : where the max size is only 512Mb (http://genomewiki.ucsc.edu/index.php/Bin_indexing_system ). But you can always use a csi index: https://www.biostars.org/p/111984/

jrobinso · 2017-06-16T03:20:14Z

Yes, the problem is the index. Does htsjdk support csi indexes? My understanding is it does not. I'm looking at the source code but don't see anything.

lbergelson · 2017-06-16T14:02:30Z

It doesn't support CSI as far as I know.

yfarjoun · 2017-06-16T14:07:51Z

How much of a burden would it be to have support for csi indices?

…

On Fri, Jun 16, 2017 at 10:02 AM, Louis Bergelson ***@***.***> wrote: It doesn't support CSI as far as I know. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#900 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACnk0pfnDdfi3komGZOeL1zaw8ATPB2lks5sEor2gaJpZM4N7i9G> .

lbergelson · 2017-06-16T17:07:37Z

I'm not really sure, but we haven't always done a great job of handling index related things in a clean way, and I suspect we have client code that relies on the details of the bai index since it's been the only one we support for bam. I'd guess that it would be a fair amount of work to implement.

yfarjoun · 2017-06-16T17:11:23Z

from what I saw online, bai is a specialization of csi....so it might not be so bad...it would be good for someone knowledgable about the format to step in though....

jayoung · 2017-07-08T01:16:38Z

Thanks all for looking into this: hoping someone capable (I am not capable) is interested in taking this on at some point.

Janet

nathanhaigh · 2017-12-14T00:28:52Z

There is another issue open regarding CSI index support in htsjdk: #447

FredericBGA · 2020-04-02T09:55:46Z

Hi,

Picard has a PR that allows to work with CSI index (from #1040)
broadinstitute/picard#998

But I need htsjdk to be able to write CSI as well (example of Wheat genome):

Exception in thread "main" htsjdk.samtools.SAMException: Exception when processing alignment for BAM index M01322:139:000000000-BVL2R:1:2108:5384:21249 2/2 301b aligned to chr1A:537177243-537177543.
Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record M01322:139:000000000-BVL2R:1:2108:5384:21249 2/2 301b aligned to chr1A:537177243-537177543.
Caused by: java.lang.IllegalStateException: Read position too high for BAI bin indexing.

How can I help?

lbergelson · 2020-04-03T17:33:13Z

I think GATK SHOULD work with bams with a CSI index already as well. There's no VCF support at the moment which will probably be problematic though.

I don't think we have any plan to implement it on our own in the near future. We just don't have the bandwidth to look into it, especially with the current pandemic craziness which is significantly cutting into people's coding time.

However, if you're interested in contributing and have a fairly significant chunk of time on your hands we'd be able to work with you on getting a CSI writer into htsjdk. In theory it shouldn't be too hard, it's basically just a tiny modification to the bai writer. I suspect there will be a lot of pain points though with how the code is structured and (poorly) tested.
The index code is unfortunately, really gross. It wasn't really designed to be extensible. There's also a weird divide between the bam index code and the vcf index code, the CSI could in theory be used for VCF to support long references, but currently we only support tabix. As far as I understand vcf CSI and bam CSI have to be slightly different due to how extra metadata is encoded in the bai/ bam csi.

@cmnbroad IS possibly going to be trying to modernize some of the index code it in the near future. He might be able to offer some guidance as to where to start if you're interested.

We'd love the help, but it will probably take a while.

jrobinso · 2020-04-03T19:37:35Z

Hey @lbergelson @cmnbroad hang in there, must be crazy where you are.

RE index modernization, ping me if you do that, I'm interested in modernizing the igv.js code as well, might be some synergy. Currently both bam and tabix indexes are handled by the same code, with a few switches, this in turn was adapted from even older code. At least its relatively small https://github.com/igvteam/igv.js/blob/master/js/bam/bamIndex.js

nathanhaigh · 2020-04-03T20:56:10Z

Perhaps the code developed for use by JBrowse could help?

BAM Index

Tabix Index

FredericBGA · 2020-04-08T06:40:54Z

Thank you for all these comments.
I'm not sure that my skills (java and computing) will be enough, but I'm still ready to help.
Of course I'm stuck at home right now like half of the people on Earth, so this is not the best timing but we could get back in touch when you managed to find time to make progress on this issue.

aschaetz · 2022-06-01T10:15:41Z

Hi Guys,
First of all, thank you for the read support for .csi indices.

We start seeing more and more genomes where the contigs exceed the limits of the .bai index.
While it's good that htsjdk can read these indices, we need write support, as well.

Not being able to write a .csi index is a significant limitation when working with large, well assembled, plant genomes.
And we believe that this will become more important as well assembled genomes become more frequent, due to the rise of long read technologies.

Adding this functionality would be very much appreciated.

lindenb · 2022-06-01T10:46:16Z

there was a project samtools/htsjdk-next-beta which aim was to implement long REFs (eg.: samtools/htsjdk-next-beta#6 ) . But the project looks inactive (dead ?) now.

aschaetz · 2022-06-09T17:37:43Z

Yes, sadly no activity there.

lynnjo · 2022-11-08T13:18:14Z

My group (Buckler Lab at Cornell University) also has a need for VCF CSI support for software we write/maintain. Our open source code (Buckler Lab PHG) uses gvcf files to store variant information for plant genomes and has run into problems working with the larger genomes .e.g wheat.

Has anything changed in terms of scheduling CSI support for VCF? We would be interested in working on this in the htsjdk code base if there were someone with whom we could consult. @lbergelson @cmnbroad (or anyone else) will you comment/ update on where this stands? Thanks

priyanka-surana mentioned this issue Dec 16, 2021

Auto-detect and raise error if CSI is required for BAM indexing nf-core/rnaseq#744

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for long reference sequences #900

Support for long reference sequences #900

jayoung commented Jun 15, 2017

lindenb commented Jun 15, 2017

jrobinso commented Jun 16, 2017

lbergelson commented Jun 16, 2017

yfarjoun commented Jun 16, 2017 via email

lbergelson commented Jun 16, 2017

yfarjoun commented Jun 16, 2017

jayoung commented Jul 8, 2017

nathanhaigh commented Dec 14, 2017

FredericBGA commented Apr 2, 2020

lbergelson commented Apr 3, 2020

jrobinso commented Apr 3, 2020

nathanhaigh commented Apr 3, 2020

FredericBGA commented Apr 8, 2020

aschaetz commented Jun 1, 2022

lindenb commented Jun 1, 2022

aschaetz commented Jun 9, 2022

lynnjo commented Nov 8, 2022

Support for long reference sequences #900

Support for long reference sequences #900

Comments

jayoung commented Jun 15, 2017

lindenb commented Jun 15, 2017

jrobinso commented Jun 16, 2017

lbergelson commented Jun 16, 2017

yfarjoun commented Jun 16, 2017 via email

lbergelson commented Jun 16, 2017

yfarjoun commented Jun 16, 2017

jayoung commented Jul 8, 2017

nathanhaigh commented Dec 14, 2017

FredericBGA commented Apr 2, 2020

lbergelson commented Apr 3, 2020

jrobinso commented Apr 3, 2020

nathanhaigh commented Apr 3, 2020

FredericBGA commented Apr 8, 2020

aschaetz commented Jun 1, 2022

lindenb commented Jun 1, 2022

aschaetz commented Jun 9, 2022

lynnjo commented Nov 8, 2022