Add .fai index creation support #842

Merged
merged 6 commits into from Apr 25, 2017

Conversation

Projects
None yet
3 participants
Contributor

magicDGS commented Apr 5, 2017 edited

Description

As described in #830, it would be nice to have a way of generate a .fai index from a provided reference file. This PR adds the FastaSequenceIndexCreator utility class, which contains:

  • Method for create a .fai file from a reference one. This will be useful for a bundle-tool in Picard/GATK for obtaining all required files within a folder, as the sequence dictionary.
  • Method for build an in-memory index for a non-indexed file. This may be useful for allow using IndexedFastaSequenceFile without requiring a file-based index.

Moreover, other useful changes are included:

  • Method in FastaSequenceIndex to write to a file in the .fai format.
  • Move IndexedFastaSequenceFile.getFastaIndexFileName private method to ReferenceSequenceFileFactory

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)
Contributor

magicDGS commented Apr 5, 2017

Can you have a look, @tfenne? I'm asking you because you were involved in the conversation .

Maybe you will be also interested on this @dariober...

@tfenne

Looks good @magicDGS, thanks for implementing this! I've made a few suggestions which you should feel free to incorporate or not as you see fit.

There are a couple of fairly minor issues I think, which I've commented inline:

  1. The code for extracting contig names doesn't handle leading spaces or comments after a space; I've suggested a drop in fix
  2. I think this will do the wrong thing (but not throw an exception) if it encounters a fasta file with CR+LF (i.e. two character) line endings. This seems a little trickier to fix.
@@ -159,6 +162,27 @@ private void parseIndexFile(Path indexFile) {
}
/**
+ * Writes this index to the specified path.
@tfenne

tfenne Apr 5, 2017

Owner

Might also want to update the class level javadoc now that writing is supported!

@magicDGS

magicDGS Apr 5, 2017

Contributor

Done.

+import java.nio.file.Path;
+
+/**
+ * Static methods for create an {@link FastaSequenceIndex}.
@tfenne

tfenne Apr 5, 2017

Owner

Either "Static methods to create an ..." or "Static methods for creating {@link FastaSequenceIndex}s"

@magicDGS

magicDGS Apr 5, 2017

Contributor

Done

+ public static void create(final Path fastaFile) throws IOException {
+ // get the index to write the file in
+ final Path indexFile = ReferenceSequenceFileFactory.getFastaIndexFileName(fastaFile);
+ if (Files.exists(indexFile)) {
@tfenne

tfenne Apr 5, 2017

Owner

I'm curious about this decision. I run into this kind of thing a lot when trying to do something, attempt 1 fails but creates an empty/truncated file at the output location, and then attempt 2 fails because the tool is unwilling to overwrite the broken file. I would suggest either logging a warning and then over-writing, or having a boolean parameter to the method that allows the user to force creation even if the file already exists.

@magicDGS

magicDGS Apr 5, 2017

Contributor

I always prefer to do not overwrite files unless I know that I'm doing it in purpose. But I agree that a parameter to the method is a good addition, so added.

+ /**
+ * Builds a FastaSequenceIndex on the fly from a FASTA file.
+ *
+ * Note: this also alows to create an index for a compressed file, but does not generate the
@tfenne

tfenne Apr 5, 2017

Owner

I tend to think that if we're not going to create the .gzi file too, that this method should probably fail with an exception on a compressed fasta file. I can easily imagine scenarios where code uses this to build a .fai, then turns around to try and query the fasta file, only to find that it cannot because the fasta file is compressed.

@magicDGS

magicDGS Apr 5, 2017

Contributor

I added this note because it is a difference with respect to the samtools faidx command.

I can also imagine scenarios where where a compressed fasta file is already indexed with bgzip (.gzi index), and the only requirement is to create the .fai index. Anyway, if there is a way to create with htsjdk a .gzi index I can integrate it here in the case of block-compressed input.

+ // in this case, the previous line contains a header and the current line the first sequence
+ if (previous.charAt(0) == '>') {
+ // first entry should be skipped
+ if (entry != null) {
@tfenne

tfenne Apr 5, 2017

Owner

Could be simplified to a one-liner:
if (entry != null) index.add(entry.build())

@magicDGS

magicDGS Apr 5, 2017

Contributor

This is my personal taste, but I always prefer brackets. Anyway I changed it, because the style for htsjdk allows it.

+ index.add(entry.build());
+ }
+ // update the sequence index
+ sequenceIndex++;
@tfenne

tfenne Apr 5, 2017

Owner

Could replace this line with adding ++ after sequenceIndex on the line below.

@magicDGS

magicDGS Apr 5, 2017

Contributor

True!

+ private final int index;
+ private final String contig;
+ private final long location;
+ // the bytes per line is the bases per line plus the end of the line
@tfenne

tfenne Apr 5, 2017

Owner

I don't think this will always work - using basesPerLine + 1. It's possible this code will encounter files that use CR+LF formatted files, in which case the bytes per line would be basesPerLine + 2. Admittedly noone should do that, but that doesn't mean it doesn't happen.

@magicDGS

magicDGS Apr 5, 2017

Contributor

That's true, it could happen. But it is quite difficult to fix with the current implementation and even throwing it's difficult because there is no way of checking how many end-of-line are unless we iterate over the bytes. That requires a complete re-implementation.

I've just added a TODO with date to keep track, and I can open a ticket if this gets accepted as it is now.

@tfenne

tfenne Apr 5, 2017

Owner

I think you could probably catch this quite easily at the end for any file with multiple sequences. You could check to ensure that the start position you have for each sequence (which is the correct physical start location) matches what you get if you calculate the start of the last sequence plus the number of bytes expected given bytesPerLine.

@magicDGS

magicDGS Apr 5, 2017

Contributor

That's a good suggestion, but what will happen with this:

>chr1\r\n
ACTG\r\n
@tfenne

tfenne Apr 5, 2017

Owner

I just created this PR that I think solves the problem: #843 . Want to review for me?

@magicDGS

magicDGS Apr 5, 2017

Contributor

Thanks, I did it already!

+ }
+ this.index = index;
+ // parse the contig name (without the starting '>')
+ this.contig = header.substring(1);
@tfenne

tfenne Apr 5, 2017

Owner

This doesn't appropriately handle headers that have spaces in the name. With samtools faidx if you have a sequence with a header like:
>chr1 Human chromosome 1
The name that is written to the .fai file is just chr1.

I think you want to do something like:

SAMSequenceRecord.truncateSequenceName(header.substring(1).trim());

in order to handle both spaces between the > and the contig name, and then comments after the name.

@magicDGS

magicDGS Apr 5, 2017

Contributor

True, I think that I removed it when I was working on it. I fixed it and added a test.

+
+ private void updateWithSequence(final String sequence) {
+ if (sequence.length() > basesPerLine) {
+ throw new SAMException(String.format("Sequence bases for '{}' exceed the maximum length ({}): {}",
@tfenne

tfenne Apr 5, 2017

Owner

The message here is technically correct, but I think I would find it confusing, since other than the last line all lines should be the same length, not variable lengths with a maximum. How about "Sequence line for {} was longer than the expected length ({}): {}"

@magicDGS

magicDGS Apr 5, 2017

Contributor

Good suggestion. Done!

Contributor

magicDGS commented Apr 5, 2017

I pushed the changes for first round of comments @tfenne. I will update when #843 is accepted.

tfenne was assigned by droazen Apr 18, 2017

Owner

tfenne commented Apr 18, 2017

Looks good to me @magicDGS - I think you should rebase on top of master now #843 is merged, use that change to throw an exception if you encounter varying line-endings, and then merge!

magicDGS added some commits Mar 23, 2017

@magicDGS @magicDGS magicDGS Move useful method to find the Path in the reference sequence 5bbbb90
@magicDGS @magicDGS magicDGS Add FastaSequenceIndex.write for writing the index into a Path 957b0bd
@magicDGS @magicDGS magicDGS Add FastaSequenceIndexCreator to build .fai indexes dacb930
@magicDGS @magicDGS magicDGS Address first round of comments d093be8
@magicDGS magicDGS Support files ending in CR+LF 546c868
@magicDGS magicDGS Fix tests
1806a66

codecov-io commented Apr 24, 2017 edited

Codecov Report

Merging #842 into master will increase coverage by 0.009%.
The diff coverage is 66.176%.

@@               Coverage Diff               @@
##              master      #842       +/-   ##
===============================================
+ Coverage     64.939%   64.948%   +0.009%     
- Complexity      7210      7219        +9     
===============================================
  Files            527       528        +1     
  Lines          31802     31867       +65     
  Branches        5426      5442       +16     
===============================================
+ Hits           20652     20697       +45     
- Misses          9017      9021        +4     
- Partials        2133      2149       +16
Impacted Files Coverage Δ Complexity Δ
...mtools/reference/ReferenceSequenceFileFactory.java 90% <100%> (+0.345%) 14 <1> (+1) ⬆️
...k/samtools/reference/IndexedFastaSequenceFile.java 72% <50%> (-0.277%) 29 <2> (-1)
.../samtools/reference/FastaSequenceIndexCreator.java 61.818% <61.818%> (ø) 8 <8> (?)
.../htsjdk/samtools/reference/FastaSequenceIndex.java 63.542% <90%> (+11.216%) 12 <2> (+3) ⬆️
...samtools/util/AsyncBlockCompressedInputStream.java 72% <0%> (-4%) 12% <0%> (-1%)
...va/htsjdk/samtools/util/AsyncBufferedIterator.java 68.478% <0%> (-1.087%) 17% <0%> (ø)
...main/java/htsjdk/samtools/SAMRecordSetBuilder.java 88.306% <0%> (-0.806%) 62% <0%> (-2%)
...a/htsjdk/samtools/cram/encoding/rans/Decoding.java 85.714% <0%> (+3.571%) 12% <0%> (+1%) ⬆️
Contributor

magicDGS commented Apr 24, 2017

I rebased and included the check for the end of line concordance for each sequence. Could you please have a look, @tfenne?

Owner

tfenne commented Apr 24, 2017

👍 from me @magicDGS. Are you able to merge or do you need me to do it?

Contributor

magicDGS commented Apr 25, 2017

I cannot merge, @tfenne. Let me know if I should squash beforehand. Thanks for the review!

@tfenne

tfenne approved these changes Apr 25, 2017

@tfenne tfenne merged commit 854d736 into samtools:master Apr 25, 2017

3 of 4 checks passed

codecov/changes 3 files have unexpected coverage changes not visible in diff.
Details
codecov/patch 66.176% of diff hit (target 64.939%)
Details
codecov/project 64.948% (+0.009%) compared to ee21d81
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

magicDGS deleted the magicDGS:dgs_fasta_faidx branch Apr 25, 2017

Contributor

magicDGS commented Apr 25, 2017

Thanks @tfenne!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment