Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Add .fai index creation support #842
Conversation
tfenne
requested changes
Apr 5, 2017
Looks good @magicDGS, thanks for implementing this! I've made a few suggestions which you should feel free to incorporate or not as you see fit.
There are a couple of fairly minor issues I think, which I've commented inline:
- The code for extracting contig names doesn't handle leading spaces or comments after a space; I've suggested a drop in fix
- I think this will do the wrong thing (but not throw an exception) if it encounters a fasta file with CR+LF (i.e. two character) line endings. This seems a little trickier to fix.
| @@ -159,6 +162,27 @@ private void parseIndexFile(Path indexFile) { | ||
| } | ||
| /** | ||
| + * Writes this index to the specified path. |
tfenne
Apr 5, 2017
Owner
Might also want to update the class level javadoc now that writing is supported!
| +import java.nio.file.Path; | ||
| + | ||
| +/** | ||
| + * Static methods for create an {@link FastaSequenceIndex}. |
tfenne
Apr 5, 2017
Owner
Either "Static methods to create an ..." or "Static methods for creating {@link FastaSequenceIndex}s"
| + public static void create(final Path fastaFile) throws IOException { | ||
| + // get the index to write the file in | ||
| + final Path indexFile = ReferenceSequenceFileFactory.getFastaIndexFileName(fastaFile); | ||
| + if (Files.exists(indexFile)) { |
tfenne
Apr 5, 2017
Owner
I'm curious about this decision. I run into this kind of thing a lot when trying to do something, attempt 1 fails but creates an empty/truncated file at the output location, and then attempt 2 fails because the tool is unwilling to overwrite the broken file. I would suggest either logging a warning and then over-writing, or having a boolean parameter to the method that allows the user to force creation even if the file already exists.
magicDGS
Apr 5, 2017
Contributor
I always prefer to do not overwrite files unless I know that I'm doing it in purpose. But I agree that a parameter to the method is a good addition, so added.
| + /** | ||
| + * Builds a FastaSequenceIndex on the fly from a FASTA file. | ||
| + * | ||
| + * Note: this also alows to create an index for a compressed file, but does not generate the |
tfenne
Apr 5, 2017
Owner
I tend to think that if we're not going to create the .gzi file too, that this method should probably fail with an exception on a compressed fasta file. I can easily imagine scenarios where code uses this to build a .fai, then turns around to try and query the fasta file, only to find that it cannot because the fasta file is compressed.
magicDGS
Apr 5, 2017
Contributor
I added this note because it is a difference with respect to the samtools faidx command.
I can also imagine scenarios where where a compressed fasta file is already indexed with bgzip (.gzi index), and the only requirement is to create the .fai index. Anyway, if there is a way to create with htsjdk a .gzi index I can integrate it here in the case of block-compressed input.
| + // in this case, the previous line contains a header and the current line the first sequence | ||
| + if (previous.charAt(0) == '>') { | ||
| + // first entry should be skipped | ||
| + if (entry != null) { |
tfenne
Apr 5, 2017
Owner
Could be simplified to a one-liner:
if (entry != null) index.add(entry.build())
magicDGS
Apr 5, 2017
Contributor
This is my personal taste, but I always prefer brackets. Anyway I changed it, because the style for htsjdk allows it.
| + index.add(entry.build()); | ||
| + } | ||
| + // update the sequence index | ||
| + sequenceIndex++; |
tfenne
Apr 5, 2017
Owner
Could replace this line with adding ++ after sequenceIndex on the line below.
| + private final int index; | ||
| + private final String contig; | ||
| + private final long location; | ||
| + // the bytes per line is the bases per line plus the end of the line |
tfenne
Apr 5, 2017
Owner
I don't think this will always work - using basesPerLine + 1. It's possible this code will encounter files that use CR+LF formatted files, in which case the bytes per line would be basesPerLine + 2. Admittedly noone should do that, but that doesn't mean it doesn't happen.
magicDGS
Apr 5, 2017
Contributor
That's true, it could happen. But it is quite difficult to fix with the current implementation and even throwing it's difficult because there is no way of checking how many end-of-line are unless we iterate over the bytes. That requires a complete re-implementation.
I've just added a TODO with date to keep track, and I can open a ticket if this gets accepted as it is now.
tfenne
Apr 5, 2017
Owner
I think you could probably catch this quite easily at the end for any file with multiple sequences. You could check to ensure that the start position you have for each sequence (which is the correct physical start location) matches what you get if you calculate the start of the last sequence plus the number of bytes expected given bytesPerLine.
magicDGS
Apr 5, 2017
Contributor
That's a good suggestion, but what will happen with this:
>chr1\r\n
ACTG\r\n
tfenne
Apr 5, 2017
Owner
I just created this PR that I think solves the problem: #843 . Want to review for me?
| + } | ||
| + this.index = index; | ||
| + // parse the contig name (without the starting '>') | ||
| + this.contig = header.substring(1); |
tfenne
Apr 5, 2017
Owner
This doesn't appropriately handle headers that have spaces in the name. With samtools faidx if you have a sequence with a header like:
>chr1 Human chromosome 1
The name that is written to the .fai file is just chr1.
I think you want to do something like:
SAMSequenceRecord.truncateSequenceName(header.substring(1).trim());in order to handle both spaces between the > and the contig name, and then comments after the name.
magicDGS
Apr 5, 2017
Contributor
True, I think that I removed it when I was working on it. I fixed it and added a test.
| + | ||
| + private void updateWithSequence(final String sequence) { | ||
| + if (sequence.length() > basesPerLine) { | ||
| + throw new SAMException(String.format("Sequence bases for '{}' exceed the maximum length ({}): {}", |
tfenne
Apr 5, 2017
Owner
The message here is technically correct, but I think I would find it confusing, since other than the last line all lines should be the same length, not variable lengths with a maximum. How about "Sequence line for {} was longer than the expected length ({}): {}"
tfenne
was assigned
by droazen
Apr 18, 2017
magicDGS
added some commits
Mar 23, 2017
codecov-io
commented
Apr 24, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #842 +/- ##
===============================================
+ Coverage 64.939% 64.948% +0.009%
- Complexity 7210 7219 +9
===============================================
Files 527 528 +1
Lines 31802 31867 +65
Branches 5426 5442 +16
===============================================
+ Hits 20652 20697 +45
- Misses 9017 9021 +4
- Partials 2133 2149 +16
|
|
I rebased and included the check for the end of line concordance for each sequence. Could you please have a look, @tfenne? |
|
|
|
I cannot merge, @tfenne. Let me know if I should squash beforehand. Thanks for the review! |
tfenne
merged commit 854d736
into
samtools:master
Apr 25, 2017
magicDGS
deleted the
magicDGS:dgs_fasta_faidx branch
Apr 25, 2017
|
Thanks @tfenne! |
magicDGS commentedApr 5, 2017
•
edited
Description
As described in #830, it would be nice to have a way of generate a .fai index from a provided reference file. This PR adds the
FastaSequenceIndexCreatorutility class, which contains:IndexedFastaSequenceFilewithout requiring a file-based index.Moreover, other useful changes are included:
FastaSequenceIndexto write to a file in the .fai format.IndexedFastaSequenceFile.getFastaIndexFileNameprivate method toReferenceSequenceFileFactoryChecklist