Tribble/Tabix index path support #810

Merged
merged 9 commits into from Mar 10, 2017

Conversation

Projects
None yet
3 participants
Contributor

magicDGS commented Mar 1, 2017 edited by lbergelson

Description

Motivation

While trying to index a Path (no convertible to File) using the Index interface, the client should:

  • Create the LittleEndianOutputStreamby its own. This may lead to wrong indexes because they aren't block-compressed.
  • Get the index file name by its own if it is based on a feature file. This may lead to incorrect index names.
    In addition, this is part of a bigger change for include support of writing in whatever filesystem (and also reading).

Changes

  • Clean up (first commit): removing mentions to vcf in javadoc for Tribble and unused imports.
  • Path support for Index interface: methods with default values use Path.toFile()
  • Implementation of the new methods in AbstractIndex and TabixIndex

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)
Contributor

lbergelson commented Mar 1, 2017

@magicDGS Something went wrong here, tests are failing all over.

Contributor

magicDGS commented Mar 1, 2017

I figured it out already, I'm trying to find the problem.

Contributor

magicDGS commented Mar 1, 2017

Very awful error from my side, sorry. The writeBasedOnFeatureFile was overwriting the input file instead of using the name for making the index.

Now I guess that it's solved. Could you have a quick review, @lbergelson?

codecov-io commented Mar 1, 2017 edited

Codecov Report

Merging #810 into master will decrease coverage by 0.021%.
The diff coverage is 62.5%.

@@               Coverage Diff               @@
##              master      #810       +/-   ##
===============================================
- Coverage     64.867%   64.846%   -0.021%     
- Complexity      7175      7187       +12     
===============================================
  Files            526       527        +1     
  Lines          31731     31769       +38     
  Branches        5424      5424               
===============================================
+ Hits           20583     20601       +18     
- Misses          8997      9019       +22     
+ Partials        2151      2149        -2
Impacted Files Coverage Δ Complexity Δ
...va/htsjdk/tribble/TribbleIndexedFeatureReader.java 68.333% <ø> (ø) 22 <0> (ø)
...c/main/java/htsjdk/tribble/TabixFeatureReader.java 68.657% <ø> (ø) 8 <0> (ø)
src/main/java/htsjdk/tribble/Tribble.java 77.778% <100%> (+6.349%) 7 <2> (+2)
.../java/htsjdk/tribble/index/linear/LinearIndex.java 78.363% <100%> (+0.518%) 18 <4> (+2)
...sjdk/tribble/index/interval/IntervalTreeIndex.java 75.61% <100%> (-1.89%) 7 <1> (ø)
src/main/java/htsjdk/tribble/index/Index.java 100% <100%> (ø) 2 <2> (?)
.../main/java/htsjdk/tribble/index/AbstractIndex.java 51.579% <42.857%> (-3.223%) 30 <6> (ø)
...tsjdk/tribble/index/linear/LinearIndexCreator.java 83.721% <57.143%> (-3.459%) 15 <1> (+1)
...k/tribble/index/interval/IntervalIndexCreator.java 88.889% <62.5%> (-3.111%) 15 <2> (+1)
...in/java/htsjdk/tribble/index/tabix/TabixIndex.java 77.108% <62.5%> (-0.669%) 38 <3> (+1)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d22658...a6ff57e. Read the comment docs.

lbergelson self-assigned this Mar 1, 2017

magicDGS changed the title from Tribble/Tabix index path support to WIP: Tribble/Tabix index path support Mar 2, 2017

Contributor

magicDGS commented Mar 2, 2017

This is still WIP, because I found some other paths in the indexing code that should use Path for fully support. So this should wait for review, @lbergelson.

magicDGS referenced this pull request Mar 2, 2017

Merged

Md5CalculatingOutputStream support for Path #814

2 of 5 tasks complete

magicDGS changed the title from WIP: Tribble/Tabix index path support to Tribble/Tabix index path support Mar 3, 2017

Contributor

magicDGS commented Mar 3, 2017

This should be ok now, @lbergelson. Previously failing tests are passing in my computer.

Contributor

magicDGS commented Mar 7, 2017

Can you have a look to this one @lbergelson? This is going to be important for my upcoming PR with tribble writing support with indexing on the fly.

Thanks a lot!

@lbergelson

@magicDGS This is useful work. Thanks for doing this. I have a a few comments. I think it's probably worth breaking implementations of Index to avoid having weird defaults methods there, but lets see if @droazen thinks otherwise. He's usually more conservative about these things than I am.

Would it be possible for you to add a test that creates an index on a non-file path? There's an example of how to do this using an in memory filesystem called jimfs in AbstractFeatureReaderTest.

The basic idea is that you create a new jimfs filesystem, then you can get paths from that filesystem (ex: "jimfs://filesystemname/path/to/my/in/memory/file"). You may have to copy a file from the local system into jimfs if you want to create an index from it, but that's easy to do using the standard Files operations.

Just be sure that the jimfs filesystem gets closed after your test is done, because it will stay in memory until the jvm shuts down otherwise.

* @return a non-null File representing the index
*/
public static File indexFile(final File file) {
return indexFile(file.getAbsoluteFile(), STANDARD_INDEX_EXTENSION);
}
/**
- * Return the name of the tabix index file for the provided vcf {@code filename}
+ * Return the name of the index file for the provided {@code filename}
@lbergelson

lbergelson Mar 8, 2017

Contributor

forgot to change filename -> path

@magicDGS

magicDGS Mar 9, 2017

Contributor

Done.

* Does not actually create an index
- * @param filename name of the vcf file
+ * @param path name of the path
@lbergelson

lbergelson Mar 8, 2017

Contributor

"name of the path" sounds like it's something distinct from the path itself, which is strange because "name of the file" is fine. Maybe just "the path"

@magicDGS

magicDGS Mar 9, 2017

Contributor

Done.

+ * @param path name of the path
+ * @return non-null String representing the index filename
+ */
+ public static String indexPath(final Path path) {
@lbergelson

lbergelson Mar 8, 2017

Contributor

It seems like we would be better off making this actually return the path instead of a String. It's different than the old API, but I'm not sure I see the use of getting a string back.

@magicDGS

magicDGS Mar 9, 2017

Contributor

I changed it.

@@ -67,7 +70,7 @@
private final static long NO_TS = -1L;
protected int version; // Our version value
- protected File indexedFile = null; // The file we've created this index for
+ protected Path indexedPath = null; // The file we've created this index for
@lbergelson

lbergelson Mar 8, 2017

Contributor

This is technically a breaking change. I'm going to say it's fine though. I don't believe there are any subclasses of AbstractIndex in the wild that we're going to be breaking.

@magicDGS

magicDGS Mar 9, 2017

Contributor

Oh, I haven't realize that it is breaking compatibility. If we do not want to break compatibility, I can kept the field with the deprecated annotation and set it when the used constructor is a File. Let me know if I should do that.

@lbergelson

lbergelson Mar 9, 2017

Contributor

Yeah, it's tricky. Anything field that's visible to a subclass of a non-final class is technically a breaking change since someone could be relying on it if they implement a subclass. It's why I'm so strongly in favor of making classes final and of making variable private with accessors if they need to be accessed.

@@ -201,7 +213,11 @@ public boolean isCurrentVersion() {
}
public File getIndexedFile() {
- return indexedFile;
+ return getIndexedPath().toFile();
@lbergelson

lbergelson Mar 8, 2017

Contributor

I think we should deprecate this method because it's no longer safe to call in all instances. It also needs a comment saying that it can fail with an exception if the path can't be represented as a file.

@magicDGS

magicDGS Mar 9, 2017

Contributor

Done.

@@ -251,7 +271,7 @@ private void writeHeader(final LittleEndianOutputStream dos) throws IOException
dos.writeInt(MAGIC_NUMBER);
dos.writeInt(getType());
dos.writeInt(version);
- dos.writeString(indexedFile.getAbsolutePath());
+ dos.writeString(indexedPath.toString());
@lbergelson

lbergelson Mar 8, 2017

Contributor

I think it's better to use indexedPath.toURI().toString() since it's more likely to roundtrip correctly for non-file paths.

@magicDGS

magicDGS Mar 9, 2017

Contributor

Done.

@@ -56,13 +57,18 @@
MathUtils.RunningStat stats = new MathUtils.RunningStat();
long basesSeen = 0;
Feature lastFeature = null;
- File inputFile;
+ // TODO: actually this field is not needed
+ Path inputPath;
@lbergelson

lbergelson Mar 8, 2017

Contributor

Also a breaking change, but I don't see the harm in it since I really doubt anyone is subclassing this thing. 👍 to delete the field

@magicDGS

magicDGS Mar 9, 2017

Contributor

In this case, because it is package protected, I do not think that is going to harm at all (unless someone is subclassing in the same package outside HTSJDK).

If we are breaking compatibiity anyway, as it is unused, why don't we just remove the field (no the constructor arg)? It is only used for get the index creators...

@magicDGS

magicDGS Mar 9, 2017

Contributor

Removed the unused field in the commit for breaking compatibility.

+ * @param indexPath Where to write the index.
+ * @throws IOException if the index is unable to write to the specified path.
+ */
+ public default void write(final Path indexPath) throws IOException {
@lbergelson

lbergelson Mar 8, 2017

Contributor

I think it's better idea to have the default throw OperationUnsupportedException than to have it call toFile() which will work most of the time but sometimes fail.

@magicDGS

magicDGS Mar 9, 2017

Contributor

Done.

+ *
+ * @param featurePath
+ */
+ public default void writeBasedOnFeaturePath(Path featurePath) throws IOException {
@lbergelson

lbergelson Mar 8, 2017

Contributor

same as above comment

@lbergelson

lbergelson Mar 8, 2017

Contributor

I sort of want to break compatibility and force everyone to implement the Path functions. It would be so much cleaner to have the file versions be default implementations that call the path version. @droazen What do you think?

@lbergelson

lbergelson Mar 8, 2017

Contributor

I suspect no on in the wild implements Index. At least it's not done in gatk/picard/igv.

@magicDGS

magicDGS Mar 9, 2017

Contributor

Using OperationUnsupportedException for now.

@magicDGS

magicDGS Mar 9, 2017

Contributor

Changed the interface in the commit for breaking compatibility.

+ *
+ * @param tabixPath Where to write the index.
+ */
+ @Override
@lbergelson

lbergelson Mar 8, 2017

Contributor

make the old file version delegate to this implementation

@magicDGS

magicDGS Mar 9, 2017

Contributor

Done.

+ * @param featurePath Path being indexed.
+ */
+ @Override
+ public void writeBasedOnFeaturePath(final Path featurePath) throws IOException {
@lbergelson

lbergelson Mar 8, 2017

Contributor

same here

@magicDGS

magicDGS Mar 9, 2017

Contributor

Done.

Contributor

magicDGS commented Mar 9, 2017

I added two commits, @lbergelson:

  1. Addressing most of your comments that does not imply backwards incompatibility.
  2. Breaking compatibility for making simpler the interfaces. This could be removed if @droazen thinks that it is better to keep backwards compatibility for subclassing indexes.

Regarding the test, I don't know if I will have time today. I will try, but if not I will add them by tomorrow. You can review the current changes anyway, in case that something else should be changed.

Contributor

magicDGS commented Mar 9, 2017

I've just realized that something is broken with the changes, @lbergelson. Do you have an idea of what is happening? It looks that is breaking indexing on the fly for gzip VCF result (not the indexing itself). The most likely reson is the change to delegate the file constructor to the path one for files, because BlockCompressedOutputStream does not have Path support (see #811) and the filename is not set correctly in the binary codec.

I can add a simple constructor to BlockCompressedOutputStream with a String filename argument, but the change of that class is out of the scope of this commit.

Contributor

magicDGS commented Mar 9, 2017

Ops, I realized that it was an error not related with the BGZ output stream. That's what happens when you program right after waking up... Fixed in the next commit (and removed the commit with the BGZ changes).

@magicDGS @magicDGS magicDGS Fix tabix index
b12c9fa
@@ -219,8 +219,8 @@ public void write(final Path tabixPath) throws IOException {
*/
@Override
public void writeBasedOnFeaturePath(final Path featurePath) throws IOException {
- if (Files.isRegularFile(featurePath)) return;
- write(IOUtil.getPath(Tribble.tabixIndexPath(featurePath)));
+ if (!Files.isRegularFile(featurePath)) return;
@magicDGS

magicDGS Mar 9, 2017

Contributor

This was the refactoring problem, I overseen this. @lbergelson, I think that this behavior should be documented and/or at least log a warning to avoid getting crazy about what happened with the index. I will rather prefer if it fails with and IO/Tribble exception, but that will break compatibility with public methods...

@lbergelson

lbergelson Mar 9, 2017

Contributor

Ack, good catch. That's terrible behavior. I wonder why it was done that way? Definitely needs to at least log a warning. I would also prefer your suggestion to fail with a TribbleException, but I think we'd need to investigate if that would break existing downstream code somehow. For now, could you have it log a warning, and open a new ticket to change the behavior?

@lbergelson

lbergelson Mar 9, 2017

Contributor

@magicDGS I've opened the issue here: #821

@magicDGS

magicDGS Mar 9, 2017

Contributor

Added warning log here and in AbstractIndex.

@@ -73,41 +73,42 @@
/**
* Writes the index into a file.
*
+ * Default implementation delegates in {@link #write(Path)}
@lbergelson

lbergelson Mar 9, 2017

Contributor

Sorry to nitpick the language, but delegates to instead of delegates in is more idiomatic.

@magicDGS

magicDGS Mar 9, 2017

Contributor

Thanks for this, I really appreciate to have comments to improve my English skills. Done!

/**
* Write an appropriately named and located Index file based on the name and location of the featureFile.
* If featureFile is not a normal file, the index will silently not be written.
+ *
+ * Default implementation delegates in {@link #writeBasedOnFeaturePath(Path)}
@lbergelson

lbergelson Mar 9, 2017

Contributor

same as above

@magicDGS

magicDGS Mar 9, 2017

Contributor

Done.

Contributor

lbergelson commented Mar 9, 2017

@magicDGS Looks good. 👍 Once we have the tests for the two index types.

magicDGS added some commits Mar 9, 2017

@magicDGS magicDGS Address minor comments
f2eda11
@magicDGS magicDGS Add test for write index into a Path 572958a
@magicDGS magicDGS Add a couple of missing ctor for Path 4d7dda3
@magicDGS magicDGS Small test to check if writing/ctor from a Path
a6ff57e
Contributor

magicDGS commented Mar 9, 2017

Three more commits:

  1. Minor changes from review.
  2. A couple of missing constructors for Path (to test)
  3. Small test for using path in writing and loading.

Thanks for reviewing, @lbergelson - Back to you!

magicDGS referenced this pull request Mar 10, 2017

Open

Tribble writing support #822

1 of 5 tasks complete
Contributor

lbergelson commented Mar 10, 2017

👍

@lbergelson lbergelson merged commit 0c282b8 into samtools:master Mar 10, 2017

1 of 4 checks passed

codecov/changes 1 file has unexpected coverage changes not visible in diff.
Details
codecov/patch 62.5% of diff hit (target 64.867%)
Details
codecov/project 64.846% (-0.021%) compared to 6d22658
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

magicDGS commented Mar 10, 2017

Thanks!

magicDGS deleted the magicDGS:dgs_tribble_path_support branch Mar 10, 2017

@magicDGS magicDGS added a commit to magicDGS/htsjdk that referenced this pull request Mar 10, 2017

@magicDGS magicDGS Complete path support with #810 c2721e2

@magicDGS magicDGS added a commit to magicDGS/htsjdk that referenced this pull request Jun 26, 2017

@magicDGS magicDGS Complete path support with #810 9cd4459

@magicDGS magicDGS added a commit to magicDGS/htsjdk that referenced this pull request Jun 27, 2017

@magicDGS magicDGS Complete path support with #810 9234f2c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment