Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: FastqManifestFormat sniffer validating #134

Closed

Conversation

maxvonhippel
Copy link
Contributor

@maxvonhippel maxvonhippel commented Jul 14, 2017

The FastqManifestFormat sniffer now validates all of the following:

  1. Column headers are correct (sample-id, filename, direction)
  2. No duplicate (sample-id,direction) tuples exist
  3. No rows have too few or too many entries

Moreover, comments are now allowed in the format and ignored. I added tests for all of the above and added a comment to a passing, successful test.

Fixes #126
Fixes #132

@jairideout jairideout self-requested a review July 14, 2017 23:20
@jairideout jairideout changed the title ENH: FastqManifestFormat sniffer validating ENH/BUG: FastqManifestFormat sniffer validating Jul 14, 2017
@jairideout jairideout added type:bug Something is wrong. enhancement stat:DO-NOT-MERGE Please do not merge this until this label has been removed. labels Jul 14, 2017
@maxvonhippel
Copy link
Contributor Author

This is ready for review.

@jairideout jairideout removed the stat:DO-NOT-MERGE Please do not merge this until this label has been removed. label Jul 17, 2017
@ebolyen
Copy link
Member

ebolyen commented Jul 17, 2017

Not reviewing, but you should add a comment to the sniff method to say that we are doing exhaustive validation, that way when we come back to this we know that we're kind of "breaking the rules" and we can just port this behavior to whatever comes next for validation/sniffing.

@maxvonhippel
Copy link
Contributor Author

Good point @ebolyen ! And done.

Copy link
Member

@jairideout jairideout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maxvonhippel, this additional validation will be really helpful. I'm submitting a partial review now so that I don't lose my comments. Please don't address any comments yet; I realized that putting validation in the sniffer may not be a good idea, as there's no way to tell the user what's wrong with their file (that'll be addressed in the future though). I think we should discuss as a group tomorrow to come up with a plan -- centralizing the validation and parsing into a helper that's invoked in each transformer may be the way to go until we have robust sniffers.


from ..plugin_setup import plugin


class FastqManifestFormat(model.TextFileFormat):
"""
Mapping of sample identifiers to filepaths and read direction.
Note that we are currently doing exhaustive validation here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... exhaustive validation here; this validation may be moved in the future when proper validation hooks exist. The overhead should be negligible because manifest files are small.


"""
def sniff(self):
with self.open() as fh:
header = fh.readline()
return header.strip() == 'sample-id,filename,direction'
if header.strip() != 'sample-id,filename,direction':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the sniffer to support comments and blank lines before the header? I think we're adding pre-header comments in manifest files generated by q2-types so it'd be good to support them here too.

read_csv can be used exclusively to support comments/blank lines and for header parsing. Something like:

def sniff(self):
    with self.open() as fh:
        try:
            manifest = pd.read_csv(fh, header=0, comment='#', skip_blank_lines=True, dtype=object)
        except Exception:
            return False
       
       if manifest.columns.tolist() != ['sample-id', 'filename', 'direction']:
           return False

       # ... and the rest of the validation (null checks, etc)

Note that I'm passing header=0 and that I tightened up the try block to only include the first statement (read_csv). I usually keep try blocks as small as possible to make it clear which statement(s) are intended to error, to avoid masking errors where they weren't expected, and to make it easier to find the except block(s) associated with the try.

manifest = pd.read_csv(fh, comment='#', header=None,
skip_blank_lines=True, dtype=object)
manifest.columns = ['sample-id', 'filename', 'direction']
if manifest.isnull().values.any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to manifest.isnull().any() (pandas objects share a lot of methods present on numpy arrays)

if manifest.isnull().values.any():
return False
duplicated = manifest.drop(manifest.columns[1], 1)
if True in duplicated.duplicated().values:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can become if duplicated.duplicated.any()

@@ -43,10 +43,11 @@ def _1(dirfmt: SingleLanePerSampleSingleEndFastqDirFmt) \
fh = iter(dirfmt.manifest.view(FastqManifestFormat).open())
next(fh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related comment on sniffer: can you support blank lines and comments before and after the header?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make these (collective) updates to these transformers below? The manifest parsing is happening in enough places that it makes sense to go in a helper function.

  • SingleLanePerSamplePairedEndFastqDirFmt -> PerSamplePairedDNAIterators
  • SingleLanePerSamplePairedEndFastqDirFmt -> SingleLanePerSampleSingleEndFastqDirFmt

@@ -1,2 +1,3 @@
sample-id,filename,direction
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward
# important comment
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for:

  • an invalid header, e.g. sample-id,foo,direction
  • a null value in the interior of the Nx3 table
  • blank lines and comments located pre- and post-header
  • a valid MANIFEST with more than one sample

@@ -9,19 +9,34 @@
import skbio.io
import yaml
import qiime2.plugin.model as model
import pandas as pd

from ..plugin_setup import plugin


class FastqManifestFormat(model.TextFileFormat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the FastqAbsolutePathManifestFormat sniffer to perform these validations? Those formats will mostly validate the same way, with the exception of relative vs absolute paths.

@jairideout
Copy link
Member

Closing in favor of #136.

@jairideout jairideout closed this Jul 22, 2017
@maxvonhippel maxvonhippel deleted the Manifest-duplicate-sniffer branch July 31, 2017 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement type:bug Something is wrong.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants