-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/BUG: FastqManifestFormat sniffer validating #134
ENH/BUG: FastqManifestFormat sniffer validating #134
Conversation
7ea13b8
to
1bbe7d8
Compare
…nhippel/q2-types into Manifest-duplicate-sniffer It is.
7bc8b70
to
27280b9
Compare
This is ready for review. |
Not reviewing, but you should add a comment to the |
Good point @ebolyen ! And done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @maxvonhippel, this additional validation will be really helpful. I'm submitting a partial review now so that I don't lose my comments. Please don't address any comments yet; I realized that putting validation in the sniffer may not be a good idea, as there's no way to tell the user what's wrong with their file (that'll be addressed in the future though). I think we should discuss as a group tomorrow to come up with a plan -- centralizing the validation and parsing into a helper that's invoked in each transformer may be the way to go until we have robust sniffers.
|
||
from ..plugin_setup import plugin | ||
|
||
|
||
class FastqManifestFormat(model.TextFileFormat): | ||
""" | ||
Mapping of sample identifiers to filepaths and read direction. | ||
Note that we are currently doing exhaustive validation here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... exhaustive validation here; this validation may be moved in the future when proper validation hooks exist. The overhead should be negligible because manifest files are small.
|
||
""" | ||
def sniff(self): | ||
with self.open() as fh: | ||
header = fh.readline() | ||
return header.strip() == 'sample-id,filename,direction' | ||
if header.strip() != 'sample-id,filename,direction': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the sniffer to support comments and blank lines before the header? I think we're adding pre-header comments in manifest files generated by q2-types so it'd be good to support them here too.
read_csv
can be used exclusively to support comments/blank lines and for header parsing. Something like:
def sniff(self):
with self.open() as fh:
try:
manifest = pd.read_csv(fh, header=0, comment='#', skip_blank_lines=True, dtype=object)
except Exception:
return False
if manifest.columns.tolist() != ['sample-id', 'filename', 'direction']:
return False
# ... and the rest of the validation (null checks, etc)
Note that I'm passing header=0
and that I tightened up the try
block to only include the first statement (read_csv
). I usually keep try
blocks as small as possible to make it clear which statement(s) are intended to error, to avoid masking errors where they weren't expected, and to make it easier to find the except
block(s) associated with the try
.
manifest = pd.read_csv(fh, comment='#', header=None, | ||
skip_blank_lines=True, dtype=object) | ||
manifest.columns = ['sample-id', 'filename', 'direction'] | ||
if manifest.isnull().values.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified to manifest.isnull().any()
(pandas objects share a lot of methods present on numpy arrays)
if manifest.isnull().values.any(): | ||
return False | ||
duplicated = manifest.drop(manifest.columns[1], 1) | ||
if True in duplicated.duplicated().values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can become if duplicated.duplicated.any()
@@ -43,10 +43,11 @@ def _1(dirfmt: SingleLanePerSampleSingleEndFastqDirFmt) \ | |||
fh = iter(dirfmt.manifest.view(FastqManifestFormat).open()) | |||
next(fh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related comment on sniffer: can you support blank lines and comments before and after the header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make these (collective) updates to these transformers below? The manifest parsing is happening in enough places that it makes sense to go in a helper function.
SingleLanePerSamplePairedEndFastqDirFmt -> PerSamplePairedDNAIterators
SingleLanePerSamplePairedEndFastqDirFmt -> SingleLanePerSampleSingleEndFastqDirFmt
@@ -1,2 +1,3 @@ | |||
sample-id,filename,direction | |||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward | |||
# important comment | |||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for:
- an invalid header, e.g.
sample-id,foo,direction
- a null value in the interior of the Nx3 table
- blank lines and comments located pre- and post-header
- a valid MANIFEST with more than one sample
@@ -9,19 +9,34 @@ | |||
import skbio.io | |||
import yaml | |||
import qiime2.plugin.model as model | |||
import pandas as pd | |||
|
|||
from ..plugin_setup import plugin | |||
|
|||
|
|||
class FastqManifestFormat(model.TextFileFormat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the FastqAbsolutePathManifestFormat
sniffer to perform these validations? Those formats will mostly validate the same way, with the exception of relative vs absolute paths.
Closing in favor of #136. |
The
FastqManifestFormat
sniffer now validates all of the following:sample-id
,filename
,direction
)sample-id
,direction
) tuples existMoreover, comments are now allowed in the format and ignored. I added tests for all of the above and added a comment to a passing, successful test.
Fixes #126
Fixes #132