-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/BUG: FastqManifestFormat sniffer validating #134
Changes from all commits
1bbe7d8
5974b17
e2101ba
53dd53e
27280b9
838223c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,19 +9,34 @@ | |
import skbio.io | ||
import yaml | ||
import qiime2.plugin.model as model | ||
import pandas as pd | ||
|
||
from ..plugin_setup import plugin | ||
|
||
|
||
class FastqManifestFormat(model.TextFileFormat): | ||
""" | ||
Mapping of sample identifiers to filepaths and read direction. | ||
Note that we are currently doing exhaustive validation here. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... exhaustive validation here; this validation may be moved in the future when proper validation hooks exist. The overhead should be negligible because manifest files are small. |
||
|
||
""" | ||
def sniff(self): | ||
with self.open() as fh: | ||
header = fh.readline() | ||
return header.strip() == 'sample-id,filename,direction' | ||
if header.strip() != 'sample-id,filename,direction': | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you update the sniffer to support comments and blank lines before the header? I think we're adding pre-header comments in manifest files generated by q2-types so it'd be good to support them here too.
def sniff(self):
with self.open() as fh:
try:
manifest = pd.read_csv(fh, header=0, comment='#', skip_blank_lines=True, dtype=object)
except Exception:
return False
if manifest.columns.tolist() != ['sample-id', 'filename', 'direction']:
return False
# ... and the rest of the validation (null checks, etc) Note that I'm passing |
||
return False | ||
try: | ||
manifest = pd.read_csv(fh, comment='#', header=None, | ||
skip_blank_lines=True, dtype=object) | ||
manifest.columns = ['sample-id', 'filename', 'direction'] | ||
if manifest.isnull().values.any(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can be simplified to |
||
return False | ||
duplicated = manifest.drop(manifest.columns[1], 1) | ||
if True in duplicated.duplicated().values: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can become |
||
return False | ||
except Exception as e: | ||
return False | ||
return True | ||
|
||
|
||
class YamlFormat(model.TextFileFormat): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -43,10 +43,11 @@ def _1(dirfmt: SingleLanePerSampleSingleEndFastqDirFmt) \ | |
fh = iter(dirfmt.manifest.view(FastqManifestFormat).open()) | ||
next(fh) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Related comment on sniffer: can you support blank lines and comments before and after the header? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also make these (collective) updates to these transformers below? The manifest parsing is happening in enough places that it makes sense to go in a helper function.
|
||
for line in fh: | ||
sample_id, filename, _ = line.split(',') | ||
filepath = str(dirfmt.path / filename) | ||
result[sample_id] = skbio.io.read(filepath, format='fastq', | ||
constructor=skbio.DNA) | ||
if not line.startswith('#'): | ||
sample_id, filename, _ = line.split(',') | ||
filepath = str(dirfmt.path / filename) | ||
result[sample_id] = skbio.io.read(filepath, format='fastq', | ||
constructor=skbio.DNA) | ||
return result | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
sample-id,filename,direction | ||
# important comment | ||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward | ||
Human-Kneecap,Human-Kneecap_S4_L001_R1_001.fastq.gz,forward |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
sample-id,filename,direction | ||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward,banana | ||
Human-Kneecap,Human-Kneecap_S1_L001_R2_001.fastq.gz,reverse |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
sample-id,filename,direction | ||
Human-Kneecap,Human-Kneecap_S1_L001_R2_001.fastq.gz,reverse | ||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward,banana |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
sample-id,filename,direction | ||
Human-Kneecap,Human-Kneecap_S1_L001_R2_001.fastq.gz,reverse | ||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
sample-id,filename,direction | ||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz | ||
Human-Kneecap,Human-Kneecap_S1_L001_R2_001.fastq.gz,reverse |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
sample-id,filename,direction | ||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward | ||
# important comment | ||
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add tests for:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the
FastqAbsolutePathManifestFormat
sniffer to perform these validations? Those formats will mostly validate the same way, with the exception of relative vs absolute paths.