Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: FastqManifestFormat sniffer validating #134

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion q2_types/per_sample_sequences/_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,34 @@
import skbio.io
import yaml
import qiime2.plugin.model as model
import pandas as pd

from ..plugin_setup import plugin


class FastqManifestFormat(model.TextFileFormat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the FastqAbsolutePathManifestFormat sniffer to perform these validations? Those formats will mostly validate the same way, with the exception of relative vs absolute paths.

"""
Mapping of sample identifiers to filepaths and read direction.
Note that we are currently doing exhaustive validation here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... exhaustive validation here; this validation may be moved in the future when proper validation hooks exist. The overhead should be negligible because manifest files are small.


"""
def sniff(self):
with self.open() as fh:
header = fh.readline()
return header.strip() == 'sample-id,filename,direction'
if header.strip() != 'sample-id,filename,direction':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the sniffer to support comments and blank lines before the header? I think we're adding pre-header comments in manifest files generated by q2-types so it'd be good to support them here too.

read_csv can be used exclusively to support comments/blank lines and for header parsing. Something like:

def sniff(self):
    with self.open() as fh:
        try:
            manifest = pd.read_csv(fh, header=0, comment='#', skip_blank_lines=True, dtype=object)
        except Exception:
            return False
       
       if manifest.columns.tolist() != ['sample-id', 'filename', 'direction']:
           return False

       # ... and the rest of the validation (null checks, etc)

Note that I'm passing header=0 and that I tightened up the try block to only include the first statement (read_csv). I usually keep try blocks as small as possible to make it clear which statement(s) are intended to error, to avoid masking errors where they weren't expected, and to make it easier to find the except block(s) associated with the try.

return False
try:
manifest = pd.read_csv(fh, comment='#', header=None,
skip_blank_lines=True, dtype=object)
manifest.columns = ['sample-id', 'filename', 'direction']
if manifest.isnull().values.any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to manifest.isnull().any() (pandas objects share a lot of methods present on numpy arrays)

return False
duplicated = manifest.drop(manifest.columns[1], 1)
if True in duplicated.duplicated().values:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can become if duplicated.duplicated.any()

return False
except Exception as e:
return False
return True


class YamlFormat(model.TextFileFormat):
Expand Down
9 changes: 5 additions & 4 deletions q2_types/per_sample_sequences/_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,11 @@ def _1(dirfmt: SingleLanePerSampleSingleEndFastqDirFmt) \
fh = iter(dirfmt.manifest.view(FastqManifestFormat).open())
next(fh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related comment on sniffer: can you support blank lines and comments before and after the header?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make these (collective) updates to these transformers below? The manifest parsing is happening in enough places that it makes sense to go in a helper function.

  • SingleLanePerSamplePairedEndFastqDirFmt -> PerSamplePairedDNAIterators
  • SingleLanePerSamplePairedEndFastqDirFmt -> SingleLanePerSampleSingleEndFastqDirFmt

for line in fh:
sample_id, filename, _ = line.split(',')
filepath = str(dirfmt.path / filename)
result[sample_id] = skbio.io.read(filepath, format='fastq',
constructor=skbio.DNA)
if not line.startswith('#'):
sample_id, filename, _ = line.split(',')
filepath = str(dirfmt.path / filename)
result[sample_id] = skbio.io.read(filepath, format='fastq',
constructor=skbio.DNA)
return result


Expand Down
4 changes: 4 additions & 0 deletions q2_types/per_sample_sequences/tests/data/duplicate-MANIFEST
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
sample-id,filename,direction
# important comment
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward
Human-Kneecap,Human-Kneecap_S4_L001_R1_001.fastq.gz,forward
3 changes: 3 additions & 0 deletions q2_types/per_sample_sequences/tests/data/extra-MANIFEST
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample-id,filename,direction
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward,banana
Human-Kneecap,Human-Kneecap_S1_L001_R2_001.fastq.gz,reverse
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample-id,filename,direction
Human-Kneecap,Human-Kneecap_S1_L001_R2_001.fastq.gz,reverse
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward,banana
3 changes: 3 additions & 0 deletions q2_types/per_sample_sequences/tests/data/lesser-MANIFEST
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample-id,filename,direction
Human-Kneecap,Human-Kneecap_S1_L001_R2_001.fastq.gz,reverse
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
sample-id,filename,direction
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz
Human-Kneecap,Human-Kneecap_S1_L001_R2_001.fastq.gz,reverse
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
sample-id,filename,direction
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward
# important comment
Human-Kneecap,Human-Kneecap_S1_L001_R1_001.fastq.gz,forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for:

  • an invalid header, e.g. sample-id,foo,direction
  • a null value in the interior of the Nx3 table
  • blank lines and comments located pre- and post-header
  • a valid MANIFEST with more than one sample

35 changes: 35 additions & 0 deletions q2_types/per_sample_sequences/tests/test_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,41 @@ def test_fastq_manifest_format_validate_negative(self):
with self.assertRaisesRegex(ValueError, 'FastqManifestFormat'):
format.validate()

def test_fastq_manifest_format_validate_negative_extra_col(self):
filepath = self.get_data_path('extra-MANIFEST')
format = FastqManifestFormat(filepath, mode='r')

with self.assertRaisesRegex(ValueError, 'FastqManifestFormat'):
format.validate()

def test_fastq_manifest_format_validate_negative_extra_col_order(self):
filepath = self.get_data_path('extra-opposite-MANIFEST')
format = FastqManifestFormat(filepath, mode='r')

with self.assertRaisesRegex(ValueError, 'FastqManifestFormat'):
format.validate()

def test_fastq_manifest_format_validate_negative_missing_col(self):
filepath = self.get_data_path('lesser-MANIFEST')
format = FastqManifestFormat(filepath, mode='r')

with self.assertRaisesRegex(ValueError, 'FastqManifestFormat'):
format.validate()

def test_fastq_manifest_format_validate_negative_missing_col_order(self):
filepath = self.get_data_path('lesser-opposite-MANIFEST')
format = FastqManifestFormat(filepath, mode='r')

with self.assertRaisesRegex(ValueError, 'FastqManifestFormat'):
format.validate()

def test_fastq_manifest_format_validate_negative_duplicate_id(self):
filepath = self.get_data_path('duplicate-MANIFEST')
format = FastqManifestFormat(filepath, mode='r')

with self.assertRaisesRegex(ValueError, 'FastqManifestFormat'):
format.validate()

def test_casava_one_eight_slanepsample_dir_fmt_validate_positive(self):
filepath = self.get_data_path('Human-Kneecap_S1_L001_R1_001.fastq.gz')
shutil.copy(filepath, self.temp_dir.name)
Expand Down