Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Golay error decoding for the EMP protocol #97

Merged
merged 20 commits into from
Apr 22, 2019
Merged

Conversation

wasade
Copy link
Member

@wasade wasade commented Apr 3, 2019

This pull request adds in Golay error decoding for the Earth Microbiome Project protocol (solving #72). The code included is essentially a direct copy from QIIME1, which was originally authored by @justin212k, with a few exceptions outlined below:

  • some normalization of data types and improved utilization of numpy
  • methods wrapped in an object
  • the use of functools.lru_cache to avoid decoding unless necessary
  • standardized the docstrings

The contained code does more than only decoding, so it is possible to use the included library methods for the purpose of creating new Golay codes. However, since demux only presently supports the EMP protocol, it didn't seem necessary at this time to expose any of that additional functionality.

As this code was brought from an existing project, some additional detail such as the module docstring was retained.

Finally, this PR does introduce an API change to both emp-single and emp-paired. Specifically, we now record detail about the error correction process. This information is dumped into a DataFrame and provided as an output. The original intention was for this output to be compatible with qiime metadata tabluate, however the sample-id column is now unique (as multiple reads within a sample may have errors) so it does not currently work with that visualizer.

@wasade
Copy link
Member Author

wasade commented Apr 3, 2019

One more note, without the lru_cache, decoding roughly doubled runtime when demuxing the MVP tutorial dataset. With the lru_cache, there is no appreciable added runtime (anecdotal).

@thermokarst thermokarst self-requested a review April 3, 2019 20:47
@thermokarst thermokarst self-assigned this Apr 3, 2019
Copy link
Contributor

@thermokarst thermokarst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick pass today, food for thought. I will do a more in-depth review soon.

q2_demux/_format.py Outdated Show resolved Hide resolved
q2_demux/_demux.py Show resolved Hide resolved
Copy link
Contributor

@thermokarst thermokarst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookin' good @wasade, made one more pass. Next time through I want to take a closer look at tests. Otherwise looking good! Thanks! 💯

q2_demux/_ecc.py Show resolved Hide resolved
q2_demux/_ecc.py Show resolved Hide resolved
q2_demux/_format.py Outdated Show resolved Hide resolved
@@ -57,3 +57,19 @@ class EMPPairedEndCasavaDirFmt(model.DirectoryFormat):

barcodes = model.File(
r'Undetermined_S0_L001_I1_001.fastq.gz', format=FastqGzFormat)


class ErrorCorrectionDetailsFmt(model.TextFileFormat):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

cc @ebolyen - this format would be another good candidate for a schema-enabled format. @wasade, defining these tabular formats is cumbersome, and is on our laundry-list of things to do around here. No need to change anything here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about you add the header/type mapping as a class property to this format? Then, you can use it here and here (and in the rest of the transformers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third comment on this format - can you make this a TSV? If you did, then you could actually just use the qiime2.Metadata object for validation:

for column in self.METADATA_COLUMNS.keys():
    try:
        md.get_column(column)
    except ValueError as md_exc:
        raise ValidationError(md_exc) from md_exc

As well, your transformers could use qiime2.Metadata for loading and saving.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

I got in the header / type mapping stuff. I was less sure about how to do the changes to support the Metadata object though, is there a recommended example to base off of by chance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a little bit here in this open PR: qiime2/q2-types#210

@thermokarst
Copy link
Contributor

thermokarst commented Apr 16, 2019

Okay, I have just a few minor things to add to this, then it is set:

Thanks!

@thermokarst thermokarst merged commit a0e0761 into qiime2:master Apr 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants