Skip to content
This repository has been archived by the owner on Nov 9, 2023. It is now read-only.

Remove R parsing code in favor of qiime.io? #1135

Open
jairideout opened this issue Oct 1, 2013 · 21 comments
Open

Remove R parsing code in favor of qiime.io? #1135

jairideout opened this issue Oct 1, 2013 · 21 comments

Comments

@jairideout
Copy link
Member

The qiime.io R package contains the parsing functionality that is in qiime/support_files/R/loaddata.r. I think we should remove loaddata.r and make qiime.io an optional QIIME dependency (like we are doing with randomForest, vegan, etc.). This is important because if there is a bug fix in one place, the other will have to be updated. For example, see #1132. loaddata.r isn't unit-tested, either, which makes it scary to maintain.

@joey711 it looks like qiime.io is still under development and doesn't have an official release yet. What are the remaining items that need to be in place before this is ready for its first release?

@gregcaporaso
Copy link
Contributor

+1

@joey711
Copy link

joey711 commented Oct 1, 2013

Well, depends. I basically just threw the package meta stuff (doc, depend, etc) around the loaddata.r source. I believe it was from Dan Knights. I have some extra bells and whistles for roughly the same functionality in phyloseq, where it is unit-tested. There are more-flexible options for parsing taxonomy, and import status messages, etc. I've procrastinated doing anything with qiime.io because I wasn't sure who was going to use it, or if I'd have permission to change/adapt the loaddata.r script with the functionality from the I/O code in phyloseq. Ideally this would be an official CRAN release that phyloseq depends-on, useful for other packages, and not re-coded in every new microbiome R tool.

So, is there a real "need" for this? If so, I can think about budgeting time to make this happen, and migrate over relevant code from phyloseq. Sorry, I'm a little out of the loop.

@jairideout
Copy link
Member Author

Thanks for the details, @joey711. I wasn't aware that this code was also ported to phyloseq.

I think there is a real need for this code to be removed from QIIME and pushed into its own package. QIIME would then depend on qiime.io. I think it'd be a huge plus because it would make it easy for R users to load QIIME files (we get fairly frequent requests for this on the QIIME forum), instead of copying this functionality from QIIME or elsewhere. qiime.io will likely be very lightweight, which also encourages adoption as a dependency in other microbiome projects.

The biggest plus that I see is having unit tests, as there are occasionally issues with this parsing code (e.g., #1132 #996 #236). Of course, we could add unit tests within QIIME, but I think there are many benefits to extracting the code into its own package. It'd also be great to have this hosted on CRAN.

@danknights @kylebittinger what do you guys think? Other devs?

I'd be happy to review code, offer feedback, suggest specific unit test cases, etc. but I'm not the best person to code this since my R experience is extremely limited.

@danknights
Copy link
Contributor

I think it's a great idea.

@joey711
Copy link

joey711 commented Oct 2, 2013

@jrrideout @danknights @kylebittinger
Quick clarification, phyloseq supported legacy QIIME files for a long time in its import_qiime function, which takes the connection/path to multiple QIIME legacy files (including a tree and sample map), and returns an S4-OOP representation of that data in R. The individual components can be retrieved in their old-school/native R format using accessors like otu_table, tax_table, etc.

I have not incorporated any of the loaddata.r or qiime.io code into phyloseq or I would have notified you guys and given an attribution. Just wanted to clarify that I'm on the up-and-up with that sort of thing. Dan and I (and/or @kylebittinger ?) wrote these things independently.

Sorry for the confusion. To clarify, I was just suggesting I might be willing to migrate some of the extra bells and whistles (and unit tests) from the phyloseq version of legacy-qiime importing code into the qiime.io package. I'd rather it see more use there, and make qiime.io a dependency for phyloseq. For example, in order to import the legacy-qiime version of the HMPv35 data table, I rewrote the phyloseq parser to stream the data in by chunks with user-definable size, without which my laptop was hitting a peak RAM issue and doing major swapping.

Here is a tutorial on that:
http://joey711.github.io/phyloseq-demo/HMP_import_example.html

Happy to discuss. I agree qiime.io could be lightweight and useful. I'd much prefer that we all agree on a single R-side way of importing legacy-qiime files (and catching/solving bugs together).

@jairideout
Copy link
Member Author

@joey711 thanks for the clarification. I think it'd be fine (and a really good idea) to take either the code from phyloseq or loaddata.r and make it into an R package that will generally be useful to others, not just within QIIME or phyloseq.

I don't think we need to stick to the existing interfaces that are in loaddata.r, as long as it's relatively easy for QIIME and phyloseq code to be updated to use this new package.

Does anyone else have interest/bandwidth to work on this?

@kylebittinger
Copy link
Contributor

Paul, Jai, Greg, Dan,

I am all for establishing qiime.io as an optional dependency and doing some
work to make it better. That being said, I don't have a great feel for how
to make decisions about improving/refactoring the R code. I want to be
very careful not to introduce new bugs into QIIME.

I think it would be a great idea to make qiime.io a "back end" package for
other R projects like phyloseq. I maintain my own package, qiimer, and I
would love to eliminate the parsing functions in favor of qiime.io if the
implementation is good. (https://github.com/kylebittinger/qiimer)

I added some unit tests to qiime.io a while back, so we should be "almost
there" with respect to pushing out an initial release on CRAN. I am
willing to devote some work time to making that happen when we decide to
move forward.

--Kyle

On Wed, Oct 2, 2013 at 6:41 PM, Jai Ram Rideout notifications@github.comwrote:

@joey711 https://github.com/joey711 thanks for the clarification. I
think it'd be fine (and a really good idea) to take either the code from
phyloseq or loaddata.r and make it into an R package that will generally be
useful to others, not just within QIIME or phyloseq.

I don't think we need to stick to the existing interfaces that are in
loaddata.r, as long as it's relatively easy for QIIME and phyloseq code to
be updated to use this new package.

Does anyone else have interest/bandwidth to work on this?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1135#issuecomment-25583354
.

@joey711
Copy link

joey711 commented Oct 3, 2013

I can try to take a look at this over the weekend. The count of redundant qiime-parsing code now appears to be 3. All the more reason to make this happen, IMHO.

@kylebittinger (or others), what sort of bugs are we worried about introducing? I figured I would follow the output data types and (meager?) tests that we currently have, so that the difference is opaque to QIIME.

Similarly, is there anything we should enhance? I already suggested adding the chunk-streaming to accommodate large files without memory-swapping, since I've already written and tested that. It also optionally spits progress dots to standard out. Anything else?

I don't currently have any support in phyloseq or my own private R scripts for writing a QIIME-legacy file from an OTU abundance matrix (plus optional meta data). This isn't needed for the QIIME interface, since QIIME does that on its own, but it would augment the R interoperability with QIIME. And add validity to the "o" in qiime.io. Thoughts?

@danknights
Copy link
Contributor

I agree with legacy output and chunk-streaming.

Dan Knights, PhD
Assistant Professor
Dept of Computer Science and Engineering
BioTechnology Institute
University of Minnesota
http://metagenome.cs.umn.edu
cell: 510-900-9015 (dials backwards)

On Thu, Oct 3, 2013 at 12:32 PM, Paul J. McMurdie
notifications@github.comwrote:

I can try to take a look at this over the weekend. The count of redundant
qiime-parsing code now appears to be 3. All the more reason to make this
happen, IMHO.

@kylebittinger https://github.com/kylebittinger (or others), what sort
of bugs are we worried about introducing? I figured I would follow the
output data types and (meager?) tests that we currently have, so that the
difference is opaque to QIIME.

Similarly, is there anything we should enhance? I already suggested adding
the chunk-streaming to accommodate large files without memory-swapping,
since I've already written and tested that. It also optionally spits
progress dots to standard out. Anything else?

I don't currently have any support in phyloseq or my own private R scripts
for writing a QIIME-legacy file from an OTU abundance matrix (plus
optional meta data). This isn't needed for the QIIME interface, since QIIME
does that on its own, but it would augment the R interoperability with
QIIME. And add validity to the "o" in qiime.io. Thoughts?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1135#issuecomment-25640977
.

@kylebittinger
Copy link
Contributor

Legacy output would be great and probably easy enough to write. I also
have functions for importing OTU mapping files and collated alpha diversity
tables if there is any interest down the road.

Looked over the chunk streaming section of the phyloseq code (
https://github.com/joey711/phyloseq/blob/master/R/IO-methods.R#L540).
Seems like it calls scan() on sections of the input file and inserts data
into a pre-allocated matrix. Looks good to go!

Thinking more about my fear of bugs, adding a few more tests for the output
data types would go a long way. I'll send you a pull request.

--Kyle

On Thu, Oct 3, 2013 at 1:47 PM, danknights notifications@github.com wrote:

I agree with legacy output and chunk-streaming.

Dan Knights, PhD
Assistant Professor
Dept of Computer Science and Engineering
BioTechnology Institute
University of Minnesota
http://metagenome.cs.umn.edu
cell: 510-900-9015 (dials backwards)

On Thu, Oct 3, 2013 at 12:32 PM, Paul J. McMurdie
notifications@github.comwrote:

I can try to take a look at this over the weekend. The count of
redundant
qiime-parsing code now appears to be 3. All the more reason to make this
happen, IMHO.

@kylebittinger https://github.com/kylebittinger (or others), what
sort
of bugs are we worried about introducing? I figured I would follow the
output data types and (meager?) tests that we currently have, so that
the
difference is opaque to QIIME.

Similarly, is there anything we should enhance? I already suggested
adding
the chunk-streaming to accommodate large files without memory-swapping,
since I've already written and tested that. It also optionally spits
progress dots to standard out. Anything else?

I don't currently have any support in phyloseq or my own private R
scripts
for writing a QIIME-legacy file from an OTU abundance matrix (plus
optional meta data). This isn't needed for the QIIME interface, since
QIIME
does that on its own, but it would augment the R interoperability with
QIIME. And add validity to the "o" in qiime.io. Thoughts?


Reply to this email directly or view it on GitHub<
https://github.com/qiime/qiime/issues/1135#issuecomment-25640977>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1135#issuecomment-25642230
.

@joey711
Copy link

joey711 commented Oct 3, 2013

Whoops, just noticed/remembered that it includes optional parallelization of the chunk parsing:
https://github.com/joey711/phyloseq/blob/master/R/IO-methods.R#L557
I like it, but we probably want to avoid that dependency on foreach-plus-backend packages. Given the concern for RAM constraints, parallelization is probably moot, anyway. Easy enough to modify.

@kylebittinger
Here are the import_qiime tests in phyloseq:
https://github.com/joey711/phyloseq/blob/master/inst/tests/test-IO.R#L67

@kylebittinger
Copy link
Contributor

Thanks!

On Thu, Oct 3, 2013 at 2:25 PM, Paul J. McMurdie
notifications@github.comwrote:

Whoops, just noticed/remembered that it includes optional parallelization
of the chunk parsing:
https://github.com/joey711/phyloseq/blob/master/R/IO-methods.R#L557
I like it, but we probably want to avoid that dependency on
foreach-plus-backend packages. Given the concern for RAM constraints,
parallelization is probably moot, anyway. Easy enough to modify.

@kylebittinger https://github.com/kylebittinger
Here are the import_qiime tests in phyloseq:
https://github.com/joey711/phyloseq/blob/master/inst/tests/test-IO.R#L67


Reply to this email directly or view it on GitHubhttps://github.com//issues/1135#issuecomment-25645441
.

@jairideout
Copy link
Member Author

This all sounds good, and I agree that having writers (in addition to parsers) is a good idea.

Regarding bugs, I can help come up with unit test cases for issues that we've run into in the past. A good place to start, though, is to add unit tests that mimic the ones in QIIME, highlighted here:

https://github.com/qiime/qiime/blob/master/tests/test_parse.py#L145-L254

@kylebittinger
Copy link
Contributor

Still working on tests for this.

What is the expected behavior if there are spaces at the beginning/end of a
fieldname in the header? I am guessing we should take care that the column
name includes the appropriate spaces. It's not covered in the QIIME tests
(test_parse.py), wondering if I should throw an error.

--Kyle

On Thu, Oct 3, 2013 at 2:36 PM, Jai Ram Rideout notifications@github.comwrote:

This all sounds good, and I agree that having writers (in addition to
parsers) is a good idea.

Regarding bugs, I can help come up with unit test cases for issues that
we've run into in the past. A good place to start, though, is to add unit
tests that mimic the ones in QIIME, highlighted here:

https://github.com/qiime/qiime/blob/master/tests/test_parse.py#L145-L254


Reply to this email directly or view it on GitHubhttps://github.com//issues/1135#issuecomment-25646357
.

@joey711
Copy link

joey711 commented Oct 14, 2013

I prefer to remove spaces at the beginning/end of headers. Is there a specific scenario where they arise, or it this just a general check?

Sorry I've been slow on this. I'll try to budget some time this week. Would be nice to get this submitted to CRAN soon.

@jairideout
Copy link
Member Author

check_id_map.py flags leading/trailing whitespace in the header fields as a warning. qiime.parse.parse_mapping_file parses the file and leaves the leading/trailing whitespace in the string.

This behavior is a bit odd, since the data fields (i.e. non-header fields) have leading/trailing whitespace stripped, but the headers don't.

Are there any reasons why this behavior is in place? If not, I'd vote to have both QIIME and R strip leading/trailing whitespace from all fields (for consistency).

@gregcaporaso
Copy link
Contributor

+1

@kylebittinger
Copy link
Contributor

That sounds good to me. I skirted the issue for the current pull request.
Starting to think about this, we have a lot of corner cases to decide upon
once we open this Pandora's box. (And there are similar issues for
"classic" OTU tables.)

I included most of the tests in the QIIME python code. I did not
re-implement the tests in phyloSeq per se -- each item there is already
tested implicitly or does not apply.

Pull request is in!
--Kyle

On Mon, Oct 14, 2013 at 4:30 PM, Jai Ram Rideout
notifications@github.comwrote:

check_id_map.py flags leading/trailing whitespace in the header fields as
a warning. qiime.parse.parse_mapping_file parses the file and leaves the
leading/trailing whitespace in the string.

This behavior is a bit odd, since the data fields (i.e. non-header fields)
have leading/trailing whitespace stripped, but the headers don't.

Are there any reasons why this behavior is in place? If not, I'd vote to
have both QIIME and R strip leading/trailing whitespace from all fields
(for consistency).


Reply to this email directly or view it on GitHubhttps://github.com//issues/1135#issuecomment-26286244
.

@kylebittinger
Copy link
Contributor

Second pull request is in (to qiime.io) for stripping whitespace from
column names in a mapping file. Time to make the change in QIIME?

What is on our to-do list before we get qiime.io onto CRAN? (I am excited
to use this in my qiimer package, upon release.)

Best,
Kyle

On Wed, Oct 16, 2013 at 10:18 AM, Kyle Bittinger kylebittinger@gmail.comwrote:

That sounds good to me. I skirted the issue for the current pull request.
Starting to think about this, we have a lot of corner cases to decide upon
once we open this Pandora's box. (And there are similar issues for
"classic" OTU tables.)

I included most of the tests in the QIIME python code. I did not
re-implement the tests in phyloSeq per se -- each item there is already
tested implicitly or does not apply.

Pull request is in!
--Kyle

On Mon, Oct 14, 2013 at 4:30 PM, Jai Ram Rideout <notifications@github.com

wrote:

check_id_map.py flags leading/trailing whitespace in the header fields as
a warning. qiime.parse.parse_mapping_file parses the file and leaves the
leading/trailing whitespace in the string.

This behavior is a bit odd, since the data fields (i.e. non-header
fields) have leading/trailing whitespace stripped, but the headers don't.

Are there any reasons why this behavior is in place? If not, I'd vote to
have both QIIME and R strip leading/trailing whitespace from all fields
(for consistency).


Reply to this email directly or view it on GitHubhttps://github.com//issues/1135#issuecomment-26286244
.

@jairideout
Copy link
Member Author

Thanks @kylebittinger!

Yep, I think QIIME should be updated so that both parsers match. The QIIME mapping file docs should also be updated.

@joey711
Copy link

joey711 commented Oct 17, 2013

I'll take a look. I'm guessing qiime.io is still short on documentation. Like vignettes. Possibly function-level doc, too. I'll run some package build checks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants