Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

review request for EML #80

Closed
10 of 13 tasks
cboettig opened this issue Oct 14, 2016 · 13 comments
Closed
10 of 13 tasks

review request for EML #80

cboettig opened this issue Oct 14, 2016 · 13 comments

Comments

@cboettig
Copy link
Member

Summary

  • What does this package do? (explain in 50 words or less):

Parse and serialize Ecological Metadata Language ('EML') files into S4 objects.

  • Paste the full DESCRIPTION file inside a code block below:
Package: EML
Type: Package
Title: Read and Write Ecological Metadata Language Files
Version: 1.0.0.0
Authors@R: c(
    person("Carl", "Boettiger", role = c("aut", "cre"),
           email = "cboettig@gmail.com"),
    person("Matt", "Jones", role="aut"),
    person("Bryce", "Mecum", role = "ctb", email = "brycemecum@gmail.com"),
    person("Maëlle", "Salmon", role = "ctb", email = "maelle.salmon@yahoo.se")
    )
URL: https://github.com/ropensci/EML
BugReports: https://github.com/ropensci/EML/issues
Maintainer: Carl Boettiger <cboettig@gmail.com>
Description: Parse and serialize Ecological Metadata Language ('EML') files into
    S4 objects.
License: FreeBSD
LazyData: TRUE
RoxygenNote: 5.0.1
Imports:
    XML,
    methods,
    tools
Suggests:
    testthat,
    knitr,
    rmarkdown
VignetteBuilder: knitr
Encoding: UTF-8
Collate:
    'classes.R'
    'classes-stmml.R'
    'coercions.R'
    'eml_find.R'
    'get_unitList.R'
    'get_attributes.R'
    'eml_get.R'
    'eml_validate.R'
    'get_TextType.R'
    'get_coverage.R'
    'is_standardUnit.R'
    'literature_coercions.R'
    'methods.R'
    'read_eml.R'
    'set_TextType.R'
    'set_attributes.R'
    'set_coverage.R'
    'set_methods.R'
    'set_physical.R'
    'set_unitList.R'
    'write_eml.R'
    'xml-s4.R'
    'zzz.R'
  • URL for the package (the development repository, not a stylized html page):

https://github.com/ropensci/EML

  • Who is the target audience?

Researchers publishing data who want to use a formal, machine-readable metadata standard to describe their data files, and/or researchers consuming data published with EML metadata (e.g. NEON or LTER data.)

  • Are there other R packages that accomplish the same thing? If so, what is different about yours?

Not that I'm aware of.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
  • If this is a resubmission following rejection, please explain the change in circumstances:
  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
  • Leah Wasser
  • Peter Slaughter
  • Bryce Mecum (also a minor contributor to the package)
  • Corinna Gries
  • Chris Jones
@noamross
Copy link
Contributor

noamross commented Oct 15, 2016

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks, @cboettig. Looks good, and I am currently seeking reviewers. The output from goodpractice::gp() is below. My notes from this are:

  • The modest test coverage is largely due to the very large (auto-generated) R/methods.R file - not every possible EML field has tests.
  • Most code style notes are from the meta-programming in inst/create_package/, which may not have been covered by any linting or other checking done by the authors.

Reviewers should take note of the package build process (described at the bottom of the README.md) and review the build scripts rather than line-by-line review of the class definitions (classes.R) and methods (methods.R).

── GP EML ──────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 65% of
    code lines are covered by test cases.

    R/classes-stmml.R:60:NA
    R/classes-stmml.R:61:NA
    R/classes-stmml.R:62:NA
    R/classes-stmml.R:63:NA
    R/classes-stmml.R:64:NA
    ... and 1732 more lines

  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users
    and developers are used it and it is easier to read your code for them if
    you use '<-'.

    inst/create-package/create_classes.R:235:17
    inst/create-package/create_classes.R:318:18
    inst/create-package/create_methods.R:24:7
    inst/create-package/create_methods.R:36:9
    inst/create-package/create_methods.R:38:11
    ... and 17 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer
    editor windows that are about 80 characters wide. Try make your lines
    shorter than 80 characters

    inst/create-package/create_classes.R:68:1
    inst/create-package/create_classes.R:103:1
    inst/create-package/create_classes.R:106:1
    inst/create-package/create_classes.R:116:1
    inst/create-package/create_classes.R:155:1
    ... and 746 more lines

  ✖ avoid calling setwd(), it changes the global environment. If you need it,
    consider using on.exit() to restore the working directory.

    R/get_TextType.R:41:5
    R/get_TextType.R:54:5
    R/set_TextType.R:87:5
    R/set_TextType.R:96:5

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list,
    depending on the input data. Consider using vapply() instead.

    inst/create-package/create_methods.R:102:13
    inst/create-package/create_package.R:300:1
    inst/create-package/explore.R:4:5
    inst/examples/eml_summary.R:4:16
    R/classes.R:14:14
    ... and 136 more lines

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...)
    expressions. They are error prone and result 1:0 if the expression on the
    right hand side is zero. Use seq_len() or seq_along() instead.

    inst/create-package/create_methods.R:66:17

  ✖ not import packages as a whole, as this can cause name clashes between the
    imported packages. Instead, import only the specific functions you need.
──────────────────────────────────────────────

Reviewers: @gmbecker @cgries
Due date: 2016-11-14

@noamross
Copy link
Contributor

First reviewer: @gmbecker

@noamross
Copy link
Contributor

Second reviewer: @cgries

@sckott
Copy link
Contributor

sckott commented Nov 1, 2016

@gmbecker @cgries - hey there, it's been 14 days, please get your review in by Nov 14, thanks 😺 (ropensci-bot)

@gmbecker
Copy link

gmbecker commented Nov 10, 2016

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

6 for initial review and comment writing

Review Comments

I do not have experience within the field of application for this package. As such, I will focus in this review on the design and technical aspects of the package while speaking less to it's direct applicability to problems computational ecologists actually face.

The package seems generally good, with code (both generated and hand-written) which is well conceived and behaves as intended. I feel that with afew revisions to tighten up some aspects of it, it will be a strong and welcome contribution to ROpenSci core mission to facilitate interaction with open-data sources and formats from within R.

S4 design and usage

there are some S4-specific design and stylistic choices which I recommend the authors revisit.

Constructors

Generally, all classes which any end users are expected to create themselves should have exported constructors, rather than requiring users to call new directly. This preserves the abstraction if the underlying class structure changes, and usually increases the readability of the users' code.

This is doable even when classes (and thus constructors) are automatically generated. I wrote some code that seems to successfully do the individual constructor generation for this package. I'm happy to contribute it, though achieving a positive final review and acceptance is, of course, in no way contingent on that particular code being used.

Another note about constructors is that I believe it is stylistically cleaner to map the arguments directly to the slots and to use constructors for the classes of those slots as necessary, recursively, rather than utilizing nested new calls as the code does now, e.g., in set_physical.

Finally, I do not feel that set_ is an appropriate prefix for constructors, as in the S4 OOP style that suggests than an existing object is being modified by having a component (e.g., slot) of it set. The standard paradigm is to have constructors be functions with names identical to the classes they construct, although that is not mandatory.

show method

The current show method for eml objects converts the object to XML before printing. This seems cumbersome, and, I suspect, is likely to cause problems for real-world-sized EML documents (in the same way the XML package does when printing a large XML blob). It seems that EML is structured enough that a meaningful summary could be displayed by show, while the emlToXML function would still allow users to easily display the entire XML blob if desired.

duplicated classes

I understand that the classes are automatically generated, but there seem to be some minor issues with how this is done. My copy of the package has two value class definitions, one right after the other, and they are different to boot. The first one, of course, will do nothing.

Also, there are the, e.g., author Author style pairs. These are troublesome as well, from a design perspective, because the classes are the same, other than a different ordering of the class slots. If the actual order of the class slots is relied upon in the code, I strongly encourage refactoring away from that. Slots should, in my opinion, be checked for and used exclusively by name or accessor method.

Other

eml_get

The eml_get function's return value seems unintuitive to me. When run on the example hf205 EML content for "physical" elements, it returns a list, of length 3, with 3 length-1 ListOfPhysical objects. I don't follow why this is not a single length 3 ListOfPhysical object.

class(blah) == "thing"

This is not a safe way to test the class of an object, as it will miss/fail for both S3 (vector of classes) and S4 (formal) inheritance. E.g., a tibble will fail a class(obj) == "data.frame" check, as would an S4 class that contains data.frame. All such checks should be refactored to use is

set_coverage data.frame column order requirement

This seems like something that can be checked and enforced programatically rather than relying on good user behavior. The possible columns are fixed and the order is known.

Another approach here is to create an S4 object with constructor that formally models the taxonomic hierarchy, rather than relying on a powerful but semantically poor structure like a data.frame. Particularly since from what I can tell it will only ever have one row, making the structure not a great fit anyway.

eml construction

It's not clear that data.frames are the correct model for constructing the components of eml. It seems that an eml_attribute constructor, which can be called repeatedly and constructs a single attribute (row in the data.frame as it is currently factored) is more useful friendly With respect, it seems like it would be at least as easy to write a csv file directly than constructing and rbinding together these data.frames. I feel more/deeper api-design work is called for on the EML construction side of things.

auto-generated code

Files with autogenerated code should contain a header comment to that effect and telling possible future contributors not to edit the file manually. This should be emitted automatically by the code-generation apparatus.

importing packages

The entire XML packages is imported. I actually feel this is correct in this case (and not terrible generally imho) but it is technically against the ropensci package guidelines. I leave it up to @noamross as the editor to make the final call on this. I recommend it not be an issue for final acceptance.

Automated submission to EML repository?

This is somewhat conjecture on my part, given the caveat I started these review comments with, but is there a single (or small set of), major repository for EML? If so, does it have an API for submission? If so, bindings to that so that users can submit their newly created EML from within R would magnify the utility of the creation aspect of this package substantially, and should be added either to this package (or via the creating of a sister convenience package).

Musing and recommendations for consideration

These are things I feel warrant consideration by the package authors, but should not be taken as firm requirements, and may not even be desirable upon a closer look. I do ask that the authors notes in response to this review address these points and why they did or did not ultimately agree.

have the S4 version of the object carry around it's source XML

XML objects are pointers, so duplication/memory will not be an issue. This could get hairy/infeasible if users are modifying the S4 representation of the eml manually and then re-exporting, but I don't get the sense that this is an intended use-pattern. This would make a couple of parts of the code much cleaner/more efficient when they switch between representations.

Does everything need to inherit from eml-2.1.1?

I can see some benefits to this, but it also complicates things. A lot of the low-level/internal classes would inherit directly from character if not for also implimenting eml-2.1.1. Are those extra slots ever used at the low level, or only, e.g., once per larger document?

@noamross
Copy link
Contributor

Thank you for the review @gmbecker!

Regarding the full package import of XML, I agree that it's fine. There are a few points in our review criteria where we need to clarify requirement v. recommendation, and how we draw the line. I will make a note of this one for our next update.

@gmbecker
Copy link

gmbecker commented Nov 11, 2016

Further reviewer comment

@cboettig has started a discussion regarding the time-scale of implementing improvements vs putting out a stable release here ropensci/EML#183. In my capacity as a reviewer I feel it is most appropriate that I place my thoughts on this subject here as a secondary part of my review. Package authors, please find my recommendations on this question below.

The most important aspect of a stable (read: CRAN) release is that the API is stable upon release for at least the near and preferably the medium term. As such, the most important issues to fix before release are ones that would actually break API compatibility. These are

  • changing the output class of eml_get, and
  • changing the names of existing constructors to remove the set_ prefix which I argue is misleading/a misnomer

This is because if you release now without those changes, then release the next release with the changes in a half a year (or whenever), people's existing code will break. This is less acceptable for a CRAN release than it is for a package under heavy development within the ROpenSci incubator.

Beyond that, I feel the class checking is the next most important. People in the CRAN sphere are almost sure (in my opinion) to want to pass tbl_df objects to things that take data.frames. Tests to ensure that this works would not be out-of-place, and they should pass before CRAN release.

The other changes, while more important from a package design improvement perspective, can probably wait if they must, because they will likely be either backwards compatible or large enough that they warrant a new major version of the software.

EDIT: @noamross if as the editor you feel this post oversteps my role as a reviewer I can remove this post and make my comments in the linked issue (or not at all if it inappropriate for a reviewer to weigh in on the response to his comments in this manner).

@cgries
Copy link

cgries commented Nov 15, 2016

Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation
The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:


Review Comments
As LTER information manager I have experience in using EML to document ecological datasets and will concentrate on reviewing those aspects rather than the technical design and implementation of the package, which I am not an expert in.
This package is greatly needed and will be used widely by ecological information managers and some scientists. I have now used it to develop EML for several datasets and found it to be at a point where a stable version should be available on CRAN.
For the average user (me included) it would be nice to have a little more extensive documentation. I.e., in the description field of functions it would help to see more than just the function name again. The examples are all very helpful and work great. The vignettes are great as well and really help to get started writing code to generate EML. The explanation for the ‘new’ method is, however, very short. I realize that it mainly takes understanding XML and EML specifically to get the hang of that and EML is complex with many ‘new’ constructs for using the more obscure elements.
One improvement that could still be considered for the first release is dealing with people information. I found myself very quickly writing some helper functions which I then also discovered in the arctic repository that was mentioned by @amoeba in https://github.com/ropensci/EML/issues/183. Given the fact that we both seemed to need such and the basic as.person is not sufficient, some thought might go into improving that before everyone starts writing those helper functions. I would like to see the option to provide a csv file/data frame with the people information following the pattern of attribute and taxonomic coverage information. I have had datasets with almost 60 creators and their sequence changed several times in the process of publishing the paper. That would be very tedious to work out in code.
Which leads me to responding to @gmbecker that I find it very convenient to provide attribute and taxonomic information as a data frame. It allows for easy retrieval of the information from either a database or more frequently from a spreadsheet that data providers supply with this information. Most data providers don’t know about the EML subtleties of measurementScale or how to correctly spell the units, but it takes very little editing on my part to prepare a spreadsheet from the information that I do receive. Taxonomic coverage can be extensive and again frequently comes out of a database, i.e., can be easily provided as csv file and data frame.

@noamross
Copy link
Contributor

Thanks for the review @cgries!

@cboettig
Copy link
Member Author

Update:

An initial version of the package is now on CRAN, after discussions with the EML package development team. The package is already in significant use by the DataONE folks, so it made sense to have a stable, archived version on CRAN. We hope to address the remainder of the issues highlighted here for a v1.1 release (https://github.com/ropensci/EML/milestone/9), which should address the remaining issues raised here that involve the higher-level interface, as well as a few performance issues reported by users.

@sckott
Copy link
Contributor

sckott commented Dec 22, 2016

@cgries do you know an estimate for your time for reviewing in hours?

@noamross
Copy link
Contributor

It having been some time since reviews here, I just want to check in with @gmbecker and @cgries that they would still be able and interested in doing follow-up for this review should the EML team submit it for their milestone at the end of the year. We'd be glad to have you, but if you can't we'll assign new reviewers from scratch (who can draw on your reviews as needed).

@cgries
Copy link

cgries commented Aug 23, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants