New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genbankr #47

Closed
gmbecker opened this Issue May 20, 2016 · 14 comments

Comments

Projects
None yet
4 participants
@gmbecker

gmbecker commented May 20, 2016

    1. What does this package do? (explain in 50 words or less)
      Parses GenBank and GenPept files into a useful datastructure which integrates with the Bioconductor ecosystem.
    1. Paste the full DESCRIPTION file inside a code block below.
Package: genbankr
Version: 1.1.3
Title: Parsing GenBank files into semantically useful objects
Description: Reads Genbank files.
Authors@R: as.person(c(
      "Gabriel Becker <becker.gabriel@gene.com> [aut, cre]",
      "Michael Lawrence <lawrence.michael@gene.com> [aut]"))
Copyright: Genentech, Inc.
Depends:
    methods
Imports:
    BiocGenerics,
    IRanges,
    GenomicRanges(>= 1.23.24),
    GenomicFeatures,
    Biostrings,
    VariantAnnotation,
    rtracklayer,
    S4Vectors,
    GenomeInfoDb,
    Biobase
Suggests:
    RUnit,
    rentrez,
    knitr
Maintainer: Gabriel Becker <becker.gabriel@gene.com>
VignetteBuilder: knitr
License: Artistic-2.0
RoxygenNote: 5.0.1.9000
biocViews: Infrastructure, DataImport
NeedsCompilation: no
    1. URL for the package (the development repository, not a stylized html page)
      https://github.com/gmbecker/genbankr
    1. What data source(s) does it work with (if applicable)?
      NCBI Nucleotide and Protein databases (GenBank and GenPeptfiles)
    1. Who is the target audience?
      Scientists working on microbial genomics/bioinformatics, on organisms for which NCBI and GenBank are the source of record for annotations.
    1. Are there other R packages that accomplish the same thing? If so, what is different about yours?
      The rentrez package offers low level querying facilities for all NCBI databases, but does not parse the results into a semantically useful datastructure in the GenBank/GenPept case. I view genbankr as working with rentrez rather than replacing it. In fact, when users give an accession rather than an already downloaded file, rentrez is used to retrieve the raw data, which is then parsed by genbankr machinery.
    1. Check the box next to each policy below, confirming that you agree. These are mandatory.
    • This package does not violate the Terms of Service of any service it interacts with.
    • The repository has continuous integration with Travis CI and/or another service
    • The package contains a vignette
    • The package contains a reasonably complete README with devtools install instructions
    • The package contains unit tests
    • The package only exports functions to the NAMESPACE that are intended for end users
    1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
    • Are there any package dependencies not on CRAN?
    • Do you intend for this package to go on CRAN?
    • Does the package have a CRAN accepted license?
    • Did devtools::check() produce any errors or warnings? If so paste them below.
    1. Please add explanations below for any exceptions to the above:
      This package is part of the Bioconductor package platform and depends on Bioconductor packages as well as CRAN packages. All dependencies are published in either CRAN or Bioconductor.

This package will not go on CRAN, as it is and will continue to be published as a Bioconductor package (with ROpenSci co-branding, pending success of this submission).

    1. If this is a resubmission following rejection, please explain the change in circumstances.
@sckott

This comment has been minimized.

Show comment
Hide comment
@sckott

sckott May 20, 2016

Member

Thanks for the submission @gmbecker

Member

sckott commented May 20, 2016

Thanks for the submission @gmbecker

@sckott

This comment has been minimized.

Show comment
Hide comment
@sckott

sckott May 23, 2016

Member

Seeking reviewers

Member

sckott commented May 23, 2016

Seeking reviewers

@sckott sckott self-assigned this May 23, 2016

@sckott

This comment has been minimized.

Show comment
Hide comment
@sckott

sckott May 25, 2016

Member

Reviewers: @dwinter
Due date: 2016-06-15

Member

sckott commented May 25, 2016

Reviewers: @dwinter
Due date: 2016-06-15

@sckott

This comment has been minimized.

Show comment
Hide comment
@sckott

sckott Jun 15, 2016

Member

@dwinter
Due date: 2016-06-15 - hey there, it's been 21 days, please get your review in soon, thanks 😺 (ropensci-bot)

Member

sckott commented Jun 15, 2016

@dwinter
Due date: 2016-06-15 - hey there, it's been 21 days, please get your review in soon, thanks 😺 (ropensci-bot)

@dwinter

This comment has been minimized.

Show comment
Hide comment
@dwinter

dwinter Jun 15, 2016

Member

Hi @sckott / bot / @gmbecker -- sorry about the delay, I've been working on this but it will take a little longer. Will aim for this week though.

Member

dwinter commented Jun 15, 2016

Hi @sckott / bot / @gmbecker -- sorry about the delay, I've been working on this but it will take a little longer. Will aim for this week though.

@sckott

This comment has been minimized.

Show comment
Hide comment
@sckott

sckott Jun 15, 2016

Member

thanks!

Member

sckott commented Jun 15, 2016

thanks!

@dwinter

This comment has been minimized.

Show comment
Hide comment
@dwinter

dwinter Jun 18, 2016

Member

This is a really nice package, which fills an important gap in the tools available to work on sequence data in R. Genbank is a very large sequence database, and genbank-formatted records provide rich information about the sequences it contains. This package provides tools to fetch and parse genbank records, and a set of functions to extract information from the parsed records. Importantly, the paclage integrates well with Bioconductor allowing the information within sequence files to be represented as flexible and useful objects like those provided by GenomicRanges.

Clashes between ropensci and Bioconductor styles

One of the difficulties in reviewing the package is that genbankr is part of the Bioconductor project, and therefore meets different coding standards and practices than those suggested for ropensci. Obviously it's up to the ropensci leaders and @gmbecker to decide how this can be negotiated, I will just list those I find here.

  • ropensci suggests using testthat for unit tests, Bioconductor has its own testing framework that uses RUnit.
  • ropensci suggest snake_case for most variables, Bioconductor uses camelCase.
  • As I understand it, the github repo for this package is a fork of a mirror of the Bioconductor source? I don't know what role the github repo will play for the package and how it can be kept in synch with the "official" source. (I'm sure it can be done, I just wanted to flag it as something that needs to be thought about)

Readme

This is perhaps another example of the different between each project. In ropensci the github repo is a "landing page" for the package, whereas as Bioconductor presents much of the same information here. Even so, the current README for genbankr source code could be improved by adding

  • badges for release, continuous integration and test coverage
  • Some small usage example
  • A clear statement about where to go for help or to file bugs (the Bioconductor QA site?)

I have to say I also find the the installation instructions a little daunting. Perhaps start by emphasizing that almost all users should use the current Bioconductor release, which can be installed easily with biocLite. Can the right version of biocLite be accessed via library(BiocInstaller)? If so, these seems preferable to telling users to source a URL to install packages.

Documentation

I found package-level function to be very good, both clear and complete. Though perhaps not technically documentation, the print function for genbank records is also well thought-out, providing the user wih a clear indication of what each record contains.

The package vignette is also clear and does a good job of demonstrating basic usages and integrations for the package. I think it would also be very helpful to show how the features in a genbank records can be used in an analysis.

For instance, an example demonstrating the best way to use features in a given file and getSeq to retriece only the CDS from an mRNA sequence would document a possible use-case and demonstrate how integration with GenomicRanges can be helpful.

For the example of reading data into and R sessions, I think it's better to store the data in inst/exdata and use system.file to specify the path in vignette. That way a user can copy and paste the example and see it work. (Hadley's book has a section on this)

I appreciate the vignette section of the "limitations" fo the package. The lack of standards as to exactly what goes into a genbank file means such limitations are inevitable, but this clear statement about how those problems are dealt with is helpful.

The code

I have very little to comment on the code itself -- it's well written, clear and appropriately modular. I have not stress tested the package on very large files, but performance on typically-sized files has been good. I don't think there are an missing functions -- the package does a good job of "doing job well", so I like the focus of the current packages is good.

@aahowel3 and I have been using the package in research this week, and we found a couple of records that break the parsing function. These are filed under issue gmbecker/genbankr#1

Everything else

It would be good to have a code of conduct -- even if you envisage mostly working on the project yourself it's useful to have an indication that that would-be collaborators can expect to be treated well.

As mentioned above, it would be useful to have a clear indication of where bugs and/or questions about the package should be sent.

I think the the VignetteBuilder entry in the DESCRIPTION should be changed to rmarkdown, and knitr should be replaced by rmarkdown in the Suggests section too. With these changes I can compile the vignette, run R CMD check and have all tests pass.

Finally, the reviewer recommendations ask for an estimate of the time taken to review the package.
I guess I spend about 2hrs using the package in research (i.e. doing stuff I would otherwise be working on, probably less efficiently 😄), another 2 closely reviewing the code, design and documentation of the package and one hour to prepare this comment.

Member

dwinter commented Jun 18, 2016

This is a really nice package, which fills an important gap in the tools available to work on sequence data in R. Genbank is a very large sequence database, and genbank-formatted records provide rich information about the sequences it contains. This package provides tools to fetch and parse genbank records, and a set of functions to extract information from the parsed records. Importantly, the paclage integrates well with Bioconductor allowing the information within sequence files to be represented as flexible and useful objects like those provided by GenomicRanges.

Clashes between ropensci and Bioconductor styles

One of the difficulties in reviewing the package is that genbankr is part of the Bioconductor project, and therefore meets different coding standards and practices than those suggested for ropensci. Obviously it's up to the ropensci leaders and @gmbecker to decide how this can be negotiated, I will just list those I find here.

  • ropensci suggests using testthat for unit tests, Bioconductor has its own testing framework that uses RUnit.
  • ropensci suggest snake_case for most variables, Bioconductor uses camelCase.
  • As I understand it, the github repo for this package is a fork of a mirror of the Bioconductor source? I don't know what role the github repo will play for the package and how it can be kept in synch with the "official" source. (I'm sure it can be done, I just wanted to flag it as something that needs to be thought about)

Readme

This is perhaps another example of the different between each project. In ropensci the github repo is a "landing page" for the package, whereas as Bioconductor presents much of the same information here. Even so, the current README for genbankr source code could be improved by adding

  • badges for release, continuous integration and test coverage
  • Some small usage example
  • A clear statement about where to go for help or to file bugs (the Bioconductor QA site?)

I have to say I also find the the installation instructions a little daunting. Perhaps start by emphasizing that almost all users should use the current Bioconductor release, which can be installed easily with biocLite. Can the right version of biocLite be accessed via library(BiocInstaller)? If so, these seems preferable to telling users to source a URL to install packages.

Documentation

I found package-level function to be very good, both clear and complete. Though perhaps not technically documentation, the print function for genbank records is also well thought-out, providing the user wih a clear indication of what each record contains.

The package vignette is also clear and does a good job of demonstrating basic usages and integrations for the package. I think it would also be very helpful to show how the features in a genbank records can be used in an analysis.

For instance, an example demonstrating the best way to use features in a given file and getSeq to retriece only the CDS from an mRNA sequence would document a possible use-case and demonstrate how integration with GenomicRanges can be helpful.

For the example of reading data into and R sessions, I think it's better to store the data in inst/exdata and use system.file to specify the path in vignette. That way a user can copy and paste the example and see it work. (Hadley's book has a section on this)

I appreciate the vignette section of the "limitations" fo the package. The lack of standards as to exactly what goes into a genbank file means such limitations are inevitable, but this clear statement about how those problems are dealt with is helpful.

The code

I have very little to comment on the code itself -- it's well written, clear and appropriately modular. I have not stress tested the package on very large files, but performance on typically-sized files has been good. I don't think there are an missing functions -- the package does a good job of "doing job well", so I like the focus of the current packages is good.

@aahowel3 and I have been using the package in research this week, and we found a couple of records that break the parsing function. These are filed under issue gmbecker/genbankr#1

Everything else

It would be good to have a code of conduct -- even if you envisage mostly working on the project yourself it's useful to have an indication that that would-be collaborators can expect to be treated well.

As mentioned above, it would be useful to have a clear indication of where bugs and/or questions about the package should be sent.

I think the the VignetteBuilder entry in the DESCRIPTION should be changed to rmarkdown, and knitr should be replaced by rmarkdown in the Suggests section too. With these changes I can compile the vignette, run R CMD check and have all tests pass.

Finally, the reviewer recommendations ask for an estimate of the time taken to review the package.
I guess I spend about 2hrs using the package in research (i.e. doing stuff I would otherwise be working on, probably less efficiently 😄), another 2 closely reviewing the code, design and documentation of the package and one hour to prepare this comment.

@gmbecker

This comment has been minimized.

Show comment
Hide comment
@gmbecker

gmbecker Jul 13, 2016

@dwinter

Thank you for the the thorough review. I have addressed and closed the issues you kindly filed in my genbankr repository. Please find my initial responses to the other points of your review below:

Style

Unfortunately, in many cases my use of camelCase are not cases of me adopting the Bioconductor style, but rather my package providing methods to existing Bioconductor generics (e.g., getSeq, cdsBy, etc). As such I'm unable to change the function naming style of my package. In the interest of full disclosure, I would have chosen to use camelCase rather than snake_case anyway, as it is my preference, but that is moot as consistency is paramount and my hands are tied regarding the names of numerous methods my package provides.

Readme

The BiocInstaller package is not itself available on CRAN. The mechanism for initially installing it is the sourcing of the file you mention. Once that is installed, you are correct that library(BiocInstaller) is sufficient, and the sourcing the file is no longer needed. I attempted to communicate that via the readme but I can revisit that if it was unclear.

With respect to badges, these are avaiable at genbankr's Bioconductor splashpage here: https://bioconductor.org/packages/release/bioc/html/genbankr.html. I figured that was sufficient. It looks like I might be able to jury rig the Bioc badges to show up on the github page. I am somewhat loath to do that, but I can look into it if it is a sticking point.

I will add some mention of the Bioconductor help site to the Readme, I agree that that isn't super clear now.

Documentation

I have changed the vignette so that it uses system file to retrieve the example genbank file (this is in devel only, not backported to the current release).

I don't really have the applied subject material expertise to determine what a compelling, more realistic example would be, so I have not added one now. I will talk to people I know are using it and try to work one in for the next release. I will add the mini-applications the reviewer mentions as well for that release.

Code

Thank you for your kind words. As mentioned, I have fixed the two bugs you identified in your issue. Please bring any future problems you find to my attention as well, and the turn-around time should be shorter when I'm not in the middle of a week of conferences.

Everything else

I would have thought that once genbankr is accepted into ROpenSci, the ROpenSci code of conduct would apply. Is that not the case? A package having a code of conduct strikes me as a bit odd (it's not really a community...) but I am of course for all the things that codes of conduct protect and enforce, and would do so within my package's development environment regardless.

@sckott (or @noamross ?) Please let me know if these responses are acceptable, and if not where more work on my part is required.

gmbecker commented Jul 13, 2016

@dwinter

Thank you for the the thorough review. I have addressed and closed the issues you kindly filed in my genbankr repository. Please find my initial responses to the other points of your review below:

Style

Unfortunately, in many cases my use of camelCase are not cases of me adopting the Bioconductor style, but rather my package providing methods to existing Bioconductor generics (e.g., getSeq, cdsBy, etc). As such I'm unable to change the function naming style of my package. In the interest of full disclosure, I would have chosen to use camelCase rather than snake_case anyway, as it is my preference, but that is moot as consistency is paramount and my hands are tied regarding the names of numerous methods my package provides.

Readme

The BiocInstaller package is not itself available on CRAN. The mechanism for initially installing it is the sourcing of the file you mention. Once that is installed, you are correct that library(BiocInstaller) is sufficient, and the sourcing the file is no longer needed. I attempted to communicate that via the readme but I can revisit that if it was unclear.

With respect to badges, these are avaiable at genbankr's Bioconductor splashpage here: https://bioconductor.org/packages/release/bioc/html/genbankr.html. I figured that was sufficient. It looks like I might be able to jury rig the Bioc badges to show up on the github page. I am somewhat loath to do that, but I can look into it if it is a sticking point.

I will add some mention of the Bioconductor help site to the Readme, I agree that that isn't super clear now.

Documentation

I have changed the vignette so that it uses system file to retrieve the example genbank file (this is in devel only, not backported to the current release).

I don't really have the applied subject material expertise to determine what a compelling, more realistic example would be, so I have not added one now. I will talk to people I know are using it and try to work one in for the next release. I will add the mini-applications the reviewer mentions as well for that release.

Code

Thank you for your kind words. As mentioned, I have fixed the two bugs you identified in your issue. Please bring any future problems you find to my attention as well, and the turn-around time should be shorter when I'm not in the middle of a week of conferences.

Everything else

I would have thought that once genbankr is accepted into ROpenSci, the ROpenSci code of conduct would apply. Is that not the case? A package having a code of conduct strikes me as a bit odd (it's not really a community...) but I am of course for all the things that codes of conduct protect and enforce, and would do so within my package's development environment regardless.

@sckott (or @noamross ?) Please let me know if these responses are acceptable, and if not where more work on my part is required.

@sckott

This comment has been minimized.

Show comment
Hide comment
@sckott

sckott Jul 17, 2016

Member
  • snake_case is just a suggestion - consistency is more important, camelCase when interfacing with lots of other camelCase is good
  • If the main location for badges, etc, is on Bioconductor, this should be noted and clearly linked. - Since this is a read-only mirror, and presumably contributions should be provided in a manner different than PRs to this repository, this should be noted, as well, in the README.md or a CONTRIBUTING.md; I imagine issues shouldn't be opened here either, or do you want that?
  • We prefer a CoC in the package repository - the idea is to make it very clear that we expect contributors to follow the CoC. The people contributing and using the package are a community though
Member

sckott commented Jul 17, 2016

  • snake_case is just a suggestion - consistency is more important, camelCase when interfacing with lots of other camelCase is good
  • If the main location for badges, etc, is on Bioconductor, this should be noted and clearly linked. - Since this is a read-only mirror, and presumably contributions should be provided in a manner different than PRs to this repository, this should be noted, as well, in the README.md or a CONTRIBUTING.md; I imagine issues shouldn't be opened here either, or do you want that?
  • We prefer a CoC in the package repository - the idea is to make it very clear that we expect contributors to follow the CoC. The people contributing and using the package are a community though
@gmbecker

This comment has been minimized.

Show comment
Hide comment
@gmbecker

gmbecker Jul 21, 2016

I have modified the README.md file to contain a link at the top to the official Bioconductor splashpage for genbankr.

I've also added CONDUCT.md.

PRs to this github repository should actually be fine, so I haven't made any changes there.

Please let me know if there's anything else outstanding that I need to address.

gmbecker commented Jul 21, 2016

I have modified the README.md file to contain a link at the top to the official Bioconductor splashpage for genbankr.

I've also added CONDUCT.md.

PRs to this github repository should actually be fine, so I haven't made any changes there.

Please let me know if there's anything else outstanding that I need to address.

@sckott

This comment has been minimized.

Show comment
Hide comment
@sckott

sckott Jul 22, 2016

Member

as mentioned above:

  • Some small usage example
  • A clear statement about where to go for help or to file bugs (the Bioconductor QA site?)

Can those be added to readme, or some reason why not?

p.s. @dwinter - thanks for the review hrs estimate

Member

sckott commented Jul 22, 2016

as mentioned above:

  • Some small usage example
  • A clear statement about where to go for help or to file bugs (the Bioconductor QA site?)

Can those be added to readme, or some reason why not?

p.s. @dwinter - thanks for the review hrs estimate

@gmbecker

This comment has been minimized.

Show comment
Hide comment
@gmbecker

gmbecker Aug 1, 2016

Hmm, I missed this message somehow. My bad.

I'd prefer any example usage code that has any meat to it live somewhere where it's compiled (and thus automatically tested), e.g. the vignette or examples in the documentation. I suppose I can add a non-run example call to readGenBank to the readme. Going beyond that seems like it is approaching a vignette that never gets built.

I'll add the link to bioc support, I had meant to do that.

gmbecker commented Aug 1, 2016

Hmm, I missed this message somehow. My bad.

I'd prefer any example usage code that has any meat to it live somewhere where it's compiled (and thus automatically tested), e.g. the vignette or examples in the documentation. I suppose I can add a non-run example call to readGenBank to the readme. Going beyond that seems like it is approaching a vignette that never gets built.

I'll add the link to bioc support, I had meant to do that.

@gmbecker

This comment has been minimized.

Show comment
Hide comment
@gmbecker

gmbecker Aug 1, 2016

Please see the latest commit of README.md, which now includes very basic usage, a link to the vignette for more complete usage examples, and a section on getting help and filing bugs and feature requests.

gmbecker commented Aug 1, 2016

Please see the latest commit of README.md, which now includes very basic usage, a link to the vignette for more complete usage examples, and a section on getting help and filing bugs and feature requests.

@sckott

This comment has been minimized.

Show comment
Hide comment
@sckott

sckott Aug 1, 2016

Member

@gmbecker Looks good to me

approved

Member

sckott commented Aug 1, 2016

@gmbecker Looks good to me

approved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment