Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

biomartr package #93

Closed
10 of 13 tasks
HajkD opened this issue Jan 25, 2017 · 46 comments
Closed
10 of 13 tasks

biomartr package #93

HajkD opened this issue Jan 25, 2017 · 46 comments

Comments

@HajkD
Copy link
Member

HajkD commented Jan 25, 2017

Summary

  • What does this package do? (explain in 50 words or less):

The biomartr package provides researchers with a useful tool for the efficient, straightforward and reproducible handling of large-scale meta-genomic data from NCBI and ENSEMBL databases and intuitive organism centered interface functions for retrieving functional annotation information from the BioMart database. In general, the package promotes computational reproducibility in genomics studies.

  • Paste the full DESCRIPTION file inside a code block below:
Package: biomartr
Title: Genomic Data Retrieval with R
Version: 0.3.0
Author: Hajk-Georg Drost
Maintainer: Hajk-Georg Drost <hgd23@cam.ac.uk>
Description: Perform meta-genomic data retrieval and functional annotation retrieval with
    R.
VignetteBuilder: knitr
NeedsCompilation: yes
Depends:
    R (>= 3.1.1)
Imports:
    biomaRt,
    Biostrings,
    stringi,
    tibble,
    jsonlite,
    data.table (>= 1.9.4),
    dplyr (>= 0.3.0),
    readr (>= 0.2.2),
    downloader (>= 0.3),
    RCurl (>= 1.95-4.5),
    XML (>= 3.98-1.1),
    httr (>= 0.6.1),
    stringr (>= 0.6.2)
Suggests:
    knitr (>= 1.6),
    rmarkdown (>= 0.3.3),
    devtools (>= 1.6.1),
    testthat
License: GPL-3
LazyData: true
URL: https://github.com/HajkD/biomartr
BugReports: https://github.com/HajkD/biomartr/issues
RoxygenNote: 5.0.1
Encoding: UTF-8
  • URL for the package (the development repository, not a stylized html page):

https://github.com/HajkD/biomartr

  • Who is the target audience?

Life scientists working with genomic data.

  • Are there other R packages that accomplish the same thing? If so, what is different about yours?

The R packages seqinr and biomaRt were previous attempts to provide parts of the biomartr functionality. The seqinr package aims to automate sequence retrieval in R but is not designed for meta-genomic approaches and does not include functional annotation. The biomaRt package aims to provide functional annotation methods but these are also not designed for meta-genomic approaches and are not easy to use for non-programming experts. The major advantage of biomartr is that it does not require profound programming expertise and vastly extends the functionality of both packages: seqinr and biomaRt. It is optimized to handle large-scale genomic and meta-genomic data using simple and straightforward commands that are useful for life scientists. The functionality provided by biomartr aims to promote computational reproducibility among life scientists without profound programming expertise. This is not achieved by the packages seqinr and biomaRt which are designed for bioinformaticians and other computationally trained researchers.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

  • Do you intend for this package to go on CRAN? (is already on CRAN)
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

Is the first submission.

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

None declared.

@sckott
Copy link
Contributor

sckott commented Jan 27, 2017

Thanks for your submission @HajkD - discussing and will get back soon

@sckott
Copy link
Contributor

sckott commented Jan 28, 2017

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): NA
  • Version (JOSS only, may be post-review): NA

Editor comments

Thanks for your submission!

Currently seeking reviewers.

Reviewers: @naupaka @grimbough
Due date: 2017-03-13

@sckott
Copy link
Contributor

sckott commented Feb 16, 2017

moving ahead with @naupaka as reviewer - haven't been able to get anyone else

@HajkD
Copy link
Member Author

HajkD commented Feb 16, 2017

Many thanks! I am very much looking forward to receiving feedback.

@maelle
Copy link
Member

maelle commented Feb 20, 2017

Added @grimbough as second reviewer, thanks @grimbough !

As a reminder here are links to the recently updated reviewing and packaging guides and to the review template

@sckott
Copy link
Contributor

sckott commented Mar 8, 2017

@naupaka @grimbough Please get your reviews in soon - due on 13 March - and thanks for reviewing ! 😸

@grimbough
Copy link

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 8 hours


Review Comments

Before I start my review, I want to state that although there is no conflict of interest and I have never worked on this package, I am a contributer to the biomaRt package, and so my review may be skewed towards functionality that overlaps with that.

The biomartr package is essentially a 'meta-package' and tries to present a consistent interface to a variety of online resources that provide to genomic data. It does this either by accessing their APIs directly, or by providing a wrapper around existing R packages like biomaRt.

I actually find the package name a little misleading, since this does way more than just provide an R interface to the BioMart API. Perhaps that was the original intention for the package, but the present functionality exceeds this. It's probably a bit late to change this though given there are existing publications referring to it.

One other small semantic point, which you see a lot, is that this package provides access to 'Ensembl Biomart', rather than accessing Biomart services in general. There are in fact many other databases that use Biomart as a way to query their data e.g. SalmonDB or Pancreas Expression Database, which this package currently doesn't provide access to. This is particularly true since the centralised access via Biomart.org no longer exists, and it might be worth clarifying this in the README and Introductory vignette.

Build/Install

I found the installation to work fine. There are clear instructions in the README file, which detail how to install both the stable version in CRAN, and the developmental version from Github. This review is conducted on the version labeled 0.4.0 from Github.

The author might be interested to know that you can use biocLite() to install from CRAN and github as well as Bioconductor, and it will sort of the dependencies from the various repos. So the installation of the CRAN version can be simplified to:

source("http://bioconductor.org/biocLite.R")
biocLite('biomartr')

and similarly for github

source("http://bioconductor.org/biocLite.R")
biocLite("HajkD/biomartr")

One other thing to note is that the importing of functions from biomaRt and Biostrings is never defined explicitly. They are mentioned in the installation instructions and their functions are accessed using the double colon operator e.g. Biostrings::readBStringSet(). However they are not declared in the NAMESPACE. Including the relevant functions as imports would be desireable. Otherwise, checking for their existence with useNamespace() and printing a helpful message if they aren't present may also work.

Examples and tests

Most functions have examples, and those that I tested manually worked. However, they are all set to not run with DONTRUN flags, so it is difficult to run them in an automated fashion. This also means they aren't run when the package is built and check, so there is a chance that some no longer execute successfully. I would certainly encourage the package maintainers to make it that more of the examples are run, although I appreciate that they probably take a long time to execute given almost all are retrieving online data sets. Running the set of tests that accompany the package to over 30 minutes for me.

When running devtools::test() I get three failures, however in each case when I manually ran the code I got the expected output rather than an error. I haven't had a chance to explore this further, but I have include the failure messages below for completeness.

  1. Failure: The download.database() throws error when wrong input database is specified.. (@test-download.database.R#9) ----
    error$message does not match "The specified database 'nl' could not be found on NCBI. Please use the listDatabases() command to retrieve available databases or check if the name was written correctly.".
    Actual value: "The specified database 'nl' could not be found on NCBI. Please use the listDatabases() command to retrieve available databases or check if the name was written correctly."

  2. Error: The getGFF() interface works properly.. (@test-getGFF.R#31) ------------------------------------------------------
    object 'ensembl.available.organisms' not found
    1: getGFF(db = "ensembl", organism = "Saccharomyces cerevisiae", path = file.path("_ncbi_downloads", "annotation")) at /home/msmith/projects/rOpenSci/biomartr/tests/testthat/test-getGFF.R:31
    2: getENSEMBL.Annotation(organism, type = "dna", id.type = "toplevel", path) at /home/msmith/projects/rOpenSci/biomartr/R/getGFF.R:173
    3: is.element(stringr::str_to_lower(new.organism), ensembl.available.organisms$name) at /home/msmith/projects/rOpenSci/biomartr/R/getENSEMBL.Annotation.R:58
    4: match(el, set, 0L)

  3. Failure: The getGFF() error messages work properly.. (@test-getGFF.R#41) ------------------------------------------------
    error$message does not match "Unfortunately organism 'Saccharomyces cerevisi' is not available at ENSEMBL. Please check whether or not the organism name is typed correctly.".
    Actual value: "object 'ensembl.available.organisms' not found"

Documentation and vignettes

The vignettes are generally well written and cover a variety of use cases. The vignette naming is also based around these use cases, which I guess makes sense if you approach the package with a particular research domain in mind, but is not so helpful if you a looking for how to access a specific service. For example, if I know I want to query Ensembl BioMart (which doesn't seem unreasonable given the package name) it doesn't feel intuitive to me that I have to look in 'Evolutionary Transcriptomics' to find an example of accessing that particular resource. This would be somewhat mitigated by the suggestion to include services in the function names, so it is at least easier to find the appropriate manual page.

I would highly recommend making more of the vignette examples evaluated when the Rmarkdown documents are rendered. Over time the contents of the databases can change, and having static output in the vignettes can lead to confusion among users, when what they see when running a code block no longer matches what they see in the documentation. For example running the command listDatabases(db = "human") produces a list of files that is different from what is contained in the Database_Retrieval vignette. You can also end up in the situation where examples in a vignette no longer work at all, although having a decent set of tests in the package should reduce the chances of this happening. This is something where a repository like Bioconductor, which builds all of its packages regularly on multiple platforms, can be very helpful, since you are notified if the vignette no longer executes.

Does the package comply with the ROpenSci packaging guide?

Function/variable naming & general syntax

In general the code is pretty easy to read, with sensible variable names and neatly structured code. However, the function naming within the package seems to be inconsistent. There are examples of camelCaps, snake_case and period.separated. Personally I would avoid using period.separated since there is the potential to conflict with the S3 method dispatch system, but you can find at least one style guide to support each. However, it would be nice to be consistent within the package.

I also think from a user view point that some function names could perhaps be more explicit about the resource they are accessing. e.g. listDatabases() only accesses the NCBI, but all of the services this package can connect to are ultimately databases. Similarly getDatasets() is BioMart specific. This is explained in the manual pages, but it might make it more intuitive to users to have the service in the function name so you can more easily pick out the functions to use together. I feel this is particularly true when you consider functions like getGenome() actually do have the functionality to access multiple resources. This is addressed a bit by grouping related functions together in the README.

There are also a few examples of accessing slots in S4 objects using @, which would probably be better served using the appropriate accessor functions.

Documentation style

ROxygen is used throughout to document the functions in the same file as the code itself. When viewing this as as a programmer (rather than a user) this makes understanding the intention of the function and its arguments much easier than using the traditional way of writing .Rd files manually. I am a definite fan!

Console messages

There are quite a few instances of cat() and print() scattered throughout the code to write messages to the user e.g. in getGenome.R, getGFF.R etc. I would recommend changing these to message() so they can be suppressed if desired.

The cat() usage is often in conjunction with sink() and setwd(). I am more in favour of writeLines() over sink()&cat() but I that's personal preference rather than anything concrete. However, I would avoid using setwd() if possible, since this changes things in the global environment and if your function fails for any reason the working directory has now changed for the user. I would either pass the path to the file writing function, or if you really want to change directory, use on.exit(setwd("my/original/dir")) to ensure it is changed back however the current function exits.

Is there code duplication in the package that should be reduced?

There are a few instances where variables such as file names are created multiple times throughout the same function. For example file.path(tempdir(), "_ncbi_downloads", "listDatabases.txt") appears six times in listDatabases.R. It's only a very minor detail, but having it defined only once may save hassle in the future if you ever decide to change that particular file name.

getGenome() has a degree of code replication between the paths followed when db = "ensemblgenomes" and db = "ensembl". I haven't thought about it too much, but I wonder if this function could be modularised with two sub-functions, one of which handles refseq and genbank queries, and one which handles the two ensembl queries?

Similarly there is a lot of overlap between getENSEMBL.Seq.R and getENSEMBLGENOMES.Seq.R. I wonder if these could be combined, and the requested host used to select the few lines that are different?

Recommended scaffolding

HTTP requests and processing are a core component of this package. These are done using the httr and jsonlite packages as recommended in the rOpenSci guidelines. XML is used rather than xml2, but I personally think the choice of packages used internally is down to the developer, so I don't have any issue with this.

@sckott
Copy link
Contributor

sckott commented Mar 9, 2017

thanks very much for your review @grimbough !

@naupaka
Copy link
Member

naupaka commented Mar 14, 2017

Hi @sckott @HajkD I'm still working my way through all the code — it's going to take me a few more days to get through the review. Sorry for the delay.

@sckott
Copy link
Contributor

sckott commented Mar 14, 2017

Okay, thanks for the update @naupaka

@naupaka
Copy link
Member

naupaka commented Mar 19, 2017

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION
Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 12 hours


Review Comments

I should say right up here at the front that this is my first review for ROpenSci, so it probably took me a bit longer than others might have to get up to speed. It's also a relatively large package in terms of the number of functions it exposes to the user, and it took me a while to figure out how the parts fit together. I'm sure I missed things, but hopefully having me muddle through the codebase will provide some opportunities to smooth a few rough edges.

The goal of the package, as I understand it, is to provide a higher-level interface for programmatically downloading sequence or other datasets from major repositories. Overall, I think the package fills a useful niche and I plan to use it for my own research needs in the future. I hope that the commentary below is helpful.

I downloaded and read the Bioinformatics paper that describes the package. I found it helpful for understanding the goals of the package, and how it differs from biomaRt, which was something I was initially confused about.

I think it would be nice to clarify in the README and elsewhere that the queries work, for the most part (as far as I understand it), against NCBI or EMSEMBL, and not from any of the many other sources that have a BioMart interface (as @grimbough pointed out). For example, I am interested in getting data from the BioMart interface to JGI's Phytozome, since they have a 3.0 version of the Populus trichocarpa genome (the NCBI version is still at 2.0), but that is not possible, I believe, with this package (feature request 😃).

Another thing that I didn't see, but I think would be useful, would be to modify listGenomes() to return a data frame where the first column is is the species' Latin binomial and the second column is which database each species' genome is associated with. Otherwise it's tricky to figure out which to set for the db = parameter when using getGenome().

Minor comments and other issues

README

No Travis badge in README.

Documentation

While all exported functions have roxygen-generated documentation, some internal functions have no comments or documentation, e.g. connected.to.internet.R, exists.ftp.file.R, or get.emsembl.info.R. You could add some short roxygen comments and add #' @noRd as recommended in the ROpenSci package guidelines, or just have some non-roxygen style comments in there in case others would like to better understand what's going on.

I also agree with @grimbough that it would be ideal if at least some of the example code in the documentation could actually be run. The output from devtools::run_examples() suggests that there isn't any code in any of the examples for any of the functions that is actually run. Obviously, it wouldn't make sense to run download.database.all(name = "nr", path = "nr") because that could take days, but surely there are some queries that would be fast enough to run in a reasonable timeframe?

Code

Function/variable naming & general syntax
  • I was a bit confused by which functions did what, in terms of which database(s) they query, and where exactly datasets were coming from. For each download, this information is all very helpfully documented in the doc_*.txt files in the _ncbi_downloads directory, but I felt like it was a bit of a hunt to figure out what had happened. Maybe this is just because I am still not 100% clear on which databases this package interfaces with. This is related to my earlier comment about being confused about how to figure out which db to get a genome from if you don't know ahead of time.
Suggestions
  • ROpenSci's packaging guide suggests the use of message() and/or warning() instead of print() (e.g. lines 144 and 291 of getProteome.R).

  • Suggestions from the goodpractice package:

── GP biomartr ──────────────────────────────────────────────────────────────────────────

It is good practice towrite unit tests for all functions, and all package code in general. 0% of
    code lines are covered by test cases.

    R/biomart.R:49:NA
    R/biomart.R:50:NA
    R/biomart.R:51:NA
    R/biomart.R:53:NA
    R/biomart.R:54:NA
    ... and 4297 more lineswrite short and simple functions. These functions have high cyclomatic
    complexity:listGenomes (78), meta.retrieval (63).avoid long code lines, it is bad for readability. Also, many people prefer
    editor windows that are about 80 characters wide. Try make your lines
    shorter than 80 characters

    R/biomart.R:71:1
    R/biomart.R:72:1
    R/connected.to.internet.R:5:1
    R/download_database.R:38:1
    R/download_database.R:52:1
    ... and 434 more linesavoid sapply(), it is not type safe. It might return a vector, or a list,
    depending on the input data. Consider using vapply() instead.

    R/download_database_all.R:21:5
    R/download_database.R:31:35
    R/getMetaGenomeAnnotations.R:96:31
    R/getMetaGenomes.R:96:31
    R/listDatabases.R:78:29
    ... and 11 more linesavoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...)
    expressions. They are error prone and result 1:0 if the expression on the
    right hand side is zero. Use seq_len() or seq_along() instead.

    R/getAttributes.R:99:36
    R/getDatasets.R:55:39
    R/getFilters.R:99:40
    R/organismAttributes.R:79:39
    R/organismFilters.R:75:43
    ... and 2 more linesnot import packages as a whole, as this can cause name clashes between the
    imported packages. Instead, import only the specific functions you need.fix this R CMD check NOTE: Namespace in Imports field not imported from:stringiAll declared Imports should be used.

Tests

A number of the tests fail for me when run with either R CMD CHECK or devtools::check(). Not sure if it is appropriate to put all that output here, but I figured maybe it would be helpful for diagnosing the problem. I think these errors are related to the occasional 'weird server error' messages I was getting.

==> devtools::check(document = FALSE)

Setting env vars ---------------------------------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building biomartr --------------------------------------------------------------
'/usr/local/Cellar/r/3.3.3/R.framework/Resources/bin/R' --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD build  \
  '/Users/naupaka/git/ROpenSci/biomartr' --no-resave-data --no-manual 

* checking for file/Users/naupaka/git/ROpenSci/biomartr/DESCRIPTION... OK
* preparingbiomartr:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* buildingbiomartr_0.4.0.tar.gzSetting env vars ---------------------------------------------------------------
_R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE
_R_CHECK_CRAN_INCOMING_           : FALSE
_R_CHECK_FORCE_SUGGESTS_          : FALSE
Checking biomartr --------------------------------------------------------------
'/usr/local/Cellar/r/3.3.3/R.framework/Resources/bin/R' --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD check  \
  '/var/folders/1y/8s_b2fx520sgj5f3j9fjwg880000gn/T//RtmpyLUWwj/biomartr_0.4.0.tar.gz'  \
  --as-cran --timings --no-manual 

* using log directory/Users/naupaka/git/ROpenSci/biomartr.Rcheck* using R version 3.3.3 (2017-03-06)
* using platform: x86_64-apple-darwin16.4.0 (64-bit)
* using session charset: UTF-8
* using options--no-manual --as-cran* checking for filebiomartr/DESCRIPTION... OK
* this is packagebiomartrversion0.4.0* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether packagebiomartrcan be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checkingbuilddirectory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
 WARNINGqpdfis needed for checks on size reduction of PDFs
* checking installed files frominst/doc... OK
* checking files invignettes... OK
* checking examples ... OK
* checking for unstated dependencies intests... OK
* checking tests ...
  Runningtestthat.R’ [147s/511s]
 ERROR
Running the tests intests/testthat.Rfailed.
Last 13 lines of output:
  4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
  5: value[[3L]](cond)
  6: stop(paste0("File ", file, " could not be read properly. \n", "Please make sure that ", 
         file, " contains only amino acid sequences and is in ", format, " format."), 
         call. = FALSE)
  
  testthat results ================================================================
  OK: 35 SKIPPED: 0 FAILED: 4
  1. Failure: The download.database() throws error when wrong input database is specified.. (@test-download.database.R#9) 
  2. Error: The getGFF() interface works properly.. (@test-getGFF.R#31) 
  3. Failure: The getGFF() error messages work properly.. (@test-getGFF.R#41) 
  4. Error: The getProteome() interface works properly.. (@test-getProteome.R#24) 
  
  Error: testthat unit tests failed
  Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes ininst/doc... OK
* checking re-building of vignette outputs ... OK
* DONE
Status: 1 ERROR, 1 WARNING

See/Users/naupaka/git/ROpenSci/biomartr.Rcheck/00check.logfor details.

R CMD check results
1 error  | 0 warnings | 0 notes
checking tests ... ERROR
  Runningtestthat.R’ [147s/511s]
Running the tests intests/testthat.Rfailed.
Last 13 lines of output:
  4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
  5: value[[3L]](cond)
  6: stop(paste0("File ", file, " could not be read properly. \n", "Please make sure that ", 
         file, " contains only amino acid sequences and is in ", format, " format."), 
         call. = FALSE)
  
  testthat results ================================================================
  OK: 35 SKIPPED: 0 FAILED: 4
  1. Failure: The download.database() throws error when wrong input database is specified.. (@test-download.database.R#9) 
  2. Error: The getGFF() interface works properly.. (@test-getGFF.R#31) 
  3. Failure: The getGFF() error messages work properly.. (@test-getGFF.R#41) 
  4. Error: The getProteome() interface works properly.. (@test-getProteome.R#24) 
  
  Error: testthat unit tests failed
  Execution halted

R CMD check succeeded

Minor documentation issues

Looks like there is a non-ASCII character in the copied tibble output in one of the verbatim blocks in the Functional Annotation vignette (non-executing).

==> devtools::document(roclets=c('rd', 'collate', 'namespace', 'vignette'))

Updating biomartr documentation
Loading biomartr
Updating vignettes
Rebuilding BioMart_Examples.Rmd
Rebuilding Database_Retrieval.Rmd
Rebuilding Functional_Annotation.Rmd
Error in tools::buildVignette(vign, dirname(vign), tangle = FALSE, clean = FALSE) : 
  Vignette 'Functional_Annotation' is non-ASCII but has no declared encoding
Calls: suppressPackageStartupMessages ... roclet_output.roclet_vignette -> vign_update_all -> vapply -> FUN -> <Anonymous>
Execution halted

Exited with status 1.
> tools::showNonASCII(readLines("vignettes/Functional_Annotation.Rmd")) 
# 566: A tibble: 6 <c3><97> 4

Repeated "No encoding supplied" warnings

> organismBM(organism = "Homo sapiens")
No encoding supplied: defaulting to UTF-8.
No encoding supplied: defaulting to UTF-8.
No encoding supplied: defaulting to UTF-8.
No encoding supplied: defaulting to UTF-8.
# A tibble: 12 × 5
   organism_name
           <chr>
1       hsapiens
2       hsapiens
3       hsapiens
4       hsapiens
5       hsapiens
6       hsapiens
7       hsapiens
8       hsapiens
9       hsapiens
10      hsapiens
11      hsapiens
12      hsapiens
# ... with 4 more variables: description <chr>, mart <chr>, dataset <chr>,
#   version <chr>

Need line break before line 33 in Introduction.Rmd, or markdown header doesn't render properly.

Error when downloading assembly stats

Sometimes when I would try to run a command, I'd get errors like this. I was connected to the Internet at the time, and similar commands run before and afterwards worked fine. Sometimes the same command, run a second time, worked fine even though it failed at first. Maybe something about the timeout before waiting for a server response could be tweaked to reduce the likelihood of premature failure?

> meta.retrieval(kingdom = "plant", 
               db      = "refseq", 
               type    = "assemblystats", 
               combine = TRUE)
Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/045/GCF_001190045.1_Vigan1.1/GCF_001190045.1_Vigan1.1_assembly_stats.txt' currently available?
In addition: Warning message:
In download.file(url, ...) :
  URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/045/GCF_001190045.1_Vigan1.1/GCF_001190045.1_Vigan1.1_assembly_stats.txt': status was 'Weird server reply'

Session Info

Tested with commit 2570b0d3296 on master branch from https://github.com/HajkD/biomartr

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin16.4.0 (64-bit)
Running under: macOS Sierra 10.12.3

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomartr_0.4.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9          IRanges_2.8.1        XML_3.98-1.5         Biostrings_2.42.1    digest_0.6.12       
 [6] bitops_1.0-6         DBI_0.6              stats4_3.3.3         RSQLite_1.1-2        zlibbioc_1.20.0     
[11] curl_2.3             XVector_0.14.0       S4Vectors_0.12.1     tools_3.3.3          Biobase_2.34.0      
[16] biomaRt_2.30.0       RCurl_1.95-4.8       parallel_3.3.3       BiocGenerics_0.20.0  AnnotationDbi_1.36.2
[21] memoise_1.0.0      

@HajkD
Copy link
Member Author

HajkD commented Mar 21, 2017

Dear @sckott,

Thank you very much for retrieving two valuable and in-depth reviews from @grimbough and @naupaka for the biomartr package.

I am very happy to see that both reviewers appreciate the functionality of biomartr. The weaknesses and potential improvements of the package that were pointed out by both reviewers are very constructive and helped me a lot to spot existing misunderstandings and problems.

I already started to incorporate the suggestions made by @grimbough and will now start to address all issues raised by @naupaka.

I highly appreciate the time the reviewers took to generate an in-depth code review and I will now work on a detailed response letter.

Kind regards,
Hajk

@HajkD
Copy link
Member Author

HajkD commented Apr 2, 2017

Response to Package Reviews

Dear @sckott,

I am very grateful for receiving two valuable reviews from @grimbough and @naupaka.
In the following response to the reviewers, I tried to address all comments and
issues and incorporated all changes into the new version of the biomartr package (see NEWS).

I hope that I could address all points sufficiently so that
you can consider biomartr to become a part of the rOpenSci package collection.

I would like to thank you and both reviewers for your detailed and constructive
comments. It was a pleasure to work with you and I am looking forward
to collaborating with the rOpenSci community in the future.

Kind regards,
Hajk

Response to reviewer 1 (@grimbough)

General comments

I actually find the package name a little misleading, since this does way more than just provide an R interface to the BioMart API. Perhaps that was the original intention for the package, but the present functionality exceeds this. It's probably a bit late to change this though given there are existing publications referring to it.

Response: I agree. Initially, I focused on the BioMart database, but then it closed its service and the NCBI and ENSEMBL databases took over parts of its role and diversified their services as well and I started to extend the functionality of biomartr. Another point was the review process of the biomartr package in the journal Bioinformatics. There, it was requested during the review process that the functionality of the biomartr package should be extended to retrieve genomes, proteomes, etc. from NCBI and ENSEMBL databases. I apologize that all this caused a misleading naming of the package, but I hope that the detailed documentation will make users aware of this shortcoming.

One other small semantic point, which you see a lot, is that this package provides access to 'Ensembl Biomart', rather than accessing Biomart services in general. There are in fact many other databases that use Biomart as a way to query their data e.g. SalmonDB or Pancreas Expression Database, which this package currently doesn't provide access to. This is particularly true since the centralised access via Biomart.org no longer exists, and it might be worth clarifying this in the README and Introductory vignette.

Response: Many thanks for pointing this out to me. I now made this point clear in the introductory vignette.

Build/Install

The author might be interested to know that you can use biocLite() to install from CRAN and github as well as Bioconductor, and it will sort of the dependencies from the various repos. So the installation of the CRAN version can be simplified.

Response: This is a great suggestion. I now use the biocLite() notation in the README.

One other thing to note is that the importing of functions from biomaRt and Biostrings is never defined explicitly. They are mentioned in the installation instructions and their functions are accessed using the double colon operator e.g. Biostrings::readBStringSet(). However they are not declared in the NAMESPACE. Including the relevant functions as imports would be desireable. Otherwise, checking for their existence with useNamespace() and printing a helpful message if they aren't present may also work.

Response: I now explicitly import biomaRt and Biostrings in the NAMESPACE.

Most functions have examples, and those that I tested manually worked. However, they are all set to not run with DONTRUN flags, so it is difficult to run them in an automated fashion. This also means they aren't run when the package is built and check, so there is a chance that some no longer execute successfully. I would certainly encourage the package maintainers to make it that more of the examples are run, although I appreciate that they probably take a long time to execute given almost all are retrieving online data sets. Running the set of tests that accompany the package to over 30 minutes for me.

Response: Thank you so much for pointing this out to me, but to be honest, I am not sure how to run these download commands while following the CRAN policy that states that examples should terminate within 5 sec. In my initial CRAN submission, I didn't have DONTRUNS wrapped around examples and was asked by the CRAN maintainers to do so to comply with the 5 sec example rule.
Due to this fact, I started using Travis to automatically check the functionality of all functions whenever I update the package. However, I am very happy for any suggestion on
how to run download examples and still comply with the 5 sec CRAN policy.

When running devtools::test() I get three failures, however in each case when I manually ran the code I got the expected output rather than an error. I haven't had a chance to explore this further, but I have include the failure messages below for completeness.

Response: I also explored this peculiarity and fixed the unit tests.

Documentation and vignettes

The vignettes are generally well written and cover a variety of use cases. The vignette naming is also based around these use cases, which I guess makes sense if you approach the package with a particular research domain in mind, but is not so helpful if you a looking for how to access a specific service. For example, if I know I want to query Ensembl BioMart (which doesn't seem unreasonable given the package name) it doesn't feel intuitive to me that I have to look in 'Evolutionary Transcriptomics' to find an example of accessing that particular resource. This would be somewhat mitigated by the suggestion to include services in the function names, so it is at least easier to find the appropriate manual page.

Response: I agree and now renamed the vignette to BioMart Examples and adapted the content of this vignette.

I would highly recommend making more of the vignette examples evaluated when the Rmarkdown documents are rendered. Over time the contents of the databases can change, and having static output in the vignettes can lead to confusion among users, when what they see when running a code block no longer matches what they see in the documentation. For example running the command listDatabases(db = "human") produces a list of files that is different from what is contained in the Database_Retrieval vignette. You can also end up in the situation where examples in a vignette no longer work at all, although having a decent set of tests in the package should reduce the chances of this happening. This is something where a repository like Bioconductor, which builds all of its packages regularly on multiple platforms, can be very helpful, since you are notified if the vignette no longer executes.

Response: I agree, and now vignette examples are evaluated when the Rmarkdown documents are rendered.

Function/variable naming & general syntax

In general the code is pretty easy to read, with sensible variable names and neatly structured code. However, the function naming within the package seems to be inconsistent. There are examples of camelCaps, snake_case and period.separated. Personally I would avoid using period.separated since there is the potential to conflict with the S3 method dispatch system, but you can find at least one style guide to support each. However, it would be nice to be consistent within the package.

Response: Thanks for pointing this out to me. In the next versions of biomartr, I will consistently use the camelCaps style for new functions.

I also think from a user view point that some function names could perhaps be more explicit about the resource they are accessing. e.g. listDatabases() only accesses the NCBI, but all of the services this package can connect to are ultimately databases. Similarly getDatasets() is BioMart specific. This is explained in the manual pages, but it might make it more intuitive to users to have the service in the function name so you can more easily pick out the functions to use together. I feel this is particularly true when you consider functions like getGenome() actually do have the functionality to access multiple resources. This is addressed a bit by grouping related functions together in the README.

Response: I renamed the listDatabases() function to listNCBIDatabases() and will depreciate listDatabases() in the next version of biomartr. I hope now the query notation biomartr::getDatasets() is more clearly defined.

Console messages

There are quite a few instances of cat() and print() scattered throughout the code to write messages to the user e.g. in getGenome.R, getGFF.R etc. I would recommend changing these to message() so they can be suppressed if desired. The cat() usage is often in conjunction with sink() and setwd(). I am more in favour of writeLines() over sink()&cat() but I that's personal preference rather than anything concrete. However, I would avoid using setwd() if possible, since this changes things in the global environment and if your function fails for any reason the working directory has now changed for the user. I would either pass the path to the file writing function, or if you really want to change directory, use on.exit(setwd("my/original/dir")) to ensure it is changed back however the current function exits.

Response: I agree and now consistently use message(). I also use the file.path() instead of changing the working directory.

Is there code duplication in the package that should be reduced?

There are a few instances where variables such as file names are created multiple times throughout the same function. For example file.path(tempdir(), "_ncbi_downloads", "listDatabases.txt") appears six times in listDatabases.R. It's only a very minor detail, but having it defined only once may save hassle in the future if you ever decide to change that particular file name.

Response: I now store file.path(tempdir(), "_ncbi_downloads", "listDatabases.txt") in the variable db.file.name.

getGenome() has a degree of code replication between the paths followed when db = "ensemblgenomes" and db = "ensembl". I haven't thought about it too much, but I wonder if this function could be modularised with two sub-functions, one of which handles refseq and genbank queries, and one which handles the two ensembl queries? Similarly there is a lot of overlap between getENSEMBL.Seq.R and getENSEMBLGENOMES.Seq.R. I wonder if these could be combined, and the requested host used to select the few lines that are different?

Response: Due to the differences in the internal folder structure of the ENSEMBL and ENSEMBLGENOMES servers I had to implement special cases and thus this code redundancy was on purpose to have a better overview for code maintainence.

Response to reviewer 2 (@naupaka)

Review Comments

I think it would be nice to clarify in the README and elsewhere that the queries work, for the most part (as far as I understand it), against NCBI or EMSEMBL, and not from any of the many other sources that have a BioMart interface (as @grimbough pointed out). For example, I am interested in getting data from the BioMart interface to JGI's Phytozome, since they have a 3.0 version of the Populus trichocarpa genome (the NCBI version is still at 2.0), but that is not possible, I believe, with this package (feature request ).

Response: I agree and I changed it according to @grimbough's point. The JGI's Phytozome request is taken (ropensci/biomartr#9) and since I am working with plant genomes myself I will try my best to provide some interface functions :D .

Another thing that I didn't see, but I think would be useful, would be to modify listGenomes() to return a data frame where the first column is is the species' Latin binomial and the second column is which database each species' genome is associated with. Otherwise, it's tricky to figure out which to set for the db = parameter when using getGenome().

Response: This is a functionality extension request. I am happy to implement it for the next biomartr version and opened an extension request.

README

No Travis badge in README.

Response: Travis badge is now included in README.

While all exported functions have roxygen-generated documentation, some internal functions have no comments or documentation, e.g. connected.to.internet.R, exists.ftp.file.R, or get.emsembl.info.R. You could add some short roxygen comments and add #' @nord as recommended in the ROpenSci package guidelines, or just have some non-roxygen style comments in there in case others would like to better understand what's going on.

Response: Many thanks for pointing this out to me. I now commented all internal functions and used #' @noRd to mark that it is an internal function.

I also agree with @grimbough that it would be ideal if at least some of the example code in the documentation could actually be run. The output from devtools::run_examples() suggests that there isn't any code in any of the examples for any of the functions that is actually run. Obviously, it wouldn't make sense to run download.database.all(name = "nr", path = "nr") because that could take days, but surely there are some queries that would be fast enough to run in a reasonable timeframe?

Response: I agree, but please see my response to @grimbough concerning the 5s CRAN policy.

Code

I was a bit confused by which functions did what, in terms of which database(s) they query, and where exactly datasets were coming from. For each download, this information is all very helpfully documented in the doc_*.txt files in the _ncbi_downloads directory, but I felt like it was a bit of a hunt to figure out what had happened. Maybe this is just because I am still not 100% clear on which databases this package interfaces with. This is related to my earlier comment about being confused about how to figure out which db to get a genome from if you don't know ahead of time.

Response: As stated above, I now clearly write in the vignettes which databases biomartr is accessing and also changed the function name from listDatabases() to listNCBIDatabases().
I hope that it is less confusing now.

ROpenSci's packaging guide suggests the use of message() and/or warning() instead of print() (e.g. lines 144 and 291 of getProteome.R)

Response: I now consistently use message() and/or warning() instead of print() or cat().

Tests

A number of the tests fail for me when running with either R CMD CHECK or devtools::check(). Not sure if it is appropriate to put all that output here, but I figured maybe it would be helpful for diagnosing the problem. I think these errors are related to the occasional 'weird server error' messages I was getting.

Response: I fixed the corrupt unit tests and now devtools::check(document = FALSE) does no throw errors anymore.

Looks like there is a non-ASCII character in the copied tibble output in one of the verbatim blocks in the Functional Annotation vignette (non-executing).

Response: Thank you so much for detecting this non-ASCII character. It is fixed now.

Need line break before line 33 in Introduction.Rmd, or markdown header doesn't render properly.

Response: I included the line break.

Repeated "No encoding supplied" warnings

Response: The encoding is fixed now.

Sometimes when I would try to run a command, I'd get errors like this. I was connected to the Internet at the time, and similar commands run before and afterwards worked fine. Sometimes the same command, run a second time, worked fine even though it failed at first. Maybe something about the timeout before waiting for a server response could be tweaked to reduce the likelihood of premature failure?

Response: Thank you so much for pointing this server timeout issue out to me. I now implemented a customized download function which uses the best download tool for the underlying operating system. I hope that this issue is now resolved.

@sckott
Copy link
Contributor

sckott commented Apr 3, 2017

@HajkD thanks for your changes and detailed comments

  • For running egs, I suggest MAYBE running egs wrapped in dontrun on Travis by adding --run-dontrun to your check args (since taking a longish time on travis I assume is not a big deal) - I'd have one or two examples for each function not wrapped in dontrun when you know they will run quickly. Note: I said MAYBE since check already takes a very long time, but still would be nice to make sure egs always work
  • output of goodpractice::gp() was pointed out and there's some things in there which I think you should address:
    • sapply usage: replace with vapply/lapply
    • it's a bit of work, but I highly suggest tidying code and documentation to 80 line width so you don't get text running off the page on the CRAN manual (e.g., https://cran.rstudio.com/web/packages/biomartr/biomartr.pdf), and it's easier to see in a narrow width editor setup
    • replace usage of 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) with eg., seq_along/seq_len
  • Add a package level manual file - so that users can do ?biomartr to get to the man pages easily
  • looks like test coverage is about 47% - would be good to get that % up - not necessary now, but in the future.
  • R cmd check takes a long time, just for ease of quick turn around time, it'd be wise to make that time much less - I assume most of the time is in tests, but haven't looked into how to speed that up
  • also related to tests, but not about speed: looks like many of your test files aren't doing much testing using testthat - that is, you run examples, but then don't test their output (e.g., https://github.com/HajkD/biomartr/blob/master/tests/testthat/test-is.genome.available.R#L12-L36) I imagine the majority of time in tests right now is doing data requests - adding test expectations shouldn't add a lot more time

@HajkD
Copy link
Member Author

HajkD commented Apr 4, 2017

Dear @sckott,

Thank you so much for your suggestions. I will address all comments and will come back to you as soon as I finished.

Best wishes,
Hajk

@sckott
Copy link
Contributor

sckott commented Apr 4, 2017

thanks @HajkD

@HajkD
Copy link
Member Author

HajkD commented Apr 11, 2017

Dear @sckott,

I addressed all comments and issues and incorporated all changes into the new version of the biomartr package.

I hope that I could address all points sufficiently so that
you can consider biomartr to become a part of the rOpenSci package collection.

Please find my detailed response below.

Thank you so much for all your support.

Kind regards,

Hajk

P.S. I just want to make you aware that the Ensembl Services are currently
not available
. Thus, until the Ensembl servers is up again some of the biomartr
functions will not run properly.

For running egs, I suggest MAYBE running egs wrapped in dontrun on Travis by adding --run-dontrun to your check args (since taking a longish time on travis I assume is not a big deal) - I'd have one or two examples for each function not wrapped in dontrun when you know they will run quickly. Note: I said MAYBE since check already takes a very long time, but still would be nice to make sure egs always work

Response: I tried to solve the running time issue by choosing example test cases that
use very small files. It should be better now. Please also see my response to the test coverage and R CMD check.

output of goodpractice::gp() was pointed out and there's some things in there which I think you should address:
sapply usage: replace with vapply/lapply
it's a bit of work, but I highly suggest tidying code and documentation to 80 line width so you don't get text running off the page on the CRAN manual (e.g., https://cran.rstudio.com/web/packages/biomartr/biomartr.pdf), and it's easier to see in a narrow width editor setup
replace usage of 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) with eg., seq_along/seq_len

Response: I agree and now replaced all sapply() commands by lapply(), reformatted all code and documentation to 80 line width, and replace usage of 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) with seq_len().

Add a package level manual file - so that users can do ?biomartr to get to the man pages easily

Response: A package level manual file is now included and as suggested can be accessed via ?biomartr.

looks like test coverage is about 47% - would be good to get that % up - not necessary now, but in the future.

Response: I now spent a significant amount of time on increasing the test coverage and will continue to improve it in the future.

R cmd check takes a long time, just for ease of quick turn around time, it'd be wise to make that time much less - I assume most of the time is in tests, but haven't looked into how to speed that up

Response: Most tests cover the functionality of data retrieval functions, so to reduce R cmd check time I now choose small file downloads for unit testing.

also related to tests, but not about speed: looks like many of your test files aren't doing much testing using testthat - that is, you run examples, but then don't test their output (e.g., https://github.com/HajkD/biomartr/blob/master/tests/testthat/test-is.genome.available.R#L12-L36) I imagine the majority of time in tests right now is doing data requests - adding test expectations shouldn't add a lot more time

Response: I absolutely agree and see where your confusion comes from. I now commented and specified the unit tests more clearly, so that no confusion exists on what part of the function is actually tested.

@sckott
Copy link
Contributor

sckott commented Apr 17, 2017

@grimbough @naupaka Are you happy with changes made? Anything else?

Looking over your changes now @HajkD

@naupaka
Copy link
Member

naupaka commented Apr 17, 2017

I probably won't have a chance to take a look till later this week. I'll make another pass through then to make sure all works as expected on my machine - seems like all the major points were addressed.

@sckott
Copy link
Contributor

sckott commented Apr 19, 2017

yep, i'll rerun tests and report back

@sckott
Copy link
Contributor

sckott commented Apr 19, 2017

same error again. DO you get that test error on travis. Have you tried Appveyor? Or r-hub ?

@HajkD
Copy link
Member Author

HajkD commented Apr 19, 2017

Hm, that's strange. Yes, Travis passes. I just committed today -> https://travis-ci.org/HajkD/biomartr/builds/223596728 . I haven't tried rhub or Appveyor yet, but I can run it there as well if you like.

@sckott
Copy link
Contributor

sckott commented Apr 19, 2017

What platform are you on? windows? mac? linux?

@HajkD
Copy link
Member Author

HajkD commented Apr 19, 2017

mac

@HajkD
Copy link
Member Author

HajkD commented Apr 22, 2017

Dear @sckott

I now also ran an r-hub check and everything seems to pass: https://builder.r-hub.io/status/biomartr_0.5.0.tar.gz-e5116cc3ef554c2db801cbbaa3be1e0f .

@naupaka
Copy link
Member

naupaka commented Apr 25, 2017

@HajkD thanks for making many of the recommended changes. Most all of the things I mentioned the first time around are now fixed. I'm still having issues with getting things to work smoothly with devtools::check() though. I think perhaps it is related to the timeout problem you mentioned - but it seems to happen in the same place pretty consistently. Might there be a way to fail gracefully if it's not your fault the server is timing out and then still have the tests pass?

Unless maybe this is something insidious that snuck in because of my recent update to R 3.4.0?

My system info is below, and I was using commit ca36885e2 on master for these tests.

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin16.5.0 (64-bit)
Running under: macOS Sierra 10.12.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomartr_0.5.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10         IRanges_2.9.19       XML_3.98-1.6         Biostrings_2.43.8    digest_0.6.12       
 [6] bitops_1.0-6         DBI_0.6-1            stats4_3.4.0         RSQLite_1.1-2        zlibbioc_1.21.0     
[11] curl_2.5             XVector_0.15.2       S4Vectors_0.13.17    tools_3.4.0          Biobase_2.35.1      
[16] biomaRt_2.31.10      RCurl_1.95-4.8       parallel_3.4.0       compiler_3.4.0       BiocGenerics_0.21.3 
[21] AnnotationDbi_1.37.4 memoise_1.1.0   

Output from devtools::check():

devtools::check(document = FALSE)

Setting env vars ---------------------------------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building biomartr --------------------------------------------------------------
'/usr/local/Cellar/r/3.4.0/R.framework/Resources/bin/R' --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD build  \
  '/Users/naupaka/git/ROpenSci/biomartr' --no-resave-data --no-manual 

* checking for file/Users/naupaka/git/ROpenSci/biomartr/DESCRIPTION... OK
* preparingbiomartr:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* buildingbiomartr_0.5.0.tar.gzSetting env vars ---------------------------------------------------------------
_R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE
_R_CHECK_CRAN_INCOMING_           : FALSE
_R_CHECK_FORCE_SUGGESTS_          : FALSE
Checking biomartr --------------------------------------------------------------
'/usr/local/Cellar/r/3.4.0/R.framework/Resources/bin/R' --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD check  \
  '/var/folders/1y/8s_b2fx520sgj5f3j9fjwg880000gn/T//Rtmpvxn83G/biomartr_0.5.0.tar.gz'  \
  --as-cran --timings --no-manual 

* using log directory/Users/naupaka/git/ROpenSci/biomartr.Rcheck* using R version 3.4.0 (2017-04-21)
* using platform: x86_64-apple-darwin16.5.0 (64-bit)
* using session charset: UTF-8
* using options--no-manual --as-cran* checking for filebiomartr/DESCRIPTION... OK
* this is packagebiomartrversion0.5.0* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether packagebiomartrcan be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checkingbuilddirectory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... NOTE
Non-standard file/directory found at top level:biomartr_0.5.0.tar.gz* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking installed files frominst/doc... OK
* checking files invignettes... OK
* checking examples ... ERROR
Running examples inbiomartr-Ex.Rfailed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: organismAttributes
> ### Title: Retrieve Ensembl Biomart attributes for a query organism
> ### Aliases: organismAttributes
> 
> ### ** Examples
> 
> # search for attribute topic id
> head(organismAttributes("Homo sapiens", topic = "id"), 20)
Warning: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
Warning: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached
Calls: head ... request_fetch -> request_fetch.write_memory -> <Anonymous> -> .Call
Execution halted
* checking for unstated dependencies intests... OK
* checking tests ...
  Runningtestthat.R’ [2m/19m]
 ERROR
Running the tests intests/testthat.Rfailed.
Last 13 lines of output:
  trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/metagenomes/assembly_summary.txt'
  Content type 'unknown' length 333931 bytes (326 KB)
  ==================================================
  testthat results ================================================================
  OK: 48 SKIPPED: 0 FAILED: 7
  1. Error: The biomart() interface works properly.. (@test-biomart.R#10) 
  2. Error: The getAttributes() interface works properly.. (@test-getAttributes.R#13) 
  3. Error: The getCDS() interface to Ensembl works properly.. (@test-getCDS.R#26) 
  4. Error: The getCDS() interface to EnsemblGenomes works properly.. (@test-getCDS.R#37) 
  5. Error: The getCDS() error messages work properly.. (@test-getCDS.R#47) 
  6. Error: The getDatasets() interface works properly.. (@test-getDatasets.R#13) 
  7. Error: The getGO() interface works properly.. (@test-getGO.R#12) 
  
  Error: testthat unit tests failed
  Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes ininst/doc... OK
* checking re-building of vignette outputs ... WARNING
Error in re-building vignettes:
  ...
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt'
Content type 'unknown' length 79196 bytes (77 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/assembly_summary.txt'
Content type 'unknown' length 43167 bytes (42 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/assembly_summary.txt'
Content type 'unknown' length 26815 bytes (26 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/protozoa/assembly_summary.txt'
Content type 'unknown' length 24012 bytes (23 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt'
Content type 'unknown' length 32978 bytes (32 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/assembly_summary.txt'
Content type 'unknown' length 37450 bytes (36 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt'
Content type 'unknown' length 1748600 bytes (1.7 MB)
==================================================
Quitting from lines 371-379 (Functional_Annotation.Rmd) 
Error: processing vignette 'Functional_Annotation.Rmd' failed with diagnostics:
Request to BioMart web service failed. Verify if you are still connected to the internet.  Alternatively the BioMart web service is temporarily down.
Execution halted

* DONE
Status: 2 ERRORs, 1 WARNING, 1 NOTE

See/Users/naupaka/git/ROpenSci/biomartr.Rcheck/00check.logfor details.

checking examples ... ERROR
Running examples inbiomartr-Ex.Rfailed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: organismAttributes
> ### Title: Retrieve Ensembl Biomart attributes for a query organism
> ### Aliases: organismAttributes
> 
> ### ** Examples
> 
> # search for attribute topic id
> head(organismAttributes("Homo sapiens", topic = "id"), 20)
Warning: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
Warning: No attributes were available for mart = ENSEMBL_MART_SEQUENCE and dataset = hsapiens_genomic_sequence.
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached
Calls: head ... request_fetch -> request_fetch.write_memory -> <Anonymous> -> .Call
Execution halted

checking tests ... ERROR
  Runningtestthat.R’ [2m/19m]
Running the tests intests/testthat.Rfailed.
Last 13 lines of output:
  trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/metagenomes/assembly_summary.txt'
  Content type 'unknown' length 333931 bytes (326 KB)
  ==================================================
  testthat results ================================================================
  OK: 48 SKIPPED: 0 FAILED: 7
  1. Error: The biomart() interface works properly.. (@test-biomart.R#10) 
  2. Error: The getAttributes() interface works properly.. (@test-getAttributes.R#13) 
  3. Error: The getCDS() interface to Ensembl works properly.. (@test-getCDS.R#26) 
  4. Error: The getCDS() interface to EnsemblGenomes works properly.. (@test-getCDS.R#37) 
  5. Error: The getCDS() error messages work properly.. (@test-getCDS.R#47) 
  6. Error: The getDatasets() interface works properly.. (@test-getDatasets.R#13) 
  7. Error: The getGO() interface works properly.. (@test-getGO.R#12) 
  
  Error: testthat unit tests failed
  Execution halted

checking re-building of vignette outputs ... WARNING
Error in re-building vignettes:
  ...
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt'
Content type 'unknown' length 79196 bytes (77 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/assembly_summary.txt'
Content type 'unknown' length 43167 bytes (42 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/assembly_summary.txt'
... 8 lines ...
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/assembly_summary.txt'
Content type 'unknown' length 37450 bytes (36 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt'
Content type 'unknown' length 1748600 bytes (1.7 MB)
==================================================
Quitting from lines 371-379 (Functional_Annotation.Rmd) 
Error: processing vignette 'Functional_Annotation.Rmd' failed with diagnostics:
Request to BioMart web service failed. Verify if you are still connected to the internet.  Alternatively the BioMart web service is temporarily down.
Execution halted

checking top-level files ... NOTE
Non-standard file/directory found at top level:biomartr_0.5.0.tar.gzR CMD check results
2 errors | 1 warning  | 1 note 

R CMD check succeeded

@sckott
Copy link
Contributor

sckott commented Apr 25, 2017

thanks for the input @naupaka

@HajkD
Copy link
Member Author

HajkD commented Apr 25, 2017

Dear @naupaka

Thank you so much for pointing this shortcoming out to me.
I will have a detailed look at it and will come back to you soon.

Many thanks!
Hajk

@sckott
Copy link
Contributor

sckott commented Jun 7, 2017

@HajkD Any progress on this? Any questions for us?

@HajkD
Copy link
Member Author

HajkD commented Jun 7, 2017

Hi @sckott,

Please accept my apologies for my late response. I was caught up in another project that required my immediate attention.

Nevertheless, I worked on the remaining biomartr issues and released the new version 0.5.1 on CRAN.

While digging deeper into the issue I noticed that the NCBI and ENSEMBL databases don't provide very stable APIs. Servers are frequently maintained and sometimes down for several days. Hence, it is impossible to resolve this issue from my side. However, I now implemented warnings that will occur when connections are not stable or whenever the server is down.

I am very happy for any suggestions on how I can improve the ftp connection via R to the NCBI and ENSEMBL servers.

In future versions, I also plan to implement https and rsync access to the NCBI and ENSEMBL servers, but these options will exclude Windows users. So the initial ftp issues will always exist for Windows users.

I now run rhub::check(platform = rhub::platforms()$name, valgrind = TRUE) whenever I check biomartr and don't receive errors.

When running devtools::check() on my machine, I receive the following output:

* using log directory/private/var/folders/3x/6bbw6ds1039gpwny1m0hn8r80000gp/T/RtmpEQJf3b/biomartr.Rcheck* using R version 3.4.0 (2017-04-21)
* using platform: x86_64-apple-darwin15.6.0 (64-bit)
* using session charset: UTF-8
* using options--no-manual --as-cran* checking for filebiomartr/DESCRIPTION... OK
* this is packagebiomartrversion0.5.9000* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether packagebiomartrcan be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checkingbuilddirectory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking installed files frominst/doc... OK
* checking files invignettes... OK
* checking examples ... OK
* checking for unstated dependencies intests... OK
* checking tests ...
  Runningtestthat.R’ [95s/460s]
 OK
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes ininst/doc... OK
* checking re-building of vignette outputs ... OK
* DONE

Status: OK

R CMD check results
0 errors | 0 warnings | 0 notes

I would be truly grateful if you could test again on your system if the check runs smoothly now.

I hope that you can now consider biomartr to be integrated into the rOpenSci package collection and I am happy to continue working on it and improving it once the onboarding process was successful.

I would like to thank all of you for your time a detailed code review. It helped me a lot to significantly improve the biomartr functionality.

Kind regards,
Hajk

@sckott
Copy link
Contributor

sckott commented Jun 7, 2017

@HajkD Looks good to me, checked locally and all seems fine.

approved!

@sckott
Copy link
Contributor

sckott commented Jun 7, 2017

Thanks very much for your submission. Remaining steps:

  • I've added you to an biomartr team for the rOpenSci organization, and invited you to our org, you should get a notification. Accept and transfer the repo (under "Settings") to rOpenSci, and then update any links for CI, coverage, and documentation.
  • Let me know if you are interested in doing a blog post about your package on the rOpenSci blog, either long-form / for more general audience (https://ropensci.org/blog/) or short form / more technical (https://ropensci.org/tech-notes/)

@HajkD
Copy link
Member Author

HajkD commented Jun 7, 2017

Dear @sckott

I am very grateful for your approval and I very much look forward to collaborating with you and the rOpenSci community.

I already transferred the package and would be very interested in a blog post for the more general audience (https://ropensci.org/blog/).

Thank you so much for coordinating everything. I truly appreciate it.

Kind regards,
Hajk

@sckott
Copy link
Contributor

sckott commented Jun 7, 2017

Great!

@sckott sckott closed this as completed Jun 7, 2017
@sckott
Copy link
Contributor

sckott commented Jun 19, 2017

@HajkD let me know if you have any questions about the blog post

@HajkD
Copy link
Member Author

HajkD commented Jun 20, 2017

Many thanks, @sckott. I am quite busy in the past weeks, but I will try to write a post by today. So far, your directions are very clear, but I will happily take your offer to help me in case I get stuck :)

@stefaniebutland
Copy link
Member

Hello @HajkD. Just wanted to say how happy I am that you will contribute a post about biomartr. In my (past) research life I used BioMart, especially as an interface to Ensembl annotations, and preached about its utility to my biologist colleagues. Great to have this addition to rOpenSci!

@HajkD
Copy link
Member Author

HajkD commented Jun 20, 2017

Hello @stefaniebutland ,

Thank you so much for your great feedback :) I truly hope that this package will become a small puzzle stone to help to tackle the reproducibility crisis in biomedical research and also to speed up large-scale genomics studies. If it's useful then I am happy about any collaboration to extend the biomartr functionality :)

@HajkD
Copy link
Member Author

HajkD commented Jun 22, 2017

Hi @sckott ,

I am very sorry that I still didn't manage to find some quiet time to write a proper blog post. I am quite caught up in a paper revision and collaborations right now.
Would it be fine if I postpone my submission by another week? If not, please feel free to onboard biomartr without a blog entry and I am happy to contribute to a blog post later on (if that's also fine with you).

Many thanks for your patience and support :)

Kind regards,
Hajk

@sckott
Copy link
Contributor

sckott commented Jun 22, 2017

@HajkD another week is fine, no rush at all!

actually, @stefaniebutland when could we fit his post in the blog schedule?

@stefaniebutland
Copy link
Member

@HajkD biomartr can def be onboarded without a post and your paper revision is important!

Could you give me an idea of a month and week that would work for you? I'll work with that to assign you a date to provide a draft and post date. Cheers.

@HajkD
Copy link
Member Author

HajkD commented Jun 25, 2017

Dear @sckott and @stefaniebutland ,

Thank you so much for your prompt reply and for all your support!

@stefaniebutland: Would the week between 03.07. - 09.07. work for you? I could deliver a blog post on any day during that week.

I truly appreciate all your efforts and for helping me to schedule the post :)

Kind regards,
Hajk

@stefaniebutland
Copy link
Member

@HajkD A draft in the week of 03.07. - 09.07 would work well. Tentative post date would be Tues July 18. This would give us time to give you feedback on your draft post.
Cheers

@HajkD
Copy link
Member Author

HajkD commented Jun 27, 2017

Dear @stefaniebutland

Many thanks. This sounds perfect. I will post my draft between 03.07. - 09.07 and very much look forward to receiving your feedback.

Kind regards,
Hajk

@sckott sckott mentioned this issue Sep 7, 2017
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants