restez -- submission #232

DomBennett · 2018-06-27T15:14:32Z

Summary

What does this package do? (explain in 50 words or less):

The restez package downloads all or sections of GenBank and creates a local SQLite copy of the database for querying. The package comes with a series of useful functions for querying the database and is designed to work with rentrez.

Paste the full DESCRIPTION file inside a code block below:

Package: restez
Type: Package
Title: Create and Query a Local Copy of GenBank in R
Version: 0.0.0
Authors@R: person("Dom", "Bennett", role = c("aut", "cre"), email = "dominic.john.bennett@gmail.com")
Maintainer: D.J. Bennett <dominic.john.bennett@gmail.com>
Description: Download large sections of GenBank and generate a local SQL-based
    database. A user can then query this database using restez functions or
    through rentrez wrappers.
URL: https://github.com/AntonelliLab/restez#readme
BugReports: https://github.com/AntonelliLab/restez/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Depends:
    R (>= 3.3.0)
Imports:
    utils,
    rentrez,
    RSQLite,
    DBI,
    R.utils,
    downloader,
    RCurl,
    cli,
    crayon
Suggests:
    testthat,
    knitr,
    rmarkdown
RoxygenNote: 6.0.1
VignetteBuilder: knitr

URL for the package (the development repository, not a stylized html page):

https://github.com/AntonelliLab/restez

Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

data retrieval, for users that wish to retrieve lots of sequence information and find NCBI's Entrez too slow.

Who is the target audience and what are scientific applications of this package?

Researchers wishing to perform any form of analysis with DNA sequence data. For my own purposes, I will use the package to retrieve large amounts of sequence data for phylogenetic analysis.

Are there other R packages that accomplish the same thing? If so, how does
yours differ or meet our criteria for best-in-category?

Hajk-Georg Drost's Biomartr is similar and is in fact the inspiration for restez. It only allows users to download genome specific data, however, not GenBank sequences. From the ecologcial sciences perspective, genome data is just not nearly taxonomically representative enough yet for any questions concerning biodiversity.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

NA

Requirements

Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

Do you intend for this package to go on CRAN?
Do you wish to automatically submit to the Journal of Open Source Software? If so:
- The package has an obvious research application according to JOSS's definition.
- The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- The package is novel and will be of interest to the broad readership of the journal.
- The manuscript describing the package is no longer than 3000 words.
- You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

I imagine the rentrez and biomartr developers to be good reviewers: dwinter and HajkD

The text was updated successfully, but these errors were encountered:

sckott · 2018-06-30T18:33:37Z

Editor checks:

Fit: The package meets criteria for fit and overlap
Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
License: The package has a CRAN or OSI accepted license
Repository: The repository link resolves correctly
Archive (JOSS only, may be post-review): The repository DOI resolves correctly
Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks for your submission @DomBennett !

── GP restez ───────
It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 87% of code lines are
    covered by test cases.

    R/biomartr-tools.R:41:NA
    R/biomartr-tools.R:42:NA
    R/biomartr-tools.R:43:NA
    R/biomartr-tools.R:44:NA
    R/biomartr-tools.R:45:NA
    ... and 65 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows
    that are about 80 characters wide. Try make your lines shorter than 80 characters

    R/rentrez-wrappers.R:26:1
    tests/testthat/test-setup.R:43:1

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions.
    They are error prone and result 1:0 if the expression on the right hand side is zero. Use seq_len()
    or seq_along() instead.

    R/gb-get-tools.R:55:19

Seeking reviewers now 🕐

Reviewers:

@naupaka deadline: 2018-07-26
@eveskew deadline: 2018-07-30

naupaka · 2018-07-03T18:03:37Z

@sckott I'd be happy to review this if you like. Right in my wheelhouse and looks like something I would make use of for my own research/teaching.

sckott · 2018-07-03T18:40:13Z

@naupaka thanks for offering, that would be great, thanks, is 3 weeks okay?

naupaka · 2018-07-03T18:45:11Z

Yes that should work. 👍🏼

sckott · 2018-07-03T18:48:34Z

grerat, thanks

sckott · 2018-07-09T17:49:48Z

Reviewers:

@naupaka deadline: 2018-07-26
@eveskew deadline: 2018-07-30

eveskew · 2018-08-01T16:23:04Z

Hi all!

Find attached my review. Sorry for the delay. Great job overall @DomBennett! Note that it's my first time reviewing for rOpenSci, and partially for that reason, I think I tended to focus on higher level stuff.

Please don't hesitate to ask for any clarification.

-- Evan

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide.

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
- But see some minor recommendations for improvements in my comments, and also note that the rOpenSci packaging guide recommends distinct links in the README for every vignette
Function documentation: for all exported functions in R help
Examples for all exported functions in R help that run successfully locally
- Note that I did not check this exhaustively (since there were quite a few exported functions), but these seem to be present, and I expect them to be of high quality as the rest of the package is
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R)

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience

References: with DOIs for all those that have one (e.g. papers, datasets, software)

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- Note however, the unit testing coverage as documented earlier by @sckott. I'm not sure if this is an issue or not, as it did not seem to affect my use of the package
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

I estimate that I spent 13 hours on this review.

Review Comments

General Comments

Overall, this is a well thought out and documented package. I don't work that often with GenBank sequences myself, but based on discussion with others, it seems this package will usefully solve the problem of easily downloading and querying sequences from NCBI. In general, all described functionality in the package seemed to operate as intended. I particularly appreciated the pkgdown site and multiple vignettes. These will be great for new users and offer an easy way to communicate changes and new features as the package develops. I tried to conduct my review in the spirit of a "naive user". I first went through some of the code base checks recommended in the rOpenSci reviewing guide. From there, I worked through the package functionality in a way I expect most new users would: I went from the README to the "Get started" tutorial on through the other vignettes. I referenced the function descriptions and help files throughout. I think my comments represent mostly minor issues, but hopefully they will improve overall organization and usability of the package. I've arranged them below in rough order of importance.

Specific Comments

Explain Maintenance of Multiple SQL Databases: Given the performance issues associated with querying large databases, users might reasonably want to have multiple local SQL databases containing data from different taxa, different sequence lengths, etc. I would guess that re-coding functions to allow for the assignment of different SQL database names within the same restez path would be a bit much at this point (but potentially a feature for future versions?). In lieu of that, I would simply recommend pointing out to users that setting different restez paths would be an easy way to maintain different databases containing different sets of information that could referenced from within one analysis script. This is a very simple solution, but it didn't occur to me until I spent a few hours with the package, so a new user would probably benefit from knowing that it's an option for maintaining multiple local databases.

More Informative Error Messages When restez Path Not Set: You've done your due diligence in reminding users to run restez_path_set() when they call library(restez), but I wonder if this is enough to prevent errors from new users. If users have a GenBank database on their machine, start an R session, run library(restez), then attempt any of the query functions, they will be met with some opaque errors. Especially when people are loading a host of packages to initiate a script, package-specific warnings can get overlooked. How much trouble would it be to check that the restez path is actually set as a first step in the query functions?

Change list_db_ids() Defaults: I understand the intuition behind setting n = 100 as the default for this function (prevent users from gathering all IDs in a massive database), but I would consider either changing this or printing a warning to the screen telling users by default they will retrieve a maximum of 100 sequence IDs. I think this is pretty critical because this will likely be a key step in many users' workflow: download a GenBank dataset, query to get a vector of all sequence IDs, then go from there. I think it will cause confusion if users download a large database, gather IDs, but then don't realize they're only working with the first 100 IDs by default.

Don't Export restez_ready()?: Is there a specific use case for restez_ready()? It seems to me it replicates the functionality of restez_status(). Both return TRUE/FALSE, but restez_status() has more useful information. I would tend towards not exporting restez_ready() just to simplify things for users.

Update restez_status() With Number of Records: Along these lines, one critical piece of information that might be usefully added to restez_status() is the number of records contained in the current SQL database. Users can currently see the size of the database, but naturally they might want to know the actual number of sequences. The only way I could think to easily find this information at present was to run list_db_ids() on the database, then find out the length() of this vector.

Clarify SQL Database Size Requirements: I think there is some issue with the communication of expected database sizes when running db_download(). Or else, the files I happened to pick have extremely long sequences relative to examples in the help files? As a test run, I attempted to download all of the viral sequences on GenBank. db_download() informed me nicely about what was expected:

You've selected a total of 54 file types. These represent:    
● 'Viral'  
Each file contains about 250 MB of decompressed data.  
54 files amounts to about 13.5 GB  
Additionally, the resulting SQL database takes about 50 MB per file  
In total the uncompressed files and SQL database should amount to 16.2 GB Is that OK?

However, after download and database creation (which went fine), restez_status() told a different story:

> restez_status()  
Checking setup status at '~/Desktop/restez' ...  
... found 55 files in 'downloads/'  
... totalling 12.18 GB  
... found 'sql_db' of 18.35 GB  
[1] TRUE

So all told, my local, raw sequence files were smaller than expected (only 12 GB), but the SQL database was much larger than anticipated, resulting in a local database folder that was roughly twice what was expected (30 GB). Is this a peculiarity of this dataset? Something that went wrong with the database generation?

Revise References To Defunct Functions: There were some confusing references to gb_download() in various parts of the documentation, including the README file. I gather this function was replaced by db_download() in the current version of the package? In any case, appropriately modifying these references is important since anyone trying to follow along with the example code in the README and elsewhere will pretty quickly run into some confusion at present.

Differing gb_record_get() and entrez_fetch() Behavior: It's great that you've integrated restez functionality with some things offered by rentrez. However, I noticed that behaviors of "regular" restez functions and those rentrez wrappers is slightly different. For example (after running demo_db_create()), running cat(gb_record_get("demo_1")) and cat(entrez_fetch(db = "nucleotide", id = "demo_1", rettype = "fasta")) do not return the same information, yet they should be querying the same (local) sequence record. Is this the expected behavior?

Typo Fixes: spelling::spell_check_package() and spelling::spell_check_files("README.Rmd") revealed some minor typos in various portions of the package documentation, particularly in README.Rmd and other vignette .Rmd files. These didn't significantly affect my understanding of the package but should be corrected. In addition, when db_download() is run, there seems to be a typo in the printed text? If a user wanted all mammalian sequences, I think they would want to download file types "12 13 15", not "12 14 15" (which would give viral sequences in addition to mammalian sequences). After reading through the package vignettes more, it occurs to me this may be the result of GenBank changes (i.e., particular taxa getting assigned different file numbers)? In that case, it might be better to revise your text such that it doesn't reference particular file numbers at all in order to future-proof it.

Make Vignette Availability More Obvious: It's great that you put in the time to generate useful vignettes for this package. I'd recommend some subtle changes to the README to make this more apparent for users. Why not move the sentence "For more detailed tutorials, visit the restez website" under a new heading below "Quick Examples" called "Detailed Tutorials"? I think that would be preferable since then new users could first execute the "Quick Examples" code to see how setup, querying, and Entrez wrappers work. Then they could move on to more in-depth tutorials with the vignettes. I think the link in this case should also point directly to the articles page of the package website (rather than the website index, which just replicates the README page).

Revise README Graphic: I appreciated the README file graphic that illustrated the overall package organization. I wonder if a couple of small changes could make this more comprehensive, however. First, would it make more sense for the white box for "restez/" to be moved above the blue box containing "downloads/" and "sql_db"? To me, this would be more accurate since "downloads/" and "sql_db" will be found within the "restez/" directory on a user's machine. Second, could you update the "Query" box to contain all the exported gb_*_get() functions in the package?

Better Organize Function Types (Families): This primarily applies to the ?restez description page and the function "References" page on the website. This occurred to me only because the ?restez help page describes having three sets of functions but then goes on to list four. In addition, the function organization on the "References" page is a little weird since functions from the same grouping can actually come from different source files (and those source files can overlap with other groups). For example, under the "Database" functions, some are from R/setup.R while others are from R/general-get-tools.R. This might make it a little hard for users to wrap their minds around which functions are to be used at which stage of the restez workflow. I would probably suggest four function categories (something like restez Path Setup, GenBank Database Functions, GenBank Query Functions, and Entrez Wrappers) all organized such that source files don't overlap among categories.

Harmonize Vignette Workflows: This is minor, but I might consider reorganizing the multiple vignettes slightly such that they all follow the same conventions and workflow (i.e., all set restez paths to the user's desktop, all delete local GenBank databases following analysis to illustrate best practices, etc.).

Performance Issues With Querying Large Databases: There may not be a good way around this, but I noticed that the gb_*_get() family of functions were a bit sluggish when used to query a large database (i.e., all GenBank viral sequences). For example, a gb_definition_get() call on the viral database required ~24 seconds to return the definition for a single ID, and most of this time seemed to be spent on DBI::dbFetch() according to Rprof() (note I'm certainly no expert in profiling functions). Perhaps for future updates you might consider whether different functions would allow for speedier ID matching of the user's query in the local database? Another solution could be to facilitate users loading specific records from their local SQL database into an R environment object? This might not be the most efficient method, but I'd imagine these objects could be parsed faster?

In the course of discussing with @noamross, he suggested migrating to a MonetDB structure from SQL might improve these performance issues. I definitely don't think this is critical to package functionality at present, but just a consideration for future package development.

roxygen2 Update: devtools::check() and devtools::test() returned no major problems with the package, but I did note that there is a newer version of roxygen2 (6.1.0) available. Not sure if updating will improve the package in any discernible way.

noamross · 2018-08-01T16:37:58Z

👋Hi all! Per Evan's comment above, I had suggested using MonetDBlite as a mostly drop-in replacement for RSQLite. It's another DBI-compliant back-end stored on-disk. An example of a similar package using it is https://github.com/cboettig/taxald

sckott · 2018-08-02T00:31:08Z

thanks for your review @eveskew !

@naupaka can you get your review in soon?

DomBennett · 2018-08-02T09:06:17Z

Thanks for such a detailed review @eveskew! I really appreciate the hours you have put into trying and testing the package. This whole ROpenSci malarky is pretty nifty.

@noamross, I will investigate using MonetDBlite. It looks like the taxald package performs a similar role to restez but for taxonomic information. I have thought about adapting restez to also work with NCBI taxonomic information, perhaps this could be in conjunction with taxald in the future? There is also the package taxizedb, is this significantly different from taxald? On the face of it, they both seem to perform the same thing.

sckott · 2018-08-02T16:49:47Z

taxizedb downloads and loads SQL databases into local SQL database engines (mysql/mariadb, postgres, sqlite) and then provides a way to set the connection to those databases so that you can then use with e.g., dplyr.

taxald is Carl's attempt at I think making it easier to work with the bulk data by I think downloading flat files from Github releases so that users don't have to setup databases. is that right @cboettig ?

cboettig · 2018-08-02T17:00:58Z

Correct. @DomBennett taxizedb downloads the various MySQL / Postgres / SQLite / and tsv dumps from ITIS, NCBI, WikiData, GBIF, COL, and EOL and sticks them into a local version of said database for you.

Each of those databases uses an entirely different format / schema for organizing the taxonomic information, so a query that works with, say, COL to return classification won't work with ITIS, etc.

taxald is something of a wrapper around taxizedb: it uses taxizedb to download each of these sources. It then applies a bunch of dplyr wrangling to convert these into a few standard schemas. It then exports the data out of these databases into compressed flat files (using arkdb), and uploads them to GitHub (as assets, using piggyback... eventually this will be Zenodo once the schemas are a bit more finalized). That all happens in data-raw scripts, since the user doesn't need to repeat that manually each time.

On the user-facing side, taxald downloads these compressed flat files from GitHub and loads them into a local MonetDBLite database. taxald user-facing functions are then just relatively simple dplyr queries to the local database. This database will persist between sessions, and should also makes the queries possible because some of the taxonomic source data are 1 - 6 GB uncompressed, which is probably asking for more RAM than would be polite.

cboettig · 2018-08-02T17:16:10Z

@DomBennett Very cool package by the way, don't want to interfere with the reviews but looks very well implemented! In general I'm excited to see more packages that interact with local databases instead of needing millions of API calls.

Completely minor note here, but the way you read in the compressed flat files into the database is pretty slick, though I'm pretty sure you could skip the utils::gunzip() since readLines can parse a compressed connection directly (gzip, bz2, or xz). Saves a bit of logic and and a bit of disk space, though maybe not worth changing at this point. Been thinking about similar problem: arkdb uses this logic to import large txt dumps into a database by chunks (since the whole file may exceed memory).

sckott · 2018-08-02T17:26:57Z

thanks for the help carl

DomBennett · 2018-08-03T08:15:10Z

Thanks @cboettig! taxald sounds and looks pretty nifty. I'd really like to try it out one day. It also looks as if you've been lining up a lot of ducks to get it working thus far.

It'd be cool if restez could make use of taxonomic information so that users could query it using species/genus/etc. names. My instinct would be to just stick with the NCBI taxdump and create linked databases based on taxids. But it sounds as if, given the availability of taxiseDB and the like, additional taxonomic resources could be implemented. (Although matching the species in NCBI to the species in other databases would not be pretty.)

I wouldn't want to try anything at the moment though. I think simple and basal is best and I always fear the monolith programs.

cboettig · 2018-08-03T16:21:31Z

Right, as you probably know, NCBI taxa dump already includes a pretty nice list of mapping both NCBI's valid scientific name and recognized synonym scientific names to taxon ids (and even common names), so the NCBI maintainers clearly had queries by species/genus or even common name (including at common names that refer to higher taxonomic ranks -- e.g. "fishes") as valid.

So the NCBI synonyms table probably covers much of the matching species in NCBI to species in other databases -- I suspect (but haven't tried yet, could be very wrong) that most unmatched names in the other databases would be from species that don't have any match in NCBI (either because a species group was later split or renamed in one database but not another, or because certain clades just aren't well covered in NCBI).

In any event, there's already existing lists of mappings between recognized synonyms, not just from NCBI, but also at the actual identifier level, e.g. as provided by WikiData or EOL. So in principle this should make it pretty easy to crosswalk. A goal of taxald is to also make this easy in practice. All the data is there, is mostly a matter of getting it into a consistent schema in a local db and then we can do standard queries to match ids, synonyms, common names, or ranks and crosswalk between dbs.

naupaka · 2018-08-07T05:20:12Z

@sckott (et al) sorry for the delay -- deadline blew right past. I finish up at ESA Friday and can get to it this weekend.

sckott · 2018-08-07T18:36:03Z

Okay, thanks @naupaka

sckott · 2018-08-14T18:13:00Z

@naupaka ping 😸

sckott · 2018-10-12T22:55:56Z

devtools::install_github and remotes::install_github should install the required version of DBI for you. not sure what happened there.

I think that's right that subprocess should be >= 0.8.3 - @DomBennett is subprocess only used in tests? if so, should it be moved to Suggests?

I'm getting a failure on that vignette too.

@DomBennett could you address these few things soon so the reviewers can take another look

DomBennett · 2018-10-14T11:17:22Z

Hi @sckott and @eveskew,

Hmmm.... it looks like local installs of a package do not lead to automatic updates for dependencies: https://stackoverflow.com/questions/51032658/update-dependencies-when-installing-a-local-package

subprocess isn't for testing, it is required for creating the spinning icon and allowing users to more easily kill a download. I've updated the DESCRIPTION to subprocess (>= 0.8.3) -- thanks for the spot.

As for the vignette building, I've explicitly tried to avoid building them in the R package as the vignettes depend on a database of all rodents which I build in a separate script (other/rodent_db.R). They should only build for me, for use in the pkgdown site, when there is a rodents database in rodents/ in the package directory . I thought simply adding vignettes/ to the .Rbuildinignore would be sufficient to prevent them being built. (For me devtools::check() works and vignettes are not built.)

To hopefully rectify the issue, I have more closely followed the advice from the pkgdown documentation (https://pkgdown.r-lib.org/articles/pkgdown.html#articles): dropping vignette builder in the DESCRIPTION, removing vignette details in the article yaml's and, as already done, adding vignettes/ to .Rbuildignore.

Does the devtools::check() still try and build the vignettes after these changes?

Thanks!

sckott · 2018-10-16T18:52:25Z

installs and checks fine for me now

DomBennett · 2018-11-09T13:23:03Z

Any updates on this?

Let me know if you (@eveskew and @naupaka) are still having problems.

sckott · 2018-11-09T17:12:37Z

thanks for the reminder @DomBennett

@eveskew @naupaka are you happy with the changes? if no response by mid next week I'll assume you're good

sckott · 2018-11-14T17:23:59Z

Approved! Thanks again for your submission @DomBennett !

To-dos:

Please transfer the package to ropensci- you're already on an ropensci team, so you should be able to transfer already. Let me know if you can't.
add rOpenSci footer to README
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
Change any needed links, such those for CI badges
Travis CI should update to the new location automatically - you may have to update other CI systems manually
We're starting to roll out software metadata files to all ropensci pkgs via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your pkg, after installing the pkg - should be easy as running codemetar::write_codemeta() in the root of your pkg

We've started putting together a bookdown with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved. The repo is at https://github.com/ropensci/dev_guide

Are you interested in doing a blog post for our blog https://ropensci.org/blog/ ? either a short-form intro to it (https://ropensci.org/technotes/) or long-form post with more narrative about its development (https://ropensci.org/blog/). If so, we'll have our community manager @stefaniebutland get in touch with you on that

eveskew · 2018-11-14T19:07:02Z

Hi all!

Sorry for the late response on this, but I've looked a bit more into the revised package. @DomBennett did a very thorough job in responding to my suggestions, and I really like some of the new package functionality (the improved restez_status() is great!). Using the latest version of the package files, devtools::check() runs fine for me now.

However, I ran into a snag that I wanted to bring to Dom's attention before he migrates the package. Here's an issue I had with db_download() when I attempted to get some real data (the function hangs, so I ran the traceback):

> db_download()
───────────────────────────────────────────────────────────────
Looking up latest GenBank release ...
-
... release number 228
-
 Hide Traceback
 
 Rerun with Debug
 Error in FUN(X[[i]], ...) : subscript out of bounds 
4.
lapply(seq_files_descripts, "[[", 2) 
3.
unlist(lapply(seq_files_descripts, "[[", 2)) 
2.
identify_downloadable_files() 
1.
db_download()

I believe I tracked the issue down to a couple of lines having to do with the seq_files_descripts object within the identify_downloadable_files() function. Perhaps the structure of the list has changed, because I think instead of

seq_files <- unlist(lapply(seq_files_descripts, '[[', 1))
descripts <- unlist(lapply(seq_files_descripts, '[[', 2))

you want

seq_files <- unlist(lapply(seq_files_descripts, '[', 1))
descripts <- unlist(lapply(seq_files_descripts, '[', 2))

This change at least resulted in identify_downloadable_files() working for me, and hopefully that's the only fix needed for db_download() to run. Note that I had no problems with generating and working with the demo data built into the package.

Sorry again to bring up an issue at this late hour. I know we should move along to get the package out to users, so if there are any package changes, I can be sure to review quickly and verify that db_download() is functional for me.

Best,
Evan

DomBennett · 2018-11-14T20:22:46Z

Thanks @eveskew! I also found this error as I was transferring to ROpenSci.

You're fix does get rid of the error -- thanks! But it does indicate that my parsing of the GenBank release notes wasn't quite up to scratch!

I've gone through and made some changes to my REGEX patterns to solve the issue. Because this could potentially be an issue every month when the latest GenBank version is released, I've also added a warning that will be raised if there are indications that the downloaded file information is not complete.

I will push these changes once I have push access on the now transferred repo (@sckott access please!)

The other change I'll want to make before I push to CRAN is to switch from subprocess to callr. This should reduce the amount of coding and shouldn't impact functionality.

sckott · 2018-11-14T20:24:16Z

@DomBennett you have admin access

DomBennett · 2018-11-14T20:38:36Z

@sckott I added the restez repo to the restez team. I think I need to be a member of that to push, right? At least, I don't seem to be able to add myself to that team.

sckott · 2018-11-14T20:41:22Z

your approved now

eveskew · 2018-11-14T21:59:28Z

@DomBennett, just wanted to report that I cloned the rOpenSci version of the package, and all seems to be well now! I was able to download a sample of real data (phage sequences), create the database, and query it successfully using the suite of gb_*_get() functions. Great work!

DomBennett · 2018-11-14T22:05:55Z

Thanks very much @eveskew!

naupaka · 2018-11-15T17:54:55Z

@DomBennett I just re-tried to set up the plant/fungal DB, and it hung on the db_create() step. After ^C, it returned the following error:

... 'gbpln231.seq.gz'(148 / 232)
... 'gbpln232.seq.gz'(149 / 232)
... 'gbpln24.seq.gz'(150 / 232)
... 'gbpln25.seq.gz'(151 / 232)
^CError in .local(conn, statement, ...) :
  Unable to execute statement 'select schemas.name as sn, tables.name as tn from sys.tables join sys.schemas on tables.schema_id=sc...'.
Server says 'MALException:dataflow:DFLOWscheduler(): q_dequeue(flow->done) returned NULL'.

Is this the same error that you were talking about earlier?

DomBennett · 2018-11-15T19:18:23Z

@naupaka Hmmm... looks like a new error I will investigate it!

I have in the past had issues with the database but haven't always been able to recreate the error. You might find that re-running the db_create function will have no problems a second time. Unfortunately, it would have to start from the beginning again.

Also note, I have just updated the db_create function. It no longer requires restez_connect before running. This is so I can run most of the bulk of the function with callr -- makes killing the running process easier.

(The database behaviour is not 100% predictable, at least for me. For example, I found I couldn't build a database on a USB stick, I should add that to the documentation.)

naupaka · 2018-11-15T19:24:12Z

Good to know. I will pull the new version and give it another go.

DomBennett · 2018-11-15T19:27:26Z

Thanks!

For clarity, I'm now testing the following commands:

# devtools::install_github('ropensci/restez')
library(restez)
restez_path_set('.')
db_download(preselection = '7')
db_create()

DomBennett · 2018-11-16T09:11:43Z

Hi @naupaka,

The plant/fungal database seems to work for me -- see attached.

Let me know if you're still having problems.

plants_fungi_resetz_rsession.txt
plants_fungi_restez_log.txt

naupaka · 2018-11-19T06:28:32Z

It's running now. I forgot to start in tmux the first time through and a broken ssh connection killed it. The database building step take many hours. So far so good (downloading worked without a hitch); I should know if it worked by the morning.

naupaka · 2018-11-19T08:33:03Z

Awesome! All works great for me now. Thanks for all your hard work on this, @DomBennett! Going to be a really helpful tool.

DomBennett · 2018-11-19T17:26:26Z

@sckott Dumb question: Do I now submit the package and paper to JOSS? Or is it automated? Thanks!

sckott · 2018-11-19T17:48:04Z

not a dumb question. You do have to submit it manually (maybe someday it will be automated).

For JOSS:
- Looks like there's already a Zenodo web hook, so that's good.
- If you haven't already, generate a new release associated with a git tag in your repo. This doesn't mean you have to submit to CRAN with that git tag/release, its just to get a Zenodo DOI - and you can put a zenodo badge in your readme, then after each git tag it will generate a new Zenodo release
- Then submit your paper.md to JOSS at http://joss.theoj.org/papers/new

DomBennett · 2018-11-26T20:45:00Z

Thanks for the helpful answer! I've now submitted the paper -- I was waiting for CRAN to host.

sckott · 2018-11-26T20:46:22Z

cool, closing now, seems like JOSS and blog post are started

noamross assigned sckott Jun 27, 2018

noamross added package 1/editor-checks labels Jun 27, 2018

sckott added 2/seeking-reviewer(s) and removed 1/editor-checks labels Jun 30, 2018

sckott added 3/reviewer(s)-assigned and removed 2/seeking-reviewer(s) labels Jul 9, 2018

sckott added topic:data-access topic:data-munging topic:molecular-biology labels Aug 6, 2018

sckott added the pub:joss label Aug 16, 2018

sckott added 5/awaiting-reviewer(s)-response and removed 4/review(s)-in-awaiting-changes labels Nov 9, 2018

sckott added 6/approved and removed 5/awaiting-reviewer(s)-response labels Nov 14, 2018

sckott closed this as completed Nov 26, 2018

DomBennett mentioned this issue Nov 27, 2018

[PRE REVIEW]: estez: Create and Query a Local Copy of GenBank in R openjournals/joss-reviews#1101

Closed

restez -- submission #232

restez -- submission #232

Comments

DomBennett commented Jun 27, 2018

Summary

Requirements

Publication options

Detail

sckott commented Jun 30, 2018 • edited

Editor checks:

Editor comments

naupaka commented Jul 3, 2018

sckott commented Jul 3, 2018

naupaka commented Jul 3, 2018

sckott commented Jul 3, 2018

sckott commented Jul 9, 2018

eveskew commented Aug 1, 2018 • edited

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

General Comments

Specific Comments

noamross commented Aug 1, 2018

sckott commented Aug 2, 2018

DomBennett commented Aug 2, 2018

sckott commented Aug 2, 2018

cboettig commented Aug 2, 2018

cboettig commented Aug 2, 2018

sckott commented Aug 2, 2018

DomBennett commented Aug 3, 2018

cboettig commented Aug 3, 2018

naupaka commented Aug 7, 2018

sckott commented Aug 7, 2018

sckott commented Aug 14, 2018

sckott commented Oct 12, 2018

DomBennett commented Oct 14, 2018

sckott commented Oct 16, 2018 • edited

DomBennett commented Nov 9, 2018

sckott commented Nov 9, 2018

sckott commented Nov 14, 2018

eveskew commented Nov 14, 2018

DomBennett commented Nov 14, 2018

sckott commented Nov 14, 2018

DomBennett commented Nov 14, 2018

sckott commented Nov 14, 2018

eveskew commented Nov 14, 2018

DomBennett commented Nov 14, 2018

naupaka commented Nov 15, 2018

DomBennett commented Nov 15, 2018

naupaka commented Nov 15, 2018

DomBennett commented Nov 15, 2018

DomBennett commented Nov 16, 2018

naupaka commented Nov 19, 2018

naupaka commented Nov 19, 2018

DomBennett commented Nov 19, 2018

sckott commented Nov 19, 2018

DomBennett commented Nov 26, 2018

sckott commented Nov 26, 2018

sckott commented Jun 30, 2018 •

edited

eveskew commented Aug 1, 2018 •

edited

sckott commented Oct 16, 2018 •

edited