Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submission: taxadb #344

Closed
16 of 25 tasks
karinorman opened this issue Sep 30, 2019 · 30 comments
Closed
16 of 25 tasks

submission: taxadb #344

karinorman opened this issue Sep 30, 2019 · 30 comments

Comments

@karinorman
Copy link

karinorman commented Sep 30, 2019

Submitting Author: Kari Norman (@karinorman)
Repository: https://github.com/cboettig/taxadb
Version submitted: v1.0.0
Editor: @ldecicco-USGS (Guest Editor)
Reviewer 1: @mcsiple
Reviewer 2: @lindsayplatt
Archive: TBD
Version accepted: TBD


  • Paste the full DESCRIPTION file inside a code block below:
Package: taxadb
Version: 0.0.1
Title: A High-Performance Local Taxonomic Database Interface
Description: Creates a local database of many commonly used taxonomic authorities
             and provides functions that can quickly query this data.  
Authors@R: c(
            person("Carl", "Boettiger", , "cboettig@gmail.com", c("aut", "cre"),
                   comment = c(ORCID = "0000-0002-1642-628X")),
            person("Kari", "Norman", role= "aut",
                   comment = c(ORCID = "0000-0002-2029-2325")),       
            person("Jorrit", "Poelen", role ="aut", 
                   comment = c(ORCID="0000-0003-3138-4118")),
            person("Scott", "Chamberlain", role ="aut", 
                   comment = c(ORCID="0000-0003-1444-9135")),
            person("Noam", "Ross", , "ross@ecohealthalliance.org", role ="ctb",
                   comment = c(ORCID = "0000-0002-2136-0000"))
            )
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
ByteCompile: true
Depends: R (>= 2.10)
Imports:
    dplyr,
    DBI,
    arkdb,
    tibble,
    memoise,
    readr,
    rappdirs,
    rlang,
    magrittr,
    stringi,
    progress,
    utils,
    dbplyr
Roxygen: list(markdown = TRUE)
RoxygenNote: 6.1.1
Suggests:
    spelling,
    testthat,
    covr,
    knitr,
    rmarkdown,
    purrr,
    RSQLite,
    MonetDBLite
Language: en-US
VignetteBuilder: knitr
URL: https://github.com/cboettig/taxadb
BugReports: https://github.com/cboettig/taxadb/issues
Remotes: MonetDB/MonetDBLite-R, cwida/duckdb/tools/rpkg


Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • database access
    • data munging
    • data deposition
    • reproducibility
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package accesses existing data sets, adapts them to a standard format, and provides a set of functions for accessing the resulting database.

  • Who is the target audience and what are scientific applications of this package?

Anyone using data including species names, especially for datasets that may have taxonomic inconsistencies.

taxaize provides accesses to many of the same datasets via an API interface. Our package is specifically designed to work around the API to offer the faster access necessary for large datasets, as well as provide a standardized format for all datasets.

  • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box. This package:

Publication options

JOSS Options
  • The package has an obvious research application according to JOSS's definition.
    • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)
MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

@melvidoni
Copy link
Contributor

Hello @karinorman, your editor will be @ldecicco-USGS, acting as a Guest Editor for rOpenSci.

@ldecicco-USGS
Copy link

ldecicco-USGS commented Oct 7, 2019

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Hello @karinorman ...this is my first "guest edit"...so bear with me! The following is the output of goodpractice::gp(). I'm not terribly concerned about the messages, but they would be easy messages to get rid of, and then future pull request could use the gp() function from a clean slate.

── GP taxadb ──────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all
    package code in general. 83% of code lines are covered
    by test cases.

    R/clean_names.R:42:NA
    R/clean_names.R:43:NA
    R/clean_names.R:44:NA
    R/clean_names.R:45:NA
    R/clean_names.R:46:NA
    ... and 60 more lines

  ✖ use '<-' for assignment instead of '='. '<-'
    is the standard, and R users and developers are used it
    and it is easier to read your code for them if you use
    '<-'.

    R\handling-duplicates.R:39:11
    R\mutate_db.R:56:24
    R\mutate_db.R:69:25

  ✖ avoid long code lines, it is bad for
    readability. Also, many people prefer editor windows
    that are about 80 characters wide. Try make your lines
    shorter than 80 characters

    inst\examples\docker.R:23:1
    R\by_common.R:4:1
    R\clean_names.R:3:1
    R\clean_names.R:20:1
    R\filter_by.R:12:1
    ... and 24 more lines

  ✖ avoid 1:length(...), 1:nrow(...),
    1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions.
    They are error prone and result 1:0 if the expression
    on the right hand side is zero. Use seq_len() or
    seq_along() instead.

    R\filter_by.R:60:50
    R\filter_by.R:65:49
    tests\testthat\test-mutate-db.R:16:19

──────────────────────────────────────────────────────────────

I'm sending email requests now to find 2 reviewers.

Reviewers: @mcsiple @lindsayplatt
Due date: October 28, 2019

@ldecicco-USGS
Copy link

Alright @karinorman ! We have 2 reviewers @mcsiple and @lindsayplatt. Please try to submit the reviews through the GitHub issue by Oct. 28. Let me know if you have any questions on the review process.

@lindsayplatt
Copy link

lindsayplatt commented Oct 28, 2019

Hello! It was fun to review this package. I have specific comments within the ROpenSci checklist (they are noted as a quotation starting with my initials {LP}). I also added a section of comments near the end that didn't fit the checklist perfectly.

Best,
Lindsay

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally.

{LP} I see a vignettes folder on GitHub, but I don't see the vignettes when I go to the Packages tab and click on the package name or when I run vignette(package="taxadb"), I get no vignettes found. There seems to be an issue with finding the vignettes that are in the folder. It might be that you have them tucked under an additional folder (articles) rather than the .Rmd files just existing in the vignettes folder.

  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally.

{LP} mutate_db() & taxa_tbl() do not have examples. td_connect() and td_create() have examples but using functions called db_connect() and db_create(); perhaps outdated function names? All others have examples and they run locally.

  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

{LP} I do not see anything that mentions how to contribute to the package in either the README, in a file called CONTRIBUTING, or mentioned anywhere in the package files. Please see this part of the ROpenSci package documentation (https://devguide.ropensci.org/collaboration.html#friendlyfiles) for info on including a contribution section. The DESCRIPTION file has the required fields.

Have not heard if this is being submitted to JOSS, so assuming no and skipping the JOSS section of review.

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.

{LP} Yes, connects to local DB of taxonomic data for querying consistently. Unknown information about the source of this data, which is useful in case you need to update frequently by running td_create("col"). You should site the "Catalogue of Life" somewhere, include links to documentation for that, and then tell the user how often (if at all) they should update that database by running td_create.

  • Performance: Any performance claims of the software been confirmed.

{LP} Since the database is available locally, it seems like this package could be faster than it currently is. Not bad right now though.

  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.

{LP} I used devtools::check() - it completed with zero issues (I have Windows 10) and took 7.5 min. I also ran devtools::test() and that took 10 min with 0 failures, 5 warnings, and 2 skips. With covr::package_coverage(), I'm seeing low coverage for code within mutate_db.R and I also suggest that you add tests for quick_db() and has_tbl on their own. I ran goodpractice::gp() and only got one note - it was about test coverage (87% lines covered) with similar information about some missing lines as I noted previously. Finally, I ran spelling::spell_check_package() and spelling::spell_check_files("README.Rmd") with no real spelling errors.

  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

{LP} Two things, both of which have been noted above. All else check out! One issue - vignettes are not available to the user since they aren't in the correct folder. One suggestion - watch the naming convention. Your by_* functions are a bit confusing and could be made clearer with more descriptive names.

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3 hrs as of 10/27/2019

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

README Documentation

  • In the installation instructions, it's unclear which of these steps is required as part of the setup and which is an example. It seems like if the package needs dplyr and readr that they would be loaded when taxadb is loaded, rather than requiring additional library commands. You might want to explain why those are needed. Additionally, the local copy of the Catalogue of Life seems like a necessary step, but the following code about reading in a species list seems like an example and not required for all usecases. I would just clarify when/why the user might take those steps.
  • Using "col" as a reference to "Catalogue of Life" is potentially confusing. I like the brevity of it, but it has a lot of other uses in R such as "column" in colnames or specifying "color" in base graphics. Consider using something more descriptive, such as "cat_of_life".
  • For people that may be new to the Catalogue of Life, it would be useful to have some documentation describing the common column names or pointing people out to existing documentation for this database.
  • When I installed this, I received some warnings. Better warnings/errors to help me understand if I need to take action or try again would be useful.
td_create("col")

Importing dwc_col.tsv.bz2 in 100000 line chunks:
[-] chunk 6	...Done! (in 47.58484 secs)
Importing common_col.tsv.bz2 in 100000 line chunks:
Warning: 1 parsing failure.
  row col   expected    actual         file
23215  -- 20 columns 2 columns literal data

	...Done! (in 2.237027 secs)
Warning messages:
1: In readLines(con, n = n, encoding = encoding, warn = FALSE) :
  invalid input found on input connection 'C:\Users\lcarr\AppData\Local\taxadb\taxadb/dwc_col.tsv.bz2'
2: In readLines(con, n = n, encoding = encoding, warn = FALSE) :
  invalid input found on input connection 'C:\Users\lcarr\AppData\Local\taxadb\taxadb/common_col.tsv.bz2'

Functions

  • When I ran the by_* functions, most of the examples took time to download information. If you submit to CRAN, these would not pass checks because they take too long for the system to run. I recommend that you use \dontrun{} in your ROXYGEN documentation for the examples that take a minute to download or hit some kind of website.
  • I would suggest having the by_* family of functions be more descriptive. Typically, functions are named with a verb indicating what they do. For instance, filter_catalogue_by*. It is longer, but brevity can cause more confusion because it is not as clear what action the function is actually taking. It could be "grouping by" or "summarizing by", etc.
  • It is unclear to me how filter_by is different from by_*. I can run the following two lines and seem to get the same result:
by_name(c("Trochalopteron henrici gucenense", "Trochalopteron elliotii"))

filter_by(c("Trochalopteron henrici gucenense", "Trochalopteron elliotii"), "scientificName")
  • It could be a little confusing that get_ids has an argument that can completely change the output format (the format = "uri" argument). Typically, no matter the arguments, you can expect the same type of information back. Returning a URL that you need to visit to get the IDs might be surprising to users. You could make it a separate function. Additionally, if I try that same argument for get_names, I don't get a url back - get_names(c("ITIS:180092", "ITIS:179913"), format = "uri") returns "Homo sapiens" "Mammalia".
  • I see code for null_tibble() but I don't see it being used at all or exported. Might be old and you don't need it anymore?

Miscellaneous

  • If I run code in the console and then try to knit an RMD with taxadb commands, I always get the error about the database being locked since there is already a session using it. It isn't super clear to a user how to remedy this without completely closing R, opening it up, and trying to knit the RMD without running any code in the console. It may make sense to include a function that the user can run to "close" the connection manually and also explain this behavior in documentation somewhere.
  • I don't know how to recreate or describe (and it has not happened to me before), but every once in a while as I am executing commands from this package, RStudio seems to freeze. I can't click on anything or type any code. I have to close the window and reopen to keep going. I seem to be able to close the window just fine; no need to force quit, so it isn't totally frozen. Haven't seen this happen to me before. I am using RStudio version 1.2.5001

@ldecicco-USGS
Copy link

ldecicco-USGS commented Oct 28, 2019

Thanks @lindsayplatt !

It looks like the vignettes are in a sub-directory "articles", which is why they are not being properly indexed when created via the remotes package.

If the developers want the vignettes included in the package itself (and not just a pkgdown page), they would need to both move the files out of the "articles" folder, and change the installation instructions to:

remotes::install_github("cboettig/taxadb", 
                         build_vignettes = TRUE, 
                         build_opts = c("--no-resave-data", "--no-manual"))

I'm constantly having to remind myself of the build_opts argument.

@mcsiple
Copy link

mcsiple commented Oct 29, 2019

Hi everyone,

This package is very cool. I tried to follow similar conventions in my review to @lindsayplatt, for consistency. I've also added some comments on the end. Thank you for including me as a reviewer for this nice new tool.

Package Review

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README

One quick question about the README -- I wasn't sure after reading it whether there is a way for users to know where the local database is located. If this option is available, I think it would be good to add.

  • Vignette(s) demonstrating major functionality that runs successfully locally

browseVignettes(package="taxadb") does not return any vignettes but maybe this is normal for packages not yet on CRAN? I see that @lindsayplatt ran into a similar issue and looks like it will be resolved with @ldecicco-USGS 's suggestion.

  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

I don't see any CONTRIBUTING guidelines in the documentation.

Functionality

  • Installation: Installation succeeds as documented.
Comment on operating systems

I tested installation from GitHub and locally from two different operating systems. Where possible, I tested other functionality was also tested on both a Mac and a Windows OS.

I ran into some issues with functionality when using different machines. For example, I had namespace errors for rlang and other dependencies when testing on a Windows OS. This may be unrelated to the package itself (maybe has to do with package versions?) but just something to be aware of.

Local install

Local install in Windows got stuck in a loop of error messages every time I tried it, and I wasn't able to figure out what was causing the issue, but all the error messages were related to MonetDBLite. Installation of MonetDBLite from GitHub was smooth (though lengthy) on a Mac machine.
Local installation on a Mac ran smoothly. I'm not sure if that's an issue of different versions of the imported packages on those two machines, or something else.
Prior to running test(), I installed the dev version of MonetDBLite from GitHub. The most current version did not install from CRAN (i.e., install.packages("MonetDBLite") returned an error). It is possible that if folks try to download and run taxadb, they may have to download the dev version of MonetDBLite in order to install taxadb. That was my experience.

Install from GitHub

devtools::install_github("cboettig/taxadb") was fine and fast on both Windows and Mac OS.

  • Functionality: Any functional claims of the software been confirmed.

Functionality notes

td_create() returned some errors when I used one machine, including some parsing failures. After the parsing failures I switched to just testing on Mac OS, and this step works fine. I wrote more below about specific functions.

  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.

Initially test() returned errors because of the issues with MonetDBLite. After installing the dev version of MonetDBLite it works (and quickly!) in both operating systems.
I used goodpractice::gp() too and encountered two goodpractice errors:

  1. Writing unit tests for all functions: "87% of code lines are covered by test cases"

  2. Installation fail for R CMD check error. gp output says "Installation failed. See ~/taxadb.Rcheck/00install.out for details"

I never figured out what the latter issue was, but it did not appear to affect functionality for me.
There must be some coverage issues, and covr::package_coverage() returns a more detailed description:

taxadb Coverage: 87.09%
R/mutate_db.R: 0.00%
R/taxa_tbl.R: 79.17%
R/td_connect.R: 86.96%
R/get_ids.R: 87.50%
R/get_names.R: 90.00%
R/fuzzy_filter.R: 94.00%
R/td_create.R: 94.74%
R/clean_names.R: 95.45%
R/synonyms.R: 97.06%
R/by_common.R: 100.00%
R/by_id.R: 100.00%
R/by_name.R: 100.00%
R/by_rank.R: 100.00%
R/filter_by.R: 100.00%
R/handling-duplicates.R: 100.00%
R/utils.R: 100.00%
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Estimated hours spent reviewing: 10 hours as of 10/28/2019


General comments

As someone who works mostly with one group of species and needs access to multiple databases, this package could be incredibly helpful. I think some changes to the documentation could make it more accessible for someone like myself, who is not very well-versed in the diversity of taxonomic databases, how to access them, and which ones have what I need.

Review Comments

Documentation

The documentation is generally good and I just had a couple of small comments.

  • The breeding bird survey data is used for all the intro examples, but I would love a couple of other examples that are more general. For example, looking up another set of taxa, or a set of taxa downstream from a specific taxonomic node.
  • An example where taxadb interacts with another package would be helpful
  • I think some documentation could be particularly effective if it is built in as more specific error/warning messages. With all the taxonomy packages I've used, I often find it hard to match exactly what the package expects in terms of the spelling and syntax of species names, especially common names. It would be great to have a little more built-in functionality that provides additional guideance (e.g., "this species is in ITIS but not in IUCN - did you mean "itis"?")

Help files

Running help(package = "taxadb") did not return a package index or any help files when I was using Windows; on Mac OS it was fine.

Functions

I think I may have missed something about how taxadb handles common names. I am always looking for taxonomic packages that can deal with slight differences in how common names are spelled/punctuated. It would be good to have some more flexibility w/r/t names, so that punctuation can be not an "exact match." For example:

by_common("Stellers jay")
by_common("Steller jay")
by_common("Steller's jay") # only this one returns results
# Based on what I understand from clean_names, maybe this works? 
by_common(clean_names(names = "Steller jay"))  # but this does not return results either

Maybe there is a way to set tolerance on punctuation-- if so, it would be useful for me as a future user.

In the by_rank() function, is there a way to return suggestions about what went wrong, if by_rank() doesn't return a tibble? For example, if I, a fish newb, am looking for all Chondrichtyans,

by_rank("Osteichthyes", "class", "itis") 

returns an empty tibble. It would be great to have something that tells people, "[insert name of taxon you entered] is defined as an [order/kingdom/genus]-- is that the classification you meant?" It doesn't have to be that literal, but would be good as a checker if people have errors in classifying the taxa they're interested in.

Session Info

In Windows:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_1.3.1  dplyr_0.8.3  taxadb_0.0.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        dbplyr_1.4.2      compiler_3.6.1    pillar_1.4.2      prettyunits_1.0.2 remotes_2.1.0    
 [7] tools_3.6.1       progress_1.2.2    bit_1.1-14        testthat_2.2.1    zeallot_0.1.0     digest_0.6.22    
[13] pkgbuild_1.0.6    pkgload_1.0.2     RSQLite_2.1.2     memoise_1.1.0     tibble_2.1.3      pkgconfig_2.0.3  
[19] rlang_0.4.1       rex_1.1.2         cli_1.1.0         DBI_1.0.0         yaml_2.2.0        pkgreviewr_0.1.2 
[25] arkdb_0.0.5       withr_2.1.2       rappdirs_0.3.1    desc_1.2.0        fs_1.3.1          vctrs_0.2.0      
[31] devtools_2.2.1    hms_0.5.1         bit64_0.9-7       tidyselect_0.2.5  rprojroot_1.3-2   glue_1.3.1       
[37] R6_2.4.0          processx_3.4.1    fansi_0.4.0       sessioninfo_1.1.1 blob_1.2.0        purrr_0.3.3      
[43] callr_3.3.2       covr_3.3.2        magrittr_1.5      backports_1.1.5   ps_1.3.0          ellipsis_0.3.0   
[49] usethis_1.5.1     assertthat_0.2.1  utf8_1.1.4        stringi_1.4.3     lazyeval_0.2.2    crayon_1.3.4    

On Mac:

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_2.2.1 usethis_1.5.1  magrittr_1.5   taxadb_0.0.1   readr_1.3.1    dplyr_0.8.3   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         whoami_1.3.0       prettyunits_1.0.2  ps_1.3.0           clisymbols_1.2.0   utf8_1.1.4        
 [7] assertthat_0.2.1   zeallot_0.1.0      rprojroot_1.3-2    digest_0.6.22      R6_2.4.0           backports_1.1.5   
[13] RSQLite_2.1.2      evaluate_0.14      httr_1.4.1         pillar_1.4.2       rlang_0.4.1        progress_1.2.2    
[19] curl_4.2           lazyeval_0.2.2     rstudioapi_0.10    praise_1.0.0       MonetDBLite_0.6.1  pkgreviewr_0.1.3  
[25] callr_3.3.2        blob_1.2.0         rmarkdown_1.16     desc_1.2.0         stringr_1.4.0      bit_1.1-14        
[31] compiler_3.6.1     lintr_2.0.0        xfun_0.10          pkgconfig_2.0.3    base64enc_0.1-3    pkgbuild_1.0.6    
[37] htmltools_0.4.0    tidyselect_0.2.5   tibble_2.1.3       codetools_0.2-16   fansi_0.4.0        crayon_1.3.4      
[43] dbplyr_1.4.2       withr_2.1.2        rappdirs_0.3.1     jsonlite_1.6       DBI_1.0.0          covr_3.3.2        
[49] cli_1.1.0          stringi_1.4.3      fs_1.3.1           remotes_2.1.0      rex_1.1.2          testthat_2.2.1    
[55] xml2_1.2.2         ellipsis_0.3.0     vctrs_0.2.0        cyclocomp_1.1.0    arkdb_0.0.5        tools_3.6.1       
[61] rcmdcheck_1.3.3    bit64_0.9-7        glue_1.3.1         purrr_0.3.3        hms_0.5.1          processx_3.4.1    
[67] pkgload_1.0.2      yaml_2.2.0         xmlparsedata_1.0.3 xopen_1.0.0        sessioninfo_1.1.1  memoise_1.1.0     
[73] goodpractice_1.0.2 knitr_1.25     

rOpenSci guidelines:

  • Does the code comply with general principles in the Mozilla reviewing guide?
  • Does the package comply with the rOpenSci packaging guide?

@ldecicco-USGS
Copy link

Awesome, thanks so much @mcsiple and @lindsayplatt . Now... @karinorman , it's back to you!

@karinorman
Copy link
Author

Hi @lindsayplatt & @mcsiple, thanks for taking the time to review and for your helpful comments! A couple general responses below about issues you both ran into with more detailed and individualized responses to come!

Vignettes

Thanks @ldecicco-USGS for pointing out the vignette location. They are currently in the articles subdirectory to prevent timing out on CRAN (pkgdown still detects and builds vignettes in subdirectories). Prebuilt vignettes are linked at the bottom of the README, but we could also put a lighter weight version in vingettes/ so they're accessible by browseVignettes(package="taxadb").

Backend Documentation

It's clear from both of your experiences with the package that an upfront description of the package backends (i.e. database hosts) is necessary. We will expand on that in the schema.rmd vignette, but also wanted to point to the article draft in paper/manuscript.rmd, which we would also love feedback on! It lays out the backends in a lot more detail and should lend some clarity.

@lindsayplatt, backend problems are likely the source of your performance issues. td_create("col") only installs the database for Catalogue of Life, whereas many of the examples reference other providers (such as ITIS, which is the default provider). If you try to use a provider that has not been installed the default is to install that database on the fly, which takes a while! Running td_create("all") will install all databases ahead of time. This is obviously an issue of clarity on our part in the README, but just wanted to explain what was likely going on there.

Also, the freezing behavior Lindsay observed is almost surely due to the computer running low on memory. We suspect this is because RSQLite or MonetDBLite were not installed? We made RSQLite an optional dependency and allowed taxadb to work in-memory if it was not found, but this example demonstrates that was a poor choice, since users can easily run out of memory with these giant data tables! We've making RSQLite a required dependency now, and adding a vignette to explain how to choose between these. (Basically RSQLite is easier to install but not as fast as MonetDBLite!) You can check sessionInfo() or just run taxadb::td_connect() to see what database connection is being used.

@cboettig
Copy link
Member

Wow, thanks @lindsayplatt & @mcsiple for the excellent reviews, this has been really helpful in improving the docs and implementations! Very sorry about the vignettes being hard to find! While we work through this, I just wanted to drop a few links in here to the updated vignettes, since hopefully they go some ways to clarifying some of the issues you both raise and we'd really like your feedback on those parts as well.

  • Overview of the name providers & column names (darwin core) -- this was previously called the 'schema vignette', but it also discusses a bit more about each of the 10 data providers (Catalogue of Life, ITIS, etc). It could probably include more specifics about each one but we'd love to hear what you think is missing!

  • new vignette on Database backends. We hope this clarifies a bit more about these options and how to control them!

  • There's also this intro vignette which is meant as a slightly longer / richer set of examples to complement the README, though I'm not sure if it serves that purpose. Perhaps it is unnecessary? Or perhaps some of the current stuff in the README should just be moved into this to keep the README really short and sweet?

  • Lastly, we have a draft of a paper for MEE, discusses the context and motivation in a bit more detail. (ropensci partners with MEE for some reviews, so this still needs polish and will get separate review if/when it goes to MEE, but it could be helpful in better understanding the package so we wanted to mention it now and get all the feedback we can!)

Really appreciate all the help!

@ldecicco-USGS
Copy link

Great!
@lindsayplatt and @mcsiple ....do these response sound acceptable to you? Any further questions/comments?

@karinorman
Copy link
Author

@ldecicco-USGS We're also still working on a few of the comments made by both reviewers, including examples and test for a couple functions, better common names examples, etc. Hoping to have that finished up in the next couple days but would love responses to the above in the meantime. Thanks everyone!

@mcsiple
Copy link

mcsiple commented Nov 6, 2019

Hi all -- apologies for the delayed response. Thank you for adding the extended documentation.

  • The added database backends vignette is great! Thank you for that.

  • I quite like the extended vignette you added @cboettig . The only potential issue I ran into was that the current chunk inside system.time({}) took a while for me to run. If other people besides me report really long running times, it might be worth finding an example with a shorter run time. Mine completed in the following:

   user  system elapsed 
 448.79  227.97  686.84 

Which is much longer than it says the example took in the vignette. A second attempt took about the same amount of time. Besides that it is great and I liked the added length/detail.

  • Just to clarify after looking over the manuscript draft - is taxadb intended to replace taxize ? If not, maybe it's worth discussing in the ms how they can be used together. I am imagining a user who is brand new to taxonomic databases and wants to use the most rigorous workflow to match a list of species to these databases and extract some info about IUCN or CITES listing (for example).

@ldecicco-USGS
Copy link

@karinorman checking it to make sure I understand correctly, we're still waiting on updates from you?

@karinorman
Copy link
Author

@ldecicco-USGS Yes! Just wanted to address some of the general changes we've now made based on comments from both reviewers.

  • contribution guidelines added
  • changed naming convention from by_ to filter_ to reflect the intuition that these functions filter underlying tables based on specified input
  • updated examples and function tests, we now have 94% coverage based on codev
  • added function td_disconnect() to manually close the database connection
  • added more detailed examples for working with messy common names, for example see mutate_db documentation

We are currently still working on a way to deal with database versioning so that multiple versions can be installed on a single machine, with the ability to move between versions. We plan to have that finished up in the next couple weeks.

@mcsiple and @lindsayplatt, let us know if there is anything else you would like to discuss further and thanks again for your helpful feedback!

@ldecicco-USGS
Copy link

Approved! Thanks @karinorman for submitting and @lindsayplatt and @mcsiple for your reviews!

To-dos:

  • Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I'm getting the rOpenSci folks the info get a "taxadb" repo set up here. You'll be made admin that's set up and you accept and invitation to join a taxadb team.
  • Add the rOpenSci footer to the bottom of your README
    " [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)"
  • Fix all links to the GitHub repo to point to the repo under the ropensci organization.
  • If you already had a pkgdown website, fix its URL to point to https://docs.ropensci.org/package_name and deactivate the automatic deployment you might have set up, since it will not be built centrally like for all rOpenSci packages, see http://devdevguide.netlify.com/#docsropensci. In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
  • Add a mention of the review in DESCRIPTION via rodev::add_ro_desc().
  • Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).
  • We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd love to host a blog post about your package - either a short introduction to it with one example or a longer post with some narrative about its development or something you learned, and an example of its use. If you are interested, review the instructions, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.

We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

@maelle
Copy link
Member

maelle commented Jan 9, 2020

@karinorman I've invited you to the rOpensci GitHub organization. Once you've accepted it, you can [transfer the repo](https://help.github.com/en/github/administering-a-repository/transferring-a-repository9 as @ldecicco-USGS said, from the Settings tab of your repo. Once you've transferred the repo, please ping me here, so I can give you admin rights to the repo again. Thank you!

@stefaniebutland
Copy link
Member

Congratulations @karinorman! Given that the subject area of taxadb is an important one for rOpenSci and our user/developer community, would you consider writing a blog post?

@ldecicco-USGS
Copy link

To echo @stefaniebutland 's request...I think the development of this package would make a good blog story. Others may have different ideas about why taxadb is neat, but for me it was how to make a relatively complicated database accessible to users who aren't necessarily subject matter experts. The logistics of creating the local database for the fast connections are really pertinent to other rOpenSci users/developers.

@karinorman
Copy link
Author

@ldecicco-USGS Wonderful to hear - thank you so much for making this a smooth process, and thanks again to reviewers @lindsayplatt and @mcsiple for your feedback that greatly improved the package! I'm out of the office until next week but will go through the onboarding steps then. @stefaniebutland I would also be happy to write a blog post, thanks for the invitation!

@stefaniebutland
Copy link
Member

Awesome :-) and thank you.
When you're back in the office Kari, please suggest a date by which you'd like to submit a draft post. Typically you submit via pull request, I review the post, and we publish one week after submission.

Instructions are here https://github.com/ropensci/roweb2#contributing-a-blog-post
and I'm happy to answer any questions.

@karinorman
Copy link
Author

@stefaniebutland Would Jan 31st be good?

@stefaniebutland
Copy link
Member

Yes!!

@karinorman
Copy link
Author

@lindsayplatt and @mcsiple, do you mind being acknowledged as our reviewers?

@lindsayplatt
Copy link

👍 don't mind one bit!

@stefaniebutland
Copy link
Member

Hi @karinorman. Will you be able to submit a draft blog post this week?
Happy to answer any questions.

@mcsiple
Copy link

mcsiple commented Feb 3, 2020

Eek so sorry for the delay. I don't mind being acknowledged either!

@karinorman
Copy link
Author

@stefaniebutland Sorry for the lack of update! I got hit with a nasty virus last week and wasn't quite able to finish it up. Look for it in the next couple days!

@stefaniebutland
Copy link
Member

Ack! No pressure from over here @karinorman. Get well soon and submit when you're ready

@cboettig
Copy link
Member

@maelle Should this thread now be closed? Also, not sure why the 'peer reviewed' badge is showing up as 'status unknown?' https://badges.ropensci.org/344_status.svg

@maelle
Copy link
Member

maelle commented Feb 18, 2020

The badges server seems to have problems at the moment but yep this should be closed for the badge to become green I think.

@maelle maelle closed this as completed Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants