Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

antanym #198

Closed
11 of 19 tasks
raymondben opened this issue Feb 26, 2018 · 31 comments
Closed
11 of 19 tasks

antanym #198

raymondben opened this issue Feb 26, 2018 · 31 comments

Comments

@raymondben
Copy link
Member

Summary

  • What does this package do? (explain in 50 words or less):

Provides access to Antarctic geographic place name information, and tools for working with those names.

  • Paste the full DESCRIPTION file inside a code block below:
Package: antanym
Type: Package
Title: Antarctic Geographic Place Names
Version: 0.3.0
Authors@R: c(person("Ben", "Raymond", email = "ben.raymond@aad.gov.au", role = c("aut", "cre")),
	     person("Michael", "Sumner", email = "michael.sumner@aad.gov.au", role="aut"),
	     person("Ursula", "Harris", role = "ctb"),
	     person("Andrew", "Cowie", role = "ctb"),
	     person("Fraser", "Morgan", role = "ctb")
	    )
Description: Antarctic geographic names from the SCAR Composite Gazetteer of Antarctica, and 
  functions for working with those place names.
URL: https://github.com/SCAR/antanym
BugReports: https://github.com/SCAR/antanym/issues
Depends:
    R (>= 3.3.0)
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Imports:
  assertthat,
  C50,
  geosphere,
  httr,
  magrittr,
  raster,
  readr,
  sp,
  stringi
Suggests:
  covr,
  dplyr,
  testthat,
  knitr,
  leaflet,
  rgdal,
  rmarkdown,
  rworldmap
RoxygenNote: 6.0.1
VignetteBuilder: knitr
X-schema.org-applicationCategory: Antarctic/Southern Ocean
X-schema.org-keywords: Antarctic, Southern Ocean, place names, gazetteer
X-schema.org-isPartOf: https://scar.org

  • URL for the package (the development repository, not a stylized html page):

https://github.com/SCAR/antanym

  • Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

Data retrieval (it retrieves the SCAR Composite Gazetteer data from its host server), and geospatial data (geographic place names are obviously geospatial in nature).

  •   Who is the target audience and what are scientific applications of this package?  

Antarctic researchers, particularly those wanting to produce maps or figures of Antarctic regions. Such figures often need spatial features to be labelled (ice shelves, glaciers, stations, etc). The gazetteer data can also be useful in a quantitative sense, for example for calculating distance to the nearest station (which could perhaps be used as a proxy for human disturbance in an ecological modelling scenario).

The geonames package provides access to the global geonames.org database of place names, via their API. The geonames.org database includes place name information from the SCAR Composite Gazetteer of Antarctica (CGA). However, there seems to be a good case for a separate package that deals specifically with Antarctic place names:

  1. the SCAR CGA is the authoritative source for Antarctic place name information, and we (SCAR) are the custodians of it. geonames.org has ingested CGA information into their global database and is thus a secondary source of this information. At the time of writing the geonames.org copy of the CGA is out of date (e.g. it does not contain Lassesen Island, Ginger Reef, or Pavlova Island, all added in 2017).

  2. antanym provides more information about features than does geonames, including the narrative (description of how a feature came to be given its name and who or what it was named after), the date on which it was named, information about the source of the name, and more.

  3. geonames requires a login because it consumes the geonames.org API; antanym does not.

  4. much of the geonames package functionality is irrelevant in an Antarctic context (e.g. GNcities(), GNcountry*(), and GNfindNearbyPostalCodes() - there are no cities, countries, or post codes in Antarctica).

  5. conversely, antanym has functionality specific to the CGA that geonames does not. In particular, the CGA is a composite gazetteer, which means that a single feature can have multiple names (given by different naming authorities from different countries). Antanym has functions to help resolve these, and to discover all of the names associated with a feature. Using the geonames package, it does not appear to be directly possible to find all of the names associated with a given feature. (As an aside, the geonames.org database must hold this multiple-names information in some form, because using the geonames.org get API method (http://api.geonames.org/get?geonameId=6627943&username=demo) shows the alternate names for Booth Island. However, unless I've missed it, the geonames package doesn't wrap the get API method.)
    Antanym also has functions for suggesting names to add to a map and thinning a list of names to get a more visually pleasing spatial coverage for plotting. Because of the differences in functionality, we have not made any particular attempt to align antanym's interface with that of the geonames package. Antanym's interface is instead loosely modelled on a dplyr-style approach to filtering and subsetting.

  6. antanym deals only with the CGA, so it's small enough to cache locally and use offline, whereas geonames is a set of wrappers around the geonames.org API and can't be used offline (though you could memoise particular calls and cache those, but then you couldn't issue new queries while offline). This might sound like a trivial issue but is a genuine consideration for Antarctic workers with limited internet access in the field, on station, or at sea.

  7. as time permits, we will add other Antarctic-related gazetteers to antanym, which are not available through geonames. These include subantarctic gazetteers and informal gazetteers managed by SCAR and the Australian Antarctic Data Centre.

  •   If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package has an obvious research application according to JOSS's definition.
    • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)
  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
    • The package is novel and will be of interest to the broad readership of the journal.
    • The manuscript describing the package is no longer than 3000 words.
    • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
    • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
    • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
    • (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

Clean bill of health!

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

Barry Rowlingson (github barryrowlingson) - author of the geonames package and all around spatial wizard.

@sckott sckott added the package label Feb 26, 2018
@karthik
Copy link
Member

karthik commented Mar 15, 2018

👋 @raymondben Sorry for the slow response on our end. We will respond with next steps shortly.

@sckott
Copy link
Contributor

sckott commented Mar 20, 2018

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks for your submission @raymondben !!

  • You can put a ropensci review badge in your README

[![](https://badges.ropensci.org/198_status.svg)](https://github.com/ropensci/onboarding/issues/198)

  • Goodpractice output: really only the long lines are something to change at some point
It is good practice towrite unit tests for all functions, and all package code
    in general. 95% of code lines are covered by test cases.

    R/load.R:71:NA
    R/load.R:72:NA
    R/load.R:73:NA
    R/load.R:74:NA
    R/load.R:97:NA
    ... and 6 more linesavoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    R/load.R:59:1
    R/load.R:64:1
    R/load.R:67:1
    R/load.R:71:1
    R/load.R:79:1
    ... and 71 more lines

Seeking reviewers now 🕐


Reviewers: @johnbaums @lbusett
Due date: 2018-04-23

@sckott
Copy link
Contributor

sckott commented Mar 31, 2018

Reviewers: @johnbaums @lbusett
Due date: 2018-04-23

@lbusett
Copy link
Contributor

lbusett commented Apr 24, 2018

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.

  • Functionality: Any functional claims of the software been confirmed.

  • Performance: Any performance claims of the software been confirmed.

  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.

    Tests are present and appear to cover the main functionalities. However, when running
    devtools::test() from RStudio "build panel" many tests fail with a "object 'g' not found"
    error. Running from the console, most tests pass but only if the g object
    was created beforehand with an_read()

  • [ X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5


Review Comments

The antanym package allows easily accessing Antarctic geographic place name information, and
provides some helper functions useful to filter, "disambiguate" and facilitate plotting
of Antarctic named locations. Although "simple in scope", those helper functions
are well thought and useful to facilitate the use of the dataset. In particular,
the an_filter function allows to easily query the dataset, based both on location and
on "semantics". The an_suggest and an_thin functions provide a clever way (IMO)
to facilitate plotting of Antarctic named locations over a map.

The package is generally already in good shape. Documentation is good (though the
package vignette could be a bit expanded and improved), and inline comments
allow to easily navigate through the code base. Below, I report some suggestions
for possible improvements I compiled while reviewing the package. Most concern minor
issues and/or suggestions for improvements or corrections on documentation. The only
"major" suggestion concerns the implementation of the "data caching" mechanism, which could
be maybe improved by using package rappdirs.

Comments on Functions

an_read()

  • Though I like the solution used by the authors to allow "caching" of the data in a folder
    selected by the user (also because the download of the file is quite slow), I think that
    it could be improved. At the moment, the user has to manually specify the folder where the data will be cached. However, it appears that if in another session he wants to use again the cached
    data he will have to remember where exactly he saved it in the first place.
    I think that a maybe better solution could be to use package rappdirs so that the data is placed
    automatically in a "standard" folder (e.g. file.path(user_data_dir(), "antanym").
    That way, the user could just use a "flag" argument (e.g., use_cache_dir == TRUE).
    The first time, this would lead to download to a CSV in the "standard" folder. Later
    on, the CSV would be automatically found and data loaded from it.

  • Download times are a bit "erratic". Sometimes the data is downloaded quickly,
    other times it can take some minutes. In a couple of cases, I got the
    following error:

Error in curl::curl_fetch_memory(url, handle = handle) :
Error while processing content unencoding: incorrect data check

would it be possible to have the user download a "zipped" version of the dataset?
That would allow reducing the download size from 22 MB to about 3 MB

  • You could consider modifying the "do_fetch_data" function to allow automatic retry on failure,
    for a given number of times. Something on the lines:
temp <- httr::RETRY("GET", download_url, httr::config(ssl_verifypeer = 0L),
                                times = 10,
                                pause_base = 0.1,
                                pause_cap = 3,
                                quiet = FALSE)

an_filter()

  • I'd suggest to allow users to pass the search extent also using an
    sp::bbox object (This would allow to use, for example, something like:
    suggested <- an_filter(g, extent = sp::bbox(my_sp_object)))

  • The documentation/examples for the "query" argument could be maybe improved. It is mentioned
    in the documentation that it should be a "regular expression" but I think that
    some more simple examples could help, since many people (included me) often struggle
    with regexes.
    For example, something explaining that query = "West Arm" performs an "AND" search, while to
    do an "OR" search one should use query = "West|Arm", or explaining how
    to search for a whole word/string) (e.g., query = "\\bUfs\\b") - is there an easier way? ).

  • It could be worth mentioning in the help that the search (seems to) consider "separated words" as separated search terms, so that for example an_filter(g, "William Archipelago") matches "William Scoresby Archipelago", which would not happen if using a "standard" regex matching. I think that this is
    a nice and useful behaviour, but it should maybe documented.

  • The search uses by default an ignore_case rule in matching the regex.
    For example, query = "West" matches with placename "Sylwester". I think it could be useful
    to provide an option to turn on/off the behavior.

an_preferred()

  • I do not really get from the documentation how the duplicated features are resolved.
    I mean: suppose that one feature has three duplicates, from Poland, Argentina and UK and I
    use origin_country = Canada: what of the three duplicates I would get? Apparently,
    I'd get the first one in alphabetical order (for example, for Booth Island I get Argentina). Is there a rationale for this behaviour? I feel this should be better documented, and that
    the identification of the "single" entry for features for which the "country" selected
    by the user does not exist should follow some kind of "standard".

  • Would it make sense to allow the user to search ONLY for features specified by
    a given country? Something like: an_preferred(g, "POLAND", strict = TRUE)

  • I think it could make sense to allow users to select the "preferred" entry also using the
    "cga_source_gazetter", as happens in an_filter(). Is there anything preventing this?
    Also, is there any difference between the "country name" and the "cga source"
    fields, or there is a 1-1 match?

minor Also, I'd suggest lengthening a bit the Description in the documentation to detail how
the selection is done (i.e., by specifying a preferred Country).

an_countries(), an_feature_types(), an_cga_sources(), an_gazetters()

minor The documentation for these functions "points" to the documentation of an_filter()
, which however does a different job. I'd propose creating a different Rd for the
three functions. It's "description" could be something on the lines of "find out the possible
values available for fine tuning queries in an_filter"

an_url()

minor I'd suggest changing the function name to an_geturl(), but leave the decision
to the authors.

an_suggest()

I really like this function in association with an_thin to allow identify
suitable names for plotting. This kind of functionality would be also beneficial for packages such ad geonames. Some minor comments:

  • As for an_filter, I'd suggest to allow users to pass the extent object also using an
    sp::bbox object (This would allow to use, for example, something like:
    suggested <- an_suggest(g, map_extent = sp::bbox(my_sp_object), map_dimensions = c(100, 100)))

  • The description could be improved. For example, I do not really get how the score
    is assigned. The sentence saying that it is "based on how often (and at what map scales)
    they have been named on maps prepared by expert cartographers" is a bit obscure to me.

  • According to documentation, map_extent is ignored if map_scale is provided. However,
    running suggested <- an_suggest(g, map_scale = 10e6) crashes. I guess that comment
    is only valid for map_dimensions.

  • The documentation states that "gaz" should be a "data.frame or SpatialPointsDataFrame:
    as returned by an_read
    ". For completeness, I think however that is should be
    "as returned by an_read, an_preferred or an_filter"

an_thin()

  • The documentation reports that "gaz" should be a "data.frame or SpatialPointsDataFrame:
    as returned by an_read
    "
    I think however that is should be "as returned by an_suggest"
    Consider that mistakenly trying to run the function on the whole antanym dataset
    led to a huge memory leak, forcing me to restart my laptop. A "sanity check" on the
    the input could be beneficial to avoid this (maybe adding an attribute to the d.f.
    returned by an_suggests? )

  • Although the user is expected to use the score_col argument to
    specify the "scoring" column, no check appears to be done to see if the column
    exists. I'd suggest adding an assertion on column names in the code, sending
    a meaningful error message in case the column is not present (this could also
    prevent trying to apply the function on an "un_suggested" antanym dataset,
    since the score column is not present).

  • Concerning the score weighting, I'd suggest giving indications concerning
    what a "high" or a "low" value could be.

Comments on Vignettes

  • The vignette duplicates the content of the README. While I think that for the
    readme the content is sufficient, I believe that some additional examples of usage should be
    included in the "full" vignette. I give some suggestions for the authors below.

    • I'd suggest introducing an additional "section separator" named like "Filtering antanym locations"
      where you illustrate in some more detail the "filtering" functionality.. After the "feature_id" based query,
      I'd start off with some simple queries based on place names and regexes, then introduce the
      an_preferred functionality, the feature_type functionality and finally the an_near
      functionality.

    • The name of the last section could be changed to something like
      "plotting antanym locations". Then start with a "vanilla" case (i.e., not
      using the suggestion feature), and then introducing an_suggests()

    • Since you mention that you wish to submit to CRAN, and rworldmap is a suggested
      package, I think you should probably wrap the r chunks using it in a
      if(requireNamespace(rworldmap) construct (the same goes for all examples using it).

    • minor The vignette only illustrates the case of using the package while caching
      to tempdir(). For completeness, you could maybe consider showing out first
      the case without cache_dir.

Comments on Examples

  • In the examples, sometimes a "local" folder is specified (e.g., "c:/temp/gaz").
    It would be better to always use cahe_dir = tempdir() to facilitate running the
    examples (this would however be solved automatically if implementing the suggestion above
    regarding rappdirs usage.)

  • In the an_filter example, I'd suggest adding some code comments explaining what
    the various example do (e.g., "Find locations nearby... etc.")

Comments on Inline documentation

Inline documentation is good. I'd just suggest however to introduce carriage returns
to shorten long lines in the roxygen sections (I do not think that they will be caught
by lintr/goodpractice). This would improve readability while working on code in
case the "soft wrap" RStudio functionality is off (or the developer does not use RStudio).

Other Comments

Personally, I think that having at least all "major" functions of a package in separate .R
files named based on the function allows easier navigation of the code base. However, since
this is not contemplated in the rOpensci packaging guide, I leave it just as a suggestion.


That's all. Thanks for sharing this interesting package (and thanks to rOpensci for
asking me to review it - Since this was my first review, I hope it fulfills requirements/expectations).

@sckott
Copy link
Contributor

sckott commented Apr 24, 2018

thanks for your review @lbusett !

@johnbaums - can you get your review in soon?

@raymondben
Copy link
Member Author

Thanks @lbusett - that's very helpful!

@sckott
Copy link
Contributor

sckott commented Apr 28, 2018

@johnbaums - can you get your review in soon?

@johnbaums
Copy link
Contributor

@sckott @raymondben apologies for the slow turnaround and missed notifications. Aiming to submit tonight, tomorrow at the latest.

@johnbaums
Copy link
Contributor

The download url for the gaz data is currently dead. Any backup url I can use to grab the data, @raymondben? Just had a couple more things I wanted to look at. Not a big deal if not.

An internal server error has occurred. The AADC development team has been notified of this issue and will resolve the problem as soon as possible.

@raymondben
Copy link
Member Author

Hi @johnbaums - sorry about that, AAD had a complete IT outage last night and looks like our geoserver is a lingering casualty. Have asked for it to be checked, will let you know when it's back on its feet.

@raymondben
Copy link
Member Author

@johnbaums - IT'S ALIIIIIVE!

@johnbaums
Copy link
Contributor

johnbaums commented May 2, 2018

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 6


Review Comments

Thanks for the opportunity to review antanym. I see clear value in the package; I have experience working with datasets compiled from multiple sources, and functions that permit sensible and straightforward filtering and subsetting are a lifesaver. I was very impressed by the creative approach to thinning place names (an_thin), using experts' perceptions of importance, for printing on maps - bravo! I believe antanym will be a great contribution to the R community.

I appreciate the effort the authors have made in ensuring the code is clear, concise, readable, and well documented. The package is well structured, with individual functions that perform clear, separate tasks, and internal helpers that prevent clutter and repetition. Data caching is a great convenience - the authors raise an important point about the relevance of this for Antarctic field workers.

The package is very sound from a technical perspective, and I could not find any glaring omissions regarding functionality. I provide a few minor comments below, primarily relating to the docs.


README.Rmd

  • expand SCAR (Scientific Committee on Antarctic Research) on first reference (or use tag).
  • Fix typo: "gazetter" -> "gazetteer"

Antanym.Rmd

  • Fix typo: "gazetter" -> "gazetteer"

DESCRIPTION

  • expand "SCAR" to its full name in the Description field of DESCRIPTION

utils.R

  • "Antarctic" -> "Antarctica" (line 3, utils.R)

an_preferred()

  • The docs for the argument origin_country state "a place name given by another country will be chosen" - chosen how?
  • A couple of object names are a little confusing here: out <- rbind(in_coi, out_coi), out refers to output data, while out_coi refers to rows (IDs) that were not included (i.e., were "out")

an_thin()

  • this.dist <- as.matrix(dist(as.data.frame(gaz)[, c("longitude", "latitude")])) can be replaced with as.matrix(dist(coordinates(gaz)))
  • Add to the docs that NULL can be passed to score_col if you just want uniform thinning without scores (or alternatively, I guess, that score_weighting can be set to 0).
  • this could use sp::spDists (or the Vincenty function used elsewhere in antanym?) to calculate great circle distances if eventual maps are going to be plotted in projected systems. Otherwise clumping might be expected toward the pole.
  • I believe you can replace:
    if (inherits(gaz, "SpatialPointsDataFrame")) {
        sc <- as.data.frame(gaz)[, score_col]
    } else {
        sc <- gaz[, score_col]
    }
    if (inherits(sc, "data.frame")) sc <- unlist(sc)
    with the more efficient
    sc <- gaz[[score_col]]

an_suggest()

  • Remove or clarify the long comment (below). It currently seems to contradict the comment within the subsequentif statement if (map_scale>=10e6) {
## scale >= 10e6: we have full coverage (nearly so for 12mill) of all
## scar_common_ids, so use per-feature predictions for scale < 10e6: use
## predictions by feature properties (except maybe if area of interest lies
## within a catalogued map) stations as special case?

an_filter()

  • the query argument is a regular expression, and is split by the line sterms <- strsplit(query, "[ ,]+")[[1]], so I guess place names are expected to be comma- or space-separated. The documentation for that argument should clarify this expectation; rather than stating that it's a regular expression, I would explicitly state that place names should be a character string comma- or space-separated place names or a regular expression. (Of course regex also works, e.g. query='Place A|Place B|Some other place'.)
  • In relation to the above point, would it make sense to also allow a character vector (length > 1) of place names?
  • add ignore.case=TRUE to grepl calls for feature_type_name, country_name, gazetteer and cga_source_gazetteer for consistency with the function's description ("All text-related matches are treated as regular expressions and are case-insensitive.").
  • To clarify the relevance of an_countries(gaz), an_feature_types(gaz) and an_cga_sources(gaz) to an_filter, it would be good to include use of these functions in the examples.

an_read()

  • To account for paths whose parent does not exist, I'd suggest replacing "The cache directory will be created if it does not exist." with "An attempt will be made to create the cache directory if it does not exist." (or use recursive = TRUE, though I don't like functions using that)

General issues

  • The reference link http://www.scar.org/data-products/cga (provided in several of the functions) is dead (website throws 404 Category not found)
  • In the examples for several functions, it's probably better to use tempdir() rather than g <- an_read(cache_directory="c:/temp/gaz")
  • consider including argument names in examples, e.g. include loc= and max_distance= in nms <- an_near(an_filter(g, feature_type = "Island"), c(100, -66), 20). This is probably more important for the vignette, but should also be considered for the examples provided in the function documentation.

Vignette
I'd like to see a more detailed vignette that includes a bit more explanatory text about the gaz data structure, and includes examples of all of the exported functions (including those documented as secondary functions, such as an_countries etc.) provided by the package (there aren't all that many, so it shouldn't be too onerous).


Thanks again for the opportunity to review what I think is a valuable and neatly coded contribution. I'm looking forward to seeing it on CRAN!

@sckott
Copy link
Contributor

sckott commented May 2, 2018

thanks for your review @johnbaums !


@raymondben continue discussion here ...

@raymondben
Copy link
Member Author

Thanks @johnbaums, thanks again @lbusett. Will get on to revisions in due course.

@sckott
Copy link
Contributor

sckott commented Jun 11, 2018

any progress on changes @raymondben ?

@raymondben
Copy link
Member Author

Ah, sorry, this had slipped off our attention radar. Will get back to it soon.

@raymondben
Copy link
Member Author

raymondben commented Jun 18, 2018

@lbusett @johnbaums

Thanks for your patience, and thanks again for your reviews. While we think it's reasonable to say that the package was in pretty good shape on submission, it is without question better now - your input has definitely improved it. Short notes against each of your review points are below. Please let us know if further details are needed on any of these. The revised code is currently sitting in the ropenscirev branch of SCAR/antanym.

Lorenzo

when running devtools::test() from RStudio "build panel" many tests fail with a "object 'g' not found"

Fixed.

Though I like the solution used by the authors to allow "caching" of the data in a folder selected by the user (also because the download of the file is quite slow), I think that it could be improved. At the moment, the user has to manually specify the folder where the data will be cached. However, it appears that if in another session he wants to use again the cached data he will have to remember where exactly he saved it in the first place. I think that a maybe better solution could be to use package rappdirs so that the data is placed automatically in a "standard" folder (e.g. file.path(user_data_dir(), "antanym"). That way, the user could just use a "flag" argument (e.g., use_cache_dir == TRUE).

The caching interface has been simplified: calling an_read(..., cache = "session") will use a temporary (per-session) directory; cache = "persistent" will use the persistent directory returned by rappdirs, or the user can specify a particular directory via cache = "c:/my/directory".

Download times are a bit "erratic". Sometimes the data is downloaded quickly, other times it can take some minutes. Would it be possible to have the user download a "zipped" version of the dataset? That would allow reducing the download size from 22 MB to about 3 MB

The geoserver installation from which the data are delivered doesn't support compressed CSVs, but all traffic goes through an additional reverse proxy that supports compression of the http traffic. As far as we can tell (looking at the returned server headers and the request size using wireshark) this is indeed being applied to our requests. We suspect that the erratic download times that you are seeing are due to the server being located at the bottom end of the world in Tasmania, which is not ideal for northern hemisphere users. We'll ask the group that hosts the database to look at options, perhaps it might be possible for the data to be mirrored elsewhere (e.g. a SCAR institution in Europe) - but that's beyond the remit of the R package itself.

You could consider modifying the "do_fetch_data" function to allow automatic retry on failure, for a given number of times.

Added.

an_filter()
I'd suggest to allow users to pass the search extent also using an sp::bbox object (This would allow to use, for example, something like: suggested <- an_filter(g, extent = sp::bbox(my_sp_object)))

Done. In fact the extent can now be passed as an sp object, its bbox, a raster object, an Extent, or a numeric vector.

The documentation/examples for the "query" argument could be maybe improved. It is mentioned in the documentation that it should be a "regular expression" but I think that some more simple examples could help, since many people (included me) often struggle with regexes. For example, something explaining that query = "West Arm" performs an "AND" search, while to do an "OR" search one should use query = "West|Arm", or explaining how to search for a whole word/string) (e.g., query = "\bUfs\b") - is there an easier way? ).

We've changed the query behaviour: query can now be provided as multiple terms (i.e. a character vector) rather than a single string. Case-sensitivity is controlled by the ignore_case parameter, and the strings can be treated as regexps or fixed strings according to the as_regex parameter. This should make the behaviour a little more obvious.

  • query = "west arm" searches for the phrase "west arm"
  • query = c("west", "arm") searches for terms matching both "West" and "Arm" (e.g. "southwestern arm")
  • query = "west|arm" as before, OR search

And while we don't want to attempt to provide a regexp tutorial, some regexp examples have also been added.

It could be worth mentioning in the help that the search (seems to) consider "separated words" as separated search terms, so that for example an_filter(g, "William Archipelago") matches "William Scoresby Archipelago", which would not happen if using a "standard" regex matching. I think that this is a nice and useful behaviour, but it should maybe documented.

We agree that it's useful, but also on reflection it's probably also good to be able to turn this off and search for the full phrase. Rather than force the user to delve into regular expressions to achieve this, query can now (as mentioned above) be a vector of multiple terms (which all must be matched). So "William Scoresby Archipelago" will now be matched by query = c("William", "Archipelago") or query = "William .* Archipelago" but not query = "William Archipelago".

The search uses by default an ignore_case rule in matching the regex. For example, query = "West" matches with placename "Sylwester". I think it could be useful to provide an option to turn on/off the behavior.

Done, see above.

an_preferred()
I do not really get from the documentation how the duplicated features are resolved. I mean: suppose that one feature has three duplicates, from Poland, Argentina and UK and I use origin_country = Canada: what of the three duplicates I would get? Apparently, I'd get the first one in alphabetical order (for example, for Booth Island I get Argentina). Is there a rationale for this behaviour? I feel this should be better documented, and that the identification of the "single" entry for features for which the "country" selected by the user does not exist should follow some kind of "standard".

The original behaviour (that you reviewed) was to choose unmatched names by row order (i.e. pick the first row with the feature in question), and yes that was a bit opaque! There are now two options, controlled by the new unmatched parameter: "random" returns unmatched names choosing from non-preferred sources in random order, and "count" chooses from sources in decreasing order of size (number of entries). The "random" option is the default - in one sense this isn't great, because the results will vary from one call to the next. However, SCAR is an apolitical organisation, and so the antanym package should not implicitly favour any one country's placenames over another's (only when the user specifies it). Since there doesn't seem to be any "correct" ordering to use here, random seems like a reasonable compromise.

Would it make sense to allow the user to search ONLY for features specified by a given country? Something like: an_preferred(g, "POLAND", strict = TRUE)

Use a filter after the preferencing: an_filter(an_preferred(g, origin = "Poland"), origin = "Poland")

an_preferred is not intended to remove (filter) any features, it just collapses duplicate names.

Is there any difference between the "country name" and the "cga source" fields, or there is a 1-1 match?

They are basically equivalent, except that entries from the GEBCO undersea feature gazetteer do not have an associated country_name. Also, the cga_source_gazetteer entries are ISO 3-letter country codes rather than full names as in country_name.

To simplify things for the user, we have added a new "origin" column to the data, which is either the full country name, or the organisation name for non-national (i.e. GEBCO) names. This will also help us in the future when additional gazetteers (e.g. subantarctic ones) are added, because those won't have cga_source_gazetteer values (they are not part of the CGA). We can add an appropriate "origin" value for those rows.

The "country_name" and "cga_source_gazetteer" columns are also now NOT included in the data by default (but the user can get them with an_read(..., simplified = FALSE) if they really want them).

I think it could make sense to allow users to select the "preferred" entry also using the "cga_source_gazetter", as happens in an_filter(). Is there anything preventing this?

an_preferred now works from the newly-added "origin" column.

minor Also, I'd suggest lengthening a bit the Description in the documentation to detail how the selection is done (i.e., by specifying a preferred Country).

Added.

an_countries(), an_feature_types(), an_cga_sources(), an_gazetters()
minor The documentation for these functions "points" to the documentation of an_filter(), which however does a different job. I'd propose creating a different Rd for the three functions. It's "description" could be something on the lines of "find out the possible values available for fine tuning queries in an_filter"

an_countries and an_cga_sources have been removed, since these columns are superseded by the "origin" column and aren't returned by an_read by default. The an_origins function has been added, and has its own man page (as do an_feature_types and an_gazetteers). Example use of an_origins and an_feature_types has been added to the an_filter help.

an_url()
minor I'd suggest changing the function name to an_geturl(), but leave the decision to the authors.

Done.

an_suggest()
I really like this function in association with an_thin to allow identify suitable names for plotting. This kind of functionality would be also beneficial for packages such ad geonames. Some minor comments:
As for an_filter, I'd suggest to allow users to pass the extent object also using an sp::bbox object (This would allow to use, for example, something like: suggested <- an_suggest(g, map_extent = sp::bbox(my_sp_object), map_dimensions = c(100, 100)))

an_suggest and an_mapscale now allow the same types of extent inputs as an_filter (bbox, Spatial, Raster, Extent, or numeric).

The description could be improved. For example, I do not really get how the score is assigned. The sentence saying that it is "based on how often (and at what map scales) they have been named on maps prepared by expert cartographers" is a bit obscure to me.

Description expanded: "Features are given a suitability score based on maps prepared by expert cartographers. Data were tabulated from a collection of such maps, indicating for each feature whether it was named on a given map, along with details (such as scale) of the map. These data are used as the basis of a recommendation algorithm, which suggests the best features to name on a map given its properties (extent and scale). This is an experimental function and currently only implemented for map_scale values of 10 million or larger."

According to documentation, map_extent is ignored if map_scale is provided. However, running suggested <- an_suggest(g, map_scale = 10e6) crashes. I guess that comment is only valid for map_dimensions.

Ah yes, the function doc wasn't correct there. Fixed.

The documentation states that "gaz" should be a "data.frame or SpatialPointsDataFrame: as returned by an_read". For completeness, I think however that is should be "as returned by an_read, an_preferred or an_filter"

Updated as you suggest.

an_thin()
The documentation reports that "gaz" should be a "data.frame or SpatialPointsDataFrame: as returned by an_read". I think however that is should be "as returned by an_suggest".

Done.

Consider that mistakenly trying to run the function on the whole antanym dataset led to a huge memory leak, forcing me to restart my laptop. A "sanity check" on the input could be beneficial to avoid this (maybe adding an attribute to the d.f. returned by an_suggests? )

Huh, interesting. We can't reproduce a memory leak/crash (win32/win64/linux64). We just get an error with "cannot allocate memory" message. Could you please send the output of sessionInfo() please so that we can try and reproduce it?

Nevertheless, to avoid the memory allocation issue (or crash) we've added a row_limit parameter to catch large input data frames and a note to the function documentation.

Although the user is expected to use the score_col argument to specify the "scoring" column, no check appears to be done to see if the column exists. I'd suggest adding an assertion on column names in the code, sending a meaningful error message in case the column is not present (this could also prevent trying to apply the function on an "un_suggested" antanym dataset, since the score column is not present).

Actually the thinning will work even if the score_col column is not present - in this case a constant score is assigned to each name and the thinning is based entirely on the spatial distribution of the names. The function documentation has been revised to note this.

Concerning the score weighting, I'd suggest giving indications concerning what a "high" or a "low" value could be.

Scores given by an_suggest range from 0 to 1 (text has been added to an_suggest documentation). But actually the scale of the scores doesn't matter, at least in the current implementation - it is the score rankings that are used to select the best-scored feature, not the actual score values. So it should all still work even if e.g. the user were to assign their own scores on some other scale.

Comments on Vignettes
The vignette duplicates the content of the README. While I think that for the readme the content is sufficient, I believe that some additional examples of usage should be included in the "full" vignette. I give some suggestions for the authors below.

The README has been cut down to give the intro/context, a few examples, and highlight the particular functionality of antanym (suggesting names, resolving multiple names). The vignette has been expanded.

I'd suggest introducing an additional "section separator" named like "Filtering antanym locations" where you illustrate in some more detail the "filtering" functionality.. After the "feature_id" based query, I'd start off with some simple queries based on place names and regexes, then introduce the an_preferred functionality, the feature_type functionality and finally the an_near functionality.

The vignette has now been broken into a number of sections, generally giving increasingly-complex examples in each.

The name of the last section could be changed to something like "plotting antanym locations". Then start with a "vanilla" case (i.e., not using the suggestion feature), and then introducing an_suggests()

Done.

Since you mention that you wish to submit to CRAN, and rworldmap is a suggested package, I think you should probably wrap the r chunks using it in a if(requireNamespace(rworldmap) construct (the same goes for all examples using it).

Done.

minor The vignette only illustrates the case of using the package while caching to tempdir(). For completeness, you could maybe consider showing out first the case without cache_dir.

The vignette now mentions the cache = "session" and cache = "persistent" options. We'll leave it to the user to consult the function doc to see all examples of the caching behaviour.

Comments on Examples
In the examples, sometimes a "local" folder is specified (e.g., "c:/temp/gaz"). It would be better to always use cahe_dir = tempdir() to facilitate running the examples (this would however be solved automatically if implementing the suggestion above regarding rappdirs usage.)

All examples now use cache = "session", which uses the session's tempdir().

In the an_filter example, I'd suggest adding some code comments explaining what the various example do (e.g., "Find locations nearby... etc.")

Added.

Comments on Inline documentation
Inline documentation is good. I'd just suggest however to introduce carriage returns to shorten long lines in the roxygen sections (I do not think that they will be caught by lintr/goodpractice). This would improve readability while working on code in case the "soft wrap" RStudio functionality is off (or the developer does not use RStudio).

Style is to some extent personal preference. Personally (BR) I'm not a particular fan of enforced maximum line lengths in general. I find that typically it's wasteful of screen real estate, and a decent text editor can be configured to soft wrap if that's what the user wants. The long lines have been largely left as is.

Other Comments
Personally, I think that having at least all "major" functions of a package in separate .R files named based on the function allows easier navigation of the code base. However, since this is not contemplated in the rOpensci packaging guide, I leave it just as a suggestion.

Mostly done.

John

README.Rmd
expand SCAR (Scientific Committee on Antarctic Research) on first reference (or use tag).
Fix typo: "gazetter" -> "gazetteer"

Done.

Antanym.Rmd
Fix typo: "gazetter" -> "gazetteer"

Done. And did a recursive grep to make sure there weren't others.

DESCRIPTION
expand "SCAR" to its full name in the Description field of DESCRIPTION

Actually we've just removed "SCAR" here, it's probably not helpful to have that level of detail in the Description field. In fact "SCAR" has mostly been removed throughout: it's enough to refer to the "Composite Gazetteer of Antarctica" without prepending it by "SCAR" each time.

utils.R
"Antarctic" -> "Antarctica" (line 3, utils.R)

Got it. And another one elsewhere too!

an_preferred()
The docs for the argument origin_country state "a place name given by another country will be chosen" - chosen how?

See response to the same question above. Basically, now random by default.

A couple of object names are a little confusing here: out <- rbind(in_coi, out_coi), out refers to output data, while out_coi refers to rows (IDs) that were not included (i.e., were "out")

Changed "in_coi" to "preferred_gaz_rows" and "out_coi" to "not_preferred_gaz_rows"

an_thin()
this.dist <- as.matrix(dist(as.data.frame(gaz)[, c("longitude", "latitude")])) can be replaced with as.matrix(dist(coordinates(gaz)))

Yep, done, and similarly in an_near.

Add to the docs that NULL can be passed to score_col if you just want uniform thinning without scores (or alternatively, I guess, that score_weighting can be set to 0).

Done. And yes, a score_weighting of 0 achieves the same as score_col=NULL (or all scores equal). Tests added to confirm this.

this could use sp::spDists (or the Vincenty function used elsewhere in antanym?) to calculate great circle distances if eventual maps are going to be plotted in projected systems. Otherwise clumping might be expected toward the pole.

We tried great-circle distances here (using geosphere's implementation) but we are calculating N^2 distances and great-circle becomes prohibitively slow, even for modestly-sized data frames. You are right about the downside, it will tend to over-select names at high latitudes. However, the bias doesn't seem too bad (perhaps because there are many more names at lower latitudes (coastal Antarctica) than high latitudes?). But we'll look at improvements as time permits. The new hypertidy/geodist package might offer a way forward.

(Minor update: a bit of experimentation suggests that hypertidy/geodist is fast enough to use here. But we'll wait until it's had some time to stabilize before incorporating it - it's still WIP).

I believe you can replace:
[snipped]
with the more efficient
sc <- gaz[[score_col]]

Done.

an_suggest()
Remove or clarify the long comment

Comment broken up into the two if-else parts, and wording clarified.

an_filter()
the query argument is a regular expression, and is split by the line sterms <- strsplit(query, "[ ,]+")[[1]], so I guess place names are expected to be comma- or space-separated. The documentation for that argument should clarify this expectation; rather than stating that it's a regular expression, I would explicitly state that place names should be a character string comma- or space-separated place names or a regular expression. (Of course regex also works, e.g. query='Place A|Place B|Some other place'.)

The query behaviour has now changed, see comment above.

In relation to the above point, would it make sense to also allow a character vector (length > 1) of place names?

The query param can now be a vector, in which case a name has to match all entries in it in order to be matched (e.g. query=c("bert", "lake") will match "bert lake", "lake bob bert", "albert lakelet", etc). For OR-matching, a regex will still be needed (e.g. your query="Place A|Place B|Some other place").

add ignore.case=TRUE to grepl calls for feature_type_name, country_name, gazetteer and cga_source_gazetteer for consistency with the function's description ("All text-related matches are treated as regular expressions and are case-insensitive.").

Done. Well actually done slightly differently, because grepl grumbles if both ignore.case and fixed are TRUE. But done.

To clarify the relevance of an_countries(gaz), an_feature_types(gaz) and an_cga_sources(gaz) to an_filter, it would be good to include use of these functions in the examples.

Done - these are mentioned in the an_filter help, and now have their own man pages too.

an_read()
To account for paths whose parent does not exist, I'd suggest replacing "The cache directory will be created if it does not exist." with "An attempt will be made to create the cache directory if it does not exist." (or use recursive = TRUE, though I don't like functions using that)

Wording changed (note also that the caching parameter has changed, see above, but your suggestion still applies when the user provides a specific caching directory - e.g. cache = "c:/my/directory").

General issues
The reference link http://www.scar.org/data-products/cga (provided in several of the functions) is dead (website throws 404 Category not found)

Whoops, SCAR's site got updated. Fixed.

In the examples for several functions, it's probably better to use tempdir() rather than g <- an_read(cache_directory="c:/temp/gaz")

Now superseded by an_read(cache = "session")

consider including argument names in examples, e.g. include loc= and max_distance= in nms <- an_near(an_filter(g, feature_type = "Island"), c(100, -66), 20). This is probably more important for the vignette, but should also be considered for the examples provided in the function documentation.

Done - though the first argument (usually the gazetteer object) hasn't been named if it's also the first argument in the function parameter list, since this seems to be idiomatic R usage.

Vignette
I'd like to see a more detailed vignette that includes a bit more explanatory text about the gaz data structure, and includes examples of all of the exported functions (including those documented as secondary functions, such as an_countries etc.) provided by the package (there aren't all that many, so it shouldn't be too onerous).

All functions now have their own man page. Vignette expanded, as noted above, including a description of the columns in the gaz data (these are also documented in the an_read function documentation).

@sckott
Copy link
Contributor

sckott commented Jun 20, 2018

@lbusett @johnbaums are you happy with revisions/comments from the maintainer? feel free to make any suggestions/comments here if you have anything further. if nothing, that's okay too, either way let me know

@raymondben
Copy link
Member Author

(Comments particularly welcome on the revised an_filter(query=...) behaviour - is it more intuitive now?)

@lbusett
Copy link
Contributor

lbusett commented Jun 20, 2018

Hi all,

I am a bit busy at the moment, but should be able to have a look at the implemented changes by the beginning of next week.

@lbusett
Copy link
Contributor

lbusett commented Jul 9, 2018

@raymondben @sckott

sorry for being late on this, but I had very limited time in the last two weeks.

I had a look at the changes: great work! All points raised in my review were nicely addressed. I particularly like the changes introduced in the an_filter() function and the improvements in the vignette (one small thing: maybe you could specify in the vignette that when using session = "persistent" the dataset is saved in rappdirs::user_cache_dir(), as it is done in the documentation). The package is now ready for approval, in my opinion.

Lorenzo

@sckott
Copy link
Contributor

sckott commented Jul 9, 2018

thanks @lbusett !

@johnbaums are you happy with changes/responses?

@johnbaums
Copy link
Contributor

@sckott @raymondben

Thanks for looping me in on the changes, and sorry for my abysmally slow handling of GitHub notifications. I'm very satisfied - great package!

@sckott
Copy link
Contributor

sckott commented Jul 13, 2018

Approved! Thanks again for your submission @raymondben

  • Please transfer the package to ropensci- I've invited you to a team within our ropensci organization. After you accept you should be able to move it. You'll be made admin once you do.
  • Make sure to
    • add rOpenSci footer to README
      [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
    • Change any needed links, such those for CI badges
    • Travis CI should update to the new location automatically - you may have to update other CI systems manually
    • add ropensci review badge to your readme [![](https://badges.ropensci.org/198_status.svg)](https://github.com/ropensci/onboarding/issues/198)
  • In a new .github folder in the root of the pkg, add a CONTRIBUTING.md file, an issue template file, and a PR template file, see https://github.com/ropensci/dotgithubfiles/ for examples, you can copy them directly and edit as you wish

@raymondben
Copy link
Member Author

Super, thanks @lbusett and @johnbaums once again for your time and attention to detail.

@raymondben
Copy link
Member Author

@sckott repo transferred, will finish changing links and so on once I'm admin

@sckott
Copy link
Contributor

sckott commented Jul 13, 2018

okay, try again

@raymondben
Copy link
Member Author

All done, I think?

@sckott
Copy link
Contributor

sckott commented Jul 15, 2018

See comment above about a blog post. thoughts on that?

@raymondben
Copy link
Member Author

raymondben commented Jul 16, 2018

Yup, is a good idea. We're thinking of a more comprehensive post (this package plus others relevant to Antarctic stuff), but will be aiming to do something, anyway. We'll chat with @stefaniebutland in due course.

@sckott
Copy link
Contributor

sckott commented Jul 16, 2018

Okay, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants