Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdefra package #68

Closed
13 tasks done
cvitolo opened this issue Aug 8, 2016 · 53 comments
Closed
13 tasks done

rdefra package #68

cvitolo opened this issue Aug 8, 2016 · 53 comments

Comments

@cvitolo
Copy link

cvitolo commented Aug 8, 2016

Summary

  • What does this package do? (explain in 50 words or less):
    The rdefra package retrieves air pollution data and metadata from the Air Information Resource (UK-AIR) of the Department for Environment, Food and Rural Affairs in the United Kingdom. UK-AIR does not provide a public API for programmatic access to data, therefore this package scrapes the HTML pages to get relevant information.
  • Paste the full DESCRIPTION file inside a code block below:
Package: rdefra
Type: Package
Title: Interact with the UK AIR Pollution Database from DEFRA
Version: 0.2.0
Date: 2016-08-03
Author: Claudia Vitolo [aut, cre], Andrew Russell [aut], Allan Tucker [aut]
Maintainer: Claudia Vitolo <cvitolodev@gmail.com>
URL: https://github.com/kehraProject/r_rdefra
BugReports: https://github.com/kehraProject/r_rdefra/issues
Description: Get data from DEFRA's UK-AIR website. It basically scraps the HTML content.
Depends: R (>= 2.10)
Imports: RCurl, XML, plyr
Suggests: testthat
LazyData: true
Encoding: UTF-8
License: GPL-3
Repository: CRAN
RoxygenNote: 5.0.1
  • URL for the package (the development repository, not a stylized html page):
    https://github.com/kehraProject/r_rdefra
  • Who is the target audience?
    Scientists and researchers interested in air pollution data and epidemiologists.
  • Are there other R packages that accomplish the same thing? If so, what is different about yours?
    The openair package (https://github.com/davidcarslaw/openair) accomplishes similar things but relies on a local and compressed copy of the data on servers at King's College (UK), periodically updated. I have used the openair package myself in the past and it is an excellent package (for data retrieval and visualisation) but I had troubles with King's College servers down time.
    The rdefra package, instead, retrieves the information directly from the original source with the advantage that users always get the most complete information at any time. This package also integrates a function to retrieve missing coordinates from the standard metadata.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

R CMD check succeeds. There were no ERRORs, WARNINGs or NOTEs.

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
  • If this is a resubmission following rejection, please explain the change in circumstances:
  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
@sckott
Copy link
Contributor

sckott commented Aug 8, 2016

Editor checks:

  • Fit: The package meets or criteria fit and overlap
  • Automated tests:
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

@cvitolo Thanks for your submission! Two things before I get reviewers:

  • you do have a tests directory, but AFAICT they don't test the functions in the package. Can you add at least a few tests that test the functionality of the exported functions?
  • You did check that you have a vignette, but the vignette is in the parent directory, so not included with the package. Please move the vignette dir into the package itself

Reviewers: @masalmon @haozhu233
Due date: 2016-08-30

@cvitolo
Copy link
Author

cvitolo commented Aug 9, 2016

Hi Scott,

Many thanks for looking at my package.
I have made the changes you mentioned below.

Kind regards,
Claudia

On 9 August 2016 at 00:00, Scott Chamberlain notifications@github.com
wrote:

Editor checks:

  • Fit: The package meets or criteria fit http://policies.md#fit
    and overlap http://policies.md#fit
  • Automated tests: Package has some tests, but none test actual
    functions in the package
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI
    resolves correctly
  • Version (JOSS only, may be post-review): Does the release version
    given match the GitHub release (v1.0.0)?

@cvitolo https://github.com/cvitolo Thanks for your submission! Two
things before I get reviewers:

  • you do have a tests directory, but AFAICT they don't test the
    functions in the package. Can you add at least a few tests that test the
    functionality of the exported functions?
  • You did check that you have a vignette, but the vignette is in the
    parent directory, so not included with the package. Please move the
    vignette dir into the package itself


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#68 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEhdr_Yr8Z4DZl1v_MCx0pnWRw5ooOSMks5qd6bvgaJpZM4JfPSk
.

@sckott
Copy link
Contributor

sckott commented Aug 9, 2016

thanks @cvitolo - reviewer assigned - seeking one more

@maelle
Copy link
Member

maelle commented Aug 10, 2016

@cvitolo I have a first question, why do you use a sub-directory for the package itself in the repository? I'm not saying it's wrong, I'm just wondering. :-)

@cvitolo
Copy link
Author

cvitolo commented Aug 10, 2016

@masalmon I use a sub-directory for the package so that, in the parent folder, I can have the package itself and also add all the work (e.g. experiments, presentations, demos, tutorials, posters, papers, etc.) related to the package. I could move everything to the package folder and add the additional files to .Rbuildignore but I personally prefer to leave the package folder as clean as possible. It is just a personal preference, but I'm happy to adapt to best practises if needed.

@maelle
Copy link
Member

maelle commented Aug 10, 2016

@cvitolo Ok thank you for your quick answer. :-)

I see some commented code such as

  # library(RCurl)
  # library(XML)
  # uka_id <- "UKA00399"
  # uka_id <- "UKA15910"

or

 # library(RCurl)
  # library(XML)
  # library(plyr)
  # site_id <- "ABD"
  # years <- 1972:2014

  # Site with list of flat files
  # rootURL <- "http://uk-air.defra.gov.uk/data/flat_files?"
  # myURL <- paste(rootURL, "site_id=", site_id, sep = "")
  # download html
  # html <- getURL(myURL, followlocation = TRUE)
  # parse html
  # doc = htmlParse(html, asText=TRUE)
  # hrefs <- xpathSApply(doc, '//*[@id="center-2col"]', xmlGetAttr, 'href')
  # Otherwise
  # html <- paste(readLines(myURL), collapse="\n")
  # library(stringr)
  # matched <- str_match_all(html, "<a href=\"(.*?)\"")

Why is it commented? If it's not useful any longer, maybe you could delete it?

@cvitolo
Copy link
Author

cvitolo commented Aug 10, 2016

@masalmon Thanks for spotting that. I have just removed all the old commented code and committed the changes.

@maelle
Copy link
Member

maelle commented Aug 10, 2016

@cvitolo how long does it take for the vignette to build on your PC? It's taking ages on mine (but it seems to work, just slowly!).

@cvitolo
Copy link
Author

cvitolo commented Aug 10, 2016

@masalmon It can take few hours because it is parsing lots of HTML pages.

@maelle
Copy link
Member

maelle commented Aug 10, 2016

@cvitolo oh, maybe the examples are a bit too long for a vignette then? 😄 Maybe you could do the same for one of the countries only (e.g. Northern Ireland)?

Now I have the built vignette. ✨

@maelle
Copy link
Member

maelle commented Aug 10, 2016

General

  • This a very cool package that allows to access data from a website without an API without having to parse the webpages. I’m very happy to see all the data it provides access to! Well done! 😸
  • I could build, test and check the package (on Windows). It behaves as expected.
  • Why is the package called rdefra and not ukair or rukair? Do you intend to add other DEFRA data sources? If not, ukair might be easier to understand but I know your package is on CRAN already so you might not want to change its name. ;-)
  • Devtools::spell_check finds two mistakes:
    metres stations.Rd:24
    ourly get1Hdata.Rd:18
  • The repository is currently lacking a code of conduct (see https://github.com/ropensci/onboarding/blob/master/packaging_guide.md#conduct ) which would help potential contributors see your repo is a friendly place. ;-)
  • In your DESCRIPTION file, you should use the Authors@R syntax https://github.com/ropensci/onboarding/blob/master/packaging_guide.md#authorship (Authors@R is then parsed automatically when building the package)
  • The DESCRIPTION (and the vignette + README) should include a link to UK air’s website.
  • Have you told people at UK air about your package? And can their website “resist” many people querying it (I have no experience about such issues)?

Code

  • The function names use upper and lower case. It’d be easier to remember their names if they were all lower case. For ropenaq I was advised to prefixing every function with a package-specific string, which I really like now. For instance in your case the functions could be ukair_catalogue, ukair_eastingnorthing (or something easier such as ukair_impute_geo?), ukair_ site, ukair_data. The reviewer that had told me this for ropenaq had written that “This has the added benefit of playing very nicely with Rstudio, which will show all the package functions as completions when users type ukair_ and hit tab”.
  • Something that you could add are more thorough checks of the parameters supplied by the user. E.g. if the user gives “London” as a site ID in the get1Hdata, you could check “London” against the list of site IDs and then output an error message saying that “London” is not a siteID. It’ll be a bit of (boring) work for you but would make the package user-friendlier if the functions fail informatively. Currently if one asks for data about a site that doesn’t exist the URL doesn’t exist and you output a NULL object, maybe you could add a warning and describe the case in the function documentation?
  • When parsing data in get1Hdata, you could add a dependency to lubridate and output a datetime column instead of a Date and a time column (all measures are in the same timezone so you can use it)
  • Similarly in the stations data and in the catalogue you could output Start.Date and End.Date as a date.
  • Your function output very big data.frame. Why not output tbl_df/tibble instead so that they’re easier to print? For instance the end of get1Hdata would become return(tibble::as_tibble(df)). If you switch from read.csv to readr::read_csv (well that means another dependency) you’d get a tbl_df directly, I’m not sure what’s best.
  • For the measures, is it possible to get the corresponding units? It would be very useful information ;-)
  • The EastingNorthing function returns “A named vector containing Easting and Northing coordinates”. Why not a data.frame? It might be even easier if the function took the whole catalogue as an input and gave the “newly populated” catalogue as an output.
  • You write
    message("Please insert a valid year (or sequence of years).")
    stop

You should write

    stop("Please insert a valid year (or sequence of years).")
  • If you add dplyr as a dependency you could use standard evaluation, e.g.
    df <- df[, -col2rm]
    df$site_id <- site_id
`` `
Would become
```r
df <- dplyr::select_( df, quote(-col2rm))
df <- dplyr::mutate_(df, lon = lazyeval::interp(~site))

Or possibly something easier. I’m not a standard evaluation expert, maybe someone has better arguments? Or is it even best practice for rOpenSci? There’s a vignette in dplyr about NSE https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html and a chapter in Hadley’s advanced R book http://adv-r.had.co.nz/Computing-on-the-language.html

  • Does the cached version of stations include “guessed” coordinates?

Dependencies

  • The packages you use in the vignette (raster, sp, leaflet) should be in Suggests: in DESCRIPTION.
  • Regarding XML parsing, rOpenSci guidelines mention “For parsing XML, if you only need to parse XML data, xml2 is a good option as it makes parsing easier. However, if you need to create XML (in addition to parsing it), use the XML package. The XML package is more low level than xml2, and allows finer manipulation, and may be favored by those more familiar with XML data.” I guess you can keep using XML?
  • The guidelines also mention “For http requests we strongly recommend using httr over RCurl.” Since you only use two RCurl functions, could you switch to httr? I think that http_error and GET are the functions you would use.
  • Regarding getURL, I don’t know if it behaves as GET but – as a nice ropenaq reviewer told me -- with httr::GET you can take a use list of arguments to build the query. This can make constructing the call a bit cleaner, so that you don't have to paste0 them all together with &s. You can check the arguments near the top of the function, then construct a list of all arguments to pass to the queryargument of GET. For example:
url <- "http://uk-air.defra.gov.uk/networks/find-sites?"
closed <- FALSE
pollutant <- 9999
arg_list <- list(closed = closed, pollutant = pollutant)
httr::GET (url, query = arg_list) # Any NULL elements of the list supplied to the query paramater are automatically dropped
  • I think you could replace plyr::rbind.fill by dplyr::rbind_all. I don’t know whether it’s better but dplyr is generally easier than plyr and I’ve read it’s faster.

Readme and vignette

  • I know I have a big conflict of interest here but since you mention the openair package in the README, why not mention ropenaq as well? :-) OpenAQ has been providing access to the values of https://uk-air.defra.gov.uk/latest/currentlevels for a few months now and ropenaq allows to get that and data from other countries as well.
  • In the README/vignette, you could mention the temporal coverage of the data, and the parameters that are present in the data (PM2.5? O3?)
  • The vignette has no title and in the first line rdefra is written Rdefra.
  • If you want to keep the installation instructions in the vignette, then rather with eval = FALSE.
  • The vignette takes quite a while to build. Why not use smaller examples, e.g. only building the map for Northern Ireland? It might be useful to add a chunk for not re-building the vignette on CRAN. I had to do this in the first chunk of ropenaq general vignette https://raw.githubusercontent.com/ropenscilabs/ropenaq/master/vignettes/Ropenaq-vignette.Rmd
  • The number of stations could change I guess, therefore you could replace the numbers in “There are 6563 stations with valid coordinates within the UK-AIR (Air Information Resource, blue circles) database, for 225 of them hourly data is available and their location is shown in the map below (red circle).” By r nrow(IDstationHdata)stations so that it changes if the number of stations changes.
  • I think you can filter the spatial data frame without over (and therefore with a dependency less? ) with “stationsSP[England, ]”.
  • At the end of your vignette you mention the README but users that installed the package from CRAN might not know where to find it. Why not had the code and explanations from the README in a eval=FALSE chunk?
  • In the vignette, you download data but you don’t show it. You could print the data.frame and maybe even add a plot?
    After downloading the boundaries for the UK, maybe you should add file.remove(“GADM_2.8_GBR_adm1.rds”)?

Documentation

  • You have generated the NAMESPACE by hand but you could actually rely on roxygen2 to do it for you. You can add #' @nord to internal functions and #’ @export to exported functions. For dependencies you can add #’ @importFrom httr GET at the top of the file or write GET::httr each time you use a function from another package (I always forget which way is the best one).
  • It's best practice to also write the data documentation in a .R file that’ll be parsed by roxygen2. See for instance http://r-pkgs.had.co.nz/data.html
  • The package lacks a top-level documentation for for ?package_name (see e.g. https://github.com/ropenscilabs/rnaturalearth/blob/master/R/rnaturalearth-package.r )
  • In the documentation of get1Hdata, you could explain how to input several years instead of one (the example is in the vignette but not in the function documentation).
  • The documentation of getSiteID should state where to find the IDs (from the catalogue I guess?).
  • The documentation of stations (and the vignette) could say how much one can expect the catalogue to change from one month to another for instance.
  • In the catalogue function documentation, could you explain how to find the country numbers / region_id/pollutants/etc? I see you had started doing this in details. You could even replace the numbers in input by something easier, e.g. the user would write pollutant=”o3” and you would translate it internally to the pollutant code.
  • In the catalogue function, you could describe more precisely what the three last arguments do. You could also skip them entirely if you think the non default mode doesn’t make much sense for users of your package.
  • Examples that shouldn’t be run should be packed in \dontrun{}, e.g.
    #' @examples
    #' \dontrun{
    #' get1Hdata("ABD", "2014")
    #'}
    #'

Tests

  • Good job on now testing the package functions! Now, the tests are in 2 files. Best practice is to have one file for each function and to use context()at the top, e.g. in test-Catalogue.R the first line would be context(“catalogue function”).
  • Why not add a codecov.io badge to your repository? See https://github.com/jimhester/covr
  • Moreover regarding tests, you should add testthat::skip_on_cran before tests that might take a long time to run (the tests will still run on Travis and locally)

I don’t know how much time doing this took me, I had parallel tasks and I won’t count the vignette building, ahah. I’d say a few hours? I’m happy to help if you have any question!

@sckott
Copy link
Contributor

sckott commented Aug 10, 2016

thanks @masalmon ! can you give estimate of time it took to do review?

@maelle
Copy link
Member

maelle commented Aug 10, 2016

@sckott not sure, see above, 2-3hours?

@sckott
Copy link
Contributor

sckott commented Aug 10, 2016

thanks!

@cvitolo
Copy link
Author

cvitolo commented Aug 12, 2016

@masalmon Thanks a lot for your prompt and thorough review. I have already simplified the vignette. I will work on the other changes once I receive the second review. Many thanks again!!!

@haozhu233
Copy link

haozhu233 commented Aug 12, 2016

First, I really like the idea of this package as it provides R users an relative easy way to access the UK-AIR data, especially when the data source does not have a public API. Great job, @cvitolo! :)

Also, thank you, @masalmon, for providing such a thorough review for this package. You review covers many points I want to say. I'll try not to overlap our points too much. ;)

Comments

  • As @masalmon suggested, instead of using getURL and package XML, I would also recommend you to use httr and xml2. XML is powerful but in this case, we can just take advantage of the convenience brought by xml2. Here is a short example of how to use that in the case of your catalogue() function.
library(httr)
library(xml2)

catalogue_fetch <- GET(
  url = "http://uk-air.defra.gov.uk", 
  path = "networks/find-sites", 
  query = list(
    site_name = "", pollutant = 9999, 
    group_id = 9999, closed = "true", 
    country_id = 9999, region_id = 9999, 
    location_type = 9999, search = "Search+Network", 
    view = "advanced", action = "results")
  )

catalogue_content <- content(catalogue_fetch)

# Please see the second comment
catalogue_csv_link <- xml_find_first(catalogue_content, "//*[contains(@class,'bCSV')]")
catalogue_csv_link <- xml_attr(catalogue_csv_link, "href")
  • I noticed that in many places you used the location of a <p> tag to locate the value your are looking for. For example, again in your catalogue() function, you used '//*[@id="center-2col"]/p[3]/a' as the xpath. I would recommend you try not to using things like p[3] in this package as the contents of a webpage may change at any time. They may add a <p> tag if they change the descriptions. Instead, you can try to use a class as a locator to target your csv link as I showed in the example above.
  • Some functions to fix those missing metadata are really slow. When I was going through the sample codes in README, it seemed to take forever for EastingNorthing() to accomplish its query. So I run a check and I got
system.time(EN <- EastingNorthing(stationList[1:10]))
#   user  system elapsed 
#  0.170   0.032   8.174 

In the sample code, stationList is a vector with a length of 2763 and that basically means it will take around 40 minutes to finish that line of code. My question is that when the information you are trying to fetch here is relatively static and small(they are just the location data for different sites), do you mind to host a pre-cached version of the data somewhere (maybe github)? You can then write a function that fetch data from one link (github) instead of asking each end user to send thousands of queries to the DEFRA server.

  • You could consider to add a HTML status checking step in case your GET or getURL function returns something other than what you expected. If users of your package entered something that is not acceptable by the query and the server returns you something else other than a status 200, the function should stop and return an error.
  • In EastingNorthing() and getSiteID(), instead of using do.call, rbind and lapply, you may consider to use sapply. You can try t(sapply(stationList[1:10], EastingNorthing_internal)) to see the results.
  • You may like to use paste0() to replace paste(sep="").
  • In get1Hdata1, length(as.list(years)) == 0 might be too weak to validate the input.
  • In get1Hdata1, you created a variable id to track if each attempt succeed but you didn't use id outside the for loop. Maybe you want to return id as an attribute or a message when something failed?
  • In get1Hdata_internal, instead of using if (url.exists(myURL)){df <- read.csv(myURL, skip = 4)[-c(1),]...}, you may want to use something like the following because sometimes even if a url exists, it doesn't mean you can download a csv from that url.
status <- try(read.csv(myURL, skip = 4)[-c(1),])
if(class(status) == "try-error")return(NULL)
  • You may want to limit your code line width to the 80-character standard code width.
  • I also recommend you to add dplyr, tibble and readr as your package dependency.
  • I also recommend you to use roxygen to manage your package NAMESPACE. To export your function, you can add a @export. To import some functions from a package, you can add @importFrom RCurl getURL url.exists.
  • You can put the tests folder under the root directory.
  • When you write you test cases, it would be better if you can write your test_that descriptions in a statement format. You should also try to keep them as brief as possible

The review process took me around 3 hours

@sckott
Copy link
Contributor

sckott commented Aug 12, 2016

@haozhu233 thanks for your review!

@cvitolo both reviews are in, continue conversation here and let us know if you have any questions.

@cvitolo
Copy link
Author

cvitolo commented Aug 13, 2016

@sckott thanks for finding reviewers. @haozhu233 @masalmon Thanks for reviewing my package so quickly. I'll start working on this right now.

@cvitolo
Copy link
Author

cvitolo commented Aug 16, 2016

rdefra - response to reviewers

Many thanks again for reviewing my package. Below are my responses to reviewers' comments.

Reviewer 1 (@masalmon)

General

  • Regarding the name of the package: In 2010 DEFRA launched their Open Data Strategy and mentioned the intention of opening up many environment-related datasets. I started working on the UK-AIR data archive but this is only one of the data sources available. In future releases of rdefra, I could use the same logic to get data from other DEFRA archives. owever, if based on best practice, ukair would be a more suitable name for the package, I am happy to change it.
  • Spelling mistakes are now fixed. Please note that 'metres' is correct (as in metres above sea level)
  • A code of conduct is now added to the package
  • I modified the DESCRIPTION file and used the Authors@R syntax
  • The DESCRIPTION, vignette and README files now include a link to UK air’s website.
  • I have contacted DEFRA and informed them about the rdefra package.

Code

  • Very good suggestion! All the names of the functions in rdefra are now in lower case and have a prefix (ukair):
    • catalogue() is now called ukair_catalogue()
    • EastingNorthing() is now called ukair_get_coordinates()
    • get1Hdata() is now called ukair_get_hourly_data()
    • getSiteID() is now called ukair_get_site_id()
  • I added some if statements to the functions to checks the validity of user's parameters.
  • The date and time columns in the object returned by ukair_get_hourly_data() are now merged in the column datetime.
  • Start.Date and End.Date in the stations object are now dates.
  • All the data frames returned by the rdefra functions are now tbl_df/tibble for better checking and printing capabilities
  • Measurements are usually ugm-3 (micrograms per cubic metre), this is specified in the documentation of ukair_get_hourly_data(). However, I have added a new parameter called keepUnits that allows to retrieve the units for each observed variable.
  • The ukair_get_coordinates() function now returns a data.frame, however it takes a long time to run it for the full catalogue. Therefore it is not convenient to generate a newly populated catalogue everytime we run this function. However I have cached a version of the catalogue and saved in as stations.rda (in the data folder). This contains all the coordinates and site identification numbers that can be retrieved using ukair_get_coordinates() and ukair_site_id() respectively.
  • I fixed the problem with message/stop.
  • Regarding the use of standard evaluation, I see your point, the code would be easier to read and probably runs faster. For the time being I would prefer not to include too many dependencies and stick with base R as much as possible. But for a future release I could run a benchmark using the two options and see which one works best for this package.
  • Yes, the cached version of stations include all the coordinates that can be retrieved using ukair_get_coordinates().

Dependencies

  • I have added testthat, raster, sp, leaflet, parallel to Suggests in the DESCRIPTION file
  • I have replaced fuctions from the XML package with the corresponding functions from the xml2 package, as both you and @haozhu233 suggested.
  • The package now uses httr in place of RCurl
  • I replaced plyr::rbind.fill with dplyr::rbind_all

Readme and vignette

  • ropenaq is now mentioned in the README file, along with openair. That was absolutely fair, sorry if I missed that before.
  • In the README/vignette, I mentioned that data has been collected since 1972 and now added the most common species monitored and their typical units.
  • The vignette has now the same title as the JOSS paper, I also fixed the typo (Rdefra vs rdefra).
  • I set eval = FALSE for the installation instructions in the vignette
  • Now the vignette builds quickly, as the map only contains stations in the Reading/Wokingham Urban Area.
  • I added to the vignette in-line code so that the number of stations are automatically updated.
  • I removed the spatial filter from the vignette.
  • At the end of the vignette now is the code to retrieve data using multiple cores using the parallel package (with eval=FALSE).
  • The filtered stations are now printed concisely with as_tibble().

Documentation

  • The NAMESPACE is now generated by roxygen2.
  • The data documentation is now generated by roxygen2.
  • I have added a top-level documentation for the rdefra-package.
  • I have added and example in the ukair_get_hourly_data() documentation to show how to download data for multiple years.
  • The documentation of ukair_get_site_id() now states that the ID can be found in the cached catalogue (stations.rda in the data folder).
  • I have added to the documentation of stations (and the vignette) that the network is expected to expand over time.
  • In the ukair_catalogue() function documentation, I added all the information I could find (scraping DEFRA web pages) about country numbers / region_id/pollutants/etc. Unfortunately there is no API and therefore no related documentation on the website.
  • In the ukair_catalogue() function I removed the last three arguments because they are not important for the typical user of the package.
  • Examples that shouldn’t be run are now packed in \dontrun{}

Tests

  • Hadley Wickham (https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf) suggests that a 'context' should group tests for related functionality. In my interpretation, I thought I could group tests related to the type of retrieved output: data and metadata. For instance, in the first group I test the functions that generate metadata as well as the connections to the related webpage. In the second group I test the functions that generate data as well as the connections to the related files. Would this approach be considered unacceptable?
  • I added a codecov.io badge to the rdefra repository?
  • I have used testthat::skip_on_cran before the last metadata test.

Reviewer 2 (@haozhu233)

  • I have replaced fuctions from the XML package with the corresponding functions from the xml2 package, as both you and @masalmon suggested.
  • Now I use a class as a locator to target the csv link in ukair_catalogue(), thanks for providing an example! However, I can't work out a similar approach for ukair_get_coordinates_internal() and ukair_get_site_id_internal(). What would you suggest to do in these cases?
  • You are right, the fetching is very slow. For this reason there is a cached version of the catalogue in the package's data folder (stations.rda). I'm going to update it regularly.
  • The functions now print a message when the csv/HTML page requested do not exist.
  • sapply works much better than do.call+rbind+lapply, thanks!
  • I replaced paste(sep="") with paste0().
  • In ukair_get_hourly_data() I have created more checks to validate the inputs.
  • In ukair_get_hourly_data() id is only using to fill in the dat object.
  • In ukair_get_hourly_data(), I followed your suggeston to use the try() and a statement to check whether the csv can be downloaded.
  • Now all the code lines are limited to 80 characters.
  • The rdefra package now depends on lubridate, tibble, httr, xml2, dplyr. I did not use readr because read.csv was working better in my case.
  • I now use roxygen2 to generate automatically the NAMESPACE
  • I have put the tests in the root folder
  • Test cases are now in a statement format.

@maelle
Copy link
Member

maelle commented Aug 29, 2016

I couldn't find a solution for eqn either.

@cvitolo
Copy link
Author

cvitolo commented Sep 4, 2016

rdefra - response to reviewers (step 3)

Thanks @masalmon for giving another look at the package, I greatly appreciate your comments! Below is a summary of the latest changes I made based on your suggestions.

  • There's no unit test for ukair_get_coordinates because travis-ci fails if I add one. This is strange because the test is successfull if I run it on my local machine.
  • The cached version of the catalogue will be updated often but there might always be new stations not yet included. Therefore I cannot generate a warning in case the site ID is not in the cached version of the catalogue...that could generate confusion.
  • I have removed the attr(output, "units") example in the function documentation. What I meant re 'the new attribute is not preserved when calling the function' is that the output of the function ukair_get_hourly_data did not store the new attribute called "units", not sure why.
  • Thanks for your suggestion re the README problem, I have added variant: markdown_github to the yml header of the Rmd and now it works like a charm!
    The leaflet map is now loaded as screenshot. In the vignette this chuck still produces a dynamic map. I have a github page for my other packages and I was thinking of creating one once the review is finished. It actually does not take much effort as it is automatically generated from the README.
  • I have now added a table of contents at the beginning of the README
  • Thanks for the suggestions to improve the first time series plot, now m^3 renders nicely and the plot shows daily means instead of all the points

One last question from me: I was thinking of moving the package to the root directory (now it's in a subfolder) as I noticed that that is the easiest way to get appveyor to work. The side effect is that the repository name will change and the link to the old releases will be lost. An option would be to create a separate repository. What is your view on that? Thanks again!

@maelle
Copy link
Member

maelle commented Sep 5, 2016

  • It is weird that the test fails on Travis. You can still let the test and add skip_on_travis() before it. There's also a skip_on_appveyor function in testthat. By the way, do you skip the tests on CRAN with skip_on_cran? Since most of your tests probably take long, you could skip all of them on CRAN, see e.g. https://github.com/ropenscilabs/opencage/blob/master/tests/testthat.R By the way do you think the time now required to build the vignette is ok for CRAN?
  • Ok I understand.
  • Was adding the attr(output, "units") the last thing you did before returning the object? I have observed one looses the attribute with e.g. mutate calls if I remember correctly. You'll notice I comment a lot on that but ropenaq functions all output lists of data frames so... 😸
  • Cool!
  • I don't see the table of contents?
  • Nice! By the way you write "Highest concentrations seem to happen in late spring and at the beginning of summer. In order to check whether this happens every year, we can download multiple years of data and then compare them" which is interesting but then you don't comment on the multiple years so it sounds a bit mysterious ;-) I have tried to re-do the second figure with
library("rdefra")
library("lubridate")
library("ggplot2")
library("dplyr")
years <- 2013:2015
df <- ukair_get_hourly_data('MY1', years)
df <- mutate(df, year = as.factor(year (datetime)),
             year_day = yday(datetime))

df %>%
  group_by(year, year_day) %>%
  summarize(ozone = mean(Ozone)) %>%
  ggplot() +
  geom_line(aes(year_day, ozone)) +
 geom_smooth() +
  facet_grid(year ~.) +
  ylab(expression(paste("Ozone concentration (", mu, "g/",m^3,")")))

but it's not really prettier. 😄 I think it's quite hard to see seasonality. Maybe try it with another pollutant?

For your last question, I have no idea ("side effect is that the repository name will change and the link to the old releases will be lost")... @sckott ?

@maelle
Copy link
Member

maelle commented Sep 5, 2016

I have tried downloading more years and I see the seasonality now (sorry it made me curious):

library("rdefra")
library("ggplot2")
years <- 2000:2015
df <- ukair_get_hourly_data('MY1', years)
df <- mutate(df, year = as.factor(year (datetime)),
             year_day = yday(datetime))

+
  ylab(expression(paste("Ozone concentration (", mu, "g/",m^3,")")))

ozone

Maybe you could comment on the later years which do not have the "nice" spike? Although there isn't really a downward trend.

@cvitolo
Copy link
Author

cvitolo commented Sep 5, 2016

rdefra - response to reviewers (step 4)

@masalmon

  • I added the tests and then used skip_on_travis before them. I'm not using appveyor at the moment (since I can only get it to work if the package is in the root directory. I'll use skip_on_appveyor function in the future,if needed. I use skip_on_cran only for the test that retrieve hourly data. The vignette seems to build relatively quickly now.
  • Yes, I added the attr(output, "units") just before returning the object, but that does not preserve the newly added attribute.
  • Ops, I forgot to knit the README.Rmd before committing. You should now see the toc.
  • I have modified the second ozone plot based on your suggestions and added few comments.

@maelle
Copy link
Member

maelle commented Sep 6, 2016

  • It seems so unpractical that Appveyor can only make the build if the package is in the root directory, and I know you've tried cd commands in the configuration file... Maybe before doing the change you could try to ask a few people more if they know more about Appveyor and subdirectories? Unless you don't mind moving the package?
  • I still don't see the toc ;-) Why not try
output:
  md_document:
    variant: markdown_github
    toc: true

as header?

  • A few comments about the README:

** Maybe add a link to the parallel package the first time you mention it and write parallel so that its name might get clearer?

** "the user access" -> "the users access"

** Nice new plot with the boxplots! It just lacks a x-axis label "Month of the year".

** Your meta should include "Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms."

  • Regarding the unit tests, when I do R CHECK I get errors with the easting/northing functions but when I do devtools::test() I don't get errors... by the way I have just noticed http_error should be httr::http_error() since httr isn't loaded. The actual easting/nothing error seems to stem from a dependency issue stop("package rgdal is required for spTransform methods"). 1)This probably means rgdal should be in Imports in the DESCRIPTION (for Travis and users). 2) Somehow rgdal could be loaded when doing devtools::test but not R CHECK. I quickly tested my own suggestion and it did the trick for R CHECK, so if you add rgdal as a dependency you can skip the skip_on_travis.
  • Regarding the units, I think it got lost because you added with ukair_get_hourly_data_internal and then made further operations on the output of ukair_get_hourly_data_internal. This made me wonder, if you ask data for several years, how does the units data frame look like? I tested
output <- ukair_get_hourly_data("ABD", 2000:2014, keep_units = TRUE)

and then saw that the units data frame has 117 lines, some of them without unit.

  1. Why not keep only unique lines with units?
  2. do you think it's possible that in some cases O3 will be measured with one unit for one year and then another unit for another year? If this can happen, shouldn't the units data frame have a year column?
  • By the way getting units from the website to the user meant a lot of work for you and is clearly an advantage of using the package, so maybe you should mention units are returned in the README and vignette?

@sckott
Copy link
Contributor

sckott commented Sep 6, 2016

hi @cvitolo for your question

I was thinking of moving the package to the root directory (now it's in a subfolder) as I noticed that that is the easiest way to get appveyor to work. The side effect is that the repository name will change and the link to the old releases will be lost ...

You can do this. If you change the repository name, any attempts at going to the old URL will be redirected to the new URL, And I +1 this move of the pkg to the root directory 👏 just makes things easier.

@cvitolo
Copy link
Author

cvitolo commented Sep 7, 2016

It seems so unpractical that Appveyor can only make the build if the package is in the root directory, and I know you've tried cd commands in the configuration file... Maybe before doing the change you could try to ask a few people more if they know more about Appveyor and subdirectories? Unless you don't mind moving the package?

I totally agree with you, it is not practical! I am going to ask for help on stackoverflow or contact appveyor support.

I still don't see the toc ;-) Why not try
output: md_document: variant: markdown_github toc: true
as header?

Sorry about this. I found out about that the output can be set to github_document and that removed the annoying html tags but forgot to indent the linetoc: true and the toc disappeared. I have just commited the change.

A few comments about the README:
** Maybe add a link to the parallel package the first time you mention it and write parallel so that its name might get clearer?

Ok, that's done

** "the user access" -> "the users access"

Done

** Nice new plot with the boxplots! It just lacks a x-axis label "Month of the year".

Done

** Your meta should include "Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms."

Done

Regarding the unit tests, when I do R CHECK I get errors with the easting/northing functions but when I do devtools::test() I don't get errors... by the way I have just noticed http_error should be httr::http_error() since httr isn't loaded. The actual easting/nothing error seems to stem from a dependency issue stop("package rgdal is required for spTransform methods"). 1)This probably means rgdal should be in Imports in the DESCRIPTION (for Travis and users). 2) Somehow rgdal could be loaded when doing devtools::test but not R CHECK. I quickly tested my own suggestion and it did the trick for R CHECK, so if you add rgdal as a dependency you can skip the skip_on_travis.

I did try that but did not work for me :(
If I add rgdal to imports I get the following error:
checking for gdal-config... no
no
configure: error: gdal-config not found or not executable.
ERROR: configuration failed for package ‘rgdal’

  • removing ‘/home/travis/R/Library/rgdal’
    Error: Command failed (1)
    Execution halted

You can look at the full log here.

Regarding the units, I think it got lost because you added with ukair_get_hourly_data_internal and then made further operations on the output of ukair_get_hourly_data_internal.

You are right! setting the attribute in ukair_get_hourly_data rather than in ukair_get_hourly_data_internal fixed the problem. Thank you so much! Units are back in the attribute list :)

This made me wonder, if you ask data for several years, how does the units data frame look like? I tested
output <- ukair_get_hourly_data("ABD", 2000:2014, keep_units = TRUE)
and then saw that the units data frame has 117 lines, some of them without unit.

  1. Why not keep only unique lines with units?
  2. do you think it's possible that in some cases O3 will measured with one unit for one year and then another unit for another year? If this can happen, shouldn't the units data frame have a year column?

The table has now a year column. It is possible that, over time, sensors will be substituted with new ones (with higher resolution, therefore different units?). My guess is that if this happens, the new sensor gets a new SiteID. I can check with DEFRA. In the meantime I think it is safer to keep the full table. Also, I wouldn't remove rows with empty units, otherwise users won't know if the units are missing or the function failed to retrieve all the metadata.

By the way getting units from the website to the user meant a lot of work for you and is clearly an advantage of using the package, so maybe you should mention units are returned in the README and vignette?

Done.

Thanks for all your suggestions! Again, your help was precious!

@maelle
Copy link
Member

maelle commented Sep 7, 2016

So Travis can't install gdal, is that the issue?

@maelle
Copy link
Member

maelle commented Sep 7, 2016

@cvitolo I see that geojsonio depends on rgdal too, see https://github.com/ropensci/geojsonio/blob/master/.travis.yml maybe helpful? @sckott maintains the package so he should know more than I 😸

Then regarding rgdal in your NAMESPACE maybe only part of rgdal is needed (you imported all I think?) -- maybe have a look at what spTransform needs exactly?

@maelle
Copy link
Member

maelle commented Sep 7, 2016

I also wonder whether you should make rgdal a suggestion given it's a huge dependency and is used in one function only... I'm not sure what's best practice.

@maelle
Copy link
Member

maelle commented Sep 7, 2016

An example of package that has rgdal a Suggests only is mregions. See https://github.com/ropenscilabs/mregions/blob/fb95e15cd197ce39f08364ac02f21ec136387e50/R/region_shp.R -> maybe you could do something similar? The code does check4pkg("rgdal") and the doc says " if \code{TRUE}, you need the \code{rgdal} package installed.". You'd still have to get rgdal to work on Travis to be able to test your function but then not every user would need rgdal to work (which can be quite difficult I believe?).

@mdsumner
Copy link

mdsumner commented Sep 7, 2016

Travis can install a binary R package like rgdal, but you need this in your
travis.yml even if it's a secondary dependency:

sudo: required
r_binary_packages:

  • rgdal

Working example is here:
https://github.com/SWotherspoon/SGAT/blob/master/.travis.yml

Travis can also build GDAL from source, see Edzer's .yml here:

https://github.com/edzer/gstat/blob/master/.travis.yml

So, I guess that's three ways to do it if you include the apt-get system
installs used by Scott's geojsonio. :)

spTransform() is defined in sp, but it's only useable if rgdal is also
available. It's kind of related to needing the projection metadata in sp -
so you can read and write data completely - even if you don't do any
transformations. You can use proj4 as an alternative to rgdal, but it still
needs PROJ.4 in the system, and it's a bit more restricted than rgdal with
spTransform.

Cheers, Mike.

HTH!

@mdsumner
Copy link

mdsumner commented Sep 7, 2016

I'll add that I'm also happy to help get travis / rgdal working if you have
problems, whatever you decide to use.

Generally, rgdal is easily available on Windows and Mac, but it's a bit
trickier for some on Linux just because it can be done so many ways. For
that reason using proj4 instead wouldn't help much, it's just a "lighter"
dependency overall.

@sckott
Copy link
Contributor

sckott commented Sep 7, 2016

thanks for the help @mdsumner - @cvitolo i'd take his advice over mine when it comes to spatial pkgs :)

@cvitolo
Copy link
Author

cvitolo commented Sep 7, 2016

@sckott Thanks, I have renamed the repo.

@mdsumner Many thanks for the advice and the offer to help! I have taken the 'build gdal from source' approach and now rgdal installs no problem.

I think this rgdal issue was the last thing to address.

@maelle
Copy link
Member

maelle commented Sep 7, 2016

Yes I don't have other remarks (hopefully all potential users will be able to install gdal or ask for help). Nice job, and welcome back from Travis hell 😉

@haozhu233
Copy link

Yes, I also believe that this package is ready. Great job, @cvitolo! :D

@sckott
Copy link
Contributor

sckott commented Sep 7, 2016

Thanks everyone

@cvitolo just taking a quick look over before approving

@sckott
Copy link
Contributor

sckott commented Sep 7, 2016

Great work! A few items before approving:

  • cran checks want you to use their "canonical form" of the url for a package, so e.g., change http://cran.rstudio.com/web/packages/rdefra/index.html to https://cran.r-project.org/package=rdefra throughout the package (roxygen docs, README, vignette)
  • I noticed something that could be architected a bit better to fail better for users. ukair_get_coordinates() allows a few different types of inputs to its first parameter. Instead of allowing anything to be passed in, which e.g., can result in
ukair_get_coordinates(5)
#> Error in xml2::xml_find_all(page_content, "//*[contains(@id,'tab_info')]")[[2]] : 
#>   subscript out of bounds

use e.g.,

foo <- function(ids) {
  UseMethod("foo")
}

foo.default <- function(ids) {
  stop("no 'foo' method for ", class(ids), call. = FALSE)
}

foo.character <- function(ids) {
  # do stuff
  ids
}

foo.data.frame <- function(ids) {
  # do stuff
  ids
}

which gives you nice failure behavior, for example:

foo(5)
#> Error: no 'foo' method for numeric

You could alternatively, do the checking for correct class type internally without doing the above example, but the above does make it easy and clear to say what happens for each input class

I thin the other exported functions in the package are fine

@cvitolo
Copy link
Author

cvitolo commented Sep 11, 2016

@sckott Thanks for having another look at the package!

I have made the changes you suggested:

  • use canonical form for package url
  • implement methods for ukair_get_coordinates, this is very neat, thanks for the suggestion!

@sckott
Copy link
Contributor

sckott commented Sep 12, 2016

Looks good to me. Approved!

Your vignettes still need a few pieces:

  • You need rmarkdown and knitr in Suggests
  • You need an entry in your DESCRIPTION file of: VignetteBuilder: knitr

To move your repo:

  • Transfer to the repo to ropenscilabs github account
  • Make sure to change all links to the new account, ropenscilabs instead of kehraProject

Please do follow these guidelines moving forward with your package:

Are you interested in doing a guest blog post on our site? http://ropensci.org/tech-notes/ If so, I'll get in touch with more details

@cvitolo
Copy link
Author

cvitolo commented Sep 13, 2016

Thanks @sckott, I'll follow your suggestions!

I have made the changes above and tried to transfer the repo to ropenscilabs but I get this message:

You don’t have admin rights to ropenscilabs

Also, should I change all the links in the badges as well (e.g. travis, codecov)?

Happy to do a guest blog post, please send me more details.

@noamross
Copy link
Contributor

Adding a note for reference: rdefra went through JOSS and rOpenSci review in parallel, having been submitted before our joint-review workflow was in place but approved after. The rOpenSci review can be found here: openjournals/joss-reviews#51

@sckott
Copy link
Contributor

sckott commented Sep 13, 2016

@cvitolo You should get an invitation via email invited to a team on ropenscilabs account, accept that, and try again

Also, should I change all the links in the badges as well (e.g. travis, codecov)?

yes, Once the repo is transferred, we can then turn on travis and codecov for your repo

Happy to do a guest blog post, please send me more details.

I'll send an email to you

@sckott sckott closed this as completed Oct 6, 2016
karthik added a commit that referenced this issue Nov 26, 2018
Based on issue at onboarding-meta/#68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants