ropensci · stitam · Mar 26, 2020 · Mar 19, 2020 · Mar 20, 2020 · Mar 20, 2020
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -4,11 +4,11 @@ Thank you for your interest in contributing to the project! The goal of this gui
 
 ### Fill out the survey
 
-The ```webchem``` survey allows us to learn who you are, which databases you use and how you interact with chemical data. This is extremely valuable information for us and guides our development efforts. The survey take about 5 minutes to fill out. You can fill out the survey [here](<link here>) (under development).
+The `webchem` survey allows us to learn which databases you use and how you interact with chemical data. This is extremely valuable information for us and guides our development efforts. The survey take about 5 minutes to fill out. You can fill out the survey [here](<link here>) (under development).
 
 ### Share a use case
 
-Write us an e-mail and show us a full example of how you use or how you would like to use ```webchem``` in your data analysis! This would give us ideas about new features and also help us create better vignettes that help others get started. Please send your e-mails to <webchem@ropensci.org>.
+Write us an e-mail and show us a full example of how you use or how you would like to use `webchem` in your data analysis! This would give us ideas about new features and also help us create better vignettes that help others get started. Please send your e-mails to <webchem@ropensci.org>.
 
 ### Raise a new issue or join discussion on an existing issue
 
@@ -34,34 +34,34 @@ We do not have strong guidelines for code contributions and are happy to help at
 
 1. We follow the [tidyverse](https://tidyverse.org) style. You can find the style guide [here](https://style.tidyverse.org/). Before committing your code, we encourage you to use ```lintr::lint_file()``` to check for nonconformances.
 
-2. We use ```roxygen2``` for documentation. Please make sure you update the package to the latest version before you update the documentation with ```devtools::document()```. In fact, it is good practice to update all the packages that are used by webchem before making changes.
+2. We use [`roxygen2`](https://cran.r-project.org/web/packages/roxygen2/index.html) for documentation. Please make sure you update the package to the latest version before you update the documentation with `devtools::document()`. Use `@noRd` for non exported functions.
 
-2. For web scraping, we recommend the use of the [polite](https://dmi3kno.github.io/polite/) package.
+3. Please use the [`xml2`](https://cran.r-project.org/web/packages/xml2/index.html) package instead of the `XML` package. The maintainance of xml2 is much better.
 
-We want to keep dependencies to a minimum:
+4. Please use the lightweight [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/index.html) package for handling JSON.
 
-3. Please use the [`xml2`](https://github.com/hadley/xml2) package instead of the `XML` package. The maintainance of xml2 is much better.
+5. Use utilities in `webchem::utils.R` when possible to keep function style consistent across the package.
 
-4. Please use the lightweight [`jsonlite`](https://github.com/jeroenooms/jsonlite) package for handling JSON.
-
-5. Use utilities in utils.R when possible to keep function style consistent across the package.
-
-6. Be nice to the resources! Use appropriate timeouts.
+6. Be nice to the resources! Minimise interaction with the servers. Use appropriate timeouts. For web scraping, we recommend the use of the [`polite`](https://cran.r-project.org/web/packages/polite/index.html) package.
 
 7. Tests go into a separate tests branch and not in the master branch.
 
 Some consistency guidelines:
 
-8. Functions that query a database for one or more database specific identifiers should follow the naming convention ```get_*```, e.g. the function that queries ChEBI IDs is called ```get_chebiid()```.
+8. Functions that query a database for one or more database specific identifiers should follow the naming convention `get_*`, e.g. the function that queries ChEBI IDs is called `get_chebiid()`.
+
+9. The naming of functions that query a database for chemical information should start with the name of the database, followed by the functionality, e.g. `pc_synonyms()` searches for synonyms in PubChem.
+
+10. Functions should always validate their input when appropriate. Use `match.arg()` for input validation.
 
-9. The naming of functions that query a database for chemical information should start with the name of the database, followed by the functionality, e.g. ```pc_synonyms()``` searches for synonyms in PubChem.
+11. Functions should always return the same output structure, even for invalid inputs, e.g. unknown chemicals, or `NA`. Make sure `NA` is not confused with sodium. Include tests.
 
 ### Data Sources
 
 You might think all webscraping is perfectly legal but it is unfortunately not that simple.
 
-Some services allow you to browse their website but do not allow you programmable access, for various reasons. Therefore, we always have to check the Terms & Conditons and any other legal documents that might restrict programmable access. ```webchem``` only provides access to databases where programmable access is clearly approved by the database provider. A provider might create a publicly accessible API, and if they do not have a restrictive T&C, this indicates their implicit approval for programmatically accessing their data. In all other cases explicit approval is required, i.e. either the T&C has to state that scraping is allowed, or we have to acquire written consent from the database provider before developing functions that scrape their website.
+Some services allow you to browse their website but do not allow you programmable access, for various reasons. Therefore, we always have to check the Terms & Conditons and any other legal documents that might restrict programmable access. `webchem` only provides access to databases where programmable access is clearly approved by the database provider. A provider might create a publicly accessible API, and if they do not have a restrictive T&C, this indicates their implicit approval for programmatically accessing their data. In all other cases explicit approval is required, i.e. either the T&C has to state that scraping is allowed, or we have to acquire written consent from the database provider before developing functions that scrape their website.
 
-And there is a big difference between scraping and crawling. ```webchem``` does provide some scraping functionality but it does not provide crawling functionality. We aim to query databases not to download them.
+And there is a big difference between scraping and crawling. `webchem` does provide some scraping functionality but it does not provide crawling functionality.
 
 ### Thanks for contributing!
diff --git a/R/alanwood.R b/R/alanwood.R
@@ -53,6 +53,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
     }
 
     takelink <- links[tolower(names) == tolower(query)]
+    if (is.na(query)) takelink <- vector()
     if (length(takelink) == 0) {
       message("Not found! Returning NA.\n")
       return(list(cname = NA,
@@ -77,7 +78,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
     if (verbose)
       message("Querying ", takelink)
 
-    Sys.sleep(rgamma(1, shape = 15, scale = 1/10))
+    Sys.sleep(rgamma(1, shape = 15, scale = 1 / 10))
     ttt <- read_html(paste0("http://www.alanwood.net/pesticides/", takelink))
 
     status <- xml_text(
@@ -133,7 +134,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
     out <- list(cname = cname, status = status,
                 pref_iupac_name = pref_iupac_name, iupac_name = iupac_name,
                 cas = cas, formula = formula, activity = activity,
-                subactivity = subactivity, inchikey = inchikey, inch = inchi,
+                subactivity = subactivity, inchikey = inchikey, inchi = inchi,
                 source_url = source_url)
     return(out)
   }
@@ -153,16 +154,13 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
 #'@author Eduard Szoecs, \email{eduardszoecs@@gmail.com}
 build_aw_idx <- function() {
   idx1 <- read_html("http://www.alanwood.net/pesticides/index_rn.html")
-  # idx2 <- read_html("http://www.alanwood.net/pesticides/index_rn1.html")
-  # idx3 <- read_html("http://www.alanwood.net/pesticides/index_rn2.html")
   prep_idx <- function(y) {
     names <- xml_text(xml_find_all(y, "//dl/dt"))
     links <- xml_attr(
       xml_find_all(y, "//dt/following-sibling::dd[1]/a[1]"), "href")
     linknames <- xml_text(xml_find_all(y, "//dt/following-sibling::dd[1]/a[1]"))
     return(data.frame(names, links, linknames, stringsAsFactors = FALSE))
   }
-  # aw_idx <- rbind(prep_idx(idx1), prep_idx(idx2) ,prep_idx(idx3))
   aw_idx <- rbind(prep_idx(idx1))
   aw_idx[["source"]] <- "rn"
   idx4 <- read_html("http://www.alanwood.net/pesticides/index_cn.html")
@@ -183,6 +181,8 @@ build_aw_idx <- function() {
   aw_idx$linknames <- ln
   attr(aw_idx, "date") <- Sys.Date()
 
+  # do not delete this line
+  # occasionally the data file should be updated in the package
   # save(aw_idx, file = "data/aw_idx.rda")
   return(aw_idx)
-}
+}
diff --git a/R/chebi.R b/R/chebi.R
@@ -30,8 +30,8 @@
 #'
 #' @references Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V,
 #'   Turner S, Swainston N, Mendes P, Steinbeck C. (2016). ChEBI in 2016:
-#'   Improved services and an expanding collection of metabfolites. Nucleic Acids
-#'   Res.
+#'   Improved services and an expanding collection of metabfolites. Nucleic
+#'   Acids Res.
 #'
 #'   Hastings, J., de Matos, P., Dekker, A., Ennis, M., Harsha, B., Kale, N.,
 #'   Muthukrishnan, V., Owen, G., Turner, S., Williams, M., and Steinbeck, C.
@@ -65,22 +65,19 @@
 #' }
 get_chebiid <- function(query,
                         from = 'ALL',
-                        match = 'all',
+                        match = c("all", "best", "ask", "na"),
                         max_res = 200,
                         stars = 'ALL',
                         verbose = TRUE,
                         ...) {
-
+  match <- match.arg(match)
   foo <- function(query, match, from, max_res, stars, verbose, ...) {
-    # query = 'Isoproturon'; from = 'ALL'; match = 'ask'; max_res = 200; stars = 'ALL'; verbose = T # debuging
-    # arguments
+    if (is.na(query)) return(data.frame(chebiid = NA))
     from_all <- c('ALL', 'CHEBI ID', 'CHEBI NAME', 'DEFINITION', 'ALL NAMES',
                   'IUPAC NAME', 'CITATIONS', 'REGISTRY NUMBERS', 'MANUAL XREFS',
                   'AUTOMATIC XREFS', 'FORMULA', 'MASS', 'MONOISOTOPIC MASS',
                   'CHARGE', 'INCHI/INCHI KEY', 'SMILES', 'SPECIES')
     from <- match.arg(from, from_all)
-    match_all <- c('all', 'best', 'ask', 'na')
-    match <- match.arg(match, match_all)
     stars_all <- c('ALL', 'TWO ONLY', 'THREE ONLY')
     stars <- match.arg(stars, stars_all)
     # query
@@ -116,20 +113,20 @@ get_chebiid <- function(query,
       out <- setNames(out, tolower(names(out)))
       if (nrow(out) == 0) {
         message('No result found. \n')
-        return(out)
+        return(data.frame(chebiid = NA))
       }
       if (match == 'all') {
         return(out)
       }
       if (match == 'best') {
         if (verbose)
           message('Returning best match. \n')
-        out <- out[ with(out, order(searchscore, decreasing = TRUE)), ]
-        return(out[ which.max(out$searchscore), ])
+        out <- out[with(out, order(searchscore, decreasing = TRUE)), ]
+        return(out[which.max(out$searchscore), ])
       }
       if (match == "ask") {
         matched <- chooser(out$chebiid, 'all')
-        return(out[ out$chebiid == matched, ])
+        return(out[out$chebiid == matched, ])
       }
       if (match == 'na') {
         return(data.frame(chebiid = NA))
@@ -212,6 +209,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {
 
   foo <- function(chebiid, verbose, ...) {
     # chebiid = c('CHEBI:27744', 'CHEBI:17790'); verbose = TRUE # debuging
+    if (is.na(chebiid)) return(empty_chebi_comp_entity())
     url <- 'http://www.ebi.ac.uk:80/webservices/chebi/2.0/webservice'
     headers <- c(Accept = 'text/xml',
                  Accept = 'multipart/*',
@@ -235,10 +233,8 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {
                 add_headers(headers),
                 body = body)
     if (res$status_code != 200) {
-      out <- data.frame(chebiid = NA)
       warning(http_status(res)$message)
-
-      return(out)
+      return(empty_chebi_comp_entity())
     } else {
       cont <- content(res, type = 'text/xml', encoding = 'utf-8')
       # restricted to one entry
@@ -313,7 +309,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {
 #' @noRd
 #'
 l2df <- function(x) {
-  out <- data.frame(rbind.named.fill(lapply(x, unlist)),
+  out <- data.frame(rbind_named_fill(lapply(x, unlist)),
                     row.names = NULL,
                     stringsAsFactors = FALSE)
 
@@ -331,7 +327,7 @@ l2df <- function(x) {
 #' @author Andreas Scharmueller, \email{andschar@@protonmail.com}
 #' @noRd
 #'
-rbind.named.fill <- function(x) {
+rbind_named_fill <- function(x) {
   nam <- lapply(x, names)
   unam <- unique(unlist(nam))
   len <- lapply(x, length)
@@ -340,4 +336,103 @@ rbind.named.fill <- function(x) {
     out[[i]] <- unname(x[[i]])[match(unam, nam[[i]])]
   }
   setNames(as.data.frame(do.call(rbind, out), stringsAsFactors = FALSE), unam)
+}
+
+#' Create empty list for chebi_comp_entity
+#'
+#' The function creates a list with identical internal structure as a
+#' chebi_comp_entity() output element. This is used in chebi_comp_entity() to
+#' provide a valid output for invalid inputs and allow vectorisation.
+#' @noRd
+#' @autor Tamas Stirling, \email{stirling.tamas@@gmail.com}
+empty_chebi_comp_entity <- function() {
+  properties <- data.frame(
+    chebiid = as.character(NA),
+    chebiasciiname = as.character(NA),
+    definition = as.character(NA),
+    status = as.character(NA),
+    smiles = as.character(NA),
+    inchi = as.character(NA),
+    inchikey = as.character(NA),
+    charge = as.character(NA),
+    mass = as.character(NA),
+    monoisotopicmass = as.character(NA),
+    entitystar = as.character(NA)
+  )
+  chebiid_snd <- data.frame(
+    chebiids = as.character(NA)
+  )
+  chem_structure <- list(
+    list("structure" = list(as.character(NA)),
+         "type" = list(as.character(NA)),
+         "dimension" = list(as.character(NA)),
+         "defaultStructure" = list(as.character(NA)))
+    )
+  synonyms <- data.frame(
+    data = as.character(NA),
+    type = as.character(NA),
+    source = as.character(NA)
+  )
+  iupacnames <- data.frame(
+    data = as.character(NA),
+    type = as.character(NA),
+    source = as.character(NA)
+  )
+  formulae <- data.frame(
+    data = as.character(NA),
+    source = as.character(NA)
+  )
+  regnumbers <- data.frame(
+    data = as.character(NA),
+    type = as.character(NA),
+    source = as.character(NA)
+  )
+  citations <- data.frame(
+    data = as.character(NA),
+    type = as.character(NA),
+    source = as.character(NA)
+  )
+  dblinks <- data.frame(
+    data = as.character(NA),
+    type = as.character(NA)
+  )
+  parents <- data.frame(
+    chebiName = as.character(NA),
+    chebiId = as.character(NA),
+    type = as.character(NA),
+    status = as.character(NA),
+    cyclicRelationship = as.character(NA)
+  )
+  children <- data.frame(
+    chebiName = as.character(NA),
+    chebiId = as.character(NA),
+    type = as.character(NA),
+    status = as.character(NA),
+    cyclicRelationship = as.character(NA)
+  )
+  comments <- data.frame(
+    text = as.character(NA),
+    date = as.character(NA)
+  )
+  origins <- data.frame(
+    speciesText = as.character(NA),
+    speciesAccession = as.character(NA),
+    SourceType = as.character(NA),
+    SourceAccession = as.character(NA),
+    componentText = as.character(NA),
+    componentAccession = as.character(NA)
+  )
+  return(list("properties" = properties,
+              "chebiid_snd" = chebiid_snd,
+              "chem_structure" = chem_structure,
+              "synonyms" = synonyms,
+              "iupacnames" = iupacnames,
+              "formulae" = formulae,
+              "regnumbers" = regnumbers,
+              "citations" = citations,
+              "dblinks" = dblinks,
+              "parents" = parents,
+              "children" = children,
+              "comments" = comments,
+              "origins" = origins))
 }
diff --git a/data/aw_idx.rda b/data/aw_idx.rda
diff --git a/tests/testthat/test-alanwood.R b/tests/testthat/test-alanwood.R
@@ -4,33 +4,46 @@ context("alanwood")
 test_that("alanwood, commonname", {
   skip_on_cran()
 
-  comps <- c("Fluazinam", "S-Metolachlor", "xxxxx")
+  comps <- c("Fluazinam", "S-Metolachlor", "balloon", NA)
   o1 <- aw_query(comps, type = "commonname")
 
   expect_is(o1, "list")
-  expect_equal(length(o1), 3)
+  expect_equal(length(o1), 4)
+  expect_is(o1[[1]], "list")
+  expect_is(o1[[2]], "list")
   expect_is(o1[[3]], "list")
+  expect_is(o1[[4]], "list")
   expect_equal(o1[["Fluazinam"]]$cas, "79622-59-6")
   expect_equal(length(o1[["S-Metolachlor"]]$inchikey), 2)
   expect_equal(length(o1[["S-Metolachlor"]]$inchi), 2)
   expect_equal(length(o1[["Fluazinam"]]), 11)
+  expect_true(is.na(o1[["balloon"]]$cname))
+  expect_true(is.na(o1[[4]]$cname))
 })
 
 
 test_that("alanwood, cas", {
   skip_on_cran()
 
-  comps <- c("79622-59-6", "87392-12-9", "xxxxx")
+  comps <- c("79622-59-6", "87392-12-9", "balloon", NA)
   o1 <- aw_query(comps, type = "cas")
 
   expect_is(o1, "list")
-  expect_equal(length(o1), 3)
+  expect_equal(length(o1), 4)
+  expect_is(o1[[1]], "list")
+  expect_is(o1[[2]], "list")
   expect_is(o1[[3]], "list")
+  expect_is(o1[[4]], "list")
+  expect_equal(mean(names(o1[[1]]) == names(o1[[2]])),1) # check structure
+  expect_equal(mean(names(o1[[2]]) == names(o1[[3]])),1) # check structure
+  expect_equal(mean(names(o1[[3]]) == names(o1[[4]])),1) # check structure
   expect_equal(o1[[1]]$cas, "79622-59-6")
   expect_equal(length(o1[[2]]$inchikey), 2)
   expect_equal(length(o1[[2]]$inchi), 2)
   expect_equal(length(o1[[1]]), 11)
   expect_true(is.na(aw_query("12071-83-9", type = "cas")[[1]]$inchi))
+  expect_true(is.na(o1[["balloon"]]$cname))
+  expect_true(is.na(o1[[4]]$cname))
 })
 
 test_that("alanwood, build_index", {
@@ -48,4 +61,4 @@ test_that("alanwood index is up to date", {
   skip_on_cran()
 
   expect_true(Sys.Date() - attr(aw_idx, "date") < 90)
-})
+})