diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index c8cb3f66..bf059b5f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -4,11 +4,11 @@ Thank you for your interest in contributing to the project! The goal of this gui ### Fill out the survey -The ```webchem``` survey allows us to learn who you are, which databases you use and how you interact with chemical data. This is extremely valuable information for us and guides our development efforts. The survey take about 5 minutes to fill out. You can fill out the survey [here]() (under development). +The `webchem` survey allows us to learn which databases you use and how you interact with chemical data. This is extremely valuable information for us and guides our development efforts. The survey take about 5 minutes to fill out. You can fill out the survey [here]() (under development). ### Share a use case -Write us an e-mail and show us a full example of how you use or how you would like to use ```webchem``` in your data analysis! This would give us ideas about new features and also help us create better vignettes that help others get started. Please send your e-mails to . +Write us an e-mail and show us a full example of how you use or how you would like to use `webchem` in your data analysis! This would give us ideas about new features and also help us create better vignettes that help others get started. Please send your e-mails to . ### Raise a new issue or join discussion on an existing issue @@ -34,34 +34,34 @@ We do not have strong guidelines for code contributions and are happy to help at 1. We follow the [tidyverse](https://tidyverse.org) style. You can find the style guide [here](https://style.tidyverse.org/). Before committing your code, we encourage you to use ```lintr::lint_file()``` to check for nonconformances. -2. We use ```roxygen2``` for documentation. Please make sure you update the package to the latest version before you update the documentation with ```devtools::document()```. In fact, it is good practice to update all the packages that are used by webchem before making changes. +2. We use [`roxygen2`](https://cran.r-project.org/web/packages/roxygen2/index.html) for documentation. Please make sure you update the package to the latest version before you update the documentation with `devtools::document()`. Use `@noRd` for non exported functions. -2. For web scraping, we recommend the use of the [polite](https://dmi3kno.github.io/polite/) package. +3. Please use the [`xml2`](https://cran.r-project.org/web/packages/xml2/index.html) package instead of the `XML` package. The maintainance of xml2 is much better. -We want to keep dependencies to a minimum: +4. Please use the lightweight [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/index.html) package for handling JSON. -3. Please use the [`xml2`](https://github.com/hadley/xml2) package instead of the `XML` package. The maintainance of xml2 is much better. +5. Use utilities in `webchem::utils.R` when possible to keep function style consistent across the package. -4. Please use the lightweight [`jsonlite`](https://github.com/jeroenooms/jsonlite) package for handling JSON. - -5. Use utilities in utils.R when possible to keep function style consistent across the package. - -6. Be nice to the resources! Use appropriate timeouts. +6. Be nice to the resources! Minimise interaction with the servers. Use appropriate timeouts. For web scraping, we recommend the use of the [`polite`](https://cran.r-project.org/web/packages/polite/index.html) package. 7. Tests go into a separate tests branch and not in the master branch. Some consistency guidelines: -8. Functions that query a database for one or more database specific identifiers should follow the naming convention ```get_*```, e.g. the function that queries ChEBI IDs is called ```get_chebiid()```. +8. Functions that query a database for one or more database specific identifiers should follow the naming convention `get_*`, e.g. the function that queries ChEBI IDs is called `get_chebiid()`. These functions should take a vector of queries and return a single [tibble](https://cran.r-project.org/web/packages/tibble/index.html). Whenever possible these functions should have arguments `query`, `from`, `match`, `verbose` and `...`. The first column of the tibble should contain the ID-s and the last should contain the queries. Invalid queries should return a row of NA-s (apart from the last element of the row which should be the query itself). + +9. The naming of functions that query a database for chemical information should start with the name of the database, followed by the functionality, e.g. `pc_synonyms()` searches for synonyms in PubChem. These functions should take a vector of queries and return a list of responses. Invalid queries should return `NA`. + +10. Functions should always validate their input when appropriate. Use `match.arg()` for input validation. -9. The naming of functions that query a database for chemical information should start with the name of the database, followed by the functionality, e.g. ```pc_synonyms()``` searches for synonyms in PubChem. +11. Make sure `NA` is not confused with sodium. ### Data Sources You might think all webscraping is perfectly legal but it is unfortunately not that simple. -Some services allow you to browse their website but do not allow you programmable access, for various reasons. Therefore, we always have to check the Terms & Conditons and any other legal documents that might restrict programmable access. ```webchem``` only provides access to databases where programmable access is clearly approved by the database provider. A provider might create a publicly accessible API, and if they do not have a restrictive T&C, this indicates their implicit approval for programmatically accessing their data. In all other cases explicit approval is required, i.e. either the T&C has to state that scraping is allowed, or we have to acquire written consent from the database provider before developing functions that scrape their website. +Some services allow you to browse their website but do not allow you programmable access, for various reasons. Therefore, we always have to check the Terms & Conditons and any other legal documents that might restrict programmable access. `webchem` only provides access to databases where programmable access is clearly approved by the database provider. A provider might create a publicly accessible API, and if they do not have a restrictive T&C, this indicates their implicit approval for programmatically accessing their data. In all other cases explicit approval is required, i.e. either the T&C has to state that scraping is allowed, or we have to acquire written consent from the database provider before developing functions that scrape their website. -And there is a big difference between scraping and crawling. ```webchem``` does provide some scraping functionality but it does not provide crawling functionality. We aim to query databases not to download them. +And there is a big difference between scraping and crawling. `webchem` does provide some scraping functionality but it does not provide crawling functionality. ### Thanks for contributing! \ No newline at end of file diff --git a/NAMESPACE b/NAMESPACE index 5353146f..5f1b70b6 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,8 +1,11 @@ # Generated by roxygen2: do not edit by hand +S3method(cas,chebi_comp_entity) S3method(cas,default) S3method(cas,pan_query) S3method(cas,wd_ident) +S3method(inchikey,aw_query) +S3method(inchikey,chebi_comp_entity) S3method(inchikey,cs_compinfo) S3method(inchikey,cs_extcompinfo) S3method(inchikey,default) @@ -12,6 +15,7 @@ S3method(inchikey,pan_query) S3method(inchikey,pc_prop) S3method(inchikey,wd_ident) S3method(smiles,aw_query) +S3method(smiles,chebi_comp_entity) S3method(smiles,cs_compinfo) S3method(smiles,cs_extcompinfo) S3method(smiles,cts_compinfo) diff --git a/NEWS b/NEWS index a8a61cd5..0319f4b0 100644 --- a/NEWS +++ b/NEWS @@ -8,6 +8,7 @@ MINOR IMPROVEMENTS BUG FIXES * get_csid() returned an error when query was NA [PR #226, fixed by stitam] +* get_chebiid() and chebi_comp_entity() fixed for invalid queries [PR #225, fixed by stitam] * get_cid() returned the PubChem ID of sodium when the query was NA [PR #223, fixed by stitam] * aw_query() returned a list for successful queries, NA for unsuccessful queries [PR #222, fixed by stitam] diff --git a/R/alanwood.R b/R/alanwood.R index cb9aedf3..ae6e7333 100644 --- a/R/alanwood.R +++ b/R/alanwood.R @@ -53,19 +53,10 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE, } takelink <- links[tolower(names) == tolower(query)] + if (is.na(query)) takelink <- vector() if (length(takelink) == 0) { message("Not found! Returning NA.\n") - return(list(cname = NA, - status = NA, - pref_iupac_name = NA, - iupac_name = NA, - cas = NA, - formula = NA, - activity = NA, - subactivity = NA, - inchikey = NA, - inchi = NA, - source_url = NA)) + return(NA) } if (length(takelink) > 1) { takelink <- unique(takelink) @@ -77,7 +68,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE, if (verbose) message("Querying ", takelink) - Sys.sleep(rgamma(1, shape = 15, scale = 1/10)) + Sys.sleep(rgamma(1, shape = 15, scale = 1 / 10)) ttt <- read_html(paste0("http://www.alanwood.net/pesticides/", takelink)) status <- xml_text( @@ -133,7 +124,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE, out <- list(cname = cname, status = status, pref_iupac_name = pref_iupac_name, iupac_name = iupac_name, cas = cas, formula = formula, activity = activity, - subactivity = subactivity, inchikey = inchikey, inch = inchi, + subactivity = subactivity, inchikey = inchikey, inchi = inchi, source_url = source_url) return(out) } @@ -153,8 +144,6 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE, #'@author Eduard Szoecs, \email{eduardszoecs@@gmail.com} build_aw_idx <- function() { idx1 <- read_html("http://www.alanwood.net/pesticides/index_rn.html") - # idx2 <- read_html("http://www.alanwood.net/pesticides/index_rn1.html") - # idx3 <- read_html("http://www.alanwood.net/pesticides/index_rn2.html") prep_idx <- function(y) { names <- xml_text(xml_find_all(y, "//dl/dt")) links <- xml_attr( @@ -162,7 +151,6 @@ build_aw_idx <- function() { linknames <- xml_text(xml_find_all(y, "//dt/following-sibling::dd[1]/a[1]")) return(data.frame(names, links, linknames, stringsAsFactors = FALSE)) } - # aw_idx <- rbind(prep_idx(idx1), prep_idx(idx2) ,prep_idx(idx3)) aw_idx <- rbind(prep_idx(idx1)) aw_idx[["source"]] <- "rn" idx4 <- read_html("http://www.alanwood.net/pesticides/index_cn.html") @@ -183,6 +171,8 @@ build_aw_idx <- function() { aw_idx$linknames <- ln attr(aw_idx, "date") <- Sys.Date() + # do not delete this line + # occasionally the data file should be updated in the package # save(aw_idx, file = "data/aw_idx.rda") return(aw_idx) -} \ No newline at end of file +} diff --git a/R/chebi.R b/R/chebi.R index 1cd1cf60..df608b80 100644 --- a/R/chebi.R +++ b/R/chebi.R @@ -30,8 +30,8 @@ #' #' @references Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, #' Turner S, Swainston N, Mendes P, Steinbeck C. (2016). ChEBI in 2016: -#' Improved services and an expanding collection of metabfolites. Nucleic Acids -#' Res. +#' Improved services and an expanding collection of metabfolites. Nucleic +#' Acids Res. #' #' Hastings, J., de Matos, P., Dekker, A., Ennis, M., Harsha, B., Kale, N., #' Muthukrishnan, V., Owen, G., Turner, S., Williams, M., and Steinbeck, C. @@ -65,22 +65,21 @@ #' } get_chebiid <- function(query, from = 'ALL', - match = 'all', + match = c("all", "best", "ask", "na"), max_res = 200, stars = 'ALL', verbose = TRUE, ...) { - + match <- match.arg(match) foo <- function(query, match, from, max_res, stars, verbose, ...) { - # query = 'Isoproturon'; from = 'ALL'; match = 'ask'; max_res = 200; stars = 'ALL'; verbose = T # debuging - # arguments + if (is.na(query)) return(data.frame(chebiid = NA_character_, + query = NA_character_, + stringsAsFactors = FALSE)) from_all <- c('ALL', 'CHEBI ID', 'CHEBI NAME', 'DEFINITION', 'ALL NAMES', 'IUPAC NAME', 'CITATIONS', 'REGISTRY NUMBERS', 'MANUAL XREFS', 'AUTOMATIC XREFS', 'FORMULA', 'MASS', 'MONOISOTOPIC MASS', 'CHARGE', 'INCHI/INCHI KEY', 'SMILES', 'SPECIES') from <- match.arg(from, from_all) - match_all <- c('all', 'best', 'ask', 'na') - match <- match.arg(match, match_all) stars_all <- c('ALL', 'TWO ONLY', 'THREE ONLY') stars <- match.arg(stars, stars_all) # query @@ -116,26 +115,34 @@ get_chebiid <- function(query, out <- setNames(out, tolower(names(out))) if (nrow(out) == 0) { message('No result found. \n') - return(out) + return(data.frame(chebiid = NA_character_, + query = query, + stringsAsFactors = FALSE)) } + if (nrow(out) > 0) out$query <- query + if (nrow(out) == 1) return(out) if (match == 'all') { return(out) } if (match == 'best') { if (verbose) message('Returning best match. \n') - out <- out[ with(out, order(searchscore, decreasing = TRUE)), ] - return(out[ which.max(out$searchscore), ]) + out <- out[with(out, order(searchscore, decreasing = TRUE)), ] + return(out[which.max(out$searchscore), ]) } if (match == "ask") { matched <- chooser(out$chebiid, 'all') - return(out[ out$chebiid == matched, ]) + return(out[out$chebiid == matched, ]) } if (match == 'na') { - return(data.frame(chebiid = NA)) + return(data.frame(chebiid = NA_character_, + query = query, + stringsAsFactors = FALSE)) } } else { - out <- data.frame(chebiid = NA) + out <- data.frame(chebiid = NA_character_, + query = query, + stringsAsFactors = FALSE) message('Returning NA (', http_status(res)$message, '). \n') return(out) @@ -149,7 +156,7 @@ get_chebiid <- function(query, stars = stars, verbose = verbose) out <- setNames(out, query) - + out <- bind_rows(out) return(out) } @@ -212,6 +219,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) { foo <- function(chebiid, verbose, ...) { # chebiid = c('CHEBI:27744', 'CHEBI:17790'); verbose = TRUE # debuging + if (is.na(chebiid)) return(NA) url <- 'http://www.ebi.ac.uk:80/webservices/chebi/2.0/webservice' headers <- c(Accept = 'text/xml', Accept = 'multipart/*', @@ -235,10 +243,8 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) { add_headers(headers), body = body) if (res$status_code != 200) { - out <- data.frame(chebiid = NA) warning(http_status(res)$message) - - return(out) + return(NA) } else { cont <- content(res, type = 'text/xml', encoding = 'utf-8') # restricted to one entry @@ -298,7 +304,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) { } out <- lapply(chebiid, foo, verbose = verbose) out <- setNames(out, chebiid) - + class(out) <- c("chebi_comp_entity","list") return(out) } @@ -313,7 +319,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) { #' @noRd #' l2df <- function(x) { - out <- data.frame(rbind.named.fill(lapply(x, unlist)), + out <- data.frame(rbind_named_fill(lapply(x, unlist)), row.names = NULL, stringsAsFactors = FALSE) @@ -331,7 +337,7 @@ l2df <- function(x) { #' @author Andreas Scharmueller, \email{andschar@@protonmail.com} #' @noRd #' -rbind.named.fill <- function(x) { +rbind_named_fill <- function(x) { nam <- lapply(x, names) unam <- unique(unlist(nam)) len <- lapply(x, length) @@ -340,4 +346,4 @@ rbind.named.fill <- function(x) { out[[i]] <- unname(x[[i]])[match(unam, nam[[i]])] } setNames(as.data.frame(do.call(rbind, out), stringsAsFactors = FALSE), unam) -} \ No newline at end of file +} diff --git a/R/extractors.R b/R/extractors.R index 7c24a412..8c4605b0 100644 --- a/R/extractors.R +++ b/R/extractors.R @@ -9,7 +9,6 @@ cas <- function(x, ...){ UseMethod("cas") } - #' @export cas.default <- function(x, ...) { sapply(x, function(y) { @@ -19,6 +18,14 @@ cas.default <- function(x, ...) { }) } +#' @export +cas.chebi_comp_entity <- function(x, ...) { + sapply(x, function(y) { + if (length(y) == 1 && is.na(y)) return(NA) + unique(y$regnumbers$data[y$regnumbers$type == "CAS Registry Number"]) + }) +} + #' @export cas.pan_query <- function(x, ...) { sapply(x, function(y) y$`CAS Number`) @@ -29,18 +36,34 @@ cas.wd_ident <- function(x, ...) { x$cas } - - # InChIKey ---------------------------------------------------------------- #' @rdname extractors #' @export inchikey <- function(x, ...){ UseMethod("inchikey") } + #' @export inchikey.default <- function(x, ...) { sapply(x, function(y) y$inchikey) } + +#' @export +inchikey.aw_query <- function(x, ...) { + sapply(x, function(y) { + if (length(y) == 1 && is.na(y)) return(NA) + y$inchikey + }) +} + +#' @export +inchikey.chebi_comp_entity <- function(x, ...) { + sapply(x, function (y) { + if (length(y) == 1 && is.na(y)) return(NA) + y$properties$inchikey + }) +} + #' @export inchikey.cs_compinfo <- function(x, ...) { x$inchikey @@ -60,7 +83,6 @@ inchikey.pan_query <- function(x, ...) { stop("InChIkey is not returned by this datasource!") } - #' @export inchikey.opsin_query <- function(x, ...) { x$stdinchikey @@ -79,18 +101,26 @@ inchikey.wd_ident <- function(x, ...) { x$inchikey } - # SMILES ------------------------------------------------------------------ #' @rdname extractors #' @export smiles <- function(x, ...){ UseMethod("smiles") } + #' @export smiles.default <- function(x, ...) { sapply(x, function(y) y$smiles) } +#' @export +smiles.chebi_comp_entity <- function(x, ...) { + sapply(x, function(y) { + if (length(y) == 1 && is.na(y)) return(NA) + y$properties$smiles + }) +} + #' @export smiles.cs_compinfo <- function(x, ...) { x$smiles @@ -137,4 +167,3 @@ smiles.pc_prop <- function(x, ...) { smiles.wd_ident <- function(x, ...) { x$smiles } - diff --git a/data/aw_idx.rda b/data/aw_idx.rda index bbc41b67..3c4bf198 100644 Binary files a/data/aw_idx.rda and b/data/aw_idx.rda differ diff --git a/man/get_chebiid.Rd b/man/get_chebiid.Rd index 99edceac..65d0926d 100644 --- a/man/get_chebiid.Rd +++ b/man/get_chebiid.Rd @@ -7,7 +7,7 @@ get_chebiid( query, from = "ALL", - match = "all", + match = c("all", "best", "ask", "na"), max_res = 200, stars = "ALL", verbose = TRUE, @@ -64,8 +64,8 @@ get_chebiid(comp) \references{ Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. (2016). ChEBI in 2016: - Improved services and an expanding collection of metabfolites. Nucleic Acids - Res. + Improved services and an expanding collection of metabfolites. Nucleic + Acids Res. Hastings, J., de Matos, P., Dekker, A., Ennis, M., Harsha, B., Kale, N., Muthukrishnan, V., Owen, G., Turner, S., Williams, M., and Steinbeck, C. diff --git a/tests/testthat/test-alanwood.R b/tests/testthat/test-alanwood.R index 29b17c81..d64a9d29 100644 --- a/tests/testthat/test-alanwood.R +++ b/tests/testthat/test-alanwood.R @@ -4,12 +4,15 @@ context("alanwood") test_that("alanwood, commonname", { skip_on_cran() - comps <- c("Fluazinam", "S-Metolachlor", "xxxxx") + comps <- c("Fluazinam", "S-Metolachlor", "balloon", NA) o1 <- aw_query(comps, type = "commonname") expect_is(o1, "list") - expect_equal(length(o1), 3) + expect_equal(length(o1), 4) + expect_is(o1[[1]], "list") + expect_is(o1[[2]], "list") expect_is(o1[[3]], "list") + expect_is(o1[[4]], "list") expect_equal(o1[["Fluazinam"]]$cas, "79622-59-6") expect_equal(length(o1[["S-Metolachlor"]]$inchikey), 2) expect_equal(length(o1[["S-Metolachlor"]]$inchi), 2) @@ -20,12 +23,15 @@ test_that("alanwood, commonname", { test_that("alanwood, cas", { skip_on_cran() - comps <- c("79622-59-6", "87392-12-9", "xxxxx") + comps <- c("79622-59-6", "87392-12-9", "balloon", NA) o1 <- aw_query(comps, type = "cas") expect_is(o1, "list") - expect_equal(length(o1), 3) + expect_equal(length(o1), 4) + expect_is(o1[[1]], "list") + expect_is(o1[[2]], "list") expect_is(o1[[3]], "list") + expect_is(o1[[4]], "list") expect_equal(o1[[1]]$cas, "79622-59-6") expect_equal(length(o1[[2]]$inchikey), 2) expect_equal(length(o1[[2]]$inchi), 2) @@ -48,4 +54,4 @@ test_that("alanwood index is up to date", { skip_on_cran() expect_true(Sys.Date() - attr(aw_idx, "date") < 90) -}) \ No newline at end of file +}) diff --git a/tests/testthat/test-chebi.R b/tests/testthat/test-chebi.R index 0df916ee..5e51426d 100644 --- a/tests/testthat/test-chebi.R +++ b/tests/testthat/test-chebi.R @@ -1,18 +1,22 @@ -context('chebi') +context("chebi") -a <- get_chebiid('Glyphosate', category = 'ALL') -A <- chebi_comp_entity('CHEBI:27744') -B <- chebi_comp_entity('27732') +a <- get_chebiid("Glyphosate", from = "ALL") +b <- get_chebiid(c("triclosan", "glyphosate", "balloon", NA)) +A <- chebi_comp_entity("CHEBI:27744") +B <- chebi_comp_entity("27732") -test_that('chebi returns correct results', { +test_that("chebi returns correct results", { skip_on_cran() - expect_is(a, 'list') - expect_is(a[[1]], 'data.frame') - expect_is(A, 'list') + expect_is(a, "data.frame") + expect_is(b, "data.frame") + expect_is(A, "list") + expect_is(B, "list") - expect_equal(names(a[[1]])[1], 'chebiid') - expect_equal(A$`CHEBI:27744`$regnumbers$data[1], '1071-83-6') - expect_equal(B$`27732`$properties$chebiasciiname, 'caffeine') - expect_equal(B$`27732`$properties$entitystar, '3') + expect_equal(names(a)[1], "chebiid") + expect_length(names(a), 5) + expect_length(names(b), 5) + expect_equal(A$`CHEBI:27744`$regnumbers$data[1], "1071-83-6") + expect_equal(B$`27732`$properties$chebiasciiname, "caffeine") + expect_equal(B$`27732`$properties$entitystar, "3") })