Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives for handling NA inputs #225

Merged
merged 10 commits into from Mar 26, 2020
30 changes: 15 additions & 15 deletions CONTRIBUTING.md
Expand Up @@ -4,11 +4,11 @@ Thank you for your interest in contributing to the project! The goal of this gui

### Fill out the survey

The ```webchem``` survey allows us to learn who you are, which databases you use and how you interact with chemical data. This is extremely valuable information for us and guides our development efforts. The survey take about 5 minutes to fill out. You can fill out the survey [here](<link here>) (under development).
The `webchem` survey allows us to learn which databases you use and how you interact with chemical data. This is extremely valuable information for us and guides our development efforts. The survey take about 5 minutes to fill out. You can fill out the survey [here](<link here>) (under development).

### Share a use case

Write us an e-mail and show us a full example of how you use or how you would like to use ```webchem``` in your data analysis! This would give us ideas about new features and also help us create better vignettes that help others get started. Please send your e-mails to <webchem@ropensci.org>.
Write us an e-mail and show us a full example of how you use or how you would like to use `webchem` in your data analysis! This would give us ideas about new features and also help us create better vignettes that help others get started. Please send your e-mails to <webchem@ropensci.org>.

### Raise a new issue or join discussion on an existing issue

Expand All @@ -34,34 +34,34 @@ We do not have strong guidelines for code contributions and are happy to help at

1. We follow the [tidyverse](https://tidyverse.org) style. You can find the style guide [here](https://style.tidyverse.org/). Before committing your code, we encourage you to use ```lintr::lint_file()``` to check for nonconformances.

2. We use ```roxygen2``` for documentation. Please make sure you update the package to the latest version before you update the documentation with ```devtools::document()```. In fact, it is good practice to update all the packages that are used by webchem before making changes.
2. We use [`roxygen2`](https://cran.r-project.org/web/packages/roxygen2/index.html) for documentation. Please make sure you update the package to the latest version before you update the documentation with `devtools::document()`. Use `@noRd` for non exported functions.

2. For web scraping, we recommend the use of the [polite](https://dmi3kno.github.io/polite/) package.
3. Please use the [`xml2`](https://cran.r-project.org/web/packages/xml2/index.html) package instead of the `XML` package. The maintainance of xml2 is much better.

We want to keep dependencies to a minimum:
4. Please use the lightweight [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/index.html) package for handling JSON.

3. Please use the [`xml2`](https://github.com/hadley/xml2) package instead of the `XML` package. The maintainance of xml2 is much better.
5. Use utilities in `webchem::utils.R` when possible to keep function style consistent across the package.

4. Please use the lightweight [`jsonlite`](https://github.com/jeroenooms/jsonlite) package for handling JSON.

5. Use utilities in utils.R when possible to keep function style consistent across the package.

6. Be nice to the resources! Use appropriate timeouts.
6. Be nice to the resources! Minimise interaction with the servers. Use appropriate timeouts. For web scraping, we recommend the use of the [`polite`](https://cran.r-project.org/web/packages/polite/index.html) package.

7. Tests go into a separate tests branch and not in the master branch.

Some consistency guidelines:

8. Functions that query a database for one or more database specific identifiers should follow the naming convention ```get_*```, e.g. the function that queries ChEBI IDs is called ```get_chebiid()```.
8. Functions that query a database for one or more database specific identifiers should follow the naming convention `get_*`, e.g. the function that queries ChEBI IDs is called `get_chebiid()`.

9. The naming of functions that query a database for chemical information should start with the name of the database, followed by the functionality, e.g. `pc_synonyms()` searches for synonyms in PubChem.

10. Functions should always validate their input when appropriate. Use `match.arg()` for input validation.

9. The naming of functions that query a database for chemical information should start with the name of the database, followed by the functionality, e.g. ```pc_synonyms()``` searches for synonyms in PubChem.
11. Functions should always return the same output structure, even for invalid inputs, e.g. unknown chemicals, or `NA`. Make sure `NA` is not confused with sodium. Include tests.

### Data Sources

You might think all webscraping is perfectly legal but it is unfortunately not that simple.

Some services allow you to browse their website but do not allow you programmable access, for various reasons. Therefore, we always have to check the Terms & Conditons and any other legal documents that might restrict programmable access. ```webchem``` only provides access to databases where programmable access is clearly approved by the database provider. A provider might create a publicly accessible API, and if they do not have a restrictive T&C, this indicates their implicit approval for programmatically accessing their data. In all other cases explicit approval is required, i.e. either the T&C has to state that scraping is allowed, or we have to acquire written consent from the database provider before developing functions that scrape their website.
Some services allow you to browse their website but do not allow you programmable access, for various reasons. Therefore, we always have to check the Terms & Conditons and any other legal documents that might restrict programmable access. `webchem` only provides access to databases where programmable access is clearly approved by the database provider. A provider might create a publicly accessible API, and if they do not have a restrictive T&C, this indicates their implicit approval for programmatically accessing their data. In all other cases explicit approval is required, i.e. either the T&C has to state that scraping is allowed, or we have to acquire written consent from the database provider before developing functions that scrape their website.

And there is a big difference between scraping and crawling. ```webchem``` does provide some scraping functionality but it does not provide crawling functionality. We aim to query databases not to download them.
And there is a big difference between scraping and crawling. `webchem` does provide some scraping functionality but it does not provide crawling functionality.

### Thanks for contributing!
12 changes: 6 additions & 6 deletions R/alanwood.R
Expand Up @@ -53,6 +53,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
}

takelink <- links[tolower(names) == tolower(query)]
if (is.na(query)) takelink <- vector()
if (length(takelink) == 0) {
message("Not found! Returning NA.\n")
return(list(cname = NA,
Expand All @@ -77,7 +78,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
if (verbose)
message("Querying ", takelink)

Sys.sleep(rgamma(1, shape = 15, scale = 1/10))
Sys.sleep(rgamma(1, shape = 15, scale = 1 / 10))
ttt <- read_html(paste0("http://www.alanwood.net/pesticides/", takelink))

status <- xml_text(
Expand Down Expand Up @@ -133,7 +134,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
out <- list(cname = cname, status = status,
pref_iupac_name = pref_iupac_name, iupac_name = iupac_name,
cas = cas, formula = formula, activity = activity,
subactivity = subactivity, inchikey = inchikey, inch = inchi,
subactivity = subactivity, inchikey = inchikey, inchi = inchi,
source_url = source_url)
return(out)
}
Expand All @@ -153,16 +154,13 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
#'@author Eduard Szoecs, \email{eduardszoecs@@gmail.com}
build_aw_idx <- function() {
idx1 <- read_html("http://www.alanwood.net/pesticides/index_rn.html")
# idx2 <- read_html("http://www.alanwood.net/pesticides/index_rn1.html")
# idx3 <- read_html("http://www.alanwood.net/pesticides/index_rn2.html")
prep_idx <- function(y) {
names <- xml_text(xml_find_all(y, "//dl/dt"))
links <- xml_attr(
xml_find_all(y, "//dt/following-sibling::dd[1]/a[1]"), "href")
linknames <- xml_text(xml_find_all(y, "//dt/following-sibling::dd[1]/a[1]"))
return(data.frame(names, links, linknames, stringsAsFactors = FALSE))
}
# aw_idx <- rbind(prep_idx(idx1), prep_idx(idx2) ,prep_idx(idx3))
aw_idx <- rbind(prep_idx(idx1))
aw_idx[["source"]] <- "rn"
idx4 <- read_html("http://www.alanwood.net/pesticides/index_cn.html")
Expand All @@ -183,6 +181,8 @@ build_aw_idx <- function() {
aw_idx$linknames <- ln
attr(aw_idx, "date") <- Sys.Date()

# do not delete this line
# occasionally the data file should be updated in the package
stitam marked this conversation as resolved.
Show resolved Hide resolved
# save(aw_idx, file = "data/aw_idx.rda")
return(aw_idx)
}
}
129 changes: 112 additions & 17 deletions R/chebi.R
Expand Up @@ -30,8 +30,8 @@
#'
#' @references Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V,
#' Turner S, Swainston N, Mendes P, Steinbeck C. (2016). ChEBI in 2016:
#' Improved services and an expanding collection of metabfolites. Nucleic Acids
#' Res.
#' Improved services and an expanding collection of metabfolites. Nucleic
#' Acids Res.
#'
#' Hastings, J., de Matos, P., Dekker, A., Ennis, M., Harsha, B., Kale, N.,
#' Muthukrishnan, V., Owen, G., Turner, S., Williams, M., and Steinbeck, C.
Expand Down Expand Up @@ -65,22 +65,19 @@
#' }
get_chebiid <- function(query,
from = 'ALL',
match = 'all',
match = c("all", "best", "ask", "na"),
max_res = 200,
stars = 'ALL',
verbose = TRUE,
...) {
Aariq marked this conversation as resolved.
Show resolved Hide resolved

match <- match.arg(match)
foo <- function(query, match, from, max_res, stars, verbose, ...) {
# query = 'Isoproturon'; from = 'ALL'; match = 'ask'; max_res = 200; stars = 'ALL'; verbose = T # debuging
# arguments
if (is.na(query)) return(data.frame(chebiid = NA))
from_all <- c('ALL', 'CHEBI ID', 'CHEBI NAME', 'DEFINITION', 'ALL NAMES',
'IUPAC NAME', 'CITATIONS', 'REGISTRY NUMBERS', 'MANUAL XREFS',
'AUTOMATIC XREFS', 'FORMULA', 'MASS', 'MONOISOTOPIC MASS',
'CHARGE', 'INCHI/INCHI KEY', 'SMILES', 'SPECIES')
from <- match.arg(from, from_all)
match_all <- c('all', 'best', 'ask', 'na')
match <- match.arg(match, match_all)
stars_all <- c('ALL', 'TWO ONLY', 'THREE ONLY')
stars <- match.arg(stars, stars_all)
# query
Expand Down Expand Up @@ -116,20 +113,20 @@ get_chebiid <- function(query,
out <- setNames(out, tolower(names(out)))
if (nrow(out) == 0) {
message('No result found. \n')
return(out)
return(data.frame(chebiid = NA))
}
if (match == 'all') {
return(out)
}
if (match == 'best') {
if (verbose)
message('Returning best match. \n')
out <- out[ with(out, order(searchscore, decreasing = TRUE)), ]
return(out[ which.max(out$searchscore), ])
out <- out[with(out, order(searchscore, decreasing = TRUE)), ]
return(out[which.max(out$searchscore), ])
}
if (match == "ask") {
matched <- chooser(out$chebiid, 'all')
return(out[ out$chebiid == matched, ])
return(out[out$chebiid == matched, ])
}
if (match == 'na') {
Aariq marked this conversation as resolved.
Show resolved Hide resolved
return(data.frame(chebiid = NA))
Expand Down Expand Up @@ -212,6 +209,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {

foo <- function(chebiid, verbose, ...) {
# chebiid = c('CHEBI:27744', 'CHEBI:17790'); verbose = TRUE # debuging
if (is.na(chebiid)) return(empty_chebi_comp_entity())
url <- 'http://www.ebi.ac.uk:80/webservices/chebi/2.0/webservice'
headers <- c(Accept = 'text/xml',
Accept = 'multipart/*',
Expand All @@ -235,10 +233,8 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {
add_headers(headers),
body = body)
if (res$status_code != 200) {
out <- data.frame(chebiid = NA)
warning(http_status(res)$message)

return(out)
return(empty_chebi_comp_entity())
} else {
cont <- content(res, type = 'text/xml', encoding = 'utf-8')
# restricted to one entry
Expand Down Expand Up @@ -313,7 +309,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {
#' @noRd
#'
l2df <- function(x) {
out <- data.frame(rbind.named.fill(lapply(x, unlist)),
out <- data.frame(rbind_named_fill(lapply(x, unlist)),
row.names = NULL,
stringsAsFactors = FALSE)

Expand All @@ -331,7 +327,7 @@ l2df <- function(x) {
#' @author Andreas Scharmueller, \email{andschar@@protonmail.com}
#' @noRd
#'
rbind.named.fill <- function(x) {
rbind_named_fill <- function(x) {
nam <- lapply(x, names)
unam <- unique(unlist(nam))
len <- lapply(x, length)
Expand All @@ -340,4 +336,103 @@ rbind.named.fill <- function(x) {
out[[i]] <- unname(x[[i]])[match(unam, nam[[i]])]
}
setNames(as.data.frame(do.call(rbind, out), stringsAsFactors = FALSE), unam)
}

#' Create empty list for chebi_comp_entity
#'
#' The function creates a list with identical internal structure as a
#' chebi_comp_entity() output element. This is used in chebi_comp_entity() to
#' provide a valid output for invalid inputs and allow vectorisation.
#' @noRd
#' @autor Tamas Stirling, \email{stirling.tamas@@gmail.com}
empty_chebi_comp_entity <- function() {
stitam marked this conversation as resolved.
Show resolved Hide resolved
properties <- data.frame(
chebiid = as.character(NA),
chebiasciiname = as.character(NA),
definition = as.character(NA),
status = as.character(NA),
smiles = as.character(NA),
inchi = as.character(NA),
inchikey = as.character(NA),
charge = as.character(NA),
mass = as.character(NA),
monoisotopicmass = as.character(NA),
entitystar = as.character(NA)
stitam marked this conversation as resolved.
Show resolved Hide resolved
)
chebiid_snd <- data.frame(
chebiids = as.character(NA)
)
chem_structure <- list(
list("structure" = list(as.character(NA)),
"type" = list(as.character(NA)),
"dimension" = list(as.character(NA)),
"defaultStructure" = list(as.character(NA)))
)
synonyms <- data.frame(
data = as.character(NA),
type = as.character(NA),
source = as.character(NA)
)
iupacnames <- data.frame(
data = as.character(NA),
type = as.character(NA),
source = as.character(NA)
)
formulae <- data.frame(
data = as.character(NA),
source = as.character(NA)
)
regnumbers <- data.frame(
data = as.character(NA),
type = as.character(NA),
source = as.character(NA)
)
citations <- data.frame(
data = as.character(NA),
type = as.character(NA),
source = as.character(NA)
)
dblinks <- data.frame(
data = as.character(NA),
type = as.character(NA)
)
parents <- data.frame(
chebiName = as.character(NA),
chebiId = as.character(NA),
type = as.character(NA),
status = as.character(NA),
cyclicRelationship = as.character(NA)
)
children <- data.frame(
chebiName = as.character(NA),
chebiId = as.character(NA),
type = as.character(NA),
status = as.character(NA),
cyclicRelationship = as.character(NA)
)
comments <- data.frame(
text = as.character(NA),
date = as.character(NA)
)
origins <- data.frame(
speciesText = as.character(NA),
speciesAccession = as.character(NA),
SourceType = as.character(NA),
SourceAccession = as.character(NA),
componentText = as.character(NA),
componentAccession = as.character(NA)
)
return(list("properties" = properties,
"chebiid_snd" = chebiid_snd,
"chem_structure" = chem_structure,
"synonyms" = synonyms,
"iupacnames" = iupacnames,
"formulae" = formulae,
"regnumbers" = regnumbers,
"citations" = citations,
"dblinks" = dblinks,
"parents" = parents,
"children" = children,
"comments" = comments,
"origins" = origins))
stitam marked this conversation as resolved.
Show resolved Hide resolved
}
Binary file modified data/aw_idx.rda
Binary file not shown.
23 changes: 18 additions & 5 deletions tests/testthat/test-alanwood.R
Expand Up @@ -4,33 +4,46 @@ context("alanwood")
test_that("alanwood, commonname", {
skip_on_cran()

comps <- c("Fluazinam", "S-Metolachlor", "xxxxx")
comps <- c("Fluazinam", "S-Metolachlor", "balloon", NA)
o1 <- aw_query(comps, type = "commonname")

expect_is(o1, "list")
expect_equal(length(o1), 3)
expect_equal(length(o1), 4)
expect_is(o1[[1]], "list")
expect_is(o1[[2]], "list")
expect_is(o1[[3]], "list")
expect_is(o1[[4]], "list")
expect_equal(o1[["Fluazinam"]]$cas, "79622-59-6")
expect_equal(length(o1[["S-Metolachlor"]]$inchikey), 2)
expect_equal(length(o1[["S-Metolachlor"]]$inchi), 2)
expect_equal(length(o1[["Fluazinam"]]), 11)
expect_true(is.na(o1[["balloon"]]$cname))
expect_true(is.na(o1[[4]]$cname))
})


test_that("alanwood, cas", {
skip_on_cran()

comps <- c("79622-59-6", "87392-12-9", "xxxxx")
comps <- c("79622-59-6", "87392-12-9", "balloon", NA)
o1 <- aw_query(comps, type = "cas")

expect_is(o1, "list")
expect_equal(length(o1), 3)
expect_equal(length(o1), 4)
expect_is(o1[[1]], "list")
expect_is(o1[[2]], "list")
expect_is(o1[[3]], "list")
expect_is(o1[[4]], "list")
expect_equal(mean(names(o1[[1]]) == names(o1[[2]])),1) # check structure
expect_equal(mean(names(o1[[2]]) == names(o1[[3]])),1) # check structure
expect_equal(mean(names(o1[[3]]) == names(o1[[4]])),1) # check structure
expect_equal(o1[[1]]$cas, "79622-59-6")
expect_equal(length(o1[[2]]$inchikey), 2)
expect_equal(length(o1[[2]]$inchi), 2)
expect_equal(length(o1[[1]]), 11)
expect_true(is.na(aw_query("12071-83-9", type = "cas")[[1]]$inchi))
expect_true(is.na(o1[["balloon"]]$cname))
expect_true(is.na(o1[[4]]$cname))
})

test_that("alanwood, build_index", {
Expand All @@ -48,4 +61,4 @@ test_that("alanwood index is up to date", {
skip_on_cran()

expect_true(Sys.Date() - attr(aw_idx, "date") < 90)
})
})