Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives for handling NA inputs #225

Merged
merged 10 commits into from Mar 26, 2020
30 changes: 15 additions & 15 deletions CONTRIBUTING.md
Expand Up @@ -4,11 +4,11 @@ Thank you for your interest in contributing to the project! The goal of this gui

### Fill out the survey

The ```webchem``` survey allows us to learn who you are, which databases you use and how you interact with chemical data. This is extremely valuable information for us and guides our development efforts. The survey take about 5 minutes to fill out. You can fill out the survey [here](<link here>) (under development).
The `webchem` survey allows us to learn which databases you use and how you interact with chemical data. This is extremely valuable information for us and guides our development efforts. The survey take about 5 minutes to fill out. You can fill out the survey [here](<link here>) (under development).

### Share a use case

Write us an e-mail and show us a full example of how you use or how you would like to use ```webchem``` in your data analysis! This would give us ideas about new features and also help us create better vignettes that help others get started. Please send your e-mails to <webchem@ropensci.org>.
Write us an e-mail and show us a full example of how you use or how you would like to use `webchem` in your data analysis! This would give us ideas about new features and also help us create better vignettes that help others get started. Please send your e-mails to <webchem@ropensci.org>.

### Raise a new issue or join discussion on an existing issue

Expand All @@ -34,34 +34,34 @@ We do not have strong guidelines for code contributions and are happy to help at

1. We follow the [tidyverse](https://tidyverse.org) style. You can find the style guide [here](https://style.tidyverse.org/). Before committing your code, we encourage you to use ```lintr::lint_file()``` to check for nonconformances.

2. We use ```roxygen2``` for documentation. Please make sure you update the package to the latest version before you update the documentation with ```devtools::document()```. In fact, it is good practice to update all the packages that are used by webchem before making changes.
2. We use [`roxygen2`](https://cran.r-project.org/web/packages/roxygen2/index.html) for documentation. Please make sure you update the package to the latest version before you update the documentation with `devtools::document()`. Use `@noRd` for non exported functions.

2. For web scraping, we recommend the use of the [polite](https://dmi3kno.github.io/polite/) package.
3. Please use the [`xml2`](https://cran.r-project.org/web/packages/xml2/index.html) package instead of the `XML` package. The maintainance of xml2 is much better.

We want to keep dependencies to a minimum:
4. Please use the lightweight [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/index.html) package for handling JSON.

3. Please use the [`xml2`](https://github.com/hadley/xml2) package instead of the `XML` package. The maintainance of xml2 is much better.
5. Use utilities in `webchem::utils.R` when possible to keep function style consistent across the package.

4. Please use the lightweight [`jsonlite`](https://github.com/jeroenooms/jsonlite) package for handling JSON.

5. Use utilities in utils.R when possible to keep function style consistent across the package.

6. Be nice to the resources! Use appropriate timeouts.
6. Be nice to the resources! Minimise interaction with the servers. Use appropriate timeouts. For web scraping, we recommend the use of the [`polite`](https://cran.r-project.org/web/packages/polite/index.html) package.

7. Tests go into a separate tests branch and not in the master branch.

Some consistency guidelines:

8. Functions that query a database for one or more database specific identifiers should follow the naming convention ```get_*```, e.g. the function that queries ChEBI IDs is called ```get_chebiid()```.
8. Functions that query a database for one or more database specific identifiers should follow the naming convention `get_*`, e.g. the function that queries ChEBI IDs is called `get_chebiid()`. These functions should take a vector of queries and return a single [tibble](https://cran.r-project.org/web/packages/tibble/index.html). Whenever possible these functions should have arguments `query`, `from`, `match`, `verbose` and `...`. The first column of the tibble should contain the ID-s and the last should contain the queries. Invalid queries should return a row of NA-s (apart from the last element of the row which should be the query itself).

9. The naming of functions that query a database for chemical information should start with the name of the database, followed by the functionality, e.g. `pc_synonyms()` searches for synonyms in PubChem. These functions should take a vector of queries and return a list of responses. Invalid queries should return `NA`.

10. Functions should always validate their input when appropriate. Use `match.arg()` for input validation.

9. The naming of functions that query a database for chemical information should start with the name of the database, followed by the functionality, e.g. ```pc_synonyms()``` searches for synonyms in PubChem.
11. Make sure `NA` is not confused with sodium.

### Data Sources

You might think all webscraping is perfectly legal but it is unfortunately not that simple.

Some services allow you to browse their website but do not allow you programmable access, for various reasons. Therefore, we always have to check the Terms & Conditons and any other legal documents that might restrict programmable access. ```webchem``` only provides access to databases where programmable access is clearly approved by the database provider. A provider might create a publicly accessible API, and if they do not have a restrictive T&C, this indicates their implicit approval for programmatically accessing their data. In all other cases explicit approval is required, i.e. either the T&C has to state that scraping is allowed, or we have to acquire written consent from the database provider before developing functions that scrape their website.
Some services allow you to browse their website but do not allow you programmable access, for various reasons. Therefore, we always have to check the Terms & Conditons and any other legal documents that might restrict programmable access. `webchem` only provides access to databases where programmable access is clearly approved by the database provider. A provider might create a publicly accessible API, and if they do not have a restrictive T&C, this indicates their implicit approval for programmatically accessing their data. In all other cases explicit approval is required, i.e. either the T&C has to state that scraping is allowed, or we have to acquire written consent from the database provider before developing functions that scrape their website.

And there is a big difference between scraping and crawling. ```webchem``` does provide some scraping functionality but it does not provide crawling functionality. We aim to query databases not to download them.
And there is a big difference between scraping and crawling. `webchem` does provide some scraping functionality but it does not provide crawling functionality.

### Thanks for contributing!
4 changes: 4 additions & 0 deletions NAMESPACE
@@ -1,8 +1,11 @@
# Generated by roxygen2: do not edit by hand

S3method(cas,chebi_comp_entity)
S3method(cas,default)
S3method(cas,pan_query)
S3method(cas,wd_ident)
S3method(inchikey,aw_query)
S3method(inchikey,chebi_comp_entity)
S3method(inchikey,cs_compinfo)
S3method(inchikey,cs_extcompinfo)
S3method(inchikey,default)
Expand All @@ -12,6 +15,7 @@ S3method(inchikey,pan_query)
S3method(inchikey,pc_prop)
S3method(inchikey,wd_ident)
S3method(smiles,aw_query)
S3method(smiles,chebi_comp_entity)
S3method(smiles,cs_compinfo)
S3method(smiles,cs_extcompinfo)
S3method(smiles,cts_compinfo)
Expand Down
1 change: 1 addition & 0 deletions NEWS
Expand Up @@ -8,6 +8,7 @@ MINOR IMPROVEMENTS
BUG FIXES

* get_csid() returned an error when query was NA [PR #226, fixed by stitam]
* get_chebiid() and chebi_comp_entity() fixed for invalid queries [PR #225, fixed by stitam]
* get_cid() returned the PubChem ID of sodium when the query was NA [PR #223, fixed by stitam]
* aw_query() returned a list for successful queries, NA for unsuccessful queries [PR #222, fixed by stitam]

Expand Down
24 changes: 7 additions & 17 deletions R/alanwood.R
Expand Up @@ -53,19 +53,10 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
}

takelink <- links[tolower(names) == tolower(query)]
if (is.na(query)) takelink <- vector()
if (length(takelink) == 0) {
message("Not found! Returning NA.\n")
return(list(cname = NA,
status = NA,
pref_iupac_name = NA,
iupac_name = NA,
cas = NA,
formula = NA,
activity = NA,
subactivity = NA,
inchikey = NA,
inchi = NA,
source_url = NA))
return(NA)
}
if (length(takelink) > 1) {
takelink <- unique(takelink)
Expand All @@ -77,7 +68,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
if (verbose)
message("Querying ", takelink)

Sys.sleep(rgamma(1, shape = 15, scale = 1/10))
Sys.sleep(rgamma(1, shape = 15, scale = 1 / 10))
ttt <- read_html(paste0("http://www.alanwood.net/pesticides/", takelink))

status <- xml_text(
Expand Down Expand Up @@ -133,7 +124,7 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
out <- list(cname = cname, status = status,
pref_iupac_name = pref_iupac_name, iupac_name = iupac_name,
cas = cas, formula = formula, activity = activity,
subactivity = subactivity, inchikey = inchikey, inch = inchi,
subactivity = subactivity, inchikey = inchikey, inchi = inchi,
source_url = source_url)
return(out)
}
Expand All @@ -153,16 +144,13 @@ aw_query <- function(query, type = c("commonname", "cas"), verbose = TRUE,
#'@author Eduard Szoecs, \email{eduardszoecs@@gmail.com}
build_aw_idx <- function() {
idx1 <- read_html("http://www.alanwood.net/pesticides/index_rn.html")
# idx2 <- read_html("http://www.alanwood.net/pesticides/index_rn1.html")
# idx3 <- read_html("http://www.alanwood.net/pesticides/index_rn2.html")
prep_idx <- function(y) {
names <- xml_text(xml_find_all(y, "//dl/dt"))
links <- xml_attr(
xml_find_all(y, "//dt/following-sibling::dd[1]/a[1]"), "href")
linknames <- xml_text(xml_find_all(y, "//dt/following-sibling::dd[1]/a[1]"))
return(data.frame(names, links, linknames, stringsAsFactors = FALSE))
}
# aw_idx <- rbind(prep_idx(idx1), prep_idx(idx2) ,prep_idx(idx3))
aw_idx <- rbind(prep_idx(idx1))
aw_idx[["source"]] <- "rn"
idx4 <- read_html("http://www.alanwood.net/pesticides/index_cn.html")
Expand All @@ -183,6 +171,8 @@ build_aw_idx <- function() {
aw_idx$linknames <- ln
attr(aw_idx, "date") <- Sys.Date()

# do not delete this line
# occasionally the data file should be updated in the package
stitam marked this conversation as resolved.
Show resolved Hide resolved
# save(aw_idx, file = "data/aw_idx.rda")
return(aw_idx)
}
}
50 changes: 28 additions & 22 deletions R/chebi.R
Expand Up @@ -30,8 +30,8 @@
#'
#' @references Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V,
#' Turner S, Swainston N, Mendes P, Steinbeck C. (2016). ChEBI in 2016:
#' Improved services and an expanding collection of metabfolites. Nucleic Acids
#' Res.
#' Improved services and an expanding collection of metabfolites. Nucleic
#' Acids Res.
#'
#' Hastings, J., de Matos, P., Dekker, A., Ennis, M., Harsha, B., Kale, N.,
#' Muthukrishnan, V., Owen, G., Turner, S., Williams, M., and Steinbeck, C.
Expand Down Expand Up @@ -65,22 +65,21 @@
#' }
get_chebiid <- function(query,
from = 'ALL',
match = 'all',
match = c("all", "best", "ask", "na"),
max_res = 200,
stars = 'ALL',
verbose = TRUE,
...) {
Aariq marked this conversation as resolved.
Show resolved Hide resolved

match <- match.arg(match)
foo <- function(query, match, from, max_res, stars, verbose, ...) {
# query = 'Isoproturon'; from = 'ALL'; match = 'ask'; max_res = 200; stars = 'ALL'; verbose = T # debuging
# arguments
if (is.na(query)) return(data.frame(chebiid = NA_character_,
query = NA_character_,
stringsAsFactors = FALSE))
from_all <- c('ALL', 'CHEBI ID', 'CHEBI NAME', 'DEFINITION', 'ALL NAMES',
'IUPAC NAME', 'CITATIONS', 'REGISTRY NUMBERS', 'MANUAL XREFS',
'AUTOMATIC XREFS', 'FORMULA', 'MASS', 'MONOISOTOPIC MASS',
'CHARGE', 'INCHI/INCHI KEY', 'SMILES', 'SPECIES')
from <- match.arg(from, from_all)
match_all <- c('all', 'best', 'ask', 'na')
match <- match.arg(match, match_all)
stars_all <- c('ALL', 'TWO ONLY', 'THREE ONLY')
stars <- match.arg(stars, stars_all)
# query
Expand Down Expand Up @@ -116,26 +115,34 @@ get_chebiid <- function(query,
out <- setNames(out, tolower(names(out)))
if (nrow(out) == 0) {
message('No result found. \n')
return(out)
return(data.frame(chebiid = NA_character_,
query = query,
stringsAsFactors = FALSE))
}
if (nrow(out) > 0) out$query <- query
if (nrow(out) == 1) return(out)
if (match == 'all') {
return(out)
}
if (match == 'best') {
if (verbose)
message('Returning best match. \n')
out <- out[ with(out, order(searchscore, decreasing = TRUE)), ]
return(out[ which.max(out$searchscore), ])
out <- out[with(out, order(searchscore, decreasing = TRUE)), ]
return(out[which.max(out$searchscore), ])
}
if (match == "ask") {
matched <- chooser(out$chebiid, 'all')
return(out[ out$chebiid == matched, ])
return(out[out$chebiid == matched, ])
}
if (match == 'na') {
Aariq marked this conversation as resolved.
Show resolved Hide resolved
return(data.frame(chebiid = NA))
return(data.frame(chebiid = NA_character_,
query = query,
stringsAsFactors = FALSE))
}
} else {
out <- data.frame(chebiid = NA)
out <- data.frame(chebiid = NA_character_,
query = query,
stringsAsFactors = FALSE)
message('Returning NA (', http_status(res)$message, '). \n')

return(out)
Expand All @@ -149,7 +156,7 @@ get_chebiid <- function(query,
stars = stars,
verbose = verbose)
out <- setNames(out, query)

out <- bind_rows(out)
return(out)
}

Expand Down Expand Up @@ -212,6 +219,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {

foo <- function(chebiid, verbose, ...) {
# chebiid = c('CHEBI:27744', 'CHEBI:17790'); verbose = TRUE # debuging
if (is.na(chebiid)) return(NA)
url <- 'http://www.ebi.ac.uk:80/webservices/chebi/2.0/webservice'
headers <- c(Accept = 'text/xml',
Accept = 'multipart/*',
Expand All @@ -235,10 +243,8 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {
add_headers(headers),
body = body)
if (res$status_code != 200) {
out <- data.frame(chebiid = NA)
warning(http_status(res)$message)

return(out)
return(NA)
} else {
cont <- content(res, type = 'text/xml', encoding = 'utf-8')
# restricted to one entry
Expand Down Expand Up @@ -298,7 +304,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {
}
out <- lapply(chebiid, foo, verbose = verbose)
out <- setNames(out, chebiid)

class(out) <- c("chebi_comp_entity","list")
return(out)
}

Expand All @@ -313,7 +319,7 @@ chebi_comp_entity <- function(chebiid, verbose = TRUE, ...) {
#' @noRd
#'
l2df <- function(x) {
out <- data.frame(rbind.named.fill(lapply(x, unlist)),
out <- data.frame(rbind_named_fill(lapply(x, unlist)),
row.names = NULL,
stringsAsFactors = FALSE)

Expand All @@ -331,7 +337,7 @@ l2df <- function(x) {
#' @author Andreas Scharmueller, \email{andschar@@protonmail.com}
#' @noRd
#'
rbind.named.fill <- function(x) {
rbind_named_fill <- function(x) {
nam <- lapply(x, names)
unam <- unique(unlist(nam))
len <- lapply(x, length)
Expand All @@ -340,4 +346,4 @@ rbind.named.fill <- function(x) {
out[[i]] <- unname(x[[i]])[match(unam, nam[[i]])]
}
setNames(as.data.frame(do.call(rbind, out), stringsAsFactors = FALSE), unam)
}
}