Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem specifying the encoding for itis retrievals #334

Closed
scelmendorf opened this issue Sep 6, 2014 · 34 comments
Closed

problem specifying the encoding for itis retrievals #334

scelmendorf opened this issue Sep 6, 2014 · 34 comments
Assignees
Labels

Comments

@scelmendorf
Copy link

Trying to get the authorship info for a taxon, but I'm running into problems with the special characters. Example:

myProblem<-itis_terms(query='Amara fulva', "scientific")
myProblem$author

I poked a bit and think I should be able to pass arguments to curl, but clearly I'm not doing this correctly.
Ideas? Or would it make sense to set the encoding for calls to itis to whatever itis.gov's encoding is (I actually couldn't figure this out from their website, but some trial and error might do the trick)
doesNotFixMyProblem<-itis_terms(query='Amara fulva', "scientific", curlopts=(list(.encoding='UTF-8')))

ideas?

@sckott sckott self-assigned this Sep 6, 2014
@sckott
Copy link
Contributor

sckott commented Sep 6, 2014

Hi @scelmendorf - Thanks for the question.

Unfortunately, we have a mix of httr and RCurl in taxize - httr is a wrapper pkg around RCurl. With this function, we use RCurl internally. To pass curl options you can do

get verbose output

itis_terms(query='Amara fulva', "scientific", curlopts=list(verbose=TRUE))

http://www.itis.gov/ITISWebService/services/ITISService/getITISTermsFromScientificName?srchKey=Amara fulva
* Adding handle: conn: 0x7fec04061600
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 2 (0x7fec04061600) send_pipe: 1, recv_pipe: 0
* About to connect() to www.itis.gov port 80 (#2)
*   Trying 137.227.231.25...
* Connected to www.itis.gov (137.227.231.25) port 80 (#2)
> GET /ITISWebService/services/ITISService/getITISTermsFromScientificName?srchKey=Amara%20fulva HTTP/1.1
Host: www.itis.gov
Accept: */*

< HTTP/1.1 200 OK
< Date: Sat, 06 Sep 2014 05:18:47 GMT
< Content-Type: application/xml;charset=UTF-8
< Transfer-Encoding: chunked
< 
* Connection #2 to host www.itis.gov left intact
     tsn            author commonnames nameusage scientificname           .attrs
1 110866 (O. Müller, 1776)        <NA>     valid    Amara fulva ax21:SvcItisTerm

Set encoding

itis_terms(query='Amara fulva', "scientific", curlopts=list(encoding='UTF-8'))

Set a timeout

itis_terms(query='Amara fulva', "scientific", curlopts=list(timeout.ms=500))

http://www.itis.gov/ITISWebService/services/ITISService/getITISTermsFromScientificName?srchKey=Amara fulva
Error in function (type, msg, asError = TRUE)  : 
  Operation timed out after 806 milliseconds with 0 out of -1 bytes received

You can search for available curl options by doing listCurlOptions()

@scelmendorf
Copy link
Author

Yup Rstudio Version 0.98.1028

I still get the funny characters even when I set the encoding through curlopts, though. Any other ideas? Is this an rstudio problem?

itis_terms(query='Amara fulva', "scientific", curlopts=list(encoding='UTF-8'))
http://www.itis.gov/ITISWebService/services/ITISService/getITISTermsFromScientificName?srchKey=Amara fulva
tsn author commonnames nameusage scientificname .attrs
1 110866 (O. Müller, 1776) valid Amara fulva ax21:SvcItisTerm

@sckott
Copy link
Contributor

sckott commented Sep 7, 2014

@scelmendorf What does R print out when you run sessionInfo() ?

@scelmendorf
Copy link
Author

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] taxize_0.3.0

loaded via a namespace (and not attached):
[1] ape_3.1-4 assertthat_0.1 codetools_0.2-8 data.table_1.9.2 foreach_1.4.2 grid_3.1.1
[7] httr_0.4 iterators_1.0.7 lattice_0.20-29 nlme_3.1-117 permute_0.8-3 plyr_1.8.1
[13] Rcpp_0.11.2 RCurl_1.95-4.3 reshape2_1.4 RJSONIO_1.3-0 stringr_0.6.2 Taxonstand_1.3
[19] tools_3.1.1 vegan_2.0-10 XML_3.98-1.1

@sckott
Copy link
Contributor

sckott commented Sep 7, 2014

thanks @scelmendorf - What happens when you try the below. We use RCurl internally now in the function you're using, I wonder if httr would fix this somewhow.

Install httr if you don't have it, then load pkgs

# install.packages(c("httr","XML")) # install them you don't already have these packages
library('taxize')
library('RCurl')
library('httr')
library('XML')
library('plyr')

Define this function

foo <- function(srchkey = NA, curlopts=list(), curl = getCurlHandle(), verbose=TRUE, which='httr') 
{
  url = "http://www.itis.gov/ITISWebService/services/ITISService/getITISTermsFromScientificName"
  args <- list()
  if (!is.na(srchkey))  args$srchKey <- srchkey

  if(which=='httr'){
    tt <- GET(url, query=args, config=c(followlocation = 0L, curlopts))
    out <- xmlParse(content(tt, as = "text"))
  } else{
    tt <- getForm(url, .params = args, .opts = c(curlopts, followlocation = 0L), curl = curl)
    out <- xmlParse(tt)
  }

  namespaces <- c(namespaces <- c(ax21 = "http://data.itis_service.itis.usgs.gov/xsd"))
  gg <- getNodeSet(out, "//ax21:itisTerms", namespaces = namespaces, xmlToList)
  tmp <- do.call(rbind.fill, lapply(gg, function(x) data.frame(x, stringsAsFactors = FALSE)))
  names(tmp) <- tolower(names(tmp))
  row.names(tmp) <- NULL
  tmp
}

Try using httr

foo(srchkey='Amara fulva', which='httr')

http://www.itis.gov/ITISWebService/services/ITISService/getITISTermsFromScientificName?srchKey=Amara fulva

tsn            author commonnames nameusage scientificname           .attrs
1 110866 (O. Müller, 1776)        true     valid    Amara fulva ax21:SvcItisTerm

Try using RCurl

foo(srchkey='Amara fulva', which='rcurl')
http://www.itis.gov/ITISWebService/services/ITISService/getITISTermsFromScientificName?srchKey=Amara fulva

tsn            author commonnames nameusage scientificname           .attrs
1 110866 (O. Müller, 1776)        true     valid    Amara fulva ax21:SvcItisTerm

@scelmendorf
Copy link
Author

Hi Scott,
Thanks for your persistence. This doesn’t seem to work either. First prob is the mssg statement:

foo <- function(srchkey = NA, curlopts=list(), curl = getCurlHandle(), verbose=TRUE, which='httr')

  • if(which=='httr'){
  • tt <- GET(url, query=args, config=c(followlocation = 0L, curlopts))
    
  • out <- xmlParse(content(tt, as = "text"))
    
  • } else{
  • tt <- getForm(url, .params = args, .opts = c(curlopts, followlocation = 0L), curl = curl)
    
  • out <- xmlParse(tt)
    
  • }
  • namespaces <- c(namespaces <- c(ax21 = "http://data.itis_service.itis.usgs.gov/xsd"))
  • gg <- getNodeSet(out, "//ax21:itisTerms", namespaces = namespaces,
  •                xmlToList)
    
  • tmp <- do.call(rbind.fill, lapply(gg, function(x) data.frame(x,
  •                                                            stringsAsFactors = FALSE)))
    
  • names(tmp) <- tolower(names(tmp))
  • row.names(tmp) <- NULL
  • tmp
  • }
    foo(srchkey='Amara fulva', which='httr')
    Error in foo(srchkey = "Amara fulva", which = "httr") :
    could not find function "mssg"

I commented the mssg line on and moved on, trying to see if I could skip over that and just use the rest.
Next problem was xmlParse was undefined. I installed&loaded ‘XML’ and that seemed to solve that problem (or at least create a new one, which is more rewarding)

Now stuck on rbind.fill:

Error in do.call(rbind.fill, lapply(gg, function(x) data.frame(x, stringsAsFactors = FALSE))) :
object 'rbind.fill' not found

Suggestions? And also – thanks SO MUCH for your help

Sarah

@sckott
Copy link
Contributor

sckott commented Sep 7, 2014

@scelmendorf sorry about that. I removed the mssg thing. Make sure to load those packages above the fxn definition before trying the foo() function.

@scelmendorf
Copy link
Author

Yup, already did that just did not copy all the way up the screen. If this helps:

library('taxize')
library('httr')
library('RCurl')
library ('XML')
#Define this function
foo <- function(srchkey = NA, curlopts=list(), curl = getCurlHandle(), verbose=TRUE, which='httr')

  • if(which=='httr'){
  • tt <- GET(url, query=args, config=c(followlocation = 0L, curlopts))
    
  • out <- xmlParse(content(tt, as = "text"))
    
  • } else{
  • tt <- getForm(url, .params = args, .opts = c(curlopts, followlocation = 0L), curl = curl)
    
  • out <- xmlParse(tt)
    
  • }
  • namespaces <- c(namespaces <- c(ax21 = "http://data.itis_service.itis.usgs.gov/xsd"))
  • gg <- getNodeSet(out, "//ax21:itisTerms", namespaces = namespaces,
  •                xmlToList)
    
  • tmp <- do.call(rbind.fill, lapply(gg, function(x) data.frame(x,
  •                                                            stringsAsFactors = FALSE)))
    
  • names(tmp) <- tolower(names(tmp))
  • row.names(tmp) <- NULL
  • tmp
  • }
    #Try using httr
    foo(srchkey='Amara fulva', which='httr')
    Error in do.call(rbind.fill, lapply(gg, function(x) data.frame(x, stringsAsFactors = FALSE))) :
    object 'rbind.fill' not found

From: Scott Chamberlain [mailto:notifications@github.com]
Sent: Sunday, September 07, 2014 4:21 PM
To: ropensci/taxize
Cc: Sarah Elmendorf
Subject: Re: [taxize] problem specifying the encoding for itis retrievals (#334)

@scelmendorfhttps://github.com/scelmendorf sorry about that. I removed the mssg thing. Make sure to load those packages above the fxn definition before trying the foo() function.


Reply to this email directly or view it on GitHubhttps://github.com//issues/334#issuecomment-54762916.

@sckott
Copy link
Contributor

sckott commented Sep 7, 2014

Did plyr load? rbind.fill() is a function the plyr package.

@sckott
Copy link
Contributor

sckott commented Sep 7, 2014

I updated the script, I think you don't get updates to comments in your email...

@scelmendorf
Copy link
Author

Aha, yes I did not see the update, but now see it on github. So now the function runs ☺. But it unfortunately doesn’t solve the funny character issues.

Maybe I should try a different R version or not use Rstudio? Or I could try it on linux, maybe this is windows problem??

library('taxize')
library('RCurl')
library('httr')
library('XML')
library('plyr')
#Define this function

foo <- function(srchkey = NA, curlopts=list(), curl = getCurlHandle(), verbose=TRUE, which='httr')

  • if(which=='httr'){
  • tt <- GET(url, query=args, config=c(followlocation = 0L, curlopts))
    
  • out <- xmlParse(content(tt, as = "text"))
    
  • } else{
  • tt <- getForm(url, .params = args, .opts = c(curlopts, followlocation = 0L), curl = curl)
    
  • out <- xmlParse(tt)
    
  • }
  • namespaces <- c(namespaces <- c(ax21 = "http://data.itis_service.itis.usgs.gov/xsd"))
  • gg <- getNodeSet(out, "//ax21:itisTerms", namespaces = namespaces, xmlToList)
  • tmp <- do.call(rbind.fill, lapply(gg, function(x) data.frame(x, stringsAsFactors = FALSE)))
  • names(tmp) <- tolower(names(tmp))
  • row.names(tmp) <- NULL
  • tmp
  • }

foo(srchkey='Amara fulva', which='httr')
tsn author commonnames nameusage scientificname .attrs
1 110866 (O. Müller, 1776) true valid Amara fulva ax21:SvcItisTerm

From: Scott Chamberlain [mailto:notifications@github.com]
Sent: Sunday, September 07, 2014 5:04 PM
To: ropensci/taxize
Cc: Sarah Elmendorf
Subject: Re: [taxize] problem specifying the encoding for itis retrievals (#334)

I updated the script, I think you don't get updates to comments in your email...


Reply to this email directly or view it on GitHubhttps://github.com//issues/334#issuecomment-54764155.

@scelmendorf
Copy link
Author

FYI - your original function works just fine for me on a linux server. This may be the fastest fix.

@sckott
Copy link
Contributor

sckott commented Sep 8, 2014

hey @gavinsimpson - I'm lost on this encoding thing. Do you know what a global solution is for windows users for special characters? e.g. ü - I wonder if there is something going on with the XML pkg, which does a lot of our parsing of XML data from APIs.

See e..g, #334 (comment)

@sckott
Copy link
Contributor

sckott commented Sep 8, 2014

@cboettig any thoughts on how to fix character encoding problems on windows?

@cboettig
Copy link
Member

cboettig commented Sep 8, 2014

@sckott Looks like this is probably due to the user's locale settings supporting only ascii characters. On a linux (& probably Mac) machine one would do:

Sys.setlocale("LC_ALL", 'en_US.UTF-8')

On a Windows machine it looks like the locale might be set by:

Sys.setlocale(category = "LC_ALL", locale = "English_United States.1252")

but not totally sure -- the sessionInfo() suggests that collate is already using that. iconv can do some locale conversion on the strings.

Scott, you could try Sys.setlocale("LC_ALL", 'C') and see if that reproduces the formatting issues on your machine?

@sckott
Copy link
Contributor

sckott commented Sep 8, 2014

@cboettig thx, i'll give that a try

@gavinsimpson
Copy link

You can use iconv() to convert strings to current locale of user if I understand the problem?

@sckott
Copy link
Contributor

sckott commented Sep 8, 2014

thanks @cboettig and @gavinsimpson I tried both setting locale in Windows, using iconv(). Neither seems to help. I wonder if there's something I should be doing in the package itself. Here's what the user gets on their windows machine:

taxize::itis_terms(query='Amara fulva', "scientific", curlopts=list(encoding='UTF-8')) 
tsn            author commonnames nameusage scientificname           .attrs
1 110866 (O. Müller, 1776)        true     valid    Amara fulva ax21:SvcItisTerm

where the ü should be ü

Stepping through the code, its all fine until I get to this line https://github.com/ropensci/taxize/blob/master/R/itis.R#L667-L668 in the function where XML::getNodeSet() changes the character from ü to ü. I've tried to change encoding in the XML::getNodeSet() call, but doesn't seem to do anything.

@sckott sckott added the bug label Sep 8, 2014
@sckott
Copy link
Contributor

sckott commented Nov 5, 2014

@scelmendorf Is this still a problem for you?

@scelmendorf
Copy link
Author

I gave up on trying it on windows and ran them all on linux because none of the fixes seemed to work. So yes, I think it's still broken unless you've done a patch, I haven't updated my taxize recently.

@sckott
Copy link
Contributor

sckott commented Nov 5, 2014

Okay, thanks for getting back so quick. Sorry I haven't fixed this yet. I just haven't been able to figure this out. I'll keep at it.

@scelmendorf
Copy link
Author

btw - in trying to borrow your code but tweak it to get synonyms from col, I may(??) have figured out a part of the encoding problem. I think it's not getting set right in getURL, but that's possible to skip? see example:

idnum<-1412627
url<-paste('http://www.catalogueoflife.org/col/webservice?id=', idnum, '&response=full', sep='')

#option 1 - use rcurl
out <- RCurl::getURL(url, encoding='UTF-8')
tt1 <- XML::xmlParse(out, encoding='UTF-8')
nodes1 <-XML::getNodeSet(tt1, "//result", fun = XML::xmlToList)

#option 2 - skip the rcurl step
tt2 <- XML::xmlParse(url, encoding='UTF-8')
nodes2 <-XML::getNodeSet(tt2, "//result", fun = XML::xmlToList)

parsecoldata_syn <- function(x){
vals <- x[c('name', 'genus', 'species', 'infraspecies', 'rank', 'author', 'url', 'common_names')]
vals[sapply(vals, is.null)] <- NA
bb <- data.frame(vals, stringsAsFactors = FALSE)
}

syn1<-plyr::ldply(nodes1[[1]]$synonyms, parsecoldata_syn)
#encoding probs
syn1$species[5]
syn2<-plyr::ldply(nodes2[[1]]$synonyms, parsecoldata_syn)
#no encoding probs
syn2$species[5]

@sckott
Copy link
Contributor

sckott commented May 26, 2015

hi @scelmendorf !

in trying to borrow your code but tweak it to get synonyms from col

If doable, we can try to add in COL as a source to the synonyms() function. Would you be interested in that?

@sckott
Copy link
Contributor

sckott commented May 27, 2015

@scelmendorf have you reinstalled since the newer CRAN version from 19 Dec '14 http://cran.rstudio.com/web/packages/taxize/ or a newer Github version?

Try reinstalling from Github: devtools::install_github("ropensci/taxize") - I just pushed a change that at least for me on a Windows computer fixes the original problem you opened this issue with - does it work for you?

@scelmendorf
Copy link
Author

I have taxize_0.5.2, still having this result:
taxize::itis_terms(query='Amara fulva', "scientific")
tsn author commonnames nameusage scientificname .attrs
1 110866 (O. Müller, 1776) valid Amara fulva ax21:SvcItisTerm.

But I will try the github version.

@scelmendorf
Copy link
Author

And yes – I would definitely use synonyms from COL, itis isn’t super comprehensive for some taxa. But I might be in the minority.

From: Scott Chamberlain [mailto:notifications@github.com]
Sent: Tuesday, May 26, 2015 5:26 PM
To: ropensci/taxize
Cc: Sarah Elmendorf
Subject: Re: [taxize] problem specifying the encoding for itis retrievals (#334)

hi @scelmendorfhttps://github.com/scelmendorf !

in trying to borrow your code but tweak it to get synonyms from col

If doable, we can try to add in COL as a source to the synonyms() function.


Reply to this email directly or view it on GitHubhttps://github.com//issues/334#issuecomment-105696614.

@scelmendorf
Copy link
Author

Github version works:

taxize::itis_terms(query='Amara fulva', "scientific")
tsn            author commonnames nameusage scientificname           .attrs
1 110866 (O. Müller, 1776)        <NA>     valid    Amara fulva ax21:SvcItisTerm

Thanks!

@sckott
Copy link
Contributor

sckott commented May 27, 2015

@scelmendorf great! glad it works now.

Should have synonyms for COL up for you to try soon.

@scelmendorf
Copy link
Author

If you really want to be a magical unicorn about my synonyms in COL problem, do you want to figure out how to grab the authorship for the synonyms while you are at it?? - I am having some probs, in particular when it has ampersands in it and is therefore character escaped (example (see Ornithodoros lagophilus): <![CDATA[Philip, Bell & Larson, 1956]]. the xmlTreeParse seems to read this as null.

@sckott
Copy link
Contributor

sckott commented May 27, 2015

@scelmendorf that should work with synonyms. e.g.

synonyms("Ornithodoros lagophilus", db = "col")
$`Ornithodoros lagophilus`
       id                    name    rank name_status        genus    species infraspecies                      author
1 1412150 Ornithodoros lagophilus Species     synonym Ornithodoros lagophilus              Philip, Bell & Larson, 1956
                                                                            url
1 http://www.catalogueoflife.org/col/details/species/id/1412297/synonym/1412150

@scelmendorf
Copy link
Author

Perfect, thanks! I hadn’t tried it in the github version.

From: Scott Chamberlain [mailto:notifications@github.com]
Sent: Wednesday, May 27, 2015 9:45 AM
To: ropensci/taxize
Cc: Sarah Elmendorf
Subject: Re: [taxize] problem specifying the encoding for itis retrievals (#334)

@scelmendorfhttps://github.com/scelmendorf that should work with synonyms. e.g.

synonyms("Ornithodoros lagophilus", db = "col")

$Ornithodoros lagophilus

   id                    name    rank name_status        genus    species infraspecies                      author

1 1412150 Ornithodoros lagophilus Species synonym Ornithodoros lagophilus Philip, Bell & Larson, 1956

                                                                        url

1 http://www.catalogueoflife.org/col/details/species/id/1412297/synonym/1412150


Reply to this email directly or view it on GitHubhttps://github.com//issues/334#issuecomment-105967737.

@sckott
Copy link
Contributor

sckott commented May 27, 2015

well, it's just there now, in the last commmit, so reinstall

@scelmendorf
Copy link
Author

related: col_synonyms I think needs one more encoding statement. Line 136 of synonyms.R, if you change:

out <- xmlParse(content(res, "text"), encoding = "UTF-8")
to
out <- xmlParse(content(res, "text", encoding="UTF-8"), encoding = "UTF-8")

I think it fixes it.
before:

tst<-col_synonyms(1412448)
tst$author[1]
[1] "Dönitz, 1907"

after

tst<-col_synonyms(1412448)
tst$author[1]
[1] "Dönitz, 1907"

@scelmendorf scelmendorf reopened this May 28, 2015
@sckott sckott closed this as completed in 8f46176 May 28, 2015
@sckott
Copy link
Contributor

sckott commented May 28, 2015

@scelmendorf thanks for that, try again after reinstlal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants