name | layout | title | date | author | sourceslug | tags | ||||
---|---|---|---|---|---|---|---|---|---|---|
tnrs-use-case |
post |
Resolving species names when you have a lot of them |
2013-01-25 |
Scott Chamberlain |
_drafts/2013-01-25-tnrs-use-case.Rmd |
|
Species names can be a pain in the ass, especially if you are an ecologist. We ecologists aren't trained in taxonomy, yet we often end up with huge species lists. Of course we want to correct any spelling errors in the names, and get the newest names for our species, resolve any synonyms, etc.
We are building tools into our R package taxize
, that will let you check your species names to make sure they are correct.
An important use case is when you have a lot of species. Someone wrote to us recently, saying that they had thousands of species, and they wanted to know how to check their species names efficiently in R.
Below is an example of how to do this.
{% highlight r %}
library(taxize) {% endhighlight %}
{% highlight r %} tpl_get(dir_ = "~/foo2", family = "Scrophulariaceae") {% endhighlight %}
{% highlight text %}
{% endhighlight %}
{% highlight r %} dat <- read.csv("~/foo2/Scrophulariaceae.csv") {% endhighlight %}
{% highlight r %} species <- as.character(ddply(dat[, c("Genus", "Species")], .(), transform, gen_sp = as.factor(paste(Genus, Species, sep = " ")))[, 4]) {% endhighlight %}
It's better to do many smaller calls to a web API instead of few big ones to be nice to the database maintainers.
{% highlight r %}
slice <- function(input, by = 2) { starts <- seq(1, length(input), by) tt <- lapply(starts, function(y) input[y:(y + (by - 1))]) llply(tt, function(x) x[!is.na(x)]) } species_split <- slice(species, by = 100) {% endhighlight %}
Query for your large species list with pauses between calls, with 3 seconds in between calls to not hit the web service too hard. Using POST method here instead of GET - required when you have a lot of species.
{% highlight r %} tnrs_safe <- failwith(NULL, tnrs) # in case some calls fail, will continue out <- llply(species_split, function(x) tnrs_safe(x, getpost = "POST", sleep = 3)) {% endhighlight %}
{% highlight text %} Calling http://taxosaurus.org/retrieve/90fcd9ae425ad7c6103b06dd9fd78ae2 Calling http://taxosaurus.org/retrieve/223f73b83fcddcb8b6187966963660a8 Calling http://taxosaurus.org/retrieve/72bacdbb8938316e321d4c709c8cdd09 Calling http://taxosaurus.org/retrieve/979ce9cc4dec376710f61de162e1294e Calling http://taxosaurus.org/retrieve/03a39a124561fec2fdfc0f483d9fb607 Calling http://taxosaurus.org/retrieve/d4bf4e5a1403f45a1be1ca3dd87785d7 Calling http://taxosaurus.org/retrieve/a9a9bdde6fda7e325d80120e27ccb480 Calling http://taxosaurus.org/retrieve/215ccdcf2b00362278bf19d1942e1395 Calling http://taxosaurus.org/retrieve/9d43c0b99b4dfb5ea1b435adab17b980 Calling http://taxosaurus.org/retrieve/42e166f8e43f1fb349e36459cd5938b3 Calling http://taxosaurus.org/retrieve/2c42e4b5227c5464f9bfeeafcdf0651d {% endhighlight %}
{% highlight r %}
lapply(out, head)[1:2] # just look at the first two {% endhighlight %}
{% highlight text %}
[[1]]
submittedName acceptedName sourceId
1 Aptosimum welwitschii iPlant_TNRS
2 Anticharis ebracteata Anticharis ebracteata iPlant_TNRS
3 Aptosimum lineare Aptosimum lineare iPlant_TNRS
4 Antherothamnus pearsonii Antherothamnus pearsonii iPlant_TNRS
5 Barthlottia madagascariensis Barthlottia madagascariensis iPlant_TNRS
6 Agathelpis mucronata iPlant_TNRS
score matchedName annotations
1 1 Aptosimum welwitschii
2 1 Anticharis ebracteata Schinz
3 1 Aptosimum lineare Marloth & Engl.
4 1 Antherothamnus pearsonii N.E. Br.
5 1 Barthlottia madagascariensis Eb. Fisch.
6 1 Agathelpis mucronata
uri
1
2 http://www.tropicos.org/Name/29202501
3 http://www.tropicos.org/Name/29202525
4 http://www.tropicos.org/Name/29202728
5 http://www.tropicos.org/Name/50089700
6
[[2]] submittedName acceptedName sourceId 1 Buddleja pichinchensis x bullata Buddleja pichinchensis iPlant_TNRS 2 Buddleja soratae Buddleja soratae iPlant_TNRS 3 Buddleja euryphylla Buddleja euryphylla iPlant_TNRS 4 Buddleja incana Buddleja incana iPlant_TNRS 5 Buddleja incana Incana NCBI 6 Buddleja nana Buddleja brachystachya iPlant_TNRS score matchedName annotations 1 0.9 Buddleja pichinchensis Kunth 2 1.0 Buddleja soratae Kraenzl. 3 1.0 Buddleja euryphylla Standl. & Steyerm. 4 1.0 Buddleja incana Ruiz & Pav. 5 1.0 Buddleja incana none 6 1.0 Buddleja nana Diels uri 1 http://www.tropicos.org/Name/19000333 2 http://www.tropicos.org/Name/19001018 3 http://www.tropicos.org/Name/19000790 4 http://www.tropicos.org/Name/19000596 5 http://www.ncbi.nlm.nih.gov/taxonomy/405077 6 http://www.tropicos.org/Name/19001133 {% endhighlight %}
{% highlight r %}
outdf <- ldply(out) head(outdf) {% endhighlight %}
{% highlight text %}
submittedName acceptedName sourceId
1 Aptosimum welwitschii iPlant_TNRS
2 Anticharis ebracteata Anticharis ebracteata iPlant_TNRS
3 Aptosimum lineare Aptosimum lineare iPlant_TNRS
4 Antherothamnus pearsonii Antherothamnus pearsonii iPlant_TNRS
5 Barthlottia madagascariensis Barthlottia madagascariensis iPlant_TNRS
6 Agathelpis mucronata iPlant_TNRS
score matchedName annotations
1 1 Aptosimum welwitschii
2 1 Anticharis ebracteata Schinz
3 1 Aptosimum lineare Marloth & Engl.
4 1 Antherothamnus pearsonii N.E. Br.
5 1 Barthlottia madagascariensis Eb. Fisch.
6 1 Agathelpis mucronata
uri
1
2 http://www.tropicos.org/Name/29202501
3 http://www.tropicos.org/Name/29202525
4 http://www.tropicos.org/Name/29202728
5 http://www.tropicos.org/Name/50089700
6
{% endhighlight %}
Note that there are multiple names for some species because data sources have different names for the same species (resulting in more than one row in the data.frame 'outdf' for a species). We are leaving this up to the user to decide which to use. For example, for the species Buddleja montana there are two names for in the output
{% highlight r %} data <- ddply(outdf, .(submittedName), summarize, length(submittedName)) outdf[outdf$submittedName %in% as.character(data[data$..1 > 1, ][6, "submittedName"]), ] {% endhighlight %}
{% highlight text %} submittedName acceptedName sourceId score matchedName 123 Buddleja montana Buddleja montana iPlant_TNRS 1 Buddleja montana 124 Buddleja montana Montana NCBI 1 Buddleja montana annotations uri 123 Britton ex Rusby http://www.tropicos.org/Name/19000601 124 none http://www.ncbi.nlm.nih.gov/taxonomy/441235 {% endhighlight %}
The source iPlant matched the name, but NCBI actually gave back a genus of cricket (follow the link under the column uri for Montana). If you look at the page for Buddleja on NCBI here there is no Buddleja montana at all.
Another thing we could do is look at the score that is returned. Let's look at those that are less than 1 (i.e., )
{% highlight r %} outdf[outdf$score < 1, ] {% endhighlight %}
{% highlight text %}
submittedName acceptedName sourceId
94 Buddleja pichinchensis x bullata Buddleja pichinchensis iPlant_TNRS
340 Diascia ellaphieae iPlant_TNRS
495 Eremophila decipiens iPlant_TNRS
500 Eremophila grandiflora Eremophila iPlant_TNRS
808 Jamesbrittneia hilliard iPlant_TNRS
1051 Verbascum patris Verbascum iPlant_TNRS
1081 Verbascum barnadesii Verbascum iPlant_TNRS
1097 Verbascum calycosum Verbascum iPlant_TNRS
score matchedName annotations
94 0.90 Buddleja pichinchensis Kunth
340 0.98 Diascia ellaphiae
495 0.98 Eremophila decipiense
500 0.50 Eremophila R. Br.
808 0.50 Jamesbrittenia
1051 0.50 Verbascum L.
1081 0.50 Verbascum L.
1097 0.50 Verbascum L.
uri
94 http://www.tropicos.org/Name/19000333
340
495
500 http://www.tropicos.org/Name/40004761
808
1051 http://www.tropicos.org/Name/40023766
1081 http://www.tropicos.org/Name/40023766
1097 http://www.tropicos.org/Name/40023766
{% endhighlight %}
As we got this speies list from theplantlist.org, there aren't that many mistakes, but if it was my species list you know there would be many :)