/
taxize.Rmd
554 lines (467 loc) · 20 KB
/
taxize.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
---
title: Introduction to taxize
author: Scott Chamberlain
date: "2020-09-17"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to taxize}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
`taxize` is a taxonomic toolbelt for R. `taxize` wraps APIs for a large suite of taxonomic databases availab on the web.
## Installation
First, install and load `taxize` into the R session.
```r
install.packages("taxize")
```
```r
library("taxize")
```
Advanced users can also download and install the latest development copy from GitHub (https://github.com/ropensci/taxize)
## Resolve taxonomic name
This is a common task in biology. We often have a list of species names and we want to know a) if we have the most up to date names, b) if our names are spelled correctly, and c) the scientific name for a common name. One way to resolve names is via the Global Names Resolver (GNR) service provided by the Encyclopedia of Life. Here, we are searching for two misspelled names:
```r
temp <- gnr_resolve(c("Helianthos annus", "Homo saapiens"))
head(temp)
#> # A tibble: 6 x 5
#> user_supplied_name submitted_name matched_name data_source_title score
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 Helianthos annus Helianthos ann… Helianthus annus uBio NameBank 0.75
#> 2 Helianthos annus Helianthos ann… Helianthus annu… Catalogue of Life 0.75
#> 3 Helianthos annus Helianthos ann… Helianthus annu… ITIS 0.75
#> 4 Helianthos annus Helianthos ann… Helianthus annu… NCBI 0.75
#> 5 Helianthos annus Helianthos ann… Helianthus annu… GRIN Taxonomy for P… 0.75
#> 6 Helianthos annus Helianthos ann… Helianthus annu… Union 4 0.75
```
The correct spellings are *Helianthus annuus* and *Homo sapiens*.
taxize takes the approach that the user should be able to make decisions about what resource to trust, rather than making the decision. The GNR service provides data from a variety of data sources. The user may trust a specific data source, thus may want to use the names from that data source. In the future, we may provide the ability for taxize to suggest the best match from a variety of sources.
Another common use case is when there are many synonyms for a species. In this example, we have three synonyms of the currently accepted name for a species.
```r
mynames <- c("Helianthus annuus ssp. jaegeri", "Helianthus annuus ssp. lenticularis", "Helianthus annuus ssp. texanus")
(tsn <- get_tsn(mynames, accepted = FALSE))
══ 3 queries ═══════════════
✔ Found: Helianthus annuus ssp. jaegeri
✔ Found: Helianthus annuus ssp. lenticularis
✔ Found: Helianthus annuus ssp. texanus
══ Results ═════════════════
● Total: 3
● Found: 3
● Not Found: 0
[1] "525928" "525929" "525930"
attr(,"class")
[1] "tsn"
attr(,"match")
[1] "found" "found" "found"
attr(,"multiple_matches")
[1] FALSE FALSE FALSE
attr(,"pattern_match")
[1] FALSE FALSE FALSE
attr(,"uri")
[1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=525928"
[2] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=525929"
[3] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=525930"
lapply(tsn, itis_acceptname)
[[1]]
submittedtsn acceptedname acceptedtsn author
1 525928 Helianthus annuus 36616 L.
[[2]]
submittedtsn acceptedname acceptedtsn author
1 525929 Helianthus annuus 36616 L.
[[3]]
submittedtsn acceptedname acceptedtsn author
1 525930 Helianthus annuus 36616 L.
```
## Retrieve higher taxonomic names
Another task biologists often face is getting higher taxonomic names for a taxa list. Having the higher taxonomy allows you to put into context the relationships of your species list. For example, you may find out that species A and species B are in Family C, which may lead to some interesting insight, as opposed to not knowing that Species A and B are closely related. This also makes it easy to aggregate/standardize data to a specific taxonomic level (e.g., family level) or to match data to other databases with different taxonomic resolution (e.g., trait databases).
A number of data sources in taxize provide the capability to retrieve higher taxonomic names, but we will highlight two of the more useful ones: Integrated Taxonomic Information System (ITIS) and National Center for Biotechnology Information (NCBI). First, we'll search for two species, *Abies procera} and *Pinus contorta* within ITIS.
```r
specieslist <- c("Abies procera","Pinus contorta")
classification(specieslist, db = 'itis')
#> ══ 2 queries ═══════════════
#> ✔ Found: Abies procera
#> ✔ Found: Pinus contorta
#> ══ Results ═════════════════
#>
#> ● Total: 2
#> ● Found: 2
#> ● Not Found: 0
#> $`Abies procera`
#> name rank id
#> 1 Plantae kingdom 202422
#> 2 Viridiplantae subkingdom 954898
#> 3 Streptophyta infrakingdom 846494
#> 4 Embryophyta superdivision 954900
#> 5 Tracheophyta division 846496
#> 6 Spermatophytina subdivision 846504
#> 7 Pinopsida class 500009
#> 8 Pinidae subclass 954916
#> 9 Pinales order 500028
#> 10 Pinaceae family 18030
#> 11 Abies genus 18031
#> 12 Abies procera species 181835
#>
#> $`Pinus contorta`
#> name rank id
#> 1 Plantae kingdom 202422
#> 2 Viridiplantae subkingdom 954898
#> 3 Streptophyta infrakingdom 846494
#> 4 Embryophyta superdivision 954900
#> 5 Tracheophyta division 846496
#> 6 Spermatophytina subdivision 846504
#> 7 Pinopsida class 500009
#> 8 Pinidae subclass 954916
#> 9 Pinales order 500028
#> 10 Pinaceae family 18030
#> 11 Pinus genus 18035
#> 12 Pinus contorta species 183327
#>
#> attr(,"class")
#> [1] "classification"
#> attr(,"db")
#> [1] "itis"
```
It turns out both species are in the family Pinaceae. You can also get this type of information from the NCBI by doing `classification(specieslist, db = 'ncbi')`.
Instead of a full classification, you may only want a single name, say a family name for your species of interest. The function `tax_name` is built just for this purpose. As with the `classification` function you can specify the data source with the `db` argument, either ITIS or NCBI.
```r
tax_name("Helianthus annuus", get = "family", db = "ncbi")
#> ══ 1 queries ═══════════════
#> ✔ Found: Helianthus+annuus
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
#> db query family
#> 1 ncbi Helianthus annuus Asteraceae
```
It may happen that a data source does not provide information on the queried species, than one could take the result from another source and union the results from the different sources.
## Interactive name selection
As mentioned most databases use a numeric code to reference a species. A general workflow in taxize is: Retrieve Code for the queried species and then use this code to query more data/information.
Below are a few examples. When you run these examples in R, you are presented with a command prompt asking for the row that contains the name you would like back; that output is not printed below for brevity. In this example, the search term has many matches. The function returns a data frame of the matches, and asks for the user to input what row number to accept.
```r
get_uid("Pinus")
#> ══ 1 queries ═══════════════
#> ✔ Found: Pinus
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
#> [1] "3337"
#> attr(,"class")
#> [1] "uid"
#> attr(,"match")
#> [1] "found"
#> attr(,"multiple_matches")
#> [1] FALSE
#> attr(,"pattern_match")
#> [1] FALSE
#> attr(,"uri")
#> [1] "https://www.ncbi.nlm.nih.gov/taxonomy/3337"
```
In another example, you can pass in a long character vector of taxonomic names (although this one is rather short for demo purposes):
```r
splist <- c("annona cherimola", 'annona muricata', "quercus robur")
get_tsn(splist, searchtype = "scientific")
#> ══ 3 queries ═══════════════
#> ✔ Found: annona cherimola
#> ✔ Found: annona muricata
#> ✔ Found: quercus robur
#> ══ Results ═════════════════
#>
#> ● Total: 3
#> ● Found: 3
#> ● Not Found: 0
#> [1] "506198" "18098" "19405"
#> attr(,"class")
#> [1] "tsn"
#> attr(,"match")
#> [1] "found" "found" "found"
#> attr(,"multiple_matches")
#> [1] FALSE FALSE TRUE
#> attr(,"pattern_match")
#> [1] FALSE FALSE TRUE
#> attr(,"uri")
#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=506198"
#> [2] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=18098"
#> [3] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19405"
```
There are functions for many other sources
* `get_boldid()`
* `get_eolid()`
* `get_gbifid()`
* `get_nbnid()`
* `get_tpsid()`
Sometimes with these functions you get a lot of data back. In these cases you may want to limit your choices. Soon we will incorporate the ability to filter using `regex` to limit matches, but for now, we have a new parameter, `rows`, which lets you select certain rows. For example, you can select the first row of each given name, which means there is no interactive component:
```r
get_nbnid(c("Zootoca vivipara","Pinus contorta"), rows = 1)
#> ══ 2 queries ═══════════════
#> ✔ Found: Zootoca vivipara
#> ✔ Found: Pinus contorta
#> ══ Results ═════════════════
#>
#> ● Total: 2
#> ● Found: 2
#> ● Not Found: 0
#> [1] "NHMSYS0001706186" "NBNSYS0000004786"
#> attr(,"class")
#> [1] "nbnid"
#> attr(,"match")
#> [1] "found" "found"
#> attr(,"multiple_matches")
#> [1] TRUE TRUE
#> attr(,"pattern_match")
#> [1] FALSE FALSE
#> attr(,"uri")
#> [1] "https://species.nbnatlas.org/species/NHMSYS0001706186"
#> [2] "https://species.nbnatlas.org/species/NBNSYS0000004786"
```
Or you can select a range of rows
```r
get_nbnid(c("Zootoca vivipara","Pinus contorta"), rows = 1:3)
#> ══ 2 queries ═══════════════
#> ✔ Found: Zootoca vivipara
#> ✔ Found: Pinus contorta
#> ══ Results ═════════════════
#>
#> ● Total: 2
#> ● Found: 2
#> ● Not Found: 0
#> [1] "NHMSYS0001706186" "NBNSYS0000004786"
#> attr(,"class")
#> [1] "nbnid"
#> attr(,"match")
#> [1] "found" "found"
#> attr(,"multiple_matches")
#> [1] TRUE TRUE
#> attr(,"pattern_match")
#> [1] TRUE TRUE
#> attr(,"uri")
#> [1] "https://species.nbnatlas.org/species/NHMSYS0001706186"
#> [2] "https://species.nbnatlas.org/species/NBNSYS0000004786"
```
In addition, in case you don't want to do interactive name selection in the case where there are a lot of names, you can get all data back with functions of the form, e.g., `get_tsn_()`, and likewise for other data sources. For example:
```r
out <- get_nbnid_("Poa annua")
NROW(out$`Poa annua`)
#> [1] 25
```
That's a lot of data, so we can get only certain rows back
```r
get_nbnid_("Poa annua", rows = 1:10)
#> $`Poa annua`
#> guid scientificName rank taxonomicStatus
#> 1 NBNSYS0000002544 Poa annua species accepted
#> 2 NBNSYS0200001901 Bellis annua species accepted
#> 3 NBNSYS0200003392 Triumfetta annua species accepted
#> 4 NBNSYS0200002555 Lonas annua species accepted
#> 5 NHMSYS0000456951 Carrichtera annua species accepted
#> 6 NHMSYS0000461807 Poa labillardierei species accepted
#> 7 NHMSYS0000461808 Poa ligularis species accepted
#> 8 NHMSYS0000461817 Poa sieberiana species accepted
#> 9 NHMSYS0000461805 Poa gunnii species accepted
#> 10 NHMSYS0000461801 Poa costiniana species accepted
```
## Coerce numerics/alphanumerics to taxon IDs
We've also introduced in `v0.5` the ability to coerce numerics and alphanumerics to taxonomic ID classes that are usually only retrieved via `get_*()` functions.
For example, adfafd
```r
as.gbifid(get_gbifid("Poa annua")) # already a uid, returns the same
#> ══ 1 queries ═══════════════
#> gbifid scientificname rank status matchtype
#> 1 2704179 Poa annua L. species ACCEPTED EXACT
#> 2 8422205 Poa annua Cham. & Schltdl. species SYNONYM EXACT
#> 3 7730008 Poa annua Steud. species DOUBTFUL EXACT
#> ✖ Not Found: Poa annua
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 0
#> ● Not Found: 1
#> [1] NA
#> attr(,"class")
#> [1] "gbifid"
#> attr(,"match")
#> [1] "not found"
#> attr(,"multiple_matches")
#> [1] TRUE
#> attr(,"pattern_match")
#> [1] FALSE
as.gbifid(2704179) # numeric
#> [1] "2704179"
#> attr(,"class")
#> [1] "gbifid"
#> attr(,"match")
#> [1] "found"
#> attr(,"multiple_matches")
#> [1] FALSE
#> attr(,"pattern_match")
#> [1] FALSE
#> attr(,"uri")
#> [1] "https://www.gbif.org/species/2704179"
as.gbifid("2704179") # character
#> [1] "2704179"
#> attr(,"class")
#> [1] "gbifid"
#> attr(,"match")
#> [1] "found"
#> attr(,"multiple_matches")
#> [1] FALSE
#> attr(,"pattern_match")
#> [1] FALSE
#> attr(,"uri")
#> [1] "https://www.gbif.org/species/2704179"
as.gbifid(list("2704179","2435099","3171445")) # list, either numeric or character
#> [1] "2704179" "2435099" "3171445"
#> attr(,"class")
#> [1] "gbifid"
#> attr(,"match")
#> [1] "found" "found" "found"
#> attr(,"multiple_matches")
#> [1] FALSE FALSE FALSE
#> attr(,"pattern_match")
#> [1] FALSE FALSE FALSE
#> attr(,"uri")
#> [1] "https://www.gbif.org/species/2704179"
#> [2] "https://www.gbif.org/species/2435099"
#> [3] "https://www.gbif.org/species/3171445"
```
These `as.*()` functions do a quick check of the web resource to make sure it's a real ID. However, you can turn this check off, making this coercion much faster:
```r
system.time( replicate(3, as.gbifid(c("2704179","2435099","3171445"), check=TRUE)) )
#> user system elapsed
#> 0.092 0.003 4.850
system.time( replicate(3, as.gbifid(c("2704179","2435099","3171445"), check=FALSE)) )
#> user system elapsed
#> 0.002 0.000 0.002
```
## What taxa are downstream of my taxon of interest?
If someone is not a taxonomic specialist on a particular taxon he likely does not know what children taxa are within a family, or within a genus. This task becomes especially unwieldy when there are a large number of taxa downstream. You can of course go to a website like Wikispecies or Encyclopedia of Life to get downstream names. However, taxize provides an easy way to programatically search for downstream taxa for the Integrated Taxonomic Information System.
```r
apis_itis_id <- 154395 # id for Apis, fetched beforehand to save time here
downstream(apis_itis_id, downto = "species", db = "itis")
#> $`154395`
#> tsn parentname parenttsn rankname taxonname rankid
#> 1 1128092 Apis 154395 species Apis laboriosa 220
#> 2 154396 Apis 154395 species Apis mellifera 220
#> 3 763550 Apis 154395 species Apis andreniformis 220
#> 4 763551 Apis 154395 species Apis cerana 220
#> 5 763552 Apis 154395 species Apis dorsata 220
#> 6 763553 Apis 154395 species Apis florea 220
#> 7 763554 Apis 154395 species Apis koschevnikovi 220
#> 8 763555 Apis 154395 species Apis nigrocincta 220
#>
#> attr(,"class")
#> [1] "downstream"
#> attr(,"db")
#> [1] "itis"
```
## Direct children
You may sometimes only want the direct children. We got you covered on that front, with methods for ITIS and NCBI.
The direct children (genera in this case) of _Pinaceae_ using NCBI data:
```r
children("Pinaceae", db = "ncbi")
#> $Pinaceae
#> childtaxa_id childtaxa_name childtaxa_rank
#> 1 123600 Nothotsuga genus
#> 2 64685 Cathaya genus
#> 3 3358 Tsuga genus
#> 4 3356 Pseudotsuga genus
#> 5 3354 Pseudolarix genus
#> 6 3337 Pinus genus
#> 7 3328 Picea genus
#> 8 3325 Larix genus
#> 9 3323 Keteleeria genus
#> 10 3321 Cedrus genus
#> 11 3319 Abies genus
#>
#> attr(,"class")
#> [1] "children"
#> attr(,"db")
#> [1] "ncbi"
```
## Get NCBI ID from GenBank Ids
With accession numbers
```r
genbank2uid(id = 'AJ748748')
#> [[1]]
#> [1] "282199"
#> attr(,"class")
#> [1] "uid"
#> attr(,"match")
#> [1] "found"
#> attr(,"multiple_matches")
#> [1] FALSE
#> attr(,"pattern_match")
#> [1] FALSE
#> attr(,"uri")
#> [1] "https://www.ncbi.nlm.nih.gov/taxonomy/282199"
#> attr(,"name")
#> [1] "Nereida ignava 16S rRNA gene, type strain 2SM4T"
```
With gi numbers
```r
genbank2uid(id = 62689767)
#> [[1]]
#> [1] "282199"
#> attr(,"class")
#> [1] "uid"
#> attr(,"match")
#> [1] "found"
#> attr(,"multiple_matches")
#> [1] FALSE
#> attr(,"pattern_match")
#> [1] FALSE
#> attr(,"uri")
#> [1] "https://www.ncbi.nlm.nih.gov/taxonomy/282199"
#> attr(,"name")
#> [1] "Nereida ignava 16S rRNA gene, type strain 2SM4T"
```
## Matching species tables with different taxonomic resolution
Biologist often need to match different sets of data tied to species. For example, trait-based approaches are a promising tool in ecology. One problem is that abundance data must be matched with trait databases. These two data tables may contain species information on different taxonomic levels and possibly data must be aggregated to a joint taxonomic level, so that the data can be merged. taxize can help in this data-cleaning step, providing a reproducible workflow:
We can use the mentioned `classification`-function to retrieve the taxonomic hierarchy and then search the hierarchies up- and downwards for matches. Here is an example to match a species with names on three different taxonomic levels.
```r
A <- "gammarus roeseli"
B1 <- "gammarus roeseli"
B2 <- "gammarus"
B3 <- "gammaridae"
A_clas <- classification(A, db = 'ncbi')
#> ══ 1 queries ═══════════════
#> ✔ Found: gammarus+roeseli
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
B1_clas <- classification(B1, db = 'ncbi')
#> ══ 1 queries ═══════════════
#> ✔ Found: gammarus+roeseli
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
B2_clas <- classification(B2, db = 'ncbi')
#> ══ 1 queries ═══════════════
#> ✔ Found: gammarus
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
B3_clas <- classification(B3, db = 'ncbi')
#> ══ 1 queries ═══════════════
#> ✔ Found: gammaridae
#> ══ Results ═════════════════
#>
#> ● Total: 1
#> ● Found: 1
#> ● Not Found: 0
B1[match(A, B1)]
#> [1] "gammarus roeseli"
A_clas[[1]]$rank[tolower(A_clas[[1]]$name) %in% B2]
#> [1] "genus"
A_clas[[1]]$rank[tolower(A_clas[[1]]$name) %in% B3]
#> [1] "family"
```
If we find a direct match (here *Gammarus roeseli*), we are lucky. But we can also match Gammaridae with *Gammarus roeseli*, but on a lower taxonomic level. A more comprehensive and realistic example (matching a trait table with an abundance table) is given in the vignette on matching.