taxonlookup: a versioned taxonomic lookup table for land plants
How to use this package
Install the required packages
#istall.packages("devtools") # if necessary devtools::install_github("ropenscilabs/datastorr") devtools::install_github("wcornwell/taxonlookup") library(taxonlookup)
Find the higher taxonomy for your species list
lookup_table(c("Pinus ponderosa","Quercus agrifolia"), by_species=TRUE)
## genus family order group ## Pinus ponderosa Pinus Pinaceae Pinales Gymnosperms ## Quercus agrifolia Quercus Fagaceae Fagales Angiosperms
There are a few other functions to get species diversity numbers and other (non-Linnean) high clades if you want that information. If you use this package in a published paper, please note the version number and the appropriate doi via Zenodo. This will allow others to reproduce your work later.
That's it, really. Below is information about the data sources and the versioned data distribution system (which we think is really cool), feel free to check it out, but you don't need to read the rest of this to use the package.
Reproducing a specific version
If you are publishing a paper with this library, or you want the results of your analysis to be reproducable for any other reason, include the version number in your call to lookup table. This will always pull the specific version of the taxonomy lookup that you used. If you leave this out, on a new machine the library will download the most recent version of the database rather than the specific one that you used.
lookup_table(c("Pinus ponderosa", "Quercus agrifolia"), by_species=TRUE, version="1.1.0")
## Downloading: 610 B Downloading: 610 B Downloading: 610 B | | | 0% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 9% | |======= | 10% | |======= | 11% | |======== | 12% | |======== | 13% | |========= | 14% | |========== | 15% | |=========== | 16% | |=========== | 17% | |=========== | 18% | |============ | 19% | |============= | 20% | |============== | 21% | |============== | 22% | |=============== | 23% | |================ | 24% | |================ | 25% | |================= | 26% | |================= | 27% | |================== | 28% | |=================== | 29% | |==================== | 30% | |==================== | 32% | |===================== | 32% | |===================== | 33% | |====================== | 34% | |======================= | 35% | |======================== | 36% | |======================== | 37% | |========================= | 39% | |========================== | 40% | |=========================== | 41% | |=========================== | 42% | |============================ | 43% | |============================= | 44% | |============================== | 46% | |============================== | 47% | |=============================== | 48% | |================================ | 49% | |================================= | 50% | |================================= | 51% | |================================= | 52% | |================================== | 53% | |=================================== | 54% | |==================================== | 55% | |==================================== | 56% | |===================================== | 56% | |===================================== | 57% | |====================================== | 58% | |====================================== | 59% | |======================================= | 60% | |======================================== | 61% | |======================================== | 62% | |========================================= | 63% | |========================================== | 64% | |=========================================== | 65% | |=========================================== | 66% | |=========================================== | 67% | |============================================ | 68% | |============================================= | 69% | |============================================== | 70% | |============================================== | 71% | |=============================================== | 73% | |================================================ | 74% | |================================================= | 75% | |================================================= | 76% | |================================================== | 77% | |=================================================== | 78% | |==================================================== | 80% | |==================================================== | 81% | |===================================================== | 81% | |===================================================== | 82% | |====================================================== | 83% | |======================================================= | 84% | |======================================================== | 85% | |======================================================== | 87% | |========================================================= | 88% | |========================================================== | 89% | |=========================================================== | 90% | |=========================================================== | 91% | |============================================================ | 92% | |============================================================ | 93% | |============================================================= | 94% | |============================================================== | 95% | |============================================================== | 96% | |=============================================================== | 97% | |================================================================ | 98% | |=================================================================| 99% | |=================================================================| 100% ## genus family order group ## Pinus ponderosa Pinus Pinaceae Pinales Gymnosperms ## Quercus agrifolia Quercus Fagaceae Fagales Angiosperms
Higher taxonomy lookup
If you want different taxonomic levels, use
add_higher_order. Because of the incomplete nesting of many taxonomic levels, this resource has a non-intuitive format. If a genus is within a specific group, that column includes the name of the clade. If it is not, then that cell is left blank. This shows the format of the resource:
This could, for example, return orders within the Rosidae:
# load the data.frame into memory ho<-add_higher_order() unique(ho$order[ho$Rosidae=="Rosidae"])
##  "Brassicales" "Celastrales" "Crossosomatales" ##  "Cucurbitales" "Fabales" "Fagales" ##  "Geraniales" "Huerteales" "Malpighiales" ##  "Malvales" "Myrtales" "Oxalidales" ##  "Picramniales" "Rosales" "Sapindales" ##  "Vitales" "Zygophyllales"
Combined with the species counts from the plant list this can be used to get the current estimates of diversity. For example, find the number of accepted Conifer species in the world:
##  760
Compare with the number of accepted plus the number of unresolved species:
##  877
The Plant List for accepted genera to families
We have a complete genus-family-order mapping for vascular plants. For bryophytes, there is only genus-family mapping at present; if anyone has a family-order map for bryophytes, please let me know. We also correct some spelling errors, special character issues, and other errors from The Plant List. We will try to keep this curation up-to-date, but there may new errors introduced as the cannonical data sources shift to future versions.
Notes on genus-to-family mapping
Taxonlookup is strictly constrained to a one-to-one genus-to-family (Linnean) mapping of taxa. The Plant List v1.1, which itself assembles data from multiple sources, is not strict about following implementing this rule and so this creates a few conflicts, which hopefully will be resolved in v1.2. In the meantime,
taxonlookup resolves these conflicts (one genus name mapping to multiple families) using the following rules:
For conflicts where two "accepted species" with the same genus name are mapped to different families, the genus is mapped to the family with more accepted species. This seemed to solve a somewhat common error in which one species is mistakenly given the wrong genus name. For example, there could be one moss species with the genus mistakenly listed as Pinus. In this case we want a behavior that maintains the clearly legitimate mapping of Pinus to Pinaceae.
For conflicts in which some species are accepted and others unresolved, the genus is mapped to the family of the accepted species.
For conflicts where all the species are unresolved it drops the genus entirely from both (or in some cases all three) families
The full list of TPL genus--family pairs that get dropped based on these criteria is here.
Details about the data distribution system
This is designed to be a living database--it will update as taxonomy changes (which it always will). These updates will correspond with changes to the version number of this resource, and each version of the database will be available via travis-ci and Github Releases. If you use this resource for published analysis, please note the version number in your publication. This will allow anyone in the future to go back and find exactly the same version of the data that you used.
Because Releases can be altered after the fact, we use zenodo-github integration to mint a DOI for each release. This will both give a citable DOI and help with the logevity of each version of the database. (Read more about this here.)
Details about the data distribution system
The core of this repository is a set of scripts that dynamically build a genus-family-order-higher taxa look-up table for land plants with the data lookup primarily coming from three sources (with most of the web-scraping done by taxize):
The scripts are in the repository but not in the package. Only the data and ways to access the data are in the package; the reason for this design will become clear further down the readme.
You can download and load the data into
R using the
## genus family order group ## 1 Acorus Acoraceae Acorales Angiosperms ## 2 Albidella Alismataceae Alismatales Angiosperms ## 3 Alisma Alismataceae Alismatales Angiosperms ## 4 Astonia Alismataceae Alismatales Angiosperms ## 5 Baldellia Alismataceae Alismatales Angiosperms ## 6 Burnatia Alismataceae Alismatales Angiosperms
The first call to
plant_lookup will download the data, but subsequent calls will be essentially instantaneous. If you are interested in diversity data, the data object also stores the number of accepted species within each genus as per the plant list:
head(plant_lookup(include_counts = TRUE)[,c(1,3,4)])
## number.of.accepted.species genus family ## 1 2 Acorus Acoraceae ## 2 1 Albidella Alismataceae ## 3 8 Alisma Alismataceae ## 4 1 Astonia Alismataceae ## 5 3 Baldellia Alismataceae ## 6 1 Burnatia Alismataceae
For taxonomic groups higher than order, use the
add_higher_order() function. Because currently the higher taxonomy of plants does not have a nested structure, the format of that lookup table is a little more complicated. Check the help file for more details. To get the version number of the dataset run:
##  "1.1.5"
For the most current version on Github run:
##  "1.1.5"
For most uses, the latest release should be sufficient, and this is all that is necessary to use the data. However, if there have been some recent changes to taxonomy that are both important for your project and incorporated in the cannical sources (the plant list or APWeb) but are more recent than the last release of this package, you might want to rebuild the lookup table from the sources. Because this requires downloading the data from the web sources, this will run slowly, depending on your internet connection.
Rebuilding the lookup table
To build the lookup table, first clone this repository. Then download and install
remake from github:
Then run the following commands from within R. Make sure the home directory is within the repository:
This requires a working internet connection and that the plant list, mobot, and datadryad servers are up and working properly. This will dynamically re-create the lookup tables within your local version of the package, as well as save the main file as a
.csv in your home directory.
We will periodically release development versions of the database using github releases. We'll check these automatically using travis-ci. As taxonomy is a moving target, we expect that the lookup to change with time.
Notes for making a release using this living dataset design
- Update the
DESCRIPTIONfile to increase the version number. Now that we are past version 1.0.0, we will use semantic versioning so be aware of when to change what number.
- Commit code changes and
DESCRIPTIONand push to GitHub
- With R in the package directory, run
"<description>" is a brief description of new features of the release.
- Check that it works by running
taxonlookup::plant_lookup(taxonlookup::plant_lookup_version_current(FALSE))which should pull the data.
- Update the Zenodo badge on the readme