Once we have the best available assertion of what FWS thinks the appropriate ITIS identifier is for as many species as possible (via ECOS scraping and TESS search), we can go consult the ITIS API to retrieve information and cache for later processing. This is done via the ITIS module in the bispy package that uses various search mechanisms against ITIS and slightly repackages the returned data for use.

In [2]:
import requests
import json
import bispy
from IPython.display import display
from joblib import Parallel, delayed
import jsonschema

itis = bispy.itis.Itis()
bis_utils = bispy.bis.Utils()

import helperfunctions

In [2]:
# Open up the cached workplan species
with open("../cache/workplan_species.json", "r") as f:
    workplan_species = json.loads(f.read())
    f.close()

In [3]:
# The ITIS search function is intelligent enough to determine if a TSN is provided as search criteria
# Here, we pull lists together of ITIS TSNs where we have them and lookup name where we do not and put both together to run the process
tsn_list = [r["ITIS TSN"] for r in workplan_species if r["ITIS TSN"] is not None]
names_without_tsns = [r["Lookup Name"] for r in workplan_species if r["ITIS TSN"] is None]
itis_search_list = tsn_list
itis_search_list.extend(names_without_tsns)

In [4]:
%%time
# Use joblib to run multiple requests for ITIS documents in parallel via known ITIS TSNs
itis_cache = Parallel(n_jobs=8)(delayed(itis.search)(name_or_tsn) for name_or_tsn in itis_search_list)

CPU times: user 965 ms, sys: 111 ms, total: 1.08 s
Wall time: 40.1 s


In [5]:
# Cache the array of retrieved documents and return/display a random sample for verification
display(bis_utils.doc_cache("../cache/itis.json", itis_cache))

{'Doc Cache File': '../cache/itis.json',
 'Number of Documents in Cache': 363,
 'Document Number 62': {'processing_metadata': {'status': 'success',
   'date_processed': '2019-09-16T17:30:14.458156',
   'status_message': 'Exact Match',
   'details': [{'Exact Match': 'https://services.itis.gov/?wt=json&rows=10&q=tsn:80145'}]},
  'parameters': {'TSN': '80145'},
  'data': [{'tsn': '80145',
    'nameWInd': 'Simpsonaias ambigua',
    'nameWOInd': 'Simpsonaias ambigua',
    'unit1': 'Simpsonaias',
    'unit2': 'ambigua',
    'usage': 'valid',
    'credibilityRating': 'TWG standards met',
    'taxonAuthor': '(Say, 1825)',
    'kingdom': 'Animalia',
    'parentTSN': '80143',
    'rankID': '220',
    'rank': 'Species',
    'synonyms': ['80145:$Simpsoniconcha ambigua$Alasmodonta ambigua$Unio hildrethianus$'],
    'synonymTSNs': ['80145:$80146$985772$985773$'],
    '_version_': 1643585535292211200,
    'date_created': '1996-06-13 14:51:08',
    'date_modified': '2015-06-25 00:00:00',
    'geograph

# Schema Validation
In developing the schema documentation for the ITIS cache that this system builds, I referenced the [ITIS Data Model](https://www.itis.gov/pdf/ITIS_ConceptualModelEntityDefinition.pdf), which is the comprehensive source of documentation on what ITIS provides that I could find. There are some discrepancies between what is in the underlying data model and what ITIS expresses through its Solr API, which is the best available online source for working with the data in the type of streaming, continuously updating mode that we are exploring with this dataset for a given species list.

The development of the ITIS schema introduced a new element to the set of common properties that I am developing in this work - "details" within processing metadata. The ITIS lookup process for a name or identifier is complex enough in terms of what the system has to do (different ways of searching, following the taxonomic record to valid documents, etc.) that the process was set up to record what happens as part of a provenance trace. I modified the underlying ITIS lookup function to place these data into a processing_metadata/details array and documented it as a common property that we will make use of elsewhere.

In [4]:
itis_schema = helperfunctions.load_schema('itis')
display(itis_schema)

jsonschema.validate(itis_cache, itis_schema)

{'definitions': {'items': {'$id': '#/items',
   'type': ['object', 'array'],
   'title': 'Generic container for items in a dataset',
   'description': 'A JSON array or object property containing one or more items in a dataset or data structure within a dataset.'}},
 '$schema': 'http://json-schema.org/draft-07/schema#',
 '$id': 'http://data.usgs.gov/property_registry/',
 'type': 'array',
 'title': 'ITIS Data Cache',
 'description': 'A cache of data assembled for workplan species using either name or ITIS TSN identifiers from the assembled list. Data were assembled using a search function build into the experimental bispy software package (https://github.com/usgs-bcb/bispy).',
 'items': {'$ref': '#/definitions/items'},
 'properties': {'processing_metadata': {'$ref': 'common_properties.json#/definitions/processing_metadata'},
  'data': {'$ref': 'common_properties.json#/definitions/data',
   'items': {'$ref': '#/definitions/items',
    'type': 'object',
    'title': 'ITIS Documents Contain

# Brief Data Examination
There are a couple of things we need to check for in the returned ITIS data just to make sure that we don't need to take any other action based on the taxonomic data and that we have all the right values to work with. At this point, based on pulling together the workplan species list and getting data from the FWS systems, we should have valid ITIS taxonomic records for everything that either had a TSN identified or for which we were able to lookup a scientific name. That means that there should be an ITIS document stored in the data cache that is either valid or accepted. We can verify that with the following list comprehension that should not output anything if our assumption is correct.

In [9]:
[r for r in itis_cache if r["processing_metadata"]["status"] == "success" and (not any(i["usage"] == "valid" for i in r["data"]) and not any(i["usage"] == "accepted" for i in r["data"]))]
        

[]

The next thing we take a quick look at are those cases where the taxon was only identified at a taxonomic level higher than species, subspecies, or variety. There are a number of records in the FWS list that appear to not yet have species identified and were "cleaned up" ahead of time in the "Lookup Name" strings to facilitate at least a higher level taxonomic match. The records resulting from these searches in the various information sources should be treated with care, noting that the resulting information will likely not be specific enough for usage without further refinement. In particular, the searches for genus names will likely return many more records from our search for scientific articles. 

In [10]:
for record in [r for r in itis_cache if r["processing_metadata"]["status"] == "success"]:
    valid_itis_doc = next(i for i in record["data"] if i["usage"] in ["valid","accepted"])
    if valid_itis_doc["rank"] not in ["Species", "Subspecies", "Variety"]:
        print("Lookup Name:", record["parameters"]["Scientific Name"])
        print("ITIS Rank:", valid_itis_doc["rank"])
        print("ITIS Name:", valid_itis_doc["nameWOInd"])
        print("======")


Lookup Name: Oreohelix
ITIS Rank: Genus
ITIS Name: Oreohelix
Lookup Name: Vertigo
ITIS Rank: Genus
ITIS Name: Vertigo
Lookup Name: Eurycea
ITIS Rank: Genus
ITIS Name: Eurycea
Lookup Name: Ictalurus
ITIS Rank: Genus
ITIS Name: Ictalurus
Lookup Name: Astylis
ITIS Rank: Genus
ITIS Name: Astyris
Lookup Name: Cottus
ITIS Rank: Genus
ITIS Name: Cottus
Lookup Name: Hemileuca
ITIS Rank: Genus
ITIS Name: Hemileuca
