Once we have the best available assertion of what FWS thinks the appropriate ITIS identifier is for as many species as possible (via ECOS scraping and TESS search), we can go consult the ITIS API to retrieve information and cache for later processing. This is done via the ITIS module in the bispy package that uses various search mechanisms against ITIS and slightly repackages the returned data for use.

In [1]:
import requests
import json
import bispy
from IPython.display import display
from joblib import Parallel, delayed
import jsonschema

itis = bispy.itis.Itis()
bis_utils = bispy.bis.Utils()

import helperfunctions

In [2]:
# Open up the cached workplan species
with open("../cache/workplan_species.json", "r") as f:
    workplan_species = json.loads(f.read())
    f.close()

In [3]:
# The ITIS search function is intelligent enough to determine if a TSN is provided as search criteria
# Here, we pull lists together of ITIS TSNs where we have them and lookup name where we do not and put both together to run the process
tsn_list = [r["ITIS TSN"] for r in workplan_species if r["ITIS TSN"] is not None]
names_without_tsns = [r["Lookup Name"] for r in workplan_species if r["ITIS TSN"] is None]
itis_search_list = tsn_list
itis_search_list.extend(names_without_tsns)

In [4]:
%%time
# Use joblib to run multiple requests for ITIS documents in parallel via known ITIS TSNs
itis_cache = Parallel(n_jobs=8)(delayed(itis.search)(name_or_tsn) for name_or_tsn in itis_search_list)

CPU times: user 944 ms, sys: 124 ms, total: 1.07 s
Wall time: 25.3 s


In [5]:
# Cache the array of retrieved documents and return/display a random sample for verification
display(bis_utils.doc_cache("../cache/itis.json", itis_cache))

{'Doc Cache File': 'cache/itis.json',
 'Number of Documents in Cache': 373,
 'Document Number 168': {'processing_metadata': {'status': 'success',
   'date_processed': '2019-07-24T15:48:35.628743',
   'status_message': 'Exact Match',
   'details': [{'Exact Match': 'http://services.itis.gov/?wt=json&rows=10&q=tsn:80142'}]},
  'itis_data': [{'tsn': '80142',
    'nameWInd': 'Lasmigona subviridis',
    'nameWOInd': 'Lasmigona subviridis',
    'unit1': 'Lasmigona',
    'unit2': 'subviridis',
    'usage': 'valid',
    'kingdom': 'Animalia',
    'parentTSN': '80134',
    'rank': 'Species',
    'synonyms': ['80142:$Unio subviridis$Unio tappanianus$Unio neglectus$Unio hyalinus$Unio pygmaeus$Margaritana quadrata$Unio pertenuis$'],
    'synonymTSNs': ['80142:$985695$985696$985697$985698$985699$985700$985701$'],
    'date_created': '1996-06-13 14:51:08',
    'date_modified': '2015-06-25 00:00:00',
    'biological_taxonomy': [{'rank': 'Kingdom', 'name': 'Animalia'},
     {'rank': 'Subkingdom', 'name

# Schema Validation
In developing the schema documentation for the ITIS cache that this system builds, I referenced the [ITIS Data Model](https://www.itis.gov/pdf/ITIS_ConceptualModelEntityDefinition.pdf), which is the comprehensive source of documentation on what ITIS provides that I could find. There are some discrepancies between what is in the underlying data model and what ITIS expresses through its Solr API, which is the best available online source for working with the data in the type of streaming, continuously updating mode that we are exploring with this dataset for a given species list.

The development of the ITIS schema introduced a new element to the set of common properties that I am developing in this work - "details" within processing metadata. The ITIS lookup process for a name or identifier is complex enough in terms of what the system has to do (different ways of searching, following the taxonomic record to valid documents, etc.) that the process was set up to record what happens as part of a provenance trace. I modified the underlying ITIS lookup function to place these data into a processing_metadata/details array and documented it as a common property that we will make use of elsewhere.

In [6]:
itis_schema = helperfunctions.load_schema('itis')
display(itis_schema)

jsonschema.validate(itis_cache, itis_schema)

{'definitions': {'items': {'$id': '#/items',
   'type': ['object', 'array'],
   'title': 'Generic container for items in a dataset',
   'description': 'A JSON array or object property containing one or more items in a dataset'},
  'doi': {'$id': '#doi',
   'type': ['string', 'null'],
   'title': 'Digital Object Identifier',
   'description': 'A digital object identifier for or associated with a record. May be in the form of an HTTP url or a standalone identifier.',
   'examples': ['http://dx.doi.org/10.2305/IUCN.UK.2004.RLTS.T59435A11941314.en',
    '10.2305/IUCN.UK.2004.RLTS.T59435A11941314.en']},
  'resolvable_identifier': {'$id': '#resolvable_identifier',
   'type': 'string',
   'title': 'Resolvable Identifier',
   'description': 'Some form of resolvable identifier for a record that returns a response when accessed over an included protocol such as HTTP. May or may not provide for content negotiation.',
   'examples': ['https://www.iucnredlist.org/species/59435/11941314']},
  'citat