The World Register of Marine Species (WoRMS) is another taxonomic authority we use in our work. In the case of the FWS work plan species, we mostly rely on ITIS as that is the taxonomic authority most used by FWS, and ITIS TSNs are determined/declared for most species. This notebook follows the ITIS caching process to retrieve any unmatched names and try them against WoRMS using the worms module of the bispy package.

In [1]:
import requests
import json
import bispy
from IPython.display import display
from joblib import Parallel, delayed
import jsonschema

worms = bispy.worms.Worms()
bis_utils = bispy.bis.Utils()

import helperfunctions

In [2]:
with open("../cache/itis.json", "r") as f:
    itis_cache = json.loads(f.read())
    f.close()

In [3]:
#Unwind the ITIS URLs into names for cases where we were not able to obtain an exact match
unmatched_itis_names = [i["processing_metadata"]["details"][0]["Exact Match Fail"].split(":")[-1].replace("\\%20", " ") for i in itis_cache if i["processing_metadata"]["status_message"] == "Not Matched"]


In [4]:
unmatched_itis_names

['Monadenia fidelis minor',
 'Lycaena ferrisi',
 'Percina kusha',
 'Etheostoma maydeni',
 'Boltonia montana',
 'Pyrgulopsis hamlinensis',
 'Pyrgulopsis anguina',
 'Pyrgulopsis saxatilis',
 'Euphilotes ancilla cryptica',
 'Pyrgulopsis peculiaris',
 'Pyrgulopsis coloradensis',
 'Pyrgulopsis lata',
 'Pyrgulopsis fausta',
 'Pyrgulopsis breviloba',
 'Pyrgulopsis planulata',
 'Pyrgulopsis marcida',
 'Pyrgulopsis hubbsi',
 'Pyrgulopsis sublata',
 'Pyrgulopsis landyei',
 'Pyrgulopsis neritella',
 'Pyrgulopsis serrata',
 'Pyrgulopsis turbatrix',
 'Pyrgulopsis sulcata',
 'Pyrgulopsis deaconi',
 'Pyrgulopsis sterilis',
 'Pyrgulopsis orbiculata',
 'Pyrgulopsis sathos']

In [5]:
%%time
# Use joblib to run multiple searches for WoRMS species in parallel via species names
worms_result = Parallel(n_jobs=8)(delayed(worms.search)(name) for name in unmatched_itis_names)

CPU times: user 109 ms, sys: 63.7 ms, total: 173 ms
Wall time: 6.79 s


In [6]:
# Cache the array of retrieved documents and return/display a random sample for verification
display(bis_utils.doc_cache("../cache/worms.json", worms_result))

{'Doc Cache File': '../cache/worms.json',
 'Number of Documents in Cache': 27,
 'Document Number 7': {'processing_metadata': {'status': 'success',
   'date_processed': '2019-09-16T17:41:03.698376',
   'status_message': 'Exact Match',
   'api': 'http://www.marinespecies.org/rest/AphiaRecordsByName/Pyrgulopsis saxatilis?like=false&marine_only=false&offset=1'},
  'parameters': {'Scientific Name': 'Pyrgulopsis saxatilis',
   'Name Source': None},
  'data': [{'AphiaID': 1349463,
    'scientificname': 'Pyrgulopsis saxatilis',
    'authority': 'Hershler, 1998',
    'status': 'accepted',
    'unacceptreason': None,
    'taxonRankID': 220,
    'rank': 'Species',
    'valid_AphiaID': 1349463,
    'valid_name': 'Pyrgulopsis saxatilis',
    'valid_authority': 'Hershler, 1998',
    'parentNameUsageID': 717226,
    'kingdom': 'Animalia',
    'phylum': 'Mollusca',
    'class': 'Gastropoda',
    'order': 'Littorinimorpha',
    'family': 'Hydrobiidae',
    'genus': 'Pyrgulopsis',
    'lsid': 'urn:lsid:

# Schema Validation
Working through the WoRMS and ITIS processes and schema documentation introduced an ability to establish three new common properties:

* date_created (a date/time from source data indicating when a record was created)
* date_modified (a date/time from source data indicating when a record was last updated/modified)
* biological_taxonomy (an array data structure containing the full taxonomic hierarchy upward from a given taxon record)

Because I'm building these data structures with a processing function, it seems reasonable to introduce these common properties at the build point, and so I put functionality into the relevant functions in the bispy package to write source attributes to these property names. Alternatively, it might be better to layer on some other logic that adds or transforms these common property names after the fact, retaining full original source data and then building out something like a secondary index with common properties.

In [7]:
worms_schema = helperfunctions.load_schema('worms')
display(worms_schema)

jsonschema.validate(worms_result, worms_schema)

{'definitions': {'items': {'$id': '#/items',
   'type': ['object', 'array'],
   'title': 'Generic container for items in a dataset',
   'description': 'A JSON array or object property containing one or more items in a dataset or data structure within a dataset.'}},
 '$schema': 'http://json-schema.org/draft-07/schema#',
 '$id': 'http://data.usgs.gov/property_registry/',
 'type': 'array',
 'title': 'WoRMS Data Cache',
 'description': 'A cache of records based on searching for species names from the World Register of Marine Species (http://marinespecies.org/). Data were assembled using a search function build into the experimental bispy software package (https://github.com/usgs-bcb/bispy).',
 'items': {'$ref': '#/definitions/items'},
 'properties': {'processing_metadata': {'$ref': 'common_properties.json#/definitions/processing_metadata'},
  'data': {'$ref': 'common_properties.json#/definitions/data',
   'items': {'$ref': '#/definitions/items',
    'required': ['AphiaID',
     'url',
    