It is often useful to gain a basic understanding of what species observation records may exist for potential use in modeling and analysis. The Global Biodiversity Information Facility (GBIF) is one source of observation data with a robust API that lends itself to a process for relatively rapid summarization in this context of building out a species information catalog. This notebook exercises an experimental function that ties together a couple aspects of the GBIF API to make a reasonable guess at a species identifier (from the GBIF taxonomic hub) and pull back a basic characterization of US-based occurrences. The three facets of the occurrence records that we can reasonably assemble quickly at this time include the following:

* basisOfRecord - The Darwin Core term that describes what the basis is for the species occurrence, mostly helping to distinguish between museum specimens (which may or may not have accurate spatial information for where the specimen was collected in the field) and human observations (or other methods of observing) a species.
* year - Provides a basic time series by year for the number of occurrences.
* institutionCode - Somewhat obscure set of codes/terms for the institution providing the record. Further details exist behind this, but the codes can help provide a basic idea on where the records come from.

In [1]:
import requests
import bispy
from IPython.display import display
from joblib import Parallel, delayed
import jsonschema

gbif = bispy.gbif.Gbif()
bis_utils = bispy.bis.Utils()

import helperfunctions

In [2]:
name_list = helperfunctions.workplan_species()

In [3]:
%%time
# Use joblib to run multiple requests for records in parallel via scientific names
gbif_results = Parallel(n_jobs=8)(delayed(gbif.summarize_us_species)(name, name_source) for name, name_source in name_list)


CPU times: user 1.04 s, sys: 111 ms, total: 1.15 s
Wall time: 2min 37s


In [4]:
# Cache the array of retrieved documents and return/display a random sample for verification
display(bis_utils.doc_cache("../cache/gbif.json", gbif_results))

{'Doc Cache File': 'cache/gbif.json',
 'Number of Documents in Cache': 387,
 'Document Number 17': {'processing_metadata': {'status': 'success',
   'date_processed': '2019-07-26T17:14:32.429038',
   'status_message': 'Matched',
   'api': ['https://api.gbif.org/v1/species/suggest?q=Deroceras hesperium',
    'https://api.gbif.org/v1/occurrence/search?country=US&limit=0&facet=institutionCode&facet=year&facet=basisOfRecord&scientificName=Deroceras hesperium']},
  'parameters': {'Scientific Name': 'Deroceras hesperium',
   'Name Source': 'Lookup Name'},
  'gbif_species': {'key': 5190779,
   'resolvable_identifier': 'http://api.gbif.org/v1/species/5190779',
   'biological_taxonomy': [{'rank': 'kingdom', 'name': 'Animalia'},
    {'rank': 'phylum', 'name': 'Mollusca'},
    {'rank': 'class', 'name': 'Gastropoda'},
    {'rank': 'order', 'name': 'Stylommatophora'},
    {'rank': 'family', 'name': 'Agriolimacidae'},
    {'rank': 'genus', 'name': 'Deroceras'},
    {'rank': 'species', 'name': 'Deroce

# Schema Validation
Working through the schema documentation for the GBIF cases pointed out an area where the processing metadata record needs to be able to record more than one API end point used in a given record assembly process. For now, I expanded the schema for processing_metadata/api to allow for an array, but we may need to revisit this at some point to include a more sophisticated structure. Ultimately, the answer is probably in encoding an actual W3C Provenance trace for every record, but that will require us to grapple fully with that standard and figure out how to build it into our workflows.

In [6]:
gbif_schema = helperfunctions.load_schema('gbif')
display(gbif_schema)

jsonschema.validate(gbif_results, gbif_schema)

{'definitions': {'items': {'$id': '#items',
   'type': ['object', 'array'],
   'title': 'Generic container for items in a dataset',
   'description': 'A JSON array or object property containing one or more items in a dataset or data structure within a dataset.'}},
 '$schema': 'http://json-schema.org/draft-07/schema#',
 '$id': 'http://data.usgs.gov/property_registry/',
 'type': 'array',
 'title': 'GBIF Species Summary Collection',
 'description': 'A dataset containing records that summarize species found by name in the Global Biodiversity Information Facility (GBIF). Records contain sufficient details to further explore the species in GBIF and a summary of available occurrence data at a point in time.',
 'items': {'$ref': '#/definitions/items',
  'properties': {'processing_metadata': {'$ref': 'common_properties.json#/definitions/processing_metadata'},
   'parameters': {'$ref': 'common_properties.json#/definitions/parameters',
    'required': ['Scientific Name', 'Name Source'],
    'prop