This notebook explores work in progress on a data processing workflow for data sources of spatial features of interest in our work on the National Biogeographic Map. The Spatial Feature Registry (or spatiotemporal once we get back to working through implications of time bounded features) is a part of the information system we are building to support work on the NBM. We need a steady process of integrating disparate datasets of spatial features of interest in our work and running a set of processors on the data to add value for our purposes.

The code blocks and notes here share some ideas we are kicking around. They go along with an [ESIP Lab ideascale campaign](https://esipfed.ideascale.com/a/campaign-home/23576) and a [screencast presentation](https://youtu.be/J7T59-H_W4o).

In [1]:
import requests
from IPython.display import display

# Places API (an end result)
We are trying to take an API-first approach to everything we build; first setting up routes in an API that abstracts from our back-end data infrastructure and establishes a set of methods and tools that get at data and functionality we are exposing. The BIS API (BIS is for Biogeographic Information System) is built as a Python Flask app using the RESTPlus package ([beta API](https://sciencebase.usgs.gov/staging/bis/)).

The information you see in this result is just the output from an ElasticSearch query made against a set of indexes created for each registered data source. We need to do a little work in the API response to eliminate the cruft and show a result that is more abstracted from the underlying infrastructure that will stand the test of time. The part that we're concentrating on now is how to go about taking each individual data source we want to register (the registrants in the SFR) and document them such that we have everything on board the dataset we need to understand how to run our processing pipeline.

In [2]:
placesTextSearchResults = requests.get('https://sciencebase.usgs.gov/staging/bis/api/v1/places/search/text?q=Yosemite').json()

for result in placesTextSearchResults['hits']['hits']:
    display(result)

{'_id': '5296',
 '_index': 'bis__sfr__published_doi_lands_from_padus_1_4',
 '_score': 8.229219,
 '_source': {'properties': {'feature_class': 'Wilderness Area',
   'feature_description': '',
   'feature_id': 'PADUS1_4:WA:Yosemite Wilderness',
   'feature_name': 'Yosemite Wilderness',
   'gid': 5296,
   'ogc_fid': 5296,
   'source_data': '[ { "access": "RA", "d_access": "Restricted Access", "access_src": "GAP - Default", "agg_src": "GAP_PADUS1_4Designation_NPS_Wilderness_20150605.gdb\\/LegislatedWilderness", "category": "Designation", "d_category": "Designation", "comments": "", "date_est": "1984", "gapcddt": "2014", "gapcdsrc": "Interagency Wilderness Steering Committee", "gap_sts": "1", "d_gap_sts": "1 - managed for biodiversity - disturbance events proceed or are mimicked", "gis_acres": 704358, "gis_src": "NPS_Wilderness_20150605.gdb\\/LegislatedWilderness", "iucnctdt": "2015", "iucnctsrc": "GAP - Default", "iucn_cat": "Ib", "d_iucn_cat": "Ib: Wilderness areas", "loc_ds": "Designated"

{'_id': '2054',
 '_index': 'bis__sfr__published_doi_lands_from_padus_1_4',
 '_score': 7.1782146,
 '_source': {'properties': {'feature_class': 'National Park',
   'feature_description': '',
   'feature_id': 'PADUS1_4:NP:Yosemite National Park',
   'feature_name': 'Yosemite National Park',
   'gid': 2054,
   'ogc_fid': 2054,
   'source_data': '[ { "access": "OA", "d_access": "Open Access", "access_src": "GAP - Default", "agg_src": "GAP_PADUS1_4Fee_NPS_Tracts", "category": "Fee", "d_category": "Fee", "comments": "", "date_est": "1890", "gapcddt": "2010", "gapcdsrc": "TNC", "gap_sts": "1", "d_gap_sts": "1 - managed for biodiversity - disturbance events proceed or are mimicked", "gis_acres": 740655, "gis_src": "NPS_Tracts.shp", "iucnctdt": "2015", "iucnctsrc": "GAP - Default", "iucn_cat": "II", "d_iucn_cat": "II: National park", "loc_ds": "NP", "loc_mang": "FEDFEE", "loc_nm": "YOSE", "loc_own": "FEDFEE", "mang_name": "NPS", "d_mang_name": "National Park Service", "mang_type": "FED", "d_mang

# Data Repository (the Registry)
The Registry part of the SFR is really just a place to reposit everything we want to work with - sources of spatial features and their documentation. For our purposes, we are leveraging ScienceBase and a particular [collection](https://www.sciencebase.gov/catalog/item/55fafaf5e4b05d6c4e501b81) where we have the necessary permissions set up for registration items. ScienceBase gives us the following:

* Flexible information model to store the basic elements of metadata we need to understand the source
* API to program against
* File repository to stash source files when needed, pre-processed files when appropriate, and additional documentation

What ScienceBase doesn't give us at this time is an elegant versioning mechanism for either the items or the files. We can tell when an item changes or files are added/changed, but we don't have a versioning construct to work against with the current platform. We will have to cross that bridge soon with data sources that change through time, and we'll look at layering in some other mechanism to account for this dynamic.

We have some further cleanup to do on some previous dumb ideas that did not stand the test of time, but the following listing shows the basic information we are concerned about in the registry.

In [3]:
sfrListing = requests.get('https://www.sciencebase.gov/catalog/items?parentId=55fafaf5e4b05d6c4e501b81&max=20&format=json&fields=title,summary,files,webLinks,dates,contacts').json()

display(sfrListing)

{'items': [{'contacts': [{'active': True,
     'contactType': 'person',
     'email': 'dwieferich@usgs.gov',
     'firstName': 'Daniel',
     'jobTitle': 'Physical Scientist',
     'lastName': 'Wieferich',
     'middleName': 'J',
     'name': 'Daniel J Wieferich',
     'oldPartyId': 66431,
     'orcId': '0000-0003-1554-7992',
     'organization': {'displayText': 'Biogeographic Characterization'},
     'primaryLocation': {'building': 'DFC Bldg 810',
      'buildingCode': 'KBT',
      'faxPhone': '3032024710',
      'mailAddress': {},
      'name': 'CN=Daniel J Wieferich,OU=CSS,OU=Users,OU=OITS,OU=DI,DC=gs,DC=doi,DC=net - Primary Location',
      'officePhone': '3032024603',
      'streetAddress': {'city': 'Lakewood',
       'line1': 'W 6th Ave Kipling St',
       'state': 'CO',
       'zip': '80225'}},
     'type': 'Custodian'}],
   'dates': [{'dateString': '2018-04-02',
     'label': 'Creation',
     'type': 'creation'},
    {'dateString': '2016', 'label': 'Publication Date', 'type': '

# Data Source Documentation
As you can see in the dump of information on registrants, we have quite a bit of cleanup and standardization still to do. We've worked through a number of ideas that have not quite scaled from generating individual ScienceBase Items to working through many different ways of packaging pre-processed data and linking to relevant code. As we work toward a production capability, we are seeking ways to simplify and clarify what we are doing in this mess.

I'm currently working on setting a pattern for what we want to see in source material using the Large Marine Ecosystems registrant. That's listed here with some notes below on what we're doing in this case.

In [4]:
lme_item = next((i for i in sfrListing['items'] if i['title'] == 'Large Marine Ecosystems'), None)

display(lme_item)

{'contacts': [{'contactType': 'person',
   'firstName': 'Kenneth',
   'jobTitle': 'Director, U.S. LME Program',
   'lastName': 'Sherman',
   'name': 'Kenneth Sherman',
   'organization': {'displayText': 'US Department of Commerce, National Oceanic and Atmospheric Administration, Northeast Fisheries Science Center, Narragansett Laboratory'},
   'personalTitle': 'Dr.',
   'primaryLocation': {'mailAddress': {}, 'streetAddress': {}},
   'type': 'Principal Investigator'},
  {'contactType': 'person',
   'firstName': 'Rebecca',
   'jobTitle': 'Fishery Biologist / LME Program Manager',
   'lastName': 'Shuford',
   'middleName': 'L',
   'name': 'Rebecca L. Shuford',
   'organization': {'displayText': 'U.S. Department of Commerce, National Oceanographic and Atmospheric Administration, Fisheries Office of Science and Technology, Marine Ecosystems Division'},
   'personalTitle': 'Dr.',
   'primaryLocation': {'mailAddress': {}, 'streetAddress': {}},
   'type': 'Co-Investigator'},
  {'contactType': 

The main things I'm working through are centered on files:

* Pre-processing - For this item, I posted a rather clunky Python script that I used to do the pre-processing in this particular case. I did not spend the time working through how the functions should be abstracted to something higher level in our master Python package for all of this related work on the Biogeographic Information System. It's ugly, but it does at least show exactly what I did to the source data to set it up for our use and provides for the basic principles of transparency and reproducibility. It's also an example of what we will run into regularly when a given data source is really only going to be processed once with a purpose-built workflow.
* Source Data Package - This package contains a final GeoJSON file for processing (results of the pre-processing script) and a first attempt at using [JSON Schema](https://json-schema.org/) to document the dataset with the most usable information for processing.

For the schema documentation, I used a nifty [online tool](https://www.jsonschema.net/) with a snippet of the pre-processed GeoJSON file (generated with a function in the pre-processing script) to build a draft-07 compliant JSON schema file and then weaved in an idea for specifying a context for integration using really dumbed down JSON-LD notation. To support the mapping of source properties to a context for integration, I am fiddling with an ontology containing higher level properties and definitions [registered](http://cor.esipfed.org/ont/~sky/usgs_sfr) in the ESIP Community Ontology Repository. That gave me resolvable identities to map to in the following way:

```
{
  "@context": {
    "@vocab": "http://cor.esipfed.org/ont/~sky/usgs_sfr",
    "#/properties/features/items/properties/properties/properties/LME_NUMBER": "http://cor.esipfed.org/ont/usgs_sfr/integratingproperty/feature_id_sourceIdentifier",
    "lme": "http://cor.esipfed.org/ont/usgs_sfr/integratingproperty/feature_id_nameSpaceId",
    "Large Marine Ecosystems": "http://cor.esipfed.org/ont/usgs_sfr/integratingproperty/feature_class",
    "#/properties/features/items/properties/properties/properties/LME_NAME": "http://cor.esipfed.org/ont/usgs_sfr/integratingproperty/feature_name",
    "#/properties/features/items/properties/properties/properties/LME_DESCRIPTION": "http://cor.esipfed.org/ont/usgs_sfr/integratingproperty/feature_description"
  }
}
```

You see the major concepts as properties here from the above API snippet. This mapping shows part of how to get from source data into the particular context of our one overarching indexing scheme (with the potential for multiple contexts in future). However, this probably isn't the best way to do things, and we're looking for better ideas.

# Message Queue and Microservices
Our current workflow to get the SFR sources online involves another set of purpose built scripts and configuration details that pull the data from files into a PostGIS instance, build out materialized views that handle some geospatial cleanup and schema mapping, and then push the data into a set of similarly named ElasticSearch indexes for use in wildcard-based searches. That's an okay process but it's not very extensible, and it's a little hard for us to take it to full production right now. So, we are working on a new approach that will build from source items with documentation in ScienceBase and run the full pipeline using as much native Amazon tech as we can currently use.

I'm thinking to leverage Amazon SQS with an initial message to check ScienceBase for processable stuff. The "find processable stuff" microservice can probably run as a Lambda. It will consult a config file with rules that look for specific characteristics of ScienceBase Items like having a "Source Data Package" and origination from a set of trusted users. It will look at a log of some kind to know whether a given item and its latest source data has already been processed. If conditions are met, it will put messages on the queue with items to process.

"Process item" messages will trigger one or more microservices to load data and run it through our pipeline. We should look to see if we can chunk things up enough to run these as Lamdas as well, perhaps building in a safeguard as another message that would fire up the code on an actual machine to run if the Lambda time limit is exceeded in roundtripping a particular process.