This notebook works out the basic process for accessing the OBIS Areas data source from servers in Oostende, caching the data as a GeoJSON document, and setting the data up for processing in the latest iteration of our Spatial Feature Registry idea. We are currently stashing source data as files in ScienceBase Items and then kicking off a process to load them to a cloud-based PostGIS instance where we run a few synthesis operations with other spatial data and then build out one or more indexes for various purposes in ElasticSearch. The interface to these is provided via the Biogeographic Information System API.

In [1]:
import requests,json,geojson
from IPython.display import display
from geojson import Feature, Point, FeatureCollection

Ideally, we will eventually kick off all of these registration and processing steps through some kind of high level catalog or inventory based on placement and/or characteristics of the source items in ScienceBase. We should have something that is essentially watching the inventory and checking dates of update periodicity and sending items onto a message queue for processing by microservices. For now, we specify the specific items that this particular code should process and grab the item as JSON for processing here locally.

In [2]:
sourceItem = "https://www.sciencebase.gov/catalog/item/5aabd9a5e4b081f61aaf009b"
sourceItemMeta = requests.get(sourceItem, headers={"accept":"application/json"}).json()

We'll need to keep working on the information model that will help us drive this process. For now, I'm iterating through what all is necessary in order to drive this particular process, figuring out what should be stored with the item vs. what can be referenced from elsewhere. I may have to make some less than optimal choices in the near term that we'll have to clean up as we move along. I'll note those as I go.

In [3]:
display (sourceItemMeta)

{'body': 'The UNESCO-IODE-OBIS Office assembles and maintains a set of marine areas for summarizing and reporting on biodiversity and biogeographic characterization metrics. We register these areas in the Spatial Feature Registry for use in our own work and ties into the OBIS API. The spatial features in this dataset originate from MarineRegions.org and other sources and have been minimally processed spatially for use.',
 'contacts': [{'contactType': 'organization',
   'name': 'UNESCO-IODE-OBIS',
   'organization': {},
   'primaryLocation': {'mailAddress': {}, 'streetAddress': {}},
   'type': 'Data Owner'},
  {'contactType': 'person',
   'name': 'Pieter Provoost',
   'organization': {},
   'primaryLocation': {'mailAddress': {}, 'streetAddress': {}},
   'type': 'Data Owner'}],
 'dates': [{'dateString': '2018-03-16',
   'label': 'Validity Begins',
   'type': 'validityBegins'}],
 'distributionLinks': [{'files': [{'contentType': 'application/json',
     'name': 'OBIS Areas Code List.json',

There are several critical pieces of information that are onboard the item metadata that I pull out here as variables for processing. Right now, this is kind of stupid in that we have to have very specific a priori knowledge of what these things (links and files) are called in order to tease them out. That's not very scalable to all the other cases we want to execute and we need to think through a different way to do this. ScienceBase sort of gets toward this with its set of onboard types for files and webLinks, but the actual meaning and utility of those types is a little fuzzy and not well controlled. So, I don't think we can count on them at this time and opted to use specific titles instead.

One way to approach this is to put the level of translating specific cases at something that watches ScienceBase for our registered sources and then packages messages for Kafka or whatever we end up using. We could have that be pretty specific code for a given source that puts together more standardized messages for processing. We'll have to keep thinking this through to come up with something more sustainable and scalable.

In [4]:
codeListURL = [f["url"] for f in sourceItemMeta["files"] if f["title"] == "codelist"][0]
inventoryAPI = [l["uri"] for l in sourceItemMeta["webLinks"] if l["title"] == "OBIS Areas Names and Identifiers"][0]
wfsGetCapabilities = [l["uri"] for l in sourceItemMeta["webLinks"] if l["title"] == "OBIS Geoserver WFS"][0]
wfsGetFeatureURLStub = wfsGetCapabilities.replace("&request=GetCapabilities","&request=GetFeature&typeName=OBIS:areas&maxFeatures=1&outputFormat=json&filter=<PropertyIsEqualTo><PropertyName>gid</PropertyName><Literal></Literal></PropertyIsEqualTo>")

One of the somewhat stupid things I had to do in the short term was load up a code list to the ScienceBase Item to contain stuff that this particular source needs but for which I could not come up with a reasonable external platform. In this case, the source data has a set of several types of areas that it supplies in short form (e.g., abnj = Area Beyond National Jurisdiction). There are definitions for these areas from a number of different sources, but they are not all together in one place, are not complete, and may not have the mapping we need from code to value. They are also so specific to this particular case, that it doesn't necessarily make sense to store them someplace like YAMZ where we've talked about building out definitions for things that don't already have a home. I don't think this is the best way to deal with this problem, but putting together some kind of specific code list for a given SFR registered source and storing it with that item in some way seemed to make the best immediate sense.

In [5]:
codeList = requests.get(codeListURL).json()
display (codeList)

{'Area Types': {'abnj': 'Area Beyond National Jurisdiction',
  'ebsa': 'Ecologically and Biologically Significant Area',
  'eez': 'Exclusive Economic Zone',
  'obis': 'Ocean Biogeographic Information System',
  'prot': 'Marine Protected Area'}}

# Get Inventory
The first official data retrieval step in all of this is to grab up the inventory of OBIS Areas from the iOBIS API end point provided for this. It's possible that we could get everything we need for this by just looking at the OBIS:areas layer on the iOBIS Geoserver, but this simple REST service is what the OBIS guys pointed us at to work from, and we presume that it will have the most up to date and official inventory to work from.

In [6]:
obisAreasInventory = requests.get(inventoryAPI).json()

In [7]:
display (obisAreasInventory)

{'count': 535,
 'lastpage': True,
 'limit': 1000,
 'offset': 0,
 'results': [{'id': 309, 'name': 'The Sundarbans', 'type': 'prot'},
  {'id': 24, 'name': 'Bosnia and Herzegovina', 'type': 'eez'},
  {'id': 322, 'name': 'Shiretoko', 'type': 'prot'},
  {'id': 478, 'name': 'EBSA No 2:  Ua puakaoa seamounts', 'type': 'ebsa'},
  {'id': 371,
   'name': 'Rivercess and Sinoe Sea Turtle Breeding Ground (Liberia)',
   'type': 'ebsa'},
  {'id': 127, 'name': 'Malaysia', 'type': 'eez'},
  {'id': 339,
   'name': 'Zone de Production Equatoriale de Thons',
   'type': 'ebsa'},
  {'id': 94, 'name': 'Guinea Bissau', 'type': 'eez'},
  {'id': 313, 'name': 'High Coast / Kvarken Archipelago', 'type': 'prot'},
  {'id': 479,
   'name': 'Quelimane to Zuni River (Zambezi River Delta)',
   'type': 'ebsa'},
  {'id': 497, 'name': 'Moneron island shelf, Russia', 'type': 'ebsa'},
  {'id': 496,
   'name': "Canyon et trou sans fond d'Abidjan (CÂte d'Ivoire)",
   'type': 'ebsa'},
  {'id': 486, 'name': 'Jabuka / Pomo Pit',

# Get geo and build data structure
Here, we loop through the OBIS Areas inventory, build out properties aligned with the target specification, retrieve geometry from the OBIS Geoserver, and package up a GeoJSON feature collection. I hard coded the CRS based on the source for convenience and should probably come back to that at some point.

In the new process we are developing, we apply source processing logic at the PostgreSQL end of this to build a common data model across all sources for at least one purpose in building a large place name lookup index. There will be other purposes for these data requiring different synthesis logic. In this case, there are very few properties to mess with, and they can all be put immediately into the convention we are evolving:

* feature_id - using URN-like syntax, we specify the source context ("OBIS_Areas" in this case) along with an identifier that is unique, reasonably persistent within the source context, and significant in being linkable to associated information
* feature_name - meaningful name of the feature that will be used in lookup and labeling
* feature_class - simple string for categorizing the features within the SFR context (needs to be richer and backed by ontology in future)
* feature_geometry - whatever the geometry is for the feature including its CRS specification

In [11]:
newOBISAreas = []
for index,obisArea in enumerate(obisAreasInventory["results"]):
    properties = {}
    properties["feature_id"] = "OBIS_Areas:"+str(obisArea["id"])
    properties["feature_name"] = obisArea["name"]
    properties["feature_class"] = codeList["Area Types"][obisArea["type"]]
    properties["getFeatureURL"] = wfsGetFeatureURLStub.replace("<Literal></Literal>","<Literal>"+str(obisArea["id"])+"</Literal>")
    
    areaFeature = requests.get(properties["getFeatureURL"]).json()
    
    if len(areaFeature["features"]) != 1 :
        display (properties)
    else:
        newOBISAreas.append(Feature(geometry=areaFeature["features"][0]["geometry"], properties=properties))

crs = {"type": "EPSG","properties": {"code": "4326"}}
obisAreasCollection = FeatureCollection(newOBISAreas, crs=crs)

{'feature_class': 'Ocean Biogeographic Information System',
 'feature_id': 'OBIS_Areas:286',
 'feature_name': 'OBIS',
 'getFeatureURL': 'http://iobis.org/geoserver/OBIS/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=OBIS:areas&maxFeatures=1&outputFormat=json&filter=<PropertyIsEqualTo><PropertyName>gid</PropertyName><Literal>286</Literal></PropertyIsEqualTo>'}

# Output file for upload
Our pattern so far has been to build a file for a processed data source and then load those up via the user interface into GC2's PostgreSQL. There is no other bulk upload method, and record by record loads via the SQL API have either failed or been ridiculously slow. However, we do need to stash these files back on the ScienceBase source item. For another project, we are working on getting programmatic access to official USGS CHS S3 bucket space. Once we have that nailed down, we should come back at these processes and build that in to anything where we need file storage. For now, this outputs a local file that gets uploaded to ScienceBase (because I've gotten frustrated with the vagaries of doing that over the sbAPI as well).

In [12]:
with open("OBISAreas.geojson", "w") as outfile:
    geojson.dump(obisAreasCollection, outfile)