This notebook explores the ways we're going to need to go about parsing and processing collections submitted to the National Digital Catalog of Geological and Geophysical Data in some new ways. The end game here is to put all records into one or multiple ElasticSearch indexes to drive a new API and various interfaces that search across collections to find useful samples and other artifacts for research. We do essentially have this now via the ScienceBase API, but the underlying data workflow is very stagnant, difficult to manage, and very difficult to change to allow for more heterogeneity in the underlying metadata and workflows.

I'm pursuing a concept of operations that will run as a set of microservices to process various kinds of collections into a common format with variable properties. To split up the work, I am focusing on getting each collection type to a common but varying GeoJSON data structure. I will then cache those data files back on the ScienceBase Items at the collection level and then slurp them up and process into ElasticSearch. In some of our other project work, we are running these types of files into their own ES indexes with a common prefix to support wildcard searches across collections. This will probably be a reasonable approach here as well, and we can take advantage of an established load mechanism based on a message queue and set of microservices.

In [1]:
import requests
import xmltodict
from IPython.display import display
from geojson import Feature, Point, FeatureCollection
import folium
from bs4 import BeautifulSoup
from gis_metadata.iso_metadata_parser import IsoParser

In [2]:
def build_point_geometry(coordinates):
    pointGeometry = Point((float(coordinates.split(',')[0]), float(coordinates.split(',')[1])))
    return pointGeometry

def build_ndc_feature(geom, props):
    ndcFeature = Feature(geometry=geom, properties=props)
    return ndcFeature

def list_waf(url, ext='xml'):
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    return [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]


# NGGDPP XML Format
One mechanism of supplying data to the NDC is a rather archaic simple XML document with "sample" records following the original NDC schema. This set of code runs through one of those examples. In looking over the collection records in ScienceBase today, we're going to have to work through a mechanism of flagging the appropriate XML file for processing so that we can run one piece of code across the entire NDC, find these types of cases, and process all files.

In [3]:
xmlData = requests.get('https://www.sciencebase.gov/catalog/file/get/57bb5f55e4b03fd6b7dd0532?f=__disk__9b%2Ff0%2F0c%2F9bf00cf674581675f45be9f7240940d44310aabb').text

In [4]:
dictData = xmltodict.parse(xmlData, dict_constructor=dict)

Things to do in processing version 1 NDC records from XML:
* Convert coordinates to valid GeoJSON point geometry
* Determine if CRS is other than WGS84 and record those details somewhere in the collection
* Verify collection ID is valid ScienceBase ID
* Determine uses of alternate title and handle appropriately
* Determine uses of browse graphic and online resource; handle appropriately with at least verification of availability
* Verify datasetReferenceDate and set to valid ISO8601 date
* Verify date range and set to valid ISO8601 date range
* Package GeoJSON feature collection

In [5]:
feature_list = []

for sample in dictData['samples']['sample']:
    pointGeometry = build_point_geometry(sample['coordinates'])
    feature_list.append(build_ndc_feature(pointGeometry,sample))

ndcxml_feature_collection = FeatureCollection(feature_list)

In [6]:
display(ndcxml_feature_collection)

{"features": [{"geometry": {"coordinates": [-157.377120885838, 68.1106331790389], "type": "Point"}, "properties": {"abstract": "Sample Locations from geologist(s): Charles G. Mull in the following quadrangles or areas: Chandalar, Chandler Lake, De Long Mountains, Howard Pass, Killik River, Misheguk Mountain, Noatak, Point Hope, Survey Pass, Wiseman Brooks Range; North Slope", "alternateGeometry": "The coordinates are represented in World Geodetic System 1984 (WGS84) as an approximate centroid point of the geospatial footprint(s) covered by this item. Item may be composed of several disparate geospatial footprints.", "alternateTitle": {"title": null}, "browseGraphic": {"resourceURL": null}, "collectionID": "57bb5f55e4b03fd6b7dd0532", "coordinates": "-157.377120885838,68.1106331790389", "dataType": "Map", "datasetReferenceDate": "2016-01-22", "dates": {"date": "1963-1974"}, "onlineResource": {"resourceURL": "http://maps.dggs.alaska.gov/agdi/detail/35701"}, "supplementalInformation": "Unp

# XML WAF Example
Another mechanism used in the NDC is to provide a web accessible folder with ISO19139 XML files for harvesting. This was used with another ScienceBase tool to harvest ISO records into items. I'm working through an alternate way of processing these files to generate a simplified GeoJSON feature collection to incorporate into the index. In these cases, we do seem to have a type classification for the WAF web links that could be useful. This example works through a case from the AZGS.

In [7]:
az_collection_example = requests.get('https://www.sciencebase.gov/catalog/item/57520032e4b053f0edd03e54?format=json&fields=webLinks').json()
waf_url = next(l['uri'] for l in az_collection_example['webLinks'] if l['type'] == 'WAF')

This is the core of the process so far. It loops through all links from the WAF, uses a metadata parsing utility to parse the contents of the ISO XML, builds a simple set of properties, creates a point geometry from the bounding box, and builds out a GeoJSON feature collection. We will need some more work on fully accommodating all useful properties out of the XML as there are some other things we should probably incorporate. This also relies on the convention of using bounding box in the ISO standard to represent a simple point, and that should probably get some validation in the code to make sure that's actually the case and generate a polygon feature in cases where it's actually a bounding box.

In [8]:
feature_list = []

for link in list_waf(waf_url):
    iso_xml = requests.get(link).text
    parsed_iso = IsoParser(iso_xml)
    
    coordinates = parsed_iso.bounding_box['east']+','+parsed_iso.bounding_box['south']
    pointGeometry = build_point_geometry(coordinates)
    
    item = {}
    item['title'] = parsed_iso.title
    item['abstract'] = parsed_iso.abstract
    item['place_keywords'] = parsed_iso.place_keywords
    item['thematic_keywords'] = parsed_iso.thematic_keywords
    item['temporal_keywords'] = parsed_iso.temporal_keywords

    feature_list.append(build_ndc_feature(pointGeometry,item))
    
waf_feature_collection = FeatureCollection(feature_list)

In [9]:
display(waf_feature_collection)

{"features": [{"geometry": {"coordinates": [-112.1122222, 34.75388889], "type": "Point"}, "properties": {"abstract": "The 'UVX: Contracts to Supply Flux to Hidalgo and Chino Smelters' file is part of the A. F. Budge Mining Ltd. Mining collection. A. F. Budge Mining Ltd., a British company owned by Tony Budge, controlled properties across several western U. S. states and northern Mexico. The company was active in Arizona during the 1980s and into the early 1990s. The collection consists of economic geologic information including maps, logs, reports and records. A few properties make up most of the collection: Vulture, United Verde Extension and Korn Kob.", "place_keywords": ["United States", "Arizona", "Yavapai County", "Clarkdale - 7.5 Min", "U.V.X. Property", "Edith And Audrey Shafts", "Little Daisy", "Verde Exploration Ltd Prop.", "Daisy Shaft", "Audrey Shaft", "T16N R2E Sec 23 NW", "Black Hills (Ya) physiographic area", "Verde metallic mineral dist.", "Yavapai552B"], "temporal_keywo

# NGGDPP CSV Format
Another format to work through that will look pretty much the same as the above XML example are the cases where states provided CSV files. These will probably be a little bit messier because of all the silling things that can be done in CSV files, but the process should be pretty straightforward.

# Visualization
The main thing I'm pursuing is getting to a workable API, based on ElasticSearch and the same Flask-based REST API we are building for the Biogeographic Information System, that allows for exploration and discovery across all collections in the NDC. However, we do also need some ways of exploring all of that visually. The code below uses a simple Folium map to display the collections added above. I'll do some more work on this to include properties in the markers.

My plan at this point is to fork the Burwell app from the Macrostrat folks and add in a capability to search for and display NDC artifacts. Their system uses a multi-resolution interface to global geologic maps as a base with a find-by-click approach that uses the surface geology and geographic location of a dropped pin to search a number of different services and display potentially useful items. I'll add a capacity to the discovery panel that uses either a buffer on the dropped pin or geologic formation geometry to set a spatial constraint for the NDC search.

This simple visual does already show that I'm going to need to put some additional stuff into the processing code that will find outliers and flag them in some way as supect. I can then generate an additional API route that highlights suspect records for further action.

In [10]:
m = folium.Map([45, -110], zoom_start=2)
folium.GeoJson(ndcxml_feature_collection).add_to(m)
folium.GeoJson(waf_feature_collection).add_to(m)

<folium.features.GeoJson at 0x115ec3dd8>

In [11]:
m