However we go about the task of building a customized map catalog from records in a metadata catalog, we need some way of querying the source catalog for items of interest. In keeping with our stated design principle to push the open standards, using the Open Geospatial Consortium Catalog Service for Web standard seems like it will be our best route to follow that principle.

## Why the SDC won't work right now
In many ways, the USGS Science Data Catalog would be our conceptual starting point for our particular purpose. That catalog at data.usgs.gov is nominally the curated master inventory of USGS data. It is in the same architecture space as similar systems at data.noaa.gov and data.nasa.gov from our sister earth science agencies. It is what we use to serve data to data.doi.gov at our parent agency level and data.gov for the US government data presence as a whole. Notionally, we should be able to query the SDC and find well curated metadata for our use case of the high level data assets for a given organizational unit.

However, there are some distinct blockers in the current situation with the SDC:

1. The only machine interface to the SDC currently is a custom REST API based on the underlying Solr query parameters. That interface returns a custom schema from the search index. This means we don't have something standard to build against like a CSW interface.
2. There is no notion of a landing page for individual records from the SDC, meaning that there is no place for schema.org in the architecture to be implemented. Metadata records from search results are rendered inline to the web application and provided to downstream aggregators (doi.data.gov and data.gov) via Web Accessible Folders with XML metadata files.
3. Many of the metadata records we want to use for our exercise use case are only partial records at the SDC discovery point because of how the SDC interfaces with ScienceBase. The SDC contains the "original metadata" for ScienceBase Repository items by design. This is the original FGDC CSDGM XML document that was loaded to the ScienceBase Item, and in most cases, these will not contain the distribution links created for services provided by the ScienceBase infrastructure - the exact links that we need to use for determining what's mappable and building our custom catalog.

These blockers could be resolved in future, and we can hopefully inform those developments with requirements from this work.

In [1]:
from owslib.csw import CatalogueServiceWeb
from owslib.fes import PropertyIsEqualTo, PropertyIsLike

## Powering with ScienceBase
It's not that ScienceBase is a bad choice as our starting point. The only reason I looked first to SDC and Data.gov is that the higher level catalogs make a little more sense for building a generalized toolset. Most of the data assets we want to use are cataloged in ScienceBase, and the ones missing can be added fairly readily. The ScienceBase Catalog offers a CSW interface that is fairly customized and a bit older in terms of its architecture, but it is still functional.

The following codeblocks run through a rudimentary start to a query pattern using OWSLib. We'll need to fiddle with this further to see if we can and should attempt to use sb:servicetype as a pre-filter on the query or incorporate that into later processing logic. I'm not yet sure how to combine query constraints like that into logical OR statements using OWSLib.

Obviously, if we can get all the properties we need to use to build our customized map catalog, it potentially obviates the whole part of this associated with schema.org embedded information in landing pages. Why would we query yet another source in such an awkward manner as what amounts to web scraping, when we have an API like CSW that gives us complete records? That is a fair point that I think we'll need to work through. I will say that the trajectory of schema.org dataset seems like it is clarifying a few things at a more simple level in terms of the types of metadata elements we really need to drive a use case like we are pursuing here. After struggling through the "robustness" of CSW and the ISO19139 metadata schema, I'm inclined toward something a bit more straightforward and less prone to the different interpretations that we already see in the couple of interface cases pursued here.

There is lots of work going on with mappings between different metadata schemas these days, and one approach might be to use schema.org dataset as our common denominator. We can leverage a mapping from ISO19139 to schema.org to generate a JSON-LD representation of the information we need in our catalog, and then build the catalog from that form. In future, we might be able to get that JSON-LD directly from another source, and providing a utility that takes an ISO XML record as input and spits out schema.org JSON-LD might be useful to others anyway (if it doesn't already exist somewhere).

In [2]:
%%time
sb_csw = CatalogueServiceWeb('https://www.sciencebase.gov/catalog/csw')

CPU times: user 19.3 ms, sys: 6.43 ms, total: 25.8 ms
Wall time: 46.2 ms


In [3]:
drl_query = PropertyIsEqualTo('sb:collection', '5644f3c1e4b0aafbcd0188f1')
wms_query = PropertyIsEqualTo('sb:servicetype', 'OGC-WMS')

In [4]:
%%time
sb_csw.getrecords2(
    constraints=[drl_query, wms_query], 
    maxrecords=20, 
    outputschema='http://www.isotc211.org/2005/gmd'
)

CPU times: user 121 ms, sys: 44.8 ms, total: 166 ms
Wall time: 512 ms




In [5]:
sb_csw.results

{'matches': 1848, 'returned': 20, 'nextrecord': 21}

In [6]:
for r in sb_csw.records:
    print(sb_csw.records[r].dataseturi)
    print(sb_csw.records[r].identification.title)
    print(sb_csw.records[r].identification.abstract)
    print(sb_csw.records[r].datetimestamp)
    
    for c in sb_csw.records[r].contact:
        print(c.name, c.role, c.organization, c.email)
    
    for d in sb_csw.records[r].identification.date:
        print(d.date, d.type)
        
    for t in sb_csw.records[r].identification.keywords:
        print(t.type, t.thesaurus, t.keywords)

    wms_url = next((d.url for d in sb_csw.records[r].distribution.online if d.protocol == "OGC:WMS"), None)
    print(wms_url)
    
    print("=======")


https://www.sciencebase.gov/catalog/item/59f5e22de4b063d5d307dd19
Yellow-bellied Flycatcher (Empidonax flaviventris) bYBFLx_CONUS_2001v1 Range Map
This dataset represents a species known range extent for Yellow-bellied Flycatcher (Empidonax flaviventris) within the conterminous United States (CONUS) based on 2001 ground conditions. This range map was created by attributing sub-watershed polygons with information of a species' presence, origin, seasonal and reproductive use. See <a href="https://www.sciencebase.gov/catalog/item/5951527de4b062508e3b1e79">Gap Analysis Project Species Range Maps</a> for more information regarding data creation and user constraints. For species specific range information, see the attached Range data.
2019-02-07T17:55:18Z
Matt Rubino pointOfContact None None
Jeff Lonneker pointOfContact None None
2018-07 Publication Date
2008 Start Date
2013 End Date
https://www.sciencebase.gov/geoserver/CONUS_Range_2001/wms?service=WMS&version=1.1.0&request=GetCapabilities


## Why Data.gov and data.doi.gov won't work right now
We might also consider it reasonable that we could build our custom map catalog from one of the next two hierarchical government metadata aggregators "above" the USGS. However, it appears at this time that the pipelines from data.usgs.gov to these higher level aggregation points has been languishing. The DOI data catalog does not seem to be online or active at all at the moment. Data.gov seems a bit of a mess from the standpoint of trying to zero in on an organizational unit. USGS is not listed as a Bureau under the Department of the Interior at an Agency level. USGS is listed at the Agency level itself, but it only shows 54 datasets as of October 21, 2019. I had to do a little digging to find the Data.gov CSW end point (http://catalog.data.gov/csw-all). While it works relatively well, has more queryables and response formats, and is returning results faster than the ScienceBase CSW, without some indication that the records we want are in the catalog, there's not much point in pursuing it farther until those issues are resolved.

The codeblocks below contain a brief look at the Data.gov CSW interface situation. I ran aground at the stage of teasing out exactly how Data.gov's ISO19139 rendering is sharing their distribution links and a few other attributes of interest.

In [7]:
%%time
dg_csw = CatalogueServiceWeb('http://catalog.data.gov/csw-all',timeout=60)

CPU times: user 37.5 ms, sys: 6.67 ms, total: 44.2 ms
Wall time: 1.8 s


In [8]:
dg_csw.get_operation_by_name('GetRecords').constraints

[Constraint: AdditionalQueryables - ['apiso:AccessConstraints', 'apiso:Classification', 'apiso:ConditionApplyingToAccessAndUse', 'apiso:Contributor', 'apiso:Creator', 'apiso:Degree', 'apiso:Lineage', 'apiso:OtherConstraints', 'apiso:Publisher', 'apiso:Relation', 'apiso:ResponsiblePartyRole', 'apiso:SpecificationDate', 'apiso:SpecificationDateType', 'apiso:SpecificationTitle'],
 Constraint: SupportedDublinCoreQueryables - ['csw:AnyText', 'dc:contributor', 'dc:creator', 'dc:date', 'dc:format', 'dc:identifier', 'dc:language', 'dc:publisher', 'dc:relation', 'dc:rights', 'dc:source', 'dc:subject', 'dc:title', 'dc:type', 'dct:abstract', 'dct:alternative', 'dct:modified', 'dct:spatial', 'ows:BoundingBox'],
 Constraint: SupportedISOQueryables - ['apiso:Abstract', 'apiso:AlternateTitle', 'apiso:AnyText', 'apiso:BoundingBox', 'apiso:CRS', 'apiso:CouplingType', 'apiso:CreationDate', 'apiso:Denominator', 'apiso:DistanceUOM', 'apiso:DistanceValue', 'apiso:Format', 'apiso:GeographicDescriptionCode',

In [9]:
dg_query = PropertyIsLike('csw:AnyText', '%birds%')

In [10]:
%%time
dg_csw.getrecords2(
    constraints=[dg_query], 
    maxrecords=20,
    outputschema='http://www.isotc211.org/2005/gmd'
)

CPU times: user 83.5 ms, sys: 7.76 ms, total: 91.3 ms
Wall time: 754 ms


In [11]:
dg_csw.results

{'matches': 254, 'returned': 20, 'nextrecord': 21}

In [12]:
for r in dg_csw.records:
    print(dg_csw.records[r].dataseturi)
    print(dg_csw.records[r].identification.title)
    print(dg_csw.records[r].identification.abstract)
    print(dg_csw.records[r].datetimestamp)

#    for c in dg_csw.records[r].contact:
#        print(c.name, c.role, c.organization, c.email)
    
#    for d in dg_csw.records[r].identification.date:
#        print(d.date, d.type)
        
#    for t in dg_csw.records[r].identification.keywords:
#        print(t.type, t.thesaurus, t.keywords)

#    wms_url = next((d.url for d in dg_csw.records[r].distribution.online if d.protocol == "OGC:WMS"), None)
#    print(wms_url)
    
    print("=======")


None
USGS 7.5 Minute Topographic Maps (2003)
This data set consists of images derived from scans of U.S.G.S. 7.5 minute quadrangle maps (Digital Raster Graphics). The images form a set of seamless tiles, 5,000 m on a side, from which the map collar information was removed. Images are GeoTIFF files in NAD83, UTM zone 18.
None
None
Walk-In Hunting Access (WIHA) Fall 2008
This shapefile represents the private lands leased by the Kansas Department of Wildlife and Parks for fall 2008 public hunting access through the Walk-In Hunting Access (WIHA) program.
None
None
USGS 7.5 Minute Topographic Maps (2000)
This data set consists of images derived from scans of U.S.G.S. 7.5 minute quadrangle maps (Digital Raster Graphics). The images form a set of seamless tiles, 5,000 m on a side, from which the map collar information was removed. Images are GeoTIFF files in NAD83, UTM zone 18.
None
None
Walk-In Hunting Access (WIHA) Fall 2009
This shapefile represents the private lands leased by the Kansas D