Part of the crux of the problem here is reading the schema.org metadata. I'm familiar with the specification and have worked on some projects with it, but I wasn't quite sure how to go about reading schema.org embedded metadata from landing pages with the various ways the spec has morphed. I first put this in place for ScienceBase a long time ago, and their process for loading the landing pages has really not changed since. It is woefully out of date, and that's likely to be one of our challenges.

After a little poking around, I found the extruct package that seems to be one of the most mature in the Python world for extracting from web pages the different ways that schema.org properties can be found. I immediately ran into an issue with exploring dynamic systems like ScienceBase and NOAA One Stop where the pages are rendered with a partial client-side Javascript process that writes contents. A simple use of requests to pull contents didn't work, so I had to re-learn about selenium. That introduces some dependencies on an instantiation of a web browser on whatever machine is running this and access to drivers, which is going to be a pain if we wanted to set up an eventual instantiation of this with something like AWS Lambda. But that's a problem for another day.

In this notebook, I use selenium and extruct to check out a couple examples that we might explore in this work. I pull back their different implementations of schema.org and document a number of issues I found.

In [1]:
import extruct
from selenium import webdriver

from IPython.display import display

ScienceBase, NOAA One Stop, and Data.gov are all catalogs that I happen to know have done some type of work with schema.org metadata on their landing pages. They also have interesting data that fits our overall use case. The URLs here are the main things we would be pulling back with some type of higher level search filter to get stuff of interest. I chose examples that all do have some type of mappable interface and should provide those links to us in their metadata somehow. I also did test each of these with Google's Structured Data Testing Tool to see how it renders out the information.

In [9]:
example_site_sb = 'https://www.sciencebase.gov/catalog/item/5b030c7ae4b0da30c1c1d6de'
#example_site_datagov = 'https://catalog.data.gov/dataset/epa-facility-registry-service-frs-facility-interests-dataset'
example_site_datagov = 'https://catalog.data.gov/dataset/235df488-53e1-4ebb-8a49-37134635e6ab'
example_site_noaa = 'https://data.noaa.gov/onestop/collections/details/AWbwQso3mm6hJphmMBul'

This is one of the stickier issues in this process based on how 2 of the 3 examples dynamically render their landing pages. I tested this with the chromium driver loaded into the Python virtual environment I'm working with as well, but the calling maching still needs to have an actual browser loaded and able to run. If we want to run this in a serverless context, we will have to explore other means. For now, the Firefox approach I use here works fine for testing.

In [3]:
wd = webdriver.Firefox()

## ScienceBase
In this first example, I grab up our example ScienceBase item and pull out its schema.org metadata. Things are a bit rough and not very workable for our use case here just yet. Here are a few observations:

* ScienceBase is using the very outmoded microdata approach to embedding the metadata. At the time we did the work, that really was kind of the only way to go, so it makes sense. At this point, however, it would be far more extensible to go with a JSON-LD structure.
* The content here is okay for basic discovery purposes, reflecting the long ago time when we put this into the ScienceBase infrastructure. At that time (ca. 2012ish, I think), we were really just experimenting with how well this approach would aid in our search engine optimization pursuit. It did well enough at the time.
* There is no distribution information at all in the current ScienceBase implementation, so we're basically dead in the water for our core requirement.
* There is contact information, which is one of the elements we might use for organizing the logical catalog browse list, so it will be a matter of determining whether or not the content is there to work from.

If we're going to make this successful with ScienceBase, we are going to need to either lean on the ScienceBase team to update their implementation or write some kind of temporary "proxy" code that would transform ScienceBase's custom Item schema to JSON-LD syntax and use that as the foundational input we would prefer to pull directly from their landing pages.

In [4]:
wd.get(example_site_sb)
data = extruct.extract(wd.page_source, base_url=example_site_sb, uniform=True)
display(data)

{'microdata': [{'@type': 'Dataset',
   '@context': 'http://schema.org',
   'datePublished': '2018-09-30',
   'articleSection': 'Summary',
   'articleBody': 'The USGS Protected Areas Database of the United States (PAD-US) is the nation\'s inventory of protected areas, including public land and voluntarily provided private protected areas, identified as an A-16 National Geospatial Data Asset in the Cadastre Theme (https://communities.geoplatform.gov/ngda-cadastre/). The PAD-US is an ongoing project with several published versions of a spatial database including areas dedicated to the preservation of biological diversity, and other natural (including extraction), recreational, or cultural uses, managed for these purposes through legal or other effective means. The database was originally designed to support biodiversity assessments; however, its scope expanded in recent years to include all public and nonprofit lands and waters. Most are public lands owned in fee; however, long-term easem

## NOAA One Stop
I had heard the NOAA guys at ESIP talking about their work with schema.org on landing pages for the NOAA One Stop search platform, so I thought this might be a good time to see what they were up to. There is quite a bit more structured content in the NOAA system that we might be able to exploit. Here are a few initial observations:

* The identifiers and sameAs information here that will let us understand the relationships between NOAA DOIs and these landing pages that we might need depending on how we run our high level searches. (Note: I've not been able to track down a CSW interface for NOAA One Stop yet.)
* The distribution information is pretty extensive and seems to include everything in the other forms of metadata. As expected, it's going to be pretty difficult to figure out what we might be able to include as a route to some type of OGC service or something otherwise mappable. Everything is of type DataDownload (still a moving target in the spec), and disambiguatingDescription is not helpful as implemented. We might look for key words in the descriptions, but this is an interesting example with "Dynamic GIS" on the one hand but "mapping application" on the other in one case. That link goes to a web mapper. The one service here that we could tease out with pattern recognition on the url is unfortunately down and out at the moment, which points to a need for making a decision on how to dig deeper on the links. The encodingFormat list here might also be useful as a filter, depending on how NOAA's convention works out on the values.
* NOAA's keywords implementation is kind of interesting here with an embedded hierarchy in string format that we might experiment with for constructing our browse list.

In [7]:
wd.get(example_site_noaa)
data = extruct.extract(wd.page_source, base_url=example_site_noaa, uniform=True)
display(data)

{'microdata': [],
 'json-ld': [{'@context': 'http://schema.org',
   '@type': 'Dataset',
   'name': 'U.S. Hourly Climate Normals (1981-2010)',
   'alternateName': 'gov.noaa.ncdc:C00824',
   'description': 'The U.S. Hourly Climate Normals for 1981 to 2010 are 30-year averages of meteorological parameters for thousands of U.S. stations located across the 50 states, as well as U.S. territories, commonwealths, the Compact of Free Association nations, and one station inCanada. NOAA Climate Normals are a large suite of data products that provide users with many tools to understand typical climate conditions for thousands of locations across the United States. As many NWS stations as possible are used, including those from the NWS Cooperative Observer Program (COOP) Network as well as some additional stations that have a Weather Bureau Army-Navy (WBAN) station identification number, including stations from the Climate Reference Network (CRN). The comprehensive U.S. Climate Normals dataset incl

## Data.gov
The Data.gov implementation of schema.org is interesting. I suspect that the team there has built some type of generalized schema.org metadata generation into their underlying CKAN implementation that may or may not be useful for our purposes.

* The majority of the content here is made up of tags and keywords in a couple different forms. Those are such a hodgepodge of different concepts with no qualification or semantics, so they will be very difficult to use in meaningful ways.
* The Data.gov folks have taken an interesting approach to distribution links. They seem to be now pointing every link at an intermediary page rendering in the Data.gov system itself where a link to something real may or may not be active. As it stands, the URLs in the actual point of metadata are not going to be usable for our purposes.\
* One of the real challenges with Data.gov for broader use cases than ours is the issue of provenance back to source material from this much higher level aggregator. They could really use the sameAs concept here to help keep track of what these records are related to back in source systems from providing agencies.

In [10]:
wd.get(example_site_datagov)
data = extruct.extract(wd.page_source, base_url=example_site_datagov, uniform=True)
display(data)

{'microdata': [{'@type': 'Dataset',
   '@context': 'http://schema.org',
   'name': 'USGS US Topo Map Collection',
   'dateModified': 'June 28, 2019',
   'description': 'Layered GeoPDF 7.5 Minute Quadrangle Map. Layers of geospatial data include orthoimagery, roads, grids, geographic names, elevation contours, hydrography, and other selected map features. This map depicts geographic features on the surface of the earth. One intended purpose is to support emergency response at all levels of government. The geospatial data in this map are from selected National Map data holdings and other government sources.',
   'distribution': [{'@type': 'DataDownload',
     'contentUrl': 'https://catalog.data.gov/dataset/usgs-us-topo-map-collection/resource/64afef66-f652-4112-b37c-09e772647a94',
     'description': 'Web Accessible Folder'},
    {'@type': 'DataDownload',
     'contentUrl': 'https://catalog.data.gov/dataset/usgs-us-topo-map-collection/resource/38271f4d-7d88-4ccb-b77d-9b83bc9e95c4',
     