One of the things that we've not done a whole lot of work with in ScienceBase are web links. ScienceBase Catalog (the main store of metadata items) essentially provides two main routes of getting at data that the items are documenting - files and links. Some links are housed in specific extensions (facets) on the items that are generated in a particular structured way (e.g., when a process is triggered to create geospatial services from spatial data files). Other links are housed in the webLinks container. The webLinks structure has been left very open to all kinds of things being added, and the current vocabulary defining web link types is a pretty big mix of concepts that are not all particularly well defined or semantically aligned.

In an attempt to start developing methods for treating web links a little more seriously in the ScienceBase model, I've introduced an idea for a new module that will handle some functionality that I've been developing for a particular use case where I needed to track down further structured information behind the models. I put this off in its own module outside "SbSession" because it is really quite different code and functionality from what has been in that core module and class for some time. However, it seems like it would be valuable to build on additional functionality into the sciencebasepy package itself rather than sequestering it off somewhere else.

The first thing I've worked up here is a set of functions to examine link responses to extract as much structured information from those links as possible to help better characterize them for use. I've employed several specific strategies that essentially generate an additional "annotation" object for the ScienceBase Item web link structure. To build the annotation, we try a number of strategies to pull together useful information about what's behind the links to help with a variety of applications.

* At a minimum, we check that a link returns some kind of response and record the date and time of the check. In this way, the process can act as a simple link checker.
* In seeking a response, we send in a couple of options in an Accept header to try for content negotiation. In some cases, this can net a structured JSON or XML response from the URL that we can potentially do something interesting with.
* If we get an HTML type of response, I run two processes to try for structured metadata - an old style named meta tag extraction routine and a "new style" approach using the extruct package from the ScrapingHub project to check for structured metadata in several formats (JSON-LD, Microdata, etc.).
* If we get an XML response with content negotiation, I use the gis-metadata-parser package to see if it is in a discernible GIS metadata format (FGDC CSDGM, ISO19139, Esri-ArcGIS) and pull out high level properties into a summary structure.
* If we get a JSON response with content negotiation, right now, I'm just shoving it into its own object. In future, we would want to compare that to known patterns of interest with something like a JSON Schema approach to pull together useful summary information about what's behind the link.

In [1]:
from sciencebasepy import Weblinks
from IPython.display import display

Lots more work to do here eventualy. Right now, I'm only instantiating a non-authenticated SbSession() and working with public items. The code is not all that well documented yet, and I haven't written any tests for the new module.

In [2]:
sb_wl = Weblinks()

There are lots of different ways that a process like this might be launched in terms of where and how to get what links from ScienceBase Items to operate against. I created a simple fetch_web_links function to take an item ID with possible parameters for link type and link title. We have ended up using link title quite a bit in our work to classify particular types of links or declare their function in context to the item. We've done this mostly because the controlled vocabulary for web link types is not all that controlled as yet.

In [3]:
web_links = sb_wl.fetch_web_links('5e8de97682cee42d134687ce')

The fetch_web_links function spits out a simplified ScienceBase item model with just the webLinks list because that's all we're concerned with at this point.

In [4]:
web_links

{'id': '5e8de97682cee42d134687ce',
 'webLinks': [{'type': 'webLink',
   'typeLabel': 'Web Link',
   'uri': 'https://water.usgs.gov/ogw/bgas/1dtemppro/',
   'rel': 'related',
   'title': 'Model Reference Link',
   'hidden': False},
  {'type': 'webLink',
   'typeLabel': 'Web Link',
   'uri': 'https://doi.org/10.5066/P9Q8JGAO',
   'rel': 'related',
   'title': 'Model Output Data',
   'hidden': False}]}

We can either run one webLink object through at a time or process a whole list. Conceivably, we could put together a giant list of links from multiple items and run all those through. Before this goes too much further, though, we need to think through some of the dynamics of what amounts to web crawling/scraping in terms of scalability and web family politeness. We definitely want to keep our system from running amok with making a ton of requests to "foreign" web addresses, and we'll probably want to tweak settings on things like user agent advertising, honoring robot blockers, and that sort of thing.

In [5]:
sb_wl.process_web_links(web_links["webLinks"])

[{'type': 'webLink',
  'typeLabel': 'Web Link',
  'uri': 'https://water.usgs.gov/ogw/bgas/1dtemppro/',
  'rel': 'related',
  'title': 'Model Reference Link',
  'hidden': False,
  'annotation': {'link_check_date': '2020-04-29T16:36:51.892785',
   'content_type': 'html',
   'status_code': 200,
   'headers': {'Date': 'Wed, 29 Apr 2020 16:36:52 GMT', 'Strict-Transport-Security': 'max-age=31536000; preload;', 'Accept-Ranges': 'bytes', 'P3P': 'CP="NON DSP LAW CUR ADMa DEVa OUR STA COM NAV PHY ONL", policyref="/w3c/p3p.xml"', 'pics-label': '(pics-1.1 "http://www.icra.org/pics/vocabularyv03/" l gen true for "https://water.usgs.gov" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0) gen true for "http://www.water.usgs.gov" r (n 0 s 0 v 0 l 0 oa 0 ob 0 oc 0 od 0 oe 0 of 0 og 0 oh 0 c 0))', 'Link': '</labels.rdf>; /="/"; rel="meta" type="application/rdf+xml"; title="ICRA labels";', 'X-Frame-Options': 'ALLOW-FROM https://usgs.maps.arcgis.com/', 'Keep-Alive': 'timeout=3, max=476', 'Con

From here, the question becomes what to do with this information. From the general ScienceBase perspective, it might be interesting to set something up to run this routinely and at least check status of every link in the system. A simple broken link report might be useful for content owners. The information pulled together about the things Items link to is ultimately meaningful in some context, and that's where more structured link type classification would be useful.

The motivating use case I'm exploring is for links associated with items describing models in ScienceBase where I am looking to round out the information that is being associated with a model. In that case, we're going to end up with things like links to model input and output data, web sites describing models and model projects, publications about the models, source code for the models, and others. Structured metadata coming from each of those sources is useful at different levels. It would, however, be useful to have the ability to pull all of this information into the ScienceBase search index in some way to aid in discovering models via some part of their related information. This same basic "expanding the search index pool" idea is likely useful for other use cases as well.