Now that we have a [container](https://www.sciencebase.gov/catalog/item/5e8de96182cee42d134687cc) for model items in ScienceBase that we can operate against, it opens up all kinds of interesting possibilities for "robots" to do some work for us. This notebook explores what we might be able to get from the web links added into the mix. Theoretically, those represent a wealth of meaty material to help flesh out the model catalog as a useful resource. We can write some code that can go do some gathering, and then decide if there is anything that we can consistently bring back in to flesh out the items. Because we're doing this in code and this whole thing is experimental, we can be reasonably safe in writing information back into ScienceBase for use and evaluation. We just need to keep track of the parts of ScienceBase Items where we want to "cede control to the robots" and the parts we want to manage in some other way.

There are several different strategies for leveraging those links to get more information together. One interesting dynamic would be to simply index everything we find on subsequent landing pages and even spidering into the contents like any search engine. I've proposed in the past that ScienceBase could do this writ large, providing directed search engine functionality to go after linked content in meaningful ways.

For this exercise, I'm focusing on a couple ways of fishing for structured metadata. This little robot is essentially the machine we talk about when we say things about "machine-readable" metadata or other content. A couple of potential strategies occur to me:

* Content negotiation - Some web pages and applications accessible over HTTP enable content negotiation, which is a method for the requestor to negotiate the structure or substance of the content that is returned in a response. It can also do things like specify languages that content should be returned in. This is all based on what the server/application providing the content will actually support. You can't force the system to give you something it isn't prepared to deliver. Systems like ScienceBase and some other USGS platforms do support content negotiation with a couple different machine readable options, so it's worth trying to see what we might be able to use.
* Structured metadata - There are a whole variety of ways that web systems have worked out to embed structured, machine-readable metadata within the HTML content delivered through web pages and web apps. Many of these coalesce around the schema.org set of content specifications to give us some degree of consistency to work from. One cool thing about these techniques is that they can be implemented right within the primary vehicle for web content delivery - the web page viewed by humans. Perhaps the coolest thing from a data science perspective is that the specifications and encoding methods really encourage explicit semantics (meaning we don't have to guess about what words mean), use of persistent identifiers and linking to associated registries (so we don't have to guess about disambiguating things), and linking between concepts that relate (so we can go after a wealth of information and tie it all together into a network/graph).

Of these two, structured metadata probably offers us the most promise in terms of richness of content, consistency in resulting data, and overall ease of use. Unfortunately, the uptake of these methods in USGS is pretty abysmal, including in ScienceBase where we've not updated our use of schema.org metadata in 8 years or so.

As a last resort, we could fall back on painful web scraping methods where we essentially parse HTML content and try to extract useful information. These are painful, because every one would essentially be a mostly custom affair that is probably more trouble than its worth. I do show a small bit of that here just for demonstration purposes.

In [1]:
from sciencebasepy import SbSession
import requests
import extruct
from w3lib.html import get_base_url
from bs4 import BeautifulSoup
from IPython.display import display
import random

I go ahead and use sciencebasepy here to get the items from the USGS Model Catalog, but we could just as easily work directly with the REST API at this point since the collection is public. If we do take this another step and write information back into ScienceBase, we'll need to authenticate our session.

In [2]:
sb = SbSession()

In this codeblock, I retrieve the id, title, and webLinks for items in the model catalog assemble a scaled down data structure that has just the stuff I care about working with. We could just run through and check links in real time, but a) I don't want to retrieve the same URL more than once and b) ScienceBase gets cranky with too many requests. Since I'm going to be hitting all of the URLs with an HTTP request, I put those together into their own list of just the unique URL strings. This takes care of any cases where more than one model item might be referencing the same link. Later, if I decide to use the information I've collected, I can come back to my model list and figure out which ScienceBase Items the information applies to.

At this point, I'm not concerned about what type of link we're dealing with. I can again come back and work on using the title of the web links (e.g., "Model Reference Link") to make decisions on what to do with what I find. In looking through the links, however, and the information that does come back on some of them, we might need to do a little more work to better classify just what these links mean when it comes to using content from their pages in a meaningful way (more on that later).

In [3]:
models = list()
links = list()
items = sb.find_items({'parentId': '5e8de96182cee42d134687cc', 'fields': 'title,webLinks', 'max': 100})
while items and 'items' in items:
    for item in items['items']:
        del item["link"]
        del item["relatedItems"]
        models.append(item)
        links.extend([l["uri"] for l in item["webLinks"]])
    items = sb.next(items)

unique_links = list(set(links))

With our list of unique URLs, we can now go out and figure out if we have anything interesting to work with. The following codeblock does all of that in one go so that each URL only gets hit once to see what we can figure out. In this stage, I am essentially gathering some content and adding what I find to a data structure for each URL. I can then go back and process that in different ways to pull out what I want to use, but for now, I'm only taking it this far to examine for possibilities.

I put the following logical pieces into a very rough and notional function to process each given URL. Depending on how useful this information proves to be, we would want to refine this to something more real and manageable.

* Content negotiation is a whole topic in its own right that could use some careful consideration and more advanced methdos than I use here. As I said above, structured metadata holds a lot more interest and potential. I'm only trying to see if there is some kind of JSON response as the simplest thing to deal with. With the requests package, I can add in an Accept header that will still give me text/html content to work with if my Accept header is not accepted. For now, I just use a try block to see if I can shove the JSON content into my data structure. It's crude, but effective for our immediate purpose.
* For the structured metadata part, I've found the [extruct](https://github.com/scrapinghub/extruct) package from the ScrapingHub folks to be one of the most reliable, but it doesn't deal with quite all of the derivations folks have employed on embedding structured metadata in web pages. In this instance, I just try for anything we can get to evaluate for use.
* The final thing threw in here as something of a Hail Mary if a basic web page meta scraper. It parses HTML content using the Python BeautifulSoup package and returns the page title and any named meta tags with content. This can sometimes yield a reasonable description depending on a lot of factors. This is a really terrible way to try and go about things for any kind of consisency given the vagaries of content management systems, legacy content, and all kinds of factors. But it's some other stuff to look at. I put this into a function that could be built upon further if it turns out this is actually a reasonable source to think about. There are also some more robust alternatives to this like the ScrapingHub AutoExtract API that we could think about.

In [4]:
def meta_scraper(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    meta_content = dict()
    
    if soup.title is not None:
        meta_content["title"] = soup.title.string
    
    for meta in soup.findAll("meta"):
        metaname = meta.get('name', '')
        try:
            metacontent = meta["content"].strip()
        except:
            metacontent = None
        if isinstance(metaname, str) and isinstance(metacontent, str) and len(metacontent) > 0:
            meta_content[metaname] = metacontent

    return meta_content


def grab_stuff(link):
    eval_result = {
        "url": link
    }
    
    try:
        r = requests.get(link, headers={"Accept": "application/json"})
    except Exception as e:
        eval_result["error_condition"] = e
        return eval_result
        
    try:
        eval_result["json_response"] = r.json()
    except:
        eval_result["json_response"] = None
        
    try:
        eval_result["structured_data"] = extruct.extract(r.text, base_url=get_base_url(r.text, r.url))
    except:
        eval_result["structured_data"] = None
    
    eval_result["meta_content"] = meta_scraper(r.text)
    
    return eval_result


Running this kind of process in a big loop isn't a great method for some eventual production application. There are all kinds of considerations in doing this kind of thing in terms of having our robots be polite web crawlers and not freaking system administrators out with too many "weird" requests and how to optimize this kind of thing to check back in routinely over time for updated information. We could certainly paralellize this in a number of ways from launching Lambdas on the cloud to multithreading, and that would pull together data nice and quick. For now, working the relatively small number of links through in a loop is fine for demonstration and evaluation purposes.

Note: If you are fiddling with this and want to see what things look like before trying every URL, add something like ```[:5]``` to the end of the "unique_links" list in the for loop to only run a handful of URLs through the process.

In [5]:
%%time
link_eval_results = list()
for link in unique_links:
    link_eval_results.append(grab_stuff(link))

CPU times: user 34.4 s, sys: 918 ms, total: 35.3 s
Wall time: 6min 6s


Now's the fun part: analyzing the data and seeing if there's anything we want to use. Well, it would be fun, except our results aren't actually all that consistent or robust.

We have very few cases where we were able to pull any useful structured metadata in at all, so our most promising route for a consistent and powerful method is stymied by the fact that most of the systems behind these links haven't implemented that method. As a little bit of a side tangent, someone really ought to be encouraging this type of thing across the USGS and perhaps leading by example. Beyond supporting what we're trying to do here, these methods would go a long way to improving how USGS content presents itself on the web, how we might influence search rankings, and how we might influence the various knowledge graph efforts toward recognizing USGS as an authority on some subjects.

Content negotiation doesn't give us a whole lot of results either with a couple of notable exceptions. Somewhat by design, anything that points at a DOI link should respond to some type of accept header that will give us DOI metadata as a response.

As clunky as it is, meta tag scraping may still be a viable option to try to work from (if we can avoid encouraging bad behavior). Meta tags are really not designed for modern robots as they do not really have capacity for explicit semantics to understand what the intent of the meta tags should be. It's all a matter of convention and usage within a particular context, and that has to be unraveled and dealt with in some fashion.

Beyond the information structure, some of the research questions I would want to pursue include the following:

* Can we legitimately use anything we bring back in these processes in a meaningful way to add more depth to our model catalog?
* Can we use titles from any of these sources provide more than model acronyms or short names in our ScienceBase Items?
* Do any of the descriptions make sense to serve as an abstract for the concept of the model as cataloged, or are they for some specific aspect of the modeling system?
* Does our initial notional way of type classifying web links (e.g., Model Reference Link) help us in determining what information we can use?
* How might the extended information from model related assets like model output data or software code be encoded into the model items to add value without adding confusion?

These tests give you a quick run down on what things we found and can go look into further. The list of dictionaries could be put into a number of other data structures for further analysis, but I find it easy enough just to work with a little bit of list comprehension at this point. I created a little helper function that incorporates the basic logic on how to tease out the interesting bits to look at.

In [6]:
def get_stuff(result_list, bucket="all"):
    if bucket == "all":
        subset = result_list
    
    if bucket == "json_response":
        subset = [i for i in result_list if "json_response" in i.keys() and i["json_response"] is not None]
    
    if bucket == "microdata":
        subset = [i for i in result_list if "structured_data" in i.keys() and i["structured_data"] is not None and len(i["structured_data"]["microdata"]) > 0]
        
    if bucket == "json-ld":
        subset = [i for i in result_list if "structured_data" in i.keys() and i["structured_data"] is not None and len(i["structured_data"]["json-ld"]) > 0]
        
    if bucket == "opengraph":
        subset = [i for i in result_list if "structured_data" in i.keys() and i["structured_data"] is not None and len(i["structured_data"]["opengraph"]) > 0]
        
    if bucket == "microformat":
        subset = [i for i in result_list if "structured_data" in i.keys() and i["structured_data"] is not None and len(i["structured_data"]["microformat"]) > 0]
        
    if bucket == "rdfa":
        subset = [i for i in result_list if "structured_data" in i.keys() and i["structured_data"] is not None and len(i["structured_data"]["rdfa"]) > 0]
    
    if bucket == "meta_content":
        subset = [i for i in result_list if "meta_content" in i.keys() and i["meta_content"] is not None]
        
    return subset



So, to get a look at the data and start thinking through if and how we might use stuff we can pull back from these different sources and methods, here is a quick dump of the numbers of things we have in each category of potentially useful data and a random output of one record in that "bucket."

In [7]:
for bucket in ["all","json_response","microdata","json-ld","opengraph","microformat","rdfa","meta_content"]:
    this_bucket = get_stuff(link_eval_results, bucket)
    print(bucket, len(this_bucket))
    display(this_bucket[random.randint(0,len(this_bucket)-1)])

all 158


{'url': 'https://www.usgs.gov/software/exploration-and-graphics-river-trends-egret',
 'json_response': None,
 'structured_data': {'microdata': [],
  'json-ld': [],
  'opengraph': [],
  'microformat': [],
  'rdfa': [{'@id': 'https://www.usgs.gov/software/exploration-and-graphics-river-trends-egret#navbar',
    'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]}]},
 'meta_content': {'title': 'Exploration and Graphics for RivEr Trends (EGRET)',
  'viewport': 'width=device-width, initial-scale=1.0',
  '': 'text/html; charset=utf-8',
  'description': 'Exploration and Graphics for RivEr Trends (EGRET) is an R-package for the analysis of long-term changes in water quality and streamflow, including the water-quality method Weighted Regressions on Time, Discharge, and Season (WRTDS).Visit the EGRET GitHub page\xa0for more information and for instructions on how to obtain the software.\xa0 You can also download the EGRET user',
  'abstract': 'Explor

json_response 23


{'url': 'https://doi.org/10.5066/F7ZC8239',
 'json_response': {'link': {'rel': 'self',
   'url': 'https://www.sciencebase.gov/catalog/item/5df121a8e4b02caea0f62f2d'},
  'relatedItems': {'link': {'url': 'https://www.sciencebase.gov/catalog/itemLinks?itemId=5df121a8e4b02caea0f62f2d',
    'rel': 'related'}},
  'id': '5df121a8e4b02caea0f62f2d',
  'title': 'Composite Raster and Divergence Tool',
  'summary': 'Files -FoxCompositeRasterAndDivergenceTool_10_5.esriAddInn this is the compiled ArcGIS AddIn file that can be added ArcGIS 10.5 Overview The Composite Raster and Divergence Tool is an ArcGIS ArcMap add-in developed at UMESC. When using this tool, a user can:  create a date prioritized composite raster from a collection of raster layers create a lookup raster for the composite raster that identifies which input layers were used to create the composite raster create a divergence raster where pixel values represent the divergence from a user specified value  Installation: To install this 

microdata 19


{'url': 'https://pubs.er.usgs.gov/publication/70204764',
 'json_response': None,
 'structured_data': {'microdata': [{'type': 'http://schema.org/ScholarlyArticle',
    'properties': {'name': 'Integrating anthropogenic factors into regional-scale species distribution models — A novel application in the imperiled sagebrush biome',
     'author': ['Juan M. Requena-Mullor',
      'Kaitlin C. Maguire',
      'Douglas Shinneman',
      'T. Trevor Caughlin'],
     'sameAs': 'https://doi.org/10.1111/gcb.14728',
     'description': 'Abstract\n\nSpecies distribution models (SDM) that rely on regional-scale environmental variables will play a key role in forecasting species occurrence in the face of climate change. However, in the Anthropocene, a number of local-scale anthropogenic variables, including wildfire history, land-use change, invasive species, and ecological restoration practices can override regional-scale variables to drive patterns of species distribution. Incorporating these human-i

json-ld 2


{'url': 'https://coastalscience.noaa.gov/research/coastal-change/wemo/',
 'json_response': None,
 'structured_data': {'microdata': [],
  'json-ld': [{'@context': 'https://schema.org',
    '@graph': [{'@type': 'WebSite',
      '@id': 'https://coastalscience.noaa.gov/#website',
      'url': 'https://coastalscience.noaa.gov/',
      'name': 'NCCOS Coastal Science Website',
      'inLanguage': 'en-US',
      'description': 'NCCOS Coastal Science Public Facing Website',
      'potentialAction': [{'@type': 'SearchAction',
        'target': 'https://coastalscience.noaa.gov/?s={search_term_string}',
        'query-input': 'required name=search_term_string'}]},
     {'@type': 'WebPage',
      '@id': 'https://coastalscience.noaa.gov/research/coastal-change/wemo/#webpage',
      'url': 'https://coastalscience.noaa.gov/research/coastal-change/wemo/',
      'name': 'Wave Exposure Model (WEMo) - NCCOS Coastal Science Website',
      'isPartOf': {'@id': 'https://coastalscience.noaa.gov/#website'},
  

opengraph 3


{'url': 'https://github.com/geoflows/D-Claw',
 'json_response': None,
 'structured_data': {'microdata': [{'type': 'http://schema.org/SoftwareSourceCode',
    'properties': {'author': 'geoflows',
     'name': 'D-Claw',
     'about': 'software for granular-fluid flows',
     'keywords': ['Fortran',
      'Python',
      'MATLAB',
      'Makefile',
      'Shell',
      'Assembly',
      'Other'],
     'license': 'https://github.com/geoflows/D-Claw/blob/master/LICENSE',
   {'type': 'http://schema.org/BreadcrumbList',
    'properties': {'itemListElement': [{'type': 'http://schema.org/ListItem',
       'properties': {'url': 'https://github.com/geoflows/D-Claw',
        'name': 'Code',
        'position': '1'}},
      {'type': 'http://schema.org/ListItem',
       'properties': {'url': 'https://github.com/geoflows/D-Claw/issues',
        'name': 'Issues',
        'position': '2'}},
      {'type': 'http://schema.org/ListItem',
       'properties': {'url': 'https://github.com/geoflows/D-Claw/pul

microformat 3


{'url': 'http://regclim.coas.oregonstate.edu/dynamical-downscaling/',
 'json_response': None,
 'structured_data': {'microdata': [],
  'json-ld': [],
  'opengraph': [],
  'microformat': [{'type': ['h-feed'],
    'properties': {'name': ['Regional and Global Climate']},
    'children': [{'type': ['h-entry'], 'properties': {}}]}],
  'rdfa': [{'@id': 'http://regclim.coas.oregonstate.edu/dynamical-downscaling/#masthead',
    'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
   {'@id': 'http://regclim.coas.oregonstate.edu/dynamical-downscaling/',
    'https://api.w.org/': [{'@id': 'http://regclim.coas.oregonstate.edu/wp-json/'}]},
   {'@id': 'http://regclim.coas.oregonstate.edu/dynamical-downscaling/#main',
    'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#main'}]}]},
 'meta_content': {'title': 'Dynamical Downscaling – Regional and Global Climate',
  'viewport': 'width=device-width, initial-scale=1',
 

rdfa 73


{'url': 'https://www.usgs.gov/software/trigrs',
 'json_response': None,
 'structured_data': {'microdata': [],
  'json-ld': [],
  'opengraph': [],
  'microformat': [],
  'rdfa': [{'@id': 'https://www.usgs.gov/software/trigrs#navbar',
    'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]}]},
 'meta_content': {'title': 'TRIGRS',
  'viewport': 'width=device-width, initial-scale=1.0',
  '': 'text/html; charset=utf-8',
  'description': 'For updates, see\xa0TRIGRS source code repositoryReferences',
  'abstract': 'A\xa0Fortran Program for Transient Rainfall Infiltration and Grid-Based Regional Slope-Stability Analysis, Version 2.0.',
  'robots': 'follow, index',
  'generator': 'Drupal 7 (http://drupal.org)',
  'msapplication-TileColor': '#008457',
  'msapplication-TileImage': '/sites/all/themes/usgs_palladium/favicons/mstile-144x144.png',
  'msapplication-config': '/sites/all/themes/usgs_palladium/favicons/browserconfig.xml',
  'theme-color': '#0

meta_content 158


{'url': 'https://pubs.er.usgs.gov/publication/tm6A52',
 'json_response': None,
 'structured_data': {'microdata': [{'type': 'http://schema.org/ScholarlyArticle',
    'properties': {'name': 'SUTRA, a Model for Saturated-Unsaturated, Variable-Density Groundwater Flow with Solute or Energy Transport—Documentation of Generalized Boundary Conditions, a Modified Implementation of Specified Pressures and Concentrations or Temperatures, and the Lake Capability',
     'author': ['Alden M. Provost', 'Clifford I. Voss'],
     'sameAs': 'https://doi.org/10.3133/tm6A52',
     'description': 'Abstract\n\nFirst posted August 21, 2019\n\nFor additional information, contact:\n\nDirector, Earth System Processes Division\nU.S. Geological Survey\nMail Stop 411\n12201 Sunrise Valley Drive\nReston, VA 20192\n\nContact Pubs Warehouse\n\nVersion 3.0 of the SUTRA groundwater modeling program offers three new capabilities: generalized boundary conditions, a modified implementation of specified pressures and conc

To unwind all this and get back to the original model item that contained a URL that we evaluated, we would go look for that URL in our original models data structure that we created at the beginning. Once we get to this point and have some data to feed back, we will look at more elegant ways of dealing with this. The following code simply picks a random evaluated link item in a particular category and finds the first corresponding model item where that link came from. There are better ways of stitching this all together. 

In [8]:
example_link_item = get_stuff(link_eval_results, "microdata")[random.randint(0,len(get_stuff(link_eval_results, "microdata"))-1)]
display(example_link_item)
model_item = next((m for m in models if example_link_item["url"] in [l["uri"] for l in m["webLinks"]]), None)
display(model_item)

{'url': 'https://catalog.data.gov/dataset/gflow-model-used-to-characterize-the-groundwater-resources-of-the-great-divide-unit-of-the-cheq',
 'json_response': None,
 'structured_data': {'microdata': [{'type': 'http://schema.org/Dataset',
    'properties': {'name': 'GFLOW model used to characterize the groundwater resources of the Great Divide Unit of the Chequamegon-Nicolet National Forest, Wisconsin',
     'dateModified': 'March 7, 2020',
     'description': 'The model simulates two-dimensional groundwater flow and base flow in streams in the Great Divide Unit of the Chequamegon-Nicolet National Forest using the analytic element program GFLOW (Haitjema, 1995). Significant streams and lakes in the model domain are represented at varying levels of detail as linesink elements fully connected to the groundwater system. The highest level of detail is given to surface water features within the Forest Unit, while greatly simplified features around the perimeter of the model provide a boundary

{'id': '5e8de97782cee42d134687e5',
 'title': 'GFLOW',
 'webLinks': [{'type': 'webLink',
   'typeLabel': 'Web Link',
   'uri': 'https://catalog.data.gov/dataset/gflow-model-used-to-characterize-the-groundwater-resources-of-the-great-divide-unit-of-the-cheq',
   'rel': 'related',
   'title': 'Model Reference Link',
   'hidden': False}]}

### A note about formal, standardized metadata
I can imagine a question coming up about why we would not pursue tying in formal, structured metadata where it exists into this process, hitting things like FGDC CSDGM records and using the more robust metadata there in this process. It's a valid question and would be worth pursuing. The reasons I did not address that yet in this exploration include the following:

* The web links that are often found, provided, and included so far in our exploratory work usually go to some type of information landing page meant for human consumption. It often takes some additional sleuthing or web scraping or other means to discover a link to a formal metadata record. Interpretive conventions could be applied in our case here to make assumptions that a "catalog.data.gov" link is going to have a formal metadata record, and with a little rewriting in the URL or further API search, we might find it to exploit. However, I'd argue that the best thing to do is for those catalog systems to put all of that metadata content right into one form or the other of landing page embedded structured metadata that can be read by processes like this as well as by humans.
* For right now, we really don't need all that much metadata content to make our model items quite a bit better. Limiting to what we might get through this kind of process keeps things simple and focuses on the highest level metadata elements that we probably care most about at the moment.
* The harder issues to work through have to do with the substance of the content and whether or not it adds value in this context. Context in which metadata are produced matters immensely in many cases. Metadata produced in a FGDC CSDGM record for releasing a product is done for one purpose while metadata put into a advertisement page for that product online may be quite different. At this stage, the metadata advertising the product is probably the most useful context for our immediate purposes.