Connecting terms used for describing staff expertise to the USGS Thesaurus gives us a certain number of more well developed concepts beyond simple keywords that have definitions, links to other related concepts, and some work toward settling on accepted words and phrasing. However, that vocabulary source is also completely developed within the intramural context of the USGS. On the one hand, that can be a benefit, because we are coming at this from the standpoint of our own corporate system of knowledge and the concepts there should conceivably align reasonably well with other information sources we may be working from. However, as we've explored so far, there are a lot of terms our staff are using to describe their expertise that are not found in the USGS Thesaurus today. And, on the whole, there are other formalized knowledge systems in various states of maturity beyond our borders that likely have a ton of information that we can leverage.

One possibility for continuous mining is the ESIP Community Ontology Repository, a resource established and maintained by the quite active ESIP semantic technologies community that is the logical home for the SWEET ontology, specifically, and other ontology sources that the community as a whole provides some stewardship over or that are otherwise interesting for use. As an experiment, I'm attempting a similar exercise on semantic evaluation of the subset of USGS staff expertise assertions for CDI members pulled from their profile pages, this time against ontology sources in the ESIP COR. Notionally, I should be able to logically connect USGS staff to a knowledge network represented in one or more of those ontologies and then exploit that linkage to both flesh out the characterization of our staff and go after additional sources of information with a broader base of knowledge.

This notebook starts down that path, but I'm quickly running up against a couple of challenges:

* The ESIP COR as it stands does not provide anything like a performant platform for wide open discovery and exploration. SPARQL queries at a certain level of specificity are inherently cumbersome, and trying to look for every triple anywhere with a mention of a certain term is quite slow. I'm going to have to work through what's actually available from the ESIP COR and come up with a more directed search pattern if I'm going to establish anything meaningful.
* There are a lot of different types of terms and concepts represented in even the relatively small number of unique terms (~250) represented in the quite small number of CDI members with active staff profiles (<60). Some concepts there are complete shorthand, lacking anything like enough specifity to even know what type of concept they might represent (e.g., "agile" and "backcountry"). Someone smarter than me could possibly take a combined set of terms for a given person through some kind of text analysis routine and add some useful context to terms to direct concept linkage searches more efficiently. I don't know how to do that yet and will need to simply ignore results based on some pattern for now.

In [1]:
import requests
from sqlite_utils import Database
from SPARQLWrapper import SPARQLWrapper, JSON
from datetime import datetime

db = Database("usgs_profiles.db")

In [2]:
profile_terms = [
    {
        "profile": i["source_identifier"],
        "term": i["term"].lower()
    } for i in db["expertise_terms"].rows_where("term_source = 'USGS Staff Profiles'")
]

expertise_terms = list(set([i["term"] for i in profile_terms]))
expertise_terms.sort()

print("Total number of profile terms:", len(profile_terms))
print("Unique terms for lookup:", len(expertise_terms))

Total number of profile terms: 341
Unique terms for lookup: 260


In [3]:
def esip_cor_search(term):
    esip_cor = SPARQLWrapper("http://cor.esipfed.org/sparql")
    esip_cor.setReturnFormat(JSON)
    query = '''
    select distinct ?subject ?predicate ?object
    where {
     ?subject ?predicate ?object.
     filter (regex(str(?subject), "%s[^/#]*$", "i")
       || regex(str(?object), "%s", "i"))
    }
    order by ?subject
    ''' % (term, term)

    esip_cor.setQuery(query)
    results = esip_cor.query().convert()
    
    return {
        "term": term,
        "date_tested": datetime.utcnow().isoformat(),
        "esip_cor_results": results["results"]["bindings"]
    }

In [4]:
expertise_terms

['adcp',
 'aerial imagery',
 'aerial photography',
 'agile',
 'amazon web services',
 'amazon web services (aws)',
 'animals',
 'aquatic ecosystems',
 'aquifer characterization',
 'aquifer testing',
 'arcgis enterprise',
 'arcgis online',
 'arcgis pro',
 'avian influenza',
 'backcountry',
 'benthic ecosystems',
 'biodiversity',
 'biodiversity and ecosystem services',
 'biogeochemistry',
 'biogeography',
 'botany',
 'broad crested weir',
 'cableway',
 'cap',
 'capture-recapture models',
 'capturing (animals)',
 'carbon dynamics',
 'carbonate isotope geochemistry',
 'cartography',
 'climate change',
 'climate model downscaling',
 'cloud computing',
 'co-production',
 'coastal ocean modeling',
 'collaboration',
 'collaborative learning',
 'collections management',
 'computational methods',
 'conflict resolution',
 'contaminant hydrogeology',
 'contamination and pollution',
 'core analysis',
 'crest stage gage program',
 'cryosphere',
 'culvert computation',
 'cyberinfrastructure',
 'dams'

This is a quick run through a small subset of terms for the moment just to see what we might be able to work with. With all the noted limitations and issues, this basic regex-based search on terms does yield some potentially interesting results. Connections to GCMD afforded by its registry with the ESIP COR could yield hints at particular scientific instruments or platforms where a method/technique that someone claims as expertise could be involved. Having this linkage might mean another bit of evidence in connecting dots to what a given person has produced, information sources they are involved in, organizations with which they are affiliated, or other useful connections. I may also be able to pull additional sources of definition and synonymous or complementary terms that can be used to flesh out a directional search from the expertise concepts. But now I need to work on wading through what's available from registered ESIP COR assets to determine what will be most immediately useful.

As a first look, I'm observing some of the same issues I've dealt with looking at Wikidata as a source. I want to be able to write reasoners that use a preponderance of evidence from the characteristics of things that my requirement demands to determine what things I can use in what circumstances. I want things that are well referenced, potentially from a suite of trusted sources but more importantly with a working, configurable definition of what constitutes trustworthiness, and well documented as to provenance. However, keep running up against the need to configure some set of hard coded lists of codes/identifiers that I trust in a particular way. That's totally not scalable or sustainable! I'm also struggling through just how many "subqueries" I need to run, often with the need to parse identifiers and add resolvers (and sometimes with a priori knowledge necessary) and usually involving something other than the simplicity of content resolution, in order go track down all those details about any given concept. There's got to be a better way to wade through this stuff that I'm just not smart enough to know yet.

In [5]:
%%time
for term in expertise_terms[:10]:
    display(esip_cor_search(term))
    print("+++++++++++++++++++")

{'term': 'adcp',
 'date_tested': '2020-09-08T21:08:26.385761',
 'esip_cor_results': [{'subject': {'type': 'uri',
    'value': 'http://cor.esipfed.org/ont/testorg/st/ADCP'},
   'predicate': {'type': 'uri',
    'value': 'http://cor.esipfed.org/ont/testorg/st/LongName'},
   'object': {'type': 'literal', 'value': 'acousticDopplerCurrentProfiler'}},
  {'subject': {'type': 'uri',
    'value': 'http://cor.esipfed.org/ont/testorg/st/ADCP'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'},
   'object': {'type': 'uri',
    'value': 'http://cor.esipfed.org/ont/testorg/st/SensorType'}},
  {'subject': {'type': 'uri',
    'value': 'http://cor.esipfed.org/ont/testorg/st/ADCP'},
   'predicate': {'type': 'uri',
    'value': 'http://cor.esipfed.org/ont/testorg/st/Definition'},
   'object': {'type': 'literal',
    'value': 'An ADCP uses Doppler shift of a acoustic signal to calculate water particle velocity at different distances along a beam path.'}},
  {'

+++++++++++++++++++


{'term': 'aerial imagery',
 'date_tested': '2020-09-08T21:08:39.701017',
 'esip_cor_results': [{'subject': {'type': 'uri',
    'value': 'http://resource.geosciml.org/classifier/cgi/mappedfeatureobservationmethod/observed_aerial_imagery'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2004/02/skos/core#prefLabel'},
   'object': {'type': 'literal',
    'xml:lang': 'en',
    'value': 'observed in aerial imagery'}},
  {'subject': {'type': 'uri',
    'value': 'http://resource.geosciml.org/classifier/cgi/mappedfeatureobservationmethod/observed_aerial_imagery'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2004/02/skos/core#definition'},
   'object': {'type': 'literal',
    'xml:lang': 'en',
    'value': 'Phenomenon observed and located using aerial imagery (standard visible light)'}}]}

+++++++++++++++++++


{'term': 'aerial photography',
 'date_tested': '2020-09-08T21:08:53.083192',
 'esip_cor_results': [{'subject': {'type': 'uri',
    'value': 'http://gcmdservices.gsfc.nasa.gov/kms/concept/20d7a6a7-1c69-469b-ac53-92078dcb2a67'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2004/02/skos/core#definition'},
   'object': {'type': 'literal',
    'xml:lang': 'en',
    'value': "The Rockwell Aero Commander (AC-500S) is a versatile and stable high-winged twin piston-engine aircraft that is suitable for a variety of missions. Standard configuration allows for mission equipment and two pilots. However, with the scientific packages removed, seating for five additional passengers may be installed. NOAA's two aero commanders are utilized primarily as aerial survey platforms for visual verification of aeronautical charts, high-resolution aerial photography, and snow water equivalent and soil moisture content measurements. Additionally, the aircraft has been used in biological invest

+++++++++++++++++++


{'term': 'agile',
 'date_tested': '2020-09-08T21:09:06.972776',
 'esip_cor_results': [{'subject': {'type': 'bnode', 'value': 'bC6D748B1x9448'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2002/07/owl#annotatedTarget'},
   'object': {'type': 'literal',
    'value': 'A mantle lamp is a central-draught lamp which includes a pear-shaped, fabric mesh containing thorium or other rare-earth salts placed over the burner. On first use the fabric burns away and the rare-earth salts are converted to oxides, leaving a very fragile structure which incandesces (glows brightly) upon exposure to the heat of the burner flame.'}},
  {'subject': {'type': 'uri',
    'value': 'http://data.crossref.org/fundingdata/vocabulary/Label-36025'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2008/05/skos-xl#literalForm'},
   'object': {'type': 'literal', 'value': 'Agilent Foundation'}},
  {'subject': {'type': 'uri',
    'value': 'http://data.crossref.org/fundingdata/vocabulary

+++++++++++++++++++


{'term': 'amazon web services',
 'date_tested': '2020-09-08T21:09:20.359701',
 'esip_cor_results': [{'subject': {'type': 'uri',
    'value': 'http://data.crossref.org/fundingdata/vocabulary/Label-1395504'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2008/05/skos-xl#literalForm'},
   'object': {'type': 'literal', 'value': 'Amazon Web Services'}}]}

+++++++++++++++++++


{'term': 'amazon web services (aws)',
 'date_tested': '2020-09-08T21:09:34.186967',
 'esip_cor_results': []}

+++++++++++++++++++


{'term': 'animals',
 'date_tested': '2020-09-08T21:09:47.874691',
 'esip_cor_results': [{'subject': {'type': 'bnode',
    'value': 'bC6D748B1x11902'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2002/07/owl#annotatedTarget'},
   'object': {'type': 'literal',
    'value': 'A planned process that is the movement of people, animals and goods from one location to another.'}},
  {'subject': {'type': 'bnode', 'value': 'bC6D748B1x11780'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2002/07/owl#annotatedTarget'},
   'object': {'type': 'literal',
    'datatype': 'http://www.w3.org/2001/XMLSchema#string',
    'value': 'A skin obtained from animals for human use typically from deer or cattle sources used to produce leather, shoes, fashion accessories, musical instruments.'}},
  {'subject': {'type': 'bnode', 'value': 'bC6D748B1x11770'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2002/07/owl#annotatedTarget'},
   'object': {'type': 'liter

+++++++++++++++++++


{'term': 'aquatic ecosystems',
 'date_tested': '2020-09-08T21:10:01.309900',
 'esip_cor_results': [{'subject': {'type': 'uri',
    'value': 'http://gcmdservices.gsfc.nasa.gov/kms/concept/363d283c-b6ba-4c90-852a-365b22b86f7e'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2004/02/skos/core#definition'},
   'object': {'type': 'literal',
    'xml:lang': 'en',
    'value': '\nSediment Traps are specialized sampling devices, usually\ncylindrical or conical in shape and of variable dimensions,\nwhich is deployed in the ocean to collect sinking particulate\ninorganic and organic matter.  When positioned near the base of\nthe euphotic zone, the downward flux of nitrogen should\napproximate new production in aquatic ecosystems.\n\n[Source: University of California]\n\n\n'}},
  {'subject': {'type': 'uri',
    'value': 'http://gcmdservices.gsfc.nasa.gov/kms/concept/7da95c01-4b39-437e-a8d4-fd572e43f693'},
   'predicate': {'type': 'uri',
    'value': 'http://www.w3.org/2004/02/sk

+++++++++++++++++++


{'term': 'aquifer characterization',
 'date_tested': '2020-09-08T21:10:14.516754',
 'esip_cor_results': []}

+++++++++++++++++++


{'term': 'aquifer testing',
 'date_tested': '2020-09-08T21:10:27.830383',
 'esip_cor_results': []}

+++++++++++++++++++
CPU times: user 93 ms, sys: 19 ms, total: 112 ms
Wall time: 2min 15s
