In order to make better use of our iSAID graph for planning, assessment, and other analyses, we need to reduce the very messy data we have from all our heterogeneous sources down to something digestible and meaningful. We also need to connect the various entities that we have as assets (people, datasets, publications, etc.) to concepts that can be linked to broader sources of meaning and other related concepts so that we can broaden the understanding of what any one entity represents.

A lot of the questions we want to ask of these data come down to a finite set of concepts that cover the subject matters we deal with as an institution. We have many different listings of terms from our various catalogs that describe these subject matters, but they are all extremely noisy, not at all uniform, and in almost no cases do they reference some type of controlled vocabulary source. Even when they do reference such a source, we are finding that the terms supplied in metadata cannot actually be traced back to the asserted vocabulary source. We also have many unstructured texts of various kinds from simple abstracts to much more complex and lengthy structures that can be tapped but require handling through some type natural language processing to extract useful structure and concepts that can be traced to a source of definition and linkage to other concepts.

To approach this problem, we are using named entity recognition, built first with a rule-based approach to establish a training dataset followed by a machine learning process on both unstructured descriptive text and even the structured texts that cannot readily be identified as to source and definition. We can then tie labels we develop through NER processing back to their original source, examine new discovered values for those labels for any applicability to our needs, and feed confirmed subject matters with solid references back into our graph for use.

This notebook deals with the process of building our usable reference library. There is no single point of truth or an overarching ontology that is going to meet all our needs. There are instead several different sources that we need to curate into our process, sticking with our overall goal of using software codes to do all the work so we can set up a continuously operating and improving process. The concepts developed through this notebook become entities within the graph when they are firmly linked to some other entity through rule-based or confirmed machine learning named entity recognition. An overarching goal for these concepts is to make sure that everything has at least some type of url that can be applied to reference to where further information about the source can be found. Not all of the concept references that we need to use in describing our work exist yet in robust forms such as ontologies that yield full semantic functionality, so not all of the linkages assembled are actionable in the same way.

In [1]:
import isaid_helpers
import click
import os
import pickle
import pandas as pd
import datetime
import copy
from collections import Counter
from pylinkedcmd import wikidata
import sqlite3
import zipfile
import requests

# USGS Thesaurus Sources
The USGS Thesaurus is one of our primary sources for a controlled vocabulary reference point. It has been designed specifically to address USGS subject matters and is in broad use across many of our information systems as the go to point of consultation in picking keywords to describe things like datasets and projects. The USGS Thesaurus is built to the SKOS specification, meaning that it has an inherent hierarchy and concepts of preferred and non-preferred terms. Much of the Thesaurus contains scope notes that provide usable descriptions to help understand meaning and intent of the terms. There is a reasonable API-style interface to much of the USGS Thesaurus, a way to construct resolvable identifiers, and a structured download via a SQLite database that we take advantage of here. The USGS Thesaurus is actually a family of different controlled vocabularies, including a large set of place names that we are pulling all together to use in various ways throughout our knowledge graphing exercise.

The major functionality in the following processing codeblocks will be formalized at some point into the pyLinkedCMD package once we settle on the form and structure of the vocabulary reference material. The first codeblock establishes a connection to the thesaurus database with an optional download from original source.

In [2]:
%%time
if click.confirm('Do you really really need to download the USGS Thesaurus DB from source?', default=True):
    r = requests.get("https://apps.usgs.gov/thesaurus/download/thesauri.zip", stream=True)
    if r.status_code == 200:
        with open(isaid_helpers.f_usgs_thesaurus_source, 'wb') as f:
            for chunk in r:
                f.write(chunk)
    if os.path.exists(isaid_helpers.f_usgs_thesaurus_source):
        with zipfile.ZipFile(isaid_helpers.f_usgs_thesaurus_source, 'r') as f:
            f.extractall(isaid_helpers.local_cache_path_rel)

con_thesaurus = sqlite3.connect(f"{isaid_helpers.local_cache_path_rel}thesauri.db")
df_thesaurus = pd.read_sql_query("SELECT * from thesaurus", con_thesaurus)    

Do you really really need to download the USGS Thesaurus DB from source? [Y/n]: n
CPU times: user 120 ms, sys: 37.9 ms, total: 158 ms
Wall time: 3.04 s


In the following codeblock, we pull each individual thesaurus reference from an index table and then read each one into memory for further processing. We handle the particular dynamic in the USGS Thesaurus itself of leveraging the hierarchy to categories, setting each top-level category up as its own distinct set of terms.

In [3]:
df_usgs_thesaurus_terms = pd.read_sql_query("SELECT * FROM term", con_thesaurus)
df_usgs_thesaurus_terms = df_usgs_thesaurus_terms.where(pd.notnull(df_usgs_thesaurus_terms), None)
d_terms = df_usgs_thesaurus_terms.to_dict(orient="records")

def category_parent(code):
    code_item = next((i for i in d_terms if i["code"] == code), None)
    if code_item is None:
        return

    if code_item["parent"] is None:
        return
    
    if code_item["parent"] == 1:
        return
    
    next_parent_code = code_item["parent"]
    while True:
        parent_item = next((i for i in d_terms if i["code"] == next_parent_code), None)
        if parent_item["parent"] == 1:
            break
        else:
            next_parent_code = parent_item["parent"]

    return parent_item["name"]

all_terms = list()

for index, row in df_thesaurus.iterrows():
    if row["name"] != "USGS Thesaurus":
        try:
            df = pd.read_sql_query(f"SELECT * FROM {row.tblname}", con_thesaurus)
            df["thesaurus_name"] = row["name"]
            df["thesaurus_id"] = row["tag"]
            df["category"] = None
            d = df.to_dict(orient="records")
            all_terms.extend(d)
        except:
            pass
    else:
        for term in d_terms:
            if term["parent"] is not None:
                term["thesaurus_name"] = "USGS Thesaurus"
                term["thesaurus_id"] = 2
                term["parent"] = int(term["parent"])
                term["category"] = category_parent(term["code"])
            if "category" in term and term["category"] is not None:
                all_terms.append(term)
        

In this code block, we prepare the final structure of terms across all thesauri of the USGS Thesaurus "family," adding in a little bit of specific classification for use in NER processes.

In [4]:
def add_url(thesaurus_id, code):
    return f"https://apps.usgs.gov/thesaurus/term-simple.php?code={code}&thcode={thesaurus_id}"

def concept_mapping(term):
    category_concepts = {'topics': "USGS_SCIENCE_TOPICS",
         'methods': "USGS_SCIENTIFIC_METHODS",
         'sciences': "USGS_SCIENTIFIC_DISCIPLINES",
         'product types': "USGS_PRODUCT_TYPES",
         'time periods': "USGS_GEOLOGIC_TIME_PERIODS",
         'institutional structures and activities': "USGS_INSTITUTIONAL_STRUCTURES_ACTIVITIES",
         'USGS business categories': "USGS_BUSINESS_CATEGORIES"}
    thesaurus_concepts = {'Common geographic areas': "USGS_COMMON_GEOGRAPHIC_AREAS",
         'Alexandria Digital Library Feature Type Thesaurus': "ALEXANDRIA_DIGITAL_LIBRARY_FEATURE_TYPES",
         'Lithologic classification of geologic map units': "USGS_LITHOLOGY",
         'ISO 19115 Topic Category': "ISO_19115_TOPICS))",
         'Marine Realms Information Bank (MRIB) keywords': "MARINE_REALMS_INFORMATION_BANK",
         'Coastal and Marine Ecological Classification Standard': "CMECS"}
    
    if "category" in term and term["category"] is not None:
        return category_concepts[term["category"]] if term["category"] in category_concepts else None
    
    return thesaurus_concepts[term["thesaurus_name"]] if term["thesaurus_name"] in thesaurus_concepts else None
    
usgs_thesaurus_terms = list()
for term in all_terms:
    usgs_thesaurus_terms.append({
        "_date_cached": str(datetime.datetime.utcnow().isoformat()),
        "source": term["thesaurus_name"],
        "source_reference": f"https://apps.usgs.gov/thesaurus/thesaurus.php?thcode={term['thesaurus_id']}",
        "url": add_url(term["thesaurus_id"], term["code"]),
        "label": term["name"],
        "concept_label": concept_mapping(term),
        "description": term["scope"] if term["scope"] is not None and len(term["scope"]) > 20 else None
    })

# Wikidata
Wikidata provides the most far reaching open knowledge in the world at this point. It is being constantly developed and maintained by librarians and other interested people who key in on one set of concepts or another where the key facts within that area can be encoded and organized in the Wikidata model of entities and claims. We are examining where select pockets of knowledge within the very large (94M+ entities) Wikidata corpus can be used effectively as a reference point.

Given the wide open, volunteer contributor nature of Wikidata, there are many things in the system that we cannot fully trust for our purposes. However, Wikidata does often contain the results of highly professional work by librarians and other informaticists who have used the platform to capture public domain data in a way that makes it the most robust source for access to those data. In one particularly notable case, Wikidata provides the most robust and "machine-actionable" source for the International Minerological Association's periodic publication on mineral species; a source we have a great deal of use for.

We feel we can count on some sources based on the provenance for the entities and claims, the availability and completeness of references and qualifiers on claims, and our ability to examine history of changes to entities to determine any point where we might lose trust in the veracity of the information. We still have more work to do in improving the sophistication on how we select particular information from Wikidata for operational use and will continue to refine the processes encoded into the pyLinkedCMD package itself. In the near term, we are experimenting with the mechanics of gathering information from Wikidata in a most usable form and adding references to our overall reference vocabulary.

In [5]:
%%time
if click.confirm('Are you sure you want to run the process to build the Wikidata reference vocabulary?', default=False):
    existing_wd_ref = None
    if os.path.exists(isaid_helpers.f_wd_reference):
        existing_wd_ref = pickle.load(open(isaid_helpers.f_wd_reference, "rb"))
    
    #wd_reference = wikidata.build_wd_reference(existing_data=existing_wd_ref)
    wd_reference = wikidata.build_wd_reference()
    pickle.dump(wd_reference, open(isaid_helpers.f_wd_reference, "wb"))
else:
    wd_reference = pickle.load(open(isaid_helpers.f_wd_reference, "rb"))

Are you sure you want to run the process to build the Wikidata reference vocabulary? [y/N]: 
CPU times: user 37.6 ms, sys: 19.5 ms, total: 57 ms
Wall time: 1.59 s


# EPA Climate Change Glossary
One of the immediate use cases we are exploring for the iSAID capability is related to examining capacity in certain areas of climate change science. To do this, we need to work up a set of terms and concepts that relate specifically to climate change science that can tease our specific topical areas of interest that can be combined together with other characteristics of scientific assets to examine particular questions on where scientific capacity may exist to address new questions and challenges. This will continue to evolve, but the following codeblock explores one possibility through the EPA Climate Change Glossary to add a specific reference point for exploration through named entity recognition processes.

It presents an interesting functional case where we want to essentially scrape a textual source whose key concepts have not been formally recognized in another ontology or knowledge system. There are much more comprehensive sources that we are also exploring, namely the Global Change Master Directory Keywords from NASA and aspects of the SWEET ontology pertinent to this domain. However, both of those sources are much more complex and robust and need additional thought, experimentation, and development for reasonable application in training NER processes.

In the following codeblock, we read in the original source data from a spreadsheet, filter out some terms that are too broad or could introduce ambiguous results in NER processes, and then add terms in the simplified structure we are using for our reference vocabulary.

In [6]:
ignore_terms = ['Adaptation',
 'Adaptive Capacity',
 'Aerosols',
 'Annex I Countries/Parties',
 'Anthropogenic',
 'Atmosphere',
 'Biomass',
 'Biosphere',
 'Borehole',
 'Co-Benefit',
 'Concentration',
 'Earth System',
 'Eccentricity',
 'Ecosystem',
 'Evaporation',
 'Feedback Mechanisms',
 'Forcing Mechanism',
 'Geosphere',
 'Hydrologic Cycle',
 'Hydrosphere',
 'Inundation',
 'Landfill',
 'Latitude',
 'Metric Ton',
 'Mitigation',
 'Municipal Solid Waste',
 'Oxidize',
 'Particulate matter',
 'Parts Per Billion',
 'Parts Per Million by Volume',
 'Parts Per Trillion',
 'Phenology',
 'Photosynthesis',
 'Radiation',
 'Recycling',
 'Reflectivity',
 'Residence Time',
 'Resilience',
 'Respiration',
 'Scenarios',
 'Sensitivity',
 'Short Ton',
 'Sink',
 'Streamflow',
 'Subsiding/Subsidence',
 'Teragram',
 'Vulnerability',
 'Wastewater',
 'Water Vapor',
 'Weather']

df_epa_climate_change_glossary = pd.read_excel("https://sor.epa.gov/sor_internet/registry/termreg/searchandretrieve/glossariesandkeywordlists/search.do;jsessionid=87TsZz9rOm0k-CQjrtbfA43w1rPWKUuBf9N25Wg0XwzV9z4-yqOh!1399359231?details=&d-1342820-e=11&6578706f7274=1&glossaryName=Glossary+Climate+Change+Terms&includeInReport=")
epa_climate_change_glossary = list()
for index, record in df_epa_climate_change_glossary.iterrows():
    if record.Term not in ignore_terms:
        term = {
            "_date_cached": datetime.datetime.utcnow().isoformat(),
            "source": "EPA Climate Change Glossary",
            "source_reference": "https://sor.epa.gov/sor_internet/registry/termreg/searchandretrieve/glossariesandkeywordlists/search.do?details=&vocabName=Glossary%20Climate%20Change%20Terms",
            "label": record.Term,
            "concept_label": "CLIMATE_CHANGE_TERM",
            "identifier": f"epa_climate_change_glossary_{index}",
            "description": record.Definitions,
            "label_source": "preferred"
        }
        epa_climate_change_glossary.append(term)

        if isinstance(record.Acronym, str):
            alt_term = copy.copy(term)
            alt_term["label"] = record.Acronym
            alt_term["label_source"] = "acronym"
            alt_term["identifier"] = f'{term["identifier"]}_alt'
            epa_climate_change_glossary.append(alt_term)


# Summary
The end result of the processes outlined and executed here is a simple file containing an array of terms that are suitable for entry into our graph as DefinedSubjectMatter entities linked to other entities via some type of relationship (usually an "ADDRESSES_SUBJECT" relationship) depending on the origin of the claim. While some of these terms may have started as a property directly part of an entity like a dataset, bringing them into our graph as their own type of entity lets us relate many entities to the same concept and establish a relationship to the reference concept as something more sophisticated than a simple keyword in a list. We are essentially treating these as claims/statements in the Wikidata model where we try to establish at least a date qualifier and a reference on every relationship to one of these terms along with some other type of qualifier, particularly in the case of relationships derived through uncertain NER processes.

The following code block brings our separate source materials together into one reference_terms array and shows a summary of the sources and how the source material was classified for NER processing. We then dump it to a file for later use.

In [7]:
reference_terms = wd_reference
reference_terms.extend(epa_climate_change_glossary)
reference_terms.extend(usgs_thesaurus_terms)
display(Counter(i['source'] for i in reference_terms))

display(Counter(i['concept_label'] for i in reference_terms if "concept_label" in i))

Counter({'Wikidata Mineral Species': 10314,
         'Wikidata Chemical Elements': 667,
         'Wikidata Sedimentary Rocks': 91,
         'Wikidata Clastic Sediments': 7,
         'Wikidata Sovereign States': 1409,
         'Wikidata US States': 50,
         'Wikidata Global Seas and Oceans': 258,
         'Wikidata Global Faults': 3102,
         'Wikidata Global Volcanos': 1548,
         'Wikidata Global Earthquakes': 1500,
         'Wikidata US National Parks': 106,
         'Wikidata US National Monuments': 184,
         'Wikidata US National Forests': 221,
         'Wikidata US Wild and Scenic Rivers': 50,
         'Wikidata Geologic Formations': 9299,
         'Wikidata Aquifers': 27,
         'Wikidata Fields of Science': 457,
         'Wikidata Additional Commodities': 10,
         'Wikidata US Territories': 38,
         'Wikidata US Counties': 3108,
         'EPA Climate Change Glossary': 123,
         'Common geographic areas': 66096,
         'USGS Thesaurus': 1151,
       

Counter({'MINERAL_SPECIES': 10314,
         'CHEMICAL_ELEMENT': 667,
         'SEDIMENTARY_ROCK': 91,
         'CLASTIC_SEDIMENT': 7,
         'SOVEREIGN_STATE': 1409,
         'US_STATE': 50,
         'SEA_OR_OCEAN': 258,
         'GEOLOGIC_FAULT': 3102,
         'NAMED_VOLCANO': 1548,
         'NAMED_EARTHQUAKE': 1500,
         'NATIONAL_PARK': 106,
         'NATIONAL_MONUMENT': 184,
         'NATIONAL_FOREST': 221,
         'WILD_AND_SCENIC_RIVER': 50,
         'GEOLOGIC_FORMATION': 9299,
         'NAMED_GROUNDWATER_AQUIFER': 27,
         'FIELD_OF_SCIENCE': 457,
         'GEOLOGIC_COMMODITY_OR_MATERIAL': 10,
         'US_TERRITORY': 38,
         'US_COUNTY': 3108,
         'CLIMATE_CHANGE_TERM': 123,
         'USGS_COMMON_GEOGRAPHIC_AREAS': 66096,
         'USGS_SCIENCE_TOPICS': 570,
         'USGS_SCIENTIFIC_METHODS': 220,
         'USGS_SCIENTIFIC_DISCIPLINES': 87,
         'USGS_PRODUCT_TYPES': 38,
         'USGS_GEOLOGIC_TIME_PERIODS': 39,
         'USGS_INSTITUTIONAL_STRUCTURE

In [8]:
pickle.dump(reference_terms, open(isaid_helpers.f_ner_reference, "wb"))