This notebook is a little bit of a piecemeal work through three major aspects of the USGS Thesaurus that we found useful to bring into knowledge representation within the GeoKB. These are concepts that show up in metadata and other aspects of various data and information systems throughout the USGS, and it is useful to establish linkages in the knowledge graph to build queries and analytical pathways around. We address the following in sections:

* [scientific methods](https://geokb.wikibase.cloud/wiki/Item_talk:Q152412)
* [academic disciplines](https://geokb.wikibase.cloud/wiki/Item_talk:Q158710) (sciences)
* General topics (introduced as "[observable entities](https://geokb.wikibase.cloud/wiki/Item_talk:Q159046)")

In [1]:
import pandas as pd
from wbmaker import WikibaseConnection
import sqlite3

geokb = WikibaseConnection("GEOKB_CLOUD")

# Thesaurus Source
Here, we pull the thesaurus source from the "raw" SQLite database form (cached locally for conveninence), build out an ordered list of the lineage as 'code' identifier values, and pull in non-preferred names to be used as aliases.

In [2]:
def get_lineage(code, root):
    if pd.isna(code):
        return []
    else:
        parent_code = terms.loc[terms['code'] == code, 'parent'].iloc[0]
        parent_lineage = get_lineage(parent_code, root)
        if code in root:
            return parent_lineage
        else:
            return parent_lineage + [code]

thesauridb = sqlite3.connect("data/thesauri.db")

terms = pd.read_sql_query("SELECT * FROM term", con=thesauridb)
terms['parent'] = terms['parent'].astype('Int64')

root_code = [1]
terms['lineage'] = terms['code'].apply(lambda x: get_lineage(x, root_code))
terms['lineage_root'] = terms['lineage'].apply(lambda x: x[0] if len(x) > 0 else None)
terms['lineage_root'] = terms['lineage_root'].astype('Int64')

terms = terms[terms['lineage_root'].isin([734,1019,1174])].reset_index(drop=True)

nonpref = pd.read_sql_query("SELECT * FROM nonpref", con=thesauridb)

terms = pd.merge(
    left=terms,
    right=nonpref[['code', 'name']].rename(columns={'name': 'aliases'}).groupby('code', as_index=False).agg(list),
    how='left',
    on='code'
)

terms['description'] = terms['scope'].apply(lambda x: x.split('.')[0] if isinstance(x, str) and len(x) > 250 else x)
terms['description'] = terms['description'].apply(lambda x: f"{x[:247]}..." if isinstance(x, str) and len(x) > 250 else x)


# GeoKB items mapped to USGS Thesaurus
We already have some items that were created individually in testing with same as relationships pointing to USGS Thesaurus 'code' identifiers. We pull those up here so we don't duplicate them. We need to run this again once we have new items established so we can then build out the subclass of claims using the parent code values.

In [45]:
same_as_query = """
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?same_as ?subclass_of ?subclass_ofLabel
WHERE {
  ?item wdt:P84 ?same_as .
  OPTIONAL {
    ?item wdt:P2 ?subclass_of .
  }
  FILTER (STRSTARTS(STR(?same_as), "https://apps.usgs.gov/thesaurus/"))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
"""

same_as_items = geokb.sparql_query(
    query=same_as_query,
    endpoint=geokb.sparql_endpoint,
    output='dataframe'
)

same_as_items['code'] = same_as_items['same_as'].apply(lambda x: int(x.split('=')[-1]))
same_as_items['qid'] = same_as_items['item'].apply(lambda x: x.split('/')[-1])

# Stub Items
We initially build basic items with no classification and then we'll use the parent relationships to build in the subclass of claims.

In [41]:
missing_items = terms[~terms['code'].isin(same_as_items['code'])].reset_index(drop=True)
print(len(missing_items))
display(missing_items.head())

1


Unnamed: 0,code,name,parent,scope,lineage,lineage_root,aliases,description
0,1174,topics,1,"Themes, subjects, and concerns for which USGS ...",[1174],1174,,"Themes, subjects, and concerns for which USGS ..."


In [7]:
for index, row in missing_items.iterrows():
    item = geokb.wbi.item.new()
    item.labels.set('en', row['name'])

    if isinstance(row['description'], str):
        item.descriptions.set('en', row['description'])

    if isinstance(row['aliases'], list):
        item.aliases.set('en', row['aliases'])

    item.claims.add(
        geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup['same as'],
            value=f"https://apps.usgs.gov/thesaurus/term-simple.php?thcode=2&code={row['code']}"
        )
    )

    response = item.write(
        summary="Added new research method concept from USGS Thesaurus"
    )
    print(row['name'], response.id)

water column reflectivity Q159545
biological soil crusts Q159546
microplastic contamination Q159547
produced water Q159548
solid industrial waste material Q159549
slag Q159550
coal ash Q159551
PFAS Q159552
acid neutralizing potential Q159553
carbon flux Q159554
thermal maturation Q159555
protected areas Q159556
ecosystem resilience Q159557
fluid migration Q159558
induced seismicity Q159559
ground failure Q159560
bioenergetics Q159561
electromagnetic reflectance and emissivity Q159562
gravity gradient Q159563
energy storage Q159564
geologic energy storage Q159565
hydrocarbon reservoir processes Q159566
carbon mineralization Q159567


# Classification
We are translating the broader/narrower SKOS relationships of USGS Thesaurus concepts into classes in the GeoKB. Here, we map the relationships using the code/parent values from the USGS Thesaurus source to their corresponding QID values in the GeoKB and then build out the subclass of claims.

In [52]:
th_parents = pd.merge(
    left=same_as_items[same_as_items['subclass_of'].isnull()][['qid','code']],
    right=terms[['code','parent']],
    how='left',
    on='code'
).drop_duplicates()

parent_mapping = pd.concat([
    same_as_items[['qid','code']].rename(columns={'qid': 'parent_qid', 'code': 'parent'}),
    pd.DataFrame([{
        'parent_qid': 'Q159046',
        'parent': 1174
    }])
])
    
missing_mappings = pd.merge(
    left=th_parents,
    right=parent_mapping,
    how='left',
    on='parent'
).drop_duplicates()

print(len(missing_mappings))
display(missing_mappings)

24


Unnamed: 0,qid,code,parent,parent_qid
0,Q158711,24,1174,Q159046
1,Q159185,587,1174,Q159046
2,Q159189,594,1174,Q159046
3,Q159197,628,1174,Q159046
4,Q159209,692,1174,Q159046
5,Q159229,769,1174,Q159046
6,Q159231,774,1174,Q159046
7,Q159232,775,1174,Q159046
8,Q159233,777,1174,Q159046
9,Q159255,841,1174,Q159046


In [53]:
for index, row in missing_mappings.iterrows():
    item = geokb.wbi.item.get(row['qid'])
    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['subclass of'],
            value=row['parent_qid']
        )
    )
    response = item.write(
        summary="Added subclass of relationship"
    )
    print(response.id)

Q158711
Q159185
Q159189
Q159197
Q159209
Q159229
Q159231
Q159232
Q159233
Q159255
Q159265
Q159270
Q159286
Q159102
Q159115
Q159161
Q159172
Q159173
Q159178
Q159334
Q159335
Q159353
Q159358
Q159376
