This notebook is a first cut at what annotation might look like within the GeoKB. It addresses the following basic use case:

* Mineral assessment geologists want to be able to query for all publications to pull up a "portfolio" of USGS Mineral Resource Assessments and add some additional value-added annotation that helps leverage that suite of tools.

The annotation in this case came in the form of a spreadsheet containing references to individual publications along with the start to a couple aspects of value-added annotation. This was not complete as yet, but I worked through what could be a reasonable methodology.

In [1]:
import pandas as pd
from wbmaker import WikibaseConnection

In [2]:
geokb = WikibaseConnection("GEOKB_CLOUD")

In [None]:
source_qid = "Q147702"
mra_class_qid = "Q152682"

# Annotation Source
We have a number of issues in how we capture annotation about our stuff. Our catalogs of things (Pubs Warehouse, Science Data Catalog, etc.) are essentially closed, one-way systems. They start and end with original metadata and never really change from that point. When we have something to add in practical use like classification terms or notes, we end up doing that elsewhere. The "elsewhere" almost never sees the light of day, meaning no one else (outside whoever did the work or their small group) gets to take advantage of it. We have sometimes addressed this with another layer in the logical architecture, though this is often not exactly tied together (e.g., MRData has records about many USGS publications with enhanced metadata that connects them logically to our work).

There are any number of ways we could handle this idea of science teams capturing annotation as part of their work and then feeding that back into some usable form. A potentially better approach here would be to build a bibliography of MRA publications in Zotero, using collection organization, tags, and other forms of annotation within that system. We can then read from Zotero and pull into the knowledge organization system.

In this case, I decided to experiment with a different approach. Given the fact that someone already did the work in putting things together in a table, I thought we'd just start there. One problem was that we needed a place to surface the table so that it's something we can actually reference and tap into. One kind of simple way to do that is to establish an item in the knowledgebase and use its wikipage functionality to house the table and then read it as an HTML table. We've adopted a design principle where we want every claim to point to an item in the GeoKB as its reference source (using predicates "data source" and "knowledge source" for now).

In [None]:
source_tables = pd.read_html(f"https://geokb.wikibase.cloud/wiki/Item_talk:{source_qid}")
hammarstrom_mra = source_tables[0]
hammarstrom_mra.head()

In [None]:
hammarstrom_mra_doi = hammarstrom_mra[hammarstrom_mra['DOI'].notnull()]['DOI'].to_list()
hammarstrom_mra_indexId = hammarstrom_mra[hammarstrom_mra['PW indexId'].notnull()]['PW indexId'].to_list()

In [None]:
doi_items = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fdoi%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP74%20%3Fdoi%20.%0A%20%20VALUES%20%3Fdoi%20%7B%20%2210.1016%2Fj.gexplo.2020.106700%22%20%2210.1016%2Fj.gexplo.2020.106712%22%20%2210.5382%2Fecongeo.4675%22%20%2210.3133%2Ffs20173012%22%20%2210.3133%2Fsir20105090BB%22%20%2210.3133%2Fsir20165089%22%20%2210.3133%2Fofr20161191%22%20%2210.3133%2Fsir20105090Z%22%20%2210.3133%2Fsir20105090AA%22%20%2210.3133%2Fofr20151021%22%20%2210.3133%2Fsir20105090U%22%20%2210.3133%2Fsir20105090Y%22%20%2210.3133%2Fsir20105090V%22%20%2210.3133%2Fsir20105090T%22%20%2210.3133%2Fsir20105090M%22%20%2210.5066%2FP9094RVV%22%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)
doi_items['qid'] = doi_items['item'].apply(lambda x: x.split('/')[-1])

indexId_items = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FindexId%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP114%20%3FindexId%20.%0A%20%20VALUES%20%3FindexId%20%7B%20%22b2005%22%20%22fs20173089%22%20%22fs20173012%22%20%22pp1802%22%20%22sir20105090BB%22%20%22sir20165089%22%20%22ofr20161191%22%20%22sir20105090Z%22%20%22sir20105090AA%22%20%22ds1004%22%20%22ofr20151021%22%20%22ofr20131280E%22%20%22sir20105090W%22%20%22sir20105090X%22%20%22sir20105090U%22%20%22sir20105090Y%22%20%22sir20105090V%22%20%22sir20105090N%22%20%22sir20105090L%22%20%22sir20105090R%22%20%22sir20105090I%22%20%22sir20105090P%22%20%22sir20105090O%22%20%22sir20105090S%22%20%22sir20105090T%22%20%22sir20105090M%22%20%22sir20105090Q%22%20%22c930O%22%20%22sir20105090D%22%20%22sir20105090K%22%20%22sir20105090J%22%20%22sir20105090E%22%20%22sir20105090A%22%20%22b2192%22%20%22ofr20111119%22%20%22sir20105090F%22%20%22sir20105090G%22%20%22sir20105090C%22%20%22ofr20101084%22%20%22ofr20101124%22%20%22b2209N%22%20%22b2218%22%20%22ofr20091271%22%20%22ofr20091045%22%20%22ofr20081253%22%20%22ofr20081155%22%20%22ofr20071214%22%20%22ofr20071005%22%20%22ofr20071046%22%20%22sir20075039%22%20%22ofr20071101%22%20%22mf2414%22%20%22sir20065197%22%20%22b2127%22%20%22mf2219B%22%20%22fs20053023%22%20%22b2035%22%20%22cir1368%22%20%22ofr20051294%22%20%22ofr20041339%22%20%22cir1336%22%20%22fs05303%22%20%22ofr03107%22%20%22cir1178%22%20%22ofr02268%22%20%22ofr00026%22%20%22i2358B%22%20%22ofr99305%22%20%22i2652%22%20%22ofr9838%22%20%22ofr98478%22%20%22ofr98232%22%20%22ofr98224D%22%20%22ofr98224A%22%20%22ofr98224B%22%20%22ofr98291a%22%20%22ofr97486%22%20%22ds46%22%20%22b2044%22%20%22ofr97484B%22%20%22cir1140%22%20%22ofr20051060%22%20%22pp1654%22%20%22ofr96666%22%20%22nbmg%20ofr962%22%20%22ofr9645%22%20%22b2083AK%22%20%22ofr95688%22%20%22ofr96256%22%20%22ofr96712%22%20%22mf2144E%22%20%22ofr96505A%22%20%22ds19R%22%20%22pp1525%22%20%22ofr95681%22%20%22ofr95682%22%20%22cir1117%22%20%22ofr94566%22%20%22ofr94712%22%20%22cir1106%22%20%22ofr94272%22%20%22i2050F%22%20%22mf2198D%22%20%22mf2198C%22%20%22mf2198E%22%20%22mf2198B%22%20%22mf2198A%22%20%22ofr93207%22%20%22b2062%22%20%22ofr93552%22%20%22ofr93329%22%20%22cir1094%22%20%22c930N%22%20%22cir930M%22%20%22pp1610%22%20%22ofr920261%22%20%22i2050E%22%20%22i1803G%22%20%22i1803E%22%20%22i2050D%22%20%22cir1077%22%20%22b1955%22%20%22ofr92444%22%20%22ofr92567%22%20%22cir1076%22%20%22cir1087%22%20%22mf2021F%22%20%22c930K%22%20%22c930L%22%20%22ofr921%22%20%22i1803F%22%20%22b1883%22%20%22ofr91139%22%20%22b1987%22%20%22b1724E%22%20%22b1887%22%20%22i1803D%22%20%22cir1046%22%20%22mf1996A%22%20%22ofr90276%22%20%22b1718E%22%20%22c930J%22%20%22b1754D%22%20%22b1752%22%20%22ofr89177.%22%20%22b1748B%22%20%22b1784A%22%20%22b1744%22%20%22b1730A%22%20%22b1737C%22%20%22b1715E%22%20%22b1701F%22%20%22b1757F%22%20%22b1704B%22%20%22ofr9034%22%20%22b1701D%22%20%22b1737F%22%20%22b1701A%22%20%22b1702C%22%20%22b1701E%22%20%22b1805%22%20%2270156971%22%20%22b1702G%22%20%22b1733E%22%20%22b1744C%22%20%22b1702K%22%20%22b1701G%22%20%22b1704C%22%20%22b1704D%22%20%22b1757H%22%20%22b1749C%22%20%22ofr90331%22%20%22mf2058A%22%20%22ofr8916%22%20%22b1740F%22%20%22b1873%22%20%22ofr89467%22%20%22ofr89363%22%20%22ofr89345%22%20%22mf1539I%22%20%22cir930H%22%20%22b1721D%22%20%22b1542%22%20%22mf1946B%22%20%22ofr88282%22%20%22b1718C%22%20%22b1707E%22%20%22b1719H%22%20%22b1720B%22%20%22b1738B%22%20%22b1738C%22%20%22b1740E%22%20%22b1742A%22%20%22b1742B%22%20%22ofr88246%22%20%22ofr88442%22%20%22b1718B%22%20%22ofr88533%22%20%22b1722C%22%20%22cir930G%22%20%22ofr87656%22%20%22cir986%22%20%22i1865%22%20%22ofr87366%22%20%22mf1834B%22%20%22mf1985B%22%20%22mf1908B%22%20%22mf1979%22%20%22mf1977%22%20%22mf1983%22%20%22ofr87609%22%20%22b1719D%22%20%22b1719E%22%20%22b1719F%22%20%22b1719G%22%20%22b1720A%22%20%22b1721A%22%20%22b1721B%22%20%22b1721C%22%20%22b1725A%22%20%22b1725D%22%20%22b1740B%22%20%22b1757B%22%20%22b1722A%22%20%22b1722B%22%20%22c930F%22%20%22b1724C%22%20%22mf1572C%22%20%22i1310F%22%20%22mf1801%22%20%22ofr86470%22%20%22b1719A%22%20%22b1719B%22%20%22b1719C%22%20%22b1580%22%20%22ofr866567%22%20%22b1732A%22%20%22cir930E%22%20%22ofr85527%22%20%22cir930D%22%20%22mr88%22%20%22ofr84294%22%20%22ofr84345%22%20%22b1538%22%20%22pp1300I%22%20%22cir928%22%20%22c930A%22%20%22c930C%22%20%22c930B%22%20%22ofr80811G%22%20%22ofr83924%22%20%22ofr83423%22%20%22ofr83901%22%20%22cir793%22%20%22cir759%22%20%22cir887%22%20%22mf1580%22%20%22mf1601A%22%20%22ofr83170B%22%20%22cir901%22%20%22mf1383C%22%20%22cir855%22%20%22ofr82223%22%20%22b1497.%22%20%22cir802%22%20%22ofr80794%22%20%22ofr79576J%22%20%22ofr79576K%22%20%22ofr79576L%22%20%22ofr79576M%22%20%22ofr79576N%22%20%22ofr79576O%22%20%22cir758%22%20%22cir775%22%20%22ofr791204%22%20%22cir783%22%20%22ofr781A%22%20%22ofr79576A%22%20%22ofr79576B%22%20%22ofr79576C%22%20%22ofr79576E%22%20%22ofr79576F%22%20%22ofr79576G%22%20%22ofr79576H%22%20%22ofr79576I%22%20%22ofr79576P%22%20%22ofr79576Q%22%20%22ofr79576R%22%20%22ofr79576S%22%20%22ofr79576T%22%20%22ofr79576U%22%20%22ofr79576V%22%20%22b1979%22%20%22cir739%22%20%22cir734%22%20%22cir718%22%20%22pp820%22%20%22pp580%22%20%22b1727B%22%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)
indexId_items['qid'] = indexId_items['item'].apply(lambda x: x.split('/')[-1])


In [None]:
hammarstrom_doi_matches = pd.merge(
    left=hammarstrom_mra,
    right=doi_items[['qid','doi']].rename(columns={'doi': 'DOI'}),
    how="inner",
    on="DOI"
)

hammarstrom_indexId_matches = pd.merge(
    left=hammarstrom_mra.rename(columns={'PW indexId': 'indexId'}),
    right=indexId_items[['qid','indexId']],
    how="inner",
    on="indexId"
)

hammarstrom_mra_to_geokb = pd.concat([
    hammarstrom_doi_matches,
    hammarstrom_indexId_matches
])

In [None]:
refs = geokb.models.References()
refs.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['knowledge source'],
        value=source_qid
    )
)

refs_pw = geokb.models.References()
refs_pw.add(
    geokb.datatypes.Item(
        prop_nr=geokb.prop_lookup['data source'],
        value="Q54915"
    )
)

for mra_qid in hammarstrom_mra_to_geokb['qid'].unique():
    item = geokb.wbi.item.get(mra_qid)

    original_class_id = next((i["mainsnak"]["datavalue"]["value"]["id"] for i in item.get_json()['claims']['P1']), None)

    class_claims = []
    class_claims.append(
        geokb.datatypes.Item(
            prop_nr="P1",
            value=original_class_id,
            references=refs_pw
        )
    )

    class_claims.append(
        geokb.datatypes.Item(
            prop_nr="P1",
            value=mra_class_qid,
            references=refs
        )
    )

    item.claims.add(
        class_claims, 
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )

    response = item.write(
        summary="Added instance of classification as a mineral resource assessment from Hammarstrom list"
    )
    print(response.id)

# Missing Pubs

In [None]:
hammarstrom_mra[~hammarstrom_mra['URL'].isin(hammarstrom_mra_to_geokb['URL'])]