This notebook is a work in progress process for initializing the Geoscience Knowledgebase with a set of properties and foundational semantics that establishes a base to build from in curating geoscientific knowledge. Our initial use cases have to do with integrating mineral occurrence information along with document references contributing to those and other facts and concepts associated with conducting mineral resource assessment. In order to build a useful knowledge graph on these concepts, though, we also need to tie in lots of other things needed for the claims associated with these things to legitimately link to other things.

For instance, we use NI "43-101 Technical Reports" and a newer "SK-1300 Technical Report" as sources for claims associated with mining projects/properties such as geographic location, mineral commodities identified and/or extracted, figures indicating estimates of ore grade and tonnage, and other details. These are technical geoscientific reports required by the Canadian and U.S. governments, respectively. We need an "instance of" (rdf:type) claim on everything like this in the system. While we could simply create items in the graph to represent these two classes with no further classification, it is useful to "work back up the semantic hierarchy" for as many concepts as we can as far as we need to in order for the information we are recording in the GeoKB to be understandable in the broader global knowledge commons (Wikidata and/or other efforts).

The initialization process here is designed to give us a semantic baseline to work from as we pull in the information and connections we really care about within this context. We're taking a pragmatic approach that is slightly more rigorous (and certainly more streamlined) than the wild west of Wikidata but somewhere short of endlessly academic. We have to get a whole bunch of information into the GeoKB to support analytical use, so we make a best effort to align what we have with mature ontologies and namespaces, knowing we'll have to evolve it over time. The notebook approach on this gives us a good basis to record our reasoning and the places we have to make pragmatic tradeoffs.

In [16]:
import os
import pandas as pd
from datetime import datetime
import swifter

from wikibaseintegrator import WikibaseIntegrator, wbi_login
from wikibaseintegrator.wbi_config import config as wbi_config
from wikibaseintegrator.datatypes import Item

from utils import (
    sparql_query,
    query_by_item_label,
    property_query
)

## Parameters

If we want to run this and other notebooks as Lambdas eventually, we'll need to set up some paramters. I moved the stuff we want to keep somewhat secret to environment variables as a safety measure.

In [2]:
geokb_init_sheet_id = '1dbuKc4cZJz0YY81B2xWXM5fId6gWgzmQar3hg3CI0Rw'

wbi_config['MEDIAWIKI_API_URL'] = os.environ['MEDIAWIKI_API_URL']
wbi_config['SPARQL_ENDPOINT_URL'] = os.environ['SPARQL_ENDPOINT']
wbi_config['WIKIBASE_URL'] = os.environ['WIKIBASE_URL']

In [29]:
geokb_auth = wbi_login.Login(
    user=os.environ['WB_BOT_NAME'], 
    password=os.environ['WB_BOT_PASS']
)
wbi = WikibaseIntegrator(login=geokb_auth)

## Functions

All or most of these functions should be movable to the abstract Wikibase management python package we are designing. That needs to be applicable to the GeoKB but generic enough to apply in other types of domains and circumstances. There are other communities doing similar work such as the wikidataintegrator project in the health sciences. We just found a need to start from the basics of pywikibot and how it operates.

I realize I'm introducing what could be higher overhead than necessary here with use of Pandas and other specialized packages. We can reevaluate that at the foundational level of the KB management package.

### Label uniqueness

Eventually, we will end up in the same situation that Wikidata is in where we will have multiple items with the same label that are disambiguated by their other attributes (classification being the likely chief distinguishing factor). Key examples on the close horizon will be how we deal with minerals such as gold that can be both a chemical element, as defined in Wikidata now, and a mineral commodity in some contexts. Do we have one item called "gold" that can be an instance of a number of things or multiple items with the same label that need to be disambiguated?

For the immediate future, I'm going to take the approach of constraining the GeoKB to unique labels (including alt labels) and see where we run into problems. The check_item_label() function is something I'm setting up to be run any time we go to create an item so that we ensure uniqueness.

The same dynamic applies with properties as well, though those should be more straightforward as we're striving for clear, unambiguous semantics anyway.

In [None]:
property_query = """
SELECT ?property ?propertyLabel ?propertyDescription ?propertyAltLabel WHERE {
    ?property a wikibase:Property .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" .}
 }
 """

## Foundational Properties and Classes

Looking toward the initial use cases for the GeoKB, we need to lay down a basic structure for properties and classification such that the items we need to introduce will all have an appropriate instance of claim pointing to a reasonable concept for basic organization. We could go on forever trying to get the semantics just right and aligned with as many sources of definition as possible, but we'll ease into that level of sophistication over time. In the near term, I've started a Google Sheet to contain the basic structure to get our knowledgebase initialized. We will likely need to iterate on this several times to get it right, and we will doubtless miss some things.

There are lots of ways to spin up these details, but a Google Sheet seems simple enough, it can be edited by multiple people, and we can read it as a CSV and process it here in the notebook. I've built this with a Properties and a Classification sheet that we'll iterate on as we improve the foundation.

In [4]:
def ref_table(reference, sheet_id=geokb_init_sheet_id):
    return f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={reference}'

geokb_init_properties = pd.read_csv(ref_table('Properties'))
geokb_init_classes = pd.read_csv(ref_table('Classification'))

In [5]:
display(geokb_init_properties)
display(geokb_init_classes)

Unnamed: 0,label,description,datatype,wd_id
0,instance of,this item is a class of that item;type of the ...,wikibase-item,P31
1,subclass of,this item is a subclass (subset) of that item,wikibase-item,P279
2,coordinate location,a geographic point indicating the location of ...,globe-coordinate,P625
3,reference item,a knowledgebase item that serves as the source...,wikibase-item,
4,reference url,a web link provided as the source reference fo...,url,
5,reference statement,a short statement about the source for a claim...,string,
6,publication date,date or point in time when a work represented ...,time,P577
7,subject matter,topic or subject addressed by the item,wikibase-item,P921
8,ranking,used as statement qualifier indicating relativ...,quantity,P1352
9,ISO 3166-1 alpha-2 code,identifier for a country in two-letter format ...,external-id,P297


Unnamed: 0,label,description,aliases,subclass of,wd_id
0,entity,"anything that can be considered, discussed, or...",thing,,Q35120
1,person,"common name of Homo sapiens, unique extant spe...",human,entity,Q5
2,organization,social entity established to meet needs or pur...,,entity,Q43229
3,document,form for preservation of structured and identi...,,entity,Q49848
4,publication,content made available to the general public,,entity,Q732577
5,scholarly article,"article in an academic publication, usually pe...","research article,scientific article,journal ar...",publication,Q13442814
6,report,"informational, formal, and detailed text",,document,Q10870555
7,government report,document written by a government to convey inf...,,report,Q15629444
8,NI 43-101 Technical Report,"National Instrument 43-101 (the ""NI 43-101"" or...",,report,
9,USGS Report Series,"official, USGS-authored publications of the U....",USGS Series Report,government report,


## Baseline Knowledgebase

We may or may not be starting from a blank slate Wikibase installation. We know we need to have everything laid out in our Properties and Classification tables, but some of it may already exist. We also need to build claims on these items that use property and item identifiers, so we need to determine exactly what's in the current system. My initial attempt to use the search functionality based on pywikibot and the currently non-functional API failed, so I'm experimenting with a SPARQL approach. This is somewhat handy in that we can leverage the SPARQL query builder to figure out our queries, first with a test against the full Wikidata platform and then our own instance. Those queries can be pulled in here based on how I set up the function to get us a result to work from.

## Building properties and items

I left off here after proving that I can at least create items. This last bit was following the [tutorial](https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial/Labels) on building and editing items to include more rich provenance (history). I'll come back to build this into functional logic.

### Deletion

After some reading on this, it looks like deleting pages in a Mediawiki instance can only be done by users with administrative privileges. This is probably a good safeguard for us to leverage, but we need to suss out a usable workflow. There appears to be some type of functionality to mark a page for deletion, presumably with administrator action to concur on the action and execute it.

### Property creation and maintenance

I can get a list of all properties in the instance, but I can't yet figure out a way to get a list of all items. With a full set of properties, we can figure out what we have that's in baseline set, get rid of anything extraneous, and add anything new. We could do the same with items to a certain point.

Note: I'm currently working against a Wikibase instance from the wikibase.cloud folks as we're still getting our own deal fully operational. I'm able to create properties using the barebones new_property() function and the pwb.simple_request method, but I'm seeing weird behavior with the SPARQL query not turning up the new property right away. There seems to be a delay, perhaps in an indexing process. This creates a problem if the process would be to fully verify each new property. I could handle it like I have here with an initial check for missing properties and then simply running through everything, trusting the process. But that's brittle.

In [44]:
geokb_properties = sparql_query(
    endpoint=os.environ["SPARQL_ENDPOINT"], 
    query=property_query,
    output='dataframe'
)
geokb_properties['id'] = geokb_properties.property.apply(lambda x: x.split('/')[-1])

property_map = geokb_properties.set_index('propertyLabel')['id'].to_dict()
geokb_properties

Unnamed: 0,property,propertyLabel,propertyDescription,propertyAltLabel,id
0,https://geokb.wikibase.cloud/entity/P1,instance of,that class of which this subject is a particul...,rdf:type,P1
1,https://geokb.wikibase.cloud/entity/P2,subclass of,this item is a subclass (subset) of that item,,P2
2,https://geokb.wikibase.cloud/entity/P3,reference item,a knowledgebase item that serves as the source...,,P3
3,https://geokb.wikibase.cloud/entity/P4,reference url,a web link provided as the source reference fo...,,P4
4,https://geokb.wikibase.cloud/entity/P5,reference statement,a short statement about the source for a claim...,,P5
5,https://geokb.wikibase.cloud/entity/P6,coordinate location,a geographic point indicating the location of ...,,P6
6,https://geokb.wikibase.cloud/entity/P7,publication date,date or point in time when a work represented ...,,P7
7,https://geokb.wikibase.cloud/entity/P8,subject matter,topic or subject addressed by the item,,P8
8,https://geokb.wikibase.cloud/entity/P9,ranking,used as statement qualifier indicating relativ...,,P9
9,https://geokb.wikibase.cloud/entity/P10,ISO 3166-1 alpha-2 code,identifier for a country in two-letter format ...,,P10


### Come back to...

I need to revisit property creation with WikibaseIntegrator

### Item creation and development

I've reworked a process_item() function using other foundational functions that apply for both items and properties (entities). Our use case for creating items of all types, whether part of the foundation or otherwise, will often involve simply sending in a label along with optional (but strongly encouraged), description, and aliases. We will also re-run this over time when we have new descriptions and aliases for a given set of things to update these basic elements. For now, we assume all English language values, but we will need to deal with multiple languages relatively soon (e.g., mine names in other languages are common).

The following test shows the basic concept.

In [25]:
def item_id_by_label(label):
    if isinstance(label, str):
        results = sparql_query(
            endpoint=os.environ["SPARQL_ENDPOINT"],
            query=query_by_item_label(label=label),
            output='dict'
        )
        if results:
            return results[0]["item"].split("/")[-1]
        
def subclass_id(df, subclass_label):
    if isinstance(subclass_label, str):
        subclass_record = df[df.label == subclass_label]
        if not subclass_record.empty:
            return subclass_record.iloc[0].geokb_id

In [20]:
geokb_init_classes['geokb_id'] = geokb_init_classes.label.swifter.apply(item_id_by_label)

Pandas Apply:   0%|          | 0/25 [00:00<?, ?it/s]

In [27]:
geokb_init_classes['geokb_subclass_id'] = geokb_init_classes['subclass of'].swifter.apply(lambda x: subclass_id(geokb_init_classes, x))

Pandas Apply:   0%|          | 0/25 [00:00<?, ?it/s]

In [28]:
geokb_init_classes

Unnamed: 0,label,description,aliases,subclass of,wd_id,geokb_id,geokb_subclass_id
0,entity,"anything that can be considered, discussed, or...",thing,,Q35120,Q2,
1,person,"common name of Homo sapiens, unique extant spe...",human,entity,Q5,Q3,Q2
2,organization,social entity established to meet needs or pur...,,entity,Q43229,Q4,Q2
3,document,form for preservation of structured and identi...,,entity,Q49848,Q5,Q2
4,publication,content made available to the general public,,entity,Q732577,Q6,Q2
5,scholarly article,"article in an academic publication, usually pe...","research article,scientific article,journal ar...",publication,Q13442814,Q7,Q6
6,report,"informational, formal, and detailed text",,document,Q10870555,Q8,Q5
7,government report,document written by a government to convey inf...,,report,Q15629444,Q9,Q8
8,NI 43-101 Technical Report,"National Instrument 43-101 (the ""NI 43-101"" or...",,report,,Q10,Q8
9,USGS Report Series,"official, USGS-authored publications of the U....",USGS Series Report,government report,,Q11,Q9


In [45]:
subclass_prop_id = property_map['subclass of']

for index, row in geokb_init_classes.iterrows():
    print("PROCESSING:", row.label, row.geokb_id)
    item = wbi.item.get(entity_id=row.geokb_id)
    
    if isinstance(row.geokb_subclass_id, str):
        item.claims.add(Item(
            prop_nr=subclass_prop_id,
            value=row.geokb_subclass_id
        ))
        item.write()

PROCESSING: entity Q2
PROCESSING: person Q3
PROCESSING: organization Q4
PROCESSING: document Q5
PROCESSING: publication Q6
PROCESSING: scholarly article Q7
PROCESSING: report Q8
PROCESSING: government report Q9
PROCESSING: NI 43-101 Technical Report Q10
PROCESSING: USGS Report Series Q11
PROCESSING: dataset Q12
PROCESSING: scientific model Q13
PROCESSING: project Q14
PROCESSING: research project Q15
PROCESSING: phenomenon Q16
PROCESSING: natural phenomenon Q17
PROCESSING: material Q18
PROCESSING: natural material Q19
PROCESSING: mineral occurrence Q20
PROCESSING: mineral deposit Q21
PROCESSING: ore Q22
PROCESSING: ore deposit Q23
PROCESSING: mineral Q24
PROCESSING: mineral deposit model Q25
PROCESSING: geographic region Q26


I need to come back to this and explore a full method to synchronize the property/classification baseline. I'm moving on to getting items added from one of the sources.