This notebook is a work in progress process for initializing the Geoscience Knowledgebase with a set of properties and foundational semantics that establishes a base to build from in curating geoscientific knowledge. Our initial use cases have to do with integrating mineral occurrence information along with document references contributing to those and other facts and concepts associated with conducting mineral resource assessment. In order to build a useful knowledge graph on these concepts, though, we also need to tie in lots of other things needed for the claims associated with these things to legitimately link to other things.

For instance, we use NI "43-101 Technical Reports" and a newer "SK-1300 Technical Report" as sources for claims associated with mining projects/properties such as geographic location, mineral commodities identified and/or extracted, figures indicating estimates of ore grade and tonnage, and other details. These are technical geoscientific reports required by the Canadian and U.S. governments, respectively. We need an "instance of" (rdf:type) claim on everything like this in the system. While we could simply create items in the graph to represent these two classes with no further classification, it is useful to "work back up the semantic hierarchy" for as many concepts as we can as far as we need to in order for the information we are recording in the GeoKB to be understandable in the broader global knowledge commons (Wikidata and/or other efforts).

The initialization process here is designed to give us a semantic baseline to work from as we pull in the information and connections we really care about within this context. We're taking a pragmatic approach that is slightly more rigorous (and certainly more streamlined) than the wild west of Wikidata but somewhere short of endlessly academic. We have to get a whole bunch of information into the GeoKB to support analytical use, so we make a best effort to align what we have with mature ontologies and namespaces, knowing we'll have to evolve it over time. The notebook approach on this gives us a good basis to record our reasoning and the places we have to make pragmatic tradeoffs.

In [115]:
import os
import pywikibot as pwb
import json
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON
from nested_lookup import nested_lookup
from datetime import datetime

## Parameters

If we want to run this and other notebooks as Lambdas eventually, we'll need to set up some paramters. I moved the stuff we want to keep somewhat secret to environment variables as a safety measure.

In [2]:
sparql_endpoint = os.environ['SPARQL_ENDPOINT']
wb_domain = os.environ['WB_DOMAIN']
geokb_init_sheet_id = '1dbuKc4cZJz0YY81B2xWXM5fId6gWgzmQar3hg3CI0Rw'

accepted_languages = ['en']

## Functions

All or most of these functions should be movable to the abstract Wikibase management python package we are designing. That needs to be applicable to the GeoKB but generic enough to apply in other types of domains and circumstances. There are other communities doing similar work such as the wikidataintegrator project in the health sciences. We just found a need to start from the basics of pywikibot and how it operates.

I realize I'm introducing what could be higher overhead than necessary here with use of Pandas and other specialized packages. We can reevaluate that at the foundational level of the KB management package.

### Label uniqueness

Eventually, we will end up in the same situation that Wikidata is in where we will have multiple items with the same label that are disambiguated by their other attributes (classification being the likely chief distinguishing factor). Key examples on the close horizon will be how we deal with minerals such as gold that can be both a chemical element, as defined in Wikidata now, and a mineral commodity in some contexts. Do we have one item called "gold" that can be an instance of a number of things or multiple items with the same label that need to be disambiguated?

For the immediate future, I'm going to take the approach of constraining the GeoKB to unique labels (including alt labels) and see where we run into problems. The check_item_label() function is something I'm setting up to be run any time we go to create an item so that we ensure uniqueness.

The same dynamic applies with properties as well, though those should be more straightforward as we're striving for clear, unambiguous semantics anyway.

In [116]:
### SPARQL Queries
def query_by_item_label(label: str, include_aliases: bool = True) -> str:
    label_criteria = 'rdfs:label|skos:altLabel'
    if not include_aliases:
        label_criteria = 'rdfs:label'
    query_string = """
        SELECT ?item ?itemLabel ?itemDescription ?itemAltLabel WHERE{  
        ?item %s "%s"@en.
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }    
        }
    """ % (label_criteria, label)

    return query_string

def query_item_subclasses(item_id: str, subclass_property_id: str = 'P13') -> str:
    query_string = """
        SELECT ?item ?itemLabel (GROUP_CONCAT(DISTINCT ?subclassOf; SEPARATOR=",") as ?subclasses)
        {
        VALUES (?item) {(wd:%s)}
        OPTIONAL {
            ?item wdt:%s ?subclassOf
        }
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
        } GROUP BY ?item ?itemLabel
    """ % (item_id, subclass_property_id)

    return query_string

property_query = """
SELECT ?property ?propertyLabel ?propertyDescription ?propertyAltLabel WHERE {
    ?property a wikibase:Property .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" .}
 }
 """


def sparql_query(endpoint: str, output: str, query: str):
    sparql = SPARQLWrapper(endpoint)
    sparql.setReturnFormat(JSON)

    sparql.setQuery(query)
    results = sparql.queryAndConvert()

    if output == 'raw':
        return results
    elif output == 'dataframe':
        names = results["head"]["vars"]
        data = []
        for name in names:
            property_value = nested_lookup('value', nested_lookup(name, results['results']['bindings']))
            if not property_value:
                data.append(None)
            else:
                data.append(property_value)

        return pd.DataFrame.from_dict(dict(zip(names, data)))

def get_wb(site_name: str, language='en'):
    site = pwb.Site(language, site_name)
    site.login()
    return site

def check_item_label(labels: dict, response: str = 'id'):
    label = labels['en']

    query_string = query_by_item_label(label=label)

    query_results = sparql_query(
        endpoint=sparql_endpoint,
        output='raw',
        query=query_string
    )

    if not query_results["results"]["bindings"]:
        return
    
    if len(query_results["results"]["bindings"]) > 1:
        raise ValueError(f"More than one item with the label: {labels}")
    
    if response == 'id':
        return query_results["results"]["bindings"][0]["item"]["value"].split('/')[-1]

    return query_results["results"]["bindings"][0]

def get_entity(
        site: pwb.APISite, 
        entity_id: str = None, 
        entity_type: str = 'item', 
        data_type: str = 'wikibase-item'):

    if entity_id and entity_id.startswith('Q'):
        entity_type = 'item'
    elif entity_id and entity_id.startswith('P'):
        entity_type = 'property'
    else:
        entity_type = entity_type

    if entity_type == 'item':
        return pwb.ItemPage(
            site=site.data_repository(),
            title=entity_id
        )
    elif entity_type == 'property':
        return pwb.PropertyPage(
            source=site.data_repository(),
            title=entity_id,
            datatype=data_type
        )

def edit_labels(
        site: pwb.APISite, 
        labels: dict, 
        prov_statement: str, 
        entity_type: str = 'item', 
        data_type: str = 'wikibase-item'
    ) -> str:
    try:
        entity_id = existing_entity = check_item_label(labels=labels)
    except Exception as e:
        raise ValueError("Problem in running query for item on labels")

    entity = get_entity(
        site=site,
        entity_id=entity_id,
        entity_type=entity_type,
        data_type=data_type
    )

    entity.editLabels(
        labels=labels, 
        summary=prov_statement
    )

    return entity.getID()

def edit_descriptions(site: pwb.APISite, entity_id: str, descriptions: dict, prov_statement: str):
    entity = get_entity(
        site=site,
        entity_id=entity_id
    )

    if entity.getID() == '-1':
        raise ValueError('Entity does not yet exist, create it first')

    entity.editDescriptions(
        descriptions=descriptions,
        summary=prov_statement,
    )
    return entity.get()

def edit_aliases(site: pwb.APISite, entity_id: str, aliases: dict, prov_statement: str):
    entity = get_entity(
        site=site,
        entity_id=entity_id
    )

    if entity.getID() == '-1':
        raise ValueError('Entity does not yet exist, create it first')

    entity.editAliases(
        aliases=aliases,
        summary=prov_statement,
    )
    return entity.get()

def process_item(
        site: pwb.APISite, 
        label: str, 
        prov_statement: str,
        description: str = None, 
        aliases: list = []
    ):

    # Assume English language for now
    label_dict = {'en': label}

    check_item = check_item_label(
        labels=label_dict,
        response='raw'
    )

    if not check_item:
        entity_id = edit_labels(
            site=site,
            labels={'en': 'copper'},
            prov_statement=prov_statement,
            entity_type='item'
        )
        missing_description = description
        missing_aliases = aliases
    else:
        entity_id = check_item['item']['value'].split('/')[-1]
        missing_description = description if description != check_item['itemDescription']['value'] else None
        existing_aliases = [i.strip() for i in check_item['itemAltLabel']['value'].split(',')]
        missing_aliases = list(set(aliases) - set(existing_aliases))

    if missing_description:
        edit_descriptions(
            site=site,
            entity_id=entity_id,
            descriptions={'en': missing_description},
            prov_statement=f'Adding description for {label}'
        )
    
    if missing_aliases:
        edit_aliases(
            site=site,
            entity_id=entity_id,
            aliases={'en': ['Cu','Copper']},
            prov_statement=f'Adding aliases for {label}'
        )

# Still problematic here with ItemPage.get() after adding claims
def add_claim(site: pwb.APISite, subject_item_id: str, property_id: str, claim_value: str, prov_statement: str):
    repo = site.data_repository()

    # Establish connection to item
    subject_item = pwb.ItemPage(repo, subject_item_id)
    # Verify item exists
    # try:
    #     subject_item.get()
    # except Exception as e:
    #     raise ValueError(e)
    
    # Establish connection to property
    property_item = pwb.PropertyPage(repo, property_id)
    # Verify property exists and get datatype
    try:
        property_datatype = property_item.get()['datatype']
        subject_claim = pwb.Claim(repo, property_id)
    except Exception as e:
        raise ValueError(e)
    
    if property_datatype == 'wikibase-item':
        # Get item target and verify exists
        claim_object = pwb.ItemPage(repo, claim_value)
        try:
            claim_object.get()
        except Exception as e:
            raise ValueError(e)
    else:
        # Need to handle other cases where we property test/format an appropriate claim_target response
        return

    # Set the target for the claim
    subject_claim.setTarget(claim_object)
    # Need to handle additional work of adding references and qualifiers

    # Commit the claim to wikibase
    try:
        subject_item.addClaim(subject_claim, summary=prov_statement)
    except Exception as e:
        raise ValueError("Claim already exists")

## Foundational Properties and Classes

Looking toward the initial use cases for the GeoKB, we need to lay down a basic structure for properties and classification such that the items we need to introduce will all have an appropriate instance of claim pointing to a reasonable concept for basic organization. We could go on forever trying to get the semantics just right and aligned with as many sources of definition as possible, but we'll ease into that level of sophistication over time. In the near term, I've started a Google Sheet to contain the basic structure to get our knowledgebase initialized. We will likely need to iterate on this several times to get it right, and we will doubtless miss some things.

There are lots of ways to spin up these details, but a Google Sheet seems simple enough, it can be edited by multiple people, and we can read it as a CSV and process it here in the notebook. I've built this with a Properties and a Classification sheet that we'll iterate on as we improve the foundation.

In [122]:
def ref_table(reference, sheet_id=geokb_init_sheet_id):
    return f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={reference}'

geokb_properties = pd.read_csv(ref_table('Properties'))
geokb_classes = pd.read_csv(ref_table('Classification'))

In [123]:
display(geokb_properties)
display(geokb_classes)

Unnamed: 0,label,description,datatype,wd_id
0,instance of,this item is a class of that item;type of the ...,wikibase-item,P31
1,subclass of,this item is a subclass (subset) of that item,wikibase-item,P279
2,coordinate location,a geographic point indicating the location of ...,globe-coordinate,P625
3,reference item,a knowledgebase item that serves as the source...,wikibase-item,
4,reference url,a web link provided as the source reference fo...,url,
5,reference statement,a short statement about the source for a claim...,string,
6,publication date,date or point in time when a work represented ...,time,P577
7,subject matter,topic or subject addressed by the item,wikibase-item,P921
8,ranking,used as statement qualifier indicating relativ...,quantity,P1352


Unnamed: 0,label,description,aliases,subclass of,wd_id
0,entity,"anything that can be considered, discussed, or...",thing,,Q35120
1,person,"common name of Homo sapiens, unique extant spe...",human,entity,Q5
2,organization,social entity established to meet needs or pur...,,entity,Q43229
3,document,form for preservation of structured and identi...,,entity,Q49848
4,publication,content made available to the general public,,entity,Q732577
5,scholarly article,"article in an academic publication, usually pe...","research article,scientific article,journal ar...",publication,Q13442814
6,report,"informational, formal, and detailed text",,document,Q10870555
7,government report,document written by a government to convey inf...,,report,Q15629444
8,NI 43-101 Technical Report,"National Instrument 43-101 (the ""NI 43-101"" or...",,report,
9,USGS Report Series,"official, USGS-authored publications of the U....",USGS Series Report,government report,


## Connecting to Wikibase

Many read operations can be accommodated by the public SPARQL interface with no need for specialized wikibase connections. At the moment, the Elasticsearch part of the tech stack is not connecting properly, and some of the dependent API functionality in the pywikibot package is not working. The write operations for properties and items are working fine, and we need to establish a connection to the APISite via pywikibot and the user/password bot config we set up previously.

These create user-config.py and user-password.py files in the directory from which pywikibot is run, which is not the most secure way to handle things. We may experiment with the more secure OAuth methods down the road once we get the mechanics worked out. The following sets up the connection for later use and tests it for functionality. I'm not sure yet on the significance of the errors I'm seeing.

In [118]:
# Required connection points in the pwb API
geokb_site = get_wb('geokb')

print(type(geokb_site))
print(type(geokb_site.data_repository()))

<class 'pywikibot.site._apisite.APISite'>
<class 'pywikibot.site._datasite.DataSite'>


## Baseline Knowledgebase

We may or may not be starting from a blank slate Wikibase installation. We know we need to have everything laid out in our Properties and Classification tables, but some of it may already exist. We also need to build claims on these items that use property and item identifiers, so we need to determine exactly what's in the current system. My initial attempt to use the search functionality based on pywikibot and the currently non-functional API failed, so I'm experimenting with a SPARQL approach. This is somewhat handy in that we can leverage the SPARQL query builder to figure out our queries, first with a test against the full Wikidata platform and then our own instance. Those queries can be pulled in here based on how I set up the function to get us a result to work from.

## Building properties and items

I left off here after proving that I can at least create items. This last bit was following the [tutorial](https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial/Labels) on building and editing items to include more rich provenance (history). I'll come back to build this into functional logic.

### Deletion

After some reading on this, it looks like deleting pages in a Mediawiki instance can only be done by users with administrative privileges. This is probably a good safeguard for us to leverage, but we need to suss out a usable workflow. There appears to be some type of functionality to mark a page for deletion, presumably with administrator action to concur on the action and execute it.

### Property creation and maintenance

I can get a list of all properties in the instance, but I can't yet figure out a way to get a list of all items. With a full set of properties, we can figure out what we have that's in baseline set, get rid of anything extraneous, and add anything new. We could do the same with items to a certain point.

In [124]:
df_properties = sparql_query(sparql_endpoint, 'dataframe', property_query)
df_properties['id'] = df_properties.property.apply(lambda x: x.split('/')[-1])
df_properties

Unnamed: 0,property,propertyLabel,propertyDescription,propertyAltLabel,id
0,http://wikibase.svc/entity/P1,imported from Wikimedia project,This is a test description for imported from W...,,P1
1,http://wikibase.svc/entity/P2,instance of,This is a test description for instance of,,P2
2,http://wikibase.svc/entity/P3,mrds id,This is a test description for mrds id,,P3
3,http://wikibase.svc/entity/P4,reference url,This is a test description for reference url,,P4
4,http://wikibase.svc/entity/P5,country,This is a test description for country,,P5
5,http://wikibase.svc/entity/P6,dep id,This is a test description for dep id,,P6
6,http://wikibase.svc/entity/P7,coordinate location,This is a test description for coordinate loca...,,P7
7,http://wikibase.svc/entity/P8,state,This is a test description for state,,P8
8,http://wikibase.svc/entity/P9,stated in,This is a test description for stated in,,P9
9,http://wikibase.svc/entity/P10,described by source,This is a test description for described by so...,,P10


In [125]:
# Missing properties
missing_properties = geokb_properties[~geokb_properties.label.isin(df_properties.propertyLabel)]
missing_properties

Unnamed: 0,label,description,datatype,wd_id


### Item creation and development

I've reworked a process_item() function using other foundational functions that apply for both items and properties (entities). Our use case for creating items of all types, whether part of the foundation or otherwise, will often involve simply sending in a label along with optional (but strongly encouraged), description, and aliases. We will also re-run this over time when we have new descriptions and aliases for a given set of things to update these basic elements. For now, we assume all English language values, but we will need to deal with multiple languages relatively soon (e.g., mine names in other languages are common).

The following test shows the basic concept.

In [119]:
process_item(
    site=geokb_site,
    label='copper',
    prov_statement='Adding item for copper as mineral element and commodity',
    description='chemical element with symbol Cu and atomic number 29',
    aliases=['Cu','Copper']
)

Sleeping for 9.9 seconds, 2023-02-23 11:01:53
Sleeping for 9.7 seconds, 2023-02-23 11:02:03


### Claims

I'm stuck here (see below) trying to work out the process from https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial/Setting_statements and looking at what J. Olivero did with his initial bot code. It should work, but I'm making the items nonfunctional somehow.

I'm leaving off here with a problem in this function. An error is raised with trying to get an item where I've set a claim with the function previously. It is erroring out with an AttributeError on the first step of running a get() on the item. addClaim() appears to work as the claim shows up on the item in the GUI as expecte, but I can then no longer run a get() on the item without this error in code. I also get the same error if I add the claim using the GUI. As soon as I get rid of the claim, I can get() the item just fine. And even through item.get() fails after adding a claim, I can still run a SPARQL query to retrieve the item and a subclass of claim. But since I can't run an item.get() any more, after a claim is on board, that means I can't add any more claims using this method.

In [None]:
add_claim(
    site=geokb_site,
    subject_item_id='Q33',
    property_id='P2',
    claim_value='Q3',
    prov_statement='Classifying item as subclass of entity'
)

In [None]:
# I have no idea what this error means
test_item = pwb.ItemPage(geokb_site.data_repository(), 'Q33')

try:
    display(test_item.get())
except Exception as e:
    print(e)

In [None]:
sparql_query(
    endpoint=sparql_endpoint,
    output='dataframe',
    query=query_item_subclasses('Q33')
)

In [None]:
repo = geokb_site.data_repository()

subject_item = pwb.ItemPage(repo, 'Q33')
# property_item = pwb.PropertyPage(repo, 'P4')
# subject_claim = pwb.Claim(repo, 'P4')
# subject_claim.setTarget('https://www.google.com')
# subject_item.addClaim(subject_claim, summary='testing URL property')