This notebook is a work in progress process for initializing the Geoscience Knowledgebase with a set of properties and foundational semantics that establishes a base to build from in curating geoscientific knowledge. Our initial use cases have to do with integrating mineral occurrence information along with document references contributing to those and other facts and concepts associated with conducting mineral resource assessment. In order to build a useful knowledge graph on these concepts, though, we also need to tie in lots of other things needed for the claims associated with these things to legitimately link to other things.

For instance, we use NI "43-101 Technical Reports" and a newer "SK-1300 Technical Report" as sources for claims associated with mining projects/properties such as geographic location, mineral commodities identified and/or extracted, figures indicating estimates of ore grade and tonnage, and other details. These are technical geoscientific reports required by the Canadian and U.S. governments, respectively. We need an "instance of" (rdf:type) claim on everything like this in the system. While we could simply create items in the graph to represent these two classes with no further classification, it is useful to "work back up the semantic hierarchy" for as many concepts as we can as far as we need to in order for the information we are recording in the GeoKB to be understandable in the broader global knowledge commons (Wikidata and/or other efforts).

The initialization process here is designed to give us a semantic baseline to work from as we pull in the information and connections we really care about within this context. We're taking a pragmatic approach that is slightly more rigorous (and certainly more streamlined) than the wild west of Wikidata but somewhere short of endlessly academic. We have to get a whole bunch of information into the GeoKB to support analytical use, so we make a best effort to align what we have with mature ontologies and namespaces, knowing we'll have to evolve it over time. The notebook approach on this gives us a good basis to record our reasoning and the places we have to make pragmatic tradeoffs.

In [1]:
import os
import pywikibot as pwb
import json
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON
from nested_lookup import nested_lookup

## Parameters

If we want to run this and other notebooks as Lambdas eventually, we'll need to set up some paramters. I moved the stuff we want to keep somewhat secret to environment variables as a safety measure.

In [2]:
sparql_endpoint = os.environ['SPARQL_ENDPOINT']
wb_domain = os.environ['WB_DOMAIN']
geokb_init_sheet_id = '1dbuKc4cZJz0YY81B2xWXM5fId6gWgzmQar3hg3CI0Rw'

## Functions

All or most of these functions should be movable to the abstract Wikibase management python package we are designing. That needs to be applicable to the GeoKB but generic enough to apply in other types of domains and circumstances. There are other communities doing similar work such as the wikidataintegrator project in the health sciences. We just found a need to start from the basics of pywikibot and how it operates.

I realize I'm introducing what could be higher overhead than necessary here with use of Pandas and other specialized packages. We can reevaluate that at the foundational level of the KB management package.

In [56]:
def get_wb(site_name: str, language='en'):
    site = pwb.Site(language, site_name)
    site.login()
    return site

def build_entity(
        label: str, 
        description: str, 
        language: str = 'en',
        datatype = None
    ) -> dict:
    item = {
        'labels': {
            language: {
                'language': language,
                'value': label
            }
        },
        'descriptions': {
            language: {
                'language': language,
                'value': description
            }
        }
    }

    if datatype:
        # Need to work out our own logic on what datatypes we will accept
        item["datatype"] = datatype
 
    return item

# I probably need to break this all up so that each element of an item or property gets built individually
# Some bot examples establish a connection to a "Page" and then add labels, descriptions, and aliases separately
# This would allow us to use those same functions in many circumstances like incrementally verifying the baseline
def add_entity(entitytype: str, summary: str, site: pwb.APISite, item: dict) -> dict:
    if entitytype not in ['item','property']:
        raise ValueError
    
    if entitytype == 'property' and 'datatype' not in item:
        raise ValueError

    params = {
        'action': 'wbeditentity',
        'new': entitytype,
        'data': json.dumps(item),
        'summary': summary,
        'token': site.tokens['csrf'],
    }

    try:
        req = site.simple_request(**params)
        results = req.submit()
        return results
    except Exception as e:
        item["error"] = e
        return e

def update_aliases(site: pwb.APISite, entity_id: str, aliases):
    item = pwb.ItemPage(site.data_repository(), entity_id)
    alias = {"en": aliases.split(',') if isinstance(aliases, str) else aliases}

    messages = []
    for key in alias:
        message = "Settings aliases: {} = '{}'".format(key, alias[key])
        try:
            item.editAliases(
                {key: alias[key]},
                summary=message
            )
            messages.append({"message": message})
        except Exception as e:
            messages.append({
                "message": message,
                "error": e
            })
    return messages
    
def sparql_query(endpoint: str, output: str, query: str):
    sparql = SPARQLWrapper(endpoint)
    sparql.setReturnFormat(JSON)

    sparql.setQuery(query)
    results = sparql.queryAndConvert()

    if output == 'raw':
        return results
    elif output == 'dataframe':
        names = results["head"]["vars"]
        data = []
        for name in names:
            property_value = nested_lookup('value', nested_lookup(name, results['results']['bindings']))
            if not property_value:
                data.append(None)
            else:
                data.append(property_value)

        return pd.DataFrame.from_dict(dict(zip(names, data)))

## Foundational Properties and Classes

Looking toward the initial use cases for the GeoKB, we need to lay down a basic structure for properties and classification such that the items we need to introduce will all have an appropriate instance of claim pointing to a reasonable concept for basic organization. We could go on forever trying to get the semantics just right and aligned with as many sources of definition as possible, but we'll ease into that level of sophistication over time. In the near term, I've started a Google Sheet to contain the basic structure to get our knowledgebase initialized. We will likely need to iterate on this several times to get it right, and we will doubtless miss some things.

There are lots of ways to spin up these details, but a Google Sheet seems simple enough, it can be edited by multiple people, and we can read it as a CSV and process it here in the notebook. I've built this with a Properties and a Classification sheet that we'll iterate on as we improve the foundation.

In [None]:
def ref_table(reference, sheet_id=geokb_init_sheet_id):
    return f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={reference}'

geokb_properties = pd.read_csv(ref_table('Properties'))
geokb_classes = pd.read_csv(ref_table('Classification'))

In [None]:
display(geokb_properties)
display(geokb_classes)

## Connecting to Wikibase

Many read operations can be accommodated by the public SPARQL interface with no need for specialized wikibase connections. At the moment, the Elasticsearch part of the tech stack is not connecting properly, and some of the dependent API functionality in the pywikibot package is not working. The write operations for properties and items are working fine, and we need to establish a connection to the APISite via pywikibot and the user/password bot config we set up previously.

These create user-config.py and user-password.py files in the directory from which pywikibot is run, which is not the most secure way to handle things. We may experiment with the more secure OAuth methods down the road once we get the mechanics worked out. The following sets up the connection for later use and tests it for functionality. I'm not sure yet on the significance of the errors I'm seeing.

In [33]:
# Required connection points in the pwb API
geokb_site = get_wb('geokb')

print(type(geokb_site))
print(type(geokb_site.data_repository()))

<class 'pywikibot.site._apisite.APISite'>
<class 'pywikibot.site._datasite.DataSite'>


## Baseline Knowledgebase

We may or may not be starting from a blank slate Wikibase installation. We know we need to have everything laid out in our Properties and Classification tables, but some of it may already exist. We also need to build claims on these items that use property and item identifiers, so we need to determine exactly what's in the current system. My initial attempt to use the search functionality based on pywikibot and the currently non-functional API failed, so I'm experimenting with a SPARQL approach. This is somewhat handy in that we can leverage the SPARQL query builder to figure out our queries, first with a test against the full Wikidata platform and then our own instance. Those queries can be pulled in here based on how I set up the function to get us a result to work from.

In [55]:
def query_by_item_label(label: str) -> str:
    query_string = """
        SELECT ?item ?itemLabel ?itemDescription ?itemAltLabel WHERE{  
        ?item ?label "%s"@en.  
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }    
        }
    """ % (label)

    return query_string

# Build on this idea for assembling multiple claims about the same property in tabular form
def query_item_subclasses(item_id: str, subclass_property_id: str = 'P13') -> str:
    query_string = """
        SELECT ?item ?itemLabel (GROUP_CONCAT(DISTINCT ?subclassOf; SEPARATOR=",") as ?subclasses)
        {
        VALUES (?item) {(wd:%s)}
        OPTIONAL {
            ?item wdt:%s ?subclassOf
        }
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
        } GROUP BY ?item ?itemLabel
    """ % (item_id, subclass_property_id)

    return query_string


property_query = """
SELECT ?property ?propertyLabel ?propertyDescription ?propertyAltLabel WHERE {
    ?property a wikibase:Property .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" .}
 }
 """

property_query = """
 SELECT ?item ?itemLabel ?prop ?propertyLabel ?propertyValue ?propertyValueLabel
{ 
  VALUES (?item) {(wd:Q3)}
  OPTIONAL {
    ?item ?prop ?statement . 
    ?statement ?ps ?propertyValue . 
    ?property wikibase:claim ?prop . 
    ?property wikibase:statementProperty ?ps . 
  } 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } 
}
"""

In [None]:
sparql_query(
    endpoint=sparql_endpoint,
    output='dataframe',
    query=query_item_subclasses('Q3', 'P13')
)

I can get a list of all properties in the instance, but I can't yet figure out a way to get a list of all items. With a full set of properties, we can figure out what we have that's in baseline set, get rid of anything extraneous, and add anything new. We could do the same with items to a certain point.

In [None]:
df_properties = sparql_query(sparql_endpoint, 'dataframe', property_query)
df_properties['id'] = df_properties.property.apply(lambda x: x.split('/')[-1])
df_properties

So far, the best I've come up with is a query on a given item by label or other criteria. I need to work out the best way to check through our baseline, and iterating through the whole list seems kind of dumb. For whatever reason, I can't apply a similar criteria for items that I did for properties like...

e.g., `?item a wikibase:Item`

In [None]:
df = sparql_query(sparql_endpoint, 'dataframe', query_by_item_label('human'))
df['item'] = df.item.apply(lambda x: x.split('/')[-1])
df

## Building properties and items

I left off here after proving that I can at least create items. This last bit was following the [tutorial](https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial/Labels) on building and editing items to include more rich provenance (history). I'll come back to build this into functional logic.

### Deletion

After some reading on this, it looks like deleting pages in a Mediawiki instance can only be done by users with administrative privileges. This is probably a good safeguard for us to leverage, but we need to suss out a usable workflow. There appears to be some type of functionality to mark a page for deletion, presumably with administrator action to concur on the action and execute it.

### Looping vs. something more efficient

The pywikibot engine appears to impose some throttling on looped operations like adding entities. This doesn't matter when running something like a basic initialization process with a handful of properties and classification items. It might or might not matter when we get to loading thousands of reference items and the many other things we need to create. The building of structured and semantically explicit knowledge is somewhat inherently slow and deliberate anyway, so we may not really care if a process has to run all night to load new information that won't change much after that point except incrementally. But we do need to better understand what's imposed by the infrastructure and how we best handle our use cases.

In [None]:
# Missing properties
missing_properties = geokb_properties[~geokb_properties.label.isin(df_properties.propertyLabel)]
missing_properties

In [None]:
for index, row in missing_properties.iterrows():
    wb_entity = build_entity(
        label=row.label,
        description=row.description,
        datatype=row.datatype
    )

    results = add_entity(
        entitytype='property',
        summary='Adding new property from GeoKB initialization spreadsheet',
        site=geokb_site,
        item=wb_entity
    )

    display(results)

In [None]:
# Need to come back to better understand how to deal with property changes and deprecation over time.

bad_prop = pwb.PropertyPage(geokb_site.data_repository(), 'P1')
bad_prop.get()
# bad_prop.delete()

### Item creation and development

Since the looping process to build items is somewhat slow and deliberative with pywikibot anyway, it doesn't matter if we need to check each classification item from our initialization spreadsheet, decide if it's already in the GeoKB, and then build if if necessary.

Right now, this seems like a slow and clunky way to introduce things to the GeoKB. I ended up breaking out the piece on adding aliases because I kept failing to get the item encoding right for the `pwb.simple_request()` method. But this is also an area where recording aliases individually for items as they come up is going to be something we do pretty often, and having these inserted with a summary statement for provenance will be useful.

### Come Back Later

Everything from here through the claims section below is not the way I want to run this. Rather, I think I need to simplify by running through each item in the baseline set of classification entities to do the following:
1. Check that the item exists...
2. if not exists, create it with a new function that instantiates the page and then incrementally adds label, description, and aliases
3. If exists, verify that description and aliases align with the init spreadsheet and reset if needed
4. If exists, query for the subclasses on the ID and set/reset if needed

In [None]:
# This is wasted processing to first check everything and then go back and do stuff; 
# need to just run through everything and do it
geokb_items = []
geokb_subclass_properties = []
for index, row in geokb_classes.iterrows():
    df_labeled_item = sparql_query(
        endpoint=sparql_endpoint,
        output='dataframe',
        query=query_by_item_label(label=row.label)
    )
    if not df_labeled_item.empty:
        geokb_items.append(df_labeled_item)
        item_id = df_labeled_item.iloc[0]['item'].split('/')[-1]
        geokb_subclass_properties.append(
            sparql_query(
                endpoint=sparql_endpoint,
                output='dataframe',
                query=query_item_subclasses(
                    item_id=item_id,
                    # Come back to this
                    subclass_property_id='P13'
                )
            )
        )

df_geokb_items = pd.concat(geokb_items)
df_geokb_items = df_geokb_items.drop_duplicates()
df_geokb_items['item'] = df_geokb_items.item.apply(lambda x: x.split('/')[-1])

df_check_baseline_classes = pd.merge(
    left=geokb_classes,
    right=df_geokb_items,
    how='left',
    left_on='label',
    right_on='itemLabel',
    indicator=True
)

df_check_baseline_classes['aliases'] = df_check_baseline_classes.aliases.apply(lambda x: [i.strip().lower() for i in x.split(',')] if isinstance(x, str) else [])
df_check_baseline_classes['itemAltLabel'] = df_check_baseline_classes.itemAltLabel.apply(lambda x: [i.strip().lower() for i in x.split(',')] if isinstance(x, str) else [])

df_check_baseline_classes['missing_aliases'] = df_check_baseline_classes.apply(
    lambda x: list(set(x.aliases) - set(x.itemAltLabel)),
    axis=1
)

missing_classes = df_check_baseline_classes[df_check_baseline_classes._merge != 'both']
missing_aliases = df_check_baseline_classes[df_check_baseline_classes.missing_aliases.str.len() > 0]
modified_descriptions = df_check_baseline_classes[
    df_check_baseline_classes.itemDescription != df_check_baseline_classes.description
]

In [None]:
df_geokb_item_subclasses = pd.concat(geokb_subclass_properties)


In [None]:
df_geokb_item_subclasses

In [None]:
# This was getting way too complicated, redo

geokb_ids = []

for index, row in geokb_classes.iterrows():
    query_results = sparql_query(
        endpoint=sparql_endpoint,
        output='raw',
        query=query_by_item_label(label=row.label)
    )

    item_id = None
    item_aliases = None

    if len(query_results['results']['bindings']) > 1:
        print("MORE THAN ONE RESULT:", row.label)
    elif len(query_results['results']['bindings']) == 1:
        item = query_results['results']['bindings'][0]
        item_id = item['item']['value'].split('/')[-1]
        if 'itemAltLabel' in item:
            item_aliases = item['itemAltLabel']['value']
        print("ITEM EXISTS", row.label, item_id)
    elif len(query_results['results']['bindings']) == 0:
        response = add_entity(
            entitytype='item',
            summary='Adding classification item from GeoKB initialization spreadsheet',
            site=geokb_site,
            item=build_entity(
                label=row.label,
                description=row.description
            )
        )
        item_id = response['entity']['id']
        print("ADDED ITEM", row.label, item_id)

    geokb_ids.append({
        'label': row.label,
        'geokb_id': item_id
    })

    if isinstance(row.aliases, str):
        aliases_to_update = row.aliases.split(',')
        if item_aliases:
            aliases_to_update = list(set(item_aliases) - set(aliases_to_update))
        if aliases_to_update:
            display(update_aliases(
                site=geokb_site,
                entity_id=item_id,
                aliases=aliases_to_update
            ))

### Claims

I'm stuck here (see below) trying to work out the process from https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial/Setting_statements and looking at what J. Olivero did with his initial bot code. It should work, but I'm making the items nonfunctional somehow.

In [43]:
def add_claim(site: pwb.APISite, subject_item_id: str, property_id: str, claim_value: str, prov_statement: str):
    repo = site.data_repository()

    # Establish connection to item
    subject_item = pwb.ItemPage(repo, subject_item_id)
    return subject_item
    # Verify item exists
    try:
        subject_item.get()
    except Exception as e:
        raise ValueError(e)
    
    # Establish connection to property
    property_item = pwb.PropertyPage(repo, property_id)
    # Verify property exists and get datatype
    try:
        property_datatype = property_item.get()['datatype']
        subject_claim = pwb.Claim(repo, property_id)
    except Exception as e:
        raise ValueError(e)
    
    if property_datatype == 'wikibase-item':
        # Get item target and verify exists
        claim_object = pwb.ItemPage(repo, claim_value)
        try:
            claim_object.get()
        except Exception as e:
            raise ValueError(e)
    else:
        # Need to handle other cases where we property test/format an appropriate claim_target response
        return

    # Set the target for the claim
    subject_claim.setTarget(claim_object)
    # Need to handle additional work of adding references and qualifiers

    # Commit the claim to wikibase
    try:
        subject_item.addClaim(subject_claim, summary=prov_statement)
    except Exception as e:
        raise ValueError("Claim already exists")

I'm leaving off here with a problem in this function. An error is raised with trying to get an item where I've set a claim with the function previously. It is erroring out with an AttributeError on the first step of running a get() on the item. addClaim() appears to work as the claim shows up on the item in the GUI as expecte, but I can then no longer run a get() on the item without this error in code.

In [45]:
add_claim(
    site=geokb_site,
    subject_item_id='Q7', # organization
    property_id='P13', # subclass of
    claim_value='Q5', # entity
    prov_statement='Classifying item as subclass of entity'
)

In [51]:
test_item = pwb.ItemPage(geokb_site.data_repository(), 'Q33')

In [54]:
# I have no idea what this error means
try:
    test_item.get()
except Exception as e:
    print(e)

DataSite instance has no attribute 'entity_sources'
