This notebook is a work in progress process for initializing the Geoscience Knowledgebase with a set of properties and foundational semantics that establishes a base to build from in curating geoscientific knowledge. Our initial use cases have to do with integrating mineral occurrence information along with document references contributing to those and other facts and concepts associated with conducting mineral resource assessment. In order to build a useful knowledge graph on these concepts, though, we also need to tie in lots of other things needed for the claims associated with these things to legitimately link to other things.

For instance, we use NI "43-101 Technical Reports" and a newer "SK-1300 Technical Report" as sources for claims associated with mining projects/properties such as geographic location, mineral commodities identified and/or extracted, figures indicating estimates of ore grade and tonnage, and other details. These are technical geoscientific reports required by the Canadian and U.S. governments, respectively. We need an "instance of" (rdf:type) claim on everything like this in the system. While we could simply create items in the graph to represent these two classes with no further classification, it is useful to "work back up the semantic hierarchy" for as many concepts as we can as far as we need to in order for the information we are recording in the GeoKB to be understandable in the broader global knowledge commons (Wikidata and/or other efforts).

The initialization process here is designed to give us a semantic baseline to work from as we pull in the information and connections we really care about within this context. We're taking a pragmatic approach that is slightly more rigorous (and certainly more streamlined) than the wild west of Wikidata but somewhere short of endlessly academic. We have to get a whole bunch of information into the GeoKB to support analytical use, so we make a best effort to align what we have with mature ontologies and namespaces, knowing we'll have to evolve it over time. The notebook approach on this gives us a good basis to record our reasoning and the places we have to make pragmatic tradeoffs.

In [1]:
import os
import pywikibot as pwb
import json
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON
from nested_lookup import nested_lookup

## Parameters

If we want to run this and other notebooks as Lambdas eventually, we'll need to set up some paramters. I moved the stuff we want to keep somewhat secret to environment variables as a safety measure.

In [2]:
sparql_endpoint = os.environ['SPARQL_ENDPOINT']
wb_domain = os.environ['WB_DOMAIN']
geokb_init_sheet_id = '1dbuKc4cZJz0YY81B2xWXM5fId6gWgzmQar3hg3CI0Rw'

## Functions

All or most of these functions should be movable to the abstract Wikibase management python package we are designing. That needs to be applicable to the GeoKB but generic enough to apply in other types of domains and circumstances. There are other communities doing similar work such as the wikidataintegrator project in the health sciences. We just found a need to start from the basics of pywikibot and how it operates.

I realize I'm introducing what could be higher overhead than necessary here with use of Pandas and other specialized packages. We can reevaluate that at the foundational level of the KB management package.

In [72]:
def get_wb(site_name: str, language='en'):
    site = pwb.Site(language, site_name)
    site.login()
    return site

def build_entity(
        label: str, 
        description: str, 
        language: str = 'en',
        datatype = None
    ) -> dict:
    item = {
        'labels': {
            language: {
                'language': language,
                'value': label
            }
        },
        'descriptions': {
            language: {
                'language': language,
                'value': description
            }
        }
    }

    if datatype:
        # Need to work out our own logic on what datatypes we will accept
        item["datatype"] = datatype
 
    return item

def add_entity(entitytype: str, summary: str, site: pwb.APISite, item: dict) -> dict:
    if entitytype not in ['item','property']:
        raise ValueError
    
    if entitytype == 'property' and 'datatype' not in item:
        raise ValueError

    params = {
        'action': 'wbeditentity',
        'new': entitytype,
        'data': json.dumps(item),
        'summary': summary,
        'token': site.tokens['csrf'],
    }

    try:
        req = site.simple_request(**params)
        results = req.submit()
        return results
    except Exception as e:
        item["error"] = e
        return e

def update_aliases(site: pwb.APISite, entity_id: str, aliases):
    item = pwb.ItemPage(site.data_repository(), entity_id)
    alias = {"en": aliases.split(',') if isinstance(aliases, str) else aliases}

    messages = []
    for key in alias:
        message = "Settings aliases: {} = '{}'".format(key, alias[key])
        try:
            item.editAliases(
                {key: alias[key]},
                summary=message
            )
            messages.append({"message": message})
        except Exception as e:
            messages.append({
                "message": message,
                "error": e
            })
    return messages
    
def sparql_query(endpoint: str, output: str, query: str):
    sparql = SPARQLWrapper(endpoint)
    sparql.setReturnFormat(JSON)

    sparql.setQuery(query)
    results = sparql.queryAndConvert()

    if output == 'raw':
        return results
    elif output == 'dataframe':
        names = results["head"]["vars"]
        data = []
        for name in names:
            data.append(
                nested_lookup('value', nested_lookup(name, results['results']['bindings']))
            )
        return pd.DataFrame.from_dict(dict(zip(names, data)))

## Foundational Properties and Classes

Looking toward the initial use cases for the GeoKB, we need to lay down a basic structure for properties and classification such that the items we need to introduce will all have an appropriate instance of claim pointing to a reasonable concept for basic organization. We could go on forever trying to get the semantics just right and aligned with as many sources of definition as possible, but we'll ease into that level of sophistication over time. In the near term, I've started a Google Sheet to contain the basic structure to get our knowledgebase initialized. We will likely need to iterate on this several times to get it right, and we will doubtless miss some things.

There are lots of ways to spin up these details, but a Google Sheet seems simple enough, it can be edited by multiple people, and we can read it as a CSV and process it here in the notebook. I've built this with a Properties and a Classification sheet that we'll iterate on as we improve the foundation.

In [4]:
def ref_table(reference, sheet_id=geokb_init_sheet_id):
    return f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={reference}'

geokb_properties = pd.read_csv(ref_table('Properties'))
geokb_classes = pd.read_csv(ref_table('Classification'))

In [5]:
display(geokb_properties)
display(geokb_classes)

Unnamed: 0,label,description,datatype,wd_id
0,instance of,this item is a class of that item;type of the ...,wikibase-item,P31
1,subclass of,this item is a subclass (subset) of that item,wikibase-item,P279
2,coordinate location,a geographic point indicating the location of ...,globe-coordinate,P625
3,reference item,a knowledgebase item that serves as the source...,wikibase-item,
4,reference url,a web link provided as the source reference fo...,url,
5,reference statement,a short statement about the source for a claim...,string,
6,publication date,date or point in time when a work represented ...,time,P577
7,subject matter,topic or subject addressed by the item,wikibase-item,P921
8,ranking,used as statement qualifier indicating relativ...,quantity,P1352


Unnamed: 0,label,description,aliases,subclass of,wd_id
0,entity,"anything that can be considered, discussed, or...",thing,,Q35120
1,person,"common name of Homo sapiens, unique extant spe...",human,entity,Q5
2,organization,social entity established to meet needs or pur...,,entity,Q43229
3,document,form for preservation of structured and identi...,,entity,Q49848
4,publication,content made available to the general public,,entity,Q732577
5,scholarly article,"article in an academic publication, usually pe...","research article,scientific article,journal ar...",publication,Q13442814
6,report,"informational, formal, and detailed text",,document,Q10870555
7,government report,document written by a government to convey inf...,,report,Q15629444
8,NI 43-101 Technical Report,"National Instrument 43-101 (the ""NI 43-101"" or...",,report,
9,USGS Report Series,"official, USGS-authored publications of the U....",USGS Series Report,government report,


## Connecting to Wikibase

Many read operations can be accommodated by the public SPARQL interface with no need for specialized wikibase connections. At the moment, the Elasticsearch part of the tech stack is not connecting properly, and some of the dependent API functionality in the pywikibot package is not working. The write operations for properties and items are working fine, and we need to establish a connection to the APISite via pywikibot and the user/password bot config we set up previously.

These create user-config.py and user-password.py files in the directory from which pywikibot is run, which is not the most secure way to handle things. We may experiment with the more secure OAuth methods down the road once we get the mechanics worked out. The following sets up the connection for later use and tests it for functionality. I'm not sure yet on the significance of the errors I'm seeing.

In [47]:
# Required connection points in the pwb API
geokb_site = get_wb('geokb')

print(type(geokb_site))
print(type(geokb_site.data_repository()))

<class 'pywikibot.site._apisite.APISite'>
<class 'pywikibot.site._datasite.DataSite'>


## Baseline Knowledgebase

We may or may not be starting from a blank slate Wikibase installation. We know we need to have everything laid out in our Properties and Classification tables, but some of it may already exist. We also need to build claims on these items that use property and item identifiers, so we need to determine exactly what's in the current system. My initial attempt to use the search functionality based on pywikibot and the currently non-functional API failed, so I'm experimenting with a SPARQL approach. This is somewhat handy in that we can leverage the SPARQL query builder to figure out our queries, first with a test against the full Wikidata platform and then our own instance. Those queries can be pulled in here based on how I set up the function to get us a result to work from.

In [7]:
def query_by_item_label(label: str) -> str:
    query_string = """
        SELECT ?item ?itemLabel ?itemDescription WHERE{  
        ?item ?label "%s"@en.  
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }    
        }
    """ % (label)

    return query_string

property_query = """
SELECT ?property ?propertyLabel ?propertyDescription WHERE {
    ?property a wikibase:Property .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" .}
 }
 """


I can get a list of all properties in the instance, but I can't yet figure out a way to get a list of all items. With a full set of properties, we can figure out what we have that's in baseline set, get rid of anything extraneous, and add anything new. We could do the same with items to a certain point.

In [24]:
df_properties = sparql_query(sparql_endpoint, 'dataframe', property_query)
df_properties['id'] = df_properties.property.apply(lambda x: x.split('/')[-1])
df_properties

Unnamed: 0,property,propertyLabel,propertyDescription,id
0,http://wikibase.svc/entity/P1,imported from Wikimedia project,This is a test description for imported from W...,P1
1,http://wikibase.svc/entity/P2,instance of,This is a test description for instance of,P2
2,http://wikibase.svc/entity/P3,mrds id,This is a test description for mrds id,P3
3,http://wikibase.svc/entity/P4,reference url,This is a test description for reference url,P4
4,http://wikibase.svc/entity/P5,country,This is a test description for country,P5
5,http://wikibase.svc/entity/P6,dep id,This is a test description for dep id,P6
6,http://wikibase.svc/entity/P7,coordinate location,This is a test description for coordinate loca...,P7
7,http://wikibase.svc/entity/P8,state,This is a test description for state,P8
8,http://wikibase.svc/entity/P9,stated in,This is a test description for stated in,P9
9,http://wikibase.svc/entity/P10,described by source,This is a test description for described by so...,P10


So far, the best I've come up with is a query on a given item by label or other criteria. I need to work out the best way to check through our baseline, and iterating through the whole list seems kind of dumb. For whatever reason, I can't apply a similar criteria for items that I did for properties like...

e.g., `?item a wikibase:Item`

In [None]:
df = sparql_query(sparql_endpoint, 'dataframe', query_by_item_label('human'))
df['item'] = df.item.apply(lambda x: x.split('/')[-1])
df

## Building properties and items

I left off here after proving that I can at least create items. This last bit was following the [tutorial](https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial/Labels) on building and editing items to include more rich provenance (history). I'll come back to build this into functional logic.

### Deletion

After some reading on this, it looks like deleting pages in a Mediawiki instance can only be done by users with administrative privileges. This is probably a good safeguard for us to leverage, but we need to suss out a usable workflow. There appears to be some type of functionality to mark a page for deletion, presumably with administrator action to concur on the action and execute it.

### Looping vs. something more efficient

The pywikibot engine appears to impose some throttling on looped operations like adding entities. This doesn't matter when running something like a basic initialization process with a handful of properties and classification items. It might or might not matter when we get to loading thousands of reference items and the many other things we need to create. The building of structured and semantically explicit knowledge is somewhat inherently slow and deliberate anyway, so we may not really care if a process has to run all night to load new information that won't change much after that point except incrementally. But we do need to better understand what's imposed by the infrastructure and how we best handle our use cases.

In [25]:
# Missing properties
missing_properties = geokb_properties[~geokb_properties.label.isin(df_properties.propertyLabel)]
missing_properties

Unnamed: 0,label,description,datatype,wd_id


In [23]:
for index, row in missing_properties.iterrows():
    wb_entity = build_entity(
        label=row.label,
        description=row.description,
        datatype=row.datatype
    )

    results = add_entity(
        entitytype='property',
        summary='Adding new property from GeoKB initialization spreadsheet',
        site=geokb_site,
        item=wb_entity
    )

    display(results)

{'entity': {'type': 'property',
  'datatype': 'string',
  'id': 'P21',
  'labels': {'en': {'language': 'en', 'value': 'reference statement'}},
  'descriptions': {'en': {'language': 'en',
    'value': 'a short statement about the source for a claiml;used when no viable item or URL is available'}},
  'aliases': {},
  'claims': {},
  'lastrevid': 19},
 'success': 1}

Sleeping for 9.4 seconds, 2023-02-19 08:04:23


{'entity': {'type': 'property',
  'datatype': 'time',
  'id': 'P22',
  'labels': {'en': {'language': 'en', 'value': 'publication date'}},
  'descriptions': {'en': {'language': 'en',
    'value': 'date or point in time when a work represented by a knowledgebase item was first published or released'}},
  'aliases': {},
  'claims': {},
  'lastrevid': 20},
 'success': 1}

Sleeping for 9.9 seconds, 2023-02-19 08:04:32


{'entity': {'type': 'property',
  'datatype': 'wikibase-item',
  'id': 'P23',
  'labels': {'en': {'language': 'en', 'value': 'subject matter'}},
  'descriptions': {'en': {'language': 'en',
    'value': 'topic or subject addressed by the item'}},
  'aliases': {},
  'claims': {},
  'lastrevid': 21},
 'success': 1}

Sleeping for 9.8 seconds, 2023-02-19 08:04:42


{'entity': {'type': 'property',
  'datatype': 'quantity',
  'id': 'P24',
  'labels': {'en': {'language': 'en', 'value': 'ranking'}},
  'descriptions': {'en': {'language': 'en',
    'value': 'used as statement qualifier indicating relative ranking'}},
  'aliases': {},
  'claims': {},
  'lastrevid': 22},
 'success': 1}

In [26]:
# Need to come back to better understand how to deal with property changes and deprecation over time.

bad_prop = pwb.PropertyPage(geokb_site.data_repository(), 'P1')
bad_prop.get()
# bad_prop.delete()

{'labels': <class 'pywikibot.page._collections.LanguageDict'>({'en': 'imported from Wikimedia project'}),
 'descriptions': <class 'pywikibot.page._collections.LanguageDict'>({'en': 'This is a test description for imported from Wikimedia project'}),
 'aliases': <class 'pywikibot.page._collections.AliasesDict'>({}),
 'claims': <class 'pywikibot.page._collections.ClaimCollection'>({}),
 'datatype': 'url'}

### Item creation and development

Since the looping process to build items is somewhat slow and deliberative with pywikibot anyway, it doesn't matter if we need to check each classification item from our initialization spreadsheet, decide if it's already in the GeoKB, and then build if if necessary.

Right now, this seems like a slow and clunky way to introduce things to the GeoKB. I ended up breaking out the piece on adding aliases because I kept failing to get the item encoding right for the `pwb.simple_request()` method. But this is also an area where recording aliases individually for items as they come up is going to be something we do pretty often, and having these inserted with a summary statement for provenance will be useful.

In [73]:
for index, row in geokb_classes.iterrows():
    query_results = sparql_query(
        endpoint=sparql_endpoint,
        output='raw',
        query=query_by_item_label(label=row.label)
    )

    if query_results['results']['bindings']:
        item_id = query_results['results']['bindings'][0]['item']['value'].split('/')[-1]
        print("ITEM EXISTS", row.label, item_id)
    else:
        response = add_entity(
            entitytype='item',
            summary='Adding classification item from GeoKB initialization spreadsheet',
            site=geokb_site,
            item=build_entity(
                label=row.label,
                description=row.description
            )
        )
        item_id = response['entity']['id']
        print("ADDED ITEM", row.label, item_id)

    if isinstance(row.aliases, str):
        display(update_aliases(
            site=geokb_site,
            entity_id=item_id,
            aliases=row.aliases
        ))


ITEM EXISTS entity Q5


[{'message': "Settings aliases: en = '['thing']'"}]

ITEM EXISTS person Q3


Sleeping for 9.0 seconds, 2023-02-19 10:22:22


[{'message': "Settings aliases: en = '['human']'"}]

ITEM EXISTS organization Q6
ITEM EXISTS document Q4
ITEM EXISTS publication Q7
ITEM EXISTS scholarly article Q30


Sleeping for 8.0 seconds, 2023-02-19 10:22:33


[{'message': "Settings aliases: en = '['research article', 'scientific article', 'journal article']'"}]

ITEM EXISTS report Q9
ITEM EXISTS government report Q10
ITEM EXISTS NI 43-101 Technical Report Q11
ITEM EXISTS USGS Report Series Q31


Sleeping for 7.5 seconds, 2023-02-19 10:22:43


[{'message': "Settings aliases: en = '['USGS Series Report']'"}]

ITEM EXISTS dataset Q13
ITEM EXISTS scientific model Q14
ITEM EXISTS project Q15
ITEM EXISTS research project Q32


Sleeping for 8.0 seconds, 2023-02-19 10:22:53


[{'message': "Settings aliases: en = '['science project', 'scientific study']'"}]

ITEM EXISTS phenomenon Q17
ITEM EXISTS natural phenomenon Q18
ITEM EXISTS material Q19
ITEM EXISTS natural material Q20
ITEM EXISTS mineral occurrence Q21
ITEM EXISTS mineral deposit Q22
ITEM EXISTS ore Q23
ITEM EXISTS ore deposit Q24
ITEM EXISTS mineral Q25
ITEM EXISTS mineral deposit model Q26


### Claims

I'm stymied again for just a bit now on the process of building claims on items. This should be similar, in some respects, to updating an item with aliases where we have an existing item and need to add claims/statements to it. But I have to figure out the anatomy of a claim as I'm getting an API error now after adding a statement by hand and trying to get() the item. Jonathan came up with an approach in his code, so I'll work through that and see if I can figure out what's really going on with this piece.

In [89]:
entity = pwb.ItemPage(geokb_site.data_repository(), 'Q3')
entity.get()['claims']

<class 'pywikibot.page._collections.ClaimCollection'>({})