This notebook is another experiment in the relatively crude process of building better metadata for items representing documents/publications in the GeoKB from the processing pipeline they are run through in xDD. I still need to run back through and clean up some previous examples as we refine how we want to go about this process in production.

We now have representations for all of the NI 43-101 Technical Reports and all USGS Series Reports (plus the majority of USGS-authored journal articles) in the GeoKB. All of those are piped through xDD (GeoDeepDive) where several pipelines are run, including some basic work to identify key scientific concepts that are addressed (or at least mentioned) in the documents. For faily unique terms such as lithologies/rock types and minerals, this can give us a fairly good indication of where documents address important subject matter. They are also terms that are not currently found in metadata from the Pubs Warehouse or our Zotero-managed technical reports. Pulling them in dynamically and adding them to the records for these pubs may prove a useful research aid.

The basic philosphpy for this approach is to have the knowledge graph essentially drive itself. We start from things the knowledge graph knows about - documents and geoscientific concepts, consult a third party processing infrastructure that also knows about those things, and record the linkages suggested back within the graph.

In [1]:
import requests
from wbmaker import WikibaseConnection
import pandas as pd

In [2]:
geokb = WikibaseConnection('GEOKB_CLOUD')

# Indexed Documents
In this exercise, we are extracting key subjects that have been identified in documents indexed through the xDD pipelines and linking those to the representations for those documents in the GeoKB. We may eventually get to the point where we kick off this process each time a new document and/or a new target scientific concept is organized into the GeoKB. It will be something automatic and dynamic where the knowledgebase is constantly (or maybe daily in practical terms) going out to work with AI and other assistive processes to build itself. In the near term, we're experimenting with the overall model and putting together building blocks.

In the following codeblocks, I have a function that gives us a mapping between the identifiers we will find in our third party infrastructure (xDD) and the local QIDs in the GeoKB that we'll need to operate on. This takes a few seconds to complete because there are about 100K records, but then we have a full mapping to work with.

In [3]:
def geokb_id_map(pid, limit=10000):
    id_lookup = {}
    offset = 0

    while True:
        id_query = f"""
        PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

        SELECT ?item ?id
        WHERE {{
            ?item wdt:{pid} ?id .
        }}
        LIMIT {limit}
        OFFSET {offset}
        """

        id_items = geokb.sparql_query(
            query=id_query,
            endpoint=geokb.sparql_endpoint,
            output="raw"
        )
        if id_items is None:
            break
        else:
            for x in id_items['results']['bindings']:
                id_lookup[x['id']['value']] = x['item']['value'].split('/')[-1]
            offset+=limit
    
    return id_lookup

In [4]:
gddid_lookup = geokb_id_map('P93')
doi_lookup = geokb_id_map('P74')

# Geoscientific Concepts

I'm still experimenting with the best way to encode all of the different reference scientific concepts we want to work with in the GeoKB/Wikibase model. I'm trying to find a balance between rigorous semantics and practical use. In the current iteration, I've built out a subset of lithology terms and minerals (including varieties and groups) that we are working with as a test case. Some of these include "same as" linkages to two different Macrostrat references that serve as the basis for the xDD indexing (meaning we will turn them up in the terms or snippets API for some of our documents).

In the following codeblock, I pull those items that have the macrostrat "defs" linkages and tee up the process that will hit the xDD API for results on those concepts. That gives me a data structure I can operate with to go after results.

In [5]:
macrostrat_def_same_as_query = """
  PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

  SELECT ?item ?itemLabel ?same_as
  WHERE {
    ?item wdt:P84 ?same_as .
    FILTER CONTAINS(STR(?same_as), "macrostrat.org/api/defs")
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
  }
"""

macrostrat_def_links = geokb.sparql_query(
    query=macrostrat_def_same_as_query,
    endpoint=geokb.sparql_endpoint,
    output="dataframe"
)

macrostrat_def_links['concept_qid'] = macrostrat_def_links['item'].apply(lambda x: x.split('/')[-1])
macrostrat_def_links['dict_filter'] = macrostrat_def_links['same_as'].apply(lambda x: x.split('/')[5].split('?')[0])
macrostrat_def_links['term'] = macrostrat_def_links['same_as'].apply(lambda x: x.split('/')[5].split('?')[-1].split('=')[-1])

macrostrat_def_links['ni43101_snippets'] = macrostrat_def_links.apply(lambda x: f"https://geodeepdive.org/api/snippets?dict_filter={x['dict_filter']}&publisher=Geoarchive&term={x['term']}", axis=1)
macrostrat_def_links['usgs_snippets'] = macrostrat_def_links.apply(lambda x: f"https://geodeepdive.org/api/snippets?dict_filter={x['dict_filter']}&publisher=USGS&term={x['term']}", axis=1)


# Retrieve Results from xDD

In this codeblock, I work through the API calls that will tell us what documents have which concepts represented. I'm using the snippets route with xDD here, though the terms route would probably suffice. I'm still fiddling with whether or not it is useful to incorporate the snippets of text where terms are found in some way. The most potentially useful approach I've experimented with is using the wiki page associated with the items representing documents in the Wikibase instance to write out snippets and any other details from any source we might want to incorporate. With those in place, both humans and machines will have additional semistructured but looseform fodder to work with in various ways.

In [6]:
geokb_doc_subjects = []

for index, row in macrostrat_def_links.iterrows():
    r_ni43101 = requests.get(row['ni43101_snippets']).json()
    if 'success' in r_ni43101 and r_ni43101['success']['data']:
        for item in r_ni43101['success']['data']:
            if item['_gddid'] in gddid_lookup:
                geokb_doc_subjects.append((gddid_lookup[item['_gddid']],row['concept_qid'],row['ni43101_snippets']))

    r_usgs_pubs = requests.get(row['usgs_snippets']).json()
    if 'success' in r_usgs_pubs and r_usgs_pubs['success']['data']:
        for item in r_usgs_pubs['success']['data']:
            if 'doi' in item and item['doi'] in doi_lookup:
                geokb_doc_subjects.append((doi_lookup[item['doi']],row['concept_qid'],row['usgs_snippets']))

# Addresses Subject Claims
The preceding step gives us everything we need to work with for committing dynamically generated content to the GeoKB. I wrote this as a list of sets containing the subject item (the document representation we will be adding claims to), the object item (geoscientific concept; lithology term or mineral species/variety/group in this case), and the specific API call used to derive the claim. The latter can be used as a reference for now, and it can be followed to see exactly how we derived the claim and leading a user to view the snippets to help evaluate whether it is worthwhile to write them into the Wikibase.

In [7]:
geokb_doc_subjects[:5]

[('Q40197',
  'Q41757',
  'https://geodeepdive.org/api/snippets?dict_filter=lithologies&publisher=Geoarchive&term=peridotite'),
 ('Q29971',
  'Q41757',
  'https://geodeepdive.org/api/snippets?dict_filter=lithologies&publisher=Geoarchive&term=peridotite'),
 ('Q34809',
  'Q41757',
  'https://geodeepdive.org/api/snippets?dict_filter=lithologies&publisher=Geoarchive&term=peridotite'),
 ('Q40195',
  'Q41757',
  'https://geodeepdive.org/api/snippets?dict_filter=lithologies&publisher=Geoarchive&term=peridotite'),
 ('Q34227',
  'Q41757',
  'https://geodeepdive.org/api/snippets?dict_filter=lithologies&publisher=Geoarchive&term=peridotite')]

# Commit to GeoKB
Once we have the claims data to work with, it's a matter of writing to the GeoKB. There are ultimately better ways of handling this at scale, but looping over a list grouped by the subject item we are adding to is reasonable enough at this point.

In [8]:
for index, row in pd.DataFrame(geokb_doc_subjects, columns=['subject','object','reference']).groupby('subject', as_index=False)[['object','reference']].agg(list).iterrows():
    item = geokb.wbi.item.get(row['subject'])
    subject_claims = []
    for i, subject_qid in enumerate(row['object']):
        ref = geokb.models.References()
        ref.add(
            geokb.datatypes.URL(
                prop_nr=geokb.prop_lookup['reference URL'],
                value=row['reference'][i]
            )
        )
        subject_claims.append(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['addresses subject'],
                value=subject_qid,
                references=ref
            )
        )
    item.claims.add(
        subject_claims,
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )
    response = item.write(
        summary="Added addresses subject claims pulled from xDD index"
    )
    print(response.id)

Q100365
Q100665
Q101510
Q103030
Q104901
Q104989
Q105506
Q106867
Q106938
Q107170
Q107699
Q109951
Q113086
Q116634
Q117883
Q118029
Q118539
Q120001
Q122136
Q122278
Q122502
Q122519
Q122846
Q123577
Q123709
Q124501
Q125370
Q129085
Q129339
Q130088
Q130301
Q130608
Q131503
Q132548
Q133289
Q133292
Q133931
Q133953
Q134006
Q134019
Q134100
Q135313
Q135357
Q135545
Q136023
Q137627
Q137675
Q28938
Q29186
Q29259
Q29716
Q29891
Q29971
Q30581
Q30742
Q31241
Q31577
Q31737
Q31802
Q32272
Q32279
Q32512
Q32562
Q32591
Q32640
Q32956
Q33378
Q33459
Q34042
Q34227
Q34374
Q34558
Q34809
Q35011
Q35501
Q35583
Q35803
Q35807
Q35815
Q36100
Q37033
Q37106
Q37163
Q37400
Q37599
Q37860
Q37869
Q37901
Q38505
Q38658
Q38759
Q38774
Q38974
Q38987
Q39045
Q39077
Q39307
Q39343
Q40195
Q40197
Q55177
Q75832
Q80287
Q80898
Q83551
Q83591
Q85481
Q86183
Q86560
Q89163
Q89244
Q91418
Q91805
Q92631
Q93042
Q93233
Q93240
Q94435
Q95131
Q95440
Q96127
Q97603
Q97899
Q99754
