This notebook starts to work through the process of pulling value from the processing of documents we are working through with the xDD Cyberinfrastructure back into the representation of documents in the GeoKB. We will keep improving on this process through use and experimentation as we learn what is going to add the most value to assessments and other research activities.

I'm starting with just the [period](https://geokb.wikibase.cloud/wiki/Special:WhatLinksHere/Item:Q26291) geochronological time concepts as those have an existing dedicated dictionary in xDD, it's a small list, and the terms are somewhat unique. We will likely want to winnow things down from this kind of crude process based solely on indexing hits across the corpus of documents. 

In [1]:
import requests
import pandas as pd
import swifter
from wbmaker import WikibaseConnection

In [2]:
geokb = WikibaseConnection("GEOKB_CLOUD")

In [3]:
source_item_id = geokb.ref_lookup['xDD Cyberinfrastructure']

In [29]:
query_period = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ26291%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D%0A"
df_period = geokb.wb_ref_data(query=query_period)
df_period["item"] = df_period.item.apply(lambda x: x.split("/")[-1])

query_gddid = "PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Fgddid%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP93%20%3Fgddid%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D%0A"
df_gddid = geokb.wb_ref_data(query=query_gddid)
df_gddid["item"] = df_gddid.item.apply(lambda x: x.split("/")[-1])
gddid_lookup = df_gddid.set_index('gddid')['item'].to_dict()

## xDD Interface

I need to keep working through the best way to interface with the xDD APIs on this type of process. The following is just my first attempt. There are likely more efficient ways to get this done. This process comes at it from the standpoint of a known linkage - a "period" as a subdivision of geologic time. We previously set up this reference source in the GeoKB by processing the International Geochronostratigraphic Chart, which is essentially the same source that xDD is using across several different dictionaries. As uniquely named entities within context, this provides us reasonable assurity that we are dealing with the same concepts.

I initially tried using the terms API from xDD, sending a series of requests for the terms we know about in the GeoKB on the corpus of documents for which we have representation in the GeoKB. Given the identifiers we have in common, this provides everything necessary to post a set of claims on NI 43-101 reports, linking them to period entities via the broad "addresses subject" predicate.

However, once I worked that through and thought about how we might want to interact with these claims, I ended up using the snippets API and pulling back highlights to incorporate into the item claims as qualifiers. This may or may not prove the most useful, but we can always revisit.

In [26]:
all_records = []
for index, row in df_period.iterrows():
    subject_qid = row["item"]
    subject = f"{row.itemLabel},{row.itemLabel.replace(' Period','').strip()}"

    next_page = f"https://geodeepdive.org/api/snippets?publisher=Geoarchive&term={subject}&dictid_filter=71&full_results&clean"
    while next_page:
        response = requests.get(next_page).json()
        if "success" in response:
            all_records.extend([dict(item, **{'subject_qid': subject_qid, 'subject': subject}) for item in response["success"]["data"]])
            if "next_page" in response["success"] and response["success"]["next_page"]:
                next_page = response["success"]["next_page"]
            else:
                next_page = False

In [32]:
xdd_ref = geokb.datatypes.Item(
    prop_nr=geokb.prop_lookup['knowledge source'],
    value=source_item_id
)

In [None]:
gddids = list(set([i["_gddid"] for i in all_records]))

for gddid in gddids:
    item_qid = gddid_lookup[gddid]
    item = geokb.wbi.item.get(item_qid)

    item_records = [i for i in all_records if i["_gddid"] == gddid]

    claims = []
    for x in item_records:
        highlight_qualifiers = geokb.models.Qualifiers()
        for snippet in x["highlight"]:
            highlight_qualifiers.add(
                geokb.datatypes.String(
                    prop_nr=geokb.prop_lookup['concept highlight'],
                    value=snippet
                )
            )

        snippet_ref = geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup['reference URL'],
            value=f"https://geodeepdive.org/api/snippets?docid={x['_gddid']}&term={x['subject'].replace(' ','%20')}&dictid_filter=71&full_results&clean"
        )

        refs = geokb.models.References()
        refs.add(xdd_ref)
        refs.add(snippet_ref)
        
        claims.append(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['addresses subject'],
                value=x["subject_qid"],
                references=refs,
                qualifiers=highlight_qualifiers
            )
        )

    item.claims.add(claims)
    try:
        item.write(
            summary="Added claims for geochronologic period based on xDD snippet extraction",
            clear=True
        )
    except:
        pass
    print(item_qid)
