This notebook explores the process of retrieving information from xDD APIs for document entities in the GeoKB that have been run through xDD processing pipelines, organizing a summary of that information for review and discussion. The process needs a fair bit of work to be something we want to run in production, and we'll get this refined once we work out the best conventions for live use.

The process uses a different Mediawiki Python client (mwclient) that I added into the wbmaker class, establishing the "mw_site" as an authenticated end point we can operate against. I also used another Python package here, tabulate, that is able to format dataframes into different table encodings, including one for mediawiki.

So far, this method does not seem to be working all that well in terms of actual engagement with SMEs. I'll likely abandon it, but will come back to reevaluate.

In [1]:
import os

In [1]:
import requests
from wbmaker import WikibaseConnection
import pandas as pd
from tabulate import tabulate

In [2]:
geokb = WikibaseConnection('GEOKB_CLOUD')

Here I get a small sampling of GeoKB items that have gddid claims. We can operate something like this on any item with a gddid and will continue working up what all we want to be fed back to the GeoKB from that process.

In [3]:
gdd_items = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Fgddid%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP93%20%3Fgddid%20.%0A%7D%0ALIMIT%2010",
    output_format="dataframe"
)

gdd_items["qid"] = gdd_items["item"].apply(lambda x: x.split("/")[-1])
gdd_items.head()

Unnamed: 0,item,gddid,qid
0,https://geokb.wikibase.cloud/entity/Q28507,620c0472ad0e9c819b01e9d8,Q28507
1,https://geokb.wikibase.cloud/entity/Q28508,620ec508ad0e9c819b0ed48d,Q28508
2,https://geokb.wikibase.cloud/entity/Q28505,620f06ecad0e9c819b0fffef,Q28505
3,https://geokb.wikibase.cloud/entity/Q28504,620f2ebaad0e9c819b109562,Q28504
4,https://geokb.wikibase.cloud/entity/Q28506,620f3149ad0e9c819b10a044,Q28506


One of the things we want to do here is link mineral names picked up from xDD indexing to mineral commodity entities in the GeoKB. Here we get the entities identified as commodities and tee up a simple name matching structure.

In [4]:
geokb_commodities = geokb.url_sparql_query(
    sparql_url="https://geokb.wikibase.cloud/query/sparql?query=PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FitemAltLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ406%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D",
    output_format="dataframe"
)

geokb_commodities["qid"] = geokb_commodities["item"].apply(lambda x: x.split("/")[-1])
geokb_commodities["commodity_name_lower"] = geokb_commodities["itemLabel"].str.lower()
geokb_commodities["commodity_alt_names"] = geokb_commodities["itemAltLabel"].apply(lambda x: x.split(",") if x else None)

geokb_commodity_lookup = pd.concat([
    geokb_commodities[["item","itemLabel"]].rename(columns={"itemLabel": "commodity_name"}),
    geokb_commodities[["item","commodity_name_lower"]].rename(columns={"commodity_name_lower": "commodity_name"}),
    geokb_commodities[["item","commodity_alt_names"]].explode("commodity_alt_names").rename(columns={"commodity_alt_names": "commodity_name"})
])

geokb_commodity_lookup.drop_duplicates(inplace=True)

geokb_commodity_lookup.head()

Unnamed: 0,item,commodity_name
0,https://geokb.wikibase.cloud/entity/Q291,platinum-group elements
1,https://geokb.wikibase.cloud/entity/Q293,aluminum
2,https://geokb.wikibase.cloud/entity/Q295,antimony
3,https://geokb.wikibase.cloud/entity/Q297,arsenic
4,https://geokb.wikibase.cloud/entity/Q302,bismuth


This is a start to functional logic we can work through as we decide how this kind of "tee up for review" process should operate and what it should lay out for discussion.

In [5]:
def gdd_terms(gddid, dict_id="11"):
    xdd_api_terms = f"https://geodeepdive.org/api/terms?docid={gddid}&dict_id={dict_id}"

    xdd_response_terms = requests.get(xdd_api_terms).json()
    # Need to error trap this, of course
    df_xdd_terms = pd.DataFrame(xdd_response_terms["success"]["data"])
    df_xdd_terms["term_lower"] = df_xdd_terms["term"].str.lower()

    return df_xdd_terms

def gdd_discussion_content(qid, gddid):
    term_api_url = f"https://geodeepdive.org/api/terms?docid={gddid}&dict_id=11"

    texts = []

    texts.append("= Mineral Name Hits and Commodity Linkages =")
    texts.append(f"The following table shows the results of [{term_api_url} xDD indexing of this document] using a dictionary of mineral names. The table shows the number of hits for each term and a potential linkage to a corresponding mineral commodity entity in the GeoKB. Before establishing these linkages programmatically, we are soliciting feedback on a representative sample for this process. Please include your comments by editing the comments section below.")

    mineral_terms = gdd_terms(gddid=gddid)

    geokb_term_matches = pd.merge(
        left=mineral_terms[["term","n_hits","last_updated","term_lower"]].rename(columns={"term_lower": "commodity_name"}),
        right=geokb_commodity_lookup,
        how="left",
        on="commodity_name"
    )

    geokb_term_matches.fillna(value='not yet identified as a commodity in GeoKB', inplace=True)

    texts.append(
        tabulate(
            geokb_term_matches[["term","n_hits","item","last_updated"]].sort_values("n_hits", ascending=False),
            headers=["Mineral Term","Number of Occurrences","Proposed Commodity Linkage","Last Updated"],
            tablefmt="mediawiki",
            showindex=False
        )
    )

    for index, row in geokb_term_matches[["term","n_hits","item"]].sort_values("n_hits", ascending=False)[:5].iterrows():
        xdd_api_snippets = f"https://geodeepdive.org/api/snippets?docid={gddid}&dict_id=11&term={row['term']}"
        xdd_response_snippets = requests.get(xdd_api_snippets).json()

        if row["item"].startswith('https://'):
            texts.append(f"== [[Item:{row['item'].split('/')[-1]}|{row['term']}]] ==")
        else:
            texts.append(f"== {row['term']} ==")
        for snippet in xdd_response_snippets["success"]["data"][0]["highlight"]:
            snippet_text = snippet.replace('<em class="hl">', "''").replace('</em>', "''")
            texts.append(f"* {snippet_text}")
    
    texts.append("= Discussion =")
    texts.append('In posting comments, please include your signature using the wiki markup "signature and timestamp." You can also include subheadings under this section for comments related to a particular topic to help facilitate discussion.')

    page = geokb.mw_site.pages[f'Item_talk:{qid}']
    response = page.edit("\n".join(texts), summary="Added information on xDD derived minerals for review to discussion page")
    return response

Now, we loop through our entities containing gddid values and run the function to grab what we want to examine from the xDD APIs and lay it out in wiki markup for evaluation and discussion. We could operate this much more as some kind of lambda function that worked over a subset of items or even every item if needed.

In [6]:
for index, row in gdd_items.iterrows():
    response = gdd_discussion_content(
        qid=row["qid"],
        gddid=row["gddid"]
    )
    display(response)


OrderedDict([('result', 'Success'),
             ('pageid', 54946),
             ('title', 'Item talk:Q28507'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 118260),
             ('newrevid', 118261),
             ('newtimestamp', '2023-06-13T20:39:36Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54947),
             ('title', 'Item talk:Q28508'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118262),
             ('newtimestamp', '2023-06-13T20:39:40Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54948),
             ('title', 'Item talk:Q28505'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118263),
             ('newtimestamp', '2023-06-13T20:39:44Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54949),
             ('title', 'Item talk:Q28504'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118264),
             ('newtimestamp', '2023-06-13T20:39:49Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54950),
             ('title', 'Item talk:Q28506'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118265),
             ('newtimestamp', '2023-06-13T20:39:53Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54951),
             ('title', 'Item talk:Q28502'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118266),
             ('newtimestamp', '2023-06-13T20:39:56Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54952),
             ('title', 'Item talk:Q28503'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118267),
             ('newtimestamp', '2023-06-13T20:39:59Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54953),
             ('title', 'Item talk:Q28510'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118268),
             ('newtimestamp', '2023-06-13T20:40:04Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54954),
             ('title', 'Item talk:Q28511'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118269),
             ('newtimestamp', '2023-06-13T20:40:08Z'),
             ('watched', '')])

OrderedDict([('new', ''),
             ('result', 'Success'),
             ('pageid', 54955),
             ('title', 'Item talk:Q28513'),
             ('contentmodel', 'wikitext'),
             ('oldrevid', 0),
             ('newrevid', 118270),
             ('newtimestamp', '2023-06-13T20:40:11Z'),
             ('watched', '')])