This notebook is an initial attempt at the process for incorporating items representing documents in a GeoArchive collection within the GeoKB. The knowledgebase is the place where we will be able to develop the richest suite of linkages for what we are able to extract or identify within document reference materials. It may also serve as the logical alternative location for metadata associated with documents, providing the future-proofing of using a particular tech base for storing document content with citation metadata.

This process caused me to examine a couple of new issues with the GeoKB, establishing some new properties:

* I added a GDDID property to provide the linkage to the GeoDeepDive infrastructure where we've run a given document through one or more processing pipelines. This provides the conduit for us to incorporate extracted information into the GeoKB. The formatter URL creates a link that ties into the /articles/ route in the xDD API.
* I added a Zotero ID property to house a compound identifier with the unique Library ID and the Item Key for the item containing document metadata. This gives us the hook to keep things in sync over time. These identifiers are also reflected in the reference URL provided to link into the library, but it's the identifiers themselves that are actionable via the API. (Note: ultimately, I still want to establish a hanle server to register these identifiers against creating an evergreen resolver for future-proofing.)
* I did a little bit of work on country place names to round out the reference source in the GeoKB and established linkages based on what was previously brought into the metadata from spreadsheet or file system naming convention sources. I opted to use a new "addresses place" property here as opposed to a more specific property like "country" as I think this is the way we want to model the linkages from a document to different concepts.
* I was able to get a handful of matches for some of the more major mining projects to existing "mine" items previously pulled in from the GNIS process. This is pretty crude at this point, but I put them in as linkages for demonstration purposes at least. I opted to use a different (also new) property here, addresses feature, as there is a fairly distinct difference in the intent of a mine (or a dam or whatever) kind of place vs. a geopolitical boundary. But I also think we want to use a higher level concept than "the mine," "the dam," etc. for the features that we are able to determine the contents of a document address in some way.
* In both cases, place and feature, I included a "reference statement" (may eventually be better as a qualifier) indicating that the particular connection to an identified place or mine was made from imperfect metadata. We'll have to play with this idea in practice as we continue to work through use cases.

A next step here is to pull in the "addresses subject" (another new property) claims from both citation metadata pulled in from the inventory process as well as extractions from DeepDive. I'm going to concentrate on the latter with geologic time as a demonstration.

I also hesitated to pull the point coordinates that we have for some of these records in at this point. From a broader perspective across the knowledgebase, those are really point locations likely indicating where the mining property is. They come from the documents and are associated with them as items, but they will be more useful in the context of the mining property itself. The same could be said for the other concepts here as well, so thinking that through is still something we need to do.

In [1]:
import os
from pyzotero import zotero
import pandas as pd
from wbmaker import WikibaseConnection

In [2]:
geokb = WikibaseConnection("GEOKB_CLOUD")

In [3]:
# Establish connection to Zotero group library from source

# Select source item to process
source_item_id = geokb.ref_lookup['GeoArchive Collection of National Instrument 43-101 Technical Reports']
source_item = geokb.wbi.item.get(source_item_id)

# Extract library identifier for Zotero group
source_instance_of = source_item.claims.get_json()[geokb.prop_lookup['instance of']][0]
# I obviously need to do something better with this once I settle on conventions for different source types
reference_url = source_instance_of["references"][0]["snaks"][geokb.prop_lookup['reference URL']][0]["datavalue"]["value"]
ni43101_library_id = reference_url.split("/")[4]

ni43101_api_key = os.environ['NI43101_KEY']

ni43101_library = zotero.Zotero(
    ni43101_library_id,
    'group', 
    ni43101_api_key
)

In [4]:
# It takes a long time to pull items, and I need to work through my caching strategy a bit better

#all_items = ni43101_library.everything(ni43101_library.items())

# df_all_items = pd.DataFrame([i["data"] for i in all_items])
# df_all_items.to_pickle("data/ni43101_library_20200419.p")
df_all_items = pd.read_pickle("data/ni43101_library_20200419.p")

In [46]:
reports = df_all_items[df_all_items.itemType == "report"].reset_index()
reports["project_name"] = reports.seriesTitle.apply(lambda x: x.replace("Project: ", "").strip())
reports["zotero_id"] = reports.key.apply(lambda x: ":".join([ni43101_library_id, x]))
reports["zotero_url"] = reports.key.apply(lambda x: f"https://www.zotero.org/groups/{ni43101_library_id}/usgs_ni_43-101_reports/items/{x}/library")

In [37]:
query_countries = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FitemAltLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ27%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D%0A"
df_countries = geokb.wb_ref_data(query=query_countries)
df_countries["item"] = df_countries.item.apply(lambda x: x.split('/')[-1])

query_us_states = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ229%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D%0A"
df_us_states = geokb.wb_ref_data(query=query_us_states)
df_us_states["item"] = df_us_states.item.apply(lambda x: x.split('/')[-1])

df_countries['aliases'] = df_countries.itemAltLabel.apply(lambda x: [i.strip() for i in x.split(",")] if isinstance(x, str) else None)
place_lookup = pd.concat([
    df_us_states,
    df_countries[["item","itemLabel"]],
    df_countries[df_countries.aliases.notnull()][["item","aliases"]].explode("aliases").rename(columns={"aliases": "itemLabel"})
]).drop_duplicates().set_index("itemLabel")["item"].to_dict()

query_mines = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ3646%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D%0A"
df_mines = geokb.wb_ref_data(query=query_mines)
df_mines["item"] = df_mines.item.apply(lambda x: x.split('/')[-1])

mine_lookup = df_mines.set_index("itemLabel")["item"].to_dict()

query_commodity = "PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FitemAltLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ406%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D%0A"
df_commodity = geokb.wb_ref_data(query=query_commodity)
df_commodity["item"] = df_commodity.item.apply(lambda x: x.split('/')[-1])
df_commodity['aliases'] = df_commodity.itemAltLabel.apply(lambda x: [i.strip() for i in x.split(",")] if isinstance(x, str) else None)

commodity_lookup = pd.concat([
    df_commodity[["item","itemLabel"]],
    df_commodity[df_commodity.aliases.notnull()][["item","aliases"]].explode("aliases").rename(columns={"aliases": "itemLabel"})
]).drop_duplicates().set_index("itemLabel")["item"].to_dict()

In [55]:
report_tags = reports[["key","tags"]].explode("tags")
report_tags["tag"] = report_tags.tags.apply(lambda x: x["tag"] if isinstance(x, dict) else x)
report_tags.drop(columns=["tags"], inplace=True)
report_tags = report_tags.dropna()
report_tags["tag_type"] = report_tags.tag.apply(lambda x: x.split(':')[0])
report_tags["tag"] = report_tags.tag.apply(lambda x: x.split(':')[1])

In [61]:
unique_commodity_tags = list(report_tags[report_tags.tag_type == "Commodity"].tag.unique())

In [66]:
df_commodity[df_commodity.itemLabel == 'zinc']

Unnamed: 0,item,itemLabel,itemAltLabel,aliases


In [63]:
[i for i in unique_commodity_tags if i not in commodity_lookup]

['Zn',
 'U',
 'Ni',
 'PGE',
 'Pb',
 'PGM',
 'La',
 'REE',
 'potash',
 'W',
 'V',
 'Ti',
 'Pd',
 'Pt',
 'Nb',
 'sand',
 'silica',
 'B',
 'P',
 'quartz',
 'Sb',
 'Sn',
 'Ta',
 '',
 'Se',
 'barite',
 'Re',
 'Te',
 'Hf',
 'Th',
 'Zr',
 'F',
 'Mg',
 'Ce',
 'Sd',
 'Sil',
 'gypsum',
 'Rh',
 'Sc',
 'Ba',
 'S',
 'Y',
 'Nd',
 'Sm',
 'Cly',
 'TLC',
 'bauxite',
 'pumice',
 'Rb',
 'fluorspar',
 'peat',
 'halloysite',
 'phosphate',
 'CD',
 'travertine',
 'LWA',
 'TL',
 'quartzite',
 'N',
 'wollastonite',
 'Na',
 'O',
 'Al,',
 'Si']

In [47]:
query_zotero_id = "PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3Fzotero_id%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP97%20%3Fzotero_id%20.%0A%7D%0A"
df_zotero_id = geokb.wb_ref_data(query=query_zotero_id)
df_zotero_id["item"] = df_zotero_id.item.apply(lambda x: x.split('/')[-1])
wb_items_lookup = df_zotero_id.set_index("zotero_id")["item"].to_dict()

In [54]:
for index, row in reports.head().iterrows():
    for tag in row.tags:
        if tag["tag"].startswith("Commodity:"):
            commodity = tag["tag"].split(":")[1]
            print(commodity, commodity_lookup[commodity])

diamond Q453
Au Q329
Au Q329
Cu Q315
Au Q329


In [None]:
source_reference = geokb.datatypes.Item(
    prop_nr=geokb.prop_lookup['data source'],
    value=source_item_id
)

for index, row in  reports.iterrows():
    item_references = geokb.models.References()
    item_references.add(source_reference)

    item_qid = wb_items_lookup[row.zotero_id] if row.zotero_id in wb_items_lookup else None

    if item_qid:
        item = geokb.wbi.item.get(item_qid)
    else:
        item = geokb.wbi.item.new()

    item.labels.set('en', row["title"])
    item.descriptions.set('en', "an NI 43-101 Technical Report pulled from the GeoArchive collection")

    claims = geokb.models.Claims()

    claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['instance of'],
            value=geokb.class_lookup['NI 43-101 Technical Report'],
            references=item_references
        )
    )

    claims.add(
        geokb.datatypes.ExternalID(
            prop_nr=geokb.prop_lookup['Zotero ID'],
            value=row.zotero_id
        )
    )

    claims.add(
        geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup['reference URL'],
            value=row.zotero_url
        )
    )

    if isinstance(row.callNumber, str) and row.callNumber.startswith("gddid:"):
        gddid = row.callNumber.split(":")[-1].strip()
        claims.add(
            geokb.datatypes.ExternalID(
                prop_nr=geokb.prop_lookup["GDDID"],
                value=gddid
            )
        )

    if row.project_name in mine_lookup:
        mine_ref_statement = geokb.datatypes.String(
            prop_nr=geokb.prop_lookup['reference statement'],
            value="Linkage to this mine was established with a simple name match from imperfect metadata identifying the mineral development project"
        )
        mine_connection_refs = geokb.models.References()
        mine_connection_refs.add(source_reference)
        mine_connection_refs.add(mine_ref_statement)

        claims.add(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['addresses feature'],
                value=mine_lookup[row.project_name],
                references=mine_connection_refs
            )
        )

    if row.place in place_lookup:
        place_ref_statement = geokb.datatypes.String(
            prop_nr=geokb.prop_lookup['reference statement'],
            value="Linkage to this mine was established with a simple name match from imperfect metadata identifying the geographic area of the mining project"
        )
        place_connection_refs = geokb.models.References()
        place_connection_refs.add(source_reference)
        place_connection_refs.add(place_ref_statement)

        claims.add(
            geokb.datatypes.Item(
                prop_nr=geokb.prop_lookup['addresses place'],
                value=place_lookup[row.place],
                references=place_connection_refs
            )
        )

    item.claims.add(claims=claims)

    response = item.write(summary="Added initial NI 43-101 item from Zotero library")
    print(row["title"], response.id)