This notebook builds a representation in the GeoKB for the collection of NI 43-101 Technical Reports that we use as cited source material for information on mineral explorations around the world. We store and manage these documents in a Zotero group library, which has its own API, but also represent them here as a place to link additional information gleaned from the documents in various ways. We created a specific type of government report to classify these entities. We use the bibliographic metadata source from the Zotero API and cache both the item metadata and attachment metadata in a YAML encoding on the item talk pages for each report entity. A few key things are done and decided on here:
* build and record a metadata URL provided via the w3id.org service that abstracts away from Zotero's specific landing page construct; these serve a similar function to a DOI in some respects and are treated as a reasonably permanent URL for the items
* build and record one or more content URLs made up of the attachment key coupled with the library ID and item key
* record the attachment key separately for convenience when using the Zotero API to access attachments (requires an API key)
* indicate MIME types for content negotiation or other purposes on the links
* record the version number for each item so that we know where we stand in relation to the base source in Zotero

In [95]:
from pyzotero import zotero
import os
from wbmaker import WikibaseConnection
import yaml

from pymongo import MongoClient
import os

In [106]:
geokb = WikibaseConnection('GEOKB_CLOUD')

mongo_client = MongoClient(f"mongodb://{os.environ['MONGODB_HOST']}:{str(os.environ['MONGODB_PORT'])}/")
isaid_cache = mongo_client['iSAID']
ni43101_cache = isaid_cache['zotero_ni43101']


In [3]:
z_ni43101 = zotero.Zotero(
    "4530692",
    'group', 
    os.environ['ZOTERO_API_KEY']
)

# Caching Zotero Metadata
The process of getting every item from a Zotero group library with thousands of items is long and cumbersome with the throttling that the API uses (50 items at a time). There is a convenient "everything" wrapper as well as an iterator in the pyzotero package, but it will still take upwards of 30 minutes to pull 15K reports and their attachments. For that reason, I have a process here that will pull and cache everything in a MongoDB instance that we can work against. This only needs to be run once to baseline the collection, and then we can use the version number to go after new/modified records using the "since" parameter.

In [44]:
reports = z_ni43101.everything(z_ni43101.items(itemType='report'))
for item in reports:
    ni43101_cache.update_one({'_id': item['key']}, {'$set': item}, upsert=True)

attachments = z_ni43101.everything(z_ni43101.items(itemType='attachment'))
for item in attachments:
    ni43101_cache.update_one({'_id': item['key']}, {'$set': item}, upsert=True)

# Updating cache from last version recorded
Here, we pull the max version number from the cache, get new items, and then drop those to the cache. This could be reworked to pull the version from the "permanent" records in the GeoKB instead, leaving out the cache.

In [9]:
max_version_agg = [
    {
        '$group': {
            '_id': None, 
            'maxVersion': {
                '$max': '$version'
            }
        }
    }
]

max_version = list(ni43101_cache.aggregate(max_version_agg))[0]['maxVersion']
print(max_version)

new_items = z_ni43101.everything(z_ni43101.items(since=max_version))
if new_items:
    for item in new_items:
        ni43101_cache.update_one({'_id': item['key']}, {'$set': item}, upsert=True)

max_version = list(ni43101_cache.aggregate(max_version_agg))[0]['maxVersion']
print(max_version)

78084
78085


## All Cached Docs
We can iterate on the cache, but it is convenient and small enough to pull the entire document set. Documents (items) returned and cached include both the metadata items documenting each report and the attachments (notes are also "items" but not used in this context). We put everything into a list of dictionaries for further work below.

In [59]:
cached_zotero_items = ni43101_cache.find()
cached_zotero_docs = [i for i in cached_zotero_items]

# GeoKB Entities
Here we pull current NI 43-101 entities from the GeoKB via a SPARQL query to produce a simple mapping from Zotero's item key to GeoKB QID. If our strategy is simply to rebuild relevant parts of entities or check for missing entities, this is sufficient. We would need to pull additional claims information if we wanted to do a more sophisticated update operation.

In [109]:
query_geokb_reports = """
PREFIX gp: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX ge: <https://geokb.wikibase.cloud/entity/>

SELECT ?item ?meta_url
WHERE {
  ?item gp:P1 ge:Q10 .
  OPTIONAL {
    ?item gp:P141 ?meta_url .
  }
}
"""

geokb_reports = geokb.sparql_query(query_geokb_reports)
geokb_reports['qid'] = geokb_reports['item'].apply(lambda x: x.split('/')[-1])
geokb_reports['z_key'] = geokb_reports['meta_url'].apply(lambda x: x.split('/')[-1] if x else None)

qid_lookup = geokb_reports.set_index('z_key')['qid'].to_dict()

print(len(geokb_reports))

14482


# Represent Zotero items in GeoKB
The sequence here to work with the GeoKB can be run on all report items pulled from the cache (or simply pulled from Zotero). It relies on the qid_lookup dictionary to determine if we already have an item in place. It will either pull an existing item (based on the Zotero key value in the metadata URL) or create a new item. The sequence (re)builds the item claims directly associated with the Zotero source (ignoring any other claims that might be in place from other sources).

In [119]:
reports = [i for i in cached_zotero_docs if i['data']['itemType'] == 'report']
missing_reports = [i for i in reports if i['_id'] not in qid_lookup.keys()]

2


In [None]:
for source_doc in missing_reports:
    source_attachments = [i for i in cached_zotero_docs if i['data']['itemType'] == 'attachment' and i['data']['parentItem'] == source_doc['key']]
    qid = qid_lookup.get(source_doc['data']['key'], None)

    if qid:
        item = geokb.wbi.item.get(qid)
    else:
        item = geokb.wbi.item.new()

    item.labels.set('en', source_doc['data']['title'])
    item.descriptions.set('en', 'an NI 43-101 Technical Report pulled from the GeoArchive collection')

    item.claims.add(
        geokb.datatypes.Item(
            prop_nr=geokb.prop_lookup['instance of'],
            value='Q10'
        ),
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )

    item.claims.add(
        geokb.datatypes.URL(
            prop_nr=geokb.prop_lookup['metadata URL'],
            value=f"https://w3id.org/usgs/z/4530692/{source_doc['key']}",
            qualifiers=[
                geokb.datatypes.String(
                    prop_nr=geokb.prop_lookup['MIME type'],
                    value='text/html'
                ),
                geokb.datatypes.String(
                    prop_nr=geokb.prop_lookup['MIME type'],
                    value='application/json'
                )
            ]
        ),
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )

    item.claims.add(
        geokb.datatypes.Quantity(
            prop_nr=geokb.prop_lookup['Zotero Version Number'],
            amount=source_doc['version']
        ),
        action_if_exists=geokb.action_if_exists.REPLACE_ALL
    )

    attachment_claims = []
    for attachment in source_attachments:
        attachment_qualifiers = geokb.models.Qualifiers()
        attachment_qualifiers.add(
            geokb.datatypes.String(
                prop_nr=geokb.prop_lookup['MIME type'],
                value=attachment['data']['contentType']
            )
        )
        attachment_qualifiers.add(
            geokb.datatypes.String(
                prop_nr=geokb.prop_lookup['checksum'],
                value=attachment['data']['md5']
            )
        )

        attachment_claims.append(
            geokb.datatypes.URL(
                prop_nr=geokb.prop_lookup['content URL'],
                value=f"https://www.zotero.org/groups/4530692/usgs_ni_43-101_reports/items/{source_doc['key']}/attachment/{attachment['key']}/reader",
                qualifiers=attachment_qualifiers
            )
        )
        attachment_claims.append(
            geokb.datatypes.String(
                prop_nr=geokb.prop_lookup['Zotero Attachment Key'],
                value=attachment['key'],
                qualifiers=attachment_qualifiers
            )
        )
    item.claims.add(attachment_claims, action_if_exists=geokb.action_if_exists.REPLACE_ALL)

    # Write the item data
    try:
        response = item.write(summary=f"item {'updated' if qid else 'added'} from Zotero source information")
        print(f"{'updated' if qid else 'added'} {response.id}")
    except Exception as e:
        print(str(e))
        display(item.get_json())
        continue

    # Cache the item and attachment raw source information
    talk_page = geokb.mw_site.pages[f"Item_talk:{response.id}"]
    cached_content = {
        'item': source_doc['data'],
        'attachments': [i['data'] for i in source_attachments]
    }
    talk_page.save(yaml.dump(cached_content), summary=f"cached content from Zotero source information")
    print('cached source metadata to item talk page')