This notebook works through a process for interfacing with a Zotero Group Library to harvest or pull its contents for some particular use. This is the start to a Python package that will be leveraged to handle several types of actions we need to perform on Zotero Group Libraries that we will be registering as GeoArchive collections:

* Syncing metadata and file content to deep storage on organizational infrastructue to ensure longevity
* Registering items via a handle mechanism to provide evergreen URLs for use in citations
* Assessing and reporting on the completeness and quality of metadata
* Analyzing item tags against controlled vocabulary sources to establish linked data
* Piping library contents to further NLP and other AI processing in the xDD Digital Library

Our initial use case with the library of [NI 43-101 Reports](https://www.zotero.org/groups/4530692/usgs_ni_43-101_reports/library) provides a reasonably large set of documents to work out our processes. The collection is large enough that live API calls when we try to get all 15K+ documents or 6K+ tags are pretty slow, meaning that we need a process that will leverage Zotero's versioning mechanism and the "since" query parameter we can use with pyzotero (a Python abstraction on the Zotero REST API).

To deal with this dynamic and provide a reasonably performant system, I'm breaking the architecture into two logical parts:

1) A process that can be containerized to periodically check the library for changes and cache metadata files in a special [inventory item](https://www.zotero.org/groups/4530692/usgs_ni_43-101_reports/items/WSFB4RQE/library) within the library itself. The three files stored here (items, collections, and tags) can then be accessed with a read-only API connection from any downstream users. Accessing these files only takes a few seconds vs. 20+ minutes to pull everything fresh from the API.
2) Processes that work with the file caches (which store the raw information from the API) in various ways to accomplish the value-added GeoArchive services we are layering onto Zotero libraries.

Any given system could decide to bypass the file caches and work directly against the live API, guaranteeing the freshest information.

In [55]:
from pyzotero import zotero
from getpass import getpass
import pandas as pd
import json
import os
import requests
import sys
import io
import PyPDF2
import magic

The Zotero API provides access to personal (user) and group libraries, with slightly different methods for each. Here we're only covering access to group libraries as this fits our use case. To instantiate an API connection with pyzotero, you need to provide the group library identifier and an API key. Library IDs are open and available to anyone. Here, we are interfacing with a library/collection of [NI 43-101 Reports](https://www.zotero.org/groups/4530692/usgs_ni_43-101_reports/library). You can see the library ID there in the URL: 4530692.

An API key has to be set up by an authorized user and will specify the type of access allowed and what library or libraries the key will access. In this case, we set up a read-only API key for the specific library, which should be the best practice for this kind of system.

In the following codeblock, we ask for these two parameters, hiding the API key as a secret. (Note: Contact the notebook author if you'd like to participate or set this up to work with your own specified library and API key.) In a production implementation of this idea, the Library ID and API Key parameters could be set up as environment variables for an application or some kind (e.g., Lambda process, etc.).

In [2]:
zotero_api_key = getpass("API Key")

inventory_item_key = "WSFB4RQE"
zotero_library_id = "4530692"

API Key ························


# Common Functions
These are functions that will be in common between read-only and write operations with Zotero. They include making a basic connection to a group library with a library ID and an API key, getting the inventory item and file attachments, loading the raw inventory metadata (items, collections, and tags), and putting the essential inventory data into Pandas dataframes. The latter includes some specialized logic for parsing what are essentially simple compound identifiers for tags that help us align them with vocabularies.

In [3]:
def zotero_library(library_id, api_key):
    # Establish connection to library with an ID and API Key
    zot = zotero.Zotero(
        library_id=library_id,
        library_type='group',
        api_key=api_key,
    )
    return zot

def get_inventory_files(z, inventory_item):
    inventory_files = z.children(inventory_item, itemType="attachment")
    inventory_file_items = {
        "inventory_item": z.item(inventory_item),
        "inventory_item_keys": [inventory_item] + [i["key"] for i in inventory_files]
    }
    for file_type in ["items","collections","tags"]:
        inventory_file_items[file_type] = next(
            (
                i for i in inventory_files 
                if i["data"]["filename"] == f"{file_type}.json"
            ), None)
        
    return inventory_file_items

def load_raw_inventory(z, inventory_item):
    existing_files = get_inventory_files(z, inventory_item)

    inventory = {
        "inventory_item_keys": existing_files["inventory_item_keys"],
        "inventory_files": existing_files
    }
    for file_type in ["items","collections","tags"]:
        raw_data = z.file(existing_files[file_type]["key"])
        inventory[file_type] = raw_data
    
    return inventory

def load_df_inventory(raw_inventory):
    inventory = {}
    for k,v in raw_inventory.items():
        if k == "tags":
            df_tags = pd.DataFrame([
                {
                    "tag": i,
                    "type": i.split(":")[0],
                    "value": i.split(":")[-1]
                } for i in v if ":" in i
            ])
            inventory["tags"] = df_tags.convert_dtypes()
        elif k in ["items","collections"]:
            inventory[k] = pd.DataFrame([i["data"] for i in v]).convert_dtypes()

    return inventory

In [4]:
zot = zotero_library(zotero_library_id, zotero_api_key)

# Baselining a Library
The following functions handle the process of initially baselining items, collections, and tags for a given library into local cache files in JSON format and then uploading those to a specified inventory item in the Zotero library. This could be a little more robust to do things like discover the inventory item based on characteristics or create an inventory item that doesn't already exist, but I chose to make this very explicit by setting the actual key value for the item we want to use in this case.

In [32]:
def baseline_tags(z, output_path="data"):
    tags = z.everything(z.tags())
    json.dump(tags, open(f"{output_path}/tags.json", "w"))

def baseline_collections(z, output_path="data"):
    records = z.everything(z.collections())
    json.dump(records, open(f"{output_path}/collections.json", "w"))

def baseline_items(z, inventory_item, output_path="data"):
    records = z.everything(z.items())
    
    inventory_files = get_inventory_files(z, inventory_item)
    records = [
        i for i in records 
        if i["key"] not in inventory_files["inventory_item_keys"]
    ]
    json.dump(records, open(f"{output_path}/items.json", "w"))

def baseline_cache(
    z, 
    inventory_item, 
    output_path="data"
):
    inventory_files = get_inventory_files(z, inventory_item)

    new_uploads = []
    for file_type in ["items","collections","tags"]:
        if os.path.exists(f"{output_path}/{file_type}.json"):
            if inventory_files[file_type] is not None:
                z.delete_item(z.item(inventory_files[file_type]["key"]))
                print("Deleted existing cache file for", file_type)
            new_uploads.append(os.path.abspath(f"{output_path}/{file_type}.json"))
        
    if new_uploads:
        inventory_item_update = {
            "key": inventory_item,
            "version": inventory_files["inventory_item"]["data"]["version"],
            "extra": z.last_modified_version()
        }
        z.update_item(inventory_item_update)
        print("Updated inventory item with last modified version for the library")
        
        z.attachment_simple(new_uploads, parentid=inventory_item)
        print("Created new files in inventory", new_uploads)


In [17]:
%%time
baseline_tags(zot)

CPU times: user 912 ms, sys: 32.2 ms, total: 944 ms
Wall time: 1min 1s


In [18]:
%%time
baseline_collections(zot)

CPU times: user 33.6 ms, sys: 3.87 ms, total: 37.5 ms
Wall time: 2.18 s


139

In [20]:
%%time
baseline_items(zot)

CPU times: user 8.55 s, sys: 314 ms, total: 8.86 s
Wall time: 24min 56s


In [33]:
%%time
baseline_cache(zot, inventory_item_key)

Deleted existing cache file for items
Deleted existing cache file for collections
Deleted existing cache file for tags
Updated inventory item with last modified version for the library
Created new files in inventory ['/home/jovyan/experiments/data/items.json', '/home/jovyan/experiments/data/collections.json', '/home/jovyan/experiments/data/tags.json']
CPU times: user 287 ms, sys: 20.3 ms, total: 307 ms
Wall time: 15.2 s


# Updating the Inventory
Updating the inventory involves several steps that need to operate in series. These could be made more robust in the code by building a Python class.

1) Load the raw inventory files from the specified Zotero inventory item
2) Determine the applicable sequenced version numbers present in the cached inventories
3) Check for any deletions that have happened since the last cache and remove those from the inventories (a given system may need to handle deletions in a particular way)
4) Get new items, collections, and tags using the version numbers determined from the previous cache
5) Remove the updated records from the previous cache and add new/updated records to the new caches
6) Save the new cache files to some mounted disc
7) Push the new cache files to the Zotero inventory item

In [36]:
def update_inventory(z, inventory_item, output_path="data"):
    inventory_files = get_inventory_files(z, inventory_item)
    raw_inventory = load_raw_inventory(zot, inventory_item_key)

    cache_item_version = max([i["version"] for i in raw_inventory["items"]])
    cache_collection_version = max([i["version"] for i in raw_inventory["collections"]])
    min_cache_version = min([cache_item_version, cache_collection_version])

    deletions = z.deleted(since=min_cache_version)
    
    new_inventory = {}
    for x in ["items","collections"]:
        new_inventory[x] = [
            i for i in raw_inventory[x] if i["key"] not in deletions[x]
        ]
    new_inventory["tags"] = [
        i for i in raw_inventory["tags"] if i not in deletions["tags"]
    ]
    
    new_items = z.everything(z.items(since=cache_item_version))
    new_collections = z.everything(z.collections(since=cache_collection_version))
    new_tags = z.everything(z.tags(since=cache_item_version))
    
    if new_tags:
        new_inventory["tags"].extend(new_tags)
        new_inventory["tags"] = list(set(new_inventory["tags"]))
        
    if new_collections:
        new_inventory["collections"] = [
            i for i in new_inventory["collections"]
            if i["key"] not in [
                x["key"] for x in new_collections
            ]
        ]
        new_inventory["collections"].extend(new_collections)
        
    if new_items:
        items_wo_inventory = [
            i for i in new_items 
            if i["key"] not in raw_inventory["inventory_item_keys"]
        ]
        if items_wo_inventory:
            new_inventory["items"] = [
                i for i in new_inventory["items"]
                if i["key"] not in [
                    x["key"] for x in new_items
                ]
            ]
            new_inventory["items"].extend(items_wo_inventory)
    
    new_uploads = []
    for x in ["items","collections","tags"]:
        if raw_inventory[x] != new_inventory[x]:
            fp = f"{output_path}/{x}.json"
            json.dump(new_inventory[x], open(fp, "w"))
            print(f"WROTE {len(new_inventory[x])} RECORDS TO {fp}")
            
            cache_file_key = raw_inventory["inventory_files"][x]["key"]
            z.delete_item(z.item(cache_file_key))
            print(f"DELETED PREVIOUS CACHE FILE FOR {x}: {cache_file_key}")
            new_uploads.append(os.path.abspath(fp))

    if new_uploads:
        inventory_item_update = {
            "key": inventory_item,
            "version": inventory_files["inventory_item"]["data"]["version"],
            "extra": z.last_modified_version()
        }
        z.update_item(inventory_item_update)
        print("Updated inventory item with last modified version for the library")
        
        z.attachment_simple(new_uploads, parentid=inventory_item)
        print("Created new files in inventory", new_uploads)
        


In [37]:
%%time
update_inventory(zot, inventory_item_key)

WROTE 30993 RECORDS TO data/items.json
DELETED PREVIOUS CACHE FILE FOR items: SWD5EPAI
Updated inventory item with last modified version for the library
Created new files in inventory ['/home/jovyan/experiments/data/items.json']
CPU times: user 4.33 s, sys: 340 ms, total: 4.66 s
Wall time: 21.8 s


# Reading and working with library metadata
Assuming we put a process in place to routinely run the update process laid out above, we'll have an inventory item in a Zotero Group Library registered as a "GeoArchive Repository" that can be worked against in predictable ways to do the 5 major things listed in the intro.

One task is to take the raw content for what the Zotero API gives us and work through that to break away from the specific ways Zotero has of organizing things to an abstraction that is more generalized. We need to at least understand what specific pieces of information we're going to get from where when it comes to mapping that information to some other schema like the xDD "article" schema. For instance, what we access and store as "items" from Zotero include both a metadata structure for a logical "report" (or other item type) and attachments that have their own metadata identifiers/links to actual file content. Attachments can also be what the Zotero user interface refers to as "notes."

The organizational structure in Zotero of collections and sub-collections is also important, adding valuable metadata to the mix. However, we have to use the "placement" of items (via a list of collection identifiers) with collection metadata to derive that additional metadata for the items. Likewise, tags have tremendous utility when applied thoughtfully, but they are really only non-controlled string values in Zotero and need additional work to make them more useful. In both of these areas, collection organization and tagging, we are letting contributors/managers of a Library determine what they want to do that makes things easy and meaningful for their purposes but then layer on some interpretation and value-added processing to get more use out of the structure.

I put in a function to convert the most salient information from the data structures output by the Zotero API to simple tabular data in Pandas dataframes. This lets us do a fair bit of data wrangling with the data in a familiar and useful form.

In [5]:
%%time
raw_inventory = load_raw_inventory(zot, inventory_item_key)
df_inventory = load_df_inventory(raw_inventory)

CPU times: user 1.38 s, sys: 188 ms, total: 1.57 s
Wall time: 7.06 s


In [11]:
df_inventory["tags"]

Unnamed: 0,tag,type,value
0,Project:Hutton Garnet Beaches,Project,Hutton Garnet Beaches
1,Project:F.W. Lewis Battle Mountain Property,Project,F.W. Lewis Battle Mountain Property
2,Commodity:Ba,Commodity,Ba
3,Project:Salt Wells,Project,Salt Wells
4,Project:Verdstone,Project,Verdstone
...,...,...,...
6479,Project:Blanket,Project,Blanket
6480,Project:Axmin properties,Project,Axmin properties
6481,Project:Nord Tirek,Project,Nord Tirek
6482,Project:Teal E&M Central Africa,Project,Teal E&M Central Africa


# Pulling metadata and files for xDD
There are a number of things I don't yet know about how the xDD infrastructure operates and how the group at U. Wisconsin-Madison would go about setting up a harvest process. In the notes and codeblocks below, I'm attempting to provide what I think is a reasonable mapping from what we have in a Zotero group library to what I think the receiving end is going to require. 

## Metadata
The NI 43-101 Reports will be piped into the xDD cyberinfrastructure for further processing using NLP and other methods to help us extract useful data from the reports. They will essentially line up with the "articles" schema used in the xDD system (see below). This is a scaled down set of citation and access metadata that we need to line up with.

Given the fact that we already have a pretty limited schema for our NI 43-101 Reports at this point, we don't have a whole lot to map from - essentially title and year. These are atypical documents anyway. They lack common attributes like "journal" or "publisher" as they are not journal articles. We also do not have a whole lot of author information populated at this time; something we are hoping we can use named entity recognition to help out with at some level. We can populate links and identifiers based on Zotero as the repository source, and we'll layer in anything we do with evergreen URLs at some point.

To help facilitate a fairly straightforward harvesting process, the function below builds a new derivative for a mapping to xDD that includes crafted identifiers and links. Because we have a one-to-one relationship between a report item and a single PDF file, we can simply bring the two item types together in this instance.

## Files
Files have to be pulled through the API because of the way Zotero manages for potential copyright issues. File access for files stored in the Zotero cloud requires being authenticated through a web browser, desktop client, or the API. The pyzotero package will stream file content as binary or string that can be written to wherever (file() function), or the dump() function wraps file() and dumps the file to a mounted path. This provides flexibility in developing an application where files can be streamed to an online storage capacity such as S3 or anything else vs. having a file system mounted.

Incidentally, the "Zotero cloud" is already Amazon S3 storage (in us-east-2), but there is not currently a direct way to access file contents natively from other AWS resources in the same region because of the aforementioned security layer.

In [43]:
# A view of what the article schema could contain from xDD API documentation
xdd_articles_info = requests.get("https://geodeepdive.org/api/articles").json()
xdd_articles_info["success"]["options"]["fields"]

{'type': 'Type of publication (article, book, etc)',
 'title': 'Article title',
 'journal': 'The name of the journal',
 'link': "An array of objects, each containing the keys 'type' and 'url'. The URL is an external link to the original document.",
 'link.type': 'Type of link -- either a link to the publisher (publisher) or the ScienceDirect page for the article (sciencedirect)',
 'link.url': "The URL to the document at the publisher's domain",
 'vol': 'Volume',
 'number': 'Issue',
 'authors': "An array of objects, each containing a key 'name' and a value equal to the name of one author",
 'publisher': 'Publisher (or primary source) of the article (e.g. Elsevier, USGS)',
 'pages': "Articles' page numbers within the issue",
 'year': 'Year of publication',
 'identifier': "An array of objects, each containing a 'type' and 'id'. Used for providing DOIs, as well as other external identifiers"}

The following codeblock builds a scaled down derivative data structure that aligns minimal report metadata together with the file attachment identifier toward the structure xDD is going to need.

The identifiers in this case are really specific to Zotero, and I completely made up the type strings. We can call these anything.

The links (built from the identifiers) are all valid and reasonably persistent resolvable URLs that point to both the Zotero REST API and the Zotero web interface. I included links for both the metadata record for the report and its corresponding primary attachment to the PDF.

I put in a hard value here for "publisher" to the Canadian Securities Administrators. This is the Canadian government regulatory organization that essentially receives these reports from mining companies and then shares them out online, at least for a while. It's a reasonable value that will enable certain important query functionality in the xDD system.

I build both identifier and link into the dataframe using the naming convention and target structure (arrays of objects) from xDD.

In [44]:
def build_identifier(record):
    identifiers = []
    identifiers.append({
        "type": "Zotero Item Key",
        "id": record.key
    })
    identifiers.append({
        "type": "Zotero Attachment Key",
        "id": record.file_key
    })
    return identifiers

def build_link(record):
    links = []
    links.append({
        "type": "Zotero Item API",
        "link": f"https://api.zotero.org/groups/{zotero_library_id}/items/{record.key}"
    })
    links.append({
        "type": "Zotero Attachment API",
        "link": f"https://api.zotero.org/groups/{zotero_library_id}/items/{record.file_key}"
    })
    links.append({
        "type": "Zotero Item Link",
        "link": f"https://www.zotero.org/groups/{zotero_library_id}/items/{record.key}"
    })
    links.append({
        "type": "Zotero Item Link",
        "link": f"https://www.zotero.org/groups/{zotero_library_id}/items/{record.key}/attachment/{record.file_key}"
    })
    return links

i = df_inventory["items"]

reports = i[i.itemType == "report"][["key","title","date"]].rename(columns={"date":"year"}).copy()
files = i[i.itemType == "attachment"][["key","parentItem","filename"]].rename(columns={"key": "file_key"}).copy()

xdd_feed = pd.merge(
    left=reports,
    right=files,
    how="left",
    left_on="key",
    right_on="parentItem"
)

xdd_feed["publisher"] = "Canadian Securities Administrators"
xdd_feed["identifier"] = xdd_feed.apply(lambda x: build_identifier(x), axis=1)
xdd_feed["link"] = xdd_feed.apply(lambda x: build_link(x), axis=1)

xdd_feed

Unnamed: 0,key,title,year,file_key,parentItem,filename,publisher,identifier,link
0,VXXCD8BD,Hutton Prefeasability & Marketing Summary,2001,5EX5ITBM,VXXCD8BD,ac125bb7-5b4a-4701-bf17-59b7dbfa9e88.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': 'VXXCD8BD'}, {'type': 'Zotero Attachment Key', 'id': '5EX5ITBM'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/VXXCD8BD'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/5EX5ITBM'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/VXXCD8BD'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/VXXCD8BD/attachment/5EX5ITBM'}]"
1,IFCEV3WF,"NI 43-101 Technical Report (1) for the F.W. Lewis Battle Mountain Property in Nevada, United States dated 2004",2004,4HK7SGM6,IFCEV3WF,fa7d0ccf-6642-4e12-9875-918f54491c72.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': 'IFCEV3WF'}, {'type': 'Zotero Attachment Key', 'id': '4HK7SGM6'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/IFCEV3WF'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/4HK7SGM6'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/IFCEV3WF'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/IFCEV3WF/attachment/4HK7SGM6'}]"
2,AXHWCJDI,"NI 43-101 Technical Report (1) for the F.W. Lewis Battle Mountain Property in Nevada, United States dated 2007",2007,ZK4HFJ6B,AXHWCJDI,7448b402-7756-45a6-ad24-1b8a336c6c23.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': 'AXHWCJDI'}, {'type': 'Zotero Attachment Key', 'id': 'ZK4HFJ6B'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/AXHWCJDI'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/ZK4HFJ6B'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/AXHWCJDI'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/AXHWCJDI/attachment/ZK4HFJ6B'}]"
3,JXGEEQ33,NI 43-101 Technical Report (1) for the F.W. Lewis Battle Mountain Property in North America dated 2004,2004,T3PDP76G,JXGEEQ33,ab856dd7-353b-4517-acbb-d0c9eb5efcac.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': 'JXGEEQ33'}, {'type': 'Zotero Attachment Key', 'id': 'T3PDP76G'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/JXGEEQ33'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/T3PDP76G'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/JXGEEQ33'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/JXGEEQ33/attachment/T3PDP76G'}]"
4,SK738P4V,NI 43-101 Technical Report (1) for the F.W. Lewis Battle Mountain Property in North America dated 2007,2007,3M5BM6ZT,SK738P4V,874322da-02a4-4430-a3e5-a86e556f6219.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': 'SK738P4V'}, {'type': 'Zotero Attachment Key', 'id': '3M5BM6ZT'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/SK738P4V'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/3M5BM6ZT'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/SK738P4V'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/SK738P4V/attachment/3M5BM6ZT'}]"
...,...,...,...,...,...,...,...,...,...
15492,4KDZ8MD7,NI 43-101 Technical Report for the Blanket Project in Africa dated 2018,2018,F92QMURK,4KDZ8MD7,24ee76d2-8b4c-40f0-acb7-c3e20e34194c.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': '4KDZ8MD7'}, {'type': 'Zotero Attachment Key', 'id': 'F92QMURK'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/4KDZ8MD7'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/F92QMURK'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/4KDZ8MD7'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/4KDZ8MD7/attachment/F92QMURK'}]"
15493,HJ2CGG2R,NI 43-101 Technical Report for the Axmin properties Project in Africa dated 2005,2005,TS4VS4SI,HJ2CGG2R,bc6f3c34-e2d8-45e5-a955-dfd13b3ea3b4.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': 'HJ2CGG2R'}, {'type': 'Zotero Attachment Key', 'id': 'TS4VS4SI'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/HJ2CGG2R'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/TS4VS4SI'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/HJ2CGG2R'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/HJ2CGG2R/attachment/TS4VS4SI'}]"
15494,8V7JI652,NI 43-101 Technical Report for the Asquith Properties Project in Africa dated 2001,2001,32G2MVPS,8V7JI652,53b86681-203a-4721-8da7-814ecf706aa7.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': '8V7JI652'}, {'type': 'Zotero Attachment Key', 'id': '32G2MVPS'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/8V7JI652'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/32G2MVPS'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/8V7JI652'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/8V7JI652/attachment/32G2MVPS'}]"
15495,E2FB73NU,"NI 43-101 Technical Report for the Nord Tirek Project in Africa, Algeria dated 2009",2009,FNRUV4PB,E2FB73NU,31a89683-673e-455b-bd23-b361d4d25c0e.pdf,Canadian Securities Administrators,"[{'type': 'Zotero Item Key', 'id': 'E2FB73NU'}, {'type': 'Zotero Attachment Key', 'id': 'FNRUV4PB'}]","[{'type': 'Zotero Item API', 'link': 'https://api.zotero.org/groups/4530692/items/E2FB73NU'}, {'type': 'Zotero Attachment API', 'link': 'https://api.zotero.org/groups/4530692/items/FNRUV4PB'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/E2FB73NU'}, {'type': 'Zotero Item Link', 'link': 'https://www.zotero.org/groups/4530692/items/E2FB73NU/attachment/FNRUV4PB'}]"


Here's a look at what just the essential information from the dataframe would look like as a JSON array, similar to what would come back from the xDD API. The actual process built to harvest information from the cached inventory files would need to use either the "Zotero Attachment Key" from link here or the file_key from the dataframe to execute file download operations.

In [45]:
xdd_feed.head()[["title","year","publisher","link","identifier"]].to_dict(orient="records")

[{'title': 'Hutton Prefeasability & Marketing Summary',
  'year': '2001',
  'publisher': 'Canadian Securities Administrators',
  'link': [{'type': 'Zotero Item API',
    'link': 'https://api.zotero.org/groups/4530692/items/VXXCD8BD'},
   {'type': 'Zotero Attachment API',
    'link': 'https://api.zotero.org/groups/4530692/items/5EX5ITBM'},
   {'type': 'Zotero Item Link',
    'link': 'https://www.zotero.org/groups/4530692/items/VXXCD8BD'},
   {'type': 'Zotero Item Link',
    'link': 'https://www.zotero.org/groups/4530692/items/VXXCD8BD/attachment/5EX5ITBM'}],
  'identifier': [{'type': 'Zotero Item Key', 'id': 'VXXCD8BD'},
   {'type': 'Zotero Attachment Key', 'id': '5EX5ITBM'}]},
 {'title': 'NI 43-101 Technical Report (1) for the F.W. Lewis Battle Mountain Property in Nevada, United States dated 2004',
  'year': '2004',
  'publisher': 'Canadian Securities Administrators',
  'link': [{'type': 'Zotero Item API',
    'link': 'https://api.zotero.org/groups/4530692/items/IFCEV3WF'},
   {'type'

Here's an example of a basic file read workflow using Python and pyzotero. It grabs the first file key from the xdd_feed dataframe we built, reads the file into memory as bytes, and then shows some information about it. I use python-magic to verify the mime type (though Zotero provides this as well and we could have sent this through in the dataframe). I read the file here with the PyPDF2 library and use that to output number of pages (something we need to feed back into our metadata).

In [62]:
%%time
sample_file_key = xdd_feed.iloc[0].file_key
print("Zotero Attachment Key:", sample_file_key)
f = zot.file(sample_file_key)

print("Python type of object:", type(f))

mime = magic.Magic(mime=True)
print("Mime type of object:", mime.from_buffer(f))

print("Byte size of object:", sys.getsizeof(f))

with io.BytesIO(f) as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    num_pages = read_pdf.getNumPages()
    print("Number of pages in PDF:", num_pages)

Zotero Attachment Key: 5EX5ITBM
Python type of object: <class 'bytes'>
Mime type of object: application/pdf
Byte size of object: 81896
Number of pages in PDF: 29
CPU times: user 37.6 ms, sys: 168 µs, total: 37.7 ms
Wall time: 908 ms


# Next Steps
We need to work out how some of the value-added metadata we are collecting as part of the inventory process or may introduce through other means finds its way into some useful structure for further processing in the xDD infrastructure. This includes the geographic context introduced through the collections structure (see some of that below) and tags that give us mineral and other commodities and project names. These are all bits of information that can be used to help seed NLP processes such as NER.

One option may be to simply build on the xdd_feed idea above and send this information along in a structure similar to link and identifier (e.g., "term"). While this is not currently a part of the xDD schema, they may be able to stick the information somewhere and use it.

Longer term, we're going to end up with other, even more robust annotation on much of this content that will be housed in additional attachments on items. We're experimenting with using Zotfile with Zotero to extract highlights and annotation as Markdown files that will be additional attachments on the report items and also accessible via the API. We may also add other attachments such as GeoJSON files containing explicit geometry of one kind or another when we've done work with these reports through other systems. All of this would be quite useful in building out processing algorithms on full text content.

## Collection-based Metadata
The organization scheme used in a Zotero Group Library could provide different types of value-added metadata about items. Collections are used like folders to provide a visual organizational scheme to library items, helping library users navigate through and work within a given library, particularly one that contains a lot of items.

In the case of the NI 43-101 Reports here, we replicated a scheme that has been used for years where these reports have been accumulating on network shared folders. It uses a hierarchy of folder names to indicate a region of the world and country as well as US States for mining projects that are the subject of the reports. Report items can be found at any of these levels with the immediate parent collection indicating the most specific geographic context available in the hierarchical structure. We may add a further level of structure with project names in future.

Since this information is contained in the organizational structure of the items and does not necessarily exist in individual item metadata (e.g., tags), we need to incorporate an "interpretation" of the organizational structure into our process of assembling useful metadata. In this case, the geographic context for a given report item may be useful in driving NLP processes. It can narrow the field on other, more specific place names we might work to recognize in full text or serve as a validation element in extracting point coordinates or other geospatial information.

Each process like this for reading a Zotero library and digesting its metadata is going to be different. The collection organization scheme is going to be different, introducing different types of information to the process. However, something like what we present here can be extended to essentially classify the collections in a given library as metadata elements of different kinds. For this case, we pull all collections and then use the hierarchical structure (via "parentCollection") to assign a "type" value of Region, Country, or US State to each collection. This essentially gives us an additional set of geographic context tags we can send along with our items for further use downstream.

The function below essentially builds a lookup table of collection keys, which can be mapped from our report items and their "collections" property to assign geographic area tags in addition to other tags.

Note: We did do some normalization work in our process of building out the Zotero library structure for the NI 43-101 Reports to ensure that each collection name can be resolved to official ISO sources for countries and US States. Regions represent continents and a few sub-continent but recognizable region names.

In [47]:
def library_collections(raw_collections):
    collections = [
        {
            "type": "Region", 
            "key": i["key"], 
            "name": i["data"]["name"]
        } for i in raw_collections if not i["data"]["parentCollection"]
    ]

    for r in collections:
        collections.extend([
            {
                "type": "Country", 
                "key": i["key"], 
                "name": i["data"]["name"],
                "parent": r["key"]
            } for i in raw_collections if i["data"]["parentCollection"] == r["key"]
        ])

    us_collection = next((i for i in collections if i["name"] == "United States"), None)

    collections.extend([
        {
            "type": "US State", 
            "key": i["key"], 
            "name": i["data"]["name"],
            "parent": us_collection["key"]
        } for i in raw_collections if i["data"]["parentCollection"] == us_collection["key"]
    ])
    
    return collections

In [49]:
collections = library_collections(raw_inventory["collections"])
df_collections = pd.DataFrame(collections)
df_collections

Unnamed: 0,type,key,name,parent
0,Region,KVSV3898,Europe,
1,Region,ZNCQ3V3A,Middle East,
2,Region,BTJTAMND,Central America,
3,Region,MQWZWWM2,North America,
4,Region,BTZ7EAKX,Caribbean,
...,...,...,...,...
163,US State,VGKU7ZGD,Georgia,AKV3WJ97
164,US State,5CQMNT5V,Florida,AKV3WJ97
165,US State,W8R6UH2U,Arizona,AKV3WJ97
166,US State,JK8JU3E2,Alaska,AKV3WJ97


# Additional Notes to Wade through Later
Via the Zotero API, everything in a given library is an item. This includes metadata principals for things like reports and articles, but file attachments are items along with collections and notes and anything else that is uniquely identified and accessed. The various item methods of pyzotero provide access to items and can be used in various ways to access and work through the contents of a library.

For our use case of retrieving documents for processing with the xDD tools, we will be focusing on pulling report type items for essential metadata, blending in collection item information (per the above discussion), to create xDD records. We'll also pull the file items attached to the report metadata items to get PDF document content for processing. Eventually, we will access and pull annotation items also attached to report items for value-added annotation that can be used in NLP training.

Again, each case may be a little bit different in terms of how a given library is organized and managed, resulting in some tweaks to this process for other circumstances. Hopefully, the basic patterns developed here can serve as a start to a more generalized process.

Baselining a given library for the purposes of creating an xDD collection or some similar purpose means working at a point in time to pull together all necessary item information translated into how the target system understands "item" for its context. 

After a library has been baselined, Zotero provides a useful versioning system that should be fairly convenient to use. Every change to any item results in a sequential version number being incremented. The version number is within the context of a given library (group or user). The last_modified_version() method of pyzotero can be used to simply retrieve the highest incremental version number across everything in the library. Usage patterns could include recording the high version number of the library at the point of sync and/or recording individual item versions and then using those within context for an API call, depending on how a workflow is composed. 

The API accepts a "since" parameter with version numbers that will only pull items changed after that version. This can be combined with other API search parameters to look for items changed within a given context (e.g., report type items changed after the last recorded version retrieved). In syncing items from a Zotero library to some other system, version numbers should be recorded and then used to go after newer items. The version number incremental system also applies for new items added to a library, so that if you ask for everything since the last version in a synced collection, you will get new items as well as updates.

Deleted items are a different issue that should also be taken into account. The pyzotero API provides a deleted method with the "since" parameter that returns lists of collections, items, and other things that were removed from the library. How these are dealt with in a target system like xDD is another consideration we'll have to work through, but the API method provides for pulling the deletions and deciding what actions to take.

The actual bibliographic metadata for items in a Zotero library is pretty straightforward and aligns well with various bibliographic metadata standards with information on title, authors, publishers, etc. The API allows for different standard forms of metadata to be requested, with the default being Zotero's own JSON format, so a harvester could be written to use something more standardized like bibtex. However, the Zotero JSON structure provides everything and is likely the easiest thing to build a comprehensize process upon.

The different itemTypes in Zotero control the schema for items, and the information elements in a given itemType schema are fixed. This introduces some challenges in recording all of the information we might want to record, but it produces a predictable structure with little room for interpretation on the downstream end. Pulling items into something like the xDD Digital Library means mapping the information elements to be encountered in Zotero libraries to the target schema.

In the xDD case, the target schema (viewable in the "fields" object [here](https://geodeepdive.org/api/articles)) is also very limited with a barebones set of identification metadata. So, our task of mapping from the barebones NI 43-101 Report metadata to the xDD "article" target is not difficult.

What will be more interesting is working through how we introduce what will essentially be value-added annotation into this process - additional metadata about the documents in the collection that we want to leverage in NLP processing. Geographic context and important name identifiers (projects, names of active mines, mineral and other commodities, etc.) are all important organizing principles that we are capturing anyway and recording through collections and tags. We may start introducing additional attachments on report items containing explicit geospatial data or more precise temporal information. We will be encouraging scientists and support staff to annotate PDF documents and use Zotero's capabilities to extract annotation (highlights and notes) as additional attachments in Markdown format. These processes will all introduce additional content that we want xDD and other systems to access and take advantage of.

Baselining the cache of a given library is a potentially resource-intensive process, with the biggest time delay in reading what we need from the Zotero API. The API is limited to 50 items in a given request as a throttling mechanism, with pyzotero offering a generator and a couple of other methods for pulling more items. After experimenting with several ways of breaking up the requests, I found that we don't end up buying very much through various parallelization strategies and it will be more simple and reliable to use the everything() wrapper in pyzotero to grab every item we need to operate on. The benchmark I ran comparing an everything request for 30,993 metadata and attachment items in the library vs. a Dask delayed process with 4 workers resulted in 21 minutes for the former and 13 minutes for the latter but a loss of some number of records that I couldn't explain but probably had to do with some API failure I didn't catch.

It does make sense, though, to get our items and then cache them as a file so that a) we can operate on that same file in different ways and b) we can store the file and use it to determine when the library contains something new for us to work on.

The three types of things we need from the Zotero library are collections (worked up previously), metadata items (reports and/or other itemTypes), and attachments. The pyzotero items() function, wrapped with everything(), will get us all of our items, or if we add the "since" parameter with a version number, we can get anything new.

Strategies for caching a library's raw information can vary depending on the platform where things are being implemented. If we're using a versioned file system like S3, we might opt to use that. Or we may come up with some file naming convention that works for our circumstance. In this rough workup, I'm simply using a local mounted file system and passing in a relative path. This will need to be tweaked to cover a broader range of use cases.

I still need to work up the following:

* Process for checking latest version in our cache against the last_modified_version from the library, getting new/changed items, and then either dealing with those separately or integrating them into our cache.
* Process for dealing with deleted items.
* Process for mapping Zotero metadata to a target schema like xDD articles.
* Process for managing file attachment content.

I split this up into three logical parts:
1) Get items for a given collection. This is the part that needs to make an API call to Zotero and may take the most resources to operate.
2) Process the metadata. This takes the raw item content and maps it to a profile, extracting the elements we need for that profile. It also handles blending in any additional metadata that might be useful but is outside the strict mapping to the target profile.
3) Get file attachment keys. The Zotero API offers a couple of different ways of retrieving files. They can be streamed through as bytes to some other process or sent directly to a file system.