This notebook demonstrates how one might go about interfacing with a Zotero Group Library to harvest or pull its contents up for some particular use. In this case, we're looking specifically at how we might use a Zotero library as a source for a special collection of documents introduced into the xDD platform for NLP processing. The idea would be to initially baseline pull the full library and then run a scheduled proccess to check for updates and pull in new or changed metadata and documents.

Access will be through the Zotero API, and this notebook demonstrates that using the pyzotero abstraction on the REST API.

In [48]:
from pyzotero import zotero
from getpass import getpass
import pandas as pd

The Zotero API provides access to personal (user) and group libraries, with slightly different methods for each. Here we're only covering access to group libraries as this fits our use case. To instantiate an API connection with pyzotero, you need to provide the group library identifier and an API key. Library IDs are open and available to anyone. Here, we are interfacing with a library/collection of [NI 43-101 Reports](https://www.zotero.org/groups/4530692/usgs_ni_43-101_reports/library). You can see the library ID there in the URL: 4530692.

An API key has to be set up by an authorized user and will specify the type of access allowed and what library or libraries the key will access. In this case, we set up a read-only API key for the specific library, which should be the best practice for this kind of system.

In the following codeblock, we ask for these two parameters, hiding the API key as a secret. (Note: Contact the notebook author if you'd like to participate or set this up to work with your own specified library and API key.) In a production implementation of this idea, the Library ID and API Key parameters could be set up as environment variables for an application or some kind (e.g., Lambda process, etc.).

In [4]:
zotero_library_id = input("Library ID")
zotero_api_key = getpass("API Key")

Library ID 4530692
API Key ························


In [10]:
def zotero_library(library_id, api_key):
    # Establish connection to library with an ID and API Key
    zot = zotero.Zotero(
        library_id=library_id,
        library_type='group',
        api_key=api_key,
    )
    return zot

In [6]:
zot = zotero_library(zotero_library_id, zotero_api_key)

# Collection-based Metadata
The organization scheme used in a Zotero Group Library could provide different types of value-added metadata about items. Collections are used like folders to provide a visual organizational scheme to library items, helping library users navigate through and work within a given library, particularly one that contains a lot of items.

In the case of the NI 43-101 Reports here, we replicated a scheme that has been used for years where these reports have been accumulating on network shared folders. It uses a hierarchy of folder names to indicate a region of the world and country as well as US States for mining projects that are the subject of the reports. Report items can be found at any of these levels with the immediate parent collection indicating the most specific geographic context available in the hierarchical structure. We may add a further level of structure with project names in future.

Since this information is contained in the organizational structure of the items and does not necessarily exist in individual item metadata (e.g., tags), we need to incorporate an "interpretation" of the organizational structure into our process of assembling useful metadata. In this case, the geographic context for a given report item may be useful in driving NLP processes. It can narrow the field on other, more specific place names we might work to recognize in full text or serve as a validation element in extracting point coordinates or other geospatial information.

Each process like this for reading a Zotero library and digesting its metadata is going to be different. The collection organization scheme is going to be different, introducing different types of information to the process. However, something like what we present here can be extended to essentially classify the collections in a given library as metadata elements of different kinds. For this case, we pull all collections and then use the hierarchical structure (via "parentCollection") to assign a "type" value of Region, Country, or US State to each collection. This essentially gives us an additional set of geographic context tags we can send along with our items for further use downstream.

When retrieving items (see below), we will get a list of collection keys that an item belongs to. We can also request items for a given collection. Any library item can belong to more than one collection. This essentially means we can use our collection metadata we assemble below to "backfill" additional value-added item metadata based on whatever significance the collections provide.

Note: We did do some normalization work in our process of building out the Zotero library structure for the NI 43-101 Reports to ensure that each collection name can be resolved to official ISO sources for countries and US States. Regions represent continents and a few sub-continent but recognizable region names.

In [57]:
def library_collections(z):
    all_collections = z.everything(z.collections())

    collections = [
        {
            "type": "Region", 
            "key": i["key"], 
            "name": i["data"]["name"]
        } for i in all_collections if not i["data"]["parentCollection"]
    ]

    for r in collections:
        collections.extend([
            {
                "type": "Country", 
                "key": i["key"], 
                "name": i["data"]["name"],
                "parent": r["key"]
            } for i in all_collections if i["data"]["parentCollection"] == r["key"]
        ])

    us_collection = next((i for i in collections if i["name"] == "United States"), None)

    collections.extend([
        {
            "type": "US State", 
            "key": i["key"], 
            "name": i["data"]["name"],
            "parent": us_collection["key"]
        } for i in all_collections if i["data"]["parentCollection"] == us_collection["key"]
    ])
    
    return collections

In [58]:
collections = library_collections(zot)
pd.DataFrame(collections)

Unnamed: 0,type,key,name,parent
0,Region,KVSV3898,Europe,
1,Region,ZNCQ3V3A,Middle East,
2,Region,BTJTAMND,Central America,
3,Region,MQWZWWM2,North America,
4,Region,BTZ7EAKX,Caribbean,
...,...,...,...,...
163,US State,VGKU7ZGD,Georgia,AKV3WJ97
164,US State,5CQMNT5V,Florida,AKV3WJ97
165,US State,W8R6UH2U,Arizona,AKV3WJ97
166,US State,JK8JU3E2,Alaska,AKV3WJ97


# Baselining Items
Via the Zotero API, everything in a given library is an item. This includes metadata principals for things like reports and articles, but file attachments are items along with collections and notes and anything else that is uniquely identified and accessed. The various item methods of pyzotero provide access to items and can be used in various ways to access and work through the contents of a library.

For our use case of retrieving documents for processing with the xDD tools, we will be focusing on pulling report type items for essential metadata, blending in collection item information (per the above discussion), to create xDD records. We'll also pull the file items attached to the report metadata items to get PDF document content for processing. Eventually, we will access and pull annotation items also attached to report items for value-added annotation that can be used in NLP training.

Again, each case may be a little bit different in terms of how a given library is organized and managed, resulting in some tweaks to this process for other circumstances. Hopefully, the basic patterns developed here can serve as a start to a more generalized process.

Baselining a given library for the purposes of creating an xDD collection or some similar purpose means working at a point in time to pull together all necessary item information translated into how the target system understands "item" for its context. 

After a library has been baselined, Zotero provides a useful versioning system that should be fairly convenient to use. Every change to any item results in a sequential version number being incremented. The version number is within the context of a given library (group or user). The last_modified_version() method of pyzotero can be used to simply retrieve the highest incremental version number across everything in the library. Usage patterns could include recording the high version number of the library at the point of sync and/or recording individual item versions and then using those within context for an API call, depending on how a workflow is composed. 

The API accepts a "since" parameter with version numbers that will only pull items changed after that version. This can be combined with other API search parameters to look for items changed within a given context (e.g., report type items changed after the last recorded version retrieved). In syncing items from a Zotero library to some other system, version numbers should be recorded and then used to go after newer items. The version number incremental system also applies for new items added to a library, so that if you ask for everything since the last version in a synced collection, you will get new items as well as updates.

Deleted items are a different issue that should also be taken into account. The pyzotero API provides a deleted method with the "since" parameter that returns lists of collections, items, and other things that were removed from the library. How these are dealt with in a target system like xDD is another consideration we'll have to work through, but the API method provides for pulling the deletions and deciding what actions to take.

The actual bibliographic metadata for items in a Zotero library is pretty straightforward and aligns well with various bibliographic metadata standards with information on title, authors, publishers, etc. The API allows for different standard forms of metadata to be requested, with the default being Zotero's own JSON format, so a harvester could be written to use something more standardized like bibtex. However, the Zotero JSON structure provides everything and is likely the easiest thing to build a comprehensize process upon.

The different itemTypes in Zotero control the schema for items, and the information elements in a given itemType schema are fixed. This introduces some challenges in recording all of the information we might want to record, but it produces a predictable structure with little room for interpretation on the downstream end. Pulling items into something like the xDD Digital Library means mapping the information elements to be encountered in Zotero libraries to the target schema.

In the xDD case, the target schema (viewable in the "fields" object [here](https://geodeepdive.org/api/articles)) is also very limited with a barebones set of identification metadata. So, our task of mapping from the barebones NI 43-101 Report metadata to the xDD "article" target is not difficult.

What will be more interesting is working through how we introduce what will essentially be value-added annotation into this process - additional metadata about the documents in the collection that we want to leverage in NLP processing. Geographic context and important name identifiers (projects, names of active mines, mineral and other commodities, etc.) are all important organizing principles that we are capturing anyway and recording through collections and tags. We may start introducing additional attachments on report items containing explicit geospatial data or more precise temporal information. We will be encouraging scientists and support staff to annotate PDF documents and use Zotero's capabilities to extract annotation (highlights and notes) as additional attachments in Markdown format. These processes will all introduce additional content that we want xDD and other systems to access and take advantage of.

In [94]:
# Quick example of getting last modified for the library
library_last_mod = zot.last_modified_version()

# Retrieving "recent" reports since a fictitious previously recorded version
recent_items = zot.everything(zot.items(itemType="report", since=library_last_mod-1000))

print(len(recent_items))

125


In [103]:
# Look at deleted stuff
display(zot.deleted(since=library_last_mod-50000))

{'collections': ['8AICMIB2', 'BCJ8XFGE', 'MGGTEK6C', 'RW29JFBN'],
 'items': ['23N2K345',
  '23U33IWI',
  '33CXVW89',
  '367IE9DG',
  '3M4XR5JH',
  '3TXSKICC',
  '3VW58HDV',
  '4HDKB45X',
  '4PA7GJCN',
  '4RV4KS95',
  '4S3JI354',
  '54UH39PU',
  '63ZXI7WV',
  '6AT6J4ET',
  '6HXPHM58',
  '6ICXXIXA',
  '6PGHCTT2',
  '6Z33PWBT',
  '73HMNNE3',
  '77ZMEMR7',
  '7CMKIXM3',
  '7DVW3E7C',
  '7FJ8JP5X',
  '7RSMHR43',
  '7W3MWJFF',
  '86VP47TM',
  '9S43IDKN',
  'A833T4MX',
  'BDWPC7AR',
  'BG3JENCS',
  'BPSXHXIS',
  'BUFWD7DN',
  'DT4Q5PZU',
  'E9K3UA2U',
  'EANI37KV',
  'EP9U8QS7',
  'EX4K37BU',
  'F6QG8RMM',
  'FP9MHM5N',
  'I5Z354Z4',
  'J4DM736W',
  'J75ECMUH',
  'JCDRDHN2',
  'JD2ZQAAF',
  'JZTBNWXU',
  'M2XWP7PG',
  'M3TUVG9T',
  'NBM9HAHP',
  'P5UIXPIN',
  'PBGQ9JGX',
  'PDPNNTBK',
  'PIS2E8ZS',
  'PSAIQREQ',
  'R2TVZTZE',
  'R4QWDPSI',
  'RDRWEKBI',
  'RH9GUDWB',
  'RKVGFDK3',
  'S8S2M4NZ',
  'SBHV3DQM',
  'SDUNSR26',
  'SGQE29UQ',
  'SUIVZJ3F',
  'SWEMNANC',
  'T7D4AAAN',
  'TCJS45U9',
 

I'll come back to work on an actual process for baselining an entire library. This starts with a look at how long it takes to pull metadata for every report item but shows that we get pretty much everything we need to with that one array to build out a corresponding collection in xDD.

In [55]:
%%time
reports = zot.everything(zot.items(itemType="report"))

CPU times: user 2.75 s, sys: 112 ms, total: 2.86 s
Wall time: 10min 33s


In [56]:
len(reports)

15496

In [59]:
reports[999]

{'key': 'QZM6W75P',
 'version': 6682,
 'library': {'type': 'group',
  'id': 4530692,
  'name': 'USGS NI 43-101 Reports',
  'links': {'alternate': {'href': 'https://www.zotero.org/groups/usgs_ni_43-101_reports',
    'type': 'text/html'}}},
 'links': {'self': {'href': 'https://api.zotero.org/groups/4530692/items/QZM6W75P',
   'type': 'application/json'},
  'alternate': {'href': 'https://www.zotero.org/groups/usgs_ni_43-101_reports/items/QZM6W75P',
   'type': 'text/html'},
  'attachment': {'href': 'https://api.zotero.org/groups/4530692/items/JS2XPAMU',
   'type': 'application/json',
   'attachmentType': 'application/pdf',
   'attachmentSize': 22635765}},
 'meta': {'createdByUser': {'id': 1119084,
   'username': 'skybristol',
   'name': 'Sky Bristol',
   'links': {'alternate': {'href': 'https://www.zotero.org/skybristol',
     'type': 'text/html'}}},
  'parsedDate': '2018',
  'numChildren': 1},
 'data': {'key': 'QZM6W75P',
  'version': 6682,
  'itemType': 'report',
  'title': 'NI 43-101 Te