This notebook demonstrates how someone might go about pulling PDF file attachments from a Zotero group library for processing in some other environment (e.g., segmentation, vectorization, and AI-powered pipelines). We are using Zotero for managing collections of documents that are not otherwise readily accessible through some other repository. We use the metadata model to contain citation details along with feedback from pipeline processing of different kinds.

In Zotero group libraries, file content is stored as "attachments" to Zotero library items. These are available using the itemType parameter in the API with 'attachment' as the value. There can be other types of attachments in addition to PDFs; mostly markdown files from notes that are either extracted from marked up PDF files (using the Zotfile extension) or notes written by a user through the Zotero client.

### Dependencies
* Pyzotero package installed from PyPi or CondaForge
* Zoteor group library ID that you want to operate on
* API key with access to the group library

In [1]:
from pyzotero import zotero
from getpass import getpass

The following will prompt for a library ID and API key. You can change the code to pull the IDs in from env variables or however you wish to go about it.

In [2]:
library_id = input('Zotero Group Library ID')
api_key = getpass(prompt='API Key')

Establish connection to Zotero

In [3]:
zot = zotero.Zotero(
    library_id,
    'group',
    api_key
)

### Get Attachment Items

There are several ways of interacting with a group library using the API. As a REST API, things are a little slow and cumbersome in trying to deal with a sizable library. In the example here, we have nearly 15K reports with PDF attachments to work through, throttled at essentially 50 records at a time. You can create a generator using either the makeiter() or iterfollow() methods. Here, I use the everything() method to simply grab every item of type 'attachment' in one lengthy run. That gives us all the necessary identifiers to operate a file retrieval operation, which we can parallelize within reason.

In [4]:
attachments = zot.everything(zot.items(itemType='attachment'))

### PDF Attachments

If our goal is to get all PDFs for processing elsewhere using something like the pyzotero dump() wrapper, we can tease out just the vital identifiers from the records pulled above to operate on. It is important to retain traceability back to the library and metadata item identifiers so that any extracted information can be linked to core metadata in future. The following function and list comprehension will pull out the library ID, the metadata key that is our core reference point to an individual report, and the attachment key for the PDF. It also pulls the original file name as stored in Zotero and puts together a potential file name, made up of the identifiers, that can be used if desired.

In [10]:
def file_meta_from_item(item):
    if 'up' in item['links'] and 'enclosure' in item['links'] and item['links']['enclosure']['type'] == 'application/pdf':
        lib_id = item['library']['id']
        meta_key = item['links']['up']['href'].split('/')[-1]
        attachment_key = item['links']['self']['href'].split('/')[-1]
        linked_file_name = f"{lib_id}_{meta_key}_{attachment_key}.pdf"
        original_file_name = item['links']['enclosure']['title']

        return {
            'library_id': lib_id,
            'metadata_key': meta_key,
            'attachment_key': attachment_key,
            'linked_file_name': linked_file_name,
            'original_file_name': original_file_name
        }

pdf_attachments = [file_meta_from_item(i) for i in attachments]
pdf_attachments = [i for i in pdf_attachments if i is not None]


### Operational Concept

In an operational mode, working with the Zotero API can be done fairly efficiently by storing a cache of relevant information taken as a snapshot at a point in time. Zotero libraries use a sequential versioning scheme, and the API supports querying for changes after a particular version. In the following code block, I pull the max version number from the attachments API run that I used. I then use Pandas to dump the basic information I pulled in this case to a parquet file for ease of use in a subsequent step to download files. Below that, I demonstrate the dump method with pyzotero to take a given file and drop it to storage. In a real world case, you would want to spin something up to write data to an appropriate location for processing. Pyzotero is not aware of something like S3 or other cloud storage protocols, so you will need to work out an appropriate routing or method based on your environment.

In [19]:
import pandas as pd

print("MAX VERSION:", max([i['version'] for i in attachments]))

df_pdf_attachments = pd.DataFrame(pdf_attachments)
df_pdf_attachments.to_parquet(f'{library_id}.parquet')

MAX VERSION: 77250


In [25]:
sample_record = df_pdf_attachments.sample().iloc[0].to_dict()
print(sample_record['attachment_key'])
print(sample_record['linked_file_name'])

zot.dump(
    sample_record['attachment_key'],
    filename=sample_record['linked_file_name'],
    path='./'
)

C3433BTD
4530692_PRQ264BE_C3433BTD.pdf
