<a href="https://colab.research.google.com/github/skybristol/experiments/blob/dev/Extracted_PDF_Annotation_via_Zotero.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I'm experimenting here with a process to turn annotations created within PDF files stored as part of a Zotero library into metadata contents and structured annotations for the bibliographic record. This is essentially for cases where there is no good citation metadata already in existence somewhere on the web (e.g., for certain types of government reports) and we need to extract that content from within PDFs. It's also for cases where built-in structured PDF metadata is no good, which is the case for anything other than professionally built PDFs (e.g., just exporting a PDF from your word processor does not build a good PDF). This technique also holds promise for setting up training data for building various kinds of entity recognition models to auto-extract particular concepts from full texts processed with NLP.

I used the ZotFile plugins for Zotero, inspired by [this video](https://www.youtube.com/watch?v=_Fjhad-Z61o&t=1251s). In Zotero, the process includes storing the PDF file as an attachment so that Zotero is "managing" it, annotating the file using some type of PDF markup tool (I used Preview on Mac), and then running the ZotFile tool to extract annotations from the PDF which creates a rich text (html) note for the item in Zotero. The notes are synced to the group library with Zotero online where they can be picked up for processing.

For annotation, I used a combination of highlighting particular text and then tagging that text with a keyword corresponding to a target part of the citation metadata I'm trying to identify (e.g., title, authors, etc.). I should then be able to pull these two pieces out of the generated markdown into a data structure that I can feed back into the corresponding record via the Zotero API.

For the Python processing workflow part of this, I used the pyzotero package to connect to the Zotero group library, read items with notes, and then processed those items to generate structured data from the HTML notes that can be reinjected back into the items.

In [None]:
!pip install pyzotero

In [1]:
from bs4 import BeautifulSoup
import re
from pyzotero import zotero
from getpass import getpass

This function handles the process of working a given Zotero child item's note content to detect if it was generated using the ZotFile process and then extracting the annotations into a usable data structure (list of dictionaries).

In [2]:
annotation_property_map = [
    {
        "property_name": "title",
        "property_type": "single"
    },
    {
        "property_name": "date",
        "property_type": "single"
    },
    {
        "property_name": "institution",
        "property_type": "single"
    },
    {
        "property_name": "author",
        "property_type": "multi"
    },
    {
        "property_name": "project",
        "property_type": "tag"
    },
    {
        "property_name": "place",
        "property_type": "tag"
    },
    {
        "property_name": "commodity",
        "property_type": "tag"
    },
]

def structured_annotations(item_key, annotation_html, property_map=annotation_property_map):
    extract_keywords = [i["property_name"] for i in property_map]

    annotations_soup = BeautifulSoup(annotation_html, 'html.parser')
    pattern = '\"(.*?)\"'

    paragraphs = annotations_soup.find_all("p")

    if "Extracted Annotations" not in paragraphs[0].text:
        return None

    annotation_texts = list()
    for index,p in enumerate(paragraphs):
        if p.text.split()[0] in extract_keywords:
            prop = p.text.split()[0]
            annotation_text = re.search(pattern, paragraphs[index - 1].text)
            if annotation_text is not None:
                annotation_texts.append({
                    "item_key": item_key,
                    "text": annotation_text.group(1),
                    "property": prop
                })
    return annotation_texts

def update_zotero_item(
    item_key, 
    annotations_list, 
    zotero_api, 
    commit_update=True,
    property_map=annotation_property_map):
    update_item = zotero_api.item(item_key)
    if not update_item:
        return

    available_updates = [i for i in annotations_list if i["item_key"] == item_key]
    if available_updates:
        single_value_props = [i["property_name"] for i in property_map if i["property_type"] == "single"]
        tag_props = [i["property_name"] for i in property_map if i["property_type"] == "tag"]

        for prop in single_value_props:
            update_value = next((i["text"] for i in available_updates if i["property"] == prop), None)
            if update_value is not None:
                update_item["data"][prop] = update_value

        update_item["data"]["place"] = ",".join([i["text"] for i in available_updates if i["property"] == "place"])
        update_item["data"]["creators"] = [{'creatorType': 'author', 'name': i["text"]} for i in available_updates if i["property"] == "author"]
        update_item["data"]["tags"] = [{'tag': i["text"]} for i in available_updates if i["property"] in tag_props]

    if commit_update:
        zotero_api.update_item(update_item)

    return update_item


To interface with the Zotero API, you need to provide a library ID and an API key. This should work for anyone with a group library that has followed the same process I outlined above.

In [3]:
zot = zotero.Zotero(input("Library ID "), "group", getpass(prompt="API Key "))

Library ID 4373054
API Key ··········


This process would need to be worked out further in production practice, but we essentially walk the items in a given library looking for notes. There might be some more efficient way to zero in on these, but I haven't figured it out yet with pyzotero. Here, we make a pass through every item, look for items with children, get the children, and then get any that have notes. I assume that this might be additive where all note items that can be processed will be yielding annotations for the given parent item. I send the note html along with the associated item key to the function to return available structured annotations for further processing. Since every annotation will essentially contain a key/value pair (property and text content), we can simply build out an array of these with their item keys for further processing.

In [4]:
item_annotations = list()
for item in zot.all_top():
    if item["meta"]["numChildren"] > 0:
        note_children = [i for i in zot.children(item["key"]) if i["data"]["itemType"] == "note"]
        if note_children:
            for note_child in note_children:
                extracted_annotations = structured_annotations(item["key"], note_child["data"]["note"])
                if extracted_annotations:
                    item_annotations.extend(extracted_annotations)


In this case, we got the one item where I've annotated and then extracted text snippets corresponding to specific metadata elements I'm identifying and wanting to work with.

In [5]:
item_annotations

[{'item_key': 'QZFHM2ZK',
  'property': 'institution',
  'text': 'Avino Silver & Gold Mines Ltd.'},
 {'item_key': 'QZFHM2ZK',
  'property': 'title',
  'text': 'Resource Estimate Update for the Avino Property, Durango, Mexico'},
 {'item_key': 'QZFHM2ZK', 'property': 'date', 'text': 'JANUARY 13, 2021'},
 {'item_key': 'QZFHM2ZK', 'property': 'author', 'text': 'Hassan Ghaffari'},
 {'item_key': 'QZFHM2ZK', 'property': 'author', 'text': "Michael F. O'Brien"},
 {'item_key': 'QZFHM2ZK', 'property': 'author', 'text': 'Barnard Foo'},
 {'item_key': 'QZFHM2ZK',
  'property': 'author',
  'text': 'Jianhui (John) Huang'},
 {'item_key': 'QZFHM2ZK', 'property': 'place', 'text': 'San Gonzalo Mine'},
 {'item_key': 'QZFHM2ZK', 'property': 'place', 'text': 'Durango, Mexico'},
 {'item_key': 'QZFHM2ZK',
  'property': 'place',
  'text': 'Elena Tolosa (ET) Mine'},
 {'item_key': 'MFA5N9X9',
  'property': 'institution',
  'text': 'BULLFROG GOLD CORP.'},
 {'item_key': 'MFA5N9X9',
  'property': 'title',
  'text': 

I added an additional function to work through the annotations gathered and commit them back to the respective Zotero items cataloging the annotated files. This basically roundtrips the process, letting us propose a workflow concentrated on marking up "messy" PDF files using meta keywords and then leveraging Zotero and ZotFile to build out the catalog records from the annotation markup.

In [6]:
for item_key in list(set([i["item_key"] for i in item_annotations])):
    update_zotero_item(item_key, item_annotations, zot, commit_update=True)

What I did above so far is a reasonable start, but there are a few issues.

* This is pretty brittle at this point and requires a very specific convention to be followed in annotating a PDF text. This would need to be made a bit more robust in terms of dealing with text strings and different things people might do in free and open annotations. I mitigated this a little bit in the function by first looking for a set of keywords identifying the specific bits of annotation that we want to go after and then getting the highlighted text the annotation is identifying. However, some type of conventions would need to be established and followed in terms of highlighting a chunk of text and then marking up its particular significance. If we want to simply pick out the major elements of reasonably complete citation metadata, then something like I tried here should work well enough.
* I still need to work out the best way to feed everything back into building more usable report reference items in Zotero once metadata properties are extracted. That should be pretty straightforward, but I want to fiddle with the simplest workflow possible where someone would mark up a bunch of PDFs quickly, load the files to the Zotero library without making report items from them, and then see if the whole process can work from there.

My takeaway so far is that it's actually really nice and fast to simply open up a PDF file and start marking it up. Theoretically, this could be done on a whole batch of PDFs totally separate from Zotero, bulk import those to Zotero, run the ZotFile extraction on the annotations, and then generate properly documented items. For the types of files this applies to, Zotero is not going to recognize that they should be "report" type items, so that part of things would need to be handled through the API. As noted, the real point here is to train an AI to do this work, at least within some contextual boundaries. But even if it was a person sitting down doing this work, it should be much faster to open a PDF, mark it up following a particular convention to identify the important bits, and then have a system take over to parse and catalog the files.