<a href="https://colab.research.google.com/github/skybristol/notebooks/blob/master/Extracted_PDF_Annotation_via_Zotero.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I'm experimenting here with a process to turn annotations created within PDF files stored as part of a Zotero library into metadata contents and structured annotations for the bibliographic record. This is essentially for cases where there is not good citation metadata already in existence somewhere on the web (e.g., for certain types of government reports) and we need to extract that content from within PDFs. It's also for cases where built-in structured PDF metadata is no good, which is the case for anything other than professionally built PDFs (e.g., just exporting a PDF from your word processor does not build a good PDF). This technique also holds promise for setting up training data for building various kinds of entity recognition models to auto-extract particular concepts from full texts processed with NLP.

I used a combination of the ZotFile and MDNotes for Zotero plugins, inspired by [this video](https://www.youtube.com/watch?v=_Fjhad-Z61o&t=1251s). In Zotero, the process includes storing the PDF file as an attachment so that Zotero is "managing" it, annotating the file using some type of PDF markup tool (I used Preview on Mac), running the ZotFile tool to extract annotations from the PDF (creating a note in Zotero), and then using MDNotes to export the ZotFile extraction to a markdown file. I uploaded that raw MD file here for this experiment.

For annotation, I used a combination of highlighting particular text and then tagging that text with a keyword corresponding to a target part of the citation metadata I'm trying to identify (e.g., title, authors, etc.). I should then be able to pull these two pieces out of the generated markdown into a data structure that I can feed back into the corresponding record via the Zotero API.

For the Python part of this, I used a combination of the python-markdown package, which converts the markdown to HTML and then BeautifulSoup to work with the HTML. I experimented with the py2md package, but got a little lost in the navigation they set up.


In [1]:
import markdown
from bs4 import BeautifulSoup

In [17]:
annotations_html = markdown.markdown(open("/content/UnknownTitle - Extracted Annotations (862021, 41048 AM).md", "r").read())
annotations_soup = BeautifulSoup(annotations_html, 'html.parser')

raw_notes = annotations_soup.find_all("blockquote")
raw_props = annotations_soup.find_all("em")
zotero_group_id = annotations_soup.find("a")["href"].split("/")[4]
zotero_file_id = annotations_soup.find("a")["href"].split("/")[6].split("?")[0]

d_annotations = list()
for index,note in enumerate(raw_notes):
  d_annotations.append({
      "zotero_group_id": zotero_group_id,
      "zotero_file_id": zotero_file_id,
      "text": note.find("p").text.split('" (')[0].replace('"', ""),
      "property": raw_props[index].text.split(" (")[0]
  })

d_annotations

[{'property': 'institution',
  'text': 'BULLFROG GOLD CORP.',
  'zotero_file_id': 'K5C5CT2F',
  'zotero_group_id': '4373054'},
 {'property': 'title',
  'text': 'NI 43-101 Technical Report Mineral Resource Estimate Bullfrog Gold Project Nye County, Nevada',
  'zotero_file_id': 'K5C5CT2F',
  'zotero_group_id': '4373054'},
 {'property': 'date',
  'text': 'August 9, 2017',
  'zotero_file_id': 'K5C5CT2F',
  'zotero_group_id': '4373054'},
 {'property': 'author',
  'text': 'Rex Bryan',
  'zotero_file_id': 'K5C5CT2F',
  'zotero_group_id': '4373054'},
 {'property': 'project',
  'text': 'Bullfrog Gold Project',
  'zotero_file_id': 'K5C5CT2F',
  'zotero_group_id': '4373054'},
 {'property': 'place',
  'text': 'Bullfrog Mining District',
  'zotero_file_id': 'K5C5CT2F',
  'zotero_group_id': '4373054'},
 {'property': 'place',
  'text': 'Bullfrog Hills',
  'zotero_file_id': 'K5C5CT2F',
  'zotero_group_id': '4373054'},
 {'property': 'place',
  'text': 'Nye County',
  'zotero_file_id': 'K5C5CT2F',
  'zo

What I did above so far is a reasonable start, but there are a few issues.

* This is pretty brittle at this point and requires a very specific convention to be followed in annotating a PDF text. This would need to be made a bit more robust in terms of dealing with text strings and different things people might do in free and open annotations. However, some type of conventions would need to be established and followed in terms of highlighting a chunk of text and then marking up its particular significance. If we want to simply pick out the major elements of reasonably complete citation metadata, then something like I tried here should work well enough.
* The annotations as extracted by ZotFile and translated to markdown include links intended to take a user working on a local system right to the particular part of a PDF file referenced in the annotation. This is cool in practical use, but it also includes two necessary identifiers that I pulled into the structure above for the group library and the PDF file that was annotated. These do correspond to the identifiers used online for the synced library and contents, which should presumably allow me to identify where the annotations came from in terms of the Zotero library catalog item itself. I will have to lookup the file identifer and figure out what item it belongs to in order to get that identifier to inject the annotated/extracted metadata.
* The operational part of this is what gets a little clunky as it relates to working within a shared group library context or fully automating a processing workflow. The actual entity being operated on here is exported to a markdown file from within a particular client instance of Zotero working on a shared library item. There's a batch export process which could be used, but the markdown files have to be sent somewhere for processing. We can put those files in as additional attachments on the Zotero item and sync them, but the behavior of the MDNotes plugin is to dialog that process. I may be able to work on using the actual notes created by Zotfile, which I think I can get at through the API and bypass the markdown part of things - I just need to work out how to parse the notes in much the same way.

My takeaway so far is that it's actually really nice and fast to simply open up a PDF file and start marking it up. Theoretically, this could be done on a whole batch of PDFs totally separate from Zotero, bulk import those to Zotero, run the ZotFile extraction on the annotations, and then generate properly documented items. For the types of files this applies to, Zotero is not going to recognize that they should be "report" type items, so that part of things would need to be handled through the API. As noted, the real point here is to train an AI to do this work, at least within some contextual boundaries. But even if it was a person sitting down doing this work, it should be much faster to open a PDF, mark it up following a particular convention to identify the important bits, and then have a system take over to parse and catalog the files.