<a href="https://colab.research.google.com/github/tylerdq/notebooks/blob/master/pdfgrep_gdrive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualize key terms in PDF file(s) from a Google Drive folder

*Implementation by [Tyler Quiring](https://github.com/tylerdq/)  
Source available on [GitHub](https://github.com/tylerdq/notebooks/blob/master/pdfgrep_gdrive.ipynb)*

## Documentation

### Functionality
This notebook provides a working solution for finding a key term or search phrase in a set of PDFs containing a [searchable](https://blogs.adobe.com/acrolaw/2007/02/is_that_pdf_sea/) text layer. It returns bar charts that show counts of the term/phrase across all pages in each PDF with matches as well as counts of pages that contain the term/phrase at least once. It also provides a way to drill down into how a term/phrase is distributed across a single PDF file.

### Implementation details
[pdfgrep](https://pdfgrep.org/) drives the PDF search functionality. It is a commandline utiltiy for UNIX-based systems such as the Linux kernel that powers this notebook. Importantly, result accuracy is dependent on the quality of the [OCR](https://en.wikipedia.org/wiki/Optical_character_recognition) process originally applied (if at all) to each individual PDF. OCR is a notoriously difficult problem and while the technology is constantly improving, this notebook may return inaccurate results for PDFs with optically complex text (such as scans or pages with ligatures) or missing results for PDFs or pages where there is no searchable text layer.

### Data caching/persistence
The cells below take advantage of pdfgrep's cache feature, which retains text from PDFs to speed up subsequent searches (by avoiding re-extracting the text from the PDFs for each search). The cache should persist for the life of the underlying virtual machine, which according to [Google's FAQ](https://research.google.com/colaboratory/faq.html) is reset periodically to free up resources for other users. This currently requires re-caching when the notebook is loaded again.

### Authentication notes
The set-up step below requires authenticating the Colab notebook to access the contents of your Google Drive. This is a direct authentication between two Google services (Colab and Drive) under your own account. However, if you are uncomfortable with the required authentication steps, do not proceed. This notebook is provided on an AS-IS basis and the user is responsible for any effects of using the software.

## Execution

In [0]:
#@title Set up
#@markdown 1. Run this cell
#@markdown 2. Follow authentication link & instructions
#@markdown 3. Click folder icon in left sidebar
#@markdown 4. Navigate to the location under "drive" folder with your PDFs
#@markdown 5. Right-click the folder, click "Copy path", paste when requested
!sudo apt install pdfgrep
import subprocess, os, glob, csv
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
path = input('Please enter the path with the PDFs you would like to search: ')
os.chdir(path)
print('Thank you, please proceed to the next cell')

In [0]:
#@title Term and page count for a given search phrase in all PDFs
term = "human" #@param {type:"string"}
pg_result = subprocess.check_output(['pdfgrep', '-i', term, '-p', '--cache']
                                    + glob.glob('*pdf')).decode('utf-8')
ct_result = subprocess.check_output(['pdfgrep', '-i', term, '-c', '--cache']
                                    + glob.glob('*pdf')).decode('utf-8')
pg_data = csv.reader(pg_result.splitlines(), delimiter=':')
ct_data = csv.reader(ct_result.splitlines(), delimiter=':')
pg_df = pd.DataFrame(data=pg_data, dtype=int, columns=['file', 'page', 'count'])
ct_df = pd.DataFrame(data=ct_data, dtype=int, columns=['file', 'count'])
pg_df['page'] = pg_df['page'].astype(str).astype(int)
pg_df['count'] = pg_df['count'].astype(str).astype(int)
ct_df['count'] = ct_df['count'].astype(str).astype(int)
pg_df = pg_df.groupby('file').size().to_frame('pages').reset_index()
df = pg_df.merge(ct_df)
count = df.plot.bar(x='file', y='count', rot=90, figsize=(12, 6))
pages = df.plot.bar(x='file', y='pages', rot=90, figsize=(12, 6))

In [0]:
#@title Term count for each page with matches in a single PDF
filename = "colebrook2014death.pdf" #@param {type:"string"}
term = "human" #@param {type:"string"}
one_result = subprocess.check_output(['pdfgrep', '-i', term, '-p', '--cache',
                                      filename]).decode('utf-8')
data = csv.reader(one_result.splitlines(), delimiter=':')
df = pd.DataFrame(data=data, dtype=int, columns=['page', 'count'])
df['page'] = df['page'].astype(str).astype(int)
df['count'] = df['count'].astype(str).astype(int)
ax = df.plot.bar(x='page', y='count', rot=90, figsize=(24, 6))