When working with digitally available humanities data you might start feeling that your data is everywhere, but not in the right format you need to run computational analysis on it. While the amount of digitally available data continues to grow, it is often only available in a Portable Document Format (PDF). Not to fear. There are many Python tools that allow you to extract a PDF's image and/or textual data into a format that is condusive for distant reading.

There are a variety of packages that allow us to extract the textual data of a PDF. We will look at two packages, PDFminer.six and Poppler. Of the two, PDFminer is the easier to set up, especially you use a Windows machine. Poppler is the more powerful program, and may be better suited to your needs if you are working with images.

Both packages have the ability to recognize a varitey of fonts and characters, including Chinese, Japanese, and Korean languages in the case of PDFminer.six.  

The first section will show you how to install PDFminer.six, how to pull in a single PDF and extract its text, and then how to loop through a directory and extract the text of multiple PDF documents. 

Note that this code will not run in the notebook. Feel free to copy and paste or save to a file as is useful.

#### Installing PDFminer

In [None]:
# install PDFminer.six
$ pip install pdfminer.six

You may now run PDFminer.six as a python package, or, if you only need to occasionally extract a PDF's text, you can use it's command line features.

#### Working with Text in a Single PDF

In [None]:
# call PDFminer's built-in script to extract text from a PDF and print that data to the command line
$ pdf2txt.py source.pdf

# extract textual data from a PDF and write that data as plain text to a new file
$ pdf2txt.py source.pdf -o output.txt

#### Working with Text in Multiple PDFs

To loop through a directory, find all the PDFs, extract their textual data, and write that data as a new .txt file in an output directory:

Before you begin, create an output directory, what we have named "out_txt/" below.

In [None]:
import pdfminer
import os

from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io
import os
import sys, getopt

#converts pdf, returns its text content as a string
def convert(case, fn, pages=None):
    if not pages: pagenums = set();
    else:         pagenums = set(pages);
    manager = PDFResourceManager()
    codec = 'utf-8'
    caching = True

    if case == 'text' :
        output = io.StringIO()
        converter = TextConverter(manager, output, laparams=LAParams())
    if case == 'HTML' :
        output = io.BytesIO()
        converter = HTMLConverter(manager, output, laparams=LAParams())

    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(fn, 'rb')

    for page in PDFPage.get_pages(infile, pagenums,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()

    infile.close(); converter.close(); output.close()
    return convertedPDF

def convertMultiple(pdfDir, txtDir):
      for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
          fileExtension = pdf.split(".")[-1]
          if fileExtension == "pdf":
              pdfFilename = pdfDir + pdf
              text = convert('text',pdfFilename) #get string of text content of pdf
              textFilename = txtDir + pdf + ".txt"
              textFile = open(textFilename, "w") #make text file
              textFile.write(text) #write text to text file

pdfDir = "pdf_input_dir/"
txtDir = "out_txt/"
convertMultiple(pdfDir, txtDir)
print('look it finished')

The following section will show the basics of using Poppler to extract text and images from a PDF and from a group of PDFs. There is much more one can do with Poppler.

#### Installing Poppler

In [None]:
# install Poppler
$ brew install poppler

#### Working with Text

In [None]:
# to extract embedded text from a single PDF
$ pdftotext 'input.pdf' 'output.txt'

# to extract embedded text from all the PDFs in your working directory
$ find . -name '*.pdf' -exec pdftotext '{}' '{}.txt' \;

#### Working with Images

In [None]:
# to extract embedded images from a single PDF
# you may specify the format of the output image, we have used PNG
$ pdfimages -png 'input.pdf' 'output'

# to extract embedded images from all the PDFs in your working directory
find . -name '*.pdf' -exec pdfimages -png '{}' '{}' \;

Note that Poppler will only extract images that are embedded as individual images within the PDF. Depending on the nature of the PDF, what might appear to you as image will not be recognized as one by Poppler. There are work arounds to this problem on [Programming Historian](https://programminghistorian.org/en/lessons/extracting-illustrated-pages). 