# Getting Data from PDFs

TODO: ask rebecca about rights to the PDF and images / texts we are using.

When working with digitally available humanities data you might start feeling that your data is everywhere, but not in the right format you need to run computational analysis on it. While the amount of digitally available data continues to grow, it is often only available in a Portable Document Format (PDF). While it might appear that posessing PDF versions of all your texts could give you access to your materials, extra work is necessary to convert these materials into formats you can use for digital text analysis.

Not to fear. There are many Python tools that can allow you to extract a PDF's image and/or textual data into a format that is condusive for distant reading. In general, when working with a PDF there is one central question to consider:

* Does your PDF have text information embedded inside of it?

TODO: describe the issue and how to know if a PDF is an image vs a PDF with embedded text
TODO: Might illustrate this by taking a screenshot of a few pages of one of the PDF files and running the commands on them to show how they won't find any text in them.
TODO: describe how to OCR things to pull out text  
TODO: transition to remaining information.  
 
## Working with a PDF that has Text Data

There are a variety of packages that allow us to extract the textual data of a PDF. We will look at two packages, PDFminer.six and Poppler. Of the two, PDFminer is the easier to set up, especially you use a Windows machine. Poppler is the more powerful program, and may be better suited to your needs if you are working with images.

Both packages have the ability to recognize a varitey of fonts and characters, including Chinese, Japanese, and Korean languages in the case of PDFminer.six.  

The first section will show you how to install PDFminer.six, how to pull in a single PDF and extract its text, and then how to loop through a directory and extract the text of multiple PDF documents. 

Note that in what follows a code block beginning with %%bash indicates commands meant to be typed from the command line.

#### Installing PDFminer

First we install PDFminer

In [3]:
%%bash
pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer.six-20200726-py3-none-any.whl (5.6 MB)
Collecting cryptography
  Downloading cryptography-3.1-cp35-abi3-macosx_10_10_x86_64.whl (1.8 MB)
Collecting sortedcontainers
  Downloading sortedcontainers-2.2.2-py2.py3-none-any.whl (29 kB)
Collecting cffi!=1.11.3,>=1.8
  Downloading cffi-1.14.2-cp38-cp38-macosx_10_9_x86_64.whl (176 kB)
Collecting pycparser
  Downloading pycparser-2.20-py2.py3-none-any.whl (112 kB)
Installing collected packages: pycparser, cffi, cryptography, sortedcontainers, pdfminer.six
Successfully installed cffi-1.14.2 cryptography-3.1 pdfminer.six-20200726 pycparser-2.20 sortedcontainers-2.2.2


You may now run PDFminer.six as a python package, or, if you only need to occasionally extract a PDF's text, you can use its command line features. Note that, depending on your setup and whether you have Python2 installed, you may need to run `pip3 install pdfminer.six` instead

#### Working with Text in a Single PDF

The most basic usage is to call PDFminer's built-in script to extract text from a PDF and print that data to the command line.

$ pdf2txt.py input_text.pdf

For the purposes of these examples, we have taken a few plain-text files from Project Gutenberg and saved them as PDFs inside a folder called `pdf_input_dir'. Here is how the above command would work for a PDF version of Cane with the text inside it:

In [7]:
%%bash
pdf2txt.py pdf_input_dir/cane.pdf

The Project Gutenberg EBook of Cane, by Jean Toomer 
 
This eBook is for the use of anyone anywhere in the United States and 
most other parts of the world at no cost and with almost no restrictions 
whatsoever.  You may copy it, give it away or re-use it under the terms 
of the Project Gutenberg License included with this eBook or online at 
www.gutenberg.org.  If you are not located in the United States, you'll 
have to check the laws of the country where you are located before using 
this ebook. 
 
 
 
Title: Cane 
 
Author: Jean Toomer 
 
Contributor: Waldo Frank 
 
Release Date: August 12, 2019 [EBook #60093] 
 
Language: English 
 
Character set encoding: UTF-8 
 
*** START OF THIS PROJECT GUTENBERG EBOOK CANE *** 
 
 
 
 
Produced by Tim Lindell, Robert Tonsing, and the Online 
Distributed Proofreading Team at http://www.pgdp.net (This 
book was produced from images made available by the 
HathiTrust Digital Library.) 
 
 
 
 
 
 
 
 
 
                                 CANE 
 
 


This doesn't do us a lot of good, unless we just wanted to see if a file had text data in it. We could then extract that textual data from write it as plain text to a new file using this formulation:

$ pdf2txt.py source.pdf -o output.txt

Again with the Cane example:

In [9]:
%%bash
pdf2txt.py pdf_input_dir/cane.pdf -o cane.txt

We wouldn't see anything happen by default, but we could then check in our GUI or use the `cat` command to check the contents of our new `cane.txt` file.

In [10]:
%%bash
cat cane.txt

The Project Gutenberg EBook of Cane, by Jean Toomer 
 
This eBook is for the use of anyone anywhere in the United States and 
most other parts of the world at no cost and with almost no restrictions 
whatsoever.  You may copy it, give it away or re-use it under the terms 
of the Project Gutenberg License included with this eBook or online at 
www.gutenberg.org.  If you are not located in the United States, you'll 
have to check the laws of the country where you are located before using 
this ebook. 
 
 
 
Title: Cane 
 
Author: Jean Toomer 
 
Contributor: Waldo Frank 
 
Release Date: August 12, 2019 [EBook #60093] 
 
Language: English 
 
Character set encoding: UTF-8 
 
*** START OF THIS PROJECT GUTENBERG EBOOK CANE *** 
 
 
 
 
Produced by Tim Lindell, Robert Tonsing, and the Online 
Distributed Proofreading Team at http://www.pgdp.net (This 
book was produced from images made available by the 
HathiTrust Digital Library.) 
 
 
 
 
 
 
 
 
 
                                 CANE 
 
 


#### Working with Text in Multiple PDFs

Of course, we very rarely want to execute a command like this on a single text file. We usually have a whole corpus. In what follows, we use pdfminer in a Python script to perform text attraction across a series of texts. Let's loop through a directory, find all the PDFs, extract their textual data, and write this collection data as new .txt files in an output directory.

Before you begin, the following script assumes a certain directory structure to run appropriately. It assumes that you have an input directory with a number of PDF files in it (`pdf_input_dir` below)and an output directory (`out_txt` below). In what follows, we use os.listdir to get the filenames from a folder, but there are other ways to do this that we cover in [our section on working with the file structure in Python](file_structure.ipynb). What follows is refactored but draws upon a [gist](https://github.com/Shahabks/Converter-pdf-files-to-.txt-or-.html/blob/master/myPDF2txt.py) uploaded by GitHub user [Shahabks](Shahabks).

In [11]:
import os
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_single(fn):
    """Converts pdf, returns its text content as a string"""
    manager = PDFResourceManager()
    output = StringIO()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    with open(fn, 'rb') as infile:
        for page in PDFPage.get_pages(infile, set(),caching=True, check_extractable=True):
            interpreter.process_page(page)
            convertedPDF = output.getvalue()

    converter.close()
    output.close()
    return convertedPDF

def convert_all(pdfDir, txtDir):
    """Convert all the PDFs in a folder"""
    for pdf in os.listdir(pdfDir): #iterate through pdfs in input directory 
        fileExtension = os.path.splitext(pdf)[1]
        if fileExtension == ".pdf":
            pdfFilename = pdfDir + pdf
            text = convert_single(pdfFilename) #get string of text content of pdf
            textFilename = txtDir + pdf[:-4] + ".txt"
            with open(textFilename, 'w') as textFile:
                textFile.write(text)


pdfDir = "pdf_input_dir/"
txtDir = "out_txt/"
convert_all(pdfDir, txtDir)
print('look it finished')

look it finished


Now we could check that it worked by looking in the GUI or by using the terminal:

In [13]:
%%bash
ls out_txt

cane.txt
douglass.txt
wheatley.txt


TODO: Some sort of conclusion