# Chapter 15: Working with PDF annd Word Documents
PDF and Word documents are binary files, which makes them much more complex than plaintext files. In addition to text, they store lots of font, color, and layout information. If you want your programs to read or write to PDFs or Word documents, you’ll need to do more than simply pass their filenames to open().

Fortunately, there are Python modules that make it easy for you to interact with PDFs and Word documents. This chapter will cover two such modules: PyPDF2 and Python-Docx.

## PDF Documents

PDF stands for Portable Document Format and uses the .pdf file extension. Although PDFs support many features, this chapter will focus on the two things you’ll be doing most often with them: reading text content from PDFs and crafting new PDFs from existing documents.

The module you’ll use to work with PDFs is PyPDF2 version 1.26.0. It’s important that you install this version because future versions of PyPDF2 may be incompatible with the code. To install it, run pip install --user PyPDF2==1.26.0 from the command line. This module name is case sensitive, so make sure the y is lowercase and everything else is uppercase. (Check out Appendix A for full details about installing third-party modules.) If the module was installed correctly, running import PyPDF2 in the interactive shell shouldn’t display any errors.

In [1]:
import PyPDF2

### Extracting Text from PDFs

PyPDF2 does not have a way to extract images, charts, or other media from PDF documents, but it can extract text and return it as a Python string. To start learning how PyPDF2 works, we’ll use it on the example PDF shown in Figure 15-1. Download this PDF from https://nostarch.com/automatestuff2/ and enter the following into the interactive shell.

First, import the PyPDF2 module. Then open meetingminutes.pdf in read binary mode and store it in pdfFileObj. To get a PdfFileReader object that represents this PDF, call PyPDF2.PdfFileReader() and pass it pdfFileObj. Store this PdfFileReader object in pdfReader.

The total number of pages in the document is stored in the numPages attribute of a PdfFileReader object ➊. The example PDF has 19 pages, but let’s extract text from only the first page.

To extract text from a page, you need to get a Page object, which represents a single page of a PDF, from a PdfFileReader object. You can get a Page object by calling the getPage() method ➋ on a PdfFileReader object and passing it the page number of the page you’re interested in—in our case, 0.

PyPDF2 uses a zero-based index for getting pages: The first page is page 0, the second is page 1, and so on. This is always the case, even if pages are numbered differently within the document. For example, say your PDF is a three-page excerpt from a longer report, and its pages are numbered 42, 43, and 44. To get the first page of this document, you would want to call pdfReader.getPage(0), not getPage(42) or getPage(1).

Once you have your Page object, call its extractText() method to return a string of the page’s text ➌. The text extraction isn’t perfect: The text Charles E. “Chas” Roemer, President from the PDF is absent from the string returned by extractText(), and the spacing is sometimes off. Still, this approximation of the PDF text content may be good enough for your program

In [None]:
import PyPDF2, os

loadPath = os.path.join('automate_online-materials', 'meetingminutes.pdf')

pdfFileObj = open(loadPath, 'rb') # open the file in read binary mode and store in in pdfFileObj
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # get a PdfFileReader object that represents this PDF and store it in pdfReader
pdfReader.numPages # the total number of pages is stored in the numPages attribute of the PdfFileReader

19

In [6]:
pageObj = pdfReader.getPage(0) # create a page object from the reader representing the first page
pageObj.extractText() # call the extractText method to return a string of the page's text

'OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of \nMarch 7\n, 2014\n        \n     The Board of Elementary and Secondary Education shall provide leadership and \ncreate policies for education that expand opportunities for children, empower \nfamilies and communities, and advance Louisiana in an increasingly \ncompetitive glob\nal market.\n BOARD \n of ELEMENTARY\n and \n SECONDARY\n EDUCATION\n  '

### Decrypting PDFs

Some PDF documents have an encryption feature that will keep them from being read until whoever is opening the document provides a password. Enter the following into the interactive shell with the PDF you downloaded, which has been encrypted with the password rosebud:

In [7]:
import PyPDF2, os

loadPath = os.path.join('automate_online-materials', 'encrypted.pdf')

pdfReader = PyPDF2.PdfFileReader(open(loadPath, 'rb'))
pdfReader.isEncrypted

True

In [9]:
pdfReader.getPage(0)



PdfReadError: file has not been decrypted

In [10]:
pdfReader = PyPDF2.PdfFileReader(open(loadPath, 'rb'))
pdfReader.decrypt('rosebud')

1

In [14]:
pageObj = pdfReader.getPage(0)
pageObj.extractText()

'OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of \nMarch 7\n, 2014\n        \n     The Board of Elementary and Secondary Education shall provide leadership and \ncreate policies for education that expand opportunities for children, empower \nfamilies and communities, and advance Louisiana in an increasingly \ncompetitive glob\nal market.\n BOARD \n of ELEMENTARY\n and \n SECONDARY\n EDUCATION\n  '

All PdfFileReader objects have an isEncrypted attribute that is True if the PDF is encrypted and False if it isn’t ➊. Any attempt to call a function that reads the file before it has been decrypted with the correct password will result in an error ➋.

NOTE

Due to a bug in PyPDF2 version 1.26.0, calling getPage() on an encrypted PDF before calling decrypt() on it causes future getPage() calls to fail with the following error: IndexError: list index out of range. This is why our example reopened the file with a new PdfFileReader object.

To read an encrypted PDF, call the decrypt() function and pass the password as a string ➌. After you call decrypt() with the correct password, you’ll see that calling getPage() no longer causes an error. If given the wrong password, the decrypt() function will return 0 and getPage() will continue to fail. Note that the decrypt() method decrypts only the PdfFileReader object, not the actual PDF file. After your program terminates, the file on your hard drive remains encrypted. Your program will have to call decrypt() again the next time it is run.

### Creating PDFs

PyPDF2’s counterpart to PdfFileReader is PdfFileWriter, which can create new PDF files. But PyPDF2 cannot write arbitrary text to a PDF like Python can do with plaintext files. Instead, PyPDF2’s PDF-writing capabilities are limited to copying pages from other PDFs, rotating pages, overlaying pages, and encrypting files.

PyPDF2 doesn’t allow you to directly edit a PDF. Instead, you have to create a new PDF and then copy content over from an existing document. The examples in this section will follow this general approach:

    Open one or more existing PDFs (the source PDFs) into PdfFileReader objects.
    Create a new PdfFileWriter object.
    Copy pages from the PdfFileReader objects into the PdfFileWriter object.
    Finally, use the PdfFileWriter object to write the output PDF.

Creating a PdfFileWriter object creates only a value that represents a PDF document in Python. It doesn’t create the actual PDF file. For that, you must call the PdfFileWriter’s write() method.

The write() method takes a regular File object that has been opened in write-binary mode. You can get such a File object by calling Python’s open() function with two arguments: the string of what you want the PDF’s filename to be and 'wb' to indicate the file should be opened in write-binary mode.

If this sounds a little confusing, don’t worry—you’ll see how this works in the following code examples.

### Copying Pages

You can use PyPDF2 to copy pages from one PDF document to another. This allows you to combine multiple PDF files, cut unwanted pages, or reorder pages.

Download meetingminutes.pdf and meetingminutes2.pdf from https://nostarch.com/automatestuff2/ and place the PDFs in the current working directory. Enter the following into the interactive shell:

In [16]:
import PyPDF2, os

loadPath1 = os.path.join('automate_online-materials', 'meetingminutes.pdf')
loadPath2 = os.path.join('automate_online-materials', 'meetingminutes2.pdf')
savePath = os.path.join('Files', 'combinedminutes.pdf')

pdf1File = open(loadPath1, 'rb') # open both files in read binary mode
pdf2File = open(loadPath2, 'rb')
pdf1Reader = PyPDF2.PdfFileReader(pdf1File) # create file reader objects for both files
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)
pdfWriter = PyPDF2.PdfFileWriter() # Creat a file writer object which represents a blank pdf document

for pageNum in range(pdf1Reader.numPages): # copy all page from file 1 and add them to the writer object
    pageObj = pdf1Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

for pageNum in range(pdf2Reader.numPages):
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

pdfOutputFile = open(savePath, 'wb') # create a new pdf
pdfWriter.write(pdfOutputFile) # write the writer object to the new file
pdf1File.close()
pdf2File.close()
pdfOutputFile.close()


Open both PDF files in read binary mode and store the two resulting File objects in pdf1File and pdf2File. Call PyPDF2.PdfFileReader() and pass it pdf1File to get a PdfFileReader object for meetingminutes.pdf ➊. Call it again and pass it pdf2File to get a PdfFileReader object for meetingminutes2.pdf ➋. Then create a new PdfFileWriter object, which represents a blank PDF document ➌.

Next, copy all the pages from the two source PDFs and add them to the PdfFileWriter object. Get the Page object by calling getPage() on a PdfFileReader object ➍. Then pass that Page object to your PdfFileWriter’s addPage() method ➎. These steps are done first for pdf1Reader and then again for pdf2Reader. When you’re done copying pages, write a new PDF called combinedminutes.pdf by passing a File object to the PdfFileWriter’s write() method ➏.

NOTE

PyPDF2 cannot insert pages in the middle of a PdfFileWriter object; the addPage() method will only add pages to the end.

You have now created a new PDF file that combines the pages from meetingminutes.pdf and meetingminutes2.pdf into a single document. Remember that the File object passed to PyPDF2.PdfFileReader() needs to be opened in read-binary mode by passing 'rb' as the second argument to open(). Likewise, the File object passed to PyPDF2.PdfFileWriter() needs to be opened in write-binary mode with 'wb

### Rotating Pages

The pages of a PDF can also be rotated in 90-degree increments with the rotateClockwise() and rotateCounterClockwise() methods. Pass one of the integers 90, 180, or 270 to these methods. Enter the following into the interactive shell, with the meetingminutes.pdf file in the current working directory:

In [None]:
import PyPDF2, os
loadPath = os.path.join('automate_online-materials', 'meetingminutes.pdf')
savePath = os.path.join('Files', 'rotatedPage.pdf')

minutesFile = open(loadPath, 'rb')
pdfReader = PyPDF2.PdfFileReader(minutesFile)
page = pdfReader.getPage(0) # get page 1 into a page object
page.rotateClockwise(90) # rotate page 1 90 degrees
pdfWriter = PyPDF2.PdfFileWriter() # create a new writer object
pdfWriter.addPage(page) # add the page object to the writer object
resultPdfFile = open(savePath, 'wb') # create a blank pdf
pdfWriter.write(resultPdfFile) # write the writer object to the blank pdf
resultPdfFile.close()
minutesFile.close()


Here we use getPage(0) to select the first page of the PDF ➊, and then we call rotateClockwise(90) on that page ➋. We write a new PDF with the rotated page and save it as rotatedPage.pdf ➌.

The resulting PDF will have one page, rotated 90 degrees clockwise, as shown in Figure 15-2. The return values from rotateClockwise() and rotateCounterClockwise() contain a lot of information that you can ignore.

### Overlaying Pages

PyPDF2 can also overlay the contents of one page over another, which is useful for adding a logo, timestamp, or watermark to a page. With Python, it’s easy to add watermarks to multiple files and only to pages your program specifies.

Download watermark.pdf from https://nostarch.com/automatestuff2/ and place the PDF in the current working directory along with meetingminutes.pdf. Then enter the following into the interactive shell:

In [19]:
import PyPDF2, os
loadPath1 = os.path.join('automate_online-materials', 'meetingminutes.pdf')
loadPath2 = os.path.join('automate_online-materials', 'watermark.pdf')
savePath = os.path.join('Files', 'watermarkedCover.pdf')

minutesFile = open(loadPath1, 'rb') # open the meeting minutes file
pdfReader = PyPDF2.PdfFileReader(minutesFile) # create a reader object from the meeting minutes file
minutesFirstPage = pdfReader.getPage(0) # create a page object from the first page of the minutes file
pdfWatermarkReader = PyPDF2.PdfFileReader(open(loadPath2, 'rb')) # open the watermark pdf and create a reader object from it

# merge the first page of minutes with the first page of watermark. Minutesfirstpage now represents the watermarked first page
minutesFirstPage.mergePage(pdfWatermarkReader.getPage(0)) 

pdfWriter = PyPDF2.PdfFileWriter() # create a new writer object
pdfWriter.addPage(minutesFirstPage) # add the watermarked first page to the writer object

for pageNum in range(1, pdfReader.numPages): # loop through the rest of the pages in meeting minutes and add each page to the writer object
    pageObj = pdfReader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

resultPdfFile = open(savePath, 'wb') # open a new pdf file
pdfWriter.write(resultPdfFile) # write the writer object to the new file
minutesFile.close()
resultPdfFile.close()

### Encrypting PDFs

A PdfFileWriter object can also add encryption to a PDF document. Enter the following into the interactive shell:

Before calling the write() method to save to a file, call the encrypt() method and pass it a password string ➊. PDFs can have a user password (allowing you to view the PDF) and an owner password (allowing you to set permissions for printing, commenting, extracting text, and other features). The user password and owner password are the first and second arguments to encrypt(), respectively. If only one string argument is passed to encrypt(), it will be used for both passwords.

In this example, we copied the pages of meetingminutes.pdf to a PdfFileWriter object. We encrypted the PdfFileWriter with the password swordfish, opened a new PDF called encryptedminutes.pdf, and wrote the contents of the PdfFileWriter to the new PDF. Before anyone can view encryptedminutes.pdf, they’ll have to enter this password. You may want to delete the original, unencrypted meetingminutes.pdf file after ensuring its copy was correctly encrypted.

In [20]:
import PyPDF2, os
loadPath = os.path.join('automate_online-materials', 'meetingminutes.pdf')
savePath = os.path.join('Files', 'encryptedminutes.pdf')

pdfFile = open(loadPath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdfReader.numPages):
    pdfWriter.addPage(pdfReader.getPage(pageNum))

pdfWriter.encrypt('swordfish') # set the password for the encrypted pdf
resultPdf = open(savePath, 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()