___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Working with PDF Files

Often you will have to deal with PDF files. There are [many libraries in Python for working with PDFs](https://reachtim.com/articles/PDF-Manipulation.html), each with their pros and cons, the most common one being **PyPDF2**. You can install it with (note the case-sensitivity, you need to make sure your capitilization matches):

    pip install PyPDF2
    
Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with PyPDF2 won't be able to be read. If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how non-standard the settings can be, text could be shown as an image instead of a utf-8 encoding. There are many parameters to consider in this aspect.

As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.
___

## Working with PyPDF2

Let's begin by showing the basics of the PyPDF2 library.

In [2]:
# note the capitalization
import PyPDF2

## Reading PDFs

First we open a pdf, then create a reader object for it. Notice how we use the binary method of reading , 'rb', instead of just 'r'.

In [3]:
# Notice we read it as a binary with 'rb'
f = open('US_Declaration.pdf','rb')

In [5]:
pdf_reader = PyPDF2.PdfReader(f)

In [7]:
len(pdf_reader.pages)

5

In [9]:
page_one = pdf_reader.pages[0]

We can then extract the text:

In [11]:
page_one_text = page_one.extract_text()

In [7]:
page_one_text

"Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the\npolitical bands which have connected them with another, and to assume among the powers of the\nearth, the separate and equal station to which the Laws of Nature and of Nature's God entitle\n\nthem, a decent respect to the opinions of mankind requires that they should declare the causes\n\nwhich impel them to the separation. \nWe hold these truths to be self-evident, that all men are created equal, that they are endowed by\n\ntheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving\n\ntheir just powers from the consent of the governed,ŠThat whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or 

In [12]:
f.close()

## Adding to PDFs

We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.

What we *can* do is copy pages and append pages to the end.

In [13]:
f = open('US_Declaration.pdf','rb')
pdf_reader = PyPDF2.PdfReader(f)

In [15]:
first_page = pdf_reader.pages[0]

In [17]:
pdf_writer = PyPDF2.PdfWriter()

In [19]:
pdf_writer.add_page(first_page)

{'/Type': '/Page',
 '/Contents': {},
 '/MediaBox': [0, 0, 612, 792],
 '/Resources': {'/Font': {'/F9': {'/Type': '/Font',
    '/Subtype': '/Type1',
    '/Name': '/F9',
    '/Encoding': '/WinAnsiEncoding',
    '/FirstChar': 31,
    '/LastChar': 255,
    '/Widths': [778,
     250,
     333,
     555,
     500,
     500,
     1000,
     833,
     278,
     333,
     333,
     500,
     570,
     250,
     333,
     250,
     278,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     500,
     333,
     333,
     570,
     570,
     570,
     500,
     930,
     722,
     667,
     722,
     722,
     667,
     611,
     778,
     778,
     389,
     500,
     778,
     667,
     944,
     722,
     778,
     611,
     778,
     722,
     556,
     667,
     722,
     722,
     1000,
     722,
     722,
     667,
     333,
     278,
     333,
     581,
     500,
     333,
     500,
     556,
     444,
     556,
     444,
     333,
     500,
     556,

In [20]:
pdf_output = open("Some_New_Doc.pdf","wb")

In [21]:
pdf_writer.write(pdf_output)

(False, <_io.BufferedWriter name='Some_New_Doc.pdf'>)

In [22]:
pdf_output.close()
f.close()

Now we have copied a page and added it to another new document!

___

## Simple Example

Let's try to grab all the text from this PDF file:

In [27]:
f = open('US_Declaration.pdf','rb')

# List of every page's text.
# The index will correspond to the page number.
pdf_text = [0]  # zero is a placehoder to make page 1 = index 1

pdf_reader = PyPDF2.PdfReader(f)

for p in range(len(pdf_reader.pages)):
    
    page = pdf_reader.pages[p]
    
    pdf_text.append(page.extract_text())

f.close()

In [28]:
pdf_text

[0,
 "Declaration of Independence\nIN CONGRESS, July 4, 1776.  \nThe unanimous Declaration of the thirteen united States of America,  \nWhen in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.— \x14That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  \x14That whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter o

In [32]:
print(pdf_text[2])

He has dissolved Re presentative Ho uses repeatedly , for opposing wit h manly
firmness his invasions on the rights of the people.
He has refused for a long time, after such dissolutions, to cause others to be
elected; whereby the Leg islative powers, incapable of Annihilation, have returned
to the People at lar ge for their exe rcise; the State r emaining in the me an time
exposed to all the dangers of invasion from without, and convulsions within.
He has endeavou red to prevent the  population of these  States; for that pur pose
obstructing the L aws for Natural ization of Foreig ners; refusing  to pass others to
encourage their migrations hither, and raising the conditions of new
Appropriations of  Lands.
He has obstructed the Administration of Justice, by refusing his Assent to Laws
for establishing  Judiciary pow ers.
He has made Judge s dependent on his Wil l alone, for the te nure of their off ices,
and the amount and  payment of t heir salaries.
He has erected  a multitude of N

### Excellent work! 
That is all for PyPDF2 for now, remember that this won't work with every PDF file and is limited in its scope to only the text of PDFs.
## Next up: Regular Expressions