# Working with PDF Files

Often you will have to deal with PDF files. There are [many libraries in Python for working with PDFs](https://reachtim.com/articles/PDF-Manipulation.html), each with their pros and cons, the most common one being **PyPDF2**. You can install it with (note the case-sensitivity, you need to make sure your capitilization matches):

    pip install PyPDF2
    
Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with PyPDF2 won't be able to be read. If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how non-standard the settings can be, text could be shown as an image instead of a utf-8 encoding. There are many parameters to consider in this aspect.

As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.
___

## Working with PyPDF2

Let's begin by showing the basics of the PyPDF2 library.

In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
Building wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py): started
  Building wheel for PyPDF2 (setup.py): finished with status 'done'
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-cp37-none-any.whl size=61091 sha256=03d7d00b472a8f8adc2d29177e5551c1b4032007148307faaa71985ce3a64602
  Stored in directory: C:\Users\FALCON\AppData\Local\pip\Cache\wheels\53\84\19\35bc977c8bf5f0c23a8a011aa958acd4da4bbd7a229315c1b7
Successfully built PyPDF2
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0


In [2]:
import PyPDF2

#### Reading PDFs
First we open a pdf, then create a reader object for it. Notice how we use the binary method of reading , 'rb', instead of just 'r'.

In [3]:
# Open a file pointer and then use the pointer to read a PDF file
f = open('resources/US_Declaration.pdf', 'rb')

pdf_reader = PyPDF2.PdfFileReader(f)

In [4]:
# To see the number of pages in the PDF document
pdf_reader.numPages

5

In [6]:
# Lets extract the text for the first page
page_one_text = pdf_reader.getPage(0)

# The above will create a json object, from where we can extract the text
page_one_text = page_one_text.extractText()

# View the contents
page_one_text

"Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the\npolitical bands which have connected them with another, and to assume among the powers of the\nearth, the separate and equal station to which the Laws of Nature and of Nature's God entitle\n\nthem, a decent respect to the opinions of mankind requires that they should declare the causes\n\nwhich impel them to the separation. \nWe hold these truths to be self-evident, that all men are created equal, that they are endowed by\n\ntheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving\n\ntheir just powers from the consent of the governed,ŠThat whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or 

In [7]:
# Now we will close the file pointer

f.close()

#### Adding to PDFs

We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.

What we *can* do is copy pages and append pages to the end.

In [8]:
f = open('resources/US_Declaration.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(f)

In [9]:
# We will get the first page
first_page = pdf_reader.getPage(0)

In [10]:
# Now we will create a PDF writer object and append the first page to the document
pdf_writer = PyPDF2.PdfFileWriter()

# Append the page
pdf_writer.addPage(first_page)

In [11]:
# Now we will output to some new document
pdf_output = open("SomeNewDoc.pdf", 'wb')
pdf_writer.write(pdf_output)

In [12]:
# We will close the objects
pdf_output.close()
f.close()