# PDF: Portable Document Format

**PDF files are binary files**. In addition to text, they store lot of other information like the font, color, and layout information.

If we want to read/write PDF files, we need to do a lot more than just calling open() with the filename passed in.

In [None]:
with open('pdf.pdf', 'rb') as f:
    data = f.read()

Now if we look at the data, we see something which is not readable even though we opened it in binary mode.

In [None]:
data

But luckily, Python is a batteries included language. There's literally a package for everything.

### PyPDF2

In [1]:
import PyPDF2

The first thing to do is open the file, Duh!
The file should be opened in binary mode.

In [None]:
pdfFileObj = open('pdf.pdf', 'rb')

Now we can call PyPDF2's **PdfFileReader** function on the fileobject.

In [None]:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

We can check the number of pages present in that file by calling **numPages**

In [None]:
pdfReader.numPages

We can get a get particular page of the PDF file by calling the gePage function and passing in the page number as it's argument(starts with 0).

In [None]:
pageObj = pdfReader.getPage(0)

We can now take a look at the contents of the first page of the pdf.

In [None]:
pageObj

We can get just the textual information by calling the extractText() function

In [None]:
pageObj.extractText()

Sometimes, there may be a PDF file which is encrypted. We can check if the file is encrypted or not by calling the **isEncrypted** method.

In [None]:
pdfReader.isEncrypted

Suppose we have a file which is encrypted. 

In [None]:
pdfFile2 = open('encrypted.pdf', 'rb')

In [None]:
pdfFile2 = PyPDF2.PdfFileReader(pdfFile2)

In [None]:
pdfFile2.isEncrypted

Now if we try to get a particular page of the PDF file, we get an error

In [None]:
# pdfFile2.getPage(0)

So, we need to decrypt the PDF file first.
We can do that by calling **decrypt** on the pdf file and passing in the password as an argument to that method.

In [None]:
pdfFile2.decrypt('rosebud')

Now we will be able to view and do whatever we want with the PDF.

In [None]:
page1 = pdfFile2.getPage(0)

In [None]:
page1.extractText()

In [16]:
def open_pdf(filePath):
    try:    
        pdf = open(filePath, 'rb')
        import PyPDF2

        pdfFile = PyPDF2.PdfFileReader(pdf)
        encrypted = pdfFile.isEncrypted
        pass_out = 0
        if encrypted:
            while pass_out == 0:
                password = raw_input('The PDF file is encrypted. Please enter the password: \n')
                pass_out = pdfFile.decrypt(password)
                if pass_out == 1:
                    print("successfully decrypted")
                    print("PDF file read")
                else:
                    print("{} is not the correct password. Enter the right password".format(password))
        else:
            print("PDF file {} read...".format(filePath))
            
            return pdfFile, pdf
    except:
        print("Error: unable to open file.")

In [None]:
open_pdf('rahul.pdf')

In [None]:
pdf_data = open_pdf('pdf.pdf')

In [None]:
pdf_encrypted = open_pdf('encrypted.pdf')

Now we can use all the functions that we can use of PDF files.

In [None]:
page1 = pdf_encrypted.getPage(0)

In [None]:
page1.extractText()

### Creating PDFs

PyPDF2's counterpart to PdfFileReader objects is PdfFileWriter objects, which can create new PDF files.

But PyPDF2's PDF-writing capabilities are just limited to copying pages from other PDFs, rotating pages, overlaying pages, and encrypting files.
It cannot write text to PDF like Python can do with plaintext files.

PyPDF2 does not allow you to directly edit a PDF. Instead, we have to create a new PDF and then copy content over from an existing document.

First we need to open the file which we wanna copy

In [None]:
pdf1File = open('meetingminutes.pdf', 'rb')

Next, we need to read the PDF file by calling the PdfFileReader

In [None]:
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)

Now, we create an instance of the PdfFileWriter

In [None]:
pdfWriter = PyPDF2.PdfFileWriter()

Go through each page in the original file and read it.
After reading each page, add it to the new file by calling **addPage** on the pdfWriter instance we created and pass in the page as an argument to that function

In [None]:
for pageNum in range(pdf2Reader.numPages):
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

Now open the file to which the data has to be written. Again, it has to be in binary.

In [None]:
pdfOutputFile = open('combinedMinutes.pdf', 'wb')

Creating a **PdfFileWriter** object creates only a value that represents a PDF document in Python. It doesn't actually create the PDF file. For that, you must call the PdfFileWriter's write() method.

In [None]:
pdfWriter.write(pdfOutputFile)

Finally, close all the files.

In [None]:
pdfOutputFile.close()

In [None]:
pdf1File.close()

PyPDF2 cannot insert pages in the middle of a PdfFileWriter object, the addPage() method will add pages only at the end.

We can copy as many pdf's as want to into a single file

In [None]:
pdf1 = open('meetingminutes.pdf', 'rb')
pdf2 = open('pdf.pdf', 'rb')

pdf_val_1 = PyPDF2.PdfFileReader(pdf1)
pdf_val_2 = PyPDF2.PdfFileReader(pdf2)

out_pdf = PyPDF2.PdfFileWriter()

In [None]:
for page in range(pdf_val_1.numPages):
    page_value = pdf_val_1.getPage(page)
    out_pdf.addPage(page_value)

In [None]:
for page_ in range(pdf_val_2.numPages):
    page_value_ = pdf_val_2.getPage(page_)
    out_pdf.addPage(page_value_)

In [None]:
output = open('output.pdf', 'wb')

In [None]:
out_pdf.write(output)

In [None]:
pdf1.close()
pdf2.close()
output.close()

### Rotating Pages

The pages of a PDF can be rotated in increments of 90 degrees with the 
**rotateClockwise()** and **rotateCounterClockwise()** methods.

In [19]:
file_data, open_file = open_pdf('pdf.pdf')

PDF file pdf.pdf read...


In [20]:
page = file_data.numPages

In [21]:
page

5

Get the page which you wanna rotate

In [22]:
page = file_data.getPage(0)

Rotate it by the amount of degrees you want

In [23]:
page.rotateClockwise(90)

{'/Annots': [IndirectObject(13, 0),
  IndirectObject(14, 0),
  IndirectObject(15, 0),
  IndirectObject(16, 0),
  IndirectObject(17, 0),
  IndirectObject(18, 0),
  IndirectObject(19, 0),
  IndirectObject(20, 0),
  IndirectObject(21, 0),
  IndirectObject(22, 0),
  IndirectObject(23, 0),
  IndirectObject(24, 0),
  IndirectObject(25, 0),
  IndirectObject(26, 0)],
 '/Contents': {'/Filter': '/FlateDecode'},
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': {'/Count': 5,
  '/Kids': [IndirectObject(4, 0),
   IndirectObject(7, 0),
   IndirectObject(9, 0),
   IndirectObject(73, 0),
   IndirectObject(11, 0)],
  '/Rotate': 0,
  '/Type': '/Pages'},
 '/Resources': {'/ExtGState': {'/R29': {'/BG': {'/BitsPerSample': 8,
     '/Domain': [0, 1],
     '/Filter': '/FlateDecode',
     '/FunctionType': 0,
     '/Range': [0, 1],
     '/Size': [256]},
    '/Name': '/R29',
    '/OPM': 1,
    '/SM': 0.02,
    '/TR': '/Identity',
    '/Type': '/ExtGState',
    '/UCR': {'/BitsPerSample': 8,
     '/Domain': [0, 1],
     

Create a writer

In [24]:
pdfWriter = PyPDF2.PdfFileWriter()

Add that page to that fileWriter instance

In [25]:
pdfWriter.addPage(page)

In [26]:
out_file = open('out.pdf', 'wb')

Write it out to the file

In [27]:
pdfWriter.write(out_file)

Close the open files

In [28]:
out_file.close()

In [29]:
open_file.close()

In [30]:
inp, open_f = open_pdf('pdf.pdf')

PDF file pdf.pdf read...


In [31]:
output_file = open('all_out.pdf', 'wb')

In [32]:
pdf_writer = PyPDF2.PdfFileWriter()

In [33]:
for i in range(inp.numPages):
    get_val = inp.getPage(i)
    get_val.rotateCounterClockwise(90)
    pdf_writer.addPage(get_val)

In [34]:
pdf_writer.write(output_file)

In [35]:
output_file.close()

In [36]:
open_f.close()

### Overlaying Pages

We can also overlay pages. 

This can be useful when we wanna add a logo or watermark on a page

In [37]:
input_data, open_file = open_pdf('pdf.pdf')

PDF file pdf.pdf read...


Get the page on which something has to be overlayed

In [38]:
first_page = input_data.getPage(0)

Get the data which has to be overlayed

In [39]:
watermarker = PyPDF2.PdfFileReader(open('watermark.pdf', 'rb'))

Call **mergePage** on the data on which something has to be overlayed.
Pass the value which has to be overlayed as an argument to it.

In [40]:
first_page.mergePage(watermarker.getPage(0))

Create a file writer object

In [41]:
pdfWrite = PyPDF2.PdfFileWriter()

In [42]:
pdfWrite.addPage(first_page)

In [44]:
for pageNum in range(input_data.numPages):
    get_page = input_data.getPage(pageNum)
    pdfWrite.addPage(get_page)

In [45]:
results = open('result.pdf', 'wb')

In [46]:
pdfWrite.write(results)

In [47]:
results.close()

In [48]:
open_file.close()