# Working with PDF files

In [1]:
import PyPDF2

In [4]:
pdfFileObj = open('data/meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages

19

To extract text from a page, you need to get a Page object, which represents a single page of a PDF, from a PdfFileReader object. You can get a Page object by calling the `getPage()` method on a `PdfFileReader` object and passing it the page number of the page you’re interested in—in this case, 0.

In [5]:
pageObj = pdfReader.getPage(0)
pageObj.extractText()

u'OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of March 7, 2014        \n     The Board of Elementary and Secondary Education shall provide leadership and create policies for education that expand opportunities for children, empower families and communities, and advance Louisiana in an increasingly competitive global market. BOARD  of ELEMENTARY and  SECONDARY EDUCATION  '

In [23]:
for numpage in range(pdfReader.numPages):
    print pdfReader.getPage(numpage).extractText()

OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of March 7, 2014        
     The Board of Elementary and Secondary Education shall provide leadership and create policies for education that expand opportunities for children, empower families and communities, and advance Louisiana in an increasingly competitive global market. BOARD  of ELEMENTARY and  SECONDARY EDUCATION  
 LOUISIANA STATE BOARD OF ELEMENTARY AND SECONDARY EDUCATION   MARCH 7, 2014  
 The Louisiana Purchase Room  Baton Rouge, LA   
 
 
The Louisiana State Board of Elementary and Secondary Education met in regular session on March 7, 2014, in the Louisiana Purchase Room, located in the Claiborne Building in Baton Rouge, Louisiana.  The meeting was called to order at 9:17 a.m. by Board President Chas Roemer and opened with a prayer by Ms. Terry Johnson, Bossier Parish School System.  
Board members present were Dr. Lottie Beebe, Ms. Holly Boffy, Mr. Jim Garvey, Mr. Jay Guillot, Ms. Carolyn Hill, Mr. Walter Lee, Dr.

## Decrypting PDFs
The file in this example has the password: 'Rosebud'

In [11]:
import PyPDF2
pdfReader = PyPDF2.PdfFileReader(open('data/encrypted.pdf', 'rb'))
pdfReader.isEncrypted

True

In [12]:
# This will not work: pdfReader.getPage(0)
# We need to decrypt first
pdfReader.decrypt('rosebud')

1

In [13]:
pageObj = pdfReader.getPage(0)

Note that the `decrypt()` method decrypts only the `PdfFileReader` object, not the actual PDF file. After your program terminates, the file on your hard drive remains encrypted

## Copying Pages

In [14]:
import PyPDF2
pdf1File = open('data/meetingminutes.pdf', 'rb')
pdf2File = open('data/meetingminutes2.pdf', 'rb')
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)
pdfWriter = PyPDF2.PdfFileWriter()

In [15]:
for pageNum in range(pdf1Reader.numPages):
    pageObj = pdf1Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

In [16]:
for pageNum in range(pdf2Reader.numPages):
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

In [17]:
pdfOutputFile = open('data/combinedminutes.pdf', 'wb')
pdfWriter.write(pdfOutputFile)
pdfOutputFile.close()
pdf1File.close()
pdf2File.close()

## Rotating Pages

In [18]:
import PyPDF2
minutesFile = open('data/meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(minutesFile)
page = pdfReader.getPage(0)
page.rotateClockwise(90)

{'/Contents': [IndirectObject(961, 0),
  IndirectObject(962, 0),
  IndirectObject(963, 0),
  IndirectObject(964, 0),
  IndirectObject(965, 0),
  IndirectObject(966, 0),
  IndirectObject(967, 0),
  IndirectObject(968, 0)],
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': {'/Count': 9,
  '/Kids': [IndirectObject(959, 0),
   IndirectObject(1, 0),
   IndirectObject(11, 0),
   IndirectObject(13, 0),
   IndirectObject(15, 0),
   IndirectObject(17, 0),
   IndirectObject(19, 0),
   IndirectObject(24, 0),
   IndirectObject(26, 0)],
  '/Parent': {'/Count': 19,
   '/Kids': [IndirectObject(953, 0),
    IndirectObject(954, 0),
    IndirectObject(955, 0)],
   '/Type': '/Pages'},
  '/Type': '/Pages'},
 '/Resources': {'/ColorSpace': {'/CS0': ['/ICCBased', IndirectObject(969, 0)],
   '/CS1': ['/ICCBased', IndirectObject(970, 0)],
   '/CS2': ['/ICCBased', IndirectObject(970, 0)]},
  '/ExtGState': {'/GS0': {'/AIS': <PyPDF2.generic.BooleanObject at 0x104321510>,
    '/BM': '/Norm

In [19]:
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(page)
resultPdfFile = open('data/rotatedPage.pdf', 'wb')
pdfWriter.write(resultPdfFile)
resultPdfFile.close()
minutesFile.close()

## Overlaying Pages
`PyPDF2` can also overlay the contents of one page over another, which is useful for adding a logo, timestamp, or watermark to a page. With Python, it’s easy to add watermarks to multiple files and only to pages your program specifies.

In [20]:
import PyPDF2
minutesFile = open('data/meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(minutesFile)
minutesFirstPage = pdfReader.getPage(0)
pdfWatermarkReader = PyPDF2.PdfFileReader(open('data/watermark.pdf', 'rb'))
minutesFirstPage.mergePage(pdfWatermarkReader.getPage(0))
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(minutesFirstPage)

In [21]:
for pageNum in range(1, pdfReader.numPages):
    pageObj = pdfReader.getPage(pageNum)
    pdfWriter.addPage(pageObj)
resultPdfFile = open('data/watermarkedCover.pdf', 'wb')
pdfWriter.write(resultPdfFile)
minutesFile.close()
resultPdfFile.close()

## Encrypting

In [22]:
import PyPDF2
pdfFile = open('data/meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdfReader.numPages):
    pdfWriter.addPage(pdfReader.getPage(pageNum))

pdfWriter.encrypt('swordfish')
resultPdf = open('data/encryptedminutes.pdf', 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()