# Working With PDFs

1. - Often we may need to read in text data from a PDF file .

   - We can use the PyPDF2 library to read in text data from a PDF file.

   - **Note: Not all the PDFs have the text that can be extracted**

   - Some PDFs are created through scanning, instead of being exported from a text editor like word.

     - These scanned PDFs are like scanned image files, making it much harder to extract the text.
     - Often this requires a specialized software.

   - The PyPDF2 library is made to extract text from PDF files directly created from a word processor, but not all word processor created PDFs with extractable text.

   - If you haven't installed PyPDF2, run the command in your terminal or in your CMD 

     ```
       pip install PyPDF2
     ```

   - If you haven't installed PyPDF2, run the command in your terminal or in your CMD 



### Let's Start 

In [1]:
# import library 

import PyPDF2

In [2]:
# Reading as file

my_file = open('data/dummy.pdf',mode='rb')

In [3]:
# reading as pdf

pdf_reader = PyPDF2.PdfFileReader(my_file)

In [4]:
# checking pages
pdf_reader.numPages

5

In [5]:
# page 
page_one = pdf_reader.getPage(0)

In [7]:
myText_first= page_one.extractText()

In [9]:
my_file.close()

## Adding to PDFs

We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.

What we *can* do is copy pages and append pages to the end.

In [10]:
f = open('data/dummy.pdf',mode='rb')
pdf_reader = PyPDF2.PdfFileReader(f)

In [11]:
first_page = pdf_reader.getPage(0)

In [12]:
pdf_writer = PyPDF2.PdfFileWriter()

In [13]:
pdf_writer.addPage(first_page)

In [14]:
pdf_output = open("data/New_Doc.pdf","wb")

In [15]:
pdf_writer.write(pdf_output)

In [16]:
pdf_output.close()
f.close()

Now we have copied a page and added it to another new document!

## Simple Example

Let's try to grab all the text from this PDF file:

In [17]:
f = open('data/dummy.pdf','rb')

# List of every page's text.
# The index will correspond to the page number.
pdf_text = [0]  # zero is a placehoder to make page 1 = index 1

pdf_reader = PyPDF2.PdfFileReader(f)

for p in range(pdf_reader.numPages):
    
    page = pdf_reader.getPage(p)
    
    pdf_text.append(page.extractText())

f.close()

In [18]:
pdf_text

[0,
 ' \n \nLorem Ipsum \nis simply dummy text of the printing and typesetting industry. Lorem Ipsum has been \nthe industry\'s standard dummy text ever since the 1500s, when an unknown printer took a galley of \ntype and scrambled it to make a type specimen book. It has survived not only five centuries, but also \nthe leap into electronic typesetting, remaining essentially unchanged. It was popularised in the \n1960s with the release of Letraset sheets containing Lore\nm Ipsum passages, and more recently with \ndesktop publishing software like Aldus PageMaker including versions of Lorem Ipsum \n \nIt is a long established fact that a reader will be distracted by the readable content of a page when \nlooking at its layout. The po\nint of using Lorem Ipsum is that it has a more\n-\nor\n-\nless normal distribution \nof letters, as opposed to using \'Content here, content here\', making it look like readable English. \nMany desktop publishing packages and web page editors now use Lorem Ip

In [19]:
print(pdf_text[3])

the leap into electronic typesetting, remaining essent
ially unchanged. It was popularised in the 
1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with 
desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum 
 
It is a long established fact t
hat a reader will be distracted by the readable content of a page when 
looking at its layout. The point of using Lorem Ipsum is that it has a more
-
or
-
less normal distribution 
of letters, as opposed to using 'Content here, content here', making it look like
 
readable English. 
Many desktop publishing packages and web page editors now use Lorem Ipsum as their default 
model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various 
versions have evolved over the years, somet
imes by accident, sometimes on purpose (injected 
humour and the like). 
 
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a 