# Working with PDF Files

We often have to deal with PDF files. There are [many libraries in Python for working with PDFs](https://reachtim.com/articles/PDF-Manipulation.html), each with their pros and cons, the most common one being `PyPDF2`. You can install it with (note the case-sensitivity, you need to make sure your capitilisation matches).

<Strong> Note </strong> Make sure you are in the AI 2 virtual environment before you execute this command. Otherwise you will have problems when importing the `PyPDF2` library.

If you are using a local environment you will need to run the following command in the prompt:

    pip install PyPDF2
    
Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with PyPDF2 won't be able to be read. 

If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how non-standard the settings can be, text could be shown as an image instead of a utf-8 encoding. There are many parameters to consider in this aspect.

As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.
___

## Working with PyPDF2

Let's begin by looking at the basics of the PyPDF2 library.

In [1]:
!pip install PyPDF2
# Import the PyPDF2 library
# BE careful of spelling and capitalisation
import PyPDF2
from google.colab import drive
drive.mount('/content/gdrive')

Collecting PyPDF2
[?25l  Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
[K     |████▎                           | 10kB 16.1MB/s eta 0:00:01[K     |████████▌                       | 20kB 21.9MB/s eta 0:00:01[K     |████████████▊                   | 30kB 9.2MB/s eta 0:00:01[K     |█████████████████               | 40kB 8.8MB/s eta 0:00:01[K     |█████████████████████▏          | 51kB 4.3MB/s eta 0:00:01[K     |█████████████████████████▍      | 61kB 4.7MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71kB 5.1MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.4MB/s 
[?25hBuilding wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-cp37-none-any.whl size=61085 sha256=94e5328715d8260c0d88508b2b92a8af37c2179fb7d14d542a743b333dc4a113
  Stored in directory: /ro

I've copied the document <strong> A Midsummer Night </strong> into the working directory of the jupiter notebook directory you are cirrently working on.

## Reading a pdf file

First we open a pdf, then create a reader object for it. Notice how we use the binary method of reading , `rb`, instead of just `r`.

In [2]:
# Mode = rb reads input as a binary method. We're using a 
# pdf file and not a text file.
my_pdf_file = open("/content/gdrive/My Drive/NLP/A_Midsummer_Night.pdf", mode="rb")

Then we initialise a <strong> PDF reader </strong> object. 

In [3]:
# Initialise a pdf reader object
pdf_reader = PyPDF2.PdfFileReader(my_pdf_file)



Now we can perform various tasks on the pdf file we've read into the PDF reader object.

In [4]:
pdf_reader.numPages

65

We can now read in text from specific pages. In this example I'm going to read in the first page of the pdf document.

In [5]:
# Indexing of pages in the pdf document starts at 0
first_pdf_page = pdf_reader.getPage(0)

In [6]:
# Extract the text from the first page
first_pdf_page.extractText()

'  \n A Midsummer Night™s Dream  A Play By \n William Shakespeare  '

I can also do this for any page I'd like to view. Now I'm going to look at the second page.

In [7]:
second_pdf_page = pdf_reader.getPage(1)
# Extract the text from the second page
second_pdf_page.extractText()

"ACT I SCENE I. Athens. The palace of THESEUS. \nEnter THESEUS, HIPPOLYTA, PHILOSTRATE, and Attendants\n  THESEUS\n  Now, fair Hippolyta, our nuptial hour \nDraws on apace; four happy days bring in \nAnother moon: but, O, methinks, how slow \n\nThis old moon wanes! she lingers my desires, \n\nLike to a step-dame or a dowager \n\nLong withering out a young man revenue. \nHIPPOLYTA\n  Four days will quickly steep themselves in night; \nFour nights will quickly dream away the time; \n\nAnd then the moon, like to a silver bow \nNew-bent in heaven, shall behold the night \nOf our solemnities. \n\nTHESEUS\n  Go, Philostrate, \nStir up the Athenian youth to merriments; \nAwake the pert and nimble spirit of mirth; \n\nTurn melancholy forth to funerals; \nThe pale companion is not for our pomp. \nExit PHILOSTRATE\n Hippolyta, I woo'd thee with my sword, \nAnd won thy love, doing thee injuries; \nBut I will wed thee in another key, \nWith pomp, with triumph and with revelling. \nEnter EGEUS, HER

We can see that the text also includes the `\n` newline markers. IF we want to see the text content without the newline marker we can use the print statement together with the previous command.

In [8]:
print(second_pdf_page.extractText())

ACT I SCENE I. Athens. The palace of THESEUS. 
Enter THESEUS, HIPPOLYTA, PHILOSTRATE, and Attendants
  THESEUS
  Now, fair Hippolyta, our nuptial hour 
Draws on apace; four happy days bring in 
Another moon: but, O, methinks, how slow 

This old moon wanes! she lingers my desires, 

Like to a step-dame or a dowager 

Long withering out a young man revenue. 
HIPPOLYTA
  Four days will quickly steep themselves in night; 
Four nights will quickly dream away the time; 

And then the moon, like to a silver bow 
New-bent in heaven, shall behold the night 
Of our solemnities. 

THESEUS
  Go, Philostrate, 
Stir up the Athenian youth to merriments; 
Awake the pert and nimble spirit of mirth; 

Turn melancholy forth to funerals; 
The pale companion is not for our pomp. 
Exit PHILOSTRATE
 Hippolyta, I woo'd thee with my sword, 
And won thy love, doing thee injuries; 
But I will wed thee in another key, 
With pomp, with triumph and with revelling. 
Enter EGEUS, HERMIA, LYSANDER, and DEMETRIUS
 EGE

And we can store the contents of the text into a string.

In [9]:
my_pdf_text = second_pdf_page.extractText()

Finally we must close the pdf file.

In [10]:
my_pdf_file.close()

## Copying all pages into a string

So far we've looked at editing one page. What if we want to get a copy of all text from the pdf document? We can quite easily use a `FOR` loop to do this. If you want to read more about a `FOR` loop, read through the <strong> Loops </strong> notes on Blackboard.

In [11]:
# Open the pdf file for extraction
pdf_file = open("/content/gdrive/My Drive/NLP/A_Midsummer_Night.pdf", mode="rb")

# Define and initialise a string array to contain all pdf text
all_text = [0]

# Initialise a pdf reader object
pdf_document_reader = PyPDF2.PdfFileReader(pdf_file)

# Use a FOR loop to iterate through each page
# and then add each page to a string variable
for page_counter in range (pdf_document_reader.numPages):
    current_page = pdf_document_reader.getPage(page_counter)
    all_text.append(current_page.extractText())

# Finally close the pdf file
pdf_file.close()



In [12]:
# Show the contents of all_text
all_text

[0,
 '  \n A Midsummer Night™s Dream  A Play By \n William Shakespeare  ',
 "ACT I SCENE I. Athens. The palace of THESEUS. \nEnter THESEUS, HIPPOLYTA, PHILOSTRATE, and Attendants\n  THESEUS\n  Now, fair Hippolyta, our nuptial hour \nDraws on apace; four happy days bring in \nAnother moon: but, O, methinks, how slow \n\nThis old moon wanes! she lingers my desires, \n\nLike to a step-dame or a dowager \n\nLong withering out a young man revenue. \nHIPPOLYTA\n  Four days will quickly steep themselves in night; \nFour nights will quickly dream away the time; \n\nAnd then the moon, like to a silver bow \nNew-bent in heaven, shall behold the night \nOf our solemnities. \n\nTHESEUS\n  Go, Philostrate, \nStir up the Athenian youth to merriments; \nAwake the pert and nimble spirit of mirth; \n\nTurn melancholy forth to funerals; \nThe pale companion is not for our pomp. \nExit PHILOSTRATE\n Hippolyta, I woo'd thee with my sword, \nAnd won thy love, doing thee injuries; \nBut I will wed thee in a