# Liberating Data from PDFs

Like it or not, many institutions love to cram data into PDFs. This data can take the form of unstructured text (like memos, emails, court filings) that we may want to pull into a spreadsheet. Or it may just be tables locked into PDFs.

Today we'll learn to:

- Unlock tables
- Extract text held in PDFs
- Deal with obnoxious PDFs

## Tables scattered in PDFs

As it is, PDFs are notoriously obnoxious. They are designed so people can't change them easily.

PDFs that hold tables are pretty much the worst.

We want to <a href="https://drive.google.com/file/d/1_zzdyqwfMP6F0ukmmlMBaXTwZ36Iadba/view?usp=sharing">scrape data from these sample digital PDFs</a>.

You might have worked with the <a href="https://tabula.technology/">Tabula GUI</a> to extract tables from PDFs. But there's a lot of manual work involved. 

To automate the process, we'll use the **Tabula Python Library**.

#### There's NO satisfaction guarantee, but at least it's a way to try to tackle PDFs with tables.

Use  ```!pip install tabula-py``` in a code cell.


In [None]:
pip install -q tabula-py

You may need ```pip install install-jdk``` the first time you run this package.

In [None]:
pip install install-jdk

In [None]:
## import needed packages and libraries

import pandas as pd ## pandas to work with data

import tabula
tabula.environment_info() ## not need always ## check it's versioning


Still having problems?

Are you getting a ```JDK error``` within your notebook?

Try this:  I installed the latest version of Oracle’s Java JDK. For those with an M1 machine, the appropriate one is the ARM file under the MacOS tab; for those with intel, it’s the x64.

https://www.oracle.com/java/technologies/downloads/#jdk17-mac

In [None]:
# Suppress the specific FutureWarning
import warnings


warnings.filterwarnings("ignore", category=FutureWarning, message=".*errors='ignore' is deprecated.*")


In [None]:
## Let's pull in our first pdf with a single page, single table


In [None]:
## WHAT TYPE OF DATA?


In [None]:
## what does this list hold?


In [None]:
## let's get the first table


In [None]:
## store into df


In [None]:
## Export and download as CSV file
df1.to_csv("table1.csv", encoding = "UTF-8", index = False)

### Multiple pages/ Multiple tables
We at target our PDF for multiple pages and tables

In [None]:
## pdf2


In [None]:
## table extraction


In [None]:
## let's get the second table


## Foundational Multi-page, Multi-table

### Campaign contribution demo

In [None]:
## path to our "campaign_contribs.pdf" PDF
# pdf3 = "pdf_samples/pa-oct-1-contribs.pdf"


In [None]:
## get all the pages


In [None]:
## confirm we have the correct number of tables. should have 601 tables


In [None]:
## check out a couple of tables


In [None]:
## combine all the tables into one df


### Reality Check

In [None]:
## import who_covid.pdf


In [None]:
## call the first table on page 3


#### Compare to actual PDF table.
What is happening?

# No Satisfaction Guarantee

What did I mean by that?

The results really depend on the PDF and how it was put together.

Here are some issues you will encounter:

1. The Tables have too many sub-columns and sub-rows and groupings (bad_table.pdf)

2. Multiple different tables on the same page that are too close together will be processed as a single table and be an utter mess.

3. Documents and reports that have been scanned and are really images of PDFs can't be processed with Tabula or PyPDF2. Tables on these types of scans require advanced Python and graphical analysis skills beyond the scope of this course.

## Extracting Text from PDFs

In many cases, we just need the text from a single or multiple PDFs so we can convert them to structured data or run natural language analysis on them.

It will depend on the type of PDFs we are dealing. Some PDFs are good, others just okay and some are just **very, very bad**. 

This folder contains PDFs that come in many different flavors. <a href="https://drive.google.com/file/d/1flBD4b2Dz6_6EfC1VuU-6Uv2FAYtKbia/view?usp=sharing">Download it</a> and place in the same directory as your notebook.

Here are several strategies:

### Good PDFs

Well-behaving PDFs are those that were the digital text can easily be copied and pasted. We just don't want to copy and paste for hundreds of files. 

We'll use one of the most modern packages used to read PDFs to incorporate into Large Language Models.



In [None]:
pip install pymupdf4llm

In [None]:
import pymupdf4llm

#### ```to_markdown()```

```pymupdf4llm``` has a power ```to_markdown()``` method.

Provide a path to your PDF and it stores the text.

In [None]:
# Extract a simple PDF content as Markdown


#### You don't need to read an entire PDF. You can just specify a page, or a range of pages.

In [None]:
## extract single page


## Extract a range of pages


In [None]:
## create a range of pages + some


In [None]:

## extract range page



In [None]:
## create a range of pages + some


In [None]:

## extract range page


#### We can write it to an ```.md``` file in case we want to hold on to it.

In [None]:
# Specify the output Markdown file


In [None]:
## Let's turn into a function

     
         

In [None]:
## export range to md


### Now the problematic kids...

In [None]:
## another with image and text 
## This one doesn't quite work


In [None]:
# on file with image and text


In [None]:
## even more problems


In [None]:
## how about as an md file?


### Obnoxious PDF

In [None]:
## read "columbus_bank_trust.pdf" to a text docucment
## read and store document in an object


# Strategy to Vanquish Obnoxious PDFs




### The problem:
*   PDFs all have different encodings: UTF-8, ASCII, Unicode, etc
*   Therefore a possible loss of data during the conversion 

### The solution:
*   Convert the PDF to an image
*   Use optical character recognition (OCR) to capture the text
*   Export to a text file



### mangoCR to the rescue.
This <a href="https://pypi.org/project/mangoCR/">package overcomes</a> many of the problems above.



In [None]:
## import library


In [None]:
## bank pdf


In [None]:
## nixon pdf


## What about a list of PDFs?

In [None]:
## import glob


In [None]:
## police cases memos


In [None]:
## mangoCR it


## Now you can tackle really any and all PDFs you encounter in your investigations!