# Liberating Data from PDFs

Like it or not, many institutions love to cram data into PDFs. This data can take the form of unstructured text (like memos, emails, court filings) that we may want to pull into a spreadsheet. Or it may just be tables locked into PDFs.

Today we'll learn to:

- Unlock tables
- Extract text held in PDFs
- Deal with obnoxious PDFs

## Tables scattered in PDFs

As it is, PDFs are notoriously obnoxious. They are designed so people can't change them easily.

PDFs that hold tables are pretty much the worst.

We want to <a href="https://drive.google.com/file/d/1_zzdyqwfMP6F0ukmmlMBaXTwZ36Iadba/view?usp=sharing">scrape data from these sample digital PDFs</a>.

You might have worked with the <a href="https://tabula.technology/">Tabula GUI</a> to extract tables from PDFs. But there's a lot of manual work involved. 

To automate the process, we'll use the **Tabula Python Library**.

#### There's NO satisfaction guarantee, but at least it's a way to try to tackle PDFs with tables.

Use  ```!pip install tabula-py``` in a code cell.


In [1]:
pip install -q tabula-py

[33mDEPRECATION: pdf2images 0.0.6 has a non-standard dependency specifier plumbum>=1.6.8cv. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pdf2images or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


You may need ```pip install install-jdk``` the first time you run this package.

In [2]:
pip install install-jdk

[33mDEPRECATION: pdf2images 0.0.6 has a non-standard dependency specifier plumbum>=1.6.8cv. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pdf2images or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [6]:
## import needed packages and libraries

import pandas as pd ## pandas to work with data
import glob
import tabula
tabula.environment_info() ## not need always ## check it's versioning


Python version:
    3.9.7 (default, Sep 16 2021, 08:50:36) 
[Clang 10.0.0 ]
Java version:
    java version "19.0.1" 2022-10-18
Java(TM) SE Runtime Environment (build 19.0.1+10-21)
Java HotSpot(TM) 64-Bit Server VM (build 19.0.1+10-21, mixed mode, sharing)
tabula-py version: 2.5.1
platform: macOS-10.16-x86_64-i386-64bit
uname:
    uname_result(system='Darwin', node='Sandeep-Junnarkars-MacBook-Pro.local', release='24.1.0', version='Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:15 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6000', machine='x86_64')
linux_distribution: ('Darwin', '24.1.0', '')
mac_ver: ('10.16', ('', '', ''), 'x86_64')


Still having problems?

Are you getting a ```JDK error``` within your notebook?

Try this:  I installed the latest version of Oracle’s Java JDK. For those with an M1 machine, the appropriate one is the ARM file under the MacOS tab; for those with intel, it’s the x64.

https://www.oracle.com/java/technologies/downloads/#jdk17-mac

In [6]:
# Suppress the specific FutureWarning
import warnings


warnings.filterwarnings("ignore", category=FutureWarning, message=".*errors='ignore' is deprecated.*")


In [8]:
## Let's pull in our first pdf with a single page, single table
t1 = tabula.read_pdf("pdf-tables-samples/mockup1.pdf")
t1

'pages' argument isn't specified.Will extract only from page 1 by default.


[   Fringe Benefit Expenses (in millions) Fiscal 2019 Fiscal 2020  \
 0                       Health Insurance      $6,268      $7,173   
 1                        Social Security      $2,161      $2,224   
 2          Supplemental Welfare Benefits      $1,259      $1,333   
 3                  Worker's Compensation        $343        $369   
 4                  Annuity Contributions        $117        $120   
 5                 Allowance for Uniforms         $72         $71   
 6      Worker's Compensation - Uniformed         $41         $42   
 7                 Unemployment Insurance         $36         $38   
 8                  Other Fringe Benefits         $12         $12   
 9               Faculty Welfare Benefits         $33         $10   
 10                  Disability Insurance          $1          $1   
 11                                Total*     $10,642     $11,394   
 
    Percent Change  
 0             14%  
 1              3%  
 2              6%  
 3              8

In [9]:
## WHAT TYPE OF DATA?
type(t1)

list

In [10]:
## what does this list hold?
type(t1[0])

pandas.core.frame.DataFrame

In [11]:
## let's get the first table
df1 = t1[0]
df1

Unnamed: 0,Fringe Benefit Expenses (in millions),Fiscal 2019,Fiscal 2020,Percent Change
0,Health Insurance,"$6,268","$7,173",14%
1,Social Security,"$2,161","$2,224",3%
2,Supplemental Welfare Benefits,"$1,259","$1,333",6%
3,Worker's Compensation,$343,$369,8%
4,Annuity Contributions,$117,$120,3%
5,Allowance for Uniforms,$72,$71,-1%
6,Worker's Compensation - Uniformed,$41,$42,4%
7,Unemployment Insurance,$36,$38,3%
8,Other Fringe Benefits,$12,$12,-3%
9,Faculty Welfare Benefits,$33,$10,-69%


In [None]:
## store into df


In [12]:
## Export and download as CSV file
df1.to_csv("table1.csv", encoding = "UTF-8", index = False)

### Multiple pages/ Multiple tables
We at target our PDF for multiple pages and tables

In [7]:
## pdf2
pdf2 = "pdf-tables-samples/mockup2.pdf"
t2 = tabula.read_pdf(pdf2, pages="1-2")
t2

  df[c] = pd.to_numeric(df[c], errors="ignore")


[   Fringe Benefit Expenses (in millions) Fiscal 2019 Fiscal 2020  \
 0                       Health Insurance      $6,268      $7,173   
 1                        Social Security      $2,161      $2,224   
 2          Supplemental Welfare Benefits      $1,259      $1,333   
 3                  Worker's Compensation        $343        $369   
 4                  Annuity Contributions        $117        $120   
 5                 Allowance for Uniforms         $72         $71   
 6      Worker's Compensation - Uniformed         $41         $42   
 7                 Unemployment Insurance         $36         $38   
 8                  Other Fringe Benefits         $12         $12   
 9               Faculty Welfare Benefits         $33         $10   
 10                  Disability Insurance          $1          $1   
 11                                Total*     $10,642     $11,394   
 
    Percent Change  
 0             14%  
 1              3%  
 2              6%  
 3              8

In [15]:
## table extraction
t2[1]

Unnamed: 0.1,Unnamed: 0,FY19,FY20,Unnamed: 1,FY21,Unnamed: 2,FY22,Unnamed: 3,FY23
0,Real Property,($70),$0,,$0,,$0,,$0
1,Personal Income,284,152,,120,,122,,87
2,General Corporation,71,71,,67,,78,,41
3,Unincorporated Business,-52,-189,,-133,,-106,,-110
4,Sales and Use,18,98,,114,,112,,112
5,Commercial Rent,11,15,,16,,18,,18
6,Real Property Transfer,-30,45,,44,,48,,45
7,Mortgage Recording,-24,25,,24,,27,,25
8,Utility,0,1,,0,,0,,0
9,Hotel,5,-9,,1,,0,,7


In [None]:
## let's get the second table


## Foundational Multi-page, Multi-table

### Campaign contribution demo

In [16]:
## path to our "campaign_contribs.pdf" PDF
pdf3 = "pdf-tables-samples/contribs-excerpt.pdf"
pdf3

'pdf-tables-samples/contribs-excerpt.pdf'

In [20]:
## get all the pages
tables = tabula.read_pdf(pdf3, pages="all")
tables

[          committee name state contrib_amount
 0   Harris for President    AK         $2,800
 1    Trump for President    AL         $2,000
 2   Harris for President    AL         $2,800
 3    Trump for President    AR         $5,000
 4    Trump for President    AZ         $2,000
 5   Harris for President    AZ         $2,800
 6    Trump for President    CA         $2,000
 7    Trump for President    CA         $2,000
 8    Trump for President    CA         $2,800
 9    Trump for President    CA         $2,800
 10   Trump for President    CA         $2,800
 11   Trump for President    CA         $2,800
 12   Trump for President    CA         $2,800
 13   Trump for President    CA         $2,800
 14   Trump for President    CA         $2,800
 15   Trump for President    CA         $5,000
 16  Harris for President    CA         $2,000
 17  Harris for President    CA         $2,000
 18  Harris for President    CA         $2,000
 19  Harris for President    CA         $2,300
 20  Harris f

In [21]:
## confirm we have the correct number of tables. should have 601 tables
len(tables)

4

In [25]:
## check out a couple of tables
tables[0]
tables[3]

Unnamed: 0,committee name,state,contrib_amount
0,Harris for President,GA,"$3,000"
1,Harris for President,GA,"$5,000"
2,Harris for President,HI,"$2,000"
3,Harris for President,HI,"$5,600"
4,Harris for President,IL,"$2,600"
5,Harris for President,IL,"$2,800"
6,Harris for President,IL,"$2,800"
7,Harris for President,IL,"$2,800"
8,Harris for President,IL,"$5,000"
9,Trump for President,IN,"$2,800"


In [30]:
## combine all the tables into one df
df = pd.concat(tables, ignore_index = True)
df

Unnamed: 0,committee name,state,contrib_amount
0,Harris for President,AK,"$2,800"
1,Trump for President,AL,"$2,000"
2,Harris for President,AL,"$2,800"
3,Trump for President,AR,"$5,000"
4,Trump for President,AZ,"$2,000"
...,...,...,...
95,Harris for President,MA,"$2,600"
96,Harris for President,MA,"$2,800"
97,Harris for President,MA,"$2,800"
98,Harris for President,MA,"$2,800"


### Reality Check

In [32]:
## import who_covid.pdf
pdf4 = "pdf-tables-samples/who_covid.pdf"
who_df = tabula.read_pdf(pdf4, pages="3-4")
who_df

[              Province/  Unnamed: 0  Unnamed: 1  Unnamed: 2  Daily Unnamed: 3  \
 0                   NaN  Population         NaN         NaN    NaN        NaN   
 1               Region/         NaN         NaN         NaN    NaN        NaN   
 2                   NaN   (10,000s)  Laboratory  Clinically  Total  Suspected   
 3                  City         NaN   confirmed   diagnosed  cases      cases   
 4                 Hubei        5917         955         888   1843       1036   
 5             Guangdong       11346          22           -     22          2   
 6                 Henan        9605          19           -     19        137   
 7              Zhejiang        5737           5           -      5         23   
 8                 Hunan        6899           3           -      3         36   
 9                 Anhui        6324          12           -     12          6   
 10              Jiangxi        4648          13           -     13          7   
 11             

In [34]:
## call the first table on page 3
who_df[0]

Unnamed: 0.1,Province/,Unnamed: 0,Unnamed: 1,Unnamed: 2,Daily,Unnamed: 3,Unnamed: 4,Unnamed: 5,Cumulative,Unnamed: 6
0,,Population,,,,,,,,
1,Region/,,,,,,,,,
2,,"(10,000s)",Laboratory,Clinically,Total,Suspected,Deaths,Laboratory,Clinically Total,Deaths
3,City,,confirmed,diagnosed,cases,cases,,confirmed,diagnosed cases,
4,Hubei,5917,955,888,1843,1036,139,38839,17410 56249,1596
5,Guangdong,11346,22,-,22,2,0,1316,- 1316,2
6,Henan,9605,19,-,19,137,0,1231,- 1231,13
7,Zhejiang,5737,5,-,5,23,0,1167,- 1167,0
8,Hunan,6899,3,-,3,36,1,1004,- 1004,3
9,Anhui,6324,12,-,12,6,0,962,- 962,6


In [35]:
who_df[1]

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Total,Total cases with,T otal cases with,Unnamed: 3
0,,,,cases with,possible or,site of,
1,,,Confirmed*,travel,confirmed,transmission,Total deaths
2,WHO Region Country/Territory/Area,,,,,,
3,,,cases (new),history to,transmission,under,(new)
4,,,,China outside of China†,,investigation,
5,,,,(new),(new),(new),
6,Singapore,,72 (5),22 (0),49 (5),1 (0),0 (0)
7,Japan,,53 (12),26 (1),27 (11),0 (0),1 (0)
8,Republic of Korea,,29 (1),13 (0),13‡ (1),3 (0),0 (0)
9,Western Pacific Region Malaysia,,22 (1),17 (0),4§ (0),1 (1),0 (0)


#### Compare to actual PDF table.
What is happening?

# No Satisfaction Guarantee

What did I mean by that?

The results really depend on the PDF and how it was put together.

Here are some issues you will encounter:

1. The Tables have too many sub-columns and sub-rows and groupings (bad_table.pdf)

2. Multiple different tables on the same page that are too close together will be processed as a single table and be an utter mess.

3. Documents and reports that have been scanned and are really images of PDFs can't be processed with Tabula or PyPDF2. Tables on these types of scans require advanced Python and graphical analysis skills beyond the scope of this course.

## Extracting Text from PDFs

In many cases, we just need the text from a single or multiple PDFs so we can convert them to structured data or run natural language analysis on them.

It will depend on the type of PDFs we are dealing. Some PDFs are good, others just okay and some are just **very, very bad**. 

This folder contains PDFs that come in many different flavors. <a href="https://drive.google.com/file/d/1flBD4b2Dz6_6EfC1VuU-6Uv2FAYtKbia/view?usp=sharing">Download it</a> and place in the same directory as your notebook.

Here are several strategies:

### Good PDFs

Well-behaving PDFs are those that were the digital text can easily be copied and pasted. We just don't want to copy and paste for hundreds of files. 

We'll use one of the most modern packages used to read PDFs to incorporate into Large Language Models.



In [36]:
pip install pymupdf4llm

[33mDEPRECATION: pdf2images 0.0.6 has a non-standard dependency specifier plumbum>=1.6.8cv. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pdf2images or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [1]:
import pymupdf4llm

#### ```to_markdown()```

```pymupdf4llm``` has a power ```to_markdown()``` method.

Provide a path to your PDF and it stores the text.

In [2]:
# Extract a simple PDF content as Markdown
md_text = pymupdf4llm.to_markdown("pdf-mixed-samples/simple.pdf")
md_text

Processing pdf-mixed-samples/simple.pdf...


'# P 1 Quisque varius, ipsum a molestie laoreet\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. In dui ligula, facilisis ac mollis in,\nfringilla id lorem. Praesent facilisis sapien eu sapien fringilla finibus. Sed tortor felis,\nfaucibus ac nisi quis, ullamcorper pretium augue. Duis at rutrum nunc. Maecenas\ninterdum magna quis ex feugiat convallis.\n\n## Quisque condimentum quis sem varius luctus.\n\n  - Ut congue mi eu sem pulvinar, at molestie felis facilisis.\n\n  - Proin vel ex quis erat molestie ultrices at eu ex.\n\n  - Nam accumsan elit ac est pretium viverra.\n\n  - **Nam sit amet leo id libero varius aliquet.**\n\n  - Vivamus gravida ligula non iaculis posuere.\n\nNunc ut nisi pellentesque, iaculis purus vitae, ornare tortor. Nullam egestas porttitor nisl,\nsed auctor erat pellentesque nec.\n\nFusce sed lacus in ex egestas vulputate placerat ut nulla. Nullam vestibulum lacus quis\naccumsan suscipit.\n\nNulla et quam gravida ante luctus viverra non sit amet augue.

#### You don't need to read an entire PDF. You can just specify a page, or a range of pages.

In [3]:
## extract single page
md_text5 = \
pymupdf4llm.to_markdown("pdf-mixed-samples/simple.pdf", pages=[4])
md_text5

Processing pdf-mixed-samples/simple.pdf...


'# P5 Quisque tellus sapien\n\nEtiam pellentesque ipsum erat, eget consequat odio euismod id. Maecenas vitae lobortis\nnisl. Pellentesque ut blandit nisi, sit amet fringilla turpis. Sed arcu ligula, euismod sed mi\nsit amet, suscipit euismod tortor.\n\n## Praesent sit amet sem maximus, pharetra magna eu, egestas velit.\n\n  - **Aenean quis dolor ac nisl vehicula semper.**\n\n  - Nunc tempus massa in tortor egestas dictum.\n\n  - Duis sagittis libero vitae leo hendrerit interdum.\n\n  - Phasellus condimentum dolor quis nulla posuere, ac elementum augue ultrices.\n\nCras quis accumsan urna, eget eHicitur nulla. In hac habitasse platea dictumst. Nunc\nporttitor ex ut nunc rutrum auctor.\n\nQuisque tellus sapien, pretium quis leo non, sagittis imperdiet mi.\n\nSed lectus lorem, dictum a risus ac, blandit tempor urna.\n\n\n-----\n\n'

## Extract a range of pages


In [8]:
## create a range of pages + some
start_page, end_page = 0, 3
page_range = list(range(start_page, end_page))
page_range

[0, 1, 2]

In [10]:
## extract range page

pymupdf4llm.to_markdown("pdf-mixed-samples/simple.pdf",
                        pages = page_range)

Processing pdf-mixed-samples/simple.pdf...


'# P 1 Quisque varius, ipsum a molestie laoreet\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. In dui ligula, facilisis ac mollis in,\nfringilla id lorem. Praesent facilisis sapien eu sapien fringilla finibus. Sed tortor felis,\nfaucibus ac nisi quis, ullamcorper pretium augue. Duis at rutrum nunc. Maecenas\ninterdum magna quis ex feugiat convallis.\n\n## Quisque condimentum quis sem varius luctus.\n\n  - Ut congue mi eu sem pulvinar, at molestie felis facilisis.\n\n  - Proin vel ex quis erat molestie ultrices at eu ex.\n\n  - Nam accumsan elit ac est pretium viverra.\n\n  - **Nam sit amet leo id libero varius aliquet.**\n\n  - Vivamus gravida ligula non iaculis posuere.\n\nNunc ut nisi pellentesque, iaculis purus vitae, ornare tortor. Nullam egestas porttitor nisl,\nsed auctor erat pellentesque nec.\n\nFusce sed lacus in ex egestas vulputate placerat ut nulla. Nullam vestibulum lacus quis\naccumsan suscipit.\n\nNulla et quam gravida ante luctus viverra non sit amet augue.

In [13]:
## create a range of pages + some
start_page, end_page = 0, 3
page_range = list(range(start_page, end_page)) + [4]
page_range

[0, 1, 2, 4]

In [15]:
## extract range page
md_text_mixed = pymupdf4llm.to_markdown("pdf-mixed-samples/simple.pdf",
                        pages = page_range)

md_text_mixed

Processing pdf-mixed-samples/simple.pdf...


'# P 1 Quisque varius, ipsum a molestie laoreet\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. In dui ligula, facilisis ac mollis in,\nfringilla id lorem. Praesent facilisis sapien eu sapien fringilla finibus. Sed tortor felis,\nfaucibus ac nisi quis, ullamcorper pretium augue. Duis at rutrum nunc. Maecenas\ninterdum magna quis ex feugiat convallis.\n\n## Quisque condimentum quis sem varius luctus.\n\n  - Ut congue mi eu sem pulvinar, at molestie felis facilisis.\n\n  - Proin vel ex quis erat molestie ultrices at eu ex.\n\n  - Nam accumsan elit ac est pretium viverra.\n\n  - **Nam sit amet leo id libero varius aliquet.**\n\n  - Vivamus gravida ligula non iaculis posuere.\n\nNunc ut nisi pellentesque, iaculis purus vitae, ornare tortor. Nullam egestas porttitor nisl,\nsed auctor erat pellentesque nec.\n\nFusce sed lacus in ex egestas vulputate placerat ut nulla. Nullam vestibulum lacus quis\naccumsan suscipit.\n\nNulla et quam gravida ante luctus viverra non sit amet augue.

#### We can write it to an ```.md``` file in case we want to hold on to it.

In [17]:
# Specify the output Markdown file
output_file_path = "text_mixed.md"

with open(output_file_path, "w") as f:
    f.write(md_text_mixed)

In [18]:
## Let's turn into a function
def export_to_markdown(input_text,output_path):
    with open(output_path, "w") as f:
        f.write(input_text)
     
         

In [19]:
## export range to md
export_to_markdown(md_text_mixed, "md2.md")

### Now the problematic kids...

In [20]:
## another with image and text 
## This one doesn't quite work
pdf_path = 'pdf-mixed-samples/Aug-20.pdf'
extract = pymupdf4llm.to_markdown(pdf_path)
extract

Processing pdf-mixed-samples/Aug-20.pdf...


'y p y, y, $\n\nBudget Act forecast of $27.792 billion, largely related to unexpected strength in the 2019 tax year. Preliminary General\nFund agency cash receipts for the entire 2019-20 fiscal year were $1.135 billion above the 2020-21 Budget Act forecast of\n$123.395 billion, or 0.9 percentage point above forecast. Total collections for March through July of 2020 are actually\ndown by 6 percent from the same period in 2019.\n\n� Personal income tax cash receipts for July were $2.7 billion above the month’s forecast of $21 billion. Withholding cash\n\nreceipts were $1.6 billion above the forecast of $4.5 billion. Other cash receipts were $1.1 billion higher than the forecast\nof $19 billion. Refunds issued in July were $47 million lower than the expected $2.1 billion. Proposition 63 requires that\n1.76 percent of total monthly personal income tax collections be transferred to the Mental Health Services Fund\n(MHSF). The amount transferred to the MHSF in July was $49 million higher tha

In [21]:
# on file with all and text
pdf_path = 'pdf-mixed-samples/Jul-19.pdf'
extract = pymupdf4llm.to_markdown(pdf_path)
extract

Processing pdf-mixed-samples/Jul-19.pdf...


'# Preliminary General Fund agency cash for the entire 2018-19 fiscal year was $1.041 billion above the 2019-20 Budget Act forecast of $143.804 billion, or 0.7 percentage point above forecast. Revenues for June were $409 million above the month’s forecast of $19.387 billion, or 2.1 percent above forecast. June cash receipts represent the second estimated payment of 40 percent of liability due mid-month for personal income tax filers and calendar-year corporations.\n\n\uf06ePersonal income tax revenues for the entire 2018-19 fiscal year were $523 million above the forecast of $98.505\nbillion. Cash receipts to the General Fund in June were $104 million above the month’s forecast of $12.776\nbillion. Withholding receipts were $184 million below the forecast of $5.115 billion. Other receipts were $132 million\nhigher than the forecast of $8.405 billion. Refunds issued in June were $157 million below the expected $515\nmillion. Proposition 63 requires that 1.76 percent of total monthly per

In [22]:
## save in md file
export_to_markdown(extract, "july19.md")

In [None]:
## how about as an md file?


### Obnoxious PDF

In [23]:
## read "columbus_bank_trust.pdf" to a text docucment
## read and store document in an object
pdf_path = 'pdf-mixed-samples/columbus_bank_trust.pdf'
extract = pymupdf4llm.to_markdown(pdf_path)
extract

Processing pdf-mixed-samples/columbus_bank_trust.pdf...


'**C\'&%C: 199120 A=1**\n\n_!" $%& \'()")\'" \'* +\'", -\'."/&0, .",&2 &3)/$)"4 056 5", /.78&9$ $\' 9\'",)$)\'"/ %&2&)"5*$&2,&/92)7&,, )"$&2&/$ \'" $%& +\'",/ (5) 6)00 "\'$ 7& )"90.,&, )" 42\'//_\n\n_)"9\'�& *\'2 *&,&250 )"9\'�& $53 (.2(\'/&/ &39&($ *\'2 )"$&2&/$ \'" 5"� \'* $%& +\'",/ *\'2 5"� (&2)\',,.2)"4 6%)9% /.9% +\'",/ 52& %&0, 7� 5 (&2/\'" 6%\' )/ 5 �/.7/$5"$)50_\n_./&2� \'* $%& *59)0)$)&/ *)"5"9&, 7� $%& +\'",/ \'2 5 �2&05$&, (&2/\'"� 6)$%)" $%& �&5")"4 \'* �&9$)\'" ���(7)(��) \'* $%& !"$&2"50 �&�&".& -\',& \'* ����, 5/ 5�&",&,, 5",_\n_(7) 6)00 "\'$ 7& 5" )$&� \'* $53 (2&*&2&"9& *\'2 (.2(\'/&/ \'* $%& 50$&2"5$)�& �)")�.� $53 )�(\'/&, \'" )",)�),.50/ 5", 9\'2(\'25$)\'"/� (2\'�),&,, %\'6&�&2, 6)$% 2&/(&9$ $\'_\n_9\'2(\'25$)\'"/ (5/,&*)"&, *\'2 *&,&250 )"9\'�& $53 (.2(\'/&/), /.9% )"$&2&/$ )/ $5�&" )"$\' 599\'."$ )",&$&2�)")"4 5,8./$&, 9.22&"$ &52")"4/ *\'2 $%& (.2(\'/& \'* 9\'�(.$)"4 $%&_\n_*&,&250 50$&2"5$)�& �)")�.� $53 \'" 9\'2(\'25$)\'"/� !" $%& \'()")\'" \'* +\'", -\'."/&0,

# Strategy to Vanquish Obnoxious PDFs




### The problem:
*   PDFs all have different encodings: UTF-8, ASCII, Unicode, etc
*   Therefore a possible loss of data during the conversion 
* Sometimes PDFs are images (like material recieved by FOAI)

### The solution:
*   Convert the PDF to an image
*   Use optical character recognition (OCR) to capture the text
*   Export to a text file



### mangoCR to the rescue.
This <a href="https://pypi.org/project/mangoCR/">package overcomes</a> many of the problems above.



In [24]:
!brew install tesseract

[34m==>[0m [1mDownloading https://formulae.brew.sh/api/formula.jws.json[0m
######################################################################### 100.0%
[34m==>[0m [1mDownloading https://formulae.brew.sh/api/cask.jws.json[0m
######################################################################### 100.0%
tesseract 5.3.4_1 is already installed but outdated (so it will be upgraded).
[31mError:[0m Cannot install under Rosetta 2 in ARM default prefix (/opt/homebrew)!
To rerun under ARM use:
    arch -arm64 brew install ...
To install under x86_64, install Homebrew into /usr/local.


In [3]:
!pip install mangoCR



In [4]:
## import library
from mangoCR import pdf2image_ocr

In [5]:
## bank pdf
pdf_path = 'pdf-mixed-samples/columbus_bank_trust.pdf'
pdf2image_ocr(pdf_path,"bank.md")

Processing PDF 1 of 1: columbus_bank_trust.pdf
  - Processed page 1 of 5 in columbus_bank_trust.pdf
  - Processed page 2 of 5 in columbus_bank_trust.pdf
  - Processed page 3 of 5 in columbus_bank_trust.pdf
  - Processed page 4 of 5 in columbus_bank_trust.pdf
  - Processed page 5 of 5 in columbus_bank_trust.pdf
Finished processing columbus_bank_trust.pdf

OCR results have been saved to bank.md


In [7]:
!tesseract --list-langs

List of available languages in "/opt/anaconda3/share/tessdata/" (125):
afr
amh
ara
asm
aze
aze_cyrl
bel
ben
bod
bos
bre
bul
cat
ceb
ces
chi_sim
chi_sim_vert
chi_tra
chi_tra_vert
chr
cos
cym
dan
deu
div
dzo
ell
eng
enm
epo
equ
est
eus
fao
fas
fil
fin
fra
frk
frm
fry
gla
gle
glg
grc
guj
hat
heb
hin
hrv
hun
hye
iku
ind
isl
ita
ita_old
jav
jpn
jpn_vert
kan
kat
kat_old
kaz
khm
kir
kmr
kor
kor_vert
lao
lat
lav
lit
ltz
mal
mar
mkd
mlt
mon
mri
msa
mya
nep
nld
nor
oci
ori
osd
pan
pol
por
pus
que
ron
rus
san
sin
slk
slv
snd
spa
spa_old
sqi
srp
srp_latn
sun
swa
swe
syr
tam
tat
tel
tgk
tha
tir
ton
tur
uig
ukr
urd
uzb
uzb_cyrl
vie
yid
yor


In [8]:
## nixon pdf
pdf_path = 'pdf-mixed-samples/nixon-memo1.pdf'
pdf2image_ocr(pdf_path,"nixon.md")

Processing PDF 1 of 1: nixon-memo1.pdf
  - Processed page 1 of 2 in nixon-memo1.pdf
  - Processed page 2 of 2 in nixon-memo1.pdf
Finished processing nixon-memo1.pdf

OCR results have been saved to nixon.md


## What about a list of PDFs?

In [10]:
## import glob
import glob

In [11]:
## police cases memos
cases_list = glob.glob("pdf-mixed-samples/case*.pdf")
cases_list

['pdf-mixed-samples/case-memos-1.pdf',
 'pdf-mixed-samples/case-memos-2.pdf',
 'pdf-mixed-samples/case-memos-3.pdf',
 'pdf-mixed-samples/case-memos-6.pdf',
 'pdf-mixed-samples/case-memos-4.pdf',
 'pdf-mixed-samples/case-memos-5.pdf']

In [12]:
## mangoCR it
pdf2image_ocr(cases_list, "cases.md")

Processing PDF 1 of 6: case-memos-1.pdf
  - Processed page 1 of 1 in case-memos-1.pdf
Finished processing case-memos-1.pdf

Processing PDF 2 of 6: case-memos-2.pdf
  - Processed page 1 of 1 in case-memos-2.pdf
Finished processing case-memos-2.pdf

Processing PDF 3 of 6: case-memos-3.pdf
  - Processed page 1 of 1 in case-memos-3.pdf
Finished processing case-memos-3.pdf

Processing PDF 4 of 6: case-memos-6.pdf
  - Processed page 1 of 1 in case-memos-6.pdf
Finished processing case-memos-6.pdf

Processing PDF 5 of 6: case-memos-4.pdf
  - Processed page 1 of 1 in case-memos-4.pdf
Finished processing case-memos-4.pdf

Processing PDF 6 of 6: case-memos-5.pdf
  - Processed page 1 of 1 in case-memos-5.pdf
Finished processing case-memos-5.pdf

OCR results have been saved to cases.md


## Now you can tackle really any and all PDFs you encounter in your investigations!