# Python Extract Table from PDF

Example PDFs

* McKinsey Global Institute Disruptive technologies

https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/McKinsey%20Digital/Our%20Insights/Disruptive%20technologies/MGI_Disruptive_technologies_Full_report_May2013.ashx

* Food Calories List

http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf

## With tabula-py

#### Installation

https://pypi.org/project/tabula-py/

`pip install tabula-py`

#### tabula-py docs

https://www.pydoc.io/pypi/tabula-py-0.9.0/autoapi/wrapper/index.html

In [1]:
from tabula import read_pdf
from tabulate import tabulate

In [2]:
df = read_pdf("./tmp/pdf/Food Calories List.pdf")
df

Unnamed: 0,BREADS & CEREALS,Portion size *,per 100 grams (3.5 oz),Unnamed: 3,energy content
0,Bagel ( 1 average ),140 cals (45g),310 cals,,Medium
1,Biscuit digestives,86 cals (per biscuit),480 cals,,High
2,Jaffa cake,48 cals (per biscuit),370 cals,,Med-High
3,Bread white (thick slice),96 cals (1 slice 40g),240 cals,,Medium
4,Bread wholemeal (thick),88 cals (1 slice 40g),220 cals,,Low-med
5,Chapatis,250 cals,300 cals,,Medium
6,Cornflakes,130 cals (35g),370 cals,,Med-High
7,Crackerbread,17 cals per slice,325 cals,,Low Calorie
8,Cream crackers,35 cals (per cracker),440 cals,,Low / portion
9,Crumpets,93 cals (per crumpet),198 cals,,Low-Med


In [3]:
df = read_pdf("./tmp/pdf/Food Calories List.pdf")
df = df.dropna(axis='columns')
df

Unnamed: 0,BREADS & CEREALS,Portion size *,per 100 grams (3.5 oz),energy content
0,Bagel ( 1 average ),140 cals (45g),310 cals,Medium
1,Biscuit digestives,86 cals (per biscuit),480 cals,High
2,Jaffa cake,48 cals (per biscuit),370 cals,Med-High
3,Bread white (thick slice),96 cals (1 slice 40g),240 cals,Medium
4,Bread wholemeal (thick),88 cals (1 slice 40g),220 cals,Low-med
5,Chapatis,250 cals,300 cals,Medium
6,Cornflakes,130 cals (35g),370 cals,Med-High
7,Crackerbread,17 cals per slice,325 cals,Low Calorie
8,Cream crackers,35 cals (per cracker),440 cals,Low / portion
9,Crumpets,93 cals (per crumpet),198 cals,Low-Med


In [4]:
df = read_pdf("./tmp/pdf/Food Calories List.pdf", pages=3)
print (tabulate(df))

--  ------------------------  -----------------  --------  -----------
 0  Fish fingers              50 cals per piece  220 cals  Medium
 1  Gammon                    320 cals           280 cals  Med-High
 2  Haddock fresh             200 cals           110 cals  Low calorie
 3  Halibut fresh             220 cals           125 cals  Low calorie
 4  Ham                       6 cals             240 cals  Medium
 5  Herring fresh grilled     300 cals           200 cals  Medium
 6  Kidney                    200 cals           160 cals  Medium
 7  Kipper                    200 cals           120 cals  Low calorie
 8  Liver                     200 cals           150 cals  Medium
 9  Liver pate                150 cals           300 cals  Medium
10  Lamb (roast)              300 cals           300 cals  Med-High
11  Lobster boiled            200 cals           100 cals  Low calorie
12  Luncheon meat             300 cals           400 cals  High
13  Mackeral                  320 cals           

In [5]:
df = read_pdf("./tmp/pdf/Food Calories List.pdf", pages=3, output_format="json")
df

[{'extraction_method': 'stream',
  'top': 0.0,
  'left': 0.0,
  'width': 524.6400146484375,
  'height': 725.6300048828125,
  'data': [[{'top': 65.19,
     'left': 120.24,
     'width': 48.599998474121094,
     'height': 7.880000114440918,
     'text': 'Fish cake'},
    {'top': 65.19,
     'left': 241.2,
     'width': 79.91999816894531,
     'height': 7.880000114440918,
     'text': '90 cals per cake'},
    {'top': 65.19,
     'left': 370.08,
     'width': 42.600006103515625,
     'height': 7.880000114440918,
     'text': '200 cals'},
    {'top': 65.19,
     'left': 472.44,
     'width': 43.67999267578125,
     'height': 7.880000114440918,
     'text': 'Medium'}],
   [{'top': 87.75,
     'left': 114.6,
     'width': 60.00000762939453,
     'height': 7.880000114440918,
     'text': 'Fish fingers'},
    {'top': 87.75,
     'left': 239.52,
     'width': 83.27998352050781,
     'height': 7.880000114440918,
     'text': '50 cals per piece'},
    {'top': 87.75,
     'left': 370.08,
     'widt

In [6]:
df = read_pdf("./tmp/pdf/Food Calories List.pdf", pages='all', multiple_tables=True)
df

[                             0                            1  \
 0             BREADS & CEREALS               Portion size *   
 1          Bagel ( 1 average )               140 cals (45g)   
 2           Biscuit digestives        86 cals (per biscuit)   
 3                   Jaffa cake        48 cals (per biscuit)   
 4    Bread white (thick slice)       96  cals (1 slice 40g)   
 5      Bread wholemeal (thick)       88  cals (1 slice 40g)   
 6                     Chapatis                     250 cals   
 7                   Cornflakes              130  cals (35g)   
 8                 Crackerbread            17 cals per slice   
 9               Cream crackers        35 cals (per cracker)   
 10                    Crumpets        93 cals (per crumpet)   
 11   Flapjacks basic fruit mix                     320 cals   
 12           Macaroni (boiled)              238 cals (250g)   
 13                      Muesli              195  cals (50g)   
 14         Naan bread (normal)  300 cal

In [7]:
df = read_pdf("http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf", pages=3)
df

Unnamed: 0,Fish cake,90 cals per cake,200 cals,Medium
0,Fish fingers,50 cals per piece,220 cals,Medium
1,Gammon,320 cals,280 cals,Med-High
2,Haddock fresh,200 cals,110 cals,Low calorie
3,Halibut fresh,220 cals,125 cals,Low calorie
4,Ham,6 cals,240 cals,Medium
5,Herring fresh grilled,300 cals,200 cals,Medium
6,Kidney,200 cals,160 cals,Medium
7,Kipper,200 cals,120 cals,Low calorie
8,Liver,200 cals,150 cals,Medium
9,Liver pate,150 cals,300 cals,Medium


In [8]:
df = read_pdf("./tmp/pdf/Food Calories List.pdf", encoding = 'ISO-8859-1',
         stream=True, area = [269.875, 12.75, 790.5, 961], pages = 4, guess = False,  pandas_options={'header':None})
df

Unnamed: 0,0,1,2,3
0,Fruits & Vegetables,Portion size *,oz),energy content
1,Apple,44 calories,44 calories,Low calorie
2,Banana,107 cals,65 calories,Low calorie
3,Beans baked beans,170 cals,80 calories,Low calorie
4,Beans dried (boiled),180 cals,130 calories,Low calorie
5,Blackberries,25 cals,25 calories,Low calorie
6,Blackcurrant,30 cals,30 calories,Low calorie
7,Broccoli,27 cals,32 cals,Very low
8,Cabbage (boiled),15 calories,20 calories,Low calorie
9,Carrot (boiled),16 calories,25 calories,Low calorie


In [9]:
df = read_pdf("./tmp/pdf/output.pdf", encoding = 'ISO-8859-1',
         stream=True, guess = False)
df

Unnamed: 0,McKinsey Global Institute
0,Disruptive technologies: Advances that will tr...
1,Exhibit E2
2,"Speed, scope, and economic value at stake of 1..."
3,Illustrative rates of technology improvement I...
4,and diffusion resources that could be impacted...
5,Mobile $5 million vs. $40024.3 billion $1.7 tr...
6,Internet Price of the fastest supercomputer in...
7,"an iPhone 4 today, equal in performance (MFLOP..."
8,6x 1 billion Interaction and transaction worker
9,Growth in sales of smartphones and tablets sin...


In [10]:
df = read_pdf("./tmp/pdf/output.pdf", encoding = 'ISO-8859-1',
         stream=True, area=[269.875, 12.75, 790.5, 961], guess = False)
df

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,"over past 5 years across industries such as manufacturing,","industries (manufacturing, health care,"
0,,,"80â90% health care, and mining",and mining)
1,,,Price decline in MEMS (microelectromechanical ...,
2,,,systems) sensors in past 5 years Global machin...,
3,,,"connections across sectors like transportation,",
4,,,"security, health care, and utilities",
5,,Cloud,18 months 2 billion,$1.7 trillion
6,,technology,Time to double server performance per dollar G...,GDP related to the Internet
7,,,"3x like Gmail, Yahoo, and Hotmail",$3 trillion
8,,,Monthly cost of owning a server vs. renting in...,Enterprise IT spend
9,,,the cloud North American institutions hosting ...,


## With Camelot

#### Installation

https://pypi.org/project/camelot-py/

`pip install camelot-py`

#### Camelot readme

https://github.com/socialcopsdev/camelot

In [11]:
import camelot
tables = camelot.read_pdf("./tmp/pdf//Food Calories List.pdf")
tables[0].df[1:3]

Unnamed: 0,0,1,2,3
1,Bagel ( 1 average ),140 cals (45g),310 cals,Medium
2,Biscuit digestives,86 cals (per biscuit),480 cals,High


In [12]:
tables1 = camelot.read_pdf("./tmp/pdf/MGI_Disruptive_technologies_Full_report_May2013.pdf", pages='32', area=[269.875, 120.75, 790.5, 561])
print (tabulate(tables1[0].df))

--  -------------
 0  Mobile
    Internet
 1  Automation
    of knowledge
    work
 2  The Internet
    of Things
 3  Cloud
    technology
 4  Advanced
    robotics
 5  Autonomous
    and near-
    autonomous
    vehicles
 6  Next-
    generation
    genomics
 7  Energy
    storage
 8  3D printing
 9  Advanced
    materials
10  Advanced oil
    and gas
    exploration
    and recovery
11  Renewable
    energy
--  -------------


In [13]:
for i in range(30,35):
    print (i)
    tables = camelot.read_pdf("./tmp/pdf/MGI_Disruptive_technologies_Full_report_May2013.pdf", pages='%d' %  i)
    try:
        print (tabulate(tables[0].df))
        print (tabulate(tables[1].df))
    except IndexError:
        print('NOK')


30
NOK
31
NOK
32
--  -------------
 0  Mobile
    Internet
 1  Automation
    of knowledge
    work
 2  The Internet
    of Things
 3  Cloud
    technology
 4  Advanced
    robotics
 5  Autonomous
    and near-
    autonomous
    vehicles
 6  Next-
    generation
    genomics
 7  Energy
    storage
 8  3D printing
 9  Advanced
    materials
10  Advanced oil
    and gas
    exploration
    and recovery
11  Renewable
    energy
--  -------------
NOK
33
NOK
34
NOK


## Extract by PyPDF2

#### Installation

https://pypi.org/project/PyPDF2/

`pip install PyPDF2`

In [14]:
import PyPDF2
pdf_file = open('./tmp/pdf/Food Calories List.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(2)
page_content = page.extractText()
print (page_content.encode('utf-8'))

b' Fish cake\n 90 cals per cake\n 200 cals\n Medium\n Fish fingers\n 50 cals per piece\n 220 cals\n Medium\n Gammon\n 320 cals\n 280 cals\n Med\n-High\n Haddock fresh\n 200 cals\n 110 cals\n Low calorie\n Halibut fresh\n 220 cals\n 125 cals\n Low calorie\n Ha\nm 6 cals\n 240 cals\n Medium\n Herring fresh grilled\n 300 cals\n 200 cals\n Medium\n Kidney\n 200 cals\n 160 cals\n Medium\n Kipper\n 200 cals\n 120 cals\n Low calorie\n Liver\n 200 cals\n 150 cals\n Medium\n Liver\n pate\n 150 cals\n 300 cals\n Medium\n Lamb (roast)\n 300 cals\n 300 cals\n Med\n-High\n Lobster boiled\n 200 cals\n 100 cals\n Low calorie\n Luncheon meat\n 300 cals\n 400 cals\n High\n Mackeral\n 320 cals\n 300 cal\ns Medium\n Mussels\n 90 cals\n 90 cals\n Low\n-Med\n Pheasant roast\n 200 cals\n 200 cals\n Medium\n Pilchards (tinned)\n 140 cals\n 140 cals\n Medium\n Prawns\n 180 cals\n 100 cals\n Low\n- Med\n Pork \n 320 cals\n 290 cals\n Med\n-High\n Pork pie\n 320 cals\n 450 cals\n High\n Rabbit\n 200 cals\n 180 



In [15]:
import numpy

table_list = page_content.split('\n')
l = numpy.array_split(table_list, len(table_list)/4)
for i in range(0,5):
    print(l[i])

[' Fish cake' ' 90 cals per cake' ' 200 cals' ' Medium']
[' Fish fingers' ' 50 cals per piece' ' 220 cals' ' Medium']
[' Gammon' ' 320 cals' ' 280 cals' ' Med']
['-High' ' Haddock fresh' ' 200 cals' ' 110 cals']
[' Low calorie' ' Halibut fresh' ' 220 cals' ' 125 cals']
