# Reading tables from PDF files
In this project, our intention is to read informational tables from the reports of Brazilian Ministry of Health. 
In order to do so, we will make use of the *PDFplumber* library, which has shown itself the easiest available tool for this task.  

The sample files are in the following folder:  
~/workplace/github/epidemic_BRdata/BMH_Bulletins/

### General statements

In [1]:
import pandas as pd
import os
import re

In [2]:
import pdfplumber

--------------------

## Listing and checking available files  
- List all the bulletins available in the "covid" directory;  
- Check whether it was already downloaded and parsed, or not. 

In [11]:
os.listdir('BMH_Bulletins/covid')

['be-covid-08-final-2.pdf',
 'Boletim-epidemiologico-SVS-04fev20.pdf',
 'BE6-Boletim-Especial-do-COE.pdf',
 'be-covid-08-final.pdf',
 'Boletim-epidemiologico-SVS-01-COE-inundacao.pdf',
 '2020-04-06-BE7-Boletim-Especial-do-COE-Atualizacao-da-Avaliacao-de-Risco.pdf',
 '2020-04-16---BE10---Boletim-do-COE-21h.pdf',
 '2020-02-21-Boletim-Epidemiologico03.pdf',
 'BE12-Boletim-do-COE.pdf',
 '2020-04-11-BE9-Boletim-do-COE.pdf',
 'Boletim-epidemiologico-COEcorona-SVS-13fev20.pdf',
 '2020-03-02-Boletim-Epidemiol--gico-04-corrigido.pdf',
 '2020-04-17---BE11---Boletim-do-COE-21h.pdf']

In [12]:
file1 = 'BMH_Bulletins/covid/2020-04-17---BE11---Boletim-do-COE-21h.pdf'

In [13]:
file2 = 'BMH_Bulletins/covid/2020-02-21-Boletim-Epidemiologico03.pdf'

--------------------

## Reading tables with PDFplumber:  
General reference: http://blog.rubypdf.com/2019/07/09/extract-table-from-pdf-with-python/

In [14]:
def parse_pdf(file):
    ## Openning the PDF file:
    filename = re.findall('(?:[^/]+)$(?<=(?:.jpg)|(?:.pdf))', file)[0]  #The last term is to take the filename out of the list type.
    try:
        print('Reading the file "{0}"'.format(filename))
        pdf = pdfplumber.open(file)
    except: 
        print('It was not possible to open the file "{0}"'.format(filename))    
    print('Done: {0} pages read.'.format(len(pdf.pages)))
    
    
    return pdf

In [15]:
pdf = parse_pdf(file1)

Reading the file "2020-04-17---BE11---Boletim-do-COE-21h.pdf"
Done: 37 pages read.


In [16]:
pdf.pages

[<pdfplumber.page.Page at 0x7fe330250b00>,
 <pdfplumber.page.Page at 0x7fe330250d30>,
 <pdfplumber.page.Page at 0x7fe330250f98>,
 <pdfplumber.page.Page at 0x7fe3302531d0>,
 <pdfplumber.page.Page at 0x7fe330253438>,
 <pdfplumber.page.Page at 0x7fe3302536a0>,
 <pdfplumber.page.Page at 0x7fe330253940>,
 <pdfplumber.page.Page at 0x7fe330253ba8>,
 <pdfplumber.page.Page at 0x7fe330259048>,
 <pdfplumber.page.Page at 0x7fe3302592e8>,
 <pdfplumber.page.Page at 0x7fe330259550>,
 <pdfplumber.page.Page at 0x7fe3302597b8>,
 <pdfplumber.page.Page at 0x7fe3302599e8>,
 <pdfplumber.page.Page at 0x7fe330259c50>,
 <pdfplumber.page.Page at 0x7fe330259e80>,
 <pdfplumber.page.Page at 0x7fe330261128>,
 <pdfplumber.page.Page at 0x7fe330261550>,
 <pdfplumber.page.Page at 0x7fe330261780>,
 <pdfplumber.page.Page at 0x7fe3302619e8>,
 <pdfplumber.page.Page at 0x7fe330261be0>,
 <pdfplumber.page.Page at 0x7fe330261e10>,
 <pdfplumber.page.Page at 0x7fe3302680b8>,
 <pdfplumber.page.Page at 0x7fe330268358>,
 <pdfplumbe

--------------------

#### Reading the file and extracting its tables manually:

In [45]:
## Openning the PDF file:
pdf = pdfplumber.open(file1)

In [46]:
## Reading the file metada (there wasn't any useful information, in this case)
pdf.metadata

{'CreationDate': "D:20200418000047+00'00'",
 'Creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36',
 'ModDate': "D:20200418000047+00'00'",
 'Producer': 'Skia/PDF m81'}

In [47]:
## Checking the number of pages (remembering it outputs a list)
len(pdf.pages)

37

#### Exploring how the PDFplumber works based on a page of interest from the document:  
We are interested in catching out the tables 1 and 2 from the document, which are in the pages 2 and 4, respectively. 

In [48]:
page2 = pdf.pages[1]
page4 = pdf.pages[3]

In [49]:
print(page4.extract_tables()[0][2][0])

Tabela   2:    Distribuição   dos   casos   e   óbitos   por   COVID-19   por   região   e   Unidade   da   Federação.   Brasil,   2020.  
CONFIRMADOS   ÓBITOS  
ID   UF/REGIÃO  
N   (%)   N   (%)  
NORTE   3.158   (9,4%)   193   (6,1%)  
1   AC   135   5  
2   AM   1.809   145  
3   AP   370   10  
4   PA   557   26  
5   RO   92   3  
6   RR   164   3  
7   TO   31   1  
NORDESTE   7.469   (22,2%)   479   (6,4%)  
8   AL   110   7  
9   BA   1.059   36  
10   CE   2.684   149  
11   MA   797   40  
12   PB   195   26  
13   PE   2.006   186  
14   PI   102   8  
15   RN   463   23  
16   SE   53   4  
SUDESTE   19.067   (56,6%)   1.329   (7,0%)  
17   ES   856   25  
18   MG   1.021   35  
19   RJ   4.349   341  
20   SP   12.841   928  
CENTRO-OESTE   1.386   (4,1%)   46   (3,3%)  
21   DF    746   20  
22   GO   335   16  
23   MS   143   5  
24   MT   162   5  
SUL   2.602   (7,7%)   94   (3,6%)  
25   PR   874   42  
26   RS   802   22  
27   SC   926   30  
BRASIL  
33.682   2.1

In [50]:
table1_df = pd.DataFrame(page4.extract_tables()[1])

In [51]:
table1_df

Unnamed: 0,0,1,2,3
0,ID,UF/REGIÃO,CONFIRMADOS,ÓBITOS
1,,,N (%),N (%)
2,NORTE,,"3.158 (9,4%)","193 (6,1%)"
3,1,AC,135,5
4,2,AM,1.809,145
5,3,AP,370,10
6,4,PA,557,26
7,5,RO,92,3
8,6,RR,164,3
9,7,TO,31,1
