**Table of contents**<a id='toc0_'></a>    
- 1. [Using camelot-py](#toc1_)    
  - 1.1. [Get tables from pdf file](#toc1_1_)    
  - 1.2. [Insight Extracted Tables](#toc1_2_)    
  - 1.3. [View a table as Data Frame](#toc1_3_)    
  - 1.4. [Export tables in csv format into zip file](#toc1_4_)    
  - 1.5. [Tables Processing](#toc1_5_)    
- 2. [Using plummber](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
# import libraries
import pandas as pd
import camelot

# 1. <a id='toc1_'></a>[Using camelot-py](#toc0_)

## 1.1. <a id='toc1_1_'></a>[Get tables from pdf file](#toc0_)


In [2]:
url_file = "https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf"


def get_table_from_pdf(
    path_file: str = url_file, method="lattice"):
    # Get the tables in the PDF file
    if path_file.endswith(".pdf"):
        try:
            tables = camelot.read_pdf(filepath=path_file, pages="all", flavor=method)
            tables.export(r'foo.csv', f='csv', compress=True)
            print(f"{path_file} successfully loaded!")
            if len(tables) > 0:
                print(f"{len(tables)} tables extracted")
                return tables
            else:
                raise Exception("No table extracted!")
        except Exception as e:
            print(e)
            print("Try change -method- parameter to 'stream'!")
            # tables = camelot.read_pdf(filepath=url_file, pages="all", flavor="lattice")
    else:
        raise Exception("File is not a PDF")


In [3]:
# set path to pdf file
pdf_file = r"../files/Rapport potablité Eau RADESS 28-02-24.pdf"

tables = get_table_from_pdf(path_file=pdf_file, method="stream")


../files/Rapport potablité Eau RADESS 28-02-24.pdf successfully loaded!
11 tables extracted


## 1.2. <a id='toc1_2_'></a>[Insight Extracted Tables](#toc0_)


In [4]:
def display_tables_info(tables) -> pd.DataFrame:
    assert len(tables) > 0, "No tables Found!"
    table_infos = [
        table.parsing_report | {"n_rows": table.shape[0], "n_cols": table.shape[1]}
        for table in tables
    ]
    return pd.DataFrame(table_infos)


display(display_tables_info(tables))


Unnamed: 0,accuracy,whitespace,order,page,n_rows,n_cols
0,95.43,58.39,1,1,23,7
1,97.62,43.45,2,1,21,8
2,99.61,29.62,1,2,46,8
3,99.5,28.57,1,3,35,8
4,94.93,37.5,2,3,16,2
5,99.72,27.29,1,4,57,9
6,99.84,12.73,1,5,48,9
7,99.85,5.05,1,6,44,9
8,99.84,8.94,1,7,46,9
9,99.74,4.94,1,8,45,9


## 1.3. <a id='toc1_3_'></a>[View a table as Data Frame](#toc0_)


In [5]:
# print the first table as Pandas DataFrame
df_table = tables[0].df
display(df_table.head())


Unnamed: 0,0,1,2,3,4,5,6
0,Code échantillon : 945-01/02,,Référence du client : RADEES 728,,,Date/heure début d'analyse : 29/02/2024 à 08h50,
1,Lieu d’exécution des analyses : LC2A,,"Condition de réception : T°C :5,2°C",,Date d’édition : 04/03/2024,,
2,Référence de la méthode d’échantillonnage:,Volume : 6 litre,,,Conditions spécifiques : 3°C à 8°C,,
3,INSPC/15/V01,,,,,,
4,,,,,,Critères,


## 1.4. <a id='toc1_4_'></a>[Export tables in csv format into zip file](#toc0_)


In [6]:
tables.export(
    r"../outputs/data.csv", f="csv", compress=True
)  # json, excel, html, markdown, sqlite


## 1.5. <a id='toc1_5_'></a>[Tables Processing](#toc0_)

In [33]:
df_table:pd.DataFrame = tables[0].df.copy()
cols = df_table.iloc[6,:]

df_table['date'] = pd.to_datetime(pdf_file[-12:-4])
df_table.set_index('date', inplace=True)
df_table.columns = cols
df_table = df_table.dropna(subset=['Paramètre(s) microbiologiques', 'Résultat'])

df_table

6,Paramètre(s) microbiologiques,Méthode/Version,Résultat,(unité,Unnamed: 5_level_0,microbiologiques,Appréciation
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-02-28,Code échantillon : 945-01/02,,Référence du client : RADEES 728,,,Date/heure début d'analyse : 29/02/2024 à 08h50,
2024-02-28,Lieu d’exécution des analyses : LC2A,,"Condition de réception : T°C :5,2°C",,Date d’édition : 04/03/2024,,
2024-02-28,Référence de la méthode d’échantillonnage:,Volume : 6 litre,,,Conditions spécifiques : 3°C à 8°C,,
2024-02-28,INSPC/15/V01,,,,,,
2024-02-28,,,,,,Critères,
2024-02-28,,,,,Incertitude,,
2024-02-28,Paramètre(s) microbiologiques,Méthode/Version,Résultat,(unité,,microbiologiques,Appréciation
2024-02-28,,,,,(%),,
2024-02-28,,,,,,Marocains(1) (VMA),
2024-02-28,Dénombrement de micro-organismes,,,,,,


# 2. <a id='toc2_'></a>[Using plumber](#toc0_)

In [68]:
import pdfplumber
from pprint import pprint
import pandas as pd
from tqdm.autonotebook import tqdm

In [69]:
pdf_file = r"../files/Rapport potablité Eau RADESS 28-02-24.pdf"

with pdfplumber.open(pdf_file) as pdf:
    tables = [page.extract_table() for page in tqdm(pdf.pages[:3])]

  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:01<00:00,  2.94it/s]


In [70]:
for table in tables:
    print(f"{'table':=^100}")
    display(pd.DataFrame(table))



Unnamed: 0,0,1,2,3,4,5,6,7
0,Hydrocarbures polycycliques aromatiques (HAP),,,,,,,
1,Paramètre au laboratoire,Méthode/Version,Résultat,Unité,LQ,Incertitude\n(%),VMA*,Appréciation
2,Benzo(b) fluorranthène*,NM ISO 28540 (2014),<LQ,µg/l,001,10,01,S
3,Benzo(k) fluorranthène*,,<LQ,µg/l,001,10,01,S
4,Benzo(ghi) pérylène*,,<LQ,µg/l,001,10,01,S
5,Indénol(1.2.3-cd) pyrène*,,<LQ,µg/l,001,10,01,S
6,Benzo(a) pyrène*,,<LQ,µg/l,001,10,01,S
7,Benzène,NM ISO 17943 (2019),<LQ,µg/l,001,10,1,S
8,Trihalométhanes (THM),,,,,,,
9,Paramètre au laboratoire,Méthode/Version,Résultat,Unité,LQ,Incertitude\n(%),VMA*,Appréciation




Unnamed: 0,0,1,2,3,4,5,6,7
0,Paramètre au laboratoire,Méthode/Version,Résultat,Unité,LQ,Incertitude\n(%),VMA*,Appréciation
1,pH*,NM ISO 10523 (2012),74,UpH,-,74,"6,5 - 8,5",S
2,Conductivité électrique*,NM ISO 7888 (2001),719,µS/cm à\n20°C,-,154,2700,S
3,Couleur réelle*,NM ISO 7887 (2012),ND,Pt mg/l,-,177,20,S
4,Odeur,NM 03.7.16 (1990),"1,5eme seuil",-,-,10,3,S
5,Saveur,NM 03.7.17 (1990),"1,5eme seuil",-,-,10,3,S
6,Turbidité*,NM ISO 7027-1 (2019),044,NTU,-,165,5,S
7,Oxydabilité au KMnO *\n4,NM ISO 8467 (2012),053,mgO /l\n2,05,177,5,S
8,Température,NM 03.7.008 (1989),198,°C,-,12,Acceptable,S
9,Oxygène dissous,NM ISO 5814 (2012),647,mg/l,-,15,Non spécifique,-




Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Pesticis - Organochlorés (OCl),,,,,,,,,
1,Paramètre,Méthode/Version,Résultat,Unité,LQ,Incertitude,,,VMA*,Appréciation
2,,,,,,(%),,,,
3,*Aldrine,Pesticides Organochlorés\nNM 03.7.202 (1996),<LQ,µg/l,001,30,,,003,S
4,*Endousulfane,,<LQ,µg/l,001,30,,,01,S
5,*HCH,,<LQ,µg/l,001,30,,,01,S
6,*Lindane,,<LQ,µg/l,001,30,,,01,S
7,*Dieldrine,,<LQ,µg/l,001,30,,,003,S
8,*Endrine,,<LQ,µg/l,001,30,,,01,S
9,*Heptachlore,,<LQ,µg/l,001,30,,,003,S
