**Table of contents**<a id='toc0_'></a>    
- 1. [Using plumber](#toc1_)    
  - 1.1. [Get All pdf files tables](#toc1_1_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# 1. <a id='toc1_'></a>[Using plumber](#toc0_)

In [1]:

import pdfplumber
from pprint import pprint
import pandas as pd
from tqdm.notebook import tqdm
import os

## 1.1. <a id='toc1_1_'></a>[Get All pdf files tables](#toc0_)

In [212]:
# get all pdf file in a directory that contain keywords in there names
def get_pdfs_with_keyword(directory, keywords:list[str]):
    pdf_files_list = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".pdf"):
                if any(keyword in file for keyword in keywords):
                    pdf_files_list.append(os.path.join(root, file))
    return pdf_files_list

In [213]:
def get_potable_water_data(potable_pdf_reprts):
    assert len(potable_pdf_reprts) > 0, "No pdf files Found!"
    print(f"{len(potable_pdf_reprts)} pdf files found")
    all_df = pd.DataFrame()

    for pdf_file in tqdm(potable_pdf_reprts):
        with pdfplumber.open(pdf_file) as pdf:
            tables = [page.extract_table() for page in pdf.pages[:2]]
            for table in tables[:]:
                df = pd.DataFrame(table)
                df["date"] = (
                    "".join([pdf_file[-14:-8] + pdf_file[-6:-4]])
                    if (pdf_file[-8:-6] == "20")
                    else pdf_file[-12:-4]
                )

                all_df = pd.concat([all_df, df], axis=0)
    return all_df

In [187]:
# Replace 'your_directory_path' with the path to the directory you want to search
directory_path = r'../files'
keywords =['RADEES', 'RADESS']
potable_pdf_reprts = get_pdfs_with_keyword(directory_path, keywords)
pprint(potable_pdf_reprts)

['../files\\Rapport potablité Eau RADESS 01-03-24.pdf',
 '../files\\Rapport potablité Eau RADESS 01-04-24.pdf',
 '../files\\Rapport potablité Eau RADESS 01-05-24.pdf',
 '../files\\Rapport potablité Eau RADESS 01-06-2024.pdf',
 '../files\\Rapport potablité Eau RADESS 02-03-24.pdf',
 '../files\\Rapport potablité Eau RADESS 02-04-24.pdf',
 '../files\\Rapport potablité Eau RADESS 02-05-24.pdf',
 '../files\\Rapport potablité Eau RADESS 02-06-2024.pdf',
 '../files\\Rapport potablité Eau RADESS 03-03-24.pdf',
 '../files\\Rapport potablité Eau RADESS 03-04-24.pdf',
 '../files\\Rapport potablité Eau RADESS 03-05-24.pdf',
 '../files\\Rapport potablité Eau RADESS 03-06-2024.pdf',
 '../files\\Rapport potablité Eau RADESS 04-03-24.pdf',
 '../files\\Rapport potablité Eau RADESS 04-04-24.pdf',
 '../files\\Rapport potablité Eau RADESS 04-05-24.pdf',
 '../files\\Rapport potablité Eau RADESS 04-06-2024.pdf',
 '../files\\Rapport potablité Eau RADESS 05-03-24.pdf',
 '../files\\Rapport potablité Eau RADESS

In [214]:
water_dataframe: pd.DataFrame = get_potable_water_data(potable_pdf_reprts[:])
water_dataframe.head()

125 pdf files found


  0%|          | 0/125 [00:00<?, ?it/s]

Unnamed: 0,0,1,2,3,4,5,6,7,date,8,9
0,Hydrocarbures polycycliques aromatiques (HAP),,,,,,,,01-03-24,,
1,Paramètre au laboratoire,Méthode/Version,Résultat,Unité,LQ,Incertitude\n(%),VMA*,Appréciation,01-03-24,,
2,Benzo(b) fluorranthène*,NM ISO 28540 (2014),<LQ,µg/l,001,10,01,S,01-03-24,,
3,Benzo(k) fluorranthène*,,<LQ,µg/l,001,10,01,S,01-03-24,,
4,Benzo(ghi) pérylène*,,<LQ,µg/l,001,10,01,S,01-03-24,,


In [215]:
df = water_dataframe.set_axis(water_dataframe.iloc[1], axis="columns").dropna(subset=["Résultat"]).query("Résultat != 'Résultat'")
df = df[df.columns.dropna()]

date_col = [col for col in df.columns if col.endswith(("23","24", "25"))][0]
df.rename(columns={date_col: "date"}, inplace=True)
df = df[[col for col in df.columns if col.startswith(("Param", "Résu", "date"))]]
df.insert(0, "date", df.pop('date'))
df

1,date,Paramètre au laboratoire,Résultat
2,01-03-24,Benzo(b) fluorranthène*,<LQ
3,01-03-24,Benzo(k) fluorranthène*,<LQ
4,01-03-24,Benzo(ghi) pérylène*,<LQ
5,01-03-24,Indénol(1.2.3-cd) pyrène*,<LQ
6,01-03-24,Benzo(a) pyrène*,<LQ
...,...,...,...
28,31-05-24,Manganèse* (Mn),<LQ
29,31-05-24,Zinc* (Zn),0019
30,31-05-24,Fer* (Fe),<LQ
31,31-05-24,Cyanures,<LQ


In [216]:
df.to_excel(r'../outputs/data.xlsx', index=False)