
# Task 1A : pure text extraction from a pdf
We must retrieve the plain text from a pdf file and identify the questions in a questionnaire. 


# Index of Titles in the Notebook

1. [Task 1A: Pure Text Extraction from a PDF](#task-1a--pure-text-extraction-from-a-pdf)
2. [PDF Text Extraction Methods](#pdf-text-extraction-methods)
3. [Retriever 1: PdfPlumber](#retriever-1-pdfplumber)
4. [Retriever 2: Pypdf2](#retriever-2-pypdf2)
5. [Retriever 3: PyMuPDF](#retriever-3-pymupdf)
6. [Retriever 4: PDFMiner.six](#retriever-4-pdfminer.six)
7. [Retriever 5: pdftotext](#retriever-5-pdftotext)
8. [Retriever Comparison](#retriever-comparison)
9. [Scalene Report](#scalene-report)
10. [Scalene Performance Result](#scalene-performance-result)
11. [Text Quality Check](#text-quality-check)



## PDF text extraction methods




| Library          | Key Features                              | Installation          |
|------------------|-------------------------------------------|----------------------|
| **PyMuPDF**      | Text, tables, images, metadata, OCR-ready | `pip install pymupdf` |
| **pdfplumber**   | Text + tables, handles complex layouts    | `pip install pdfplumber` |
| **pdfminer.six** | Deep text analysis (fonts, positions)     | `pip install pdfminer.six` |
| **pdftotext**    | Requires `poppler`, clean PDFs only       | `pip install pdftotext` |
| **PyPDF2**       | Lightweight, simple text extraction       | `pip install PyPDF2` |

In addition to figuring out how to do this, we need to gather some statistics for legitimate comparison. In addition to the usual python profilers, we will try the python library scalene which is widely used in research. It isn't working on cells so I had to to put the code in python files and run it through the terminal. Go to [Scalene Report](#scalene-report) to view the results. 


**we will try and compare as many as possible and adding them to the following dataclass, we will use conventional measurements.** Check out [Retriever Comparison](#retriever-comparison)


In [1]:
retrieval_stats = []


## Retriever 1: PdfPlumber

In [2]:
%pip install pdfplumber

Note: you may need to restart the kernel to use updated packages.


In [3]:


###                                   TRACKING                                     ###
# Initialize tracking
import time
import psutil
import os
process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()
###                                   TRACKING END                                 ###
import pdfplumber
# PDF Processing
words = []
with pdfplumber.open("highlighted_output.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text() 
        words.extend(text.split())


###                                   TRACKING                                     ###
# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss - mem_before
total_time = time.time() - start_time

# Store results
retrieval_stats.append( {
    "name": "pdfplumber",
    "time": total_time,
    "num_words": len(words),
    "memory_MB": memory_MB / (1024 * 1024),
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in words for c in word if c.isalnum()),
})
###                                   TRACKING END                                 ###

 # Print words (10 per line)
for i in range(0, len(words), 10):
    print(' '.join(words[i:i+10]))

 Copie électronique   N° FINESS : 34 3
 2  - 
-  (:  7: 04
67 53 15 44   - Biologiste(s) Médical(aux) Docteur
    Madame    CABINET MEDICAL "LA " 250
 DES      (100) Copie à
: Docteur   , DR  X Demande
n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie
à : Docteur   , DR Patient né(e)   le
  FSE Tiers payant  - 
Prélèvements effectués par le laboratoire le 01/02/21 à 10H27 Vos
résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable
1) Communiquez votre mail au laboratoire 2) Recevez un email
dès que vos résultats sont disponibles 3) Cliquez sur le
lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour
connaître notre organisation : https://.fr/depistage-covid-19/ Hématologie Valeurs de référence Antériorités
✔ Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en
flux)  -   Hématies ........................................ 4,94Téra/L 3,80 à
5,90 4,97 Hémoglobine .................................... 13,6g/dL 11,5 à 

**Benefits**: 
- quickly gets reads all the text with high precision and line respect. 
- simple installation, high compatibility.

**Downsides**: 
- Only works on pdfs, not images. Pdf's must be made with adobe and the text allow to be copied otherwise I suspect it won't detect the words. 
- Works with the name only and thus potentially needs to be in the same folder as the pdf's. Could be a problem when scaling. 

In [4]:
entire_text = ' '.join(words)

[Return to the top](#task-1a--pure-text-extraction-from-a-pdf)

## Retriever 2: Pypdf2


In [5]:
%pip install pypdf2

Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: pypdf2
Successfully installed pypdf2-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [6]:


###                                   TRACKING                                     ###
import psutil
import os
import time

# Initialize tracking
process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()
###                                   TRACKING END                                 ###


# PDF Processing
from PyPDF2 import PdfReader
words2 = []
reader = PdfReader("highlighted_output.pdf")
for page in reader.pages:
    text = page.extract_text() 
    words2.extend(text.split())


###                                   TRACKING                                     ###

# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

retrieval_stats.append( {
    "name": "pypdf2",
    "time": total_time,
    "num_words": len(words2),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in words2 for c in word if c.isalnum()),
}
)

###                                   TRACKING END                                 ###

for i in range(0, len(words2), 10):
    print(' '.join(words2[i:i+10]))


  N° FINESS :  7
av du  de  -  -  (:
 7: 53 15
44   - Biologiste(s) Médical(aux) Docteur     CABINET
MEDICAL "LA "   Madame    250 
DES    (100) Copie à : Docteur 
 , DR  X Demande n° 01/02/ -LABO--TP
Edité le, lundi 1 février 2021 Patient né(e)   le
Copie à : Docteur   , DR 
FSE Tiers payant  -  Prélèvements effectués par le
laboratoire le 01/02/21 à 10H27 Vos résultats sur internet :
Accès sécurisé, rapide, gratuit, pratique, écoresponsable 1) Communiquez votre mail
au laboratoire 2) Recevez un email dès que vos résultats
sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous
sur notre site internet dédié pour connaître notre organisation :
https://.fr/depistage-covid-19/ Hématologie Valeurs de référence Antériorités ✔Hémogramme (Sang total -
Variation d'impédance, photométrie, cytométrie en flux)  -  
Hématies ........................................ 4,94 Téra/L3,80 à 5,90 4,97 Hémoglobine .................................... 13,6
g/dL 11,5 à 17,5 13,8 8,4mmol/L 7,

**Benefits**: 
- Extremely quick: takes even less time than PDF plumber
- Processes all available output. 
- Lightweight and easy installation

**Downsides**: 
- Cannot identify charachters. Can only read text that can be copied. 
- Also appears to work from the file name and not path so we don't know if it can reach another folder. 
-

[Return to the top](#task-1a--pure-text-extraction-from-a-pdf)

## Retriever  3: PyMuPDF

In [7]:
%pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.1
Note: you may need to restart the kernel to use updated packages.


In [8]:

###                                   TRACKING                                     ###
import psutil
import os
import time
# Initialize tracking
process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###
import pymupdf # imports the pymupdf library

words3 = [] # initialize an empty list to store words

doc = pymupdf.open("highlighted_output.pdf") # open a document
for page in doc: # iterate the document pages
    text = page.get_text() # get plain text encoded as UTF-8
    words3.extend(text.split()) # extend the list without reassigning



###                                   TRACKING                                     ###
# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

retrieval_stats.append( {
    "name": "pymupdf",
    "time": total_time,
    "num_words": len(words3),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in words3 for c in word if c.isalnum()),
}
)

###                                   TRACKING END                                 ###
for i in range(0, len(words3), 10):
    print(' '.join(words3[i:i+10]))

   N° FINESS : 
 -  - 
(:  7: 53
15 44   - Biologiste(s) Médical(aux) Docteur    
CABINET MEDICAL "LA "   Madame    250
 DES    (100) Copie à : Docteur
  , DR  X Demande n° 01/02/
-LABO--TP Edité le, lundi 1 février 2021 Patient né(e)  
le  Copie à : Docteur   , DR 
 FSE Tiers payant  -  Prélèvements effectués
par le laboratoire le 01/02/21 à 10H27 Vos résultats sur
internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable 1) Communiquez
votre mail au laboratoire 2) Recevez un email dès que
vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION
COVID-19 Rendez-vous sur notre site internet dédié pour connaître notre
organisation : https://.fr/depistage-covid-19/ Hématologie Valeurs de référence Antériorités ✔ Hémogramme
(Sang total - Variation d'impédance, photométrie, cytométrie en flux) 
-   Hématies ........................................ 4,94 Téra/L 3,80 à 5,90
4,97 Hémoglobine .................................... 13,6 g/dL 11,5 à 17,5 13,8 8,4
mmol/L 7,1 à 10

**Benefits**:
- extremely quick 
- appears to capture the same amount of input as the others do. 
- very quick installation too. Barely any requirements.

**Downsides**: 
- probably can only capture input when the pdf is written in adobe. 
- uses the file name and not the path although should be tested. 
  

[Return to the top](#task-1a--pure-text-extraction-from-a-pdf)

## Retriever 4: PDFMiner.six

In [9]:
%pip install pdfminer.six


Note: you may need to restart the kernel to use updated packages.


In [10]:

###                                   TRACKING                                     ###
import psutil
import os
import time
# Initialize tracking
process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###
from pdfminer.high_level import extract_text
text = extract_text("highlighted_output.pdf")
words4 = text.split()


###                                   TRACKING                                     ###
# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

retrieval_stats.append( {
    "name": "pdfminer",
    "time": total_time,
    "num_words": len(words4),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in words4 for c in word if c.isalnum()),
}
)

###                                   TRACKING END                                 ###
for i in range(0, len(words4), 10):
    print(' '.join(words4[i:i+10]))

 N° FINESS :  
du  de  -  -  (: 04
67 : 
 - Biologiste(s) Médical(aux)Docteur    CABINET MEDICAL "LA " 
Madame     (100)Copie à :
Docteur   , DR X Demande n° 01/02/
-LABO--TPEdité le, lundi 1 février 2021Patient né(e)   le Copie
à : Docteur   , DR FSE Tiers
payant  -  Prélèvements effectués par le laboratoire le
01/02/21 à 10H27 Vos résultats sur internet : Accès sécurisé,
rapide, gratuit, pratique, écoresponsable1) Communiquez votre mail au laboratoire 2)
Recevez un email dès que vos résultats sont disponibles 3)
Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site
internet dédié pour connaître notre organisation : https://.fr/depistage-covid-19/ HématologieValeurs de
référenceAntériorités✔Hémogramme(Sang total - Variation d'impédance, photométrie, cytométrie en flux) 
- Hématies ........................................4,94Téra/L3,80 à 5,904,97Hémoglobine ....................................13,6g/dL11,5 à 17,513,88,4mmol/L7,1 à 10,9Hématocrite
..........................

**Benefits**:
- quick 
- appears to capture the same amount of input as the others do. 
- very quick installation too. Barely any requirements.
- very short to set up, doesn't get confused with pages or anything of the sort.

**Downsides**: 
- probably can only capture input when the pdf is written in adobe. 
- uses the file name and not the path although should be tested. 
  

[Return to the top](#task-1a--pure-text-extraction-from-a-pdf)

## Retriever 5: pdftotext

- CTRL + SHIFT + P
- type: Open terminal 
- click on: Bash
- once in the terminal, type:
> (.venv) @mint:~/Path/to/this/folder$ 

Debian/Ubuntu:
```bash
sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
```

MacOS: 
```bash
brew install pkg-config poppler python
```

Windows: 
Currently tested only when using conda:

- Install the Microsoft Visual C++ Build Tools
- Install poppler through conda: 

```bash
conda install -c conda-forge poppler
```


In [11]:
#won't work without previous installation
%pip install pdftotext

Collecting pdftotext
  Downloading pdftotext-3.0.0.tar.gz (113 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: pdftotext
  Building wheel for pdftotext (pyproject.toml) ... [?25ldone
[?25h  Created wheel for pdftotext: filename=pdftotext-3.0.0-cp313-cp313-linux_x86_64.whl size=60801 sha256=63e0d8a630742fd638084a5c4428509461fe609d2e85d16f4273f789586398f2
  Stored in directory: /home//.cache/pip/wheels/9f/21/c1/d0326a643e800bbb8f8b64448b4e14b6c6e525b78a379445f2
Successfully built pdftotext
Installing collected packages: pdftotext
Successfully installed pdftotext-3.0.0
Note: you may need to restart the kernel to use updated packages.


In [12]:
###                                   TRACKING                                     ###
import psutil
import os
import time
# Initialize tracking
process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()
###                                   TRACKING END                                 ###

import pdftotext
with open("highlighted_output.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)
words5 = []
for page in pdf:
    words5.extend(page.split())
###                                   TRACKING                                     ###
# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

retrieval_stats.append( {
    "name": "pdftotext",
    "time": total_time,
    "num_words": len(words5),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in words5 for c in word if c.isalnum()),
}
)
###                                   TRACKING END                                 ###
for i in range(0, len(words5), 10):
    print(' '.join(words5[i:i+10]))



Copie électronique   N° FINESS : 34 3 
2  -  -
 (:  7: 04 67
53 15 44   - Biologiste(s) Médical(aux) Docteur  
  CABINET MEDICAL "LA " Madame    250 
DES      (100) X Demande n°
01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie à
: Docteur   , DR  Copie à
: Docteur   , DR  Patient né(e)
  le  FSE Tiers payant  -  Prélèvements
effectués par le laboratoire le 01/02/21 à 10H27 Vos résultats
sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable 1)
Communiquez votre mail au laboratoire 2) Recevez un email dès
que vos résultats sont disponibles 3) Cliquez sur le lien
INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaître
notre organisation : https://.fr/depistage-covid-19/ Hématologie Valeurs de référence Antériorités ✔
Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)
 -   4,97 Hématies ........................................ Hémoglobine .................................... 4,94
Téra/L 13,6 g/dL 3,80 à 5,90 11,5 à

## Retriever 6: tabula-py

In [None]:
%pip install tabula-py
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting jpype1
  Downloading jpype1-1.5.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading jpype1-1.5.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (493 kB)
Installing collected packages: jpype1
Successfully installed jpype1-1.5.2
Note: you may need to restart the kernel to use updated packages.


In [None]:
import tabula
import os
from dotenv import load_dotenv
load_dotenv()
pdf_path = os.environ.get("filePath")
dfs= tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)
for i, df in enumerate(dfs):
    print(f"Table {i + 1}:")
    print(df)
    print("\n")

[Return to the top](#task-1a--pure-text-extraction-from-a-pdf)

## Retriever comparison

In [None]:
%pip install tabulate

Note: you may need to restart the kernel to use updated packages.


In [None]:
from tabulate import tabulate

print(tabulate(retrieval_stats, headers="keys"))


name             time    num_words    memory_MB    cpu_percent    num_alphanum_chars
----------  ---------  -----------  -----------  -------------  --------------------
pdfplumber  0.960923          1961       34.875           97.8                  9931
pypdf2      0.42402           1889        2.875           99                    9931
pymupdf     0.0342188         2007        0              115.2                  9931
pdfminer    0.622337          1673        0               99.6                  9931
pdftotext   0.0811651         2004        0.5             36.7                  9931


[Return to the top](#task-1a--pure-text-extraction-from-a-pdf)

## Scalene report 

### Installation: 

In [None]:
%pip install -U scalene

Note: you may need to restart the kernel to use updated packages.


### How to get a report:


> **WARNING**: Remember to run this entire notebook to download the needed libraries.

- CTRL + SHIFT + P
- Open terminal --> Bash
- cd to /RESEARCH/task1scripts/
- on the terminal: 
```bash
scalene --cpu --memory retrievers.py
```
- Will open the report on your browser. 


### My report:

I used the command line format and sent the output to a txt file:
[fullPerfReport](./task1scripts/fullPerfReport.txt)


In order to get this I did: 
- CTRL + SHIFT + P
- Open terminal --> Bash
- cd to /RESEARCH/task1scripts/
- on the terminal : 
```bash
scalene --cpu --memory --cli --column-width 6000 --outfile=fullPerfReport.txt retrievers.py
```

### My report on HTML

In order to get this I did: 
- CTRL + SHIFT + P
- Open terminal --> Bash
- cd to /RESEARCH/task1scripts/
- on the terminal : 
```bash
scalene --cpu --memory --outfile=fullPerfReport.html retrievers.py
```
- on the same terminal: 
```bash
xdg-open fullPerfReport.html     #Linux
open fullPerfReport.html         #MacOS
explorer.exe fullPerfReport.html #Windows
```
- The report will open in your browser. If you get an installation error, type in your bash terminal or in a python snippet in the notebook the command they request. 

> **Warning**: The statistics depend on your machine, run various tests to see if things make sense. 

### Scalene Performance Result:
The report shows the CPU usage from Python and from the system, which are the categories named:
Time Python and ___ native. We know this is the CPU thanks to the youtube video [12]. 

[Return to the top](#task-1a--pure-text-extraction-from-a-pdf)

### Text Quality Check:

3. [Retriever 1: PdfPlumber](#retriever-1-pdfplumber)
4. [Retriever 2: Pypdf2](#retriever-2-pypdf2)
5. [Retriever 3: PyMuPDF](#retriever-3-pymupdf)
6. [Retriever 4: PDFMiner.six](#retriever-4-pdfminer.six)
7. [Retriever 5: pdftotext](#retriever-5-pdftotext)

In [None]:
%pip install simphile

Note: you may need to restart the kernel to use updated packages.


In [None]:
from simphile import jaccard_similarity, compression_similarity, euclidian_similarity


pdfplumber_text = ' '.join(words)
pypdf2_text = ' '.join(words2)
pymupdf_text = ' '.join(words3)
pdfminer_text = ' '.join(words4)
pdftotext_text = ' '.join(words5)
# Load the text from the highlighted_output.txt file

with open("highlighted_output.txt", "r") as file:
    txt_doc_text = file.read()
# Calculate similarities
similarities = {
    "pdfplumber": {
        "jaccard": jaccard_similarity(txt_doc_text, pdfplumber_text),
        "compression": compression_similarity(txt_doc_text, pdfplumber_text),
        "euclidean": euclidian_similarity(txt_doc_text, pdfplumber_text),
    },
    "pypdf2": {
        "jaccard": jaccard_similarity(txt_doc_text, pypdf2_text),
        "compression": compression_similarity(txt_doc_text, pypdf2_text),
        "euclidean": euclidian_similarity(txt_doc_text, pypdf2_text),
    },
    "pymupdf": {
        "jaccard": jaccard_similarity(txt_doc_text, pymupdf_text),
        "compression": compression_similarity(txt_doc_text, pymupdf_text),
        "euclidean": euclidian_similarity(txt_doc_text, pymupdf_text),
    },
    "pdfminer": {
        "jaccard": jaccard_similarity(txt_doc_text, pdfminer_text),
        "compression": compression_similarity(txt_doc_text, pdfminer_text),
        "euclidean": euclidian_similarity(txt_doc_text, pdfminer_text),

    },
    "pdftotext": {
        "jaccard": jaccard_similarity(txt_doc_text, pdftotext_text),
        "compression": compression_similarity(txt_doc_text, pdftotext_text),
        "euclidean": euclidian_similarity(txt_doc_text, pdftotext_text),
    },
}

# Print similarities
for retriever, metrics in similarities.items():
    print(f"Retriever: {retriever}")
    for metric, value in metrics.items():
        print(f"  {metric.capitalize()} Similarity: {value:.4f}")

# Calculate the total similarity score for each method
total_similarity_scores = {
    retriever: sum(metrics.values())
    for retriever, metrics in similarities.items()
}

# Find the method with the highest similarity score
most_similar_method = max(total_similarity_scores, key=total_similarity_scores.get)

# Print the results
print("\nTotal Similarity Scores:")
for retriever, score in total_similarity_scores.items():
    print(f"{retriever}: {score:.4f}")

print(f"\nThe method that returns the most similar text is: {most_similar_method}")

Retriever: pdfplumber
  Jaccard Similarity: 0.8863
  Compression Similarity: 0.7542
  Euclidean Similarity: 0.9962
Retriever: pypdf2
  Jaccard Similarity: 0.8143
  Compression Similarity: 0.7541
  Euclidean Similarity: 0.9934
Retriever: pymupdf
  Jaccard Similarity: 0.9343
  Compression Similarity: 0.7652
  Euclidean Similarity: 0.9972
Retriever: pdfminer
  Jaccard Similarity: 0.6804
  Compression Similarity: 0.7668
  Euclidean Similarity: 0.9912
Retriever: pdftotext
  Jaccard Similarity: 0.9329
  Compression Similarity: 0.7667
  Euclidean Similarity: 0.9972

Total Similarity Scores:
pdfplumber: 2.6367
pypdf2: 2.5618
pymupdf: 2.6968
pdfminer: 2.4383
pdftotext: 2.6967

The method that returns the most similar text is: pymupdf
