# Task 1C: Optical charachter recognition from a png file. 

We will now focus on extracting text from an image. We will use the png images given by the pdfium library, since it is the quickest and least memory consuming one. That is not a final conclusion on which method to use though. These scripts are surprisingly slow because opening and decompressing the png's takes time. it is just a comparative measure.

| OCR Solution       | Backend/Engine       | Strengths                          | Best For                          | Installation Command              |
|--------------------|----------------------|------------------------------------|-----------------------------------|-----------------------------------|
| **pytesseract**    | Tesseract OCR        | Mature, customizable, many options | General-purpose OCR               | `pip install pytesseract pillow`  |
| **EasyOCR**        | CRNN models          | Multilingual, easy setup           | Quick implementations, multi-lang | `pip install easyocr`             |
| **LayoutParser**   | Multiple backends    | Layout analysis + OCR integration  | Document structure understanding  | `pip install layoutparser`        |
| **PaddleOCR**      | PaddlePaddle         | Good accuracy, multi-language      | Chinese/English documents         | `pip install paddleocr`           |
| **DocTR**          | TensorFlow/PyTorch   | Document-focused models            | PDFs and scanned documents        | `pip install python-doctr`        |
| **OpenCV+Tesseract** | OpenCV + Tesseract | Good with preprocessing           | Noisy/low-quality images          | `pip install opencv-python pytesseract` |

### Backend Options for LayoutParser:

| LayoutParser Backend | Engine Used       | Additional Requirements           |
|----------------------|-------------------|------------------------------------|
| `TesseractAgent()`   | Tesseract OCR     | `pip install pytesseract`          |
| `EasyOCRAgent()`     | EasyOCR           | `pip install easyocr`              |
| `PaddleOCRAgent()`   | PaddleOCR         | `pip install paddleocr`            |

In [3]:
OCR_stats=[]


## Method 1 : Pytesseract 


In [None]:
%pip install pdf2image
%pip install pytesseract opencv-python pillow
%pip install python-dotenv

**RUN IN BASH TERMINAL AFTER PRESSING CTRL+SHIFT+P and writing terminal, press enter and click on bash**


__don't execute on python, open a bash terminal__
```bash
chmod +x ./install_tesseract.sh    #makes it executable
./install_tesseract.sh
```

In [None]:
from dotenv import load_dotenv
load_dotenv()
###                                   TRACKING                                     ###
import time
import psutil
import os
process = psutil.Process(os.getpid())

# Start measurements (proper CPU initialization)
_ = process.cpu_percent(interval=None)  # Dummy call to reset counter
mem_before = process.memory_info().rss
start_time = time.time()
###                                   TRACKING END                                 ###

from PIL import Image
import pytesseract

# Directory containing PNG files
directory = "task1c_reportsTemplates"
words = []
coordinates = []  # List to store coordinates of words/characters

# Sort filenames to process pages in order
sorted_filenames = sorted([f for f in os.listdir(directory) if f.lower().endswith('.png')])

# Process each PNG file in sorted order
for filename in sorted_filenames:
    try:
        image_path = os.path.join(directory, filename)
        img = Image.open(image_path)
        
        # Perform OCR with bounding box information
        data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
        
        for i in range(len(data['text'])):
            text = data['text'][i].strip()
            if text:  # Skip empty text
                words.append(text)
                coordinates.append({
                    "text": text,
                    "left": data['left'][i],
                    "top": data['top'][i],
                    "width": data['width'][i],
                    "height": data['height'][i]
                })
        print(f"Prociessed: {filename}")
    except Exception as e:
        print(f"Error processing {filename}: {str(e)}")

# Print coordinates for debugging or further processing
for coord in coordinates:
    print(coord)
###                                   TRACKING                                     ###

# Calculate metrics
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

# Store results
OCR_stats.append({
    "name": "tesseract",
    "time": total_time,
    "num_words": len(words),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta memory
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in words for c in word if c.isalnum()),
})
###                                   TRACKING END                                 ###

pytesseract_text = ' '.join(words) 
with open ('task2_prompts/prompting_tesseract.txt','w') as file:
    file.write(pytesseract_text)

Prociessed: pdf2image_denoised_page_1.png
Prociessed: pdf2image_denoised_page_2.png
Prociessed: pdf2image_denoised_page_3.png
Prociessed: pdf2image_denoised_page_4.png
Prociessed: pdf2image_denoised_page_5.png
{'text': 'Copie', 'left': 1773, 'top': 28, 'width': 106, 'height': 40}
{'text': 'électronique', 'left': 1893, 'top': 28, 'width': 224, 'height': 40}
{'text': 'LABORATOIRE', 'left': 1422, 'top': 112, 'width': 359, 'height': 36}
{'text': 'DE', 'left': 1802, 'top': 112, 'width': 64, 'height': 36}
{'text': 'BIOLOGIE', 'left': 1887, 'top': 112, 'width': 242, 'height': 36}
{'text': 'MEDICALE', 'left': 2150, 'top': 101, 'width': 259, 'height': 47}
{'text': 'C', 'left': 59, 'top': 64, 'width': 100, 'height': 146}
{'text': 'LABO', 'left': 274, 'top': 148, 'width': 315, 'height': 67}
{'text': ' ', 'left': 534, 'top': 314, 'width': 261, 'height': 39}
{'text': '', 'left': 820, 'top': 315, 'width': 197, 'height': 38}
{'text': 'N°', 'left': 532, 'top': 375, 'width': 33, 'height': 22}
{'text': 

**Benefits**: 
- turns pdfs to images and then uses pytesseract to captivate the words and charachters from the image. highly adaptable to treat basically any kind of input. 
- works from the Path of the file. Could make it easier when scaling to treat many different files. 

**Downsides**: 
- Programs runs in a much slower manner.
- Installation is harder and requires more software to work, although I figured it out, so for the complexity of the task it is easy.


[Return to the top](#task-1c-optical-charachter-recognition-from-a-png-file)

## Method 2: Layout-Parser with tesseractOCR(again) and Effdet backend

`layout-parser` has it's own pdf text retriever but uses [pdfPlumber](#retriever-1-Pdfplumber) backend so it would be redundant to use it. We will then proceed to turn the pdf into an image with pdf2image. Then, use the model [installed locally](./models/publaynet-tf_efficientdet_d1.pth.tar) from [layoutparser's Hugging face post](https://huggingface.co/layoutparser/efficientdet/tree/main/PubLayNet/tf_efficientdet_d1). We used the `Effdet` backend this time. I will later try with `detectron2` which seems to have more models and documentation available. I recomment that you create a models folder and store them there. I had to remove mine from git because it was getting too heavy. 

> Don't forget to install tesseract as I explained in [Method 1](#method-1--pytesseract)

In [3]:
%pip install layoutparser # Install the base layoutparser library with  
%pip install "layoutparser[layoutmodels]" # Install DL layout model toolkit 
%pip install "layoutparser[effdet]"
%pip install effdet
%pip install "layoutparser[ocr]" # Install OCR toolkit
%pip install pdf2image # Install pdf2image for PDF to image conversion
%pip install python-dotenv


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


> **Warning** : be sure to restart your kernel with: 

- CTRL + SHIFT + P 
- type: restart kernel 
- Enter

In [None]:


from dotenv import load_dotenv
load_dotenv()
###                                   TRACKING                                     ###
import time
import psutil
import os

process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###
import layoutparser as lp
from PIL import Image
from layoutparser.models import EfficientDetLayoutModel

words2 = []  # Initialize an empty list to store words
coordinates2 = []  # Initialize an empty list to store coordinates

# Process PDF
filePath = os.environ.get("filePath")
directory = "task1c_reportsTemplates"

# Directly process PNG files from the directory
sorted_filenames = sorted([f for f in os.listdir(directory) if f.lower().endswith('.png')])

images = []
for filename in sorted_filenames:
    image_path = os.path.join(directory, filename)
    img = Image.open(image_path)
    images.append(img)

# Model: publaynet-tf_efficientdet_d1.pth
model = EfficientDetLayoutModel(
    model_path=os.environ.get("model_path"),
    config_path='lp://PubLayNet/tf_efficientdet_d1/config',
    label_map={
        0: "Text",
        1: "Title",
        2: "List",
        3: "Table",
        4: "Figure"
    },
    enforce_cpu=True
)

print("Model loaded successfully.")
ocr_agent = lp.TesseractAgent()

for i, image in enumerate(images):
    layout = model.detect(image)
    
    print(f"\n--- Page {i+1} ---")

    for block in layout:
        x1, y1, x2, y2 = map(int, block.block.coordinates)
        cropped_img = image.crop((x1, y1, x2, y2))
        
        # Perform OCR
        text = ocr_agent.detect(cropped_img)
        block.set(text=text, inplace=True)

        # Clean text and skip empty results
        clean_text = text.strip()
        if clean_text:
            print(f"[{block.type}] {clean_text}")
            words2.append(clean_text)
            coordinates2.append({
                "text": clean_text,
                "left": x1,
                "top": y1,
                "width": x2 - x1,
                "height": y2 - y1
            })




all_words2 = words2  # Store all words for tracking
print(coordinates2)
###                                   TRACKING                                     ###

# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

OCR_stats.append( {
    "name": "layoutparser+EfficientDet",
    "time": total_time,
    "num_words": len(all_words2),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in all_words2 for c in word if c.isalnum()),
}
)

###                                   TRACKING END                                 ###

In [None]:
import json

# Format coordinates2 into a readable JSON-like structure
formatted_coordinates = json.dumps(coordinates2, indent=4)

# Print the formatted coordinates
print(formatted_coordinates)

[
    {
        "text": "PICrlatwwrIuygic\n\nValeurs de r\u00e9f\u00e9rence Ant\u00e9riorit\u00e9s\nv H\u00e9mogramme\n(Sang total - Variation d'imp\u00e9dance, photom\u00e9trie, cytom\u00e9trie en flux)  - \n\n05/10/2C\u00a2\n\nHEMAtieS ......c cece cceeeeeeeeeeeeeeeeeeeeeeeeees 4,94 T\u00e9ra/L 3,80 a 5,90 4,97\nHEMOGIODING ...........ceceeeeeeeeeeeeeeeeeeenes 13,6 g/dL 11,5 a 17,5 13,\n\n8,4 mmol/L 7,14 10,9\n\nHEMatOcrite .......ccccececececeeeeeeeeeeeeeeeeees 41,1 % 34,0 a 53,0 41,6\nV.G.M. vecccsecerscerscenseeees beceeeeuueeecenuacs 83,14 76,0 a 96,0 84,(\nT.C.M.H. ......eee seen eeeeeeeeeeeeeeeeeeeseeeeeeaes 27,4 pg 24,4 a 34,0 27,7\nC.CLM LH. cece ecccccceeceeeeeeeeeeeeeeeeueeeeeuees 33,0 g/at 31,0 a 36,0 32,5\n\nIndex d'aniSOCytOSe .........cccceeee eee eeeeee 13,7 10 4 16 14,1\n\nLEUCOCYEES .....ecceeeeceeee cence eeeeeeneeeneeeaas 6,5 Giga/L 3,8. 11,0 7.8\n\nPolynucl\u00e9aires neutrophiles ... 58,1 % 3,780 G/L 1,40 47,70 4,24\u00a2\nPolynucl\u00e9aires \u00e9osinophiles ..

**Benefits**: 
- turns pdfs to images then uses an agent I downloaded from `layoutparser` trained on PublicLayNet data and using effdet backend to figure out the layout and get the text blocks. Then tesseract-OCR captivates the words and charachters from the image. 
it is highly adaptable to treat basically any kind of input. 
- works from the Path of the file. Could make it easier when scaling to treat many different files. 

**Downsides**: 
- Is extremely slow and memory demanding. The double use of agents makes the process longer and more demanding.
- The read quality isn't great, although I haven't yet managed to use it with french as language, still, the capture remains slow and poor. 
- Installation is harder and requires more software to work, effdet models are almost undocumented and harder to find. 
- Has a toolkit warning, will need to change/add imports before november this year. 


[Return to the top](#task-1c-optical-charachter-recognition-from-a-png-file)

## Method 3: layoutparser with a Detectron2 layout parser and tesseract-OCR 
pdf2image will be used for 
the model used for the layout parsing remains to be determined. 

In [None]:
%pip install layoutparser torchvision  
%pip install 'git+https://github.com/facebookresearch/detectron2.git'
%pip install "layoutparser[ocr]"

In [4]:
from dotenv import load_dotenv
load_dotenv()
###                                   TRACKING                                     ###
import time
import psutil
import os

process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###

import layoutparser as lp 
from PIL import Image
words3= []  # Initialize an empty list to store words
coordinates3 = []  # Initialize an empty list to store coordinates
# Process PDF
filePath = os.environ.get("filePath")
directory = "task1c_reportsTemplates"

# Directly process PNG files from the directory
sorted_filenames = sorted([f for f in os.listdir(directory) if f.lower().endswith('.png')])

images = []
for filename in sorted_filenames:
    image_path = os.path.join(directory, filename)
    img = Image.open(image_path)
    images.append(img)


model = lp.Detectron2LayoutModel('lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config', 
                                 model_path=os.environ.get("model2_path"),
                                 extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
                                 label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})

print("Model loaded successfully.")
ocr_agent = lp.TesseractAgent()

for i, image in enumerate(images):
    layout = model.detect(image)
    
    print(f"\n--- Page {i+1} ---")

    for block in layout:
        x1, y1, x2, y2 = map(int, block.block.coordinates)
        cropped_img = image.crop((x1, y1, x2, y2))
        
        # Perform OCR
        text = ocr_agent.detect(cropped_img)
        block.set(text=text, inplace=True)

        # Clean text and skip empty results
        clean_text = text.strip()
        if clean_text:
            print(f"[{block.type}] {clean_text}")
            words3.append(clean_text)
            coordinates3.append({
                "text": clean_text,
                "left": x1,
                "top": y1,
                "width": x2 - x1,
                "height": y2 - y1
            })

all_words3 = ' '.join(words3)
###                                   TRACKING                                     ###

# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

OCR_stats.append( {
    "name": "layoutparser+Detectron2",
    "time": total_time,
    "num_words": len(all_words3),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in all_words3 for c in word if c.isalnum()),
}
)

###                                   TRACKING END                                 ###

Model loaded successfully.


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]



--- Page 1 ---
[Table] iematologie

Valeurs de référence Antériorités
Hémogramme

(Sang total - Variation d'impédance, photométrie, cytométrie en flux)  - 

Hématies ...... be eeeeceaeeaeeeeaeeeeeeeaeeneneens 4,94 Téra/t 3,80 a 5,90 4,97
HEMOGIODING ........cceeeeeeeeeeeeeee seen enenees 13,6 9/dL 11,54 17,5 13,8

8,4 mmol/L 7,14 10,9

HEMatOcrite .......ccccececececeeeeeeeeeeeeeeeeees 41,1 % 34,0 a 53,0 41,8
VA CH nr see eeeeeeeeeeeeeeeees 83,1 f 76,0 a 96,0 84,0
T.C.M.H. ......eee seen eeeeeeeeeeeeeeeeeeeseeeeeeaes 27,4 pg 24,4 a 34,0 27,7
C.CLM LH. cece ecccccceeceeeeeeeeeeeeeeeeueeeeeuees 33,0 g/at 31,0 a 36,0 32,9

Index d'anisocytose ............. cee eeeeeeeeeee 13,7 10 416 14,1

LEUCOCYEES .....ecceeeeceeee cence eeeeeeneeeneeeaas 6,5 Giga/L 3,8. 11,0 7,8

Polynucléaires neutrophiles ... 58,1 % 3,780 G/L 1,40 47,70 4,240
Polynucléaires éosinophiles ... 3,1 % 0,200 G/L 0,02 a 0,58 0,200
Polynucléaires basophiles ..... 0,9 % 0,060 G/L 0,00 a 0,11 0,060
Lymphocytes ........cceeee

In [14]:
import json

# Format coordinates3 into a readable JSON-like structure
formatted_coordinates3 = json.dumps(coordinates3, indent=4)
print(formatted_coordinates3)

[
    {
        "text": "iematologie\n\nValeurs de r\u00e9f\u00e9rence Ant\u00e9riorit\u00e9s\nH\u00e9mogramme\n\n(Sang total - Variation d'imp\u00e9dance, photom\u00e9trie, cytom\u00e9trie en flux)  - \n\nH\u00e9maties ...... be eeeeceaeeaeeeeaeeeeeeeaeeneneens 4,94 T\u00e9ra/t 3,80 a 5,90 4,97\nHEMOGIODING ........cceeeeeeeeeeeeeee seen enenees 13,6 9/dL 11,54 17,5 13,8\n\n8,4 mmol/L 7,14 10,9\n\nHEMatOcrite .......ccccececececeeeeeeeeeeeeeeeeees 41,1 % 34,0 a 53,0 41,8\nVA CH nr see eeeeeeeeeeeeeeeees 83,1 f 76,0 a 96,0 84,0\nT.C.M.H. ......eee seen eeeeeeeeeeeeeeeeeeeseeeeeeaes 27,4 pg 24,4 a 34,0 27,7\nC.CLM LH. cece ecccccceeceeeeeeeeeeeeeeeeueeeeeuees 33,0 g/at 31,0 a 36,0 32,9\n\nIndex d'anisocytose ............. cee eeeeeeeeeee 13,7 10 416 14,1\n\nLEUCOCYEES .....ecceeeeceeee cence eeeeeeneeeneeeaas 6,5 Giga/L 3,8. 11,0 7,8\n\nPolynucl\u00e9aires neutrophiles ... 58,1 % 3,780 G/L 1,40 47,70 4,240\nPolynucl\u00e9aires \u00e9osinophiles ... 3,1 % 0,200 G/L 0,02 a 0,58 0,200\nPol

This automatically installs the model in '/home/usr/.torch/iopath_cache/s/57zjbwv6gh3srry/model_final.pth'
Be careful, I had to change the model name like this: 
```bash
mv 'model_final.pth?dl=1' model_final.pth
```


**Benefits**:
- Utilizes the `Detectron2` layout parser, which is highly efficient and well-documented.
- Detects complex document layouts and extracts structured text blocks effectively.
- Integrates with `Tesseract-OCR` for text recognition, providing flexibility for multilingual text extraction.
- Supports high accuracy in layout detection due to pre-trained models like `PubLayNet`.

**Downsides**:
- Requires significant computational resources, including memory and CPU usage.
- Installation and setup can be challenging due to dependencies like `Detectron2` and its specific model configurations.
- Slower processing time compared to simpler OCR methods due to the dual use of layout parsing and OCR agents.
- Limited support for certain languages and may require additional tuning for non-standard document layouts.


[Return to the top](#task-1c-optical-charachter-recognition-from-a-png-file)

## Method 4: EasyOCR

In [None]:
%pip install easyocr

In [7]:
###                                   TRACKING                                     ###
import time
import psutil
import os

process = psutil.Process(os.getpid())
_ = process.cpu_percent(interval=None)  # Reset CPU counter
mem_before = process.memory_info().rss
start_time = time.time()

###                                   TRACKING END                                 ###
import easyocr

# Initialize EasyOCR reader
reader = easyocr.Reader(['fr', 'en'],gpu=False)  # Enable GPU for faster processing

# Directory containing PNG files
directory = "task1c_reportsTemplates"

# Sort filenames to process pages in order
sorted_filenames = sorted([f for f in os.listdir(directory) if f.lower().endswith('.png')])

words4 = []  # Initialize an empty list to store all words
coordinates4 = []  # Initialize an empty list to store coordinates

# Process each PNG file
for filename in sorted_filenames:
    image_path = os.path.join(directory, filename)
    try:
        result = reader.readtext(image_path, detail=1)  # Enable detailed output
        print(f"Processed: {filename}")
        for detection in result:
            words4.append(detection[1])  # Append detected text to words4
            coordinates4.append({
                "text": detection[1],
                "left": int(detection[0][0][0]),
                "top": int(detection[0][0][1]),
                "width": int(detection[0][1][0] - detection[0][0][0]),
                "height": int(detection[0][2][1] - detection[0][0][1])
            })
    except Exception as e:
        print(f"Error processing {filename}: {str(e)}")
all_words4 = ' '.join(words4)  # Join all words into a single string for tracking
# Calculate stats
cpu_used = process.cpu_percent(interval=None)
memory_MB = process.memory_info().rss-mem_before
total_time = time.time() - start_time

OCR_stats.append( {
    "name": "easyocr",
    "time": total_time,
    "num_words": len(all_words4),
    "memory_MB": memory_MB / (1024 * 1024),  # Delta in MB
    "cpu_percent": cpu_used,
    "num_alphanum_chars": sum(1 for word in all_words4 for c in word if c.isalnum()),
}
)
for i in range(0, len(all_words4.split()), 10):
    print(' '.join(all_words4.split()[i:i+10]))
###                                   TRACKING END                                 ###

easyOCR_text= ' '.join(words4) 
with open ('task2_prompts/prompting_easy.txt','w') as file:
    file.write(easyOCR_text)

Using CPU. Note: This module is much faster with a GPU.


Processed: pdf2image_denoised_page_1.png
Processed: pdf2image_denoised_page_2.png
Processed: pdf2image_denoised_page_3.png
Processed: pdf2image_denoised_page_4.png
Processed: pdf2image_denoised_page_5.png
Processed: testPic.png
Copie électronique LABORATOIRE DE BIOLOGIE MÉDICALE 6     
No FINESS   du 
de    5: 2:
   Biologiste(s) Médical(aux) Docteur
    Madame    CABINET MEDICAL " " 250
 DES      (100) Copie à
Docteur    , DR  X Demande n'
01/02/ ~LABO--TP Edité le, lundi 1 février 2021 Copie à
Docteur    , DR Patient né(e)   le  
 FSE Tiers payant   Prélèvements effectués par
le laboratoire le 01/02/21 à 10H27 Vos résultats sur internet
Accès sécurisé, rapide, gratuit, pratique, écoresponsable 1) Communiquez votre mail
au laboratoire 2) Recevez un email dès vos résultats sont
disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur
notre site internet dédié pour connaître notre organisation https:/ I 
fr/depistage-covid-19/ Hématologie Valeurs de référence Antériorités

In [6]:
import json

formatted_coordinates4 = json.dumps(coordinates4, indent = 4)
print(formatted_coordinates4)

[
    {
        "text": "Copie \u00e9lectronique",
        "left": 1763,
        "top": 18,
        "width": 363,
        "height": 60
    },
    {
        "text": "LABORATOIRE DE BIOLOGIE M\u00c9DICALE",
        "left": 1410,
        "top": 88,
        "width": 1011,
        "height": 73
    },
    {
        "text": "6",
        "left": 50,
        "top": 66,
        "width": 163,
        "height": 229
    },
    {
        "text": " ",
        "left": 266,
        "top": 131,
        "width": 586,
        "height": 98
    },
    {
        "text": "  ",
        "left": 523,
        "top": 304,
        "width": 505,
        "height": 65
    },
    {
        "text": "No FINESS",
        "left": 525,
        "top": 366,
        "width": 163,
        "height": 37
    },
    {
        "text": "",
        "left": 712,
        "top": 366,
        "width": 206,
        "height": 37
    },
    {
        "text": " du G\u00e9n\u00e9ral de ",
        "left": 528,
        "top": 402,
        "width

**Benefits**:  
- EasyOCR is simple to set up and supports multiple languages, including French and English.  
- GPU acceleration allows for faster processing compared to CPU-based methods.  
- Provides detailed output, including bounding box coordinates for detected text.  
- Works directly with image files, making it suitable for various input formats.  

**Downsides**:  
- Accuracy may vary depending on the quality of the input images.  
- Requires significant computational resources, especially when using GPU.  
- Limited support for complex document layouts compared to advanced layout parsers.  
- May struggle with noisy or low-resolution images.  

## NOT WORKING Method 5: DocTR 


https://github.com/mindee/doctr

In [31]:
%pip install python-doctr
%pip install "python-doctr[torch]"
# or with preinstalled packages for visualization & html & contrib module support
%pip install "python-doctr[torch,viz,html,contrib]"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting onnxruntime>=1.11.0 (from python-doctr[contrib,html,torch,viz])
  Downloading onnxruntime-1.22.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting coloredlogs (from onnxruntime>=1.11.0->python-doctr[contrib,html,torch,viz])
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime>=1.11.0->python-doctr[contrib,html,torch,viz])
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading onnxruntime-1.22.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
Downloading humanfriendly-10.0-py2.py3-

In [None]:
from dotenv import load_dotenv
load_dotenv()
from doctr.models import ocr_predictor

model = ocr_predictor('linknet_resnet50', 'crnn_mobilenet_v3_large', pretrained=True)

import os
import cv2

# Verify the file exists
image_path = "Bilanconsultation.jpeg"
if not os.path.exists(image_path):
	raise FileNotFoundError(f"File not found: {image_path}")

doc = cv2.imread(image_path)
# Perform OCR
result = model(doc)

# Print the results
print(result.render())
result.show()



## Method 6: Paddle OCR

In [16]:
%pip install pandas tokenizers



Note: you may need to restart the kernel to use updated packages.


In [18]:
# Install paddleocr
%pip install paddleocr

Collecting paddleocr
  Using cached paddleocr-3.0.1-py3-none-any.whl.metadata (15 kB)
Collecting paddlex==3.0.1 (from paddlex[ie,multimodal,ocr]==3.0.1->paddleocr)
  Using cached paddlex-3.0.1-py3-none-any.whl.metadata (72 kB)
Collecting chardet (from paddlex==3.0.1->paddlex[ie,multimodal,ocr]==3.0.1->paddleocr)
  Using cached chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting colorlog (from paddlex==3.0.1->paddlex[ie,multimodal,ocr]==3.0.1->paddleocr)
  Using cached colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting GPUtil>=1.4 (from paddlex==3.0.1->paddlex[ie,multimodal,ocr]==3.0.1->paddleocr)
  Using cached gputil-1.4.0-py3-none-any.whl
Collecting numpy==1.26.4 (from paddlex==3.0.1->paddlex[ie,multimodal,ocr]==3.0.1->paddleocr)
  Using cached numpy-1.26.4-cp313-cp313-linux_x86_64.whl
Collecting pandas<=1.5.3 (from paddlex==3.0.1->paddlex[ie,multimodal,ocr]==3.0.1->paddleocr)
  Using cached pandas-1.5.3.tar.gz (5.2 MB)
  Installing build dependencies ... [?25ldone


In [1]:

from paddleocr import PaddleOCR
import cv2
from matplotlib import pyplot as plt

ocr = PaddleOCR(use_angle_cls=True, lang='en')  # Initialize PaddleOCR
image_path = "/home//DocumentsAIresearchSLICC/RESEARCH/task1c_reportsTemplates/pdf2image_denoised_page_1.png"
img = cv2.imread(image_path)

plt.figure()
plt.imshow(img)
plt.show()
results = ocr.predict(image_path)
# Print the results
print(results)

ModuleNotFoundError: No module named 'paddleocr'

In [None]:
for result in results:
    print("Recognized Text:")
    for text, score in zip(result['rec_texts'], result['rec_scores']):
        print(f"Text: {text}, Confidence: {score:.2f}")

        for i in range(0, len(result['rec_texts']), 15):
            print(' '.join(result['rec_texts'][i:i+15]))

            words6 = [text for text in result['rec_texts']]

Recognized Text:
Text:  , Confidence: 1.00
  MEDICALL BIOLOGIE LABORATOIRE DE    N°FINESS :   du    - Biologiste(s) Médical(aux) Docteur     Madame    CABiNET medicaL " "      (100)
X Demande n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie a : Docteur    , DR Patient né(e)   le   FSE Tiers payant  -  Prélèvements effectués par le laboratoire le 01/02/21 à 10H27 Vos résultats sur internet : Accès sécurisé, rapide, gratuit, pratique, écoresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dès que vos résultats sont disponibes  3) Cliquez sur le lien INFORMATION COVID-19 Rendez-voutnéiouroarraio:p://abdr/dptae-oi/ Hématologie Valeurs de référence Antériorités
Hémogramme (Sang total- Variation dimpédance, photométrie, cytométrie en flux) -   Hématies 4,94 Téra/L 3,80à5,90 4,97 Hémoglobine 13,6 g/dL 11,5à17,5 13,8 8,4 mmol/L 7,1à10,9 Hématocrite 41,1 %
34,0à 53,0 41,8 V.G.M. .. 83,1 fL 76,096,0 84,0 T.C.M.H. 27,4 pg 24,4à34,0 27,7 C.C.M.H. 33,0 g/dL 31,0à 36,0 3

**Benefits**:  
- PaddleOCR supports multiple languages and advanced text detection, including rotated and angled text.
- Provides high accuracy for both printed and handwritten text.
- Includes built-in text detection and recognition models, making it suitable for complex layouts.
- Easy to use with a simple API and good documentation.
- Can process images directly and visualize results with bounding boxes.

**Downsides**:  
- Requires installation of both PaddleOCR and PaddlePaddle, which may be challenging on some systems.
- GPU support is limited to specific hardware and may require additional setup.
- Processing speed can be slower on CPU, especially for large or high-resolution images.
- May require additional configuration for optimal results with non-English languages or custom datasets.

## OCR comparison

In [None]:
%pip install tabulate

In [None]:
from tabulate import tabulate

print(tabulate(OCR_stats, headers="keys"))


[Return to the top](#task-1c-optical-charachter-recognition-from-a-png-file)

### Text Quality Check:

[Methods](#methods)

- [TesseractOCR](#layoutparser--tesseractocr)
- [LayoutParser + EfficientDet](#layoutparser--efficientdet)
- [LayoutParser + Detectron2](#layoutparser--detectron2)
- [EasyOCR](#easyocr)
- [DocTR](#doctr)
- [PaddleOCR](#paddleocr)

In [None]:
%pip install simphile

Note: you may need to restart the kernel to use updated packages.


In [None]:
from simphile import jaccard_similarity, compression_similarity, euclidian_similarity
from tabulate import tabulate
pytesseract_text = ' '.join(words)  # Tesseract words
lpeffdet_text = ' '.join(words2)  # LayoutParser with EfficientDet words
lpdetectron2_text = ' '.join(words3)  # LayoutParser with Detectron2 words
easyocr_text = ' '.join(words4)  # EasyOCR words
#paddleocr_text= ' '.join(words6)  # PaddleOCR words
GCV_text = output.get('fullTextAnnotation', {}).get('text', '')  # Google Cloud Vision text

# Load the text from the highlighted_output.txt file

with open("highlighted_output.txt", "r") as file:
    txt_doc_text = file.read()

with open("highlighted_output_reduced.txt", "w") as file:
    gcv_comparator = file.read()
# Calculate similarities
similarities = {
    "pytesseract": {
        "jaccard": jaccard_similarity(txt_doc_text,pytesseract_text),
        "compression": compression_similarity(txt_doc_text, pytesseract_text),
        "euclidean": euclidian_similarity(txt_doc_text, pytesseract_text),
    },
    "lpeffdetocr": {
        "jaccard": jaccard_similarity(txt_doc_text,lpeffdet_text),
        "compression": compression_similarity(txt_doc_text, lpeffdet_text),
        "euclidean": euclidian_similarity(txt_doc_text, lpeffdet_text),
    },
    "lpdetectron2": {
        "jaccard": jaccard_similarity(txt_doc_text,lpdetectron2_text),
        "compression": compression_similarity(txt_doc_text, lpdetectron2_text),
        "euclidean": euclidian_similarity(txt_doc_text, lpdetectron2_text),
    },
    "easyocr": {
        "jaccard": jaccard_similarity(txt_doc_text,easyocr_text),
        "compression": compression_similarity(txt_doc_text, easyocr_text),
        "euclidean": euclidian_similarity(txt_doc_text, easyocr_text),
    },
    # "paddleocr": {
    #     "jaccard": jaccard_similarity(txt_doc_text,paddleocr_text),
    #     "compression": compression_similarity(txt_doc_text, paddleocr_text),
    #     "euclidean": euclidian_similarity(txt_doc_text, paddleocr_text),
    # },
    "google_cloud_vision": {
        "jaccard": jaccard_similarity(gcv_comparator,GCV_text),
        "compression": compression_similarity(gcv_comparator, GCV_text),
        "euclidean": euclidian_similarity(gcv_comparator, GCV_text),
    }
}
# Print similarities in a tabular format

similarity_table = []
for retriever, metrics in similarities.items():
    similarity_table.append([retriever] + [f"{value:.4f}" for value in metrics.values()])

headers = ["Retriever", "Jaccard Similarity", "Compression Similarity", "Euclidean Similarity"]
print(tabulate(similarity_table, headers=headers, tablefmt="grid"))

# Calculate the average similarity score for each method
average_similarity_scores = {
    retriever: sum(metrics.values()) / len(metrics)
    for retriever, metrics in similarities.items()
}

# Find the method with the highest average similarity score
most_similar_method = max(average_similarity_scores, key=average_similarity_scores.get)

# Print the average similarity scores in a tabular format
average_similarity_table = [[retriever, f"{score:.4f}"] for retriever, score in average_similarity_scores.items()]
print("\nAverage Similarity Scores:")
print(tabulate(average_similarity_table, headers=["Retriever", "Average Similarity"], tablefmt="grid"))

print(f"\nThe method that returns the most similar text is: {most_similar_method}")


+---------------------+----------------------+--------------------------+------------------------+
| Retriever           |   Jaccard Similarity |   Compression Similarity |   Euclidean Similarity |
| pytesseract         |               0.6677 |                   0.6365 |                 0.9833 |
+---------------------+----------------------+--------------------------+------------------------+
| lpeffdetocr         |               0.51   |                   0.5999 |                 0.9829 |
+---------------------+----------------------+--------------------------+------------------------+
| lpdetectron2        |               0.4444 |                   0.5545 |                 0.9806 |
+---------------------+----------------------+--------------------------+------------------------+
| easyocr             |               0.7253 |                   0.6666 |                 0.9847 |
+---------------------+----------------------+--------------------------+------------------------+
| google_c