# Denoising text with an LLM before the Embedding.

 We will use the output from easyOCR, the highest quality OCR so far and also try with the quickest one we have which is pytesseract. 
 EasyOCR from what I've seen barely needs denoising, on the other hand pytesseract really does, we will see which is better. 

 For the LLM's I will investigate on locally installed ones later on. Right now, we will use the groq API, that offers a certain amount of tokens and charachters for free. One of the advantages of Groq is that instead of using GPU or CPU they developped language processing units(LPU) so the models are apparently extremely quick. 

We will prepare the input by leaving it in the folder [task2_prompts](./task2_prompts/)
There are two folders with the two of the possible OCR results

## Index

1. **Introduction**
    - [Denoising text with an LLM before the Embedding.](#denoising-text-with-an-llm-before-the-embedding)

    2. **OCR Approaches**
        - [EasyOCR and pytesseract comparison](#ocr-approaches)
        - [Input preparation and folder structure](#ocr-approaches)

    3. **LLM Denoising Experiments**
        - [llama-3.3-70b-versatile](#llama-3370b-versatile)
          - [pyTessearct input](#pytessearct-input)
          - [Easy OCR input](#easy-ocr-input)
        - [mistral-saba-24b](#mistral-saba-24b)
          - [pyTessearct input](#pytessearct-input-1)
          - [Easy OCR input](#easy-ocr-input-1)
        - [qwen-qwq-32b](#qwen-qwq-32b)
          - [pyTessearct input](#pytessearct-input-2)
          - [Easy OCR input](#easy-ocr-input-2)


## llama-3.3-70b-versatile

### pyTessearct input

In [None]:
%pip install python-dotenv
%pip install groq

In [10]:
import os


from groq import Groq

print("before OCR:")
with open('task2_prompts/prompting_tesseract.txt','r') as file:
    print(file.read())

client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert in cleaning OCR-extracted medical documents. "
                "Your task is to correct errors while preserving clinical accuracy. "
                "Follow these rules strictly:\n"
                "1. Fix OCR artifacts (e.g., 'c0ugh' → 'cough'), but NEVER alter valid medical terms.\n"
                "2. Remove non-text noise (stray symbols, headers/footers).\n"
                "3. Retain original formatting (lists, bullet points) where meaningful.\n"
                "4. Return ONLY the cleaned text, no explanations."
            )
        },
        {
            "role": "user",
            "content": (
                "Denoise this OCR-extracted medical document for embedding:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_tesseract.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="llama-3.3-70b-versatile",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)

before OCR:
Copie électronique LABORATOIRE DE BIOLOGIE MEDICALE C LABO    N° FINESS :   -  -  @:  &:    - Biologiste(s) Médical(aux) Docteur     Madame    CABINET MEDICAL " "      (100) Copie a : Docteur    , DR  X Demande n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie a : Docteur    , DR Patient né(e)   le   FSE Tiers payant  -  Prélevements effectués par le laboratoire le 01/02/21 a 10H27 Vos résultats sur internet : Accés sécurisé, rapide, gratuit, pratique, @coresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dés que vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaitre notre organisation : https:// .fr/depistage-covid-19/ ia = Hematologie Valeurs de référence Antériorités v Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)  -   HEMatieS .......cccccecececeeeeeeeeeeeeeeeeeeeees 4,94 Téra/L 3,80 a 5,90 4,97 HEMOGIODING ........cceeeeeeeeeeeeee

### Easy OCR input 

In [None]:
import os


from groq import Groq


client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert in cleaning OCR-extracted medical documents. "
                "Your task is to correct errors while preserving clinical accuracy. "
                "Follow these rules strictly:\n"
                "1. Fix OCR artifacts (e.g., 'c0ugh' → 'cough'), but NEVER alter valid medical terms.\n"
                "2. Remove non-text noise (stray symbols, headers/footers).\n"
                "3. Retain original formatting (lists, bullet points) where meaningful.\n"
                "4. Return ONLY the cleaned text, no explanations."
            )
        },
        {
            "role": "user",
            "content": (
                "Denoise this OCR-extracted medical document for embedding:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_easy.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="llama-3.3-70b-versatile",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)

## mistral-saba-24b

### pyTessearct input

In [16]:
import os


from groq import Groq


print("before OCR:")
with open('task2_prompts/prompting_tesseract.txt','r') as file:
    print(file.read())


client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert in cleaning OCR-extracted medical documents. "
                "Your task is to correct errors while preserving clinical accuracy. "
                "Follow these rules strictly:\n"
                "1. Fix OCR artifacts (e.g., 'c0ugh' → 'cough'), but NEVER alter valid medical terms.\n"
                "2. Remove non-text noise (stray symbols, headers/footers).\n"
                "3. Retain original formatting (lists, bullet points) where meaningful.\n"
                "4. Return ONLY the cleaned text, no explanations."
            )
        },
        {
            "role": "user",
            "content": (
                "Denoise this OCR-extracted medical document for embedding:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_tesseract.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="mistral-saba-24b",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)

before OCR:
Copie électronique LABORATOIRE DE BIOLOGIE MEDICALE C LABO    N° FINESS :   -  -  @:  &:    - Biologiste(s) Médical(aux) Docteur     Madame    CABINET MEDICAL " "      (100) Copie a : Docteur    , DR  X Demande n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie a : Docteur    , DR Patient né(e)   le   FSE Tiers payant  -  Prélevements effectués par le laboratoire le 01/02/21 a 10H27 Vos résultats sur internet : Accés sécurisé, rapide, gratuit, pratique, @coresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dés que vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaitre notre organisation : https:// .fr/depistage-covid-19/ ia = Hematologie Valeurs de référence Antériorités v Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)  -   HEMatieS .......cccccecececeeeeeeeeeeeeeeeeeeeees 4,94 Téra/L 3,80 a 5,90 4,97 HEMOGIODING ........cceeeeeeeeeeeeee

### Easy OCR input 

In [2]:
import os


from groq import Groq


client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert in cleaning OCR-extracted medical documents. "
                "Your task is to correct errors while preserving clinical accuracy. "
                "Follow these rules strictly:\n"
                "1. Fix OCR artifacts (e.g., 'c0ugh' → 'cough'), but NEVER alter valid medical terms.\n"
                "2. Remove non-text noise (stray symbols, headers/footers).\n"
                "3. Retain original formatting (lists, bullet points) where meaningful.\n"
                "4. Return ONLY the cleaned text, no explanations."
            )
        },
        {
            "role": "user",
            "content": (
                "Denoise this OCR-extracted medical document for embedding:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_easy.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="mistral-saba-24b",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)

Copie électronique

LABORATOIRE DE BIOLOGIE MÉDICALE
6  
 

No FINESS 34 3 
2 
 

 
Biologiste(s) Médical(aux)

Docteur    
Madame   

CABINET MEDICAL " "

 

Copie à
Docteur    , DR 

Demande n° 01/02/
LABO--TP
Edité le, lundi 1 février 2021

Copie à
Docteur    , DR 

Patient né(e)   le 

FSE
Tiers payant



Prélèvements effectués par le laboratoire le 01/02/21 à 10H27

Vos résultats sur internet
Accès sécurisé, rapide, gratuit, pratique, écoresponsable
1) Communiquez votre mail au laboratoire
2) Recevez un email dès vos résultats sont disponibles
3) Cliquez sur le lien

INFORMATION COVID-19
Rendez-vous sur notre site internet dédié pour connaître notre organisation
https:// .fr/depistage-covid-19/

Hématologie
Valeurs de référence
Antériorités
Hémogramme (Sang total Variation d'impédance, photométrie, cytométrie en flux)
 

Hématies 4,94 Téra/L 3,80 à 5,90 4,97
Hémoglobine 13,6 g/dL 11,5 à 17,5 13,8
8,4 mmol/L 7,1 à 10,9
Hématocrite 41,1 % 34,0 à 53,0 41,8
VG.M 83,1 fL 76,0 à 96,0 84

## qwen-qwq-32b

### pyTessearct input

In [3]:
import os


from groq import Groq


print("before OCR:")
with open('task2_prompts/prompting_tesseract.txt','r') as file:
    print(file.read())


client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert in cleaning OCR-extracted medical documents. "
                "Your task is to correct errors while preserving clinical accuracy. "
                "Follow these rules strictly:\n"
                "1. Fix OCR artifacts (e.g., 'c0ugh' → 'cough'), but NEVER alter valid medical terms.\n"
                "2. Remove non-text noise (stray symbols, headers/footers).\n"
                "3. Retain original formatting (lists, bullet points) where meaningful.\n"
                "4. Return ONLY the cleaned text, no explanations."
            )
        },
        {
            "role": "user",
            "content": (
                "Denoise this OCR-extracted medical document for embedding:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_tesseract.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="qwen-qwq-32b",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)

before OCR:
Copie électronique LABORATOIRE DE BIOLOGIE MEDICALE C LABO    N° FINESS :   -  -  @:  &:    - Biologiste(s) Médical(aux) Docteur     Madame    CABINET MEDICAL " "      (100) Copie a : Docteur    , DR  X Demande n° 01/02/ -LABO--TP Edité le, lundi 1 février 2021 Copie a : Docteur    , DR Patient né(e)   le   FSE Tiers payant  -  Prélevements effectués par le laboratoire le 01/02/21 a 10H27 Vos résultats sur internet : Accés sécurisé, rapide, gratuit, pratique, @coresponsable 1) Communiquez votre mail au laboratoire 2) Recevez un email dés que vos résultats sont disponibles 3) Cliquez sur le lien INFORMATION COVID-19 Rendez-vous sur notre site internet dédié pour connaitre notre organisation : https:// .fr/depistage-covid-19/ ia = Hematologie Valeurs de référence Antériorités v Hémogramme (Sang total - Variation d'impédance, photométrie, cytométrie en flux)  -   HEMatieS .......cccccecececeeeeeeeeeeeeeeeeeeeees 4,94 Téra/L 3,80 a 5,90 4,97 HEMOGIODING ........cceeeeeeeeeeeeee

### Easy OCR input 

In [4]:
import os


from groq import Groq


client = Groq(

    # This is the default and can be omitted

    api_key=os.environ.get("GROQ_API_KEY"),

)


chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert in cleaning OCR-extracted medical documents. "
                "Your task is to correct errors while preserving clinical accuracy. "
                "Follow these rules strictly:\n"
                "1. Fix OCR artifacts (e.g., 'c0ugh' → 'cough'), but NEVER alter valid medical terms.\n"
                "2. Remove non-text noise (stray symbols, headers/footers).\n"
                "3. Retain original formatting (lists, bullet points) where meaningful.\n"
                "4. Return ONLY the cleaned text, no explanations."
            )
        },
        {
            "role": "user",
            "content": (
                "Denoise this OCR-extracted medical document for embedding:\n\n"
                "=== INPUT TEXT ===\n"
                f"{open('task2_prompts/prompting_easy.txt', encoding='utf-8').read()}\n"
                "=== END INPUT ==="
            )
        }
    ],
    model="qwen-qwq-32b",
    temperature=0.2,  # Reduce randomness for factual accuracy
    max_tokens=4096   # Ensure long documents are fully processed
)



print(chat_completion.choices[0].message.content)


<think>
Okay, let's tackle this OCR cleaning task. First, I need to go through the provided document carefully. The user wants me to correct OCR errors but keep the clinical terms accurate. 

Starting with the header: "Copie électronique LABORATOIRE DE BIOLOGIE MÉDICALE 6  ..." looks okay except maybe "6  " might be a typo? Wait, maybe it's part of the lab's name. I'll leave it as is unless there's an obvious error.

Looking at the patient info: "Madame    CABINET..." – "CABINET" might be "CABINET" but probably "CABINET MEDICAL" should be "CABINET MÉDICAL" with an accent. Wait, in the original, it's written as "CABINET MEDICAL " "      (100)". "" is likely "CAME" (rue CAME DES ?), but maybe "" is a street name, so maybe it's correct. Hmm, but "" might be "" as a street name. I'll check if "" is a known street, but since I can't verify, I'll leave it as is.

Next, "Demande n' 01/02/ ~LABO--TP" – the apostrophe after "Demande n'" might be a typo. It should probably be "Demande n° 01/02/