<a href="https://colab.research.google.com/github/tgarnier067/MNLP-project-2/blob/main/02_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import data

In [1]:
from google.colab import files
uploaded = files.upload()

ModuleNotFoundError: No module named 'google.colab'

In [4]:
import json
import os

# Colab

with open("the_vampyre_clean.json", "r", encoding="utf-8") as f:
    clean_data = json.load(f)

with open("the_vampyre_ocr.json", "r", encoding="utf-8") as f:
    ocr_data = json.load(f)


# SSPCloud

# Chemin vers le dossier où les fichiers ont été extraits
#extract_dir = os.path.expanduser("~/work/MNLP-project-2/data/eng")

# Chemins complets vers les fichiers JSON
#clean_path = os.path.join(extract_dir, "the_vampyre_clean.json")
#ocr_path = os.path.join(extract_dir, "the_vampyre_ocr.json")

# Chargement des données JSON
#with open(clean_path, "r", encoding="utf-8") as f:
#    clean_data = json.load(f)

#with open(ocr_path, "r", encoding="utf-8") as f:
#    ocr_data = json.load(f)

# Prepare data

In [13]:
def concat_values_dict(d):
    """
    Concat values of a dict, seperating each element with '\n'

    Args:
        d (dict): Dictionnary

    Returns:
        str: concatenated text
    """
    return '\n'.join(d.get(str(i), "") for i in range(48))

clean_data_text = concat_values_dict(clean_data)

Possible models that we can try : Transducteurs (seq2seq with attention or transformers)

T5-small / T5-base

FLAN-T5-small

BART or distilBART

ByT5 : specialisez in byte-level treatment -> usefull for OCR errors.

Charformer / CANINE : char-level to get thine level of granularity

🔹 Baselines :
Levenshtein-based correction

Spell checkers + N-grams (as Hunspell or SymSpell with context)

In [14]:
!pip install transformers torch sentencepiece --quiet

# T5 : Model fine tuned for grammar

## T5-base

We create a function to apply a prompt wich asks to correct the data, to the LLM T5-base, and print the output of this LLM

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

# Charge model and tokenizer
model_name = "t5-base" # We could use also t5-small
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def correct_text_with_t5(text):
    prompt = f"Fix errors : {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=False, padding=False).to(device)
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Take the firsts sentences of a noisy sentence
noisy_text = ocr_data['0'][:1000]

# Build segment, and prompt to correct on each segment
segments = noisy_text.split('\n')

segments

['THE VAMPYRE;',
 'A Tale.',
 'By John William Polidori',
 'THEsuperstition upon which this taIe iſ founded is very general in the East. Among tho Arabjans itappeors to be common: it did not, however, extend itself to the Greeks until after the establi shment of Christianity; and it has only aſsumed its prosent form since the division af the Latin and Greok churches; at which time, lhe idea becoming prevalent, that a Lcltin body could not corrvpl if buried in their territory, it gradually increosed, and formed lhe subject of many wonderful stories, ſtill extant, of the dead rising from their graves, and feeding uponlhe blood of tho young and beautiful. In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbi6ed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of c0nsumptions; whilst these human blood-suckers fattened—and their veins 

In [None]:
# Apply the model on a small dataset
for i in range(4):
  print(correct_text_with_t5(segments[i]))

.: THE VAMPYRE; THE VAMPYRE; THE VAMPY
: A Tale.
Fix errors : By John William Polidori : By John William Polidori :
. : Fix: Fix bug fixes : Fix errors : Fix bug fixes


This model makes a lot of problem, as you can see on the transcription he proposed on the above cell. I tried to modify the prompt, the lenght of the inputs, and many other parameters, but still, the quality of the output data was too bad. We will not focus on this model, but look at another one instead : vennify/t5-base-grammar-correction

## T5-base-grammar-correction

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

# Load fine-tuned grammar correction model
model_name = "vennify/t5-base-grammar-correction"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def correct_text_with_t5(text):
    prompt = f"correct: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test on a small dataset
noisy_text = ocr_data['0'][:1000]
segments = [s.strip() for s in noisy_text.split('\n') if s.strip()]

corrected_segments = [correct_text_with_t5(seg) for seg in segments]
corrected_text = '\n'.join(corrected_segments)

print("Corrected Output:\n", corrected_text)

Corrected Output:
 THE VAMPYRE.
A Tale.
By John William Polidori.
The superstition upon which this taIe is founded is very general in the East. It did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its prosent form since the division of the Latin and Greok churches; and it gradually increosed, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding upon the blood of young and beautiful people. In the West it spread, with some slight variation,


Problem : The output is not completed. So we increase the max_lenghts to 512. Unfortunatly, it's not enough => we have no longer output for very high values of max_lengths, than with max_lenght = 512. We also try to remove early_stopping, but it does not work. So, as the model can not output very long sentences, we apply it many times, on slices of the text.

- Instead of : model(sentence 1, sentence 2, sentence 3...)
- We do : model(sentence 1) + model(sentence 2) + model(sentence 3) + ...

We use spacy to slice into sentences

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import spacy

# Charger spaCy pour le découpage en phrases
nlp = spacy.load("en_core_web_sm")

# Charger le modèle T5 fine-tuné pour la correction grammaticale
model_name = "vennify/t5-base-grammar-correction"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Fonction pour corriger une phrase avec T5
def correct_text_with_t5(text):
    prompt = f"correct: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=4,
        early_stopping=False,
        no_repeat_ngram_size=3
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Découper un bloc de texte en phrases avec spaCy
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# Fonction principale : découpe par ligne, puis phrase, puis correction
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Exemple avec texte OCR
noisy_text = ocr_data['0'][:3000]  # ou le texte entier
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)


Corrected Output:
 THE VAMPYRE.
A Tale.
By John William Polidori.
THE superstition upon which this theory is founded is very general in the East. It did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its prosent form since the division of the Latin and Greok churches; at which time, the idea becoming prevalent, that a Lcltin body could not corrvpl if buried in their territory, gradually increased, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding upon the blood of young and beautiful. In the West it spread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbi6ed a certain portion of the blood of their victims, who became emaciated, lost their strength, and quickly died of c0nsumptions; while these human blood-suckers fattened—and their veins became distended to such a state of roplet

In [None]:
len(corrected_text)

2733

In [None]:
print(clean_data_text[:3000])

THE VAMPYRE;
A Tale.
By John William Polidori
THE superstition upon which this tale is founded is very general in the East. Among the Arabians it appears to be common: it did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its present form since the division of the Latin and Greek churches; at which time, the idea becoming prevalent, that a Latin body could not corrupt if buried in their territory, it gradually increased, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding upon the blood of the young and beautiful. In the West it spread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, where the belief existed, that vampyres nightly imbibed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of consumptions; whilst these human blood-suckers fattened—and their veins became dist

Problem : the model skip some parts of the text. We input 3000 characters, and it output only 2733. When looking into details at the translation, we see that it skips some sentences. Maybe, if we try to apply the LLM on smaller slices of the text, we won't skip parts. NOw, we slices when there is a '\n', and when there is a ,:;!?

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import spacy
import re

# Charger spaCy pour le découpage en phrases
nlp = spacy.load("en_core_web_sm")

# Charger le modèle T5 fine-tuné pour la correction grammaticale
model_name = "vennify/t5-base-grammar-correction"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Fonction pour corriger une phrase avec T5
def correct_text_with_t5(text):
    prompt = f"correct: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(
        **inputs,
        max_length=1024,       # Allonge la sortie maximale (attention à la RAM GPU)
        num_beams=4,
        early_stopping=False, # Ne pas stopper la génération prématurément
        length_penalty=1.0,   # Garde des sorties de taille naturelle
        no_repeat_ngram_size=3
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Découper un bloc de texte en phrases avec spaCy
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]



def split_sentences_and_punct(text):
    # Split d'abord en phrases spaCy
    spacy_sents = split_into_sentences_spacy(text)

    # Ensuite split par ponctuation dans chaque phrase
    punct_split_sents = []
    for sent in spacy_sents:
        # Split sur virgule, point d'exclamation, point d'interrogation, point-virgule, deux-points
        parts = re.split(r'[,:;!?]', sent)
        parts = [p.strip() for p in parts if p.strip()]
        punct_split_sents.extend(parts)
    return punct_split_sents


# Fonction principale : découpe par ligne, puis phrase, puis correction
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_sentences_and_punct(line)
        corrected_sentences = [correct_text_with_t5(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Exemple avec texte OCR
noisy_text = ocr_data['0'][:3000]  # ou le texte entier
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)

Corrected Output:
 THE VAMPYRE is correct.
A Tale.
By John William Polidori.
THE superstition upon which this theory is founded is very general in the East. Among Arabjans itappeors to be common. It did not. However, this is correct: however, the facts are correct. It extends itself to the Greeks until after the establishment of Christianity. And it has only assumed its prosent form since the division of the Latin and Greok churches. At which time is correct: at which time. The idea is becoming prevalent. That a Lcltin body could not be corrvpl if buried in their territory. It gradually increased. And formed the subject of many wonderful stories. The fact is that there are still a lot of fossils that are still extant. The dead are rising from their graves. And feeding on the blood of tho young and beautiful. In the West it spreads throughout the world. With some slight variation. All over Hungary, Hungary is correct. Poland is correct. Austria is correct. And Lorraine is correct. Whoro

New problems : the time computation starts beeing very high : 3 min to apply on 3000 characters. When we will generalize it to the 48 texts, it's going to take hours. Moreover, the outputs are not correct :    

- Noisy : In theLond0n Journal, of March, 1732, is a curiovs, and, of course, credible account of a particular case of vampyrifin, which is stated to hove accurred at Madreyga, in Hungary.

- Cleaned : In theLond0n Journal. The month of March is correct. 1732, correct: 1732. Is a curiovs. Correct: and so on. Of course, of course. Credible account of a particular case of vampyrifin. Which is stated to have been accurred at Madreyga. In Hungary.

## T5-small-grammar-correction

Ce modèle est une version quantifiée en FP16 de t5-small, fine-tunée sur le jeu de données JFLEG pour la correction grammaticale. Il est optimisé pour une inférence rapide tout en maintenant une bonne précision.

Différence avec t5-base-grammar-correction :

t5-base-grammar-correction : 220M de paramètres, vitesse d'execution moyenne

T5-small-grammar-correction : 60M de paramètres, vitesse d'execution plus rapide.

Chaque modèle est parti de T5 (small ou base), puis a été fine tuné

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import spacy

# Charger spaCy pour le découpage en phrases
nlp = spacy.load("en_core_web_sm")

# Charger le modèle T5 fine-tuné pour la correction grammaticale
model_name = "AventIQ-AI/T5-small-grammar-correction"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Fonction pour corriger une phrase avec T5
def correct_text_with_t5(text):
    prompt = f"correct: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=4,
        early_stopping=False,
        no_repeat_ngram_size=3
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Découper un bloc de texte en phrases avec spaCy
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# Fonction principale : découpe par ligne, puis phrase, puis correction
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Exemple avec texte OCR
noisy_text = ocr_data['0'][:3000]  # ou le texte entier
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)

Corrected Output:
 е аее;
A Tale.
By John William Polidori
The superstition upon which this taIe is founded is very general in the East. Among tho Arabjans itappeors to be common: it did not extend itself to the Greeks until after the establi shment of Christianity; and it has only assumed its prosent form since the division af the Latin and Greok churches; at which time, lhe idea becoming prevalent, that a Lcltin body could not corrvpl if buried in their territory, it gradually increosed, and formed a subject of many wonderful stories, still ex In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbi6ed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of c0nsumptions; while these human blood-suckers fattened—and their veins became distended to such a state of ropletion, as
In theLond0n Journal, of March, 1732, is a curiovs, and, of 

## T5-efficient-tiny-grammar-correction

On utilise un modèle encore plus petit que T5-small : T5-tiny

In [54]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import spacy

# Charger spaCy pour le découpage en phrases
nlp = spacy.load("en_core_web_sm")

# Charger le modèle tiny pour la correction grammaticale
model_name = "visheratin/t5-efficient-tiny-grammar-correction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Fonction de correction d'une phrase
def correct_text_with_t5(text):
    prompt = f"correct: {text.strip()}"  # ou juste `text.strip()` si prompt ne donne pas de gain
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)

    try:
        outputs = model.generate(
            **inputs,
            max_length=128,
            num_beams=2,
            early_stopping=True
        )
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        print(f"Erreur sur phrase : {text} — {e}")
        return text  # On retourne la phrase d'origine si erreur

# Découper un bloc de texte en phrases avec spaCy
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# Pipeline principal
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Exemple avec texte OCR (remplace ici par ton propre texte OCR)
noisy_text = ocr_data['0'][:3000]  # ou texte brut
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)


tokenizer_config.json:   0%|          | 0.00/2.42k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


model.safetensors:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Corrected Output:
 correct: THE VAMPYRE;
The correct: A Tale.
correct: By John William Polidori.
correct: THE superstition upon which this site is founded is very general in the East. correct: Among the Arabjans itappeors to be common: it did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its present form since the division of Latin and Greok churches; at which time, the idea of becoming prevalent, that a Latin body could not correct if buried in their territory, it gradually increased, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding uponlhe blood of tho young and beautiful. Correct: In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbibed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of consumptio

Problem : At the begining of each correction, we have the word 'Correct: '. But as it is the same for each sentences that has been corrected, we just have to remove it !

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import spacy

# Charger spaCy pour le découpage en phrases
nlp = spacy.load("en_core_web_sm")

# Charger le modèle tiny pour la correction grammaticale
model_name = "visheratin/t5-efficient-tiny-grammar-correction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Fonction de correction d'une phrase
def correct_text_with_t5(text):
    prompt = f"correct: {text.strip()}"  # ou juste `text.strip()` si prompt ne donne pas de gain
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)

    try:
        outputs = model.generate(
            **inputs,
            max_length=128,
            num_beams=2,
            early_stopping=True
        )
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        print(f"Erreur sur phrase : {text} — {e}")
        return text  # On retourne la phrase d'origine si erreur

# Découper un bloc de texte en phrases avec spaCy
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# Pipeline principal
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence)[9:] for sentence in sentences] # it removes the part 'correct: '
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Exemple avec texte OCR (remplace ici par ton propre texte OCR)
noisy_text = ocr_data['0'][:3000]  # ou texte brut
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Corrected Output:
 THE VAMPYRE;
ct: A Tale.
By John William Polidori.
THE superstition upon which this site is founded is very general in the East. Among the Arabjans itappeors to be common: it did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its present form since the division of Latin and Greok churches; at which time, the idea of becoming prevalent, that a Latin body could not correct if buried in their territory, it gradually increased, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding uponlhe blood of tho young and beautiful. In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbibed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of consumption; whilst these human blood-suckers fattened—and their

# Back Translation

Here, we pass the model through a machine translation, to have the french text, and put it back to english, to see if machine translation are capable to correct OCR mistakes

## MarianMTModel

In [None]:
from transformers import MarianMTModel, MarianTokenizer

def translate(text, src_lang, tgt_lang):
    model_name = f"Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}"
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    translated = model.generate(**inputs)
    return tokenizer.decode(translated[0], skip_special_tokens=True)

# Texte source en anglais
original_text = "This document was extracted from a noisy OCR scan. It needs to be corrected and translated."
print("📘Original text :", original_text)

# Étape 1 : Traduire en français
translated_to_french = translate(original_text, "en", "fr")
print("➡️ Traduction en français :\n", translated_to_french)

# Étape 2 : Revenir à l’anglais
translated_back_to_english = translate(translated_to_french, "fr", "en")
print("\n⬅️ Traduction de retour en anglais :\n", translated_back_to_english)

  from .autonotebook import tqdm as notebook_tqdm


📘Original text : This document was extracted from a noisy OCR scan. It needs to be corrected and translated.




Everything works, so now, we apply it on our dataset :

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import spacy

# Chargement du modèle de segmentation en phrases
nlp = spacy.load("en_core_web_sm")

# Modèle EN → FR
en_fr_model_name = "Helsinki-NLP/opus-mt-en-fr"
en_fr_tokenizer = AutoTokenizer.from_pretrained(en_fr_model_name)
en_fr_model = AutoModelForSeq2SeqLM.from_pretrained(en_fr_model_name)

# Modèle FR → EN
fr_en_model_name = "Helsinki-NLP/opus-mt-fr-en"
fr_en_tokenizer = AutoTokenizer.from_pretrained(fr_en_model_name)
fr_en_model = AutoModelForSeq2SeqLM.from_pretrained(fr_en_model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
en_fr_model.to(device)
fr_en_model.to(device)

# Fonction générique de traduction
def translate(text, tokenizer, model, src_lang="en", tgt_lang="fr"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    try:
        outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        print(f"Erreur de traduction {src_lang}→{tgt_lang} : {text} — {e}")
        return text

# Segmentation du texte
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# Pipeline de traduction et back-traduction
def translate_and_backtranslate_text(text):
    lines = text.split('\n')
    final_lines = []

    for line in lines:
        if not line.strip():
            final_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        processed_sentences = []

        for sent in sentences:
            fr = translate(sent, en_fr_tokenizer, en_fr_model, "en", "fr")
            back_en = translate(fr, fr_en_tokenizer, fr_en_model, "fr", "en")
            processed_sentences.append(back_en)

        final_line = ' '.join(processed_sentences)
        final_lines.append(final_line)

    return '\n'.join(final_lines)

noisy_text = ocr_data['0'][:3000]  # ou texte brut
translated_and_back = translate_and_backtranslate_text(noisy_text)

print("🔁 Back-Translated Output:\n", translated_and_back)

Everything works perfectly, so we apply the translation to the whole dataset, and we save it as a csv file

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import spacy
import pandas as pd
from tqdm import tqdm

# Chargement du modèle de segmentation
nlp = spacy.load("en_core_web_sm")

# Modèles de traduction
en_fr_model_name = "Helsinki-NLP/opus-mt-en-fr"
en_fr_tokenizer = AutoTokenizer.from_pretrained(en_fr_model_name)
en_fr_model = AutoModelForSeq2SeqLM.from_pretrained(en_fr_model_name)

fr_en_model_name = "Helsinki-NLP/opus-mt-fr-en"
fr_en_tokenizer = AutoTokenizer.from_pretrained(fr_en_model_name)
fr_en_model = AutoModelForSeq2SeqLM.from_pretrained(fr_en_model_name)

# GPU / CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
en_fr_model.to(device)
fr_en_model.to(device)

# Fonction générique de traduction
def translate(text, tokenizer, model, src_lang="en", tgt_lang="fr"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    try:
        outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        print(f"Erreur de traduction {src_lang}→{tgt_lang} : {text} — {e}")
        return text

# Segmentation en phrases
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# Pipeline de back-traduction
def translate_and_backtranslate_text(text):
    lines = text.split('\n')
    final_lines_en = []
    final_lines_fr = []

    for line in lines:
        if not line.strip():
            final_lines_en.append('')
            final_lines_fr.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        processed_sentences_en = []
        processed_sentences_fr = []

        for sent in sentences:
            fr = translate(sent, en_fr_tokenizer, en_fr_model, "en", "fr")
            back_en = translate(fr, fr_en_tokenizer, fr_en_model, "fr", "en")
            processed_sentences_fr.append(fr)
            processed_sentences_en.append(back_en)

        final_line_fr = ' '.join(processed_sentences_fr)
        final_line_en = ' '.join(processed_sentences_en)
        final_lines_fr.append(final_line_fr)
        final_lines_en.append(final_line_en)

    return '\n'.join(final_lines_fr), '\n'.join(final_lines_en)

# Traitement complet du dataset
results = []

for i in tqdm(range(len(ocr_data)), desc="Back-translation"):
    original_text = ocr_data[str(i)]
    translated_fr_text, back_translated_text = translate_and_backtranslate_text(original_text)
    results.append({
        "index": i,
        "original_text": original_text,
        "translated_fr_text": translated_fr_text,
        "back_translated_text": back_translated_text
    })

# Sauvegarde dans un CSV
df = pd.DataFrame(results)
df.to_csv("back_translation_correction.csv", index=False)

## Facebook

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import spacy

# Chargement de spacy pour la segmentation
nlp = spacy.load("en_core_web_sm")

# Modèle NLLB
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)  # <-- Important
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Codes de langue NLLB
lang_code = {
    "en": "eng_Latn",
    "fr": "fra_Latn"
}

# Fonction de traduction NLLB
def translate_nllb(text, tokenizer, model, src_lang="en", tgt_lang="fr"):
    try:
        tokenizer.src_lang = lang_code[src_lang]
        encoded = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
        generated_tokens = model.generate(
            **encoded,
            forced_bos_token_id=tokenizer.convert_tokens_to_ids(lang_code[tgt_lang]),  # <-- fix
            max_length=512,
            num_beams=4,
            early_stopping=True
        )
        return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    except Exception as e:
        print(f"Erreur de traduction {src_lang}→{tgt_lang} : {text} — {e}")
        return text

# Segmentation des phrases
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# Pipeline complet
def translate_and_backtranslate_nllb(text):
    lines = text.split('\n')
    final_lines = []

    for line in lines:
        if not line.strip():
            final_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        processed_sentences = []

        for sent in sentences:
            fr = translate_nllb(sent, tokenizer, model, "en", "fr")
            back_en = translate_nllb(fr, tokenizer, model, "fr", "en")
            processed_sentences.append(back_en)

        final_line = ' '.join(processed_sentences)
        final_lines.append(final_line)

    return '\n'.join(final_lines)

# Exemple d'utilisation
noisy_text = ocr_data['0'][:3000]  # ou texte brut
translated_and_back = translate_and_backtranslate_nllb(noisy_text)
print("🔁 Back-Translated Output (NLLB):\n", translated_and_back)

🔁 Back-Translated Output (NLLB):
 The vampire.
It's a story.
By John William Polidori
The superstition on which this story is based is widespread in the East. Among the Arabs, it seems to be common: it did not, however, spread to the Greeks until after the establishment of Christianity; and it took its form only from the division of the Latin and Greek churches; at that time, as the idea became widespread, that a Lcltin body could not be corrupted if it was buried on their territory, it gradually became believed, and formed the subject of many wonderful stories, still existing, of the dead rising from their graves, and feeding on the blood of young and beautiful. In the West, it spread, with a slight variation, throughout Hungary, Poland, Austria, and Lorraine, where healers existed, that vampires absorbed at night a certain portion of the blood of their victims, who became emancipated, lost their strength, and died quickly of constipation; while these suckers of human blood grew fat a

Problem : It took 10 min to run just this few part of the code...

# BART

In [66]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import BartTokenizer, BartForConditionalGeneration
import torch
import spacy

# Remplace bien ce nom
model_name = "facebook/bart-base"



#tokenizer = AutoTokenizer.from_pretrained(model_name)
#model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def correct_text_with_bart(text):

    #prompt = f"Fix errors: {text}"
    inputs = tokenizer(text, return_tensors="pt", truncation=True).to(device)
    outputs = model.generate(**inputs, max_new_tokens=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# Charger spaCy pour le découpage en phrases
nlp = spacy.load("en_core_web_sm")

# Découper un bloc de texte en phrases avec spaCy
def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# Pipeline principal
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_bart(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Exemple avec texte OCR (remplace ici par ton propre texte OCR)
noisy_text = ocr_data['0'][:3000]  # ou texte brut
corrected_text_BART = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text_BART)




Corrected Output:
 THE VAMPYRE;
A Tale.
By John William Polidori
THEsuperstition upon which this taIe iſ founded is very general in the East. Among tho Arabjans itappeors to be common: it did not, however, extend itself to the Greeks until after the establi shment of Christianity; and it has only aſsumed its prosent form since the division af the Latin and Greok churches; at which time, lhe idea becoming prevalent, that a Lcltin body could not corrvpl if buried in their territory, it gradually increosed, and formed lhe subject of many wonderful stories, ſtill extant, of the dead rising from their graves, and feeding uponlhe blood of tho young and beautiful In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbi6ed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of c0nsumptions; whilst these human blood-suckers fattened—and their ve

# Evaluation

In [None]:
# Evaluation

!pip install jiwer --quiet
from jiwer import wer, cer

total_cer = 0
total_wer = 0

for i in range(48):
    ref = clean_data_text[i]
    hyp = results[str(i)]
    #total_cer += cer(ref, hyp)
    total_wer += wer(ref, hyp)

#print(f"Mean CER: {total_cer / 48:.4f}")
print(f"Mean WER: {total_wer / 48:.4f}")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/3.1 MB[0m [31m8.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/3.1 MB[0m [31m19.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.1/3.1 MB[0m [31m29.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25h

ValueError: After applying the transformation, each reference should be a non-empty list of strings, with each string being a single word.

In [52]:
!pip install jiwer --quiet
from jiwer import wer, cer

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/3.1 MB[0m [31m34.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.1/3.1 MB[0m [31m41.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [64]:
cer(noisy_data[:3000] ,clean_data_text[:3000])

0.113

In [75]:
!pip install Levenshtein --quiet
import Levenshtein

def levenshtein_score(pred, ref):
    return Levenshtein.distance(pred, ref)

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/161.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [79]:
levenshtein_score(corrected_text_BART, clean_data_text[:3000])

205

In [80]:
levenshtein_score(corrected_text, clean_data_text[:3000])

339

In [81]:
levenshtein_score(noisy_text[:3000], clean_data_text[:3000])

124

# Ideas

Possible ideas : get some data from the internet, pass it through an OCR, get the noisy text, and use it to finetune a model ?

**Possible idea : Pass multiple time the input through the model ? We hope that at first loop, it will improve the quality of text, thanks to eliminate evident mistakes, and with second, third etc... loop, it will eliminate less evident mistakes**