<a href="https://colab.research.google.com/github/tgarnier067/MNLP-project-2/blob/main/02_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Library

In [None]:
# If you are working on sspcloud

!pip install spacy --quiet
!python -m spacy download en_core_web_sm

In [None]:
!pip install transformers torch sentencepiece --quiet

In [None]:
import json
import os
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import spacy
import re
from transformers import MarianMTModel, MarianTokenizer
import pandas as pd
from tqdm import tqdm
import warnings
import logging
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# Import data

In [2]:
#Path to english data
extract_dir = os.path.expanduser("~/work/MNLP-project-2/data/eng")

#path to json files
clean_path = os.path.join(extract_dir, "the_vampyre_clean.json")
ocr_path = os.path.join(extract_dir, "the_vampyre_ocr.json")

#load files
with open(clean_path, "r", encoding="utf-8") as f:
    clean_data = json.load(f)

with open(ocr_path, "r", encoding="utf-8") as f:
    ocr_data = json.load(f)

# Prepare data

In [3]:
def concat_values_dict(d):
    """
    Concat values of a dict, seperating each element with '\n'

    Args:
        d (dict): Dictionnary

    Returns:
        str: concatenated text
    """
    return '\n'.join(d.get(str(i), "") for i in range(48))

clean_data_text = concat_values_dict(clean_data)

# T5 : Model fine tuned for grammar

## T5-base

We create a function to apply a prompt wich asks to correct the data, to the LLM T5-base, and print the output of this LLM

In [None]:
# Charge model and tokenizer
model_name = "t5-base" # We could use also t5-small
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def correct_text_with_t5(text):
    prompt = f"Fix errors : {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=False, padding=False).to(device)
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Take the firsts sentences of a noisy sentence
noisy_text = ocr_data['0'][:1000]

# Build segment, and prompt to correct on each segment
segments = noisy_text.split('\n')

segments

['THE VAMPYRE;',
 'A Tale.',
 'By John William Polidori',
 'THEsuperstition upon which this taIe iſ founded is very general in the East. Among tho Arabjans itappeors to be common: it did not, however, extend itself to the Greeks until after the establi shment of Christianity; and it has only aſsumed its prosent form since the division af the Latin and Greok churches; at which time, lhe idea becoming prevalent, that a Lcltin body could not corrvpl if buried in their territory, it gradually increosed, and formed lhe subject of many wonderful stories, ſtill extant, of the dead rising from their graves, and feeding uponlhe blood of tho young and beautiful. In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbi6ed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of c0nsumptions; whilst these human blood-suckers fattened—and their veins 

In [None]:
# Apply the model on a small dataset
for i in range(4):
  print(correct_text_with_t5(segments[i]))

.: THE VAMPYRE; THE VAMPYRE; THE VAMPY
: A Tale.
Fix errors : By John William Polidori : By John William Polidori :
. : Fix: Fix bug fixes : Fix errors : Fix bug fixes


This model makes a lot of problem, as you can see on the transcription displayed on the above cell. We tried to modify the prompt, the lenght of the inputs, and many other parameters, but still, the quality of the output data was too bad. We will not focus on this model, but look at another one instead : vennify/t5-base-grammar-correction

## T5-base-grammar-correction

T5-base-grammar-correction is a transformer model finetuned for grammatical errors correction. It's trained on the dataset JFLEG, a corpus made for developing and evaluating grammatical errors correction.

In [None]:
# Load fine-tuned grammar correction model
model_name = "vennify/t5-base-grammar-correction"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def correct_text_with_t5(text):
    prompt = f"correct: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test on a small dataset
noisy_text = ocr_data['0'][:1000]
segments = [s.strip() for s in noisy_text.split('\n') if s.strip()]

corrected_segments = [correct_text_with_t5(seg) for seg in segments]
corrected_text = '\n'.join(corrected_segments)

print("Corrected Output:\n", corrected_text)

Corrected Output:
 THE VAMPYRE.
A Tale.
By John William Polidori.
The superstition upon which this taIe is founded is very general in the East. It did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its prosent form since the division of the Latin and Greok churches; and it gradually increosed, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding upon the blood of young and beautiful people. In the West it spread, with some slight variation,


Problem : The output is not completed. So we increase the max_lenghts to 512. Unfortunatly, it's not enough => we have no longer output for very high values of max_lengths, than with max_lenght = 512. We also try to remove early_stopping, but it does not work. So, as the model can not output very long sentences, we apply it many times, on slices of the text.

- Instead of : model(sentence 1, sentence 2, sentence 3...)
- We do : model(sentence 1) + model(sentence 2) + model(sentence 3) + ...

We use spacy to slice into sentences

In [None]:
#Load spacy to split into sentences
nlp = spacy.load("en_core_web_sm")

# Load fine-tuned grammar correction model
model_name = "vennify/t5-base-grammar-correction"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#function to clean the text
def correct_text_with_t5(text):
    prompt = f"correct: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=4,
        early_stopping=False,
        no_repeat_ngram_size=3
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# main function : split by lines, then sentences, and correct
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

#Example with an OCR extract
noisy_text = ocr_data['0'][:3000] 
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)


Corrected Output:
 THE VAMPYRE.
A Tale.
By John William Polidori.
THE superstition upon which this theory is founded is very general in the East. It did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its prosent form since the division of the Latin and Greok churches; at which time, the idea becoming prevalent, that a Lcltin body could not corrvpl if buried in their territory, gradually increased, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding upon the blood of young and beautiful. In the West it spread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbi6ed a certain portion of the blood of their victims, who became emaciated, lost their strength, and quickly died of c0nsumptions; while these human blood-suckers fattened—and their veins became distended to such a state of roplet

In [None]:
len(corrected_text)

2733

In [None]:
print(clean_data_text[:3000])

THE VAMPYRE;
A Tale.
By John William Polidori
THE superstition upon which this tale is founded is very general in the East. Among the Arabians it appears to be common: it did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its present form since the division of the Latin and Greek churches; at which time, the idea becoming prevalent, that a Latin body could not corrupt if buried in their territory, it gradually increased, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding upon the blood of the young and beautiful. In the West it spread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, where the belief existed, that vampyres nightly imbibed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of consumptions; whilst these human blood-suckers fattened—and their veins became dist

Problem : the model skip some parts of the text. We input 3000 characters, and it output only 2733. When looking into details at the translation, we see that it skips some sentences. Maybe, if we try to apply the LLM on smaller slices of the text, we won't skip parts. NOw, we slices when there is a '\n', and when there is a ,:;!?

In [None]:
nlp = spacy.load("en_core_web_sm")

# Load fine-tuned grammar correction model
model_name = "vennify/t5-base-grammar-correction"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#Function to clean an input text
def correct_text_with_t5(text):
    prompt = f"correct: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(
        **inputs,
        max_length=1024,       
        num_beams=4,
        early_stopping=False, 
        length_penalty=1.0,   
        no_repeat_ngram_size=3
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]


def split_sentences_and_punct(text):
    #Firstly split into sentences
    spacy_sents = split_into_sentences_spacy(text)

    # then split according to punctuations
    punct_split_sents = []
    for sent in spacy_sents:
        parts = re.split(r'[,:;!?]', sent)
        parts = [p.strip() for p in parts if p.strip()]
        punct_split_sents.extend(parts)
    return punct_split_sents


# main function that split into lines, sentences and with punctuations
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_sentences_and_punct(line)
        corrected_sentences = [correct_text_with_t5(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Example
noisy_text = ocr_data['0'][:3000] 
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)

Corrected Output:
 THE VAMPYRE is correct.
A Tale.
By John William Polidori.
THE superstition upon which this theory is founded is very general in the East. Among Arabjans itappeors to be common. It did not. However, this is correct: however, the facts are correct. It extends itself to the Greeks until after the establishment of Christianity. And it has only assumed its prosent form since the division of the Latin and Greok churches. At which time is correct: at which time. The idea is becoming prevalent. That a Lcltin body could not be corrvpl if buried in their territory. It gradually increased. And formed the subject of many wonderful stories. The fact is that there are still a lot of fossils that are still extant. The dead are rising from their graves. And feeding on the blood of tho young and beautiful. In the West it spreads throughout the world. With some slight variation. All over Hungary, Hungary is correct. Poland is correct. Austria is correct. And Lorraine is correct. Whoro

New problems : the time computation starts beeing very high : 3 min to apply on 3000 characters. When we will generalize it to the 48 texts, it's going to take hours. Moreover, the outputs are not correct :    

- Noisy : In theLond0n Journal, of March, 1732, is a curiovs, and, of course, credible account of a particular case of vampyrifin, which is stated to hove accurred at Madreyga, in Hungary.

- Cleaned : In theLond0n Journal. The month of March is correct. 1732, correct: 1732. Is a curiovs. Correct: and so on. Of course, of course. Credible account of a particular case of vampyrifin. Which is stated to have been accurred at Madreyga. In Hungary.

## T5-small-grammar-correction

This model is a smaller version of T5-base-grammar-correction

In [None]:
#Load spacy
nlp = spacy.load("en_core_web_sm")

# Load fine-tuned grammar correction model
model_name = "AventIQ-AI/T5-small-grammar-correction"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#function to correct input text with the model
def correct_text_with_t5(text):
    prompt = f"correct: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=4,
        early_stopping=False,
        no_repeat_ngram_size=3
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

#main function : split into sentences and correct
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Example
noisy_text = ocr_data['0'][:3000] 
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)

Corrected Output:
 е аее;
A Tale.
By John William Polidori
The superstition upon which this taIe is founded is very general in the East. Among tho Arabjans itappeors to be common: it did not extend itself to the Greeks until after the establi shment of Christianity; and it has only assumed its prosent form since the division af the Latin and Greok churches; at which time, lhe idea becoming prevalent, that a Lcltin body could not corrvpl if buried in their territory, it gradually increosed, and formed a subject of many wonderful stories, still ex In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbi6ed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of c0nsumptions; while these human blood-suckers fattened—and their veins became distended to such a state of ropletion, as
In theLond0n Journal, of March, 1732, is a curiovs, and, of 

## T5-efficient-tiny-grammar-correction

This model is even smaller than T5-small

In [None]:
# To split text
nlp = spacy.load("en_core_web_sm")

# Load fine-tuned grammar correction tiny model
model_name = "visheratin/t5-efficient-tiny-grammar-correction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Function to correct input text
def correct_text_with_t5(text):
    prompt = f"correct: {text.strip()}" 
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)

    try:
        outputs = model.generate(
            **inputs,
            max_length=128,
            num_beams=2,
            early_stopping=True
        )
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        print(f"Erreur sur phrase : {text} — {e}")
        return text  

def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# main function that split text into sentences and then correct
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

#example 
noisy_text = ocr_data['0'][:3000]
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)


tokenizer_config.json:   0%|          | 0.00/2.42k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


model.safetensors:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Corrected Output:
 correct: THE VAMPYRE;
The correct: A Tale.
correct: By John William Polidori.
correct: THE superstition upon which this site is founded is very general in the East. correct: Among the Arabjans itappeors to be common: it did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its present form since the division of Latin and Greok churches; at which time, the idea of becoming prevalent, that a Latin body could not correct if buried in their territory, it gradually increased, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding uponlhe blood of tho young and beautiful. Correct: In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbibed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of consumptio

Problem : At the begining of each correction, we have the word 'Correct: '. But as it is the same for each sentences that has been corrected, we just have to remove it !

In [None]:
nlp = spacy.load("en_core_web_sm")

# Load fine-tuned grammar correction tiny model
model_name = "visheratin/t5-efficient-tiny-grammar-correction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Functino to correct input text
def correct_text_with_t5(text):
    prompt = f"correct: {text.strip()}" 
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True).to(device)

    try:
        outputs = model.generate(
            **inputs,
            max_length=128,
            num_beams=2,
            early_stopping=True
        )
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        print(f"Erreur sur phrase : {text} — {e}")
        return text  


def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# main function that split into sentences, correct and remove the word 'correct' at the beggining of every lines
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence)[9:] for sentence in sentences] # it removes the part 'correct: '
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Example
noisy_text = ocr_data['0'][:3000]  
corrected_text = correct_text_by_line_and_sentence(noisy_text)

print("Corrected Output:\n", corrected_text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Corrected Output:
 THE VAMPYRE;
ct: A Tale.
By John William Polidori.
THE superstition upon which this site is founded is very general in the East. Among the Arabjans itappeors to be common: it did not, however, extend itself to the Greeks until after the establishment of Christianity; and it has only assumed its present form since the division of Latin and Greok churches; at which time, the idea of becoming prevalent, that a Latin body could not correct if buried in their territory, it gradually increased, and formed the subject of many wonderful stories, still extant, of the dead rising from their graves, and feeding uponlhe blood of tho young and beautiful. In the West itspread, with some slight variation, all over Hungary, Poland, Austria, and Lorraine, whoro the helies existed, that vompyresnightly imbibed a certain portion of the blood of their victims, who became emaciated, lost their strength, and speedily died of consumption; whilst these human blood-suckers fattened—and their

# Back Translation

Here, we pass the model through a machine translation, to have the french text, and put it back to english, to see if machine translation are capable to correct OCR mistakes

## MarianMTModel

In [6]:
# Function that translate a text from a source language to a target language
def translate(text, src_lang, tgt_lang):
    model_name = f"Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}"
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    translated = model.generate(**inputs)
    return tokenizer.decode(translated[0], skip_special_tokens=True)

# Source text in English
original_text = "This document was extracted from a noisy OCR scan. It needs to be corrected and translated."
print("Original text :", original_text)

# French traduction
translated_to_french = translate(original_text, "en", "fr")
print("Traduction en français :\n", translated_to_french)

# Back to english
translated_back_to_english = translate(translated_to_french, "fr", "en")
print("\n back translated to English :\n", translated_back_to_english)

Original text : This document was extracted from a noisy OCR scan. It needs to be corrected and translated.
Traduction en français :
 Ce document a été extrait d'un scanner OCR bruyant. Il doit être corrigé et traduit.

 back translated to English :
 This document has been extracted from a noisy OCR scanner. It must be corrected and translated.


Everything works, so now, we apply it on our dataset :

In [7]:
#load spacy to split into sentences
nlp = spacy.load("en_core_web_sm")

#English To French Model 
en_fr_model_name = "Helsinki-NLP/opus-mt-en-fr"
en_fr_tokenizer = AutoTokenizer.from_pretrained(en_fr_model_name)
en_fr_model = AutoModelForSeq2SeqLM.from_pretrained(en_fr_model_name)

# French to English Model 
fr_en_model_name = "Helsinki-NLP/opus-mt-fr-en"
fr_en_tokenizer = AutoTokenizer.from_pretrained(fr_en_model_name)
fr_en_model = AutoModelForSeq2SeqLM.from_pretrained(fr_en_model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
en_fr_model.to(device)
fr_en_model.to(device)

# function to translate
def translate(text, tokenizer, model, src_lang="en", tgt_lang="fr"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    try:
        outputs = model.generate(**inputs, max_length=512, num_beams=4, early_stopping=True)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        print(f"Erreur de traduction {src_lang}→{tgt_lang} : {text} — {e}")
        return text

def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# main function that split into sentences, translate and backtranslate
def translate_and_backtranslate_text(text):
    lines = text.split('\n')
    final_lines = []

    for line in lines:
        if not line.strip():
            final_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        processed_sentences = []

        for sent in sentences:
            fr = translate(sent, en_fr_tokenizer, en_fr_model, "en", "fr")
            back_en = translate(fr, fr_en_tokenizer, fr_en_model, "fr", "en")
            processed_sentences.append(back_en)

        final_line = ' '.join(processed_sentences)
        final_lines.append(final_line)

    return '\n'.join(final_lines)

# Test on a small dataset
noisy_text = ocr_data['0'][:1000]  
translated_and_back = translate_and_backtranslate_text(noisy_text)

print(" Back-Translated Output:\n", translated_and_back)



 Back-Translated Output:
 VAMPYRE;
A tale.
By John William Polidori
THE superstition on which this tae is based is very general in the East. However, it did not extend to the Greeks until after the establishment of Christianity; and it did not take the form of prosents since the division that the Latin and Greek churches had; at that time, the idea of becoming dominant, that a body of Lcltin could not corrupt if it was buried in their territory, it gradually creed, and formed the object of many wonderful stories, still existing, of the dead who rose from their graves, and fed on the blood of young and beautiful tho. In the West, it spread, with a slight variation, throughout Hungary, Poland, Austria and Lorraine, which were the helies, that the vampyres every night soaked some of the blood of their victims, who emaciated themselves, lost their strength, and died quickly of c0nomptions; while these human blood suckers were fattening — and their veins became distendable


Everything works perfectly, so we apply the translation to the whole dataset, and we save it as a csv file

In [None]:
# Process the whole dataset
results = []

# Apply the back translation to every text
for i in tqdm(range(len(ocr_data)), desc="Back-translation"):
    original_text = ocr_data[str(i)]
    translated_fr_text, back_translated_text = translate_and_backtranslate_text(original_text)
    results.append({
        "index": i,
        "original_text": original_text,
        "translated_fr_text": translated_fr_text,
        "back_translated_text": back_translated_text
    })

# Saving the correction file
df = pd.DataFrame(results)
df.to_csv("back_translation_correction.csv", index=False)

## Facebook

In [5]:
# To split the text
nlp = spacy.load("en_core_web_sm")

# Model NLLB
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) 
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# NLLB language codes
lang_code = {
    "en": "eng_Latn",
    "fr": "fra_Latn"
}

# translation function
def translate_nllb(text, tokenizer, model, src_lang="en", tgt_lang="fr"):
    try:
        tokenizer.src_lang = lang_code[src_lang]
        encoded = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
        generated_tokens = model.generate(
            **encoded,
            forced_bos_token_id=tokenizer.convert_tokens_to_ids(lang_code[tgt_lang]),  # <-- fix
            max_length=512,
            num_beams=4,
            early_stopping=True
        )
        return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    except Exception as e:
        print(f"Erreur de traduction {src_lang}→{tgt_lang} : {text} — {e}")
        return text


def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

# main function
def translate_and_backtranslate_nllb(text):
    lines = text.split('\n')
    final_lines = []

    for line in lines:
        if not line.strip():
            final_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        processed_sentences = []

        for sent in sentences:
            fr = translate_nllb(sent, tokenizer, model, "en", "fr")
            back_en = translate_nllb(fr, tokenizer, model, "fr", "en")
            processed_sentences.append(back_en)

        final_line = ' '.join(processed_sentences)
        final_lines.append(final_line)

    return '\n'.join(final_lines)

# Example
noisy_text = ocr_data['0'][:3000]  
translated_and_back = translate_and_backtranslate_nllb(noisy_text)
print(" Back-Translated Output (NLLB):\n", translated_and_back)

  from .autonotebook import tqdm as notebook_tqdm


 Back-Translated Output (NLLB):
 The vampire.
It's a story.
By John William Polidori
The superstition on which this story is based is widespread in the East. Among the Arabs, it seems to be common: it did not, however, spread to the Greeks until after the establishment of Christianity; and it took its form only from the division of the Latin and Greek churches; at that time, as the idea became widespread, that a Lcltin body could not be corrupted if it was buried on their territory, it gradually became believed, and formed the subject of many wonderful stories, still existing, of the dead rising from their graves, and feeding on the blood of young and beautiful. In the West, it spread, with a slight variation, throughout Hungary, Poland, Austria, and Lorraine, where healers existed, that vampires absorbed at night a certain portion of the blood of their victims, who became emancipated, lost their strength, and died quickly of constipation; while these suckers of human blood grew fat an

Problem : It took 10 min to run just this few part of the code...

# BART

let's use pykale/bart-large-ocr, BART-large was finetuned for OCR correction task. It was trained on historical English corpus.

In [None]:
warnings.filterwarnings("ignore")

logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("torch").setLevel(logging.ERROR)

In [None]:
# To split
nlp = spacy.load("en_core_web_sm")

#Load Bart model for OCR correction
model_name = "pykale/bart-large-ocr"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
generator = pipeline('text2text-generation', model=model, tokenizer=tokenizer, device=0) 

def correct_text_with_bart(text, max_length=1024):
    try:
        outputs = generator(text, max_length=max_length, num_beams=5, early_stopping=True)
        return outputs[0]['generated_text']
    except Exception as e:
        print(f"Erreur de correction : {e}")
        return text

def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_bart(sentence) for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

# Correction on all texts
results = []

for i in tqdm(range(len(ocr_data)), desc="OCR texts correction"):
    key = str(i)
    original_text = ocr_data[key]
    corrected_text = correct_text_by_line_and_sentence(original_text)
    results.append({
        "index": i,
        "original_text": original_text,
        "corrected_text": corrected_text
    })

# Saving
df = pd.DataFrame(results)
df.to_csv("bart_correction.csv", index=False)

print("Look at you csv files !!")

Correction des textes OCR: 100%|██████████| 48/48 [3:32:38<00:00, 265.80s/it]  

Look at you csv files !!





# Back_translation and T5

Here we try to process the correction twice and to mix methods. Therefore the OCR text is firstly cleaned with the backtranslation method and then there is a second correction phase with the T5-tiny-grammar-correction model.

In [None]:
#load Spacy 
nlp = spacy.load("en_core_web_sm")

# load the model
model_name = "visheratin/t5-efficient-tiny-grammar-correction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

#function to correct input text
def correct_text_with_t5(text, max_length=1024):
    try:
        prompt = f"Correct: {text}"
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
        outputs = model.generate(**inputs, max_new_tokens=max_length, num_beams=5, early_stopping=True)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    except Exception as e:
        print(f"Erreur de correction : {e}")
        return text

def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

#main function that split into lines and correct them
def correct_text_by_line_and_sentence(text):
    lines = text.split('\n')
    corrected_lines = []

    for line in lines:
        if not line.strip():
            corrected_lines.append('')
            continue

        sentences = split_into_sentences_spacy(line)
        corrected_sentences = [correct_text_with_t5(sentence)[9:] for sentence in sentences]
        corrected_line = ' '.join(corrected_sentences)
        corrected_lines.append(corrected_line)

    return '\n'.join(corrected_lines)

#Processing all OCR texts
data = pd.read_csv('back_translation_correction.csv')
results = []

for i in tqdm(range(len(data['back_translated_text'])), desc="OCR correction"):
    original_text = data['back_translated_text'][i]
    corrected_text = correct_text_by_line_and_sentence(original_text)
    results.append({
        "index": i,
        "original_text": original_text,
        "corrected_text": corrected_text
    })

# Saving results
df = pd.DataFrame(results)
df.to_csv("back_translation_t5_correction.csv", index=False)

print('Look at your csv files !')

OCR correction:   0%|          | 0/48 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
OCR correction: 100%|██████████| 48/48 [1:42:49<00:00, 128.53s/it]

Look at your csv files !



