üèõÔ∏è Greek Manuscript OCR: Memory-Efficient Extraction
This cell handles the core transformation of the scanned document into digital text. It is designed with an Efficiency Architect approach to solve two specific problems:

Memory Bottlenecks: Instead of loading a 300+ page PDF into RAM, it uses a generator-like approach to process one page at a time.

Linguistic Accuracy: It leverages the Tesseract ell (Greek) language pack to correctly identify complex characters in scanned manuscripts.

üõ†Ô∏è Execution Logic
Interactive Selection: Opens a native file explorer for input/output paths to avoid hardcoding.

Sequential Processing: Uses first_page and last_page parameters in convert_from_path to ensure the CPU only handles one high-resolution (300 DPI) image at a time.

Incremental Saving: Writes each page to a unique .txt file immediately. This ensures that if the process is interrupted, progress is preserved.

In [None]:
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
from tkinter import filedialog
import os
from tqdm import tqdm

# Select the input PDF
pdf_path = filedialog.askopenfilename(title="Select a scanned Greek PDF")

# Select output folder
output_dir = filedialog.askdirectory(title="Select folder to save OCR text files")
os.makedirs(output_dir, exist_ok=True)

# Set path to Poppler and Tesseract
poppler_path = r"C:\tools\poppler-24.08.0\Library\bin"
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Get total number of pages
from pdf2image.pdf2image import pdfinfo_from_path
info = pdfinfo_from_path(pdf_path, poppler_path=poppler_path)
num_pages = info["Pages"]

print(f"üîç PDF has {num_pages} pages. Beginning OCR...")

# Convert and process each page one at a time
for page_num in tqdm(range(1, num_pages + 1), desc="Processing pages"):
    # Convert one page at a time
    image = convert_from_path(pdf_path, dpi=300, poppler_path=poppler_path, first_page=page_num, last_page=page_num)[0]

    # OCR
    text = pytesseract.image_to_string(image, lang='ell')

    # Save output
    out_path = os.path.join(output_dir, f"page_{page_num:03}.txt")
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(text)

print("‚úÖ OCR completed ‚Äî all pages processed one-by-one.")


Data Consolidation: Merging Page-Level OCR into a Single Volume
This final step transforms the collection of individual page fragments into a professional, continuous digital book. It is designed to ensure structural integrity and correct document flow.

üèóÔ∏è Architectural Decisions
Natural Sort Order: The script uses sorted() to ensure that pages are merged in the correct numerical sequence (e.g., page_001.txt, page_002.txt), preventing the common "lexicographical" sorting error where page 10 might appear before page 2.

Structural Separators: An optional double-newline \n\n is injected between pages during the write process. This preserves the original document's spatial distinction and makes the resulting text easier for researchers to navigate.

UTF-8 Encoding: Hardcoded utf-8 encoding ensures that Greek characters‚Äîincluding complex polytonic accents‚Äîare preserved without corruption during the merging process.

üõ†Ô∏è Execution Logic
Input Selection: Prompts for the directory containing the .txt files generated in the previous step.

Output Specification: Uses an interactive "Save As" window to define the final file name and location.

Steam-Processing Merge: Opens each file one by one to append its content to the master file, keeping memory usage low even during the final consolidation.

In [None]:
import os
from tkinter import filedialog

# Ask user to select the folder with OCR .txt files
input_folder = filedialog.askdirectory(title="Select folder with OCR text files")

if input_folder:
    # Get and sort all .txt files
    txt_files = sorted([f for f in os.listdir(input_folder) if f.lower().endswith(".txt")])

    # Ask where to save the final combined file
    output_file = filedialog.asksaveasfilename(
        defaultextension=".txt",
        title="Save combined text as",
        filetypes=[("Text files", "*.txt")]
    )

    if output_file:
        with open(output_file, 'w', encoding='utf-8') as outfile:
            for filename in txt_files:
                file_path = os.path.join(input_folder, filename)
                with open(file_path, 'r', encoding='utf-8') as infile:
                    outfile.write(infile.read())
                    outfile.write("\n\n")  # Optional separator
        print(f"‚úÖ Combined file saved at:\n{output_file}")
    else:
        print("‚ö†Ô∏è No save location selected.")
else:
    print("‚ö†Ô∏è No folder selected.")
