This document describes the instructions to obtain first pass OCR on a scanned document and using it to create a dataset for OCR post-correction.
📌 The process begins with a PDF or images (PNG, JPEG etc.) of each page in the scanned document, for which we get the first pass OCR from an existing system.
📌 A small subset of the pages need to be manually corrected to form the training data for the post-correction model.
📌 The trained model can then be applied to all the uncorrected pages for automatic OCR post-correction.
Python 3+ is required. Pip can be used to install the packages:
pip install -r ocr_requirements.txt
If the scanned document is in the form of a PDF, poppler
is required to convert the PDF to image files. Follow the "how to install" instructions here.
As described in the main document, some books that contain endangered language texts also contain a translation in another language. If the document contains such a translation, we call it multisource. If not, it is single-source.
We demonstrate all steps of the process with a sample multisource document that contains text in the endangered language Griko with its translation in Italian. We start with a PDF of the document.
Some of the steps are likely not necessary for setting up a dataset with a single-source document.
Since using a single-source document is easier, we recommend starting off with this setting. Even if the document is multisource, the portions with the translation can simply be skipped during processing.
The first step is converting the PDF into a set of images, one per page.
python firstpass_ocr/pdf_to_png.py \
--pdf sample_dataset/images/pdf/griko.pdf \
--output_folder sample_dataset/images/png
This step is likely not necessary for a single-source document.
For a multisource document, the translation may be present on the same page as the endangered language text. This can be in various different layouts, such as a two column format or interleaved pages.
We recommend using a layout analysis tool such as LAREX for complex layouts.
In our case, each page in the PDF contains Griko text on the left and its Italian translation on the right. This is a relatively simple layout to crop: we cut each page's image into two halves down the middle. This is easily done with a tool like image_slicer.
The sample cropped images are here.
We use the existing OCR system from Google Vision to obtain the first pass OCR.
The steps to use the API are:
- Sign up for Google Cloud.
- Go to the Console and create a new project at the top left corner.
- Enable the Vision API for the project.
- You will then be redirected to a page with a button to "Create Credentials".
- When creating the credential, choose "Cloud Vision API" and "No, I’m not using them".
- In the next step, enter any service name you want and choose "Project --> Owner" as the role.
- A JSON file will be generated with your credentials.
Given your credentials json file, run the following commands to get the first pass OCR:
export GOOGLE_APPLICATION_CREDENTIALS=[credentials.json]
python firstpass_ocr/transcribe_image.py \
--image_folder sample_dataset/images/cropped_pngs \
--output_folder sample_dataset/images/ocr_output
The first 1000 images processed per month with Google Cloud are free. The platform also offers a $300 credit to new users.
🚀 If you are unable to sign up for Google Cloud, please email us at srijhwan@cs.cmu.edu and we might be able to help!
Once we obtain a first pass OCR for all the pages in our document, the next step is constructing a dataset to train the OCR post-correction model. The model is trained in a supervised manner.
Since the model is designed for a low-resource setting, a small number of manually annotated pages (~10 pages) is typically sufficient to train a model (although more is better).
Each instance in the training dataset has a source (the first pass OCR of the endangered language text) and target (the corrected "gold" transcription). For the multisource setting, we have an additional source (the first pass OCR of the translation). These are denoted by:
📌 src1
for the first pass OCR of the endangered language text
📌 src2
for the first pass OCR of the translation
📌 tgt
the corrected "gold" transcription
The steps for creating the dataset are:
-
Select the subset of pages that will be manually corrected and used for training. The remaining uncorrected pages will be used for pretraining the model.
-
Divide the first pass OCR outputs of
src1
(andsrc2
for multisource) ascorrected
anduncorrected
based on the pages selected for manual annotation: seesample_dataset/text_outputs
for an example. -
Manually transcribe the pages and add the transcriptions
corrected/tgt
. Ensure that the first pass OCR outputs and the manual transcriptions are aligned at sentence-level or paragraph-level. The text files should contain one sentence/paragraph per line: seesample_dataset/text_outputs/corrected
. Annotation tools like From The Page can make this process easier. -
If
src2
exists, also align the first pass OCR ofsrc2
at the sentence or paragraph-level with respect tosrc1
. Note thatsrc2
does not need to be manually corrected, only aligned. -
For the
uncorrected
first pass OCR, for a single-source model, simply split the text at the sentence-level to have one sentence per line. For multisource, use a sentence aligner like YASA to automatically align the sentences. -
After alignment and annotation, all files corresponding to the same page (
src1
,src2
,tgt
) should have the same number of lines. -
See
sample_dataset/text_outputs
for an example of aligned and annotated text created after all the steps above. -
Finally, run the
prepare_data
script to format the data as pretraining, training, development, and test sets. Exclude thesrc2
parameters for a single-source dataset. These will be used to train models and run experiments.
python utils/prepare_data.py \
--unannotated_src1 sample_dataset/text_outputs/uncorrected/src1_griko/ \
--unannotated_src2 sample_dataset/text_outputs/uncorrected/src2_italian/ \
--annotated_src1 sample_dataset/text_outputs/corrected/src1_griko/ \
--annotated_src2 sample_dataset/text_outputs/corrected/src2_italian/ \
--annotated_tgt sample_dataset/text_outputs/corrected/tgt_griko/ \
--output_folder sample_dataset/postcorrection
🚀 The dataset can now be used to train a model: instructions here. The trained model can then be applied to all the unannotated pages and future documents for automatic post-correction!