Latin page scans and ground truth text for testing OCR accuracy.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md
leviathansivedem00hobb.png
leviathansivedem00hobb.src
leviathansivedem00hobb.txt
montfaucon_palaeographica_graeca_p182.png
montfaucon_palaeographica_graeca_p182.src
montfaucon_palaeographica_graeca_p182.txt
pugna_porcorum_p8.png
pugna_porcorum_p8.src
pugna_porcorum_p8.txt

README.md

A collection of page scans and corresponding text files of Latin.

These files are designed for use in testing OCR quality, using the tools from https://gitorious.org/ancient-greek-training-for-tesseract/ocr-evaluation-tools, in particular the tessaccsummary script.

The naming of the files is quite straightforward:

  • <name>.png - the page scan
  • <name>.txt - the correct UTF-8 encoded text corresponding to the page scan
  • <name>.src - a text file describing the provenance of the page scan