GitHub - ryanfb/latin-texts-ocr: Plaintext OCR processing results on Internet Archive works in the latin-texts repo

latin-texts-ocr

This is a repository for plaintext OCR processing results on Internet Archive works in the latin-texts repo in latin_to_annotate.txt that have derivatives (i.e. PDFs) on IA but no existing OCR text. See missing_ids_200_pdfs.txt or its make target in my classification branch of the latin-texts repo.

Processing is being done with:

TESSERACT_FLAGS="-l lat+eng+grc+deu" ~/source/ocrpdf/ocrpdf.sh "${i}.pdf"

Where ocrpdf.sh is a version of this script modified to retain plaintext and hocr output.

Up to 5cfccaf, the Latin Tesseract training file is v0.1.0-alpha2 from my preliminary Latin OCR training process, Greek is v2.0 from Nick White's Ancient Greek OCR. From 0fcc280 on, the Latin Tesseract training file is v0.2.0. From 1f3376a on, v0.2.1 is used.

Languages were picked with:

cut -d, -f 2 < djvus-language-confidence.txt|sort|uniq -c|sort -n

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
ocr-results		ocr-results
README.md		README.md