Skip to content

ryanfb/latin-texts-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 

Repository files navigation

latin-texts-ocr

This is a repository for plaintext OCR processing results on Internet Archive works in the latin-texts repo in latin_to_annotate.txt that have derivatives (i.e. PDFs) on IA but no existing OCR text. See missing_ids_200_pdfs.txt or its make target in my classification branch of the latin-texts repo.

Processing is being done with:

TESSERACT_FLAGS="-l lat+eng+grc+deu" ~/source/ocrpdf/ocrpdf.sh "${i}.pdf"

Where ocrpdf.sh is a version of this script modified to retain plaintext and hocr output.

Up to 5cfccaf, the Latin Tesseract training file is v0.1.0-alpha2 from my preliminary Latin OCR training process, Greek is v2.0 from Nick White's Ancient Greek OCR. From 0fcc280 on, the Latin Tesseract training file is v0.2.0. From 1f3376a on, v0.2.1 is used.

Languages were picked with:

cut -d, -f 2 < djvus-language-confidence.txt|sort|uniq -c|sort -n

See: djvus-language-confidence.txt

About

Plaintext OCR processing results on Internet Archive works in the latin-texts repo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published