This is a repository for plaintext OCR processing results on Internet Archive works in the latin-texts repo in latin_to_annotate.txt
that have derivatives (i.e. PDFs) on IA but no existing OCR text. See missing_ids_200_pdfs.txt
or its make target in my classification
branch of the latin-texts
repo.
Processing is being done with:
TESSERACT_FLAGS="-l lat+eng+grc+deu" ~/source/ocrpdf/ocrpdf.sh "${i}.pdf"
Where ocrpdf.sh is a version of this script modified to retain plaintext and hocr output.
Up to 5cfccaf, the Latin Tesseract training file is v0.1.0-alpha2 from my preliminary Latin OCR training process, Greek is v2.0 from Nick White's Ancient Greek OCR. From 0fcc280 on, the Latin Tesseract training file is v0.2.0. From 1f3376a on, v0.2.1 is used.
Languages were picked with:
cut -d, -f 2 < djvus-language-confidence.txt|sort|uniq -c|sort -n