OCR Ground truth for Caroline Minuscule

This ground truth repository is a work in process; it currently accounts for a part of our complete Caroline Minuscule training pool of around 70 manuscripts used for our OCRopus Caroline Minuscule model (see ocropus-models repository).

Repository structure

Each manuscript used is contained in a directory, which contains subdirectories for each page and a metadata.txt file.

The page subdirectories are split into lines (by ocropus-gpageseg), with .png images and corresponding .gt.txt ground truth text files. These have been prepared for the OCRopus OCR engine, but should be usable by other OCR engines which work on a line by line basis. Please let us know if you adapt the ground truth to work with other OCR engines, we'd love to hear about it.

There are also now full colour page images in each manuscript directory, alongside plain text and ALTO XML representations of ground truth, which can be used for training with Kraken or transforming into other formats.

License, provenance, metadata

The ground truth contained in this repository should be considered Public Domain, or licensed under Apache License 2.0, whichever suits your needs better.

The metadata.txt files in each manuscript directory describes its provenance and the licensing of the images. We have only used manuscript images which are freely redistributable and reusable.

OCR Ground Truth Transcription Protocol

The ground truth published in this repository has been created according to a number of guidelines designed to find a compromise between the quirks of medieval writing and the limitations of modern software (i.e. its incompatibility with certain medieval characters or the fact that some medieval characters are not represented in unicode). With some respect, we simply made some executive decisions on the transcription guidelines where we felt this would best serve the overall purpose.

In general this meant deciding between diplomatic transcription (i.e. sticking to what it says on the page) and gently modernized features (i.e. reinterpreting medieval signs into modern equivalents) with a view to specific categories. Read on for a summary of the rules and the respective rationale behind them.

SUMMARY

PUNCTUATION

Modern: medieval punctuation is transcribed with modern equivalents; punctus elevatus transcribed as semicolon

CAPITALIZATION

Diplomatic: Original capitalization retained

ABBREVIATIONS

Diplomatic where possible: Retain abbreviations and render glyphs as opposed to expanded versions where possible
"*" where original character isn't served: OCRopus (at the point in time of transcription) could not handle some of the medieval glyphs, even where a Unicode version was present. Abbreviations not in OCRopus are uniformly transcribed as "*", in the case of a combined character (such as a consonant with a macron) as the base character followed by "*" (e.g. "t*"). The list of accepted characters in OCRopus can be found in this repository, and downloaded and used as codec in the OCRopus training process.

SPACING

Diplomatic: Preserve manuscript spacing, i.e. give diplomatic transcription

NUMBERS

Diplomatic: retain original version of both Roman and Arabic numerals

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
badges		badges
bsb00046285		bsb00046285
bsb00046500		bsb00046500
bsb00046557		bsb00046557
bsb00047183		bsb00047183
bsb00050531		bsb00050531
bsb00054504		bsb00054504
bsb00065407		bsb00065407
bsb00065409		bsb00065409
bsb00065410		bsb00065410
bsb00065411		bsb00065411
bsb00065763		bsb00065763
bsb00071185		bsb00071185
bsb00071369		bsb00071369
bsb00072162		bsb00072162
bsb00073147		bsb00073147
bsb00095929		bsb00095929
bsb00104168		bsb00104168
README.md		README.md
htr-united.yml		htr-united.yml
table.csv		table.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Ground truth for Caroline Minuscule

Repository structure

License, provenance, metadata

OCR Ground Truth Transcription Protocol

About

Releases

Packages

Contributors 3

rescribe/carolineminuscule-groundtruth

Folders and files

Latest commit

History

Repository files navigation

OCR Ground truth for Caroline Minuscule

Repository structure

License, provenance, metadata

OCR Ground Truth Transcription Protocol

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages