`ensemble_ocr`

This is a Stata package that improves the quality of variables generated through multiple OCR scans or engines.

The input is a set of variables that reflects specific words or numbers in an OCRed text. This text must have been OCRed with multiple engines, from different paper copies, or through multiple scans, so the versions are different.

As long as the different methods are unbiased, picking the most common digit will give the correct result. For instance, in this hypothetical example we obtain a number from three engines:

"Ground Truth"	Abbyy	Tesseract v3	Tesseract v4
123456	23456	128450	123186

This package works by first aligning the input:

 23456
128450
123186

Then, the most common digit is picked, and ground truth is recovered:

Note: this approach is similar in spirit to several papers by WB Lundt.

Installation

net install ensemble_ocr, from(https://github.com/sergiocorreia/stata-ensemble-ocr/raw/master/)

Syntax

ensemble_ocr varlist , generate(newvar)

Warnings

This is aimed at dealing with relatively small numbers (<15 digits); will fail in unexpected ways otherwise (because it exceeds the capacity of -double-)
Could be extended to deal with non-digit strings but currently only handles digits and nothing else
Cases where strings have different length (and thus need to be aligned) are not fully implemented

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
demo.do		demo.do
ensemble_ocr.ado		ensemble_ocr.ado
ensemble_ocr.pkg		ensemble_ocr.pkg
stata.toc		stata.toc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`ensemble_ocr`

Installation

Syntax

Warnings

About

Releases

Packages

Languages

License

sergiocorreia/stata-ensemble-ocr

Folders and files

Latest commit

History

Repository files navigation

ensemble_ocr

Installation

Syntax

Warnings

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`ensemble_ocr`

Packages