-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/generate trainingsets #205
Open
M3ssman
wants to merge
55
commits into
tesseract-ocr:main
Choose a base branch
from
ulb-sachsen-anhalt:feat/generate-trainingsets
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 48 commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
2e99bd7
[app][feat] init training sets
M3ssman 4e08dca
[app][feat] add page 2019, review ordering
M3ssman 8f4f6c5
[app][dep] extract dev libraries
M3ssman 1af468a
[app][doc] commented cli opts
M3ssman e2cb3bf
[app][dep] rather use opencv-headless
M3ssman 3f74072
[app][rfct] renamed test deps
M3ssman 46bb550
[doc][add] initial include
M3ssman a332644
[app][feat] use ocrd-workspace layout if possible
M3ssman 9724919
[app][rfct] clear concern - reorder is not rtl
M3ssman dbd2750
[app][rfct] set 1 printable char as required len
M3ssman 6490910
python tooling to install GT extractor, update README
kba e571231
[ci/cd][mrg] resolve merge conflicts
M3ssman f7fe5e5
[test][fix] adjust module import and tests
M3ssman 8acba6d
Merge branch 'feat/generate-trainingsets' of github.com:ulb-sachsen-a…
M3ssman a7548e3
[app][dep] fix import module path
M3ssman d891fa4
[app][dep] fix missing main attribute
M3ssman 7171396
[app][test] fix test imports
M3ssman 7f2b6f2
[app][fix] filter invalid lines
M3ssman 07ee94d
[app][fix] handle page without word elements
M3ssman 1904c58
[app][fix] alto V4
M3ssman d7b2ec9
[app][mrg] resolve conflicts
M3ssman 23edc06
[test][rfct] use always constants
M3ssman 1976e12
[app][fix] remove both rtl and ltr marks
M3ssman ea8464b
[app][fix] do *not* append rtl
M3ssman 325d794
Merge remote-tracking branch 'upstream/master' into feat/generate-tra…
M3ssman 43e103c
[app][fix] enhance robustness for invalid PAGE
M3ssman 15a6180
[app][feat] enhance char filter
M3ssman cf54dd9
[test][rfct] unified fixture usage
M3ssman 055d89a
[app][rfct] please lgtome
M3ssman ba1ed11
[app][rfct] do not generate, only extract
M3ssman 91d592d
[app][fix] check text exists
M3ssman 162148d
[app][feat] use shapes instead of bboxes
M3ssman fd09bb0
[app][feat] filter polygon shapes properly
M3ssman 3914f78
[app][feat] padding, binarizing, rotating
M3ssman 94f564a
[app][rfct] drop unused import
M3ssman 1a0d4c7
[app][fix] increase filter threshold
M3ssman 2e02397
[app][fix] decrease grayscale range default
M3ssman 2b9c9fa
[app][fix] decrease grayscale range even more
M3ssman b4edc80
[app][feat] dpi from png too
M3ssman 9d67309
[app][fix] center from coord pairs mean
M3ssman cd8853f
[app][doc] update readme
M3ssman 5fa6ade
[app][rfct] clear err msg
M3ssman cafe6e0
[app][rfct] clear error subject
M3ssman 7acd59f
Merge branch 'tesseract-ocr:master' into feat/generate-trainingsets
M3ssman f076cf8
Merge branch 'master' into tmp
M3ssman dbb0553
[ci/cd][fix] pin latest working opencv
M3ssman 31c617b
[ci/cd][upd] follow test layout
M3ssman a5a9be7
[app][fix] opencv usage concentrated
M3ssman cb33288
[ci/cd][feat] update build
M3ssman f7afb3c
[app][fix] update impl
M3ssman 4ab7b4d
Merge remote-tracking branch 'upstream/main' into tmp_merge
M3ssman 78a126c
[app][rfct] fit src/tesstrain layout
M3ssman 90ccf29
[app][feat] gather pairs from data from dirs
M3ssman 80ee2d5
[app][feat] aggr n of created pairs
M3ssman 736e73e
[app][fix] dont remove from same list
M3ssman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,3 +21,5 @@ master.zip | |
main.zip | ||
plot/*.LOG | ||
plot/ocrd* | ||
__pycache__ | ||
*.egg-info |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -195,6 +195,37 @@ cd ./plot | |||||
./plot_cer.sh | ||||||
``` | ||||||
|
||||||
## Extract training data from ALTO/PAGE and images | ||||||
|
||||||
tesstrain provides a utility `tesstrain-extract-sets` to generate pairs of text lines and corresponding line images from input data in the form of | ||||||
[ALTO](https://www.loc.gov/standards/alto/) or | ||||||
[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) files that represent scanned pages (complete or partial) with existing OCR. | ||||||
|
||||||
To install `tesstrain-extract-sets`, first set up a virtual environment and install the project via `pip`: | ||||||
|
||||||
``` | ||||||
# create virtual environment in subfolder "venv" | ||||||
python3 -m venv venv | ||||||
# unix | ||||||
source venv/bin/activate | ||||||
# win | ||||||
venv\Scripts\activate.bat | ||||||
|
||||||
# actual install | ||||||
pip install . | ||||||
``` | ||||||
|
||||||
`tesstrain-extract-sets` currently supports OCR data in ALTO V3, PAGE 2013 and PAGE 2019, as well as TIFF, JPEG and PNG images. | ||||||
|
||||||
Output is written as UTF-8 encoded plain text files and TIFF images. The image frame is produced from the textline coordinates in the OCR data, so please take care of properly annotated geometrical information. Additionally, the tool can add a fixed synthetic padding around the textline or store it binarized (`--binarize`). | ||||||
|
||||||
By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
See `tesstrain-extract-sets --help` for a brief listing of all supported command line flags and options. | ||||||
|
||||||
**NOTE:** The text of the lines is extracted as-is, no automatic correction takes place. It is strongly recommended to review the generated data before training Tesseract with it. | ||||||
|
||||||
|
||||||
## License | ||||||
|
||||||
Software is provided under the terms of the `Apache 2.0` license. | ||||||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
from .training_sets import ( | ||
TrainingSets, | ||
gray_canvas, | ||
read_dpi, | ||
calculate_grayscale, | ||
clear_vertical_borders, | ||
rotate_text_line_center, | ||
coords_center, | ||
DEFAULT_OUTDIR_PREFIX, | ||
DEFAULT_MIN_CHARS, | ||
DEFAULT_USE_SUMMARY, | ||
DEFAULT_USE_REORDER, | ||
DEFAULT_INTRUSION_RATIO, | ||
DEFAULT_ROTATION_THRESH, | ||
DEFAULT_BINARIZE, | ||
DEFAULT_SANITIZE, | ||
SUMMARY_SUFFIX, | ||
DEFAULT_PADDING, | ||
XML_NS | ||
) |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Generate Sets of Training Data TextLine + Image Pairs""" | ||
|
||
import argparse | ||
import os | ||
|
||
from . import ( | ||
TrainingSets, | ||
DEFAULT_OUTDIR_PREFIX, | ||
DEFAULT_MIN_CHARS, | ||
DEFAULT_USE_SUMMARY, | ||
DEFAULT_USE_REORDER, | ||
DEFAULT_INTRUSION_RATIO, | ||
DEFAULT_ROTATION_THRESH, | ||
DEFAULT_SANITIZE, | ||
DEFAULT_BINARIZE, | ||
DEFAULT_PADDING, | ||
SUMMARY_SUFFIX | ||
) | ||
|
||
|
||
######## | ||
# MAIN # | ||
######## | ||
def main(): | ||
PARSER = argparse.ArgumentParser(description="generate pairs of textlines and image frames from existing OCR and image data") | ||
PARSER.add_argument( | ||
"data", | ||
type=str, | ||
help="path to local alto|page file corresponding to image") | ||
PARSER.add_argument( | ||
"-i", | ||
"--image", | ||
required=False, | ||
help="path to local image file tif|jpg|png corresponding to ocr. (default: read from OCR-Data)") | ||
PARSER.add_argument( | ||
"-o", | ||
"--output", | ||
required=False, | ||
help="optional: output directory, re-created if already exists. (default: <script-dir>/<{}-ocr-name>)".format(DEFAULT_OUTDIR_PREFIX)) | ||
PARSER.add_argument( | ||
"-m", | ||
"--minchars", | ||
required=False, | ||
type=int, | ||
default=int(DEFAULT_MIN_CHARS), | ||
help="optional: minimum printable chars required for a line to be included into set (default: {})".format(DEFAULT_MIN_CHARS)) | ||
PARSER.add_argument( | ||
"-s", | ||
"--summary", | ||
required=False, | ||
action='store_true', | ||
default=DEFAULT_USE_SUMMARY, | ||
help="optional: print all lines in additional file (default: {}, pattern: <default-output-dir>{})".format(DEFAULT_USE_SUMMARY, SUMMARY_SUFFIX)) | ||
PARSER.add_argument( | ||
"-r", | ||
"--reorder", | ||
required=False, | ||
action='store_true', | ||
default=DEFAULT_USE_REORDER, | ||
help="optional: re-order word tokens from right-to-left (default: {})".format(DEFAULT_USE_REORDER)) | ||
PARSER.add_argument( | ||
"--binarize", | ||
required=False, | ||
action='store_true', | ||
default=DEFAULT_BINARIZE, | ||
help="optional: binarize textline images (default: {})".format(DEFAULT_BINARIZE)) | ||
PARSER.add_argument( | ||
"--sanitize", | ||
required=False, | ||
type=bool, | ||
default=DEFAULT_SANITIZE, | ||
help="optional: sanitize textline images (default: {})".format(DEFAULT_SANITIZE)) | ||
PARSER.add_argument('--no-sanitize', dest='sanitize', action='store_false') | ||
PARSER.add_argument( | ||
"--intrusion-ratio", | ||
required=False, | ||
default=DEFAULT_INTRUSION_RATIO, | ||
help="optional: alter threshold for top and bottom ratios for intrusion detection for sanitizing (default: {})".format(DEFAULT_INTRUSION_RATIO)) | ||
PARSER.add_argument( | ||
"--rotation-threshold", | ||
required=False, | ||
type=float, | ||
default=DEFAULT_ROTATION_THRESH, | ||
help="optional: alter threshold for rotation of textline image (default: {})".format(DEFAULT_ROTATION_THRESH)) | ||
PARSER.add_argument( | ||
"-p", | ||
"--padding", | ||
required=False, | ||
type=int, | ||
default=DEFAULT_PADDING, | ||
help="optional: additional padding for existing textline image (default: {})".format(DEFAULT_PADDING)) | ||
|
||
ARGS = PARSER.parse_args() | ||
PATH_OCR = ARGS.data | ||
PATH_IMG = ARGS.image | ||
FOLDER_OUTPUT = ARGS.output | ||
MIN_CHARS = ARGS.minchars | ||
SUMMARY = ARGS.summary | ||
REORDER = ARGS.reorder | ||
BINARIZE = ARGS.binarize | ||
SANITIZE = ARGS.sanitize | ||
INTR_RATIO = ARGS.intrusion_ratio | ||
if isinstance(INTR_RATIO, str) and ',' in INTR_RATIO: | ||
INTR_RATIO = [float(n) for n in INTR_RATIO.split(',')] | ||
else: | ||
INTR_RATIO = float(INTR_RATIO) | ||
ROTA_THRESH = ARGS.rotation_threshold | ||
PADDING = ARGS.padding | ||
|
||
if os.path.exists(PATH_OCR): | ||
print("[INFO ] generate trainingsets from '{}'".format(PATH_OCR)) | ||
print("[DEBUG] args: {}".format(ARGS)) | ||
TRAINING_DATA = TrainingSets(PATH_OCR, PATH_IMG) | ||
RESULT = TRAINING_DATA.create( | ||
folder_out=FOLDER_OUTPUT, | ||
min_chars=MIN_CHARS, | ||
summary=SUMMARY, | ||
reorder=REORDER, | ||
intrusion_ratio=INTR_RATIO, rotation_threshold=ROTA_THRESH, | ||
binarize=BINARIZE, sanitize=SANITIZE, padding=PADDING) | ||
print( | ||
"[SUCCESS] created '{}' training data sets from '{}' in '{}', please review".format( | ||
len(RESULT), PATH_OCR, TRAINING_DATA.path_out)) | ||
else: | ||
print( | ||
"[ERROR ] missing OCR '{}' or Image Data '{}'!".format( | ||
PATH_OCR, | ||
PATH_IMG)) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't padding for raw images going to be a desaster? I'd recommend making this combination disallowed in the CLI right away.