Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/generate trainingsets #205

Open
wants to merge 55 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
2e99bd7
[app][feat] init training sets
M3ssman Nov 17, 2020
4e08dca
[app][feat] add page 2019, review ordering
M3ssman Nov 18, 2020
8f4f6c5
[app][dep] extract dev libraries
M3ssman Nov 18, 2020
1af468a
[app][doc] commented cli opts
M3ssman Nov 18, 2020
e2cb3bf
[app][dep] rather use opencv-headless
M3ssman Nov 19, 2020
3f74072
[app][rfct] renamed test deps
M3ssman Nov 19, 2020
46bb550
[doc][add] initial include
M3ssman Nov 19, 2020
a332644
[app][feat] use ocrd-workspace layout if possible
M3ssman Nov 19, 2020
9724919
[app][rfct] clear concern - reorder is not rtl
M3ssman Dec 7, 2020
dbd2750
[app][rfct] set 1 printable char as required len
M3ssman Dec 7, 2020
6490910
python tooling to install GT extractor, update README
kba Nov 19, 2020
e571231
[ci/cd][mrg] resolve merge conflicts
M3ssman Jan 20, 2021
f7fe5e5
[test][fix] adjust module import and tests
M3ssman Nov 19, 2020
8acba6d
Merge branch 'feat/generate-trainingsets' of github.com:ulb-sachsen-a…
M3ssman Jan 20, 2021
a7548e3
[app][dep] fix import module path
M3ssman Dec 7, 2020
d891fa4
[app][dep] fix missing main attribute
M3ssman Dec 7, 2020
7171396
[app][test] fix test imports
M3ssman Dec 7, 2020
7f2b6f2
[app][fix] filter invalid lines
M3ssman Dec 7, 2020
07ee94d
[app][fix] handle page without word elements
M3ssman Dec 13, 2020
1904c58
[app][fix] alto V4
M3ssman Dec 14, 2020
d7b2ec9
[app][mrg] resolve conflicts
M3ssman Jan 20, 2021
23edc06
[test][rfct] use always constants
M3ssman Jan 19, 2021
1976e12
[app][fix] remove both rtl and ltr marks
M3ssman Mar 27, 2021
ea8464b
[app][fix] do *not* append rtl
M3ssman Mar 27, 2021
325d794
Merge remote-tracking branch 'upstream/master' into feat/generate-tra…
M3ssman Mar 27, 2021
43e103c
[app][fix] enhance robustness for invalid PAGE
M3ssman May 5, 2021
15a6180
[app][feat] enhance char filter
M3ssman May 5, 2021
cf54dd9
[test][rfct] unified fixture usage
M3ssman May 5, 2021
055d89a
[app][rfct] please lgtome
M3ssman May 5, 2021
ba1ed11
[app][rfct] do not generate, only extract
M3ssman May 7, 2021
91d592d
[app][fix] check text exists
M3ssman May 7, 2021
162148d
[app][feat] use shapes instead of bboxes
M3ssman May 28, 2021
fd09bb0
[app][feat] filter polygon shapes properly
M3ssman May 30, 2021
3914f78
[app][feat] padding, binarizing, rotating
M3ssman Jun 1, 2021
94f564a
[app][rfct] drop unused import
M3ssman Jun 1, 2021
1a0d4c7
[app][fix] increase filter threshold
M3ssman Jun 2, 2021
2e02397
[app][fix] decrease grayscale range default
M3ssman Jun 3, 2021
2b9c9fa
[app][fix] decrease grayscale range even more
M3ssman Jun 3, 2021
b4edc80
[app][feat] dpi from png too
M3ssman Jun 3, 2021
9d67309
[app][fix] center from coord pairs mean
M3ssman Jun 4, 2021
cd8853f
[app][doc] update readme
M3ssman Jun 4, 2021
5fa6ade
[app][rfct] clear err msg
M3ssman Jul 5, 2021
cafe6e0
[app][rfct] clear error subject
M3ssman Jul 5, 2021
7acd59f
Merge branch 'tesseract-ocr:master' into feat/generate-trainingsets
M3ssman Aug 2, 2021
f076cf8
Merge branch 'master' into tmp
M3ssman Jan 26, 2023
dbb0553
[ci/cd][fix] pin latest working opencv
M3ssman Jan 26, 2023
31c617b
[ci/cd][upd] follow test layout
M3ssman Jan 26, 2023
a5a9be7
[app][fix] opencv usage concentrated
M3ssman Jan 26, 2023
cb33288
[ci/cd][feat] update build
M3ssman Sep 20, 2023
f7afb3c
[app][fix] update impl
M3ssman Sep 20, 2023
4ab7b4d
Merge remote-tracking branch 'upstream/main' into tmp_merge
M3ssman Sep 21, 2023
78a126c
[app][rfct] fit src/tesstrain layout
M3ssman Sep 21, 2023
90ccf29
[app][feat] gather pairs from data from dirs
M3ssman Sep 22, 2023
80ee2d5
[app][feat] aggr n of created pairs
M3ssman Sep 22, 2023
736e73e
[app][fix] dont remove from same list
M3ssman Sep 25, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,5 @@ master.zip
main.zip
plot/*.LOG
plot/ocrd*
__pycache__
*.egg-info
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,37 @@ cd ./plot
./plot_cer.sh
```

## Extract training data from ALTO/PAGE and images

tesstrain provides a utility `tesstrain-extract-sets` to generate pairs of text lines and corresponding line images from input data in the form of
[ALTO](https://www.loc.gov/standards/alto/) or
[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) files that represent scanned pages (complete or partial) with existing OCR.

To install `tesstrain-extract-sets`, first set up a virtual environment and install the project via `pip`:

```
# create virtual environment in subfolder "venv"
python3 -m venv venv
# unix
source venv/bin/activate
# win
venv\Scripts\activate.bat

# actual install
pip install .
```

`tesstrain-extract-sets` currently supports OCR data in ALTO V3, PAGE 2013 and PAGE 2019, as well as TIFF, JPEG and PNG images.

Output is written as UTF-8 encoded plain text files and TIFF images. The image frame is produced from the textline coordinates in the OCR data, so please take care of properly annotated geometrical information. Additionally, the tool can add a fixed synthetic padding around the textline or store it binarized (`--binarize`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't padding for raw images going to be a desaster? I'd recommend making this combination disallowed in the CLI right away.


By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`.
By default, several optimization actions are performed at image line level, like deskewing or removal of top-bottom intruders. To disable this, add flag `--no-sanitize`.


See `tesstrain-extract-sets --help` for a brief listing of all supported command line flags and options.

**NOTE:** The text of the lines is extracted as-is, no automatic correction takes place. It is strongly recommended to review the generated data before training Tesseract with it.


## License

Software is provided under the terms of the `Apache 2.0` license.
Expand Down
20 changes: 20 additions & 0 deletions extract_sets/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from .training_sets import (
TrainingSets,
gray_canvas,
read_dpi,
calculate_grayscale,
clear_vertical_borders,
rotate_text_line_center,
coords_center,
DEFAULT_OUTDIR_PREFIX,
DEFAULT_MIN_CHARS,
DEFAULT_USE_SUMMARY,
DEFAULT_USE_REORDER,
DEFAULT_INTRUSION_RATIO,
DEFAULT_ROTATION_THRESH,
DEFAULT_BINARIZE,
DEFAULT_SANITIZE,
SUMMARY_SUFFIX,
DEFAULT_PADDING,
XML_NS
)
Binary file added extract_sets/__init__.pyc
Binary file not shown.
129 changes: 129 additions & 0 deletions extract_sets/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# -*- coding: utf-8 -*-
"""Generate Sets of Training Data TextLine + Image Pairs"""

import argparse
import os

from . import (
TrainingSets,
DEFAULT_OUTDIR_PREFIX,
DEFAULT_MIN_CHARS,
DEFAULT_USE_SUMMARY,
DEFAULT_USE_REORDER,
DEFAULT_INTRUSION_RATIO,
DEFAULT_ROTATION_THRESH,
DEFAULT_SANITIZE,
DEFAULT_BINARIZE,
DEFAULT_PADDING,
SUMMARY_SUFFIX
)


########
# MAIN #
########
def main():
PARSER = argparse.ArgumentParser(description="generate pairs of textlines and image frames from existing OCR and image data")
PARSER.add_argument(
"data",
type=str,
help="path to local alto|page file corresponding to image")
PARSER.add_argument(
"-i",
"--image",
required=False,
help="path to local image file tif|jpg|png corresponding to ocr. (default: read from OCR-Data)")
PARSER.add_argument(
"-o",
"--output",
required=False,
help="optional: output directory, re-created if already exists. (default: <script-dir>/<{}-ocr-name>)".format(DEFAULT_OUTDIR_PREFIX))
PARSER.add_argument(
"-m",
"--minchars",
required=False,
type=int,
default=int(DEFAULT_MIN_CHARS),
help="optional: minimum printable chars required for a line to be included into set (default: {})".format(DEFAULT_MIN_CHARS))
PARSER.add_argument(
"-s",
"--summary",
required=False,
action='store_true',
default=DEFAULT_USE_SUMMARY,
help="optional: print all lines in additional file (default: {}, pattern: <default-output-dir>{})".format(DEFAULT_USE_SUMMARY, SUMMARY_SUFFIX))
PARSER.add_argument(
"-r",
"--reorder",
required=False,
action='store_true',
default=DEFAULT_USE_REORDER,
help="optional: re-order word tokens from right-to-left (default: {})".format(DEFAULT_USE_REORDER))
PARSER.add_argument(
"--binarize",
required=False,
action='store_true',
default=DEFAULT_BINARIZE,
help="optional: binarize textline images (default: {})".format(DEFAULT_BINARIZE))
PARSER.add_argument(
"--sanitize",
required=False,
type=bool,
default=DEFAULT_SANITIZE,
help="optional: sanitize textline images (default: {})".format(DEFAULT_SANITIZE))
PARSER.add_argument('--no-sanitize', dest='sanitize', action='store_false')
PARSER.add_argument(
"--intrusion-ratio",
required=False,
default=DEFAULT_INTRUSION_RATIO,
help="optional: alter threshold for top and bottom ratios for intrusion detection for sanitizing (default: {})".format(DEFAULT_INTRUSION_RATIO))
PARSER.add_argument(
"--rotation-threshold",
required=False,
type=float,
default=DEFAULT_ROTATION_THRESH,
help="optional: alter threshold for rotation of textline image (default: {})".format(DEFAULT_ROTATION_THRESH))
PARSER.add_argument(
"-p",
"--padding",
required=False,
type=int,
default=DEFAULT_PADDING,
help="optional: additional padding for existing textline image (default: {})".format(DEFAULT_PADDING))

ARGS = PARSER.parse_args()
PATH_OCR = ARGS.data
PATH_IMG = ARGS.image
FOLDER_OUTPUT = ARGS.output
MIN_CHARS = ARGS.minchars
SUMMARY = ARGS.summary
REORDER = ARGS.reorder
BINARIZE = ARGS.binarize
SANITIZE = ARGS.sanitize
INTR_RATIO = ARGS.intrusion_ratio
if isinstance(INTR_RATIO, str) and ',' in INTR_RATIO:
INTR_RATIO = [float(n) for n in INTR_RATIO.split(',')]
else:
INTR_RATIO = float(INTR_RATIO)
ROTA_THRESH = ARGS.rotation_threshold
PADDING = ARGS.padding

if os.path.exists(PATH_OCR):
print("[INFO ] generate trainingsets from '{}'".format(PATH_OCR))
print("[DEBUG] args: {}".format(ARGS))
TRAINING_DATA = TrainingSets(PATH_OCR, PATH_IMG)
RESULT = TRAINING_DATA.create(
folder_out=FOLDER_OUTPUT,
min_chars=MIN_CHARS,
summary=SUMMARY,
reorder=REORDER,
intrusion_ratio=INTR_RATIO, rotation_threshold=ROTA_THRESH,
binarize=BINARIZE, sanitize=SANITIZE, padding=PADDING)
print(
"[SUCCESS] created '{}' training data sets from '{}' in '{}', please review".format(
len(RESULT), PATH_OCR, TRAINING_DATA.path_out))
else:
print(
"[ERROR ] missing OCR '{}' or Image Data '{}'!".format(
PATH_OCR,
PATH_IMG))
Loading