Skip to content

Commit

Permalink
Don't set page segmentation mode for hocr, pdf and tsv configs
Browse files Browse the repository at this point in the history
Setting the page segmentation mode in those config files gives unexpected
results: the text recognized when no config or only txt is given changes
if both txt and any of hocr, pdf or tsv is chosen.

In a test set of nearly 200 pages from historical books, using
segmentation mode 1 is typically slightly better than the default,
but there are also cases where it is much worse. Therefore the user
should be able to decide which page segmentation mode is best.

Old results for hocr, pdf or tsv now need an explicit `--psm 1` for
reproduction.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
  • Loading branch information
stweil committed Oct 4, 2018
1 parent b15fbf1 commit ecfee53
Show file tree
Hide file tree
Showing 3 changed files with 0 additions and 3 deletions.
1 change: 0 additions & 1 deletion tessdata/configs/hocr
@@ -1,3 +1,2 @@
tessedit_create_hocr 1
tessedit_pageseg_mode 1
hocr_font_info 0
1 change: 0 additions & 1 deletion tessdata/configs/pdf
@@ -1,2 +1 @@
tessedit_create_pdf 1
tessedit_pageseg_mode 1
1 change: 0 additions & 1 deletion tessdata/configs/tsv
@@ -1,2 +1 @@
tessedit_create_tsv 1
tessedit_pageseg_mode 1

0 comments on commit ecfee53

Please sign in to comment.