Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eng text with leader dots causes OCR output to be very low quality #4126

Closed
cdrini opened this issue Sep 20, 2023 · 1 comment
Closed

Eng text with leader dots causes OCR output to be very low quality #4126

cdrini opened this issue Sep 20, 2023 · 1 comment

Comments

@cdrini
Copy link

cdrini commented Sep 20, 2023

Current Behavior

Table of contents pages very often include decorative dots (called "Leader dots") between the chapter name and the page number. For example:

These cause the output to be unnaturally low quality for the entire page. Here's the output OCR:

Setup:

sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
sudo apt-get update -qq
sudo apt install -y -qq tesseract-ocr tesseract-ocr-eng
wget -O tmp.jpg https://archive.org/download/publishedwriting00fost/publishedwriting00fost_jp2.zip/publishedwriting00fost_jp2%2Fpublishedwriting00fost_0017.jp2?ext=jpg&reduce=0&quality=100
tesseract tmp.jpg tmp.txt --dpi 400 -l eng
cat tmp.txt
CON TENT BS.

Frontispiece, portrait of Mr. George N. Lawrence.

OUMIGIHIUG) 25 ¢ 00506050505 bo08 dob osecsececss Sascccnoceoo cog sraScars oabs Co0e
EOC Tap Da CPEs KUCH scr se[a coe eo cfalainis <6.o.5 = saaiere o\afs sles oeiayor i= = sy-lom sia ee
Avian genus named in honor of Mr. George N. Lawrence ......---.--.-.-----
Species of birds named in honor of Mr. George N. Lawrence...--..--...----.

Chronological catalogue of the publications of Mr. George N. Lawrence, 1844
OPE ees ars aris aja cme ciers stoio cS Sis. ciioS 5 Hele wesc See oles are aicisia sing <5 So's =, eeeees

Alphabetical list of new species and subspecies of birds, described by Mr.

George N. Lawrence, 1846 to 1891, with habitat of type specimen-.-.-..-.----.
HDG Coe Sod. con dOnGd Coot COD GTO SES= DEUS Ne GE EO An eebe Berio snnbeeere 6 mmc Gace

Note: --psm 4 and --psm 11 yield very slightly improved results)

Manually removing the leader dots cause the OCR quality to significantly improve:

CON TEN? 5.

Frontispiece, portrait of Mr. George N. Lawrence.

Contents

Biographical sketch

Avian genus named in honor of Mr. George N. Lawrence

Species of birds named in honor of Mr. George N. Lawrence

Chronological catalogue of the publications of Mr. George N. Lawrence, 1844
to 1891

Alphabetical list of new species and subspecies of birds, described by Mr.

George N. Lawrence, 1846 to 1891, with habitat of type specimen
Index

Note it's not just that it's just missing some of the noise; words which were entirely mis-OCR'd before are now correctly OCR'd (eg "Biographical sketch").

Expected Behavior

The leader dots shouldn't cause the entire line of text to be mis-OCR'd.

Suggested Fix

Perhaps including cases of leader dots in the training data might help? I might be able to provide more examples of those.

tesseract -v

tesseract 5.3.2
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.14

Operating System

Ubuntu 22.04 Jammy

Other Operating System

No response

uname -a

Linux 0bb489023524 5.15.109+ #1 SMP Fri Jun 9 10:57:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

@amitdo
Copy link
Collaborator

amitdo commented Sep 24, 2023

Closing because this is a duplicate of other issues. See the TOC label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants