You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CON TENT BS.
Frontispiece, portrait of Mr. George N. Lawrence.
OUMIGIHIUG) 25 ¢ 00506050505 bo08 dob osecsececss Sascccnoceoo cog sraScars oabs Co0e
EOC Tap Da CPEs KUCH scr se[a coe eo cfalainis <6.o.5 = saaiere o\afs sles oeiayor i= = sy-lom sia ee
Avian genus named in honor of Mr. George N. Lawrence ......---.--.-.-----
Species of birds named in honor of Mr. George N. Lawrence...--..--...----.
Chronological catalogue of the publications of Mr. George N. Lawrence, 1844
OPE ees ars aris aja cme ciers stoio cS Sis. ciioS 5 Hele wesc See oles are aicisia sing <5 So's =, eeeees
Alphabetical list of new species and subspecies of birds, described by Mr.
George N. Lawrence, 1846 to 1891, with habitat of type specimen-.-.-..-.----.
HDG Coe Sod. con dOnGd Coot COD GTO SES= DEUS Ne GE EO An eebe Berio snnbeeere 6 mmc Gace
Note: --psm 4 and --psm 11 yield very slightly improved results)
Manually removing the leader dots cause the OCR quality to significantly improve:
CON TEN? 5.
Frontispiece, portrait of Mr. George N. Lawrence.
Contents
Biographical sketch
Avian genus named in honor of Mr. George N. Lawrence
Species of birds named in honor of Mr. George N. Lawrence
Chronological catalogue of the publications of Mr. George N. Lawrence, 1844
to 1891
Alphabetical list of new species and subspecies of birds, described by Mr.
George N. Lawrence, 1846 to 1891, with habitat of type specimen
Index
Note it's not just that it's just missing some of the noise; words which were entirely mis-OCR'd before are now correctly OCR'd (eg "Biographical sketch").
Expected Behavior
The leader dots shouldn't cause the entire line of text to be mis-OCR'd.
Suggested Fix
Perhaps including cases of leader dots in the training data might help? I might be able to provide more examples of those.
tesseract -v
tesseract 5.3.2
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.14
Operating System
Ubuntu 22.04 Jammy
Other Operating System
No response
uname -a
Linux 0bb489023524 5.15.109+ #1 SMP Fri Jun 9 10:57:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Compiler
No response
CPU
No response
Virtualization / Containers
No response
Other Information
No response
The text was updated successfully, but these errors were encountered:
Current Behavior
Table of contents pages very often include decorative dots (called "Leader dots") between the chapter name and the page number. For example:
These cause the output to be unnaturally low quality for the entire page. Here's the output OCR:
Setup:
Note:
--psm 4
and--psm 11
yield very slightly improved results)Manually removing the leader dots cause the OCR quality to significantly improve:
Note it's not just that it's just missing some of the noise; words which were entirely mis-OCR'd before are now correctly OCR'd (eg "Biographical sketch").
Expected Behavior
The leader dots shouldn't cause the entire line of text to be mis-OCR'd.
Suggested Fix
Perhaps including cases of leader dots in the training data might help? I might be able to provide more examples of those.
tesseract -v
Operating System
Ubuntu 22.04 Jammy
Other Operating System
No response
uname -a
Linux 0bb489023524 5.15.109+ #1 SMP Fri Jun 9 10:57:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Compiler
No response
CPU
No response
Virtualization / Containers
No response
Other Information
No response
The text was updated successfully, but these errors were encountered: