In [2]:
!pip install pytesseract
!pip install fpdf



In this notebook I will be running through a few examples showing my progress working with tesseract. Note that you have to install tesseract locally to use the Python API. Tesseract can be installed by following this page: https://tesseract-ocr.github.io/tessdoc/Installation.html. Information on pytesseract can be found here: https://pypi.org/project/pytesseract/. Please note that you have to install the additional languages in order to perform it on other languages.

In [1]:
from PIL import Image

import pytesseract
import re

# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

Since we will be using script data, we will be specifying that tesseract should look in the scripts folder.

Here we will run a basic example. We will have PIL read in the test image, then have it output to our results folder as a PDF. Tesseract will automatically generate the HOCR component for the input png. Since we specify extension to be 'pdf' it will merge the png and hocr into a searchable pdf. Note that that the bulk of the testing data was found on the tesseract github at: https://github.com/tesseract-ocr/tesseract. The file basic_test.png was downloaded from https://i.pinimg.com/originals/e0/80/96/e08096c170d005631906ef908b4d209b.jpg.

In [2]:
file_name = 'basic_test.png'
pdf = pytesseract.image_to_pdf_or_hocr(f'data\\{file_name}', extension='pdf')
with open(f'results\\{file_name.split(".")[0]}.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default
print(pytesseract.image_to_string(f'data\\{file_name}'))

ANNA KARENINA 5

‘But then, while she was here in the house with us, I
did not permit myself any liberties. And the worst of
all is that she is already... All this must needs happen
just to spite me. Ar! ar! ar! But what, what is to be
done?”

There was no answer except that common answer
which life gives to all the most complicated and unsolva-
ble questions, —this answer: You must live according
to circumstances, in other words, forget yourself. But
as you cannot forget yourself in sleep—at least till
night, as you cannot return to that music which the
water-bottle woman sang, therefore you must forget
yourself in the dream of life!

“We shali see by and by,” said Stepan Arkadyevitch
to himself, and rising he put on his gray dressing-gown
with blue silk lining, tied the tassels into a knot, and
took a full breath into his ample lungs. Then with his
usual firm step, his legs spread somewhat apart and
easily bearing the solid weight of his body, he went
over to the window, lifted the c

Here I experiment with reading in diagonal text. As you can see by looking at the results, the text is not generated correctly, though it is positioned correctly. One way to handle this would be to rotate the image before generating the pdf, but this is not preferred. I am investigating whether there is some method for natively reading rotated text in tesseract.

In [3]:
file_name = 'phototest-rotated-L.png'
pdf = pytesseract.image_to_pdf_or_hocr(f'data\\{file_name}', extension='pdf')
with open(f'results\\{file_name.split(".")[0]}.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default
print(pytesseract.image_to_string(f'data\\{file_name}'))

“xoy AZe] 94} JBAO peduunl Bop umosgq
yoinb ay *xo} Aze] eu} Jeno peduuni
Bop umojg yoInb ay “xo Aze] ay} JaA0
peduun{ Bop umoiq yoinb aus ‘xoy AzZe]
24} J8A0 peduin{ Bop umosg yoinb ayy
“JEUOY SII} JO

sad} ||— UO SyJOM }I J 89S PUB BpPod 190
OU} 4S9} 0} }X9} JulOd Z|, JO JO] © SI SIL



We can use to image_to_osd method to get useful info on the scan.

In [4]:
file_name = 'hebrew-nikud-genesis-1-2.png'
print(pytesseract.image_to_osd(f'data\\{file_name}'))

Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 41.23
Script: Hebrew
Script confidence: 63.70



Here is a function that will create a pdf in our results folder, note that it will automatically rotate and detect which script is being used.

In [5]:
def create_pdf_w_ocr(file_name, extension = 'pdf'):
    #opening our image and obtaining the rotation data
    with Image.open(f'data\\{file_name}') as img:
        osd_info = pytesseract.image_to_osd(img)
    
    #getting the rotation angle
    rotation_angle = re.search('(?<=Rotate: )\d+', osd_info).group(0)
    
    #rotating the image the desired amount
    rotated_img = img.rotate(-float(rotation_angle), expand=1)
    
    #getting script language
    script = re.search("Script: ([a-zA-Z]+)\n", osd_info).group(1)
    
    #creating our pdf (or other file)
    pdf = pytesseract.image_to_pdf_or_hocr(rotated_img, config='-l script/' + script, extension = extension)
    
    
    
    #saving it
    with open(f'results\\{file_name.split(".")[0]}.pdf', 'w+b') as f:
        f.write(pdf) 
file_name = 'hebrew-nikud-genesis-1-2.png'
create_pdf_w_ocr(file_name)

In [6]:
file_name = 'hebrew-nikud-genesis-1-2.png'
create_pdf_w_ocr(file_name)

In [7]:
file_name = 'phototest-rotated-L.png'
create_pdf_w_ocr(file_name)

Below we see that we can also use .tif files (and therefore also .tiff files)

In [None]:
file_name = '8087_054.3B.tif'
create_pdf_w_ocr(file_name)

As we see below, the model has occasional errors, as 324 is incorrectly interpreted as 824. If you check the resulting pdf though in 324.pdf, we see that the text is still overlayed in the correct position.

In [8]:
file_name = '324.tif'
pdf = pytesseract.image_to_pdf_or_hocr(f'data\\{file_name}', extension='pdf')
with open(f'results\\{file_name.split(".")[0]}.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default
print(pytesseract.image_to_string(f'data\\{file_name}'))

824



Here we test out a mutli-page scan. The test is a series of png files generated from a doc I filled with random lorem ipsum text. You input the path to a text file which contains the directories of the pngs.

In [9]:
file_name = 'multi_page_test\\multi_page_test.txt'
pdf = pytesseract.image_to_pdf_or_hocr(f'data\\{file_name}', extension='pdf')
with open(f'results\\{file_name.split(".")[0]}.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default
print(pytesseract.image_to_string(f'data\\{file_name}'))

Labore dolore dolores accusam facilisi nisl takimata tempor ut laoreet. Rebum esse
accusam facilisis takimata accusam veniam accusam in diam. Sanctus vero diam clita
invidunt sit. Et sit ad invidunt ea.

Lorem possim qui stet eu gubergren. Adipiscing stet dolore magna eirmod est placerat
dolor elitr clita mazim tincidunt at aliquyam. Est ut tempor dolor. Sit consetetur vero.

Ipsum stet diam vero ea duo aliquyam diam clita voluptua autem iriure blandit ut
voluptua. Labore qui sit sed elitr gubergren eros clita takimata dolore diam ea sanctus
erat justo velit et. Elitr dignissim consectetuer. Tempor sea facer sed ut magna clita
dolor magna volutpat consetetur elitr in option ut. Clita erat lobortis consequat duo eros
takimata. Ut iriure odio in nonummy sanctus sanctus nibh sed dolor consetetur ut sit
dolor accusam dolore. Zzril consequat ea diam stet clita. Sit feugiat kasd lorem dolore
dolor stet est magna eros invidunt nonummy. Dolor gubergren magna nisl invidunt
ipsum. lpsum duis ver