The purpose of this repo and this document is to report on my attempt to do an optical character recognition (OCR) of archaeological reports using python. It is a challenging task, since some of the source pdfs are in Cyrillic and do not have a very good quality.
- Vojtěch Kaše, Aarhus University/University of West Bohemia
CC-BY-SA 4.0, see attached License.md
The data for this project are scanned PDFs of various archaeological reports, differing in language and quality.
The tools in this repo have been tested on Mac with a local installation of Python 3. While I normally work online in Google Colab, here I am working with my local Python, since it relies on the other two programs through bindings.
- Python 3
- Tesseract
- MuPDF
Description: Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. In 2006, Tesseract was considered one of the most accurate open-source OCR engines then available. (wiki)
- official tesseract repo
- repo with simple python example - The aim of this Repository is to be able to recognise text from an image file using the Tesseract Library in the Python Programming Language (video tutorial with simple .ipynb).
- extensive video tutorial
a) linux:
sudo apt-get install tesseract-ocr
b) macOSx: you need either brew
or macports
installed, ports are easier, but you must have “Command Line Tools for Xcode”, then you can run:
$ sudo port install tesseract
or:$ sudo port install tesseract-<language>
to install specific language version, e.g.tesseract-ces
for Czech- alternatively, with brew you run
brew install tesseract
, howerver, then I do not know what is a syntax for installing individual language mutations, becausebrew install tesseract-lang
does not work - overview of languages is here
- you might face problems with permissions to write into certain directories, to solve this. run:
shell sudo chown -R au648597 /usr/local/lib/pkgconfig /usr/local/share/info /usr/local/share/man/man3
chmod u+w /usr/local/lib/pkgconfig /usr/local/share/info /usr/local/share/man/man3
Simple command line example:
- move into a folder close to some test data and run
tesseract data/test-image.png data/output.txt
Simple pytesseract example from console
pytesseract
is a python wrapper for tesseract. To open images inside python, you also need pillow
package, therefore:
- install both dependencies using pip (or
pip3
for python 3 specifically):$ pip3 install pytesseract pillow
- open python
$ python3
- import the packages:
>>> from PIL import Image
>>> import pytesseract
- define a
string
variable and assign to it an output of theimage_to_string()
function. (You have to be in the directory where "test-image.png" is located.)
string = pytesseract.image_to_string(Image.open(“test-image.png”), lang=”eng”)
- print the output:
>>> print(string)
It works very well with Czech texts as well. There are many apps relying on tesseract, like Ancient Greek OCR app. Always you have firstly install the language by running:
$ sudo port install tesseract-grc
Perhaps the most efficient way to work with pdfs within python is to use mupdf and pymupdf.
But this is based on images. So I was looking for a straightforward solution to work with pdfs. I was especially interested in extracting pdf pages into computer vision readable objects used by cv2
library, what would enable me to do some additional transformations Therefore I turned to PyMuPdf, based on MuPdf (following this thread). First, you have to install MuPDF docs. From bash, you can install it using brew
. In Linux, you need to follow the same course of installing MuPDF and then PyMuPDF, and to make the process smoother, see instructions below this Mac section.
In my case I first had to install xquartz however:
You can either install it straightforwardly:
$ pip3 install pymupdf
Or you can firstly install MuPDF as I did:
$ brew cask install xquartz
and then, I was finally able to run:
$ brew install mupdf
I got it here: /usr/local/Cellar/mupdf
Installation of MyPdf and PyMuPDF on Ubuntu 18.04 can be a tad entailed if you have different versions of Python and do not update your system very often. You may be getting errors (e.g. pip3 install pymupdf failure or others),when installing these prerequisites. In order to avoid problems, let's start with system update:
$ sudo apt-get update
Then you can install the MuPdf
$ sudo apt-get install mupdf mupdf-tools
My terminal was happy after running these, and a diagnostic ls revealed mupdf library in place
$ ls /usr/lib/mupdf/mupdf* -1
> /usr/lib/mupdf/mupdf-x11
Thereafter, you should theoretically be able to install Python bindings for MuPDF by using a single pip3 install command. This method is called Python wheels and should be self-contained, i.e. you should not need any other software to download or install MuPDF to run PyMuPDF scripts. The command goes to Github, grabs the most recent version of the binaries (or the one you indicate with ==) unpacks and installs it. It works on most 64-bit Linux platforms with Python versions 2.7 through 3.8. The catch is : you may have multiple versions of Python and your pip3 command may be out of date. So before pip3 install, it's good to check what versions of Python you have ...
$ ls /usr/lib/python
> python2.7/ python3/ python3.6/ python3.7/ python3.8/
Ok, I have a bunch of Python versions, which is fine as long as the latest 3.7 and 3.8 are there. Pip3 should get the installation job done as long as you update it first
$ pip3 install --upgrade pip
Collecting pip
Downloading https://files.pythonhosted.org/packages/54/2e/df11ea7e23e7e761d484ed3740285a34e38548cf2bad2bed3dd5768ec8b9/pip-20.1-py2.py3-none-any.whl (1.5MB)
100% |████████████████████████████████| 1.5MB 1.1MB/s
Installing collected packages: pip
Successfully installed pip-20.1
And now, nothing is in the way of installing PyMuPDF. I have selected the most recent version from the releases page, after I verified it fit my system, but pip3 will do that for you
$ pip3 install PyMuPDF==1.16.18
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.16.18
Bingo! You are ready to move on to real work now.
(All the code below is from jupyter notebook scripts/pdf-and-image-preprocessing.ipynb
).
PyMuPDF has a very intuitive usage (see the tutorial). To read a pdf, you run just:
doc = fitz.open("data/test-cyr.pdf") ### open the pdf
Then you can easily iterate over pages and do anything you wish (look for annotations, links, etcs.).
The most important thing for us is to render images into a Pixmap
object. Pixmap object is a RGB image of a page. As the documentation says, "[m]ethod Page.getPixmap()
offers lots of variations for controlling the image: resolution, colorspace (e.g. to produce a grayscale image or an image with a subtractive color scheme), transparency, rotation, mirroring, shifting, shearing, etc."
for page in doc:
pix = page.getPixmap(colorspace = "GRAY") # try "csGRAY"
There are very nice recipes for this procedure. For instance, to get a better resolution, you can use matrix parameter and so to say to "zoom in".
To really test different parametrizations of getPicxmap, I produced 5 images of the same area from a third page of a pdf doc
. I also defined a function called rect
to capture a rectangle area of my interest defined by a list of ratio values: [start height, end height, start width, end width]
.
def rect(img,rect):
'''return rectangle defined by side ratio'''
h = img.shape[0]
w = img.shape[1]
return img[int(h * rect[0]):int((h * rect[1])), int(w * rect[2]):int((w * rect[3]))]
test_img = doc[2] ### select the page
test_imgs = []
test_imgs.append(pix2np(test_img.getPixmap()))
test_imgs.append(pix2np(test_img.getPixmap(colorspace="csGRAY")))
test_imgs.append(pix2np(test_img.getPixmap(matrix = fitz.Matrix(2, 2))))
test_imgs.append(pix2np(test_img.getPixmap(matrix = fitz.Matrix(2, 2), colorspace="csGRAY")))
test_imgs.append(pix2np(test_img.getPixmap(matrix = fitz.Matrix(3, 3), colorspace="csGRAY")))
test_imgs = [rect(img, [0.07, 0.57, 0.5, 1]) for img in test_imgs] # defined the area
Already a very preliminary look at these output indicates that zooming actually means a lot of improvement. A zoom with matrix = (2,2) appears to produce the best results.
Subsequently, I tested some basic morphological transformations, namely Erosion , Dilation, and Closing.
So I have the image extracted from the pdf with these parameters:
img = test_imgs.append(pix2np(test_img.getPixmap(matrix = fitz.Matrix(3, 3), colorspace="csGRAY")))
Finally, I produced four variants of this image while employing the above mentioned transformations and on each of them applied the pytesseract.image_to_string(img, lang="bul")
method.
img = test_imgs[3]
imgs_transf = []
# ORIG IMG
imgs_transf.append(img)
# DILATION
kernel = np.ones((1, 1), np.uint8)
img_dil = cv2.dilate(img, kernel, iterations=1)
imgs_transf.append(img_dil)
# EROSION
kernel = np.ones((2, 2), np.uint8)
img_er = cv2.erode(img, kernel, iterations=1)
imgs_transf.append(img_er)
# CLOSING
kernel = np.ones((1, 1), np.uint8)
img_clo = closing = cv2.morphologyEx(img, cv2.MORPH_CLOSE, kernel)
imgs_transf.append(img_clo)
fig, axs = plt.subplots(4, 2, figsize=(15,20), tight_layout=True)
for img, ax_pair, title in zip(imgs_transf, axs, ["original", "dilation", "erosion", "closing"]):
ax_pair[0].imshow(img)
ax_pair[0].axis("off")
ax_pair[0].set_title(title)
txt = pytesseract.image_to_string(img, lang="bul")
ax_pair[1].text(0, 0, txt, fontsize=12)
ax_pair[1].axis("off")
All the stuff above might be combined into one handy script. You just have to correctly navigate your cmd to the script above and then to an pdf file you want to analyze, specify the language of the pdf and name of the output.
Once you are in the OCR
repo main directory. You can run the script by copying the following line into your terminal:
$ python3 scripts/pdf-to-txt.py
You will be prompted to provide a path to file. In this case, navigate to a file in the data
subdirectory, the path is like here:
data/test-pdf.pdf
Next prompt will ask about language of the text. Here you need to know the appropriate language abbreviations or else your script will error out. In this case, the text is in English, so you type in 'eng'.
The whole code in the pdf-to-txt.py
file is here:
### binding for tesseract:
import pytesseract
### computer vision:
import cv2
### computer vision relies to a substantial extent on numpy arrays
import numpy as np
### PyMuPDF is called fitz:
import fitz
### to plot pages and everything else:
from datetime import datetime
### configure sddk session
### CRUCIAL FUNCTIONS
def pix2np(pix):
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.h, pix.w, pix.n)
#im = np.ascontiguousarray(im[..., [2, 1, 0]]) # rgb to bgr
return im
def get_text(doc):
i = 1
pages = ""
for page in doc: ### or you can specify: doc(start, end, step):
pix = page.getPixmap(matrix = fitz.Matrix(2, 2), colorspace="csGRAY") # try "csGRAY"
img = pix2np(pix)
kernel = np.ones((2, 2), np.uint8)
img_er = cv2.erode(img, kernel, iterations=1)
txt = pytesseract.image_to_string(img_er, lang=language) + "\n\n[end-of-page" + str(i) + "]\n\n"
pages += txt
i = i+1
return pages
inputfile = input("file for ocr: ")
try:
doc = fitz.open(inputfile)
except:
print("reading of the file failed, have you correctly specified its relative path from here? Try again:")
inputfile = input("file for ocr: ")
doc = fitz.open(inputfile)
language = input("specify language of the pdf (use '+' for more languages): ")
print(datetime.now(), "ocr analysis started")
pages_str = get_text(doc)
outputfile = input("specify name of the output file: ")
### SAVE THE FILE
file = open(outputfile,"w")
file.write(pages_str)
print(datetime.now(), "ocr analysis ended and the output file was saved.")
To proceed further, I started a project with a virtual machine instance on Google Cloud Platform enabling me faster computing. You can configure your virtual machine instance on GCP in many different ways. Here I describe my configuration of a machine running Linux Ubuntu and Python 3.7+.
First, you have to go to console.cloud.google.com
and create a project.
Within a project, you go to Compute Engine
part of the platform and the section VM instances
. Here you can either return to an instance you used in the past (Even if an instance is stopped, it still maintains your data in its memory), or to create a new one.
Region
Creating a new instance, there are dozens of different options and their combinations. The first important thing is to choose Region
and Zone
for your machine. It basically means to choose from places where Google has physically located its servers (I am not sure to what extent these overlap with locations of google data centres). For my first experiment, I setup region to europe-west3 (Frankfurt
). However, using this setting, I was not allowed to use the most powerful machines on the list. Therefore, in my second try, I chose europe-north1 (Finland)
.
Machine configuration
In the following section, at first you have to choose from a Machine family
, whether you want a General-purpose
or a Memory-optimized
machine. Here is their list. Taken together, the choice of Region
and Machine type
determines what you see below as available Series
and Machine type
: If you choose us-central1 (Iowa)+Memory-optimized
, the most powerful option on the list is m1-megamem-96 (96 vCPUm 1.4 TB memory)
, but if you choose europe-north1 (Finland)+Memory-optimized
, the most powerful option on the list is `m1-ultramem-40 (40 vCPU, 961 GB memory).
However, it seems that an ordinary user is not allowed to use these most powerful machines. Trying to use megamem-96
, I was rejected with this warning: "Quota 'CPUS_ALL_REGIONS' exceeded. Limit: 12.0 globally."
Therefore, I turned to n1-highmem-8 (8 vCPU, 52 GB memory)
. (In a previous session, I had n1-standard-8 (8 vCPU, 30 GB memory)
, I am not sure how big measurable difference this can be for my tasks, if any.)
Boot disk
As a next step, you have to choose a Boot disk. Since I faced some problems with upgrading Python 3 coming with older Linux distributions (after I made an upgrade, the default Py3 remained unchanged), I decided for Ubuntu 20.04 LTS, which contains Python 3.8+ by default (and not Python 3.4 or 3.5 as older Ubuntus).
Last important option is Firewall to allow specific network traffic from the Internet. I always allow both HTTP and HTTPS traffic.
System configuration.
Once you have an instance, the most straigtforward way to use it is via SSH
(= secure shell) associated with it.
First, I had to inspect my Python 3 version. Since using Ubuntu 20.04, it should be 3.8+:
$ python3 --version
>>> Python 3.8.2
Next, we have to install pip3, to be able to install packages for python3. However, this is not so straigtforward here, because apt
is not able to locate it at first. So you have to update and upgrade apt at first:
$ sudo apt update
$ sudo apt upgrade
$ sudo apt install python3-pip
Now you have only base Python, so you have to install all the important packages like numpy, pandas, matplotlib etc.
$ pip3 install pandas matplotlib
What is crucial, we also have to install our sddk
package:
$ pip3 install sddk
Next, we can install the software crucial for our task at hands, i.e. OCR analysis of pdf documents.
- tesseract
$ sudo apt install tesseract-ocr
- individual languages:
$ sudo apt install tesseract-ocr-bul tesseract-ocr-ces tesseract-ocr-grc
- python bindings:
$ pip3 install pytesseract
- pymupdf
$ pip3 install pymupdf
- open cv2 (opencv-python)
$ pip3 install opencv-python
To make it functional, you also need:
$ sudo apt-get install -y libsm6
$ sudo apt-get install -y libxext6
$ sudo apt-get install -y libxrender-dev
- Beautiful Soup
$ pip3 install beautifulsoup4
Now we can test whether we are actually able to call all these packages within Python 3. Open it by:
$ python3
In python, test this:
>>> import sddk, fitz, pytesseract, cv2
To quit python console, run quit()
.
Having our environment fully functional, we can run there any script using the tools we installed. Instead of trying to figure out how to upload these scripts programmatically, we can just create a new file and edit its content using nano
text editor (in my former effort, nano
was already installed, but this time I had to install it on my own: $ sudo apt-get install nano
).
$ nano ocr_cyr.py
Here is the script I used here:
### binding for tesseract:
import pytesseract
### computer vision:
import cv2
### computer vision relies to a substantial extent on numpy arrays
import numpy as np
### PyMuPDF is called fitz:
import fitz
### to plot pages and everything else:
from matplotlib import pyplot as plt
### to import data from sciencedata.dk
import sddk
from bs4 import BeautifulSoup
from datetime import datetime
### configure sddk session
conf = sddk.configure_session_and_url("SDAM_root", "648597@au.dk")
directory = input("you are in " + conf[1] + ", specify subdirectory: ")
language = input("specify language (use '+' for more languages): ")
resp = conf[0].get(conf[1] + directory)
soup = BeautifulSoup(resp.content)
soup
filenames = []
for a in soup.find_all("a"):
a_str = str(a.get_text())
if ".pdf" in a_str:
filenames.append(a_str)
print("files in the folder: ")
print(filenames)
def pix2np(pix):
im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.h, pix.w, pix.n)
#im = np.ascontiguousarray(im[..., [2, 1, 0]]) # rgb to bgr
return im
def get_text(doc):
i = 1
pages = ""
for page in doc: ### or you can specify: doc(start, end, step):
pix = page.getPixmap(matrix = fitz.Matrix(2, 2), colorspace="csGRAY") # try "csGRAY"
img = pix2np(pix)
kernel = np.ones((2, 2), np.uint8)
img_er = cv2.erode(img, kernel, iterations=1)
txt = pytesseract.image_to_string(img_er, lang=language) + "\n\n[end-of-page" + str(i) + "]\n\n"
pages += txt
i = i+1
return pages
for filename in filenames:
print(datetime.now(), "started to read " + filename)
resp = conf[0].get(conf[1] + directory + filename)
doc = fitz.open(stream=resp.content, filetype="pdf")
pages_str = get_text(doc)
filepathname = "/SDAM_data/OCR/outputs/" + filename.rpartition(".")[0] + ".txt"
conf[0].put(conf[1] + filepathname, data=pages_str.encode('utf-8'))
print(datetime.now(), "ended ocr analysis of " + filename + " and saved it to sciencedata.dk")
The script interactively ask you for several inputs:
- sciencedata.dk username (format '123456@au.dk')
- sciencedata.dk password
- specification of subdirectory (by default, you are in "SDAM_root")
- language of the analysis (e.g. "bul" or "bul+eng")
Subsequently, it lists all pdf files in the directory you choose and prints out whenever it starts and end to work on a file:
> files in the folder: ['Adams1965_LandBehindBagdad.pdf', 'Cherry1991_Keos.pdf', 'Isaac1986_GreekSettlementsAncientThrace_best.pdf']
> 2020-04-26 08:53:19.048385 started to read Adams1965_LandBehindBagdad.pdf
> 2020-04-26 09:02:13.878638 ended ocr analysis of Adams1965_LandBehindBagdad.pdf and saved it to sciencedata.dk
The outputs are in SDAM_root/SDAM_data/OCR/outputs
. As we inspect them , these results are from sufficient. It seems that we have to play with the "morphological transformations" for each file independently.