In [1]:
import os

os.chdir("..")

OCR with pytesseract
====================

links:
- https://cran.r-project.org/web/packages/tesseract/index.html
- https://tesseract-ocr.github.io/tessdoc/FAQ.html#running-tesseract
- https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#page-segmentation-method
- https://pyimagesearch.com/2021/08/16/installing-tesseract-pytesseract-and-python-ocr-packages-on-your-system/
- https://guides.nyu.edu/tesseract/usage
- https://github.com/madmaze/pytesseract
- https://nanonets.com/blog/ocr-with-tesseract/

## Setup

### Importing the necessary libraries

In [2]:
import pytesseract

pytesseract.get_tesseract_version()

<Version('5.1.0')>

### Loading the images into python
An image file that has been appropriately prepared.

In [3]:
simple_img = './data/alice_start-gutenberg.jpg'
fr_img = './data/fr_ocr-wikipedia.png'
kor_img = './data/kor_ocr-wikipedia.png'
toc = './data/alic_toc-gutenberg.jpg'
two_column = './data/two_column-google.png'

#### supported image formats

pytesseract can operate on any PIL Image, NumPy array or file path of an image
than can be processed by Tessseract. Tesseract supports most image formats:
png, jpeg, tiff, bmp, gif.

Notably, pytesseract, and tesseract, don't work on Pdf files. In order to
perform OCR on a pdf file, you must first convert it to a supported image
format. See the 'Prework' section for details on how to do this.

## Usage

In order to maximize the quality of results from OCR with tesseract, its often
necessary to customize the behavior of the OCR through parameters. With
tesseract, you can specify one or multiple languages you expect in the
document, which OCR engine to use, and information about the layout of the text
within the document.  

Tesseract by default uses its english training data. Tesseract detects
characters and then tries to map the detected characters to its closest
neighbor. Both of these processes are greatly effected by the assumed language
of the text. With tesseract you can specify the language or languages for the
OCR engine to use. Tesseract can be configured to use different OCR 'engine
modes'. This can be very useful when working with software or on systems that
don't support the newest engines or for which computational performance is a
limiting factor. In addition, not all languages have training data for each
engine mode. Tesseract also supports different behaviors for how it expects the
text to be layed out on the page. For example it supports options for if the
image is expected to contain just a single character, a single line, multiple
columns, and several others.

In addition to modifying the behavior of the OCR engine, we can configure the
format of the output. Including how much information we want about the
extracted text - such as the location on the page and confidence values for the
extracted text.

### Simplest Usage

The simplest way to get the text from an image with pytesseract is with `pytesseract.image_to_string`:

In [4]:
pytesseract.image_to_string(simple_img)

'Chapter 1\n\nDown the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the bank,\nand of having nothing to do: once or twice she had peeped into the book her\nsister was reading, but it had no pictures or conversations in it, ‘and what is\nthe use of a book,’ thought Alice ‘without pictures or conversation?’\n\nSo she was considering in her own mind (as well as she could, for the hot\nday made her feel very sleepy and stupid), whether the pleasure of making a\ndaisy-chain would be worth the trouble of getting up and picking the daisies,\nwhen suddenly a White Rabbit with pink eyes ran close by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh\ndear! I shall be late!’ (when she thought it over afterwards, it occurred to\nher that she ought to have wondered at this, but at the time it all seemed\nquite natural); but when the Rabbit actually TOOK A WATCH 

This returns just a string of all the text detected in the image.

### Language Specification

By default, tesseract uses its english training data. This can lead to very poor results if there are non english characters in the image. This is especially true if the image contains text that doesn't use a latin alphabet.

In [5]:
pytesseract.image_to_string(fr_img)

"Reconnaissance optique de caractéres\n\nLa reconnaissance optique de caractéres (ROC, ou OCR pour l'anglais optical character recognition), ou\nocérisation, désigne les procédés informatiques pour la traduction d'images de textes imprimés ou\ndactylographiés en fichiers de texte.\n\nUn ordinateur réclame pour 'exécution de cette tache un logiciel d'OCR. Celui-ci permet de récupérer le texte\ndans l'image d'un texte imprimé et de le sauvegarder dans un fichier pouvant étre exploité dans un traitement\nde texte pour enrichissement, et stocké dans une base de données ou sur un autre support exploitable par un\nsystéme informatique.\n"

In [6]:
pytesseract.image_to_string(kor_img)

'Bet xt ol}\n\n‘lvls, $2] 20) sahara\n\n‘BB BAt 24\\(Optical character recognition; OCR) Ateto| M74L} 7| A= last SLO] Bars o|o|x| A.\nMAS 85810} 7/717} AS + Qk= VAS Wetst= AOIch.\n\n0[0|4| Atos YS + We SM] Mt BSS ARE BS 7st BABS S92] BACs Helse 2zE\n\nAOSM Ubos OCRO|A}T SOY, OCRE 2ISAlSOILt 7124] Al2H(machine vision) 2] SP HOFS AlAHE| A\nch\n'

In order to use ocr on languages other than english we need to download the language's associated training data for tesseract. See index for how to do so on your system.

With pytesseract we can see all the available languages with:

In [7]:
pytesseract.get_languages()

['afr',
 'amh',
 'ara',
 'asm',
 'aze',
 'aze_cyrl',
 'bel',
 'ben',
 'bod',
 'bos',
 'bre',
 'bul',
 'cat',
 'ceb',
 'ces',
 'chi_sim',
 'chi_sim_vert',
 'chi_tra',
 'chi_tra_vert',
 'chr',
 'cos',
 'cym',
 'dan',
 'deu',
 'div',
 'dzo',
 'ell',
 'eng',
 'enm',
 'epo',
 'equ',
 'est',
 'eus',
 'fao',
 'fas',
 'fil',
 'fin',
 'fra',
 'frk',
 'frm',
 'fry',
 'gla',
 'gle',
 'glg',
 'grc',
 'guj',
 'hat',
 'heb',
 'hin',
 'hrv',
 'hun',
 'hye',
 'iku',
 'ind',
 'isl',
 'ita',
 'ita_old',
 'jav',
 'jpn',
 'jpn_vert',
 'kan',
 'kat',
 'kat_old',
 'kaz',
 'khm',
 'kir',
 'kmr',
 'kor',
 'kor_vert',
 'lao',
 'lat',
 'lav',
 'lit',
 'ltz',
 'mal',
 'mar',
 'mkd',
 'mlt',
 'mon',
 'mri',
 'msa',
 'mya',
 'nep',
 'nld',
 'nor',
 'oci',
 'ori',
 'osd',
 'pan',
 'pol',
 'por',
 'pus',
 'que',
 'ron',
 'rus',
 'san',
 'sin',
 'slk',
 'slv',
 'snd',
 'snum',
 'spa',
 'spa_old',
 'sqi',
 'srp',
 'srp_latn',
 'sun',
 'swa',
 'swe',
 'syr',
 'tam',
 'tat',
 'tel',
 'tgk',
 'tha',
 'tir',
 'ton',
 'tur

To specify the language to use, pass the name of the language as a parameter to `pytesseract.image_to_string`:

In [8]:
pytesseract.image_to_string(kor_img, lang='kor')

'광학 문자 인식\n\n위키백과, 우리 모두의 백과사전.\n\n광학 문자 인식(20068! 08180 『600901007; 0ㄷㅠ83:은 사람이 쓰거나 기계로 인쇄한 문자의 영상을 이미지 스\n캐너로 획득하여 기계가 읽을 수 있는 문자로 변환하는 것이다.\n\n이미지 스캔으로 얻을 수 있는 문서의 활자 영상을 컴퓨터가 편집 가능한 문자코드 등의 형식으로 변환하는 소프트\n\n웨어로써 일반적으로 0ㄷ이라고 하며, 0은 인공지능이나 기계 시각(07106 1510/의 연구분야로 시작되었\n다\n'

multiple languages

In [9]:
pytesseract.image_to_string(kor_img, lang='kor+eng')

'광학 문자 인식\n\n위키백과, 우리 모두의 백과사전.\n\n광학 문자 인식(20068! character recognition; OCR) 사람이 쓰거나 기계로 인쇄한 문자의 영상을 이미지 스\n캐너로 획득하여 기계가 읽을 수 있는 문자로 변환하는 것이다.\n\n이미지 스캔으로 얻을 수 있는 문서의 활자 영상을 컴퓨터가 편집 가능한 문자코드 등의 형식으로 변환하는 소프트\n\n웨어로써 일반적으로 0ㄷ이라고 하며, OCRE 인공지능이나 기계 Al2H(machine 1510/의 연구분야로 시작되었\n다\n'

### Engine Selection

Tesseract supports the following options for selecting the engine:
```
0 = Original Tesseract only.
1 = Neural nets LSTM only.
2 = Tesseract + LSTM.
3 = Default, based on what is available.
```

I recommend using option 1. The default seems to use option 2. 

To set the 'oem' (OCR engine mode) with pytesseract we pass it as the 'config' parameter:

In [10]:
custom_oem_psm_config = r'--oem 1'
pytesseract.image_to_string(simple_img, config=custom_oem_psm_config)

'Chapter 1\n\nDown the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the bank,\nand of having nothing to do: once or twice she had peeped into the book her\nsister was reading, but it had no pictures or conversations in it, ‘and what is\nthe use of a book,’ thought Alice ‘without pictures or conversation?’\n\nSo she was considering in her own mind (as well as she could, for the hot\nday made her feel very sleepy and stupid), whether the pleasure of making a\ndaisy-chain would be worth the trouble of getting up and picking the daisies,\nwhen suddenly a White Rabbit with pink eyes ran close by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh\ndear! I shall be late!’ (when she thought it over afterwards, it occurred to\nher that she ought to have wondered at this, but at the time it all seemed\nquite natural); but when the Rabbit actually TOOK A WATCH 

### Page Layouts

Tesseract supports a variety of common Page Segmentation Modes.
```
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
11 = Sparse text. Find as much text as possible in no particular order.
12 = Sparse text with OSD.
13 = Raw line. Treat the image as a single text line,
     bypassing hacks that are Tesseract-specific.
```

Just like with the OCR engine mode, we set the Page Segmentation Mode as part of the config string.

#### Auto page segmentation

By default, tesseract will attempt to automatically detect the text layout. If
we have prior knowledge its best to specify the layout that is most
appropriate.

In [11]:
custom_oem_psm_config = r'--oem 1 --psm 3'
pytesseract.image_to_string(two_column, config=custom_oem_psm_config)

'An Overview of the Tesseract OCR Engine\n\nRay Smith\nGoogle Inc.\ntheraysmith@gmail.com\n\nAbstract\n\nThe Tesseract OCR engine, as was the HP Research\nPrototype in the UNLV Fourth Annual Test of OCR\nAccuracy[1], is described in a _ comprehensive\noverview. Emphasis is placed on aspects that are novel\nor at least unusual in an OCR engine, including in\nparticular the line finding, features/classification\nmethods, and the adaptive classifier.\n\n1. Introduction — Motivation and History\n\nTesseract is an open-source OCR engine that was\ndeveloped at HP between 1984 and 1994. Like a super-\nnova, it appeared from nowhere for the 1995 UNLV\nAnnual Test of OCR Accuracy [1], shone brightly with\nits results, and then vanished back under the same\ncloak of secrecy under which it had been developed.\nNow for the first time, details of the architecture and\nalgorithms can be revealed.\n\nTesseract began as a PhD research project [2] in HP\nLabs, Bristol, and gained momentum as a possible

Notice that with the default page segmentation mode (fully automatic) it 
correctly identifies that the lines of text are split between the two columns
on the page.

If we were to use a psm that 


#### Uniform block
--psm 6

#### Tables

--psm 4


### Output Formats

```
string
boxes
data
osd
alto xml
```

#### string

#### data


## Validation

### Confidence scores

### Summary statistics

### Vocabulary