# Use OCR to get text from an image (and images from text?)

In [1]:
import pytesseract
import tempfile
import cv2
from pathlib import Path

OCR is easy to try, but hard to get right. Sometimes it just works, other times you need to pre-process the images, or adjust the default segmentation settings. At the very least, you probably want to convert colour images to greyscale.

In this example we'll load [the image](../images/nla.obj-62330748-1.jpg) into OpenCV (that's the `cv2` prefix), and convert it to greyscale. Then we'll feed it to Tesseract to do the OCR. Because this is a poster, rater than a normal page of text, I've set the page segmentation model (`psm`) to 4, which looks at each line separately.

In [3]:
img = cv2.imread('7539064-p4.jpg')
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
text = pytesseract.image_to_string(grey, config='--psm 4')
print(text)

aC

BOX 2343, G.P.0. PREMIER'S DEPARTMENT
ADELAIDE, 8.A, $001

225 40811 F ADELAIDE, SOUTH AUSTRALIA
reves RAST b
Ww rerQeWkeA be Quore
REF, AA,

24 peo 1976

———a
Dear Mr. Bllicott,

= am writing to you as Minteter in charge of the
Classification of Publications Act in order to bring to your

attention the present position in regard to eensorship within
Australia.

You will be oware that for many years the Vommonwealth
Gevernment prohibited the importation of almost all pornographic
material from overseas, but that more recently standards have
been relaxed. Hach of the States has now eome method of
controlling the sale of such material although there aro
variations in standards; be expected and indeed
the right to have local differences is one of the reasons for
the of State Governments. However, the previous
 tepomeadg ped Government and the various States had mado some

r & jewards a uniform system of classification, although
it allowed for variations of standard if so desired by a


## Get images of the letters

Pytesseract's `image_to_boxes()` option gives us back individual letters and their bounding boxes. We can then use these bounding boxes to crop images of the letters from the poster. Or at least that's the theory.

In [None]:
Path('letters').mkdir(parents=True, exist_ok=True)
h, w, c = img.shape
boxes = pytesseract.image_to_boxes(grey, config='--psm 4 --oem 1') 
for i, b in enumerate(boxes.splitlines()):
    b = b.split(' ')
    # img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
    if b[0].isalpha():
        # Note the weird way the coordinates are provided
        letter = img[h - int(b[4]):h - int(b[2]), int(b[1]):int(b[3])]
        cv2.imwrite(f'letters/{b[0]}-{i}.jpg', letter)                                                  
cv2.imwrite('test.jpg', img)

Have a look in the `letters` directory to see if it worked.

## What next?

I found the poster I'm using as an example in the `book` zone of Trove. It's part of the ephemera collection. If you'd like to play with more like this, have a look at the [harvesting ephemera](nla.obj-62330748-1.jpg) notebook.

What about experimenting with extracting text from other sources? Perhaps digitised files in the National Archives of Australia...

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).