# Tesseract OCR for Non-English Languages

### Install the necessary packages

In [None]:
!sudo apt-get install tesseract-ocr
!pip install pytesseract
!pip install textblob

### Downloading and Adding Language Packs to Tesseract OCR

In [2]:
!git clone https://github.com/tesseract-ocr/tessdata

Cloning into 'tessdata'...
remote: Enumerating objects: 769, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 769 (delta 0), reused 1 (delta 0), pack-reused 768[K
Receiving objects: 100% (769/769), 3.17 GiB | 19.26 MiB/s, done.
Resolving deltas: 100% (178/178), done.
Checking out files: 100% (172/172), done.


In [1]:
import os
os.environ["TESSDATA_PREFIX"] = "/content/tessdata"

In [2]:
%cd tesseract-non-english

/content/tesseract-non-english


### Import Packages

In [3]:
# import the necessary packages
from matplotlib import pyplot as plt
from textblob import TextBlob
import pytesseract
import argparse
import cv2

In [4]:
def plt_imshow(title, image):
	# convert the image frame BGR to RGB color space and display it
	image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
	plt.imshow(image)
	plt.title(title)
	plt.grid(False)
	plt.show()

### Implementing Our Tesseract with Non-English Languages Script


In [5]:
# since we are using Jupyter Notebooks we can replace our argument
# parsing code with *hard coded* arguments and values
args = {
	"image": "images/german_block.png",
	"lang": "deu",
	"to": "en",
	"psm": 3
}

In [6]:
# load the input image and convert it from BGR to RGB channel
# ordering
image = cv2.imread(args["image"])
rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# OCR the image, supplying the country code as the language parameter
options = "-l {} --psm {}".format(args["lang"], args["psm"])
text = pytesseract.image_to_string(rgb, config=options)

# show the original OCR'd text
print("ORIGINAL")
print("========")
print(text)
print("")

ORIGINAL
Erstes Kapitel

Gustav Aschenbach oder von Aschenbach, wie seit seinem fünfzigsten
Geburtstag amtlich sein Name lautete, hatte an einem
Frühlingsnachmittag des Jahres 19.., das unserem Kontinent monatelang
eine so gefahrdrohende Miene zeigte, von seiner Wohnung in der Prinz-
Regentenstraße zu München aus, allein einen weiteren Spaziergang
unternommen. Überreizt von der schwierigen und gefährlichen, eben
jetzt eine höchste Behutsamkeit, Umsicht, Eindringlichkeit und
Genauigkeit des Willens erfordernden Arbeit der Vormittagsstunden,
hatte der Schriftsteller dem Fortschwingen des produzierenden
Triebwerks in seinem Innern, jenem »motus animi continuus«, worin
nach Cicero das Wesen der Beredsamkeit besteht, auch nach der
Mittagsmahlzeit nicht Einhalt zu tun vermocht und den entlastenden
Schlummer nicht gefunden, der ihm, bei zunehmender Abnutzbarkeit
seiner Kräfte, einmal untertags so nötig war. So hatte er bald nach dem
Tee das Freie gesucht, in der Hoffnung, daß Luft und Bewegun

In [7]:
# translate the text into a different language
tb = TextBlob(text)
translated = tb.translate(to=args["to"])

# show the translated text
print("TRANSLATED")
print("==========")
print(translated)

TRANSLATED
First chapter

Gustav Aschenbach or von Aschenbach, like since his fiftieth
His official name was his birthday on one
Spring afternoon of the year 19 .. that our continent for months
showed such a threatening expression, from his apartment in the
Regentenstrasse to Munich, just another walk
undertaken. Overstimulated by the difficult and dangerous, just
now the greatest care, caution, urgency and
Morning work requiring accuracy of will,
the writer had the ripple of the producing
Engine in its interior, that "motus animi continuus," in which
according to Cicero the essence of eloquence exists, even according to the
Midday meal could not curb and the exonerating
Slumber not found for him, with increasing wear and tear
of his strength once during the day was so necessary. So he had soon after
Tea sought the outdoors, in the hope that air and movement would leave him
restore and help him to have a fruitful evening
would.

It was the beginning of May and, after the wet and cold w