## Optical Chacater Recognition using Tesseract
- Ref : 
 - https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/
 - https://www.pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/

### Environment Preparation

In [None]:
# Install Tesseract
!apt install tesseract-ocr

In [None]:
# Install Python bindings and other related packages
!pip install pillow
!pip install pytesseract
!pip install imutils
!pip install textblob   # For non-English translation

In [None]:
# Check if installation success
!tesseract -v

### Tesseract Command Options

In [4]:
# OCR Engine Mode (Default : 1)
# Controls the type of algorithm used by Tesseract
!tesseract --help-oem

OCR Engine modes: (see https://github.com/tesseract-ocr/tesseract/wiki#linux)
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.


In [5]:
# Page Segmentation Mode (Default : 3) (6, 7 work well)
# Controls the automatic Page Segmentation Mode used by Tesseract
!tesseract --help-psm

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.


### For Non-English Language

In [None]:
# Clone tesseract repo for Tesseract’s language packs
!git clone https://github.com/tesseract-ocr/tessdata

In [60]:
# Set the TESSDATA_PREFIX environment variable
# Can't directly use export in colab
import os
os.environ['TESSDATA_PREFIX'] = '/content/tessdata'

### Import Library

In [47]:
import cv2
import pytesseract

from matplotlib import pyplot as plt
from pytesseract import Output
from textblob import TextBlob

### Parameter

In [78]:
# Input Image
IMG_PATH = '2.png'

# Tesseract Command Options
LANG = 'swa'
OEM = 1
PSM = 3

# Confidence Level
CONF = 80

### Apply Tesseract

In [73]:
# Read text image data
img = cv2.imread(IMG_PATH)

In [79]:
# Tesseract Command Options
config = '-l %s --oem %d --psm %d' % (LANG, OEM, PSM)

In [None]:
# Apply Tesseract (Get Text Only)
text = pytesseract.image_to_string(img, config=config)
print()

In [None]:
# Translation (TextBlob may fail if input frpm Output.DICT text)
tb = TextBlob(text)
print(text, tb.translate(to='en'))

In [93]:
# Apply Tesseract (Get Text / Box Coordinate / Confidence Level)
data = pytesseract.image_to_data(img, config=config, output_type=Output.DICT)

### Visualize Result

In [None]:
img_vis = img.copy()
for i in range(len(data['level']) - 1):
    if int(data['conf'][i]) > CONF:
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        cv2.rectangle(img_vis, (x, y), (x+w, y+h), (0, 255, 0), 1)
        cv2.putText(img_vis, '%s: %s' % (data['text'][i], data['conf'][i]), (int(x+w/2), int(y+h/2)), cv2.FONT_HERSHEY_COMPLEX, 1, (255, 0, 0), 1, cv2.LINE_AA)

plt.figure(figsize=(10, 10))
plt.imshow(cv2.cvtColor(img_vis, cv2.COLOR_BGR2RGB))