# OCR Experiments

## Problem

Given images containing text, we need to write code that can extract data from them.

### Images

All the images are put in the folder `img`. Each of them contains tabular or plain text data. Some images are good quality (i.e. higher resolution) while some are not. We need to come up with a strategy to identify what pre-processing we need to do to be able to parse the images and convert them into text.

### Libraries Used
We can use OpenCV to load the image and preprocess it. And we are using Tesseract library to parse text from images

In [None]:
import cv2
from matplotlib import pyplot as plt
import numpy as np

## Preprocessing

Almost all the images we have need to be preprocessed in one or more ways because they have several orthogonal abnormalities than what a perfect image for processing would be. To name a few:

- Resolution is not very high
- Use of Indian languages
- Tables
- Highly incoherent fonts, weights, colors and sizes of text

Some things to try are:
- Strip the vertical and horizontal lines fro images 
- Sharpen images
- Apply thresholding or blurs
- Use grayscale (remove the color factor as long as we don't care about colors)

Up until now, there hasn't been a use case for learning about color, so lets just get rid of colors. So lets load the image we want to work with and create a gray scale out of it

In [None]:
image = cv2.imread('./img/raj.jpeg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
plt.imshow(gray, cmap='Greys_r')
plt.show()

In [None]:
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
plt.imshow(thresh)
plt.show()

In [None]:
def make_gray(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

def make_thresh(image):
    gray = make_gray(image)
    return cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]


Lets use the threshold as the base image and use it to remove the vertical and horizontal lines from it

In [None]:
def run_kernel(img, kernel, iterations=2):
    detected_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=iterations)
    cnts = cv2.findContours(detected_lines, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts:
        cv2.drawContours(img, [c], -1, (255, 255, 255), 2)

# Remove horizontal
h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (30, 1))
run_kernel(image, h_kernel)

# Remove horizontal
v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 30))
run_kernel(image, v_kernel)

img = make_thresh(image)
plt.imshow(image)
plt.show()

In [None]:
import pytesseract

In [None]:
custom_config = r'--oem 3 --psm 6'
d = pytesseract.image_to_string(image, config=custom_config)
print(d)