# Beginning

Given that

*   OCR is not designed to detect document layouts or distinguish different text types
*   Currently, **layout parser** supports **two types of OCR engines**: Google Cloud Vision and Tesseract OCR engine.
    *   https://layout-parser.readthedocs.io/en/latest/example/parse_ocr/index.html

To make complex document structures like Moody's manual with several columns readable for a computer, we need a more targeted approach. This has led us to explore VGG Image Annotator (VIA) from Oxford University

*   VGG Image Annotator is a simple and standalone **manual annotation software** for image
*   VIA is an open source project based solely on HTML, Javascript and CSS (no dependency on external libraries).

It requires no dependency on external libraries, making it a potentially more accessible and flexible tool. With VIA, we could manually 'block' and annotate each discovered area of interest, potentially allowing us to define the structure of complex documents and parse them more accurately."




# Introduction

*   After annotating the image and identifying the rectangles, we would like to merge them **in the order of their annotation to assemble a larger image**.
    *   https://www.robots.ox.ac.uk/~vgg/software/via/
    *   Version 2 image annotator


*   The goal is to arrange all the elements (parts of text) longitudinally, instead of horizontally. This way, **the final image will have the text parts aligned vertically**, creating a more cohesive and organized presentation.

# Code

In [None]:
#This is used later in the script to convert images to a PDF.
!pip install img2pdf



In [None]:
import json
import cv2
import os
from google.colab import files
from PIL import Image
import img2pdf

In [None]:
# Upload the original image and annotations.json
# Here we upload both the original image and the JSON file produced from VGG Image Annotator.

uploaded = files.upload()

Saving annotation.json to annotation.json
Saving page.jpg to page.jpg


In [None]:
# Load the annotations
annotations = json.load(open('annotation.json'))

# Get the image filename and its annotations
filename = list(annotations.keys())[0]
image_annotations = annotations[filename]['regions']

# Load the original image
original_image = cv2.imread('page.jpg', cv2.IMREAD_COLOR)

# Create a directory to store the cropped images
# Creating a directory for cropped images: A directory named "sample" is created to store the cropped images
os.makedirs('sample', exist_ok=True)

In [None]:
#Cropping the regions from the original image: For each region annotation,
#the bounding box coordinates are retrieved, the corresponding region is cropped from the original image,
#and the cropped region is saved as a separate .jpg file in the "sample" directory.

# For each annotation in the sorted list (based on 'y' value)
for i, annotation in enumerate(sorted(image_annotations, key=lambda x: x['shape_attributes']['y'])):
    # Get the bounding box coordinates
    x = int(annotation['shape_attributes']['x'])
    y = int(annotation['shape_attributes']['y'])
    width = int(annotation['shape_attributes']['width'])
    height = int(annotation['shape_attributes']['height'])

    # Extract the region from the original image
    region = original_image[y:y+height, x:x+width]

    # Save the cropped image
    cv2.imwrite(f'sample/cropped_image_{i}.jpg', region)

# Directory containing images
img_dir = "sample"

# Get all .jpg files in the directory
imgs = [i for i in os.listdir(img_dir) if i.endswith(".jpg")]

# Sort the images by name to ensure they're in the order you expect
imgs = sorted(imgs)

# Converting each image to PDF: Each image is converted to PDF using the img2pdf library. The resulting PDF bytes are stored in the pdf_bytes variable.
# Convert each image to PDF
pdf_bytes = img2pdf.convert([Image.open(os.path.join(img_dir, img)).filename for img in imgs])

# Write the PDF bytes to a file
with open("output.pdf", "wb") as f:
    f.write(pdf_bytes)

# Download the output file
files.download("output.pdf")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!pip install requests
!pip install ocrspace
!pip install arabic_reshaper
!pip install python-bidi



In [None]:
import requests
import json

In [None]:
def ocr_space_file(filename, overlay=False, isTable=False, api_key='K86160585588957', language='eng'):
    payload = {
        'isOverlayRequired': overlay,
        'isTable': isTable,
        'apikey': api_key,
        'language': language,
    }
    with open(filename, 'rb') as f:
        r = requests.post('https://api.ocr.space/parse/image',
                          files={filename: f},
                          data=payload,
                          )
    return r.content.decode()


def ocr_space_url(url, overlay=False, isTable=False, api_key='helloworld', language='eng'):
    payload = {
        'url': url,
        'isOverlayRequired': overlay,
        'isTable': isTable,
        'apikey': api_key,
        'language': language,
    }
    r = requests.post('https://api.ocr.space/parse/image',
                      data=payload,
                      )
    return r.content.decode()


In [None]:
def extract_table_info(ocr_json):
    parsed_data = json.loads(ocr_json)
    lines = []

    if 'ParsedResults' in parsed_data:
        for result in parsed_data['ParsedResults']:
            if 'ParsedText' in result:
                lines.extend(result['ParsedText'].split('\n'))

    # Now lines contain the text, line by line, which you can process further for your specific needs.
    return lines

# Example usage
if __name__ == "__main__":
    ocr_result_file = ocr_space_file(filename='output.jpg', isTable=True)
    lines = extract_table_info(ocr_result_file)

    for line in lines:
        print(line)

In [None]:
!pip install pandas openpyxl



In [None]:
import pandas as pd

def write_to_excel(lines, filename='output.xlsx'):
    # Assuming each line in 'lines' is a tab-separated row
    rows = [line.split('\t') for line in lines]

    df = pd.DataFrame(rows)
    df.to_excel(filename, index=False, header=False)

# Example usage
if __name__ == "__main__":
    ocr_result_file = ocr_space_file(filename='output.pdf', isTable=True)
    lines = extract_table_info(ocr_result_file)

    write_to_excel(lines)

# Conclusion

Here's the proposed workflow using verbal language:

*   Step 1: Image Segmentation

Firstly, we need to perform image segmentation to separate the different elements in the image. Specifically, we want to distinguish between text and tables.
For the text part, we can leverage the existing OCR (Optical Character Recognition) Workflow. This means using OCR technology to extract text from the segmented regions containing textual content.

*   Step 2: Text Processing

After obtaining the text using our prefered OCR, we may need to perform some manual corrections. This step involves reviewing the OCR results and making necessary adjustments to fix any inaccuracies or errors that might have occurred during the OCR process. Manual correction ensures the accuracy of the extracted text.

*   Step 3: Table Processing

For the table part of the image, we'll also use the OCR Workflow to extract the content. However, tables can be more complex than plain text, and OCR might not perfectly capture all the data.
To address this, we could perform manual correction on the extracted table data. This step involves carefully reviewing the table content, cross-referencing it with the original image, and making necessary adjustments to ensure the table is accurately represented.

*    Overall

The end result will be a comprehensive and accurate extraction of both the text and table elements from the image. The text will be processed using the (AWS I believe) OCR Workflow with manual corrections, if needed. Similarly, the table will undergo the same process to ensure its accuracy. By following this three-step plan, we'll achieve a successful image segmentation and extraction of textual and tabular content.
