# LLM4LLU : OCR to text / JSON analysis

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, or even a scene photo. In the context of LLM4LLU, we expect many users will be providing images of documents to the application in order to get help for filling out forms, applying for programmes, checking for requirements or just having an understanding of the contents of a document in general. Thus, it will be important for our application to have a robust system for implementing OCR so that the LLM in question (GPT4) has full access to all the relevant text it needs to answer user queries.

After a brief literature review, we learn that while GPT4 performs well over a variety of OCR tasks, it can still struggle in the more complex ones. Furthermore, it does not outperform existing state-of-the-art OCR models, thus using GPT4 for OCR downstream tasks remains an open problem (https://arxiv.org/abs/2310.16809). However, combining GPT with a model specifically trained for extracting information from documents leads to very promising results, as the information is provided to the LLM in a text format, the modality most suited for assessing context and semantics, and the LLM is not restricted by its image handling limitations (https://arxiv.org/abs/2403.07553).

## Our Approach

Following this idea, we plan to use GPT4 in tandem with an OCR model, and another smaller LLM for converting the OCR output into a structured JSON format. Since keeping the cost of running our backend low is one of our objectives with this project as well, using a smaller LLM for converting OCR output to structured JSON makes sense as it reduces OpenAI API calls.

We use PaddleOCR for our OCR purposes, which is a state of the art ultra-lightweight, pre-trained model interface for OCR tasks and data annotation (https://github.com/PaddlePaddle/PaddleOCR). As for the text to JSON LLM, we use a quantized version of Gemma7b through the Unsloth platform (https://huggingface.co/unsloth/gemma-7b-bnb-4bit), as their version of Gemma7b is further optimized to use lesser VRAM and perform inference faster, thus helping to provide timely responses to our users.

This notebooks walks through the process of using PaddleOCR on a NADRA document (first in English, then in Urdu), and subsequently how the OCR output can be provided to Gemma7b in order to generate a JSON object suitable for passing to GPT4 in tandem with the original image.

## Implementation

First, we import all libraries and set the environment variables as necessary.

In [1]:
# !pip install paddlepaddle paddleocr
# !pip install --upgrade paddleocr

In [2]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [3]:
import numpy as np
from PIL import Image

Next, we open an image of a NADRA document template from their website. The document is completely in English.

In [35]:
input_image = Image.open('nadra_d_certificate.png')
input_image_array = np.array(input_image.convert('RGB'))

We now pass the image data to PaddleOCR, set to detect English text only ("en").

In [36]:
from paddleocr import PaddleOCR, draw_ocr
from ast import literal_eval
import json

paddleocr = PaddleOCR(lang="en",ocr_version="PP-OCRv4",show_log = False,use_gpu=True)

def paddle_scan(paddleocr,img_path_or_nparray):
    result = paddleocr.ocr(img_path_or_nparray,cls=True)
    result = result[0]
    boxes = [line[0] for line in result]       #boundign box
    txts = [line[1][0] for line in result]     #raw text
    scores = [line[1][1] for line in result]   # scores
    return  txts, result

# perform ocr scan
receipt_texts, receipt_boxes = paddle_scan(paddleocr,input_image_array)
print(50*"--","\ntext only:\n",receipt_texts)
print(50*"--","\nocr boxes:\n",receipt_boxes)

---------------------------------------------------------------------------------------------------- 
text only:
 ['APPLICATION FORM FOR DEATH CERTIFICATE', 'To', 'The Registrar of Births & Deaths, Municipality, Koraput.', 'Sub:', 'Issue of Death Certificate', 'Sir/Madam,', 'I am submitting here with the following particulars for issue of Death Certificate under section-17.', '.Copy/Copies)', 'For Office use only', 'Registration No.', 'Date of Registration :', 'Application No.', 'Search Fee', 'No.of year', 'Challan Amount:Rs.', 'Challan No:', 'Challan Date.', 'M/R Amount:Rs.', 'M/R No', 'M/R Date', 'Issue No.', 'Issue Date', 'ID No.', 'ID Type ', 'Ortps Ack. No.', 'Date:', '(FILL THE BLANKS USING CAPITAL LETTERS)', 'Date of Death*', '(DD/MM/YYYY)', 'Gender', '(Male/Female)', 'Deceased Name :', 'First Name)', '(Middle Name)', '(Last/Surname)', 'Care of*', 'O Father', 'O Husband', 'Father/Husband Name*', '(First Name)', '(Middle Name)', '(Last/Surname)', 'Deceased Age (In Years)*', 'Age 

As we can see, the OCR has detected all of the text perfectly. Now we will set up Gemma7b and construct a prompt for providing it the context of our task and asking it generate a structured JSON object based on the text data provided and what meaning it is able to infer from it.

In [1]:
import torch
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

In [7]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

In [21]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-7b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth: Fast Gemma patching release 2024.4
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


In [16]:
EOS_TOKEN = tokenizer.eos_token

In [17]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [37]:
input = receipt_texts
input = str(input)

In [38]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "This is the output I got after performing OCR on an image of a form/document. I would like to see if you can understand the different parts of the form and properly GENERATE a structure and convert this into a JSON object", # instruction
        input, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1000)

<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
This is the output I got after performing OCR on an image of a form/document. I would like to see if you can understand the different parts of the form and properly GENERATE a structure and convert this into a JSON object

### Input:
['APPLICATION FORM FOR DEATH CERTIFICATE', 'To', 'The Registrar of Births & Deaths, Municipality, Koraput.', 'Sub:', 'Issue of Death Certificate', 'Sir/Madam,', 'I am submitting here with the following particulars for issue of Death Certificate under section-17.', '.Copy/Copies)', 'For Office use only', 'Registration No.', 'Date of Registration :', 'Application No.', 'Search Fee', 'No.of year', 'Challan Amount:Rs.', 'Challan No:', 'Challan Date.', 'M/R Amount:Rs.', 'M/R No', 'M/R Date', 'Issue No.', 'Issue Date', 'ID No.', 'ID Type ', 'Ortps Ack. No.', 'Date:', '(FILL THE BLANKS

As we can see, Gemma has appropriately identified the variable names for each of the fields in the form and appropriately assigned them empty strings as well. This will help to indicate to GPT4 which areas of the form still need to be filled. Gemma7b has also identified the heirarchical structure of the form and represented that in the JSON output as well through the use of nested objects, adding further context for GPT4 to be able to understand the document better.

We try reading another NADRA form through OCR in English and once again achieve great character recognition.

In [44]:
input_image = Image.open('nadra_b_form.png')
input_image_array = np.array(input_image.convert('RGB'))

paddleocr = PaddleOCR(lang="en",ocr_version="PP-OCRv4",show_log = False,use_gpu=True)
receipt_texts, receipt_boxes = paddle_scan(paddleocr,input_image_array)
print(50*"--","\ntext only:\n",receipt_texts)
print(50*"--","\nocr boxes:\n",receipt_boxes)

---------------------------------------------------------------------------------------------------- 
text only:
 ['www.wcb.gov.pk', 'THIS FORM IS FOR OFFICE RECORD ONLY AND WILL NOT BE USED AS BIRTH REGISTRATION CERTIFICATE', 'NADRA', "Applicant's Name", "Applicant's CNIC No", '36', "Child's Name", 'Relation', 'Gender', 'Religion', "Father's Name", "Father's CNIC No", "Mother's Name", "Mother's CNIC No", 'Distt./Cantt Area of Birth', '1', 'J', 'Aige!', 'Date of Birth', 'Vaccinated', 'Yes', 'No', 'wit', 'Disability', 'Sni.', 'Address', 'District', 'Jiwsis', 'H', 'CBRC NO.ISSUED', 'Form is also available on www.wcb.gov.pk']
---------------------------------------------------------------------------------------------------- 
ocr boxes:
 [[[[304.0, 13.0], [369.0, 13.0], [369.0, 24.0], [304.0, 24.0]], ('www.wcb.gov.pk', 0.9871886372566223)], [[[33.0, 36.0], [673.0, 36.0], [673.0, 49.0], [33.0, 49.0]], ('THIS FORM IS FOR OFFICE RECORD ONLY AND WILL NOT BE USED AS BIRTH REGISTRATION CERTIFIC

Lastly, through PaddleOCR, we try reading a NADRA form entirely in Urdu.

In [48]:
input_image = Image.open('urdu_form.jpeg')
input_image_array = np.array(input_image.convert('RGB'))

paddleocr = PaddleOCR(lang="ur",ocr_version="PP-OCRv4",show_log = False,use_gpu=True)
receipt_texts, receipt_boxes = paddle_scan(paddleocr,input_image_array)
print(50*"--","\ntext only:\n",receipt_texts)
print(50*"--","\nocr boxes:\n",receipt_boxes)

---------------------------------------------------------------------------------------------------- 
text only:
 ['.p٧Aw.wcbgo', 'THIS', 'FORM', 'FOR', 'OFFICE', 'RECORD', 'ONLY', 'ND', 'WILL NOT', ' E', 'USED', 'BIRTH', 'REGISTRATION', 'CERTIFICATE', 'نشيرسجر', 'هترب زئارثويمك', 'رئارب مراف', 'تساوخرد', 'CBW', ' اا', 'AD', "Aplicant's Name", 'ماكد  تساوخرو', ' C Aliat', 'مىاوا', "Chid's Namc", 'Rclati', 'Gender', 'ب', 'Religion', 'بج ', "Fatther's Nammec.", 'ماكلو', "Fther's CNC No", 'مرداك ىىاشثاكدلاو', "Mother's Namme", 'مااكودلاو', 's CNIC NoMhcr', 'ميرداك ىاش كودلاو', 'يقداج /علض تاد', 'Dis', 'a', 'Ara of Birth', 'ر', 'لتب', 'أري ةاب', 'Date of Birh', 'ي', 'Vaccinated', ' Yes', '٥N', 'جادردأ ر', 'Disability', 'ىرورحم', 'Addres', 'District', 'رات', 'ودجو ماخرو تم', 'ىرتفد', 'رئارب', 'يرات', 'ء', 'سوك نم ', 'CBR NO', 'ISSUED', '.pk,wcb go,Fom is also available on www', 'نار لشخ ']
---------------------------------------------------------------------------------------------------- 


However this time, we see really bad results. Even though PaddleOCR is a state of the art OCR interface, we see it heavily struggles with identifying Urdu characters and outputting coherent, intact sentences. This is probably due to the specific lack of Urdu training data present when PaddleOCR was being trained. Unfortunately, this forces us to abandon the idea of using OCR technologies for structing Urdu text, and rely on the image perceptual capabilities of GPT4 alone.