# OCR For Text Extraction
---
## 1. Installing Packages & Import Modules
Packages required to do text extraction are:
- **TensorFlow**
: Text detection and text recognition model builder
- **PyPDF2**
: PDF document handler, capable of retrieve text and metadata from PDFs
- **DocTR**
: OCR logic handler, capable for parsing textual information from document or images

In [1]:
import sys
!{sys.executable} -m pip install tensorflow
!{sys.executable} -m pip install PyPDF2
!{sys.executable} -m pip install python-doctr
!{sys.executable} -m pip install python-doctr[tf]
!{sys.executable} -m pip install python-doctr[torch]







**UserWarning** \
If warning message pops up, just ignore it.

In [2]:
import tensorflow as tf
import PyPDF2 as pdf
from doctr.io import DocumentFile
from doctr.models import ocr_predictor


TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 



## 2. Building Models
So to make OCR works, we need to combine text detection model and text recognition model as an OCR pipeline to recognize text characters.

### Text Detection Model
For the text recognition model, we will use **DB ResNet50**. From the sources that I have read, this model somehow very popular among other document OCR apps for text detection model. It's also recommended by DocTR to use this model instead of the other models. Further research may be needed to understand why this is used.

### Text Recognition Model
For the text recognition model, we will use **CRNN VGG-16 Backbone**. Same as ResNet50, this model somehow very popular among other OCR apps for text recognition model. It's also recommended by DocTR to use this model instead of the other models. Further research may needed to understand why this is used.

**Status:** \
Combining both of these models is somewhat complicated. I have built the models from my local environment, still have no idea how to export it for passing DocTR ocr_predictor arguments. Right now, we will use pretrained model from DocTR itself so it doesn't crash while passing ocr_predictor args. The result from self trained model and pretrained model are nearly the same anyway, so don't worry about it.

In [3]:
# this block of code will be updated as soon as my self trained models works properly

## 3. Extract Text using Trained OCR Models with DocTR
To extract text with DocTR, we have to use built in ocr_predictor function from DocTR. This function allow us to use self trained or pretrained models for text detection and text recognition. The ocr_predictor returns a document object with a nested structure (with Page, Block, Line, Word, Artefact).

In [4]:
def extract_with_ocr(file):
    # pretrained model
    model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)

    # reading files
    document = DocumentFile.from_pdf(file)

    # analyze
    result = model(document)

    # export to json
    output = result.export()

    # grouping detected words
    separated_words = []
    for page in output["pages"]:
        for block in page["blocks"]:
            for line in block["lines"]:
                for word in line["words"]:
                    separated_words.append(word["value"])
    
    # combining separated words into sentences
    output = " ".join(separated_words)
    
    return output

The function below is needed to check if the pdf file is a scanned pdf (image based) or digital pdf (text based). If the function didn't detect any characters, it will be considered scanned pdf, so it had to handle extracting text using OCR. If the function detected some characters (100 chars at least) in the pdf file, then it will be considered digital pdf, so it just need to extract text directly from PyPDF2 for better accuracy.

This function return strings of extracted text from the pdf file.

In [5]:
import re

def text_extractor(file):
    # read file
    reader = pdf.PdfReader(file)
    page = reader.pages[0]
    
    # checks for text
    # if chars detected more than 100 chars, it will be considered scanned pdf
    if len(page.extract_text()) > 100: # tweak this number for character count if you want
        # extract text
        text = page.extract_text()
        
        # clean text
        clean_text = text.lower()

        # remove symbols
        clean_text = re.sub(r'[^\w\ \n]', '', clean_text)
        
        return clean_text
    else:
        return extract_with_ocr(file)

## 4. Let's Test It
This funtion will print out the extracted text from your document. You can play around with any pdf file you want to extract text from. Just change the file_name value to your pdf document path.

In [6]:
if __name__ == "__main__":
    # test digital pdf
    print("Digital PDF extracted text example:")
    file_name_1 = "example_digital.pdf" # select path to your pdf document in your local environment
    print(text_extractor(file_name_1))
    
    print("="*125)
    
    # test scanned pdf
    print("Scanned PDF extracted text example:")
    file_name_2 = "example_scanned.pdf" # select path to your pdf document in your local environment
    print(text_extractor(file_name_2))

Digital PDF extracted text example:
janice walton  
janicewaltongmailcom  
3492618950  
 
objective  
service minded and team focus ed boutique facilitator with 5 years of experience in a luxury retail environment 
eager to support the house of chanel with top class organizational skills and providing the highest standards of 
service in previous roles increased client facing time by over 30 won  facilitator of the year  award  
 
work experience  
boutique facilitator  
balenciaga boutique new york city ny  
2016 present  
 delivered excellent customer service based on the company values including welcoming and greeting all 
clients analyzing their needs and offering solutions  
 supported the operations division in maintaining stock order and assisting in cycle count activit y 
 opened and closed cash registers and assisted with handling cash and deposits  
 answered phone calls to ensure that all client issues are resolved promptly and professionally  
 maintained the highest profes

---
# Summary
So, this OCR file works just fine. The OCR models predict the words with good accuracy too. But, those models used comes from DocTR pretrained model, so they are already guaranteed that the result of text extraction will be great. Self trained models will be used if they can work properly with DocTR ocr_predictor function args. Some researches may be conducted for reevaluating and remodeling to updgrade and boost accuracy for OCR model.

In [None]:
# Thanks