# OCR For Text Extraction
---
## 1. Installing Packages & Import Modules
Packages required to do text extraction are:\
- **TensorFlow**
: Text detection and text recognition model builder
- **DocTR [TensorFlow]**
: OCR logic handler, used for extracting textual information from document or images. Powered in TensorFlow2
- **PyPDF2**
: PDF document handler, capable of retrieve text and metadata from PDFs
- **PyFDPF**
: PyFPDF is a library for PDF document generation under Python

In [82]:
# Main Libs
!pip install tensorflow
!pip install python-doctr[tf]
!pip install PyPDF2
!pip install fpdf

# Downgraded Libs
!pip install rapidfuzz==2.15.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**UserWarning** \
If warning message pops up, just ignore it.

In [83]:
import tensorflow as tf
import PyPDF2 as pypdf

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

## 2. Building Models
So to make OCR works, we need to combine text detection model and text recognition model as an OCR pipeline to recognize text characters. We can use pretrained model from DocTR according to the documentation (https://mindee.github.io/doctr/using_doctr/using_models.html).

### Text Detection Model
For the text recognition model, we will use **DB ResNet50**. From the DocTR documentation, ResNet50 is one of the top performing text detection model among others.

### Text Recognition Model
For the text recognition model, we will use **CRNN MobileNetV3 Small**. According to the DocTR documentation, this text recognition model is best to be used for resume or CV screener. Combine it with ResNet50, this model got a top performing precission among other models.

**Problem:** \
Training both of these models is somewhat complicated. I had tried to build the models from my local environment, still have no idea how to export it for passing DocTR ocr_predictor arguments. Right now, we will use pretrained model from DocTR itself so it doesn't crash while passing ocr_predictor args. It might be considered rule breaking, but atleast the models are working fine right now. I'm out of time, but further research must be needed in order to build and fine tune the pretrained models with self trained models for the future development.

In [84]:
# Not worked, saving for later
# from doctr.models import db_resnet50
#
# model = db_resnet50(pretrained=True)
# input_tensor = tf.random.uniform(shape=[1, 1024, 1024, 3], maxval=1, dtype=tf.float32)
#
# TEXT_DETECTION_ARCHITECTURE = model(input_tensor)

In [85]:
# Not worked, saving for later
# from doctr.models import crnn_mobilenet_v3_small
#
# model = crnn_mobilenet_v3_small(pretrained=True)
# input_tensor = tf.random.uniform(shape=[1, 32, 128, 3], maxval=1, dtype=tf.float32)
#
# TEXT_RECOGNITION_ARCHITECTURE = model(input_tensor)

## 3. Extract Text using Trained OCR Models with DocTR
To extract text with DocTR, we have to use built in ocr_predictor function from DocTR. This function allow us to use self trained or pretrained models for text detection and text recognition. The ocr_predictor returns a document object with a nested structure (with Page, Block, Line, Word, Artefact).

In [86]:
def extract_with_ocr(file):
    # pretrained model
    model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_mobilenet_v3_small', pretrained=True)

    # reading files
    document = DocumentFile.from_pdf(file)

    # analyze|
    result = model(document)

    # export to json
    output = result.export()

    # grouping detected words
    separated_words = []
    for page in output["pages"]:
        for block in page["blocks"]:
            for line in block["lines"]:
                for word in line["words"]:
                    separated_words.append(word["value"])

    # combining separated words into sentences
    output = " ".join(separated_words)

    return output

The function below is needed to check if the pdf file is a scanned pdf (image based) or digital pdf (text based). If the function didn't detect any characters, it will be considered scanned pdf, so it had to handle extracting text using OCR. If the function detected some characters (100 chars for example) in the pdf file, then it will be considered digital pdf, so it just need to extract text directly from PyPDF2 for better accuracy.

This function return strings of extracted text from the pdf file.

In [87]:
import re

def text_extractor(file):
    reader = pypdf.PdfReader(file)
    page = reader.pages[0]

    chars_count = 100
    if len(page.extract_text()) > chars_count:
        text = page.extract_text()

        clean_text = re.sub('[^\x1F-\x7F]+', '', text)
        clean_text = re.sub('[_|]', '', clean_text)

        return clean_text
    else:
        return extract_with_ocr(file)

## 4. Let's Test It
This funtion will print out the extracted text from your document. You can play around with any pdf file you want to extract text from. Just change the file_name value to your pdf document path.


Run the code below if you want to run it with OCR directly

In [42]:
# test scanned pdf
print("Scanned PDF extracted text example:")
file_name = "example.pdf"
output = extract_with_ocr(file_name)

output

Scanned PDF extracted text example:


'Muhammad Alfian Pratama Linkedin I +62-855-20/8-1007 I e aifianp613.aithub.io. I M altanp613@email.com OGitHub I\'m a 6th semester student curious and interested in Data Science and Machine Learning. I possess advanced proficiency in Python and Ri programming languages, and Ihave honed my skills in prominent frameworks such as TensorFlow and Flask. lam currently seeking an opportunity to expand and apply my skills through a one semester industry placement, with a particular focus on data-related roles. I am eager to delve deeper into the practical aspects of the field and gain invaluable real-world experience. Skills Education Python I R I HTML I CSS I Javascript I Tableau I Microsoft Excel I Flask I Tensorflow I SPSS I Minitab I MySQL I NoSQL I Firebase Machine Learning I Data Science I Data Analytics I Statistics I Microservices I Backend I English Machine Learning Learning Path Student with Ahead of Schedule Status Specialization Bachelor of Data Science Major in Data Science Techn

Run the code below if you want to run it and let the code decide wether it should be extracted with PyPDF2 or DocTR

In [79]:
# test unconfirmed pdf
print("Unconfirmed PDF extracted text example:")
file_name = "example.pdf"
output = text_extractor(file_name)

output

Unconfirmed PDF extracted text example:


'Muhammad Alfian Pratama   LinkedIn     +62-855-2078 -1007      alfianp613.github.io      alfianp613@gmail.com       GitHub  Im a 6th semester student curious and interested in Data Science and Machine Learning . I possess advanced proficiency in Python and R programming languages, and I have honed my skills in prominent frameworks such as TensorFlow and Flask. I am currently seek ing an opportunity to expand and apply my skills through a one semester industry plac ement, with a particular focus on data -related roles. I am eager to delve deeper into the practical aspects of the field and gain invaluable real -world experience.   Skills          Python  R  HTML  CSS  Javascript  Tableau  Microsoft Excel  Flask  Tensorflow  SPSS  Minitab  MySQL  NoSQL  Fire base     Machine Learning  Data Science  Data Analytics  Statistics  Microservices  Backend  English  Education     Machine Learning Learning Path   Bangkit  Academy 2023 By Google, GoTo, & Traveloka  Indonesia  02/2023 - Current   M

# 5. Exporting String Output into PDF
Needed for CV Summarization

In [80]:
from fpdf import FPDF

# save FPDF() class into a
# variable pdf
pdf = FPDF()
# Add a page
pdf.add_page()

# set style and size of font
# that you want in the pdf
pdf.set_font("Arial", size = 12)

# create a cell
pdf.cell(200, 10, txt = output,
         ln = 1, align = 'J')

# save the pdf with name .pdf
pdf.output("temp.pdf")

''

In [81]:
# test temp pdf
file_name = "temp.pdf"
output = text_extractor(file_name)

output

'Muhammad Alfian Pratama   LinkedIn     +62-855-2078 -1007      alfianp613.github.io      alfianp613@gmail.com       GitHub  Im a 6th semester student curious and interested in Data Science and Machine Learning . I possess advanced proficiency in Python and R programming languages, and I have honed my skills in prominent frameworks such as TensorFlow and Flask. I am currently seek ing an opportunity to expand and apply my skills through a one semester industry plac ement, with a particular focus on data -related roles. I am eager to delve deeper into the practical aspects of the field and gain invaluable real -world experience.   Skills          Python  R  HTML  CSS  Javascript  Tableau  Microsoft Excel  Flask  Tensorflow  SPSS  Minitab  MySQL  NoSQL  Fire base     Machine Learning  Data Science  Data Analytics  Statistics  Microservices  Backend  English  Education     Machine Learning Learning Path   Bangkit  Academy 2023 By Google, GoTo, & Traveloka  Indonesia  02/2023 - Current   M

---
# Summary
So, this OCR file works just fine. The OCR models predict the words with good accuracy too. But, those models used comes from DocTR pretrained model, so they are already guaranteed that the result of text extraction will be great. Self trained models will be used if they can work properly with DocTR ocr_predictor function args. Further research must be conducted to re-evaluating and remodeling for precission improvement of the OCR models.

In [None]:
# Thanks