# OCR For Text Extraction
---
## 1. Installing Packages & Import Modules
Packages required to do text extraction are:
- **TensorFlow**
: Text detection and text recognition model builder
- **PyPDF2**
: PDF document handler, capable of retrieve text and metadata from PDFs
- **DocTR**
: OCR logic handler, capable for parsing textual information from document or images

In [3]:
# Main Libs
!pip install tensorflow
!pip install PyPDF2
!pip install python-doctr

# Supporting Libs
!pip install python-doctr[tf]
!pip install python-doctr[torch]
!pip install rapidfuzz==2.15.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rapidfuzz==2.15.1
  Downloading rapidfuzz-2.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz
  Attempting uninstall: rapidfuzz
    Found existing installation: rapidfuzz 3.1.1
    Uninstalling rapidfuzz-3.

**UserWarning** \
If warning message pops up, just ignore it.

In [1]:
import tensorflow as tf
import PyPDF2 as pdf
from doctr.io import DocumentFile
from doctr.models import ocr_predictor


TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 



## 2. Building Models
So to make OCR works, we need to combine text detection model and text recognition model as an OCR pipeline to recognize text characters.

### Text Detection Model
For the text recognition model, we will use **DB ResNet50**. From the sources that I have read, this model somehow very popular among other document OCR apps for text detection model. It's also recommended by DocTR to use this model instead of the other models. Further research may be needed to understand why this is used.

### Text Recognition Model
For the text recognition model, we will use **CRNN VGG-16 Backbone**. Same as ResNet50, this model somehow very popular among other OCR apps for text recognition model. It's also recommended by DocTR to use this model instead of the other models. Further research may needed to understand why this is used.

**Status:** \
Combining both of these models is somewhat complicated. I have built the models from my local environment, still have no idea how to export it for passing DocTR ocr_predictor arguments. Right now, we will use pretrained model from DocTR itself so it doesn't crash while passing ocr_predictor args. The result from self trained model and pretrained model are nearly the same anyway, so don't worry about it.

In [2]:
# this block of code will be updated as soon as my self trained models works properly

## 3. Extract Text using Trained OCR Models with DocTR
To extract text with DocTR, we have to use built in ocr_predictor function from DocTR. This function allow us to use self trained or pretrained models for text detection and text recognition. The ocr_predictor returns a document object with a nested structure (with Page, Block, Line, Word, Artefact).

In [3]:
def extract_with_ocr(file):
    # pretrained model
    model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)

    # reading files
    document = DocumentFile.from_pdf(file)

    # analyze|
    result = model(document)

    # export to json
    output = result.export()

    # grouping detected words
    separated_words = []
    for page in output["pages"]:
        for block in page["blocks"]:
            for line in block["lines"]:
                for word in line["words"]:
                    separated_words.append(word["value"])
    
    # combining separated words into sentences
    output = " ".join(separated_words)
    
    return output

The function below is needed to check if the pdf file is a scanned pdf (image based) or digital pdf (text based). If the function didn't detect any characters, it will be considered scanned pdf, so it had to handle extracting text using OCR. If the function detected some characters (100 chars at least) in the pdf file, then it will be considered digital pdf, so it just need to extract text directly from PyPDF2 for better accuracy.

This function return strings of extracted text from the pdf file.

In [4]:
import re

def text_extractor(file):
    # read file
    reader = pdf.PdfReader(file)
    page = reader.pages[0]
    
    # checks for text
    # if chars detected more than 100 chars, it will be considered scanned pdf
    if len(page.extract_text()) > 100: # tweak this number for character count if you want
        # extract text
        text = page.extract_text()
        
        return text
    else:
        return extract_with_ocr(file)

## 4. Let's Test It
This funtion will print out the extracted text from your document. You can play around with any pdf file you want to extract text from. Just change the file_name value to your pdf document path.

In [6]:
if __name__ == "__main__":
    # test scanned pdf
    print("Scanned PDF extracted text example:")
    file_name_2 = "2.pdf" # select path to your pdf document in your local environment
    print(extract_with_ocr(file_name_2))

Scanned PDF extracted text example:
MUHAMMAD RAZAN FAWWAZ +6285397946743 m razan@mhs. unsyiah. ac.id I htps/linkedin.comiin/razanfawwaz htins/rmzanfawwazxy I https:/lgithub. com/razanfawwaz Banda Aceh EDUCATION Informatics Kuala. UNIVERSITAS SYIAH KUALA Sep 2020 - Sep 2024 (Expected) The Most Outstanding Student III from the Faculty of Natural Sciences and Mathematics Universitas Syiah Lab assistant of Web Programming odd semester class. Teach 49 students basic Web Programming including Laravel, TailwindCSS, Github, Conventional Commit, and Deployment. BANGKIT ACADEMY LED BY GOOGLE, GOTO, TRAVELOKA Feb 2023 -Jul 2023 Cloud Computing Actively engaged in discussions and provided assistance on various platforms, including meet sessions and Contributed to the development of a backend system for the capstone project, utilizing technologies such as Demonstrated leadership skills by leading the capstone team, coordinating efforts, and ensuring successful Shared knowledge and insights about Cl

---
# Summary
So, this OCR file works just fine. The OCR models predict the words with good accuracy too. But, those models used comes from DocTR pretrained model, so they are already guaranteed that the result of text extraction will be great. Self trained models will be used if they can work properly with DocTR ocr_predictor function args. Some researches may be conducted for reevaluating and remodeling to updgrade and boost accuracy for OCR model.

In [None]:
# Thanks

In [17]:
# test scanned pdf
print("Scanned PDF extracted text example:")
file_name_2 = "temp.pdf" # select path to your pdf document in your local environment
output = text_extractor(file_name_2)

Scanned PDF extracted text example:


In [18]:
output

'Muhammad Alfian Pratama Linkedin I 1+62-855-2078-1007 I e alfianp613.github.io I M allanp61Begmal.com o GitHub I\'m a 6th semester student curious and interested in Data Science and Machine Learning. I possess advanced proficiency in Python and R programming languages, and I have honed my skills in prominent frameworks such as TensorFlow and Flask. I am currently seeking an opportunity to expand and apply my skills through a one semester industry placement, with a particular focus on data-related roles. I am eager to delve deeper into the practical aspects of the field and gain invaluable real-world experience. Skills Education Python IR I HTML I CSS I Javascript I Tableau I Microsoft Excel I Flask I Tensorflow I SPSS I Minitab I MySQL I NoSQL I Firebase Machine Learning I Data Science I Data Analytics I Statistics I Microservices I Backend I English Machine Learning Learning Path Student with Ahead of Schedule Status Specialization Bachelor of Data Science Major in Data Science Techn

In [21]:
import fpdf
from fpdf import FPDF
 
 
# save FPDF() class into a
# variable pdf
pdf = FPDF()
# Add a page
pdf.add_page()

# set style and size of font
# that you want in the pdf
pdf.set_font("Arial", size = 12)
 
# create a cell
pdf.cell(200, 10, txt = output,
         ln = 1, align = 'J')
 
# save the pdf with name .pdf
pdf.output("temp.pdf")  

''