# OCR For Text Extraction
---
## 1. Installing Packages & Import Modules
Packages required to do text extraction are:
- **DocTR [TensorFlow]**
: OCR logic handler, used for extracting textual information from document or images. Supported in TensorFlow2
- **PyPDF2**
: PDF document handler, capable of retrieve text and metadata from PDFs
- **PyFDPF**
: PyFPDF is a library for PDF document generation under Python

In [None]:
# Main Libs
!pip install python-doctr[tf]
!pip install PyPDF2
!pip install fpdf

# Downgraded Libs
!pip install rapidfuzz==2.15.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-doctr[tf]
  Downloading python_doctr-0.6.0-py3-none-any.whl (239 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.3/239.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting importlib-metadata (from python-doctr[tf])
  Downloading importlib_metadata-6.6.0-py3-none-any.whl (22 kB)
Collecting pypdfium2<4.0.0,>=3.0.0 (from python-doctr[tf])
  Downloading pypdfium2-3.21.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyclipper<2.0.0,>=1.2.0 (from python-doctr[tf])
  Downloading pyclipper-1.3.0.post4-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (813 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m813.9/813.9 kB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
[?25

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fpdf
  Downloading fpdf-1.7.2.tar.gz (39 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: fpdf
  Building wheel for fpdf (setup.py) ... [?25l[?25hdone
  Created wheel for fpdf: filename=fpdf-1.7.2-py2.py3-none-any.whl size=40704 sha256=9c0fdc32f66cc082a0d5dafce950f03df3fc705d76d51ae192c9ae258f1b1976
  Stored in directory: /root/.cache/pip/wheels/f9/95/ba/f418094659025eb9611f17cbcaf2334236bf39a0c3453ea455
Successfully built fpdf
Installing collected packages: fpdf
S

**UserWarning** \
If warning message pops up, just ignore it.

In [None]:
import PyPDF2 as pypdf
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

## 2. Building Models
So to make OCR works, we need to combine text detection model and text recognition model as an OCR pipeline to recognize text characters.

### Text Detection Model
For the text recognition model, we will use **DB ResNet50**. From the sources that I have read, this model somehow very popular among other document OCR apps for text detection model. It's also recommended by DocTR to use this model instead of the other models. Further research may be needed to understand why this is used.

### Text Recognition Model
For the text recognition model, we will use **CRNN VGG-16 Backbone**. Same as ResNet50, this model somehow very popular among other OCR apps for text recognition model. It's also recommended by DocTR to use this model instead of the other models. Further research may needed to understand why this is used.

**Status:** \
Combining both of these models is somewhat complicated. I have built the models from my local environment, still have no idea how to export it for passing DocTR ocr_predictor arguments. Right now, we will use pretrained model from DocTR itself so it doesn't crash while passing ocr_predictor args. The result from self trained model and pretrained model are nearly the same anyway, so don't worry about it.

In [None]:
# this block of code will be updated as soon as my self trained models works properly

## 3. Extract Text using Trained OCR Models with DocTR
To extract text with DocTR, we have to use built in ocr_predictor function from DocTR. This function allow us to use self trained or pretrained models for text detection and text recognition. The ocr_predictor returns a document object with a nested structure (with Page, Block, Line, Word, Artefact).

In [None]:
def extract_with_ocr(file):
    # pretrained model
    model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)

    # reading files
    document = DocumentFile.from_pdf(file)

    # analyze|
    result = model(document)

    # export to json
    output = result.export()

    # grouping detected words
    separated_words = []
    for page in output["pages"]:
        for block in page["blocks"]:
            for line in block["lines"]:
                for word in line["words"]:
                    separated_words.append(word["value"])

    # combining separated words into sentences
    output = " ".join(separated_words)

    return output

The function below is needed to check if the pdf file is a scanned pdf (image based) or digital pdf (text based). If the function didn't detect any characters, it will be considered scanned pdf, so it had to handle extracting text using OCR. If the function detected some characters (100 chars for example) in the pdf file, then it will be considered digital pdf, so it just need to extract text directly from PyPDF2 for better accuracy.

This function return strings of extracted text from the pdf file.

In [None]:
import re

def text_extractor(file):
    reader = pypdf.PdfReader(file)
    page = reader.pages[0]

    chars_count = 100
    if len(page.extract_text()) > chars_count:
        text = page.extract_text()

        clean_text = re.sub(r'[^\w\ \n]', '', text)

        return clean_text
    else:
        return extract_with_ocr(file)

## 4. Let's Test It
This funtion will print out the extracted text from your document. You can play around with any pdf file you want to extract text from. Just change the file_name value to your pdf document path.


Run the code below if you want to run it with OCR directly

In [None]:
# test scanned pdf
print("Scanned PDF extracted text example:")
file_name = "Muhammad Alfian Pratama new resume.pdf" # select path to your pdf document in your local environment
output = extract_with_ocr(file_name)

output

Scanned PDF extracted text example:


'Muhammad Alfian Pratama LinkedIn I +62-855-2078-1007 I e alfianp613.github.io I M alfianp613@gmail.com OGitHub I\'m a 6th semester student curious and interested in Data Science and Machine Learning. I possess advanced proficiency in Python and R programming languages, and Ihave honed my skills in prominent frameworks such as TensorFlow and Flask. lam currently seeking an opportunity to expand and apply my skills through a one semester industry placement, with a particular focus on data-related roles. I am eager to delve deeper into the practical aspects of the field and gain invaluable real-world experience. Skills Education Python I R I HTML I CSS I Javascript I Tableau I Microsoft Excel I Flask I Tensorflow I SPSS I Minitab I MySQL I NoSQL I Firebase Machine Learning I Data Science I Data Analytics I Statistics I Microservices I Backend I English Machine Learning Learning Path Student with Ahead of Schedule Status Specialization Bachelor of Data Science Major in Data Science Techno

Run the code below if you want to run it and let the code decide wether it should be extracted with PyPDF2 or DocTR

In [None]:
# test unconfirmed pdf
print("Unconfirmed PDF extracted text example:")
file_name = "Muhammad Alfian Pratama new resume.pdf" # select path to your pdf document in your local environment
output = text_extractor(file_name)

output

Unconfirmed PDF extracted text example:


'Muhammad Alfian Pratama  \n LinkedIn     628552078 1007      alfianp613githubio      alfianp613gmailcom       GitHub  \nIm a 6th semester student curious and interested in Data Science and Machine Learning  I possess advanced proficiency in Python and \nR programming languages and I have honed my skills in prominent frameworks such as TensorFlow and Flask I am currently seek ing an \nopportunity to expand and apply my skills through a one semester industry plac ement with a particular focus on data related roles I \nam eager to delve deeper into the practical aspects of the field and gain invaluable real world experience  \n \nSkills  ____________________________________________________________________________________ ________ ___  \n \n   Python  R  HTML  CSS  Javascript  Tableau  Microsoft Excel  Flask  Tensorflow  SPSS  Minitab  MySQL  NoSQL  Fire base  \n   Machine Learning  Data Science  Data Analytics  Statistics  Microservices  Backend  English  \nEducation  _________ _________

# 5. Exporting String Output into PDF
Needed for CV Summarization

In [None]:
from fpdf import FPDF

# save FPDF() class into a
# variable pdf
pdf = FPDF()
# Add a page
pdf.add_page()

# set style and size of font
# that you want in the pdf
pdf.set_font("Arial", size = 12)

# create a cell
pdf.cell(200, 10, txt = output,
         ln = 1, align = 'J')

# save the pdf with name .pdf
pdf.output("temp.pdf")

''

In [None]:
# test temp pdf
file_name = "temp.pdf" # select path to your pdf document in your local environment
output = text_extractor(file_name)

output

'Muhammad Alfian Pratama  \n LinkedIn     628552078 1007      alfianp613githubio      alfianp613gmailcom       GitHub  \nIm a 6th semester student curious and interested in Data Science and Machine Learning  I possess advanced proficiency in Python and \nR programming languages and I have honed my skills in prominent frameworks such as TensorFlow and Flask I am currently seek ing an \nopportunity to expand and apply my skills through a one semester industry plac ement with a particular focus on data related roles I \nam eager to delve deeper into the practical aspects of the field and gain invaluable real world experience  \n \nSkills  ____________________________________________________________________________________ ________ ___  \n \n   Python  R  HTML  CSS  Javascript  Tableau  Microsoft Excel  Flask  Tensorflow  SPSS  Minitab  MySQL  NoSQL  Fire base  \n   Machine Learning  Data Science  Data Analytics  Statistics  Microservices  Backend  English  \nEducation  _________ _________

---
# Summary
So, this OCR file works just fine. The OCR models predict the words with good accuracy too. But, those models used comes from DocTR pretrained model, so they are already guaranteed that the result of text extraction will be great. Self trained models will be used if they can work properly with DocTR ocr_predictor function args. Some researches may be conducted for reevaluating and remodeling to updgrade and boost accuracy for OCR model.

In [None]:
# Thanks