# Get Text From Any PDF Files in 3 simple steps with OCR

![image.png](attachment:image.png)

### Problem statement:  
Imagine you have to build a tool to extract text from pdf files from two different clients: 
- one group of client submit their PDF as scanned/image files   
- from the other side, we receive native pdf files. 

#### What to do in such case? 
- create two different tools? No, very expensive, and not maintainable. 

#### A solution?  (not the only one)   
a tool with the following features:  
- normalise all the files into images   
- extract the content of the images using OCR   
Source code bellow

In [1]:
# Useful libraries
from pdf2image import convert_from_path
from pytesseract import image_to_string

In [2]:
def convert_pdf_to_img(pdf_file):
    """
    @desc: this function converts a PDF into Image
    
    @params:
        - pdf_file: the file to be converted
    
    @returns:
        - an interable containing image format of all the pages of the PDF
    """
    return convert_from_path(pdf_file)


def convert_image_to_text(file):
    """
    @desc: this function extracts text from image
    
    @params:
        - file: the image file to extract the content
    
    @returns:
        - the textual content of single image
    """
    
    text = image_to_string(file)
    return text


def get_text_from_any_pdf(pdf_file):
    """
    @desc: this function is our final system combining the previous functions
    
    @params:
        - file: the original PDF File
    
    @returns:
        - the textual content of ALL the pages
    """
    images = convert_pdf_to_img(pdf_file)
    final_text = ""
    for pg, img in enumerate(images):
        
        final_text += convert_image_to_text(img)
        #print("Page n°{}".format(pg))
        #print(convert_image_to_text(img))
    
    return final_text

In [3]:
path_to_pdf = './data/First Cry Image.pdf'

In [4]:
print(get_text_from_any_pdf(path_to_pdf))

www.FirstCry.com
Sold By

Digital Age Retails Pvt. Ltd.

585/2,PALAM ROAD,BIJWASA, NEW

DELHI,BIJAWASAN., 110061

GST NO : 07AADCD8136E1ZT
Shipment ID : 1112027167787

Order No : 305835LLH6B19975

TAX INVOICE

Shipping Address
Rashmita Devi

Flat No A2/1, Second Floor, IRCON
Tower Newtown, Kolkata Opposite

of Reliance fresh
North twenty four Parganas, We:
Bengal, 700156

Mobile No : +91-9748167266

st

Place Of Supply : 19-West Bengal

Invoice No : 06CV0003436591

Invoice Date : 2020-12-05
Order Date: 2020-12-04

Dispatch From : Wholsum Foods Pvt Ltd Pickup Address:Holisol Logistics Plot No. - 66,
Dwarka Sec-28, Village - Bamnoli, New Delhi - 110077 (India) Billing Address: Wholsum
Foods Pvt Ltd, C-533, Triveni Apartments, Sheikh Sarai, Phase 1, New Delhi 110017 Delhi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Delhi 110077
Total] Tax | Tax] Tax | Sub
Item Name Qty) MRP | pisc.| Type |Rate| Amt | Total
Slurrp Farm Immunity Booster
Combo - Organic Jaggery and Nut
Powder, 3001009
H