# Document identification with Optical Character Recognition (OCR) and Natural Language Processing (NLP)

Step-by-Step Rundown

    Input Handling: Accept images or PDFs as input.
    Convert PDF to Images (if needed): Convert PDF pages to images since OCR typically works on image data.
    Apply OCR to Extract Text: Use an OCR library to extract text from each image.
    Pre-process the Extracted Text: Clean and prepare the text for analysis. This may involve removing noise, special characters, and stopwords.
    NLP to Classify Document Type: Use NLP techniques such as keyword extraction, text classification, or topic modeling to determine the type of document.
    Output the Results: Display or store the results, indicating the document type.

Libraries and Tools
    PDF to Image Conversion: Use PyMuPDF to convert PDF pages to images.
    OCR: Use Keras-OCR to extract text from images.
    Text Pre-processing: Use standard NLP libraries to clean and process the text data.
    NLP: Use spacy for natural language processing.


In [1]:
#!python -m spacy download en_core_web_sm

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [3]:
import fitz  # PyMuPDF
import keras_ocr
import nltk
from nltk.corpus import stopwords
import string
from PIL import Image
import numpy as np
invoice_path = "Q:\Code\document_data\sample-invoice.pdf"
resume_path  = "Q:\Code\document_data\cv_1.pdf"

nltk.download('stopwords')

In [4]:
# Initialize Keras-OCR pipeline and spaCy model
pipeline = keras_ocr.pipeline.Pipeline()

# Load NLP model
nlp = spacy.load("en_core_web_sm")

Looking for C:\Users\simon\.keras-ocr\craft_mlt_25k.h5
Instructions for updating:
Use `tf.image.resize(...method=ResizeMethod.BILINEAR...)` instead.
Looking for C:\Users\simon\.keras-ocr\crnn_kurapan.h5


### Convert pdf to image

you could just work with the pdf as is, but using OCR, the model is more robust as it can also handle pdfs where the text is not defined (e.g. scans)

In [5]:
# Function to convert PDF to images using PyMuPDF
def pdf_to_images(pdf_path):
    doc = fitz.open(pdf_path)
    images = []
    
    for page_num in range(len(doc)):
        # Get the page
        page = doc.load_page(page_num)
        # Render page to a pixmap
        pix = page.get_pixmap()
        # Convert pixmap to PIL Image
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        images.append(img)
    
    return images

### Extract the text from the images using keras OCR

In [6]:
# Function to extract text from images using Keras-OCR
def extract_text_from_image(image):
    # Convert PIL image to numpy array
    image_array = np.array(image)
    # Run OCR on the image using Keras-OCR pipeline
    prediction_groups = pipeline.recognize([image_array])
    # Join all the recognized text from the predictions
    extracted_text = ' '.join([text for text, box in prediction_groups[0]])
    return extracted_text

### Prepare the text to be fed into the NLP model

In [7]:
def preprocess_text(text):
    # Lowercase and remove punctuation
    text = text.lower().translate(str.maketrans("", "", string.punctuation))
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = " ".join([word for word in text.split() if word not in stop_words])
    
    return text

### Classify the document using NLP

In [8]:
# Function to classify document type
def classify_document(text):
    # Example: Simple keyword-based classification
    doc = nlp(text)
    
    # Check for certain keywords or entities
    if any(token.text in ['invoice', 'bill'] for token in doc):
        return 'Invoice'
    elif any(token.text in ['resume', 'curriculum vitae'] for token in doc):
        return 'Resume'
    elif any(token.text in ['report', 'analysis'] for token in doc):
        return 'Report'
    else:
        return 'Unknown'


### Building the main function to run all previous steps

In [9]:

# Main Function
def main(pdf_path):
    # Step 1: Convert PDF to images
    images = pdf_to_images(pdf_path)

    # Step 2: Extract text from images
    extracted_text = ""
    for image in images:
        extracted_text += extract_text_from_image(image) + "\n"

    # Step 3: Pre-process extracted text
    preprocessed_text = preprocess_text(extracted_text)

    # Step 4: Classify document type
    document_type = classify_document(preprocessed_text)

    print(f"\nExtracted Text: \n{extracted_text[:500]}...")  # Print first 500 characters of the text
    print('\n-----------------------------------')
    print(f"Document Type: {document_type}")
    print('-----------------------------------')


### Testing the function with an invoice pdf

In [10]:
# Run the main function
if __name__ == "__main__":
    # Provide the path to your PDF file
    pdf_path = invoice_path
    main(pdf_path)


Extracted Text: 
cpb cpb software germany gmbh irucn ojopr oreog m olofon le sit fjus o ocpo geman onere con ee ollnes row cow cpe sofnere greh bnuch csber mtonese van gamany m s muslerkunde ag mr john doe 25 musterstr 12345 musterstadt slelanie nae sller phone sag coti stoco invoice wmaccess internet vat de1sgito66 no invoice no customer no invoice period date 2315 o1 02 2024 29 02 2024 maz 024 123100401 amount description total serice quantity amount without vat oo sasic ce dnvicr 190 10 o0 baee orioral ooo ee...

-----------------------------------
Document Type: Invoice
-----------------------------------


### Testing the function with a resume pdf

In [11]:
# Run the main function
if __name__ == "__main__":
    # Provide the path to your PDF file
    pdf_path = resume_path
    main(pdf_path)


Extracted Text: 
functional resume sample we smith john 2002 way font collins co 8025 front range iwsmithacolostate edu career summary four in childhood diverse background of experience early development with in the years care c children special needs and adults care experience adult 1s0 adult clients determined work placement for special needs maintained client databases and records coordinated client health basis with local professionals monthly contact care on d 25 volunteer managed workers childcare experien...

-----------------------------------
Document Type: Resume
-----------------------------------


### The model correctly classified the two documents as an invoice and a resume, respectively.