<a href="https://colab.research.google.com/github/santoshpremi/Automatic-Document-Analysis/blob/main/Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**In this tutorial, we’ll explore a powerful approach to compare two PDF documents.**

**Table of Contents**
1. Setup Environment
2. Text Extraction from PDFs
3. Word-Level Differences
4. Sentence-Level Comparison
5. Semantic Similarity Analysis

# **1. Setup Environment**

Google Drive Mount: This allows access to PDF files stored in your Google Drive.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Tesseract OCR:** Handle scanned PDFs where text extraction isn’t possible.

**Libraries:**

**PyMuPDF (fitz):**  Extracts text from standard PDFs. <br>
**pytesseract:** Integrates Tesseract for OCR. <br>
**spaCy:** Splits text into sentences for granular analysis. <br>
**transformers:**  Provides BERT for semantic similarity calculations.

In [None]:
!sudo apt install tesseract-ocr
!pip install pytesseract PyMuPDF pdfplumber transformers spacy
!python -m spacy download en_core_web_sm

In [18]:
# Import libraries
import fitz
import difflib
import spacy
import pytesseract
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel
from transformers import BertTokenizer, BertModel
import numpy as np
from PIL import Image
import io

In [19]:
# Replace these paths with your PDF locations in Google Drive
pdf_v1_path = '/content/drive/MyDrive/Adhikari_Cover_letter.pdf'
pdf_v2_path = '/content/drive/MyDrive/Adhikari_Cover_letter_5.pdf'

# **2. Text Extraction from PDFs**

Documents often contain text or images. Our extraction function handles both cases:

In [20]:
# ==================== DOCUMENT COMPARISON ====================
def extract_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        # Try text extraction first
        page_text = page.get_text()
        if page_text.strip():
            text += page_text
        else:
            # Fallback to OCR
            pix = page.get_pixmap()
            img = Image.open(io.BytesIO(pix.tobytes()))
            text += pytesseract.image_to_string(img)
    return text
# Load PDFs from Drive
text1 = extract_text(pdf_v1_path)
text2 = extract_text(pdf_v2_path)

**How It Works:**

**1. Text Extraction:** fitz retrieves text directly from PDF pages. <br>
**2. OCR Fallback:** If a page has no extractable text (e.g., scanned PDF), it converts the page to an image and uses Tesseract OCR. <br>
**3. Efficiency:** Combines both methods to ensure no content is missed.

# **3. Word-Level Differences:** <br>
Identify additions, deletions, and modifications at the word level with color coding:

In [21]:
# ==================== VISUAL DIFF DISPLAY ====================
def print_colored_diff(old_text, new_text):
    """
    Shows word-level differences with color coding
    - Red: Removed text
    - Green: Added text
    """
    d = difflib.Differ()
    diff = d.compare(old_text.split(), new_text.split())

    current_line = []
    line_length = 0
    for word in diff:
        # Format word with color
        if word.startswith('- '):
            formatted_word = f"\033[91m{word[2:]}\033[0m"
        elif word.startswith('+ '):
            formatted_word = f"\033[92m{word[2:]}\033[0m"
        elif word.startswith('? '):
            continue
        else:
            formatted_word = word[2:]

        # Check line length and wrap text
        if line_length + len(formatted_word) > 80:
            print(' '.join(current_line))
            current_line = []
            line_length =0

        current_line.append(formatted_word)
        line_length += len(formatted_word) + 1

    if current_line:
        print(' '.join(current_line))


**Color Coding:** Red for removed words, green for added words. <br>
**Line Wrapping:** Ensures output remains readable by limiting line length.


#**4. Sentence-Level Comparison:**
Analyze changes at the sentence level to understand structural edits:

In [22]:
# Load English NLP model for sentence splitting
nlp = spacy.load("en_core_web_sm")
# ==================== SENTENCE-LEVEL ANALYSIS ====================
def compare_sentences(text1, text2):
    """
    Compare documents at sentence level using spaCy with inline coloring
    """
    doc1 = nlp(text1)
    doc2 = nlp(text2)

    sentences1 = [sent.text for sent in doc1.sents]
    sentences2 = [sent.text for sent in doc2.sents]

    matcher = difflib.SequenceMatcher(None, sentences1, sentences2)
    count_sentence = 1
    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
        if tag == 'replace':
            original = ' '.join(sentences1[i1:i2])
            modified = ' '.join(sentences2[j1:j2])

            print(f"{count_sentence}.MODIFIED SENTENCES")
            count_sentence += 1
            print("\nOriginal Version (removed content in red):\n" + "─" * 40)
            # Show original with deletions colored
            d = difflib.Differ()
            diff_original = [token for token in d.compare(original.split(), modified.split())
                             if token.startswith(('- ', '  '))]
            print_colored_diff(original, modified)

            print("\nNew Version (added content in green):\n" + "─" * 40)
            # Show modified with additions colored
            d = difflib.Differ()
            diff_modified = [token for token in d.compare(original.split(), modified.split())
                             if token.startswith(('+ ', '  '))]
            print_colored_diff(original, modified)
            print("\n")

        elif tag == 'delete':
            print("DELETED SENTENCES ".center(80, '─'))
            print('\n'.join([f"• \033[91m{sent}\033[0m" for sent in sentences1[i1:i2]]))
            print("\n" + "─" * 80)

        elif tag == 'insert':
            print("ADDED SENTENCES".center(80, '─'))
            print('\n'.join([f"• \033[92m{sent}\033[0m" for sent in sentences2[j1:j2]]))
            print("\n" + "─" * 80)

**Workflow:**
**Sentence Splitting:** spaCy divides text into sentences. <br>
**Sequence Matching:** difflib identifies replacements, deletions, and insertions. <br>
**Visual Feedback:** Highlights modified sentences and lists added/removed content.

#**5. Semantic Similarity Analysis**
Measure how meaningfully similar two documents are using BERT embeddings:

In [23]:
# ==================== SEMANTIC SIMILARITY ANALYSIS ====================
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

def semantic_similarity(text1, text2):
    embedding1 = get_embedding(text1)
    embedding2 = get_embedding(text2)
    return cosine_similarity(embedding1, embedding2)[0][0]

**BERT Embeddings:** Converts text into numerical vectors capturing contextual meaning. <br>
**Cosine Similarity:** Computes similarity between vectors (0 = dissimilar, 1 = identical). <br>
**Use Case:** Detects paraphrasing or structural changes that aren’t visible in word/sentence diffs.

In [24]:
# ==================== MAIN EXECUTION ====================
# Show detailed differences
print(" WORD-LEVEL CHANGES ".center(80, '═'))
print_colored_diff(text1, text2)
print("\n")
# Show sentence-level analysis
print(" SENTENCE-LEVEL CHANGES ".center(80, '═'))
compare_sentences(text1, text2)

# Calculate and show semantic similarity
similarity_score = semantic_similarity(text1, text2)

# Show similarity-based recommendation
print(f"SEMANTIC SIMILARITY ANALYSIS:".center(80, '═'))
print(f"Semantic Similarity Score: {similarity_score:.2f}")

if similarity_score > 0.85:
    print("✅ Documents are highly similar semantically")
elif similarity_score > 0.7:
    print("⚠️ Documents have moderate semantic differences")
else:
    print("❌ Documents are significantly different semantically")

══════════════════════════════ WORD-LEVEL CHANGES ══════════════════════════════
Dear Hiring Team, I am writing to express my sincere interest in the Software
Engineering (Working Student) position at Fulfin. As a Master’s student in
Computer Science at Julius Maximilians University Wurzburg, I have developed
proficiency in Python, PostgreSQL, and full-stack development, which I believe
align well with the [91mrequirements[0m [92mlending[0m of this role.
Relevant Experience and Skills Python & JavaScript Development: I have extensive
experience in Python, including the development of machine learning models. For
instance, I created Background Remover AI using PyTorch, which leverages Python
for backend processing. This project honed my ability to develop and implement
algorithms efficiently. Full-Stack Development: During my software developer
role at Real Time Solutions, I have built and maintained full-stack applications
using JavaScript Reactjs and Python. Collaboration & Proble