<a href="https://colab.research.google.com/github/sssangeetha/OutamationAI_OCR_RAG_Automation/blob/main/Copy_of_Designing_a_Page_Level_Detection_Strategy_Using_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Colab notebook includes all the code you need to experiment with page-level document classification using a RAG-based strategy. Upload your own multi-document PDF, follow along step-by-step, and learn how to detect document boundaries, label each section by type, and generate clean metadata for smarter extraction and automation.


**Step 1: Extract Page-Level Content from PDF**





In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m225.3/232.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [2]:
from PyPDF2 import PdfReader

reader = PdfReader("/content/sample_data/Blob File Sample.pdf")
pages = [page.extract_text() for page in reader.pages]
doc_pages = [{"page_num": i, "text": p} for i, p in enumerate(pages)]
doc_pages

[{'page_num': 0,
  'text': 'Functional Resume Sample \n \nJohn W. Smith   \n2002 Front Range Way Fort Collins, CO 80525  \njwsmith@colostate.edu  \n \nCareer Summary \n \nFour years experience in early childhood development with a di verse background in the care of \nspecial needs children and adults.  \n  \nAdult Care Experience  \n \n• Determined work placement for 150 special needs adult clients.  \n• Maintained client databases and records.  \n• Coordinated client contact with local health care professionals on a monthly basis.     \n• Managed 25 volunteer workers.     \n \nChildcare Experience  \n \n• Coordinated service assignments for 20 part -time counselors and 100 client families. \n• Oversaw daily activity and outing planning for 100 clients.  \n• Assisted families of special needs clients with researching financial assistance and \nhealthcare. \n• Assisted teachers with managing daily classroom activities.    \n• Oversaw daily and special st udent activities.     \n \nEmplo

**Step 2: Write the "Same Document?" Function with RAG**

In [3]:
def gemini_model(prompt):
    import google.generativeai as genai

    genai.configure(api_key="AIzaSyBUZXaw8UOeWE5h8e6sSMKv3kA4H4H3NiQ")

    model = genai.GenerativeModel("models/gemini-2.0-flash")
    response = model.generate_content(prompt)

    return response.text


In [4]:
def is_same_document(prev_text, curr_text, doc_type=None):
    prompt = f"""
    You are checking whether two pages belong to the same document.
    Previous page type: {doc_type or 'unknown'}

    Previous Page:
    {prev_text}

    Current Page:
    {curr_text}

    Answer ONLY 'Yes' or 'No'. Do NOT explain.
    """
    response = gemini_model(prompt)  # Swap in LLM call
    return response.strip().lower().startswith("yes")


prev_text = doc_pages[2]["text"]
curr_text = doc_pages[0]["text"]
doc_type = "Resume"  # Optional, can be "unknown" or None

is_same_document(prev_text, curr_text, doc_type)


False

**Step 3: Write the Document Type Classifier**

In [5]:
def classify_document_type(text):
    prompt = f"""
    This is the start of a new document. Based on the content, classify it.

    Page Content:
    {text}

    Choose from: Resume, Contract, Lender Fee Sheet, ID, PaySlip, Other.
    Just respond with the type.
    """
    response = gemini_model(prompt).strip().lower().replace(".", "")
    result = response.title() # Capitalize the first letter of each word
    return result

classify_document_type(doc_pages[3]["text"])

'Payslip'

**Step 4: Loop Through Pages and Generate Page-Level Metadata**

In [7]:
import time

results = []
current_doc_type = None
doc_counter = 0

for i, page in enumerate(doc_pages):
    if i == 0:
        current_doc_type = classify_document_type(page["text"])
    else:
        prev_text = doc_pages[i - 1]["text"]
        same = is_same_document(prev_text, page["text"], current_doc_type)
        if not same:
            doc_counter += 1
            current_doc_type = classify_document_type(page["text"])

    results.append({
        "page": i,
        "doc_id": doc_counter,
        "doc_type": current_doc_type
    })

    time.sleep(1) # Add a small delay to avoid hitting rate limits

for r in results:
    print(r)

{'page': 0, 'doc_id': 0, 'doc_type': 'Resume'}
{'page': 1, 'doc_id': 1, 'doc_type': 'Lender Fee Sheet'}
{'page': 2, 'doc_id': 2, 'doc_type': 'Payslip'}
{'page': 3, 'doc_id': 2, 'doc_type': 'Payslip'}


**Step 5: Visualize Results**

In [8]:
import pandas as pd

df = pd.DataFrame(results)
df.head()

Unnamed: 0,page,doc_id,doc_type
0,0,0,Resume
1,1,1,Lender Fee Sheet
2,2,2,Payslip
3,3,2,Payslip
