### 1. Importing libraries

In [19]:
import hashlib
import json
import os

- hashlib: Provides hash functions (like SHA-256) to create unique digital fingerprints of files or data.

- json: Allows reading/writing JSON data (a text-based format to store structured data).

- os: Provides utilities to interact with the operating system (like checking if files exist).



### 2. Setting constant filename for tracking processed PDFs

In [20]:
PROCESSED_RECORD_FILE = "processed_pdfs.json"

- This is a constant variable holding the filename where you will store the list of PDFs that have been processed.

- The file will store hashes (unique IDs) of PDFs processed so far.

### 3. Function: Generate a hash for a file

In [21]:
def get_file_hash(file_path):
    """Generate a hash for the file content."""
    hasher = hashlib.sha256()                          # Create a new SHA-256 hash object
    with open(file_path, "rb") as f:                   # Open the file in binary read mode
        for chunk in iter(lambda: f.read(4096), b""):  # Read file in 4KB chunks until EOF (empty bytes)
            hasher.update(chunk)                       # Update hash with current chunk data
    return hasher.hexdigest()                          # Return the final hex digest (unique hash string)


- Purpose: Calculate a unique fingerprint of the file content to detect duplicates.

- Opens the file as binary because hashing works on raw bytes.

- Reads file in chunks (4 KB at a time) for memory efficiency — useful for big files.

- Updates the hash object incrementally for each chunk.

- Returns the final hexadecimal string representation of the hash — a 64-character unique string for that file content.

In [22]:
# lets try with a sample file
sample_file_path = "./test_data/Passport1.pdf"
sample_file_hash = get_file_hash(sample_file_path)
print(f"Hash for {sample_file_path}: {sample_file_hash}")

Hash for ./test_data/Passport1.pdf: c2cfca0247c8cce1e59a62bb32598b98f2117f443ca6d2897338422be16ddad4


### 4. Function: Load processed PDF hashes from JSON file

In [24]:
def load_processed_records():
    if os.path.exists(PROCESSED_RECORD_FILE):        # Check if JSON file exists
        with open(PROCESSED_RECORD_FILE, "r") as f:  # Open JSON file in read mode
            return set(json.load(f))                 # Load list from JSON, convert to set for faster lookup
    return set()                                     # If file doesn't exist, return empty set

- Checks if processed_pdfs.json file exists.

- If yes, loads the list of processed file hashes from it.

- Converts list to a Python set (fast lookup for hash presence).

- If the file does not exist, return an empty set (means no PDFs processed yet).

### 5. Function: Save updated processed PDF hashes back to JSON

In [26]:
def save_processed_records(processed_set):
    with open(PROCESSED_RECORD_FILE, "w") as f:    # Open JSON file in write mode (overwrites existing)
        json.dump(list(processed_set), f)          # Convert set to list and write as JSON array

- Writes the updated set of processed hashes back to the JSON file.

- Converts set back to list because JSON doesn’t support sets.

- Overwrites the file each time to keep it updated.

### 6. Main function: Process a PDF

In [34]:
from modules.pdf_loader import load_pdf
from modules.text_splitter import split_text
from modules.embed_store import embed_and_store_documents
from modules.qa_with_retriever import answer_query_with_gemini

def process_pdf(pdf_path):
    processed = load_processed_records()        # Load processed hashes so far
    pdf_hash = get_file_hash(pdf_path)          # Calculate hash for current PDF

    if pdf_hash in processed:                    # Check if this PDF was already processed
        print("PDF already processed, skipping.")
        return                                  # Exit function early if processed

    # If not processed:
    pdf_text = load_pdf(pdf_path)                # Load the text content from PDF
    chunks = split_text(pdf_text, chunk_size=1000, chunk_overlap=100)  # Split into chunks for RAG
    vectorstore = embed_and_store_documents(chunks, persist_directory="./chroma_store", model_name="embed-multilingual-v3.0")  # Embed & store

    processed.add(pdf_hash)                       # Add this PDF’s hash to the processed set
    save_processed_records(processed)             # Save updated processed hashes back to JSON
    print("PDF processed and embeddings stored successfully.")

- Loads all processed PDF hashes to check for duplicates.

- Gets the hash of the current PDF file.

- If hash already in processed set → skips further processing.

- If not processed:

    Reads and extracts text from the PDF (you must define load_pdf).

    Splits the text into chunks with specified size and overlap (you must define split_text).

    Embeds chunks and stores them persistently in a vector store (you must define embed_and_store_documents).

- Adds the current PDF hash to processed set.

- Saves the updated processed set back to JSON file for future reference.

- Prints success message.

In [36]:
from modules.pdf_loader import load_pdf
from modules.text_splitter import split_text
from modules.embed_store import embed_and_store_documents
from modules.qa_with_retriever import answer_query_with_gemini

def process_pdf(pdf_path):
    processed = load_processed_records()        # Load processed hashes so far
    pdf_hash = get_file_hash(pdf_path)          # Calculate hash for current PDF

    if pdf_hash in processed:                    # Check if this PDF was already processed
        print("PDF already processed, skipping.")
        return                                  # Exit function early if processed

    # If not processed:
    pdf_text = load_pdf(pdf_path)                # Load the text content from PDF
    chunks = split_text(pdf_text, chunk_size=1000, chunk_overlap=100)  # Split into chunks for RAG
    vectorstore = embed_and_store_documents(chunks, persist_directory="./chroma_store", model_name="embed-multilingual-v3.0")  # Embed & store

    processed.add(pdf_hash)                       # Add this PDF’s hash to the processed set
    save_processed_records(processed)             # Save updated processed hashes back to JSON
    print("PDF processed and embeddings stored successfully.")