# Semantic Search Across Multiple Documents

When working with a collection of PDFs, you might need to find information relevant to a specific query across all documents, not just within a single one. This tutorial demonstrates how to perform semantic search over a `PDFCollection`.

In [1]:
#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[search]"  # Ensure search dependencies are installed

In [2]:
import logging
import natural_pdf

# Optional: Configure logging to see progress
natural_pdf.configure_logging(level=logging.INFO)

# Define the paths to your PDF files
pdf_paths = [
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
    # Add more PDF paths as needed
]

# Create a PDFCollection
collection = natural_pdf.PDFCollection(pdf_paths)
print(f"Created collection with {len(collection.pdfs)} PDFs.")

natural_pdf.collections.pdf_collection - INFO - Initializing 2 PDF objects...


Loading PDFs:   0%|          | 0/2 [00:00<?, ?it/s]

natural_pdf.core.pdf - INFO - Downloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf


natural_pdf.core.pdf - INFO - PDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmpc7n5dufd.pdf


natural_pdf.core.pdf - INFO - Initializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmpc7n5dufd.pdf


natural_pdf.ocr.ocr_manager - INFO - OCRManager initialized.


natural_pdf.analyzers.layout.layout_manager - INFO - LayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling', 'gemini']


natural_pdf.core.highlighting_service - INFO - HighlightingService initialized with ColorManager.


natural_pdf.classification.manager - INFO - ClassificationManager initialized on device: None


natural_pdf.core.pdf - INFO - PDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf' initialized with 1 pages.


natural_pdf.classification.manager - INFO - ClassificationManager initialized on device: None


natural_pdf.extraction.manager - INFO - Initialized StructuredDataManager.


natural_pdf.core.pdf - INFO - Downloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf


natural_pdf.core.pdf - INFO - PDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp5mocjexv.pdf


natural_pdf.core.pdf - INFO - Initializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp5mocjexv.pdf


natural_pdf.ocr.ocr_manager - INFO - OCRManager initialized.


natural_pdf.analyzers.layout.layout_manager - INFO - LayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling', 'gemini']


natural_pdf.core.highlighting_service - INFO - HighlightingService initialized with ColorManager.


natural_pdf.classification.manager - INFO - ClassificationManager initialized on device: None


natural_pdf.core.pdf - INFO - PDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf' initialized with 5 pages.


natural_pdf.classification.manager - INFO - ClassificationManager initialized on device: None


natural_pdf.extraction.manager - INFO - Initialized StructuredDataManager.


natural_pdf.collections.pdf_collection - INFO - Successfully initialized 2 PDFs. Failed: 0


Created collection with 2 PDFs.


## Initializing the Search Index

Before performing a search, you need to initialize the search capabilities for the collection. This involves processing the documents and building an index.

In [3]:
# Initialize search. 'index=True' builds the index immediately.
# This might take some time depending on the number and size of PDFs.
collection.init_search(index=True) 
print("Search index initialized.")

natural_pdf.search.searchable_mixin - INFO - Using default collection name 'default_collection' for in-memory service.


natural_pdf.search.searchable_mixin - INFO - Creating new SearchService: name='default_collection', persist=False, model=default


natural_pdf.search.haystack_search_service - INFO - HaystackSearchService initialized for collection='default_collection' (persist=False, model='sentence-transformers/all-MiniLM-L6-v2'). Default path: './natural_pdf_index'


natural_pdf.search - INFO - Created new HaystackSearchService instance for collection 'default_collection'.


natural_pdf.search.searchable_mixin - INFO - index=True: Proceeding to index collection immediately after search initialization.


natural_pdf.search.searchable_mixin - INFO - Starting internal indexing process into SearchService collection 'default_collection'...


natural_pdf.search.searchable_mixin - INFO - Prepared 6 indexable items for indexing.


natural_pdf.search.haystack_search_service - INFO - Index request for collection='default_collection', docs=6, model='sentence-transformers/all-MiniLM-L6-v2', force=False, persist=False


natural_pdf.search.haystack_search_service - INFO - Created SentenceTransformersDocumentEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)


natural_pdf.search.haystack_search_service - INFO - Preparing Haystack Documents from 6 indexable items...


natural_pdf.search.haystack_search_service - INFO - Embedding 6 documents using 'sentence-transformers/all-MiniLM-L6-v2'...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

natural_pdf.search.haystack_search_service - INFO - Successfully embedded 6 documents.


natural_pdf.search.haystack_search_service - INFO - Writing 6 embedded documents to store 'default_collection'...


natural_pdf.search.haystack_search_service - INFO - Successfully wrote 6 documents to store 'default_collection'.


natural_pdf.search.haystack_search_service - INFO - Store 'default_collection' document count after write: 6


natural_pdf.search.searchable_mixin - INFO - Successfully completed indexing into SearchService collection 'default_collection'.


Search index initialized.


## Performing a Semantic Search

Once the index is ready, you can use the `find_relevant()` method to search for content semantically related to your query.

In [4]:
# Perform a search query
query = "american president"
results = collection.find_relevant(query)

print(f"Found {len(results)} results for '{query}':")

natural_pdf.search.searchable_mixin - INFO - Searching collection 'default_collection' via HaystackSearchService...


natural_pdf.search.haystack_search_service - INFO - Search request for collection='default_collection', query_type=str, options=TextSearchOptions(top_k=10, retriever_top_k=20, filters=None, use_reranker=True, reranker_instance=None, reranker_model=None, reranker_api_key=None)


natural_pdf.search.haystack_search_service - INFO - Created SentenceTransformersTextEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

natural_pdf.search.haystack_search_service - INFO - Running retrieval pipeline for collection 'default_collection'...


natural_pdf.search.haystack_search_service - INFO - Retrieved 6 documents.


natural_pdf.search.searchable_mixin - INFO - SearchService returned 6 results from collection 'default_collection'.


Found 6 results for 'american president':


## Understanding Search Results

The `find_relevant()` method returns a list of dictionaries, each representing a relevant text chunk found in one of the PDFs. Each result includes:

*   `pdf_path`: The path to the PDF document where the result was found.
*   `page_number`: The page number within the PDF.
*   `score`: A relevance score (higher means more relevant).
*   `content_snippet`: A snippet of the text chunk that matched the query.

In [5]:
# Process and display the results
if results:
    for i, result in enumerate(results):
        print(f"  {i+1}. PDF: {result['pdf_path']}")
        print(f"     Page: {result['page_number']} (Score: {result['score']:.4f})")
        # Display a snippet of the content
        snippet = result.get('content_snippet', '')
        print(f"     Snippet: {snippet}...") 
else:
    print("  No relevant results found.")

# You can access the full content if needed via the result object, 
# though 'content_snippet' is usually sufficient for display.

  1. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp5mocjexv.pdf
     Page: 2 (Score: 0.0708)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
The Anasazi (Removed: 1)
Author: Petersen, David. ISBN: 0-516-01121-9 (trade) Published: 1991
Sit...
  2. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp5mocjexv.pdf
     Page: 5 (Score: 0.0669)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
Centennial Place 33170000562167 $13.10 11/5/1999 33554-43170
Academy (Charter)
Was Available -- W...
  3. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmpc7n5dufd.pdf
     Page: 1 (Score: -0.0040)
     Snippet: Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men...
  4. PDF: /var/folders/25/h3prywj

Semantic search allows you to efficiently query large sets of documents to find the most relevant information without needing exact keyword matches, leveraging the meaning and context of your query. 