# Semantic Search Across Multiple Documents

When working with a collection of PDFs, you might need to find information relevant to a specific query across all documents, not just within a single one. This tutorial demonstrates how to perform semantic search over a `PDFCollection`.

In [1]:
#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[search]"  # Ensure search dependencies are installed

In [2]:
import logging
import natural_pdf

# Optional: Configure logging to see progress
natural_pdf.configure_logging(level=logging.INFO)

# Define the paths to your PDF files
pdf_paths = [
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf",
    "https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf"
    # Add more PDF paths as needed
]

# Create a PDFCollection
collection = natural_pdf.PDFCollection(pdf_paths)
print(f"Created collection with {len(collection.pdfs)} PDFs.")

natural_pdf.collections.pdf_collection - INFO - Initializing 2 PDF objects...


[2m2025-04-13T11:47:49.658337Z[0m [[32m[1minfo     [0m] [1mInitializing 2 PDF objects... [0m [36mlineno[0m=[35m145[0m [36mmessage[0m=[35mInitializing 2 PDF objects...[0m [36mmodule[0m=[35mnatural_pdf.collections.pdf_collection[0m


[2025-04-13 13:47:49,657] [    INFO] pdf_collection.py:145 - Initializing 2 PDF objects...


Loading PDFs:   0%|                                              | 0/2 [00:00<?, ?it/s]

natural_pdf.core.pdf - INFO - Downloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf


[2m2025-04-13T11:47:49.675516Z[0m [[32m[1minfo     [0m] [1mDownloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf[0m [36mlineno[0m=[35m80[0m [36mmessage[0m=[35mDownloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:49,675] [    INFO] pdf.py:80 - Downloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf


natural_pdf.core.pdf - INFO - PDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf


[2m2025-04-13T11:47:49.909076Z[0m [[32m[1minfo     [0m] [1mPDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf[0m [36mlineno[0m=[35m93[0m [36mmessage[0m=[35mPDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:49,908] [    INFO] pdf.py:93 - PDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf


natural_pdf.core.pdf - INFO - Initializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf


[2m2025-04-13T11:47:49.909730Z[0m [[32m[1minfo     [0m] [1mInitializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf[0m [36mlineno[0m=[35m106[0m [36mmessage[0m=[35mInitializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:49,909] [    INFO] pdf.py:106 - Initializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf


natural_pdf.ocr.ocr_manager - INFO - OCRManager initialized.


[2m2025-04-13T11:47:49.911378Z[0m [[32m[1minfo     [0m] [1mOCRManager initialized.       [0m [36mlineno[0m=[35m38[0m [36mmessage[0m=[35mOCRManager initialized.[0m [36mmodule[0m=[35mnatural_pdf.ocr.ocr_manager[0m


[2025-04-13 13:47:49,911] [    INFO] ocr_manager.py:38 - OCRManager initialized.


natural_pdf.analyzers.layout.layout_manager - INFO - LayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling']


[2m2025-04-13T11:47:49.912104Z[0m [[32m[1minfo     [0m] [1mLayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling'][0m [36mlineno[0m=[35m68[0m [36mmessage[0m=[35mLayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling'][0m [36mmodule[0m=[35mnatural_pdf.analyzers.layout.layout_manager[0m


[2025-04-13 13:47:49,911] [    INFO] layout_manager.py:68 - LayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling']


natural_pdf.core.highlighting_service - INFO - HighlightingService initialized with ColorManager.


[2m2025-04-13T11:47:49.912762Z[0m [[32m[1minfo     [0m] [1mHighlightingService initialized with ColorManager.[0m [36mlineno[0m=[35m286[0m [36mmessage[0m=[35mHighlightingService initialized with ColorManager.[0m [36mmodule[0m=[35mnatural_pdf.core.highlighting_service[0m


[2025-04-13 13:47:49,912] [    INFO] highlighting_service.py:286 - HighlightingService initialized with ColorManager.


natural_pdf.core.pdf - INFO - Initialized HighlightingService.


[2m2025-04-13T11:47:49.914736Z[0m [[32m[1minfo     [0m] [1mInitialized HighlightingService.[0m [36mlineno[0m=[35m141[0m [36mmessage[0m=[35mInitialized HighlightingService.[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:49,914] [    INFO] pdf.py:141 - Initialized HighlightingService.


natural_pdf.core.pdf - INFO - PDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf' initialized with 1 pages.


[2m2025-04-13T11:47:49.915592Z[0m [[32m[1minfo     [0m] [1mPDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf' initialized with 1 pages.[0m [36mlineno[0m=[35m142[0m [36mmessage[0m=[35mPDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf' initialized with 1 pages.[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:49,915] [    INFO] pdf.py:142 - PDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf' initialized with 1 pages.


Loading PDFs:  50%|███████████████████                   | 1/2 [00:00<00:00,  4.15it/s]

natural_pdf.core.pdf - INFO - Downloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf


[2m2025-04-13T11:47:49.916368Z[0m [[32m[1minfo     [0m] [1mDownloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf[0m [36mlineno[0m=[35m80[0m [36mmessage[0m=[35mDownloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:49,916] [    INFO] pdf.py:80 - Downloading PDF from URL: https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf


natural_pdf.core.pdf - INFO - PDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf


[2m2025-04-13T11:47:50.258874Z[0m [[32m[1minfo     [0m] [1mPDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf[0m [36mlineno[0m=[35m93[0m [36mmessage[0m=[35mPDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:50,258] [    INFO] pdf.py:93 - PDF downloaded to temporary file: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf


natural_pdf.core.pdf - INFO - Initializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf


[2m2025-04-13T11:47:50.260105Z[0m [[32m[1minfo     [0m] [1mInitializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf[0m [36mlineno[0m=[35m106[0m [36mmessage[0m=[35mInitializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:50,259] [    INFO] pdf.py:106 - Initializing PDF from /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf


natural_pdf.ocr.ocr_manager - INFO - OCRManager initialized.


[2m2025-04-13T11:47:50.261983Z[0m [[32m[1minfo     [0m] [1mOCRManager initialized.       [0m [36mlineno[0m=[35m38[0m [36mmessage[0m=[35mOCRManager initialized.[0m [36mmodule[0m=[35mnatural_pdf.ocr.ocr_manager[0m


[2025-04-13 13:47:50,261] [    INFO] ocr_manager.py:38 - OCRManager initialized.


natural_pdf.analyzers.layout.layout_manager - INFO - LayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling']


[2m2025-04-13T11:47:50.262781Z[0m [[32m[1minfo     [0m] [1mLayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling'][0m [36mlineno[0m=[35m68[0m [36mmessage[0m=[35mLayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling'][0m [36mmodule[0m=[35mnatural_pdf.analyzers.layout.layout_manager[0m


[2025-04-13 13:47:50,262] [    INFO] layout_manager.py:68 - LayoutManager initialized. Available engines: ['yolo', 'tatr', 'paddle', 'surya', 'docling']


natural_pdf.core.highlighting_service - INFO - HighlightingService initialized with ColorManager.


[2m2025-04-13T11:47:50.263666Z[0m [[32m[1minfo     [0m] [1mHighlightingService initialized with ColorManager.[0m [36mlineno[0m=[35m286[0m [36mmessage[0m=[35mHighlightingService initialized with ColorManager.[0m [36mmodule[0m=[35mnatural_pdf.core.highlighting_service[0m


[2025-04-13 13:47:50,263] [    INFO] highlighting_service.py:286 - HighlightingService initialized with ColorManager.


natural_pdf.core.pdf - INFO - Initialized HighlightingService.


[2m2025-04-13T11:47:50.267510Z[0m [[32m[1minfo     [0m] [1mInitialized HighlightingService.[0m [36mlineno[0m=[35m141[0m [36mmessage[0m=[35mInitialized HighlightingService.[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:50,267] [    INFO] pdf.py:141 - Initialized HighlightingService.


natural_pdf.core.pdf - INFO - PDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf' initialized with 5 pages.


[2m2025-04-13T11:47:50.268089Z[0m [[32m[1minfo     [0m] [1mPDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf' initialized with 5 pages.[0m [36mlineno[0m=[35m142[0m [36mmessage[0m=[35mPDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf' initialized with 5 pages.[0m [36mmodule[0m=[35mnatural_pdf.core.pdf[0m


[2025-04-13 13:47:50,267] [    INFO] pdf.py:142 - PDF 'https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf' initialized with 5 pages.


Loading PDFs: 100%|██████████████████████████████████████| 2/2 [00:00<00:00,  3.26it/s]

Loading PDFs: 100%|██████████████████████████████████████| 2/2 [00:00<00:00,  3.37it/s]


natural_pdf.collections.pdf_collection - INFO - Successfully initialized 2 PDFs. Failed: 0


[2m2025-04-13T11:47:50.269456Z[0m [[32m[1minfo     [0m] [1mSuccessfully initialized 2 PDFs. Failed: 0[0m [36mlineno[0m=[35m154[0m [36mmessage[0m=[35mSuccessfully initialized 2 PDFs. Failed: 0[0m [36mmodule[0m=[35mnatural_pdf.collections.pdf_collection[0m


[2025-04-13 13:47:50,269] [    INFO] pdf_collection.py:154 - Successfully initialized 2 PDFs. Failed: 0


Created collection with 2 PDFs.


## Initializing the Search Index

Before performing a search, you need to initialize the search capabilities for the collection. This involves processing the documents and building an index.

In [3]:
# Initialize search. 'index=True' builds the index immediately.
# This might take some time depending on the number and size of PDFs.
collection.init_search(index=True) 
print("Search index initialized.")

natural_pdf.search.searchable_mixin - INFO - Using default collection name 'default_collection' for in-memory service.


[2m2025-04-13T11:47:50.276848Z[0m [[32m[1minfo     [0m] [1mUsing default collection name 'default_collection' for in-memory service.[0m [36mlineno[0m=[35m104[0m [36mmessage[0m=[35mUsing default collection name 'default_collection' for in-memory service.[0m [36mmodule[0m=[35mnatural_pdf.search.searchable_mixin[0m


[2025-04-13 13:47:50,276] [    INFO] searchable_mixin.py:104 - Using default collection name 'default_collection' for in-memory service.


natural_pdf.search.searchable_mixin - INFO - Creating new SearchService: name='default_collection', persist=False, model=default


[2m2025-04-13T11:47:50.277734Z[0m [[32m[1minfo     [0m] [1mCreating new SearchService: name='default_collection', persist=False, model=default[0m [36mlineno[0m=[35m106[0m [36mmessage[0m=[35mCreating new SearchService: name='default_collection', persist=False, model=default[0m [36mmodule[0m=[35mnatural_pdf.search.searchable_mixin[0m


[2025-04-13 13:47:50,277] [    INFO] searchable_mixin.py:106 - Creating new SearchService: name='default_collection', persist=False, model=default


natural_pdf.search.haystack_search_service - INFO - HaystackSearchService initialized for collection='default_collection' (persist=False, model='sentence-transformers/all-MiniLM-L6-v2'). Default path: './natural_pdf_index'


[2m2025-04-13T11:47:50.278641Z[0m [[32m[1minfo     [0m] [1mHaystackSearchService initialized for collection='default_collection' (persist=False, model='sentence-transformers/all-MiniLM-L6-v2'). Default path: './natural_pdf_index'[0m [36mlineno[0m=[35m106[0m [36mmessage[0m=[35mHaystackSearchService initialized for collection='default_collection' (persist=False, model='sentence-transformers/all-MiniLM-L6-v2'). Default path: './natural_pdf_index'[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:50,278] [    INFO] haystack_search_service.py:106 - HaystackSearchService initialized for collection='default_collection' (persist=False, model='sentence-transformers/all-MiniLM-L6-v2'). Default path: './natural_pdf_index'


natural_pdf.search - INFO - Created new HaystackSearchService instance for collection 'default_collection'.


[2m2025-04-13T11:47:50.279470Z[0m [[32m[1minfo     [0m] [1mCreated new HaystackSearchService instance for collection 'default_collection'.[0m [36mlineno[0m=[35m80[0m [36mmessage[0m=[35mCreated new HaystackSearchService instance for collection 'default_collection'.[0m [36mmodule[0m=[35mnatural_pdf.search[0m


[2025-04-13 13:47:50,279] [    INFO] __init__.py:80 - Created new HaystackSearchService instance for collection 'default_collection'.


natural_pdf.search.searchable_mixin - INFO - index=True: Proceeding to index collection immediately after search initialization.


[2m2025-04-13T11:47:50.280210Z[0m [[32m[1minfo     [0m] [1mindex=True: Proceeding to index collection immediately after search initialization.[0m [36mlineno[0m=[35m141[0m [36mmessage[0m=[35mindex=True: Proceeding to index collection immediately after search initialization.[0m [36mmodule[0m=[35mnatural_pdf.search.searchable_mixin[0m


[2025-04-13 13:47:50,279] [    INFO] searchable_mixin.py:141 - index=True: Proceeding to index collection immediately after search initialization.


natural_pdf.search.searchable_mixin - INFO - Starting internal indexing process into SearchService collection 'default_collection'...


[2m2025-04-13T11:47:50.280887Z[0m [[32m[1minfo     [0m] [1mStarting internal indexing process into SearchService collection 'default_collection'...[0m [36mlineno[0m=[35m152[0m [36mmessage[0m=[35mStarting internal indexing process into SearchService collection 'default_collection'...[0m [36mmodule[0m=[35mnatural_pdf.search.searchable_mixin[0m


[2025-04-13 13:47:50,280] [    INFO] searchable_mixin.py:152 - Starting internal indexing process into SearchService collection 'default_collection'...


natural_pdf.search.searchable_mixin - INFO - Prepared 6 indexable items for indexing.


[2m2025-04-13T11:47:50.281489Z[0m [[32m[1minfo     [0m] [1mPrepared 6 indexable items for indexing.[0m [36mlineno[0m=[35m165[0m [36mmessage[0m=[35mPrepared 6 indexable items for indexing.[0m [36mmodule[0m=[35mnatural_pdf.search.searchable_mixin[0m


[2025-04-13 13:47:50,281] [    INFO] searchable_mixin.py:165 - Prepared 6 indexable items for indexing.


natural_pdf.search.haystack_search_service - INFO - Index request for collection='default_collection', docs=6, model='sentence-transformers/all-MiniLM-L6-v2', force=False, persist=False


[2m2025-04-13T11:47:50.282015Z[0m [[32m[1minfo     [0m] [1mIndex request for collection='default_collection', docs=6, model='sentence-transformers/all-MiniLM-L6-v2', force=False, persist=False[0m [36mlineno[0m=[35m210[0m [36mmessage[0m=[35mIndex request for collection='default_collection', docs=6, model='sentence-transformers/all-MiniLM-L6-v2', force=False, persist=False[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:50,281] [    INFO] haystack_search_service.py:210 - Index request for collection='default_collection', docs=6, model='sentence-transformers/all-MiniLM-L6-v2', force=False, persist=False


natural_pdf.search.haystack_search_service - INFO - Created SentenceTransformersDocumentEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)


[2m2025-04-13T11:47:53.718673Z[0m [[32m[1minfo     [0m] [1mCreated SentenceTransformersDocumentEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)[0m [36mlineno[0m=[35m146[0m [36mmessage[0m=[35mCreated SentenceTransformersDocumentEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:53,718] [    INFO] haystack_search_service.py:146 - Created SentenceTransformersDocumentEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)


natural_pdf.search.haystack_search_service - INFO - Preparing Haystack Documents from 6 indexable items...


[2m2025-04-13T11:47:53.719838Z[0m [[32m[1minfo     [0m] [1mPreparing Haystack Documents from 6 indexable items...[0m [36mlineno[0m=[35m241[0m [36mmessage[0m=[35mPreparing Haystack Documents from 6 indexable items...[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:53,719] [    INFO] haystack_search_service.py:241 - Preparing Haystack Documents from 6 indexable items...


natural_pdf.search.haystack_search_service - INFO - Embedding 6 documents using 'sentence-transformers/all-MiniLM-L6-v2'...


[2m2025-04-13T11:47:54.243363Z[0m [[32m[1minfo     [0m] [1mEmbedding 6 documents using 'sentence-transformers/all-MiniLM-L6-v2'...[0m [36mlineno[0m=[35m281[0m [36mmessage[0m=[35mEmbedding 6 documents using 'sentence-transformers/all-MiniLM-L6-v2'...[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,242] [    INFO] haystack_search_service.py:281 - Embedding 6 documents using 'sentence-transformers/all-MiniLM-L6-v2'...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

natural_pdf.search.haystack_search_service - INFO - Successfully embedded 6 documents.


[2m2025-04-13T11:47:54.587734Z[0m [[32m[1minfo     [0m] [1mSuccessfully embedded 6 documents.[0m [36mlineno[0m=[35m286[0m [36mmessage[0m=[35mSuccessfully embedded 6 documents.[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,587] [    INFO] haystack_search_service.py:286 - Successfully embedded 6 documents.


natural_pdf.search.haystack_search_service - INFO - Writing 6 embedded documents to store 'default_collection'...


[2m2025-04-13T11:47:54.588580Z[0m [[32m[1minfo     [0m] [1mWriting 6 embedded documents to store 'default_collection'...[0m [36mlineno[0m=[35m302[0m [36mmessage[0m=[35mWriting 6 embedded documents to store 'default_collection'...[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,588] [    INFO] haystack_search_service.py:302 - Writing 6 embedded documents to store 'default_collection'...


natural_pdf.search.haystack_search_service - INFO - Successfully wrote 6 documents to store 'default_collection'.


[2m2025-04-13T11:47:54.590136Z[0m [[32m[1minfo     [0m] [1mSuccessfully wrote 6 documents to store 'default_collection'.[0m [36mlineno[0m=[35m308[0m [36mmessage[0m=[35mSuccessfully wrote 6 documents to store 'default_collection'.[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,589] [    INFO] haystack_search_service.py:308 - Successfully wrote 6 documents to store 'default_collection'.


natural_pdf.search.haystack_search_service - INFO - Store 'default_collection' document count after write: 6


[2m2025-04-13T11:47:54.590961Z[0m [[32m[1minfo     [0m] [1mStore 'default_collection' document count after write: 6[0m [36mlineno[0m=[35m310[0m [36mmessage[0m=[35mStore 'default_collection' document count after write: 6[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,590] [    INFO] haystack_search_service.py:310 - Store 'default_collection' document count after write: 6


natural_pdf.search.searchable_mixin - INFO - Successfully completed indexing into SearchService collection 'default_collection'.


[2m2025-04-13T11:47:54.591611Z[0m [[32m[1minfo     [0m] [1mSuccessfully completed indexing into SearchService collection 'default_collection'.[0m [36mlineno[0m=[35m173[0m [36mmessage[0m=[35mSuccessfully completed indexing into SearchService collection 'default_collection'.[0m [36mmodule[0m=[35mnatural_pdf.search.searchable_mixin[0m


[2025-04-13 13:47:54,591] [    INFO] searchable_mixin.py:173 - Successfully completed indexing into SearchService collection 'default_collection'.


Search index initialized.


## Performing a Semantic Search

Once the index is ready, you can use the `find_relevant()` method to search for content semantically related to your query.

In [4]:
# Perform a search query
query = "american president"
results = collection.find_relevant(query)

print(f"Found {len(results)} results for '{query}':")

natural_pdf.search.searchable_mixin - INFO - Searching collection 'default_collection' via HaystackSearchService...


[2m2025-04-13T11:47:54.597061Z[0m [[32m[1minfo     [0m] [1mSearching collection 'default_collection' via HaystackSearchService...[0m [36mlineno[0m=[35m244[0m [36mmessage[0m=[35mSearching collection 'default_collection' via HaystackSearchService...[0m [36mmodule[0m=[35mnatural_pdf.search.searchable_mixin[0m


[2025-04-13 13:47:54,596] [    INFO] searchable_mixin.py:244 - Searching collection 'default_collection' via HaystackSearchService...


natural_pdf.search.haystack_search_service - INFO - Search request for collection='default_collection', query_type=str, options=TextSearchOptions(top_k=10, retriever_top_k=20, filters=None, use_reranker=True, reranker_instance=None, reranker_model=None, reranker_api_key=None)


[2m2025-04-13T11:47:54.597645Z[0m [[32m[1minfo     [0m] [1mSearch request for collection='default_collection', query_type=str, options=TextSearchOptions(top_k=10, retriever_top_k=20, filters=None, use_reranker=True, reranker_instance=None, reranker_model=None, reranker_api_key=None)[0m [36mlineno[0m=[35m318[0m [36mmessage[0m=[35mSearch request for collection='default_collection', query_type=str, options=TextSearchOptions(top_k=10, retriever_top_k=20, filters=None, use_reranker=True, reranker_instance=None, reranker_model=None, reranker_api_key=None)[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,597] [    INFO] haystack_search_service.py:318 - Search request for collection='default_collection', query_type=str, options=TextSearchOptions(top_k=10, retriever_top_k=20, filters=None, use_reranker=True, reranker_instance=None, reranker_model=None, reranker_api_key=None)


natural_pdf.search.haystack_search_service - INFO - Created SentenceTransformersTextEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)


[2m2025-04-13T11:47:54.598345Z[0m [[32m[1minfo     [0m] [1mCreated SentenceTransformersTextEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)[0m [36mlineno[0m=[35m164[0m [36mmessage[0m=[35mCreated SentenceTransformersTextEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,598] [    INFO] haystack_search_service.py:164 - Created SentenceTransformersTextEmbedder. Model: sentence-transformers/all-MiniLM-L6-v2, Device: ComponentDevice(_single_device=Device(type=<DeviceType.MPS: 'mps'>, id=None), _multiple_devices=None)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

natural_pdf.search.haystack_search_service - INFO - Running retrieval pipeline for collection 'default_collection'...


[2m2025-04-13T11:47:54.727173Z[0m [[32m[1minfo     [0m] [1mRunning retrieval pipeline for collection 'default_collection'...[0m [36mlineno[0m=[35m401[0m [36mmessage[0m=[35mRunning retrieval pipeline for collection 'default_collection'...[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,726] [    INFO] haystack_search_service.py:401 - Running retrieval pipeline for collection 'default_collection'...


natural_pdf.search.haystack_search_service - INFO - Retrieved 6 documents.


[2m2025-04-13T11:47:54.731304Z[0m [[32m[1minfo     [0m] [1mRetrieved 6 documents.        [0m [36mlineno[0m=[35m410[0m [36mmessage[0m=[35mRetrieved 6 documents.[0m [36mmodule[0m=[35mnatural_pdf.search.haystack_search_service[0m


[2025-04-13 13:47:54,730] [    INFO] haystack_search_service.py:410 - Retrieved 6 documents.


natural_pdf.search.searchable_mixin - INFO - SearchService returned 6 results from collection 'default_collection'.


[2m2025-04-13T11:47:54.731883Z[0m [[32m[1minfo     [0m] [1mSearchService returned 6 results from collection 'default_collection'.[0m [36mlineno[0m=[35m266[0m [36mmessage[0m=[35mSearchService returned 6 results from collection 'default_collection'.[0m [36mmodule[0m=[35mnatural_pdf.search.searchable_mixin[0m


[2025-04-13 13:47:54,731] [    INFO] searchable_mixin.py:266 - SearchService returned 6 results from collection 'default_collection'.


Found 6 results for 'american president':


## Understanding Search Results

The `find_relevant()` method returns a list of dictionaries, each representing a relevant text chunk found in one of the PDFs. Each result includes:

*   `pdf_path`: The path to the PDF document where the result was found.
*   `page_number`: The page number within the PDF.
*   `score`: A relevance score (higher means more relevant).
*   `content_snippet`: A snippet of the text chunk that matched the query.

In [5]:
# Process and display the results
if results:
    for i, result in enumerate(results):
        print(f"  {i+1}. PDF: {result['pdf_path']}")
        print(f"     Page: {result['page_number']} (Score: {result['score']:.4f})")
        # Display a snippet of the content
        snippet = result.get('content_snippet', '')
        print(f"     Snippet: {snippet}...") 
else:
    print("  No relevant results found.")

# You can access the full content if needed via the result object, 
# though 'content_snippet' is usually sufficient for display.

  1. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf
     Page: 2 (Score: 0.0708)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
The Anasazi (Removed: 1)
Author: Petersen, David. ISBN: 0-516-01121-9 (trade) Published: 1991
Sit...
  2. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp__h0cd9h.pdf
     Page: 5 (Score: 0.0669)
     Snippet: Library Weeding Log Atlanta Public Schools
From: 8/1/2017 To: 6/30/2023
6/6/2023 - Copies Removed: 130
Centennial Place 33170000562167 $13.10 11/5/1999 33554-43170
Academy (Charter)
Was Available -- W...
  3. PDF: /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmplduy73q3.pdf
     Page: 1 (Score: -0.0040)
     Snippet: Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking  Chicago, Ill.
Date:  February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer m...
  4. PDF: /var/folders/25/h3prywj

Semantic search allows you to efficiently query large sets of documents to find the most relevant information without needing exact keyword matches, leveraging the meaning and context of your query. 