# Solar Cell Agentic RAG: Docling-to-RAG Pipeline Tutorial

This notebook shows **end-to-end usage of the RAG pipeline based on the Docling parser**, reusing the project code.

We will cover:
- Ingestion with `DoclingProcessor` (Docling parsing + hybrid chunking)
- Saving extracted chunks / tables / images to `data/processed`
- Indexing into the vector store
- Running queries through the high-level `RAGPipeline`

> Run this notebook from the project root so that paths like `data/raw` and `src/...` resolve correctly.


In [2]:
# Ensure we can import the project modules
import sys
from pathlib import Path

# Try to auto-detect the project root.
# If you run the notebook from `notebooks/`, this will point one level up.
cwd = Path.cwd()
if (cwd / "pyproject.toml").exists():
    project_root = cwd
else:
    project_root = cwd.parent

# You can also hard-code it if needed, for example:
# project_root = Path(r"C:/Users/My Pc/Downloads/rag4chat-main/SAMI/solar-cell-agentic-rag")

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print("Project root:", project_root)

# Quick sanity check: can we import the pipeline?
from src.rag_pipeline import RAGPipeline
from src.ingestion.docling_processor import DoclingProcessor
from configuration.config_loader import config

print("Loaded config processing settings:")
print("  enable_ocr:", config.processing.enable_ocr)
print("  max_chunk_tokens:", config.processing.max_chunk_tokens)


Project root: c:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag


  from .autonotebook import tqdm as notebook_tqdm
[32m2025-12-18 11:31:12.298[0m | [1mINFO    [0m | [36msrc.agent.graph[0m:[36mcreate_agent_graph[0m:[36m19[0m - [1mCreating agent graph...[0m
[32m2025-12-18 11:31:12.320[0m | [1mINFO    [0m | [36msrc.agent.graph[0m:[36mcreate_agent_graph[0m:[36m47[0m - [1mAgent graph compiled successfully[0m


Loaded config processing settings:
  enable_ocr: True
  max_chunk_tokens: 256


## 1. Select input PDF(s)

We will work with PDF files from the `data/raw` directory.
You can change `pdf_paths` to point to any other PDFs.

The `DoclingProcessor` will:
- Use Docling to **convert PDF → structured document**
- Run the **HybridChunker** to create semantically meaningful text chunks
- Extract **tables** and **images** with rich metadata.


In [3]:
from pprint import pprint

raw_dir = project_root / "data" / "raw"
pdf_paths = sorted(raw_dir.glob("*.pdf"))

print("Found PDFs in data/raw:")
for p in pdf_paths:
    print(" -", p.name)

# For the tutorial, pick one or a few PDFs
selected_pdfs = pdf_paths[:2]
print("\nUsing these PDFs:")
for p in selected_pdfs:
    print(" -", p)

selected_pdfs


Found PDFs in data/raw:
 - APC n° 2025-2251 du 14_11_2025.pdf
 - CELEX_62008CJ0059_SUM_EN_TXT.pdf
 - Test.pdf
 - TEST2.pdf

Using these PDFs:
 - c:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag\data\raw\APC n° 2025-2251 du 14_11_2025.pdf
 - c:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag\data\raw\CELEX_62008CJ0059_SUM_EN_TXT.pdf


[WindowsPath('c:/Users/My Pc/Downloads/rag4chat-main/SAMI/solar-cell-agentic-rag/data/raw/APC n° 2025-2251 du 14_11_2025.pdf'),
 WindowsPath('c:/Users/My Pc/Downloads/rag4chat-main/SAMI/solar-cell-agentic-rag/data/raw/CELEX_62008CJ0059_SUM_EN_TXT.pdf')]

## 2. Process PDFs with `DoclingProcessor`

This mirrors the logic in `src/ingestion/docling_processor.py`:
- Configure Docling `PdfPipelineOptions` (OCR, table structure, images)
- Convert each PDF to a Docling document object
- Run `HybridChunker` to produce text chunks with metadata
- Extract tables (`TableItem`) and images (`PictureItem`)

We’ll wrap this into a simple loop over the selected PDFs and inspect the result structure.


In [4]:
processor = DoclingProcessor()

processed_results = []
for pdf_path in selected_pdfs:
    print("\nProcessing:", pdf_path.name)
    result = processor.process_document(pdf_path)
    processed_results.append(result)
    
    print("  doc_id:", result["doc_id"])
    print("  pages:", result["pages"])
    print("  #chunks:", len(result["chunks"]))
    print("  #tables:", len(result["tables"]))
    print("  #images:", len(result["images"]))

# Peek at one chunk / table / image metadata
example = processed_results[0]
print("\nExample chunk:")
pprint(example["chunks"][0])

if example["tables"]:
    print("\nExample table:")
    pprint(example["tables"][0])

if example["images"]:
    print("\nExample image:")
    # We only print metadata, not raw image bytes
    img_meta = {k: v for k, v in example["images"][0].items() if k != "image_data"}
    pprint(img_meta)


[32m2025-12-18 11:31:13.116[0m | [1mINFO    [0m | [36msrc.ingestion.docling_processor[0m:[36m__init__[0m:[36m59[0m - [1mDoclingProcessor initialized (OCR: True, max_tokens: 256)[0m
[32m2025-12-18 11:31:13.119[0m | [1mINFO    [0m | [36msrc.ingestion.docling_processor[0m:[36mprocess_document[0m:[36m82[0m - [1mProcessing document: APC n° 2025-2251 du 14_11_2025.pdf[0m
2025-12-18 11:31:13,125 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]



Processing: APC n° 2025-2251 du 14_11_2025.pdf


2025-12-18 11:31:13,386 - INFO - Going to convert document batch...
2025-12-18 11:31:13,388 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 5e587a31a6093e580adab56a04d23dec
2025-12-18 11:31:13,467 - INFO - Loading plugin 'docling_defaults'
2025-12-18 11:31:13,471 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-12-18 11:31:13,541 - INFO - Loading plugin 'docling_defaults'
2025-12-18 11:31:13,555 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-12-18 11:31:14,084 - INFO - Accelerator device: 'cpu'
2025-12-18 11:31:18,417 - INFO - Accelerator device: 'cpu'
2025-12-18 11:31:20,208 - INFO - Accelerator device: 'cpu'
2025-12-18 11:31:21,738 - INFO - Processing document APC n° 2025-2251 du 14_11_2025.pdf
2025-12-18 11:34:08,390 - INFO - Finished converting document APC n° 2025-2251 du 14_11_2025.pdf in 175.27 sec.
[32m2025-12-18 11:34:08.394[0m | [1mINFO    [0m | [36msrc.ingestion.docling_processor[0

  doc_id: APC n° 2025-2251 du 14_11_2025_0b1b375a
  pages: 4
  #chunks: 8
  #tables: 0
  #images: 2

Processing: CELEX_62008CJ0059_SUM_EN_TXT.pdf


2025-12-18 11:34:08,765 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-18 11:34:08,803 - INFO - Going to convert document batch...
2025-12-18 11:34:08,812 - INFO - Processing document CELEX_62008CJ0059_SUM_EN_TXT.pdf
2025-12-18 11:34:21,473 - INFO - Finished converting document CELEX_62008CJ0059_SUM_EN_TXT.pdf in 12.94 sec.
[32m2025-12-18 11:34:21.474[0m | [1mINFO    [0m | [36msrc.ingestion.docling_processor[0m:[36mprocess_document[0m:[36m92[0m - [1mDocument converted: 4 pages[0m
[32m2025-12-18 11:34:21.513[0m | [1mINFO    [0m | [36msrc.ingestion.docling_processor[0m:[36mprocess_document[0m:[36m96[0m - [1mExtracted 1 tables[0m
[32m2025-12-18 11:34:21.515[0m | [1mINFO    [0m | [36msrc.ingestion.docling_processor[0m:[36mprocess_document[0m:[36m100[0m - [1mExtracted 0 images[0m
[32m2025-12-18 11:34:21.672[0m | [1mINFO    [0m | [36msrc.ingestion.docling_processor[0m:[36mprocess_document[0m:[36m104[0m - [1mCreated 8 text chunks[

  doc_id: CELEX_62008CJ0059_SUM_EN_TXT_10b172e9
  pages: 4
  #chunks: 8
  #tables: 1
  #images: 0

Example chunk:
{'chunk_id': 'APC n° 2025-2251 du 14_11_2025_0b1b375a_chunk_0',
 'content': "Considérant, par conséquent, qu'il y a lieu de réglementer les "
            'conditions délimination de ces composts ainsi que les conditions '
            "dentreposage dans l'attente de leur élimination finale ;\n"
            'Sur proposition du Secrétaire Général de la préfecture de la '
            'Meuse,',
 'doc_id': 'APC n° 2025-2251 du 14_11_2025_0b1b375a',
 'heading': None,
 'page': 2,
 'token_count': 38,
 'type': 'text'}

Example image:
{'bbox': {'x0': 51.5792236328125,
          'x1': 120.62897491455078,
          'y0': 814.3222522735596,
          'y1': 753.4652481079102},
 'caption': None,
 'doc_id': 'APC n° 2025-2251 du 14_11_2025_0b1b375a',
 'image_id': 'APC n° 2025-2251 du 14_11_2025_0b1b375a_img1',
 'page': 1}


## 3. Persist Docling outputs to `data/processed`

The project expects processed artifacts in:
- `data/processed/text` → JSON list of text chunks
- `data/processed/tables` → JSON list of tables
- `data/processed/images` → PNG image files + metadata JSON

We will save the structures produced by `DoclingProcessor` in a format compatible with the existing indexing scripts (e.g. `scripts/index_processed_data.py`).


In [5]:
import json
from PIL import Image

processed_root = project_root / "data" / "processed"
text_dir = processed_root / "text"
tables_dir = processed_root / "tables"
images_dir = processed_root / "images"

for d in [text_dir, tables_dir, images_dir]:
    d.mkdir(parents=True, exist_ok=True)

for doc in processed_results:
    doc_id = doc["doc_id"]
    
    # --- Save text chunks ---
    text_path = text_dir / f"{doc_id}_chunks.json"
    with text_path.open("w", encoding="utf-8") as f:
        json.dump(doc["chunks"], f, ensure_ascii=False, indent=2)
    print("Saved text chunks to", text_path)
    
    # --- Save tables ---
    if doc["tables"]:
        tables_path = tables_dir / f"{doc_id}_tables.json"
        # Optionally attach doc_id to each table
        tables_with_doc = []
        for t in doc["tables"]:
            t = dict(t)
            t.setdefault("doc_id", doc_id)
            tables_with_doc.append(t)
        with tables_path.open("w", encoding="utf-8") as f:
            json.dump(tables_with_doc, f, ensure_ascii=False, indent=2)
        print("Saved tables to", tables_path)
    
    # --- Save images + metadata ---
    if doc["images"]:
        images_meta = []
        for img_info in doc["images"]:
            img_id = img_info["image_id"]
            pil_image = img_info["image_data"]
            img_file = images_dir / f"{img_id}.png"
            
            # Save the image
            pil_image.save(img_file)
            
            # Prepare metadata record
            meta = {
                "image_id": img_id,
                "doc_id": img_info["doc_id"],
                "page": img_info["page"],
                "image_path": str(img_file.relative_to(project_root)),
                "bbox": img_info.get("bbox"),
                "caption": img_info.get("caption"),
            }
            images_meta.append(meta)
        
        meta_path = images_dir / f"{doc_id}_images_metadata.json"
        with meta_path.open("w", encoding="utf-8") as f:
            json.dump(images_meta, f, ensure_ascii=False, indent=2)
        print("Saved image metadata to", meta_path)


Saved text chunks to c:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag\data\processed\text\APC n° 2025-2251 du 14_11_2025_0b1b375a_chunks.json
Saved image metadata to c:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag\data\processed\images\APC n° 2025-2251 du 14_11_2025_0b1b375a_images_metadata.json
Saved text chunks to c:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag\data\processed\text\CELEX_62008CJ0059_SUM_EN_TXT_10b172e9_chunks.json
Saved tables to c:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag\data\processed\tables\CELEX_62008CJ0059_SUM_EN_TXT_10b172e9_tables.json


## 4. Index processed data into the vector store

At this point, `data/processed` contains the Docling-derived chunks, tables, and images.

Instead of re-implementing the indexing logic, we use the high-level `RAGPipeline.index_processed_data`, which internally:
- Loads JSON chunk/table/image files from `data/processed`
- Uses `CLIPEmbedder` to embed text and images
- Stores them in `VectorStore` (Chroma)

This is similar to the behavior of `scripts/index_processed_data.py`, but callable from Python.


In [6]:
pipeline = RAGPipeline(auto_index=False)

print("Indexing processed data into the vector store...")
index_counts = pipeline.index_processed_data()

print("\nIndexed items:")
print("  text chunks:", index_counts.get("text"))
print("  tables:", index_counts.get("tables"))
print("  images:", index_counts.get("images"))

stats = pipeline.get_statistics()
print("\nVector store statistics:")
print("  total_documents:", stats["total_documents"]) 
print("  by_type:", stats["by_type"])


[32m2025-12-18 11:34:21.901[0m | [1mINFO    [0m | [36msrc.rag_pipeline[0m:[36m__init__[0m:[36m49[0m - [1mInitializing RAG Pipeline...[0m
[32m2025-12-18 11:34:22.613[0m | [1mINFO    [0m | [36msrc.ingestion.docling_processor[0m:[36m__init__[0m:[36m59[0m - [1mDoclingProcessor initialized (OCR: True, max_tokens: 256)[0m
[32m2025-12-18 11:34:22.617[0m | [1mINFO    [0m | [36msrc.ingestion.image_processor[0m:[36m__init__[0m:[36m29[0m - [1mImageProcessor initialized (output: C:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag\data\processed\images)[0m
[32m2025-12-18 11:34:22.622[0m | [1mINFO    [0m | [36msrc.ingestion.pipeline[0m:[36m__init__[0m:[36m40[0m - [1mIngestionPipeline initialized[0m
[32m2025-12-18 11:34:22.626[0m | [1mINFO    [0m | [36msrc.embeddings.clip_embedder[0m:[36m__init__[0m:[36m40[0m - [1mInitializing CLIP embedder: openai/clip-vit-base-patch32 on cpu[0m
[32m2025-12-18 11:34:26.462[0m | [1mINFO    

Indexing processed data into the vector store...

Indexed items:
  text chunks: 0
  tables: 0
  images: 0

Vector store statistics:
  total_documents: 53
  by_type: {'table': 8, 'image': 6, 'text': 39}


## 5. Ask questions with the RAG pipeline

Now that the Docling-parsed and chunked content is indexed, we can issue natural language questions.

Here we show two options:
- **Direct `RAGPipeline.query`** → assumes data is already in the vector store
- **`process_and_query`** → ingest + index + query in one call (handy for ad-hoc PDFs)

We’ll start with `query` to reuse the data we just ingested.


In [7]:
question = "What is the maximum power output?"

print("Question:", question)

result = pipeline.query(question)

print("\nAnswer (truncated):")
print(result["answer"][:500])

print("\nMetadata:")
for k, v in result.get("metadata", {}).items():
    print(f"  {k}: {v}")


[32m2025-12-18 11:34:27.659[0m | [1mINFO    [0m | [36msrc.rag_pipeline[0m:[36mquery[0m:[36m297[0m - [1mProcessing query: 'What is the maximum power output?' (history: 0 messages)[0m
[32m2025-12-18 11:34:27.709[0m | [1mINFO    [0m | [36msrc.agent.nodes[0m:[36mretrieve_node[0m:[36m35[0m - [1m[RETRIEVE NODE] Processing query: 'What is the maximum power output?'[0m
[32m2025-12-18 11:34:27.720[0m | [1mINFO    [0m | [36msrc.agent.tools[0m:[36msearch_knowledge_base[0m:[36m69[0m - [1mTool called: search_knowledge_base(query='What is the maximum power output?...', top_k=None)[0m
[32m2025-12-18 11:34:27.723[0m | [1mINFO    [0m | [36msrc.agent.tools[0m:[36m_get_retriever[0m:[36m27[0m - [1mInitializing retrieval components for agent tool[0m
[32m2025-12-18 11:34:27.724[0m | [1mINFO    [0m | [36msrc.embeddings.clip_embedder[0m:[36m__init__[0m:[36m40[0m - [1mInitializing CLIP embedder: openai/clip-vit-base-patch32 on cpu[0m


Question: What is the maximum power output?


[32m2025-12-18 11:34:30.371[0m | [1mINFO    [0m | [36msrc.embeddings.clip_embedder[0m:[36m__init__[0m:[36m49[0m - [1mCLIP model loaded successfully (device: cpu)[0m
[32m2025-12-18 11:34:30.373[0m | [1mINFO    [0m | [36msrc.retrieval.vector_store[0m:[36m__init__[0m:[36m36[0m - [1mInitializing ChromaDB at C:\Users\My Pc\Downloads\rag4chat-main\SAMI\solar-cell-agentic-rag\data\vectordb[0m
[32m2025-12-18 11:34:30.428[0m | [1mINFO    [0m | [36msrc.retrieval.vector_store[0m:[36m__init__[0m:[36m53[0m - [1mVector store initialized: collection 'datasheet_collection' with 53 documents[0m
[32m2025-12-18 11:34:30.430[0m | [1mINFO    [0m | [36msrc.retrieval.unified_retriever[0m:[36m__init__[0m:[36m39[0m - [1mUnifiedRetriever initialized (top_k: 30)[0m
[32m2025-12-18 11:34:30.433[0m | [1mINFO    [0m | [36msrc.retrieval.reranker[0m:[36m__init__[0m:[36m47[0m - [1mJinaReranker initialized (model: jina-colbert-v2, top_n: 15)[0m
[32m2025-12-18 


Answer (truncated):
I apologize, but I'm unable to generate an answer at this time due to technical difficulties. Please try again later.

Metadata:
  retrieval_time: 3.4383327960968018
  num_chunks_retrieved: 15
  num_images_retrieved: 0
  retrieval_stats: {'total_retrieved': 30, 'text_chunks': 30, 'images': 0, 'avg_score': 0.7571517209211985, 'min_score': 0.7256559133529663, 'max_score': 0.7974576950073242}
  synthesis_time: 8.031483888626099
  model_used: fallback
  confidence: low
  sources_used: 0


### Optional: One-shot `process_and_query`

If you want to go **from raw PDFs directly to an answer in one call**, you can use `RAGPipeline.process_and_query`.

This will internally:
1. Use `DoclingProcessor` and the configured chunking strategy to process the PDFs
2. Index the outputs
3. Run a RAG query and return answer + metadata.


In [8]:
# Example: end-to-end in one call (commented out to avoid re-indexing every run)
#
# end_to_end_result = pipeline.process_and_query(
#     pdf_paths=[str(p) for p in selected_pdfs],
#     question="Summarize the main performance characteristics of this solar cell."
# )
#
# print("Answer:")
# print(end_to_end_result["query"]["answer"])
#
# print("\nIngestion summary:")
# print(end_to_end_result["ingestion"])
#
# print("\nIndexing summary:")
# print(end_to_end_result["indexing"])


## 6. Recap

In this notebook we:
- Used **`DoclingProcessor`** to parse PDFs and produce structured content (chunks, tables, images)
- Persisted these artifacts under `data/processed` in a format expected by the project
- Called **`RAGPipeline.index_processed_data`** to embed and store them in the vector store
- Queried the system via **`RAGPipeline.query`** and (optionally) `process_and_query`

You can now adapt this notebook to:
- Point to your own PDFs in `data/raw`
- Change chunking / OCR parameters in `configuration/config.yaml`
- Experiment with different questions and evaluation scripts in `evaluation/`.
