# Neo4j SimpleKGPipeline Integration

This notebook demonstrates the complete Neo4j integration with Vertector, showcasing:
- Document loading and chunking with rich metadata
- Audio transcription and segment-based chunking
- Multimodal pipeline handling both documents and audio
- State isolation between different modalities
- Metadata preservation through the pipeline

## Setup

In [1]:
from pathlib import Path

from vertector_data_ingestion import setup_logging
from vertector_data_ingestion.integrations.neo4j import (
    MultimodalLoader,
    VertectorAudioLoader,
    VertectorDataLoader,
    VertectorTextSplitter,
)

# Setup logging
setup_logging(log_level="INFO")

print("✓ Neo4j integration components imported successfully")

[32m2026-01-03 16:22:02[0m | [1mINFO    [0m | [36mvertector_data_ingestion.monitoring.logger[0m:[36msetup_logging[0m:[36m51[0m - [1mLogging initialized at INFO level[0m


✓ Neo4j integration components imported successfully


## Part 1: Document Processing

Load and chunk a PDF document using Docling's structure-aware chunking.

In [2]:
# Initialize document loader and splitter
doc_loader = VertectorDataLoader()
doc_splitter = VertectorTextSplitter(loader=doc_loader, chunk_size=512)

# Load document
pdf_path = Path("../test_documents/2112.13734v2.pdf")
print(f"Loading document: {pdf_path.name}")

doc_result = await doc_loader.run(pdf_path)

print(f"\n✓ Document loaded:")
print(f"  Type: {doc_result.document_info.document_type}")
print(f"  Pages: {doc_result.document_info.metadata['num_pages']}")
print(f"  Processing time: {doc_result.document_info.metadata['processing_time']}s")
print(f"  Text length: {len(doc_result.text)} characters")

[32m2026-01-03 16:22:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2026-01-03 16:22:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2026-01-03 16:22:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36m__init__[0m:[36m55[0m - [1mHardware detected: mps[0m
[32m2026-01-03 16:22:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m__init__[0m:[36m44[0m - [1mInitialized UniversalConverter on mps[0m
[32m2026-01-03 16:22:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_ensure_models_available[0m:[36m67[0m - [1mChecking model availability...[0m
[32m2026-01-03 16:22:53[0m | [1mINFO    [0m | [36mvertector_data_ingestion

Loading document: 2112.13734v2.pdf
Consider using the pymupdf_layout package for a greatly improved page layout analysis.


2026-01-03 16:22:54,755 - INFO - Loading plugin 'docling_defaults'
2026-01-03 16:22:54,757 - INFO - Registered picture descriptions: ['vlm', 'api']
2026-01-03 16:22:54,765 - INFO - Loading plugin 'docling_defaults'
2026-01-03 16:22:54,771 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2026-01-03 16:22:55,285 - INFO - Loading plugin 'docling_defaults'
2026-01-03 16:22:55,288 - INFO - Registered layout engines: ['docling_layout_default', 'docling_experimental_table_crops_layout']
2026-01-03 16:22:55,294 - INFO - Accelerator device: 'mps'
2026-01-03 16:22:56,103 - INFO - Loading plugin 'docling_defaults'
2026-01-03 16:22:56,104 - INFO - Registered table structure engines: ['docling_tableformer']
2026-01-03 16:22:56,332 - INFO - Accelerator device: 'mps'
2026-01-03 16:22:57,159 - INFO - Processing document 2112.13734v2.pdf
2026-01-03 16:23:02,772 - INFO - Finished converting document 2112.13734v2.pdf in 8.08 sec.
[32m2026-01-03 16:23:0


✓ Document loaded:
  Type: document
  Pages: 4
  Processing time: 8.23647403717041s
  Text length: 17923 characters


In [3]:
# Chunk the document
print("Chunking document with structure-aware HybridChunker...")
doc_chunks = await doc_splitter.run(doc_result.text)

print(f"\n✓ Created {len(doc_chunks.chunks)} document chunks")
print(f"\nFirst chunk:")
print(f"  Text: {doc_chunks.chunks[0].text[:100]}...")
print(f"  Index: {doc_chunks.chunks[0].index}")
print(f"  Metadata keys: {list(doc_chunks.chunks[0].metadata.keys())}")

[32m2026-01-03 16:51:26[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36mchunk_document[0m:[36m68[0m - [1mChunking document: 2112.13734v2.pdf (4 pages)[0m
[32m2026-01-03 16:51:27[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36mchunk_document[0m:[36m99[0m - [1mCreated 22 chunks[0m


Chunking document with structure-aware HybridChunker...

✓ Created 22 document chunks

First chunk:
  Text: Enoch Tetteh Mila, Quebec AI Institute AMMI, AIMS Rwanda etetteh@aimsammi.org
Joseph Viviano Mila, Q...
  Index: 0
  Metadata keys: ['chunk_id', 'token_count', 'document_id', 'page_no', 'subsection_path', 'bbox']


In [None]:
# Examine rich metadata from multiple chunks
print("Document chunk metadata examples:\n")

for i in range(min(3, len(doc_chunks.chunks))):
    chunk = doc_chunks.chunks[i]
    print(f"Chunk {i}:")
    print(f"  Token count: {chunk.metadata.get('token_count')}")
    print(f"  Page number: {chunk.metadata.get('page_no', 'N/A')}")
    print(f"  Section: {chunk.metadata.get('subsection_path', 'N/A')[:80]}")
    print(f"  Is table: {chunk.metadata.get('is_table', 'False')}")
    print(f"  Is heading: {chunk.metadata.get('is_heading', 'False')}")
    print()

## Part 2: Audio Processing

Transcribe and chunk audio using Whisper segments with timestamps.

In [4]:
# Initialize audio loader and splitter
audio_loader = VertectorAudioLoader()
audio_splitter = VertectorTextSplitter(loader=audio_loader, chunk_size=512)

# Load audio
audio_path = Path("../test_documents/harvard.wav")
print(f"Loading audio: {audio_path.name}")

audio_result = await audio_loader.run(audio_path)

print(f"\n✓ Audio loaded:")
print(f"  Type: {audio_result.document_info.document_type}")
print(f"  Duration: {audio_result.document_info.metadata['duration']}s")
print(f"  Language: {audio_result.document_info.metadata['language']}")
print(f"  Segments: {audio_result.document_info.metadata['segments']}")
print(f"  Model: {audio_result.document_info.metadata['model']}")

[32m2026-01-03 16:53:12[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.audio_factory[0m:[36mcreate_audio_transcriber[0m:[36m23[0m - [1mCreating audio transcriber: model=base, backend=auto[0m
[32m2026-01-03 16:53:12[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2026-01-03 16:53:12[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m__init__[0m:[36m56[0m - [1mInitializing WhisperTranscriber with model=base, device=mlx, backend=auto[0m
[32m2026-01-03 16:53:12[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid

Loading audio: harvard.wav


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2026-01-03 16:53:15[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36mtranscribe[0m:[36m193[0m - [1mTranscription complete in 1.93s: 216 chars, 6 segments[0m



✓ Audio loaded:
  Type: audio
  Duration: 1.9336051940917969s
  Language: en
  Segments: 6
  Model: whisper-base-mlx


In [5]:
# Chunk the audio
print("Chunking audio using Whisper segments...")
audio_chunks = await audio_splitter.run(audio_result.text)

print(f"\n✓ Created {len(audio_chunks.chunks)} audio chunks")
print(f"\nFirst chunk:")
print(f"  Text: {audio_chunks.chunks[0].text}")
print(f"  Index: {audio_chunks.chunks[0].index}")
print(f"  Metadata keys: {list(audio_chunks.chunks[0].metadata.keys())}")

Chunking audio using Whisper segments...

✓ Created 6 audio chunks

First chunk:
  Text: The stale smell of old beer lingers.
  Index: 0
  Metadata keys: ['chunk_id', 'token_count', 'document_id', 'modality', 'start_time', 'end_time', 'duration', 'language']


In [6]:
# Examine audio chunk metadata with timestamps
print("Audio chunk metadata (with timestamps):\n")

for i, chunk in enumerate(audio_chunks.chunks):
    print(f"Chunk {i}:")
    print(f"  Text: {chunk.text}")
    print(f"  Start: {chunk.metadata['start_time']}s")
    print(f"  End: {chunk.metadata['end_time']}s")
    print(f"  Duration: {chunk.metadata['duration']}s")
    print(f"  Token count: {chunk.metadata['token_count']}")
    print(f"  Document ID: {chunk.metadata['document_id']}")
    print(f"  Chunk ID: {chunk.metadata['chunk_id']}")
    print()

Audio chunk metadata (with timestamps):

Chunk 0:
  Text: The stale smell of old beer lingers.
  Start: 0.8600000000000003s
  End: 3.64s
  Duration: 2.78s
  Token count: 9
  Document ID: harvard
  Chunk ID: harvard_0

Chunk 1:
  Text: It takes heat to bring out the odor.
  Start: 4.18s
  End: 6.18s
  Duration: 2.0s
  Token count: 9
  Document ID: harvard
  Chunk ID: harvard_1

Chunk 2:
  Text: A cold dip restores health and zest.
  Start: 7.02s
  End: 9.16s
  Duration: 2.1400000000000006s
  Token count: 8
  Document ID: harvard
  Chunk ID: harvard_2

Chunk 3:
  Text: A salt pickle tastes fine with ham.
  Start: 9.96s
  End: 12.0s
  Duration: 2.039999999999999s
  Token count: 8
  Document ID: harvard
  Chunk ID: harvard_3

Chunk 4:
  Text: Tacos al pastor are my favorite.
  Start: 12.68s
  End: 14.32s
  Duration: 1.6400000000000006s
  Token count: 8
  Document ID: harvard
  Chunk ID: harvard_4

Chunk 5:
  Text: A zestful food is the hot cross bun.
  Start: 15.12s
  End: 17.42s
  Duratio

## Part 3: Multimodal Pipeline

Use a single MultimodalLoader to handle both documents and audio, with automatic modality detection.

In [7]:
# Initialize multimodal loader and splitter
multimodal_loader = MultimodalLoader()
multimodal_splitter = VertectorTextSplitter(loader=multimodal_loader, chunk_size=512)

print("✓ Multimodal pipeline initialized")
print("  Supports: PDF, DOCX, PPTX, XLSX, WAV, MP3, M4A, FLAC, OGG")

[32m2026-01-03 16:56:20[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2026-01-03 16:56:20[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2026-01-03 16:56:20[0m | [1mINFO    [0m | [36mvertector

✓ Multimodal pipeline initialized
  Supports: PDF, DOCX, PPTX, XLSX, WAV, MP3, M4A, FLAC, OGG


### Test 1: Document → Audio (State Isolation)

In [8]:
# Load and chunk document first
print("Step 1: Processing document...")
doc_result = await multimodal_loader.run(pdf_path)
doc_chunks = await multimodal_splitter.run(doc_result.text)

print(f"✓ Document: {len(doc_chunks.chunks)} chunks")
print(f"  First chunk text: {doc_chunks.chunks[0].text[:80]}...")
print(f"  Has page_no: {'page_no' in doc_chunks.chunks[0].metadata}")
print(f"  Has start_time: {'start_time' in doc_chunks.chunks[0].metadata}")

# Now load and chunk audio (state should be isolated)
print("\nStep 2: Processing audio (after document)...")
audio_result = await multimodal_loader.run(audio_path)
audio_chunks = await multimodal_splitter.run(audio_result.text)

print(f"✓ Audio: {len(audio_chunks.chunks)} chunks")
print(f"  First chunk text: {audio_chunks.chunks[0].text[:80]}...")
print(f"  Has page_no: {'page_no' in audio_chunks.chunks[0].metadata}")
print(f"  Has start_time: {'start_time' in audio_chunks.chunks[0].metadata}")

# Verify state isolation
print("\n✓ State Isolation Verified:")
print(f"  Document chunks != Audio chunks: {doc_chunks.chunks[0].text != audio_chunks.chunks[0].text}")
print(f"  loader.last_document is None: {multimodal_loader.last_document is None}")
print(f"  loader.last_transcription_result is not None: {multimodal_loader.last_transcription_result is not None}")

[32m2026-01-03 16:57:11[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36mdetermine_pipeline[0m:[36m105[0m - [1mUsing Classic pipeline (default) for 2112.13734v2.pdf[0m
[32m2026-01-03 16:57:11[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m175[0m - [1mConverting 2112.13734v2.pdf with classic pipeline[0m
2026-01-03 16:57:11,588 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-03 16:57:11,592 - INFO - Going to convert document batch...
2026-01-03 16:57:11,594 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 870e160bad93d15722a8ae8d62725e09
2026-01-03 16:57:11,594 - INFO - Accelerator device: 'mps'


Step 1: Processing document...


2026-01-03 16:57:12,391 - INFO - Accelerator device: 'mps'
2026-01-03 16:57:13,216 - INFO - Processing document 2112.13734v2.pdf
2026-01-03 16:57:18,891 - INFO - Finished converting document 2112.13734v2.pdf in 7.30 sec.
[32m2026-01-03 16:57:18[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m194[0m - [1mConverted 2112.13734v2.pdf in 7.43s (4 pages, 0.5 pages/sec)[0m
[32m2026-01-03 16:57:18[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36mchunk_document[0m:[36m68[0m - [1mChunking document: 2112.13734v2.pdf (4 pages)[0m
[32m2026-01-03 16:57:19[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36mchunk_document[0m:[36m99[0m - [1mCreated 22 chunks[0m
[32m2026-01-03 16:57:19[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36m_load_model[0m:[36m90[0m - [1mLoading Whisper model: base[0m
[32

✓ Document: 22 chunks
  First chunk text: Enoch Tetteh Mila, Quebec AI Institute AMMI, AIMS Rwanda etetteh@aimsammi.org
Jo...
  Has page_no: True
  Has start_time: False

Step 2: Processing audio (after document)...


[32m2026-01-03 16:57:19[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36mtranscribe[0m:[36m193[0m - [1mTranscription complete in 0.85s: 216 chars, 6 segments[0m


✓ Audio: 6 chunks
  First chunk text: The stale smell of old beer lingers....
  Has page_no: False
  Has start_time: True

✓ State Isolation Verified:
  Document chunks != Audio chunks: True
  loader.last_document is None: True
  loader.last_transcription_result is not None: True


### Test 2: Audio → Document (Reverse Order)

In [9]:
# Create fresh loader
fresh_loader = MultimodalLoader()
fresh_splitter = VertectorTextSplitter(loader=fresh_loader, chunk_size=512)

# Load audio first
print("Step 1: Processing audio...")
audio_result = await fresh_loader.run(audio_path)
audio_chunks = await fresh_splitter.run(audio_result.text)

print(f"✓ Audio: {len(audio_chunks.chunks)} chunks")
print(f"  Modality: {audio_chunks.chunks[0].metadata.get('modality', 'N/A')}")

# Then load document
print("\nStep 2: Processing document (after audio)...")
doc_result = await fresh_loader.run(pdf_path)
doc_chunks = await fresh_splitter.run(doc_result.text)

print(f"✓ Document: {len(doc_chunks.chunks)} chunks")
print(f"  Has page_no: {'page_no' in doc_chunks.chunks[0].metadata}")

# Verify state isolation
print("\n✓ Reverse State Isolation Verified:")
print(f"  Audio chunks != Document chunks: {audio_chunks.chunks[0].text != doc_chunks.chunks[0].text}")
print(f"  loader.last_document is not None: {fresh_loader.last_document is not None}")
print(f"  loader.last_transcription_result is None: {fresh_loader.last_transcription_result is None}")

[32m2026-01-03 16:57:56[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2026-01-03 16:57:56[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2026-01-03 16:57:56[0m | [1mINFO    [0m | [36mvertector

Step 1: Processing audio...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2026-01-03 16:57:57[0m | [1mINFO    [0m | [36mvertector_data_ingestion.audio.whisper_transcriber[0m:[36mtranscribe[0m:[36m193[0m - [1mTranscription complete in 0.60s: 216 chars, 6 segments[0m
[32m2026-01-03 16:57:57[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36mdetermine_pipeline[0m:[36m105[0m - [1mUsing Classic pipeline (default) for 2112.13734v2.pdf[0m
[32m2026-01-03 16:57:57[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m175[0m - [1mConverting 2112.13734v2.pdf with classic pipeline[0m
2026-01-03 16:57:57,994 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-03

✓ Audio: 6 chunks
  Modality: audio

Step 2: Processing document (after audio)...


2026-01-03 16:57:58,490 - INFO - Accelerator device: 'mps'
2026-01-03 16:57:59,148 - INFO - Processing document 2112.13734v2.pdf
2026-01-03 16:58:04,327 - INFO - Finished converting document 2112.13734v2.pdf in 6.33 sec.
[32m2026-01-03 16:58:04[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m194[0m - [1mConverted 2112.13734v2.pdf in 6.39s (4 pages, 0.6 pages/sec)[0m
[32m2026-01-03 16:58:04[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36mchunk_document[0m:[36m68[0m - [1mChunking document: 2112.13734v2.pdf (4 pages)[0m
[32m2026-01-03 16:58:04[0m | [1mINFO    [0m | [36mvertector_data_ingestion.chunkers.hybrid_chunker[0m:[36mchunk_document[0m:[36m99[0m - [1mCreated 22 chunks[0m


✓ Document: 22 chunks
  Has page_no: True

✓ Reverse State Isolation Verified:
  Audio chunks != Document chunks: True
  loader.last_document is not None: True
  loader.last_transcription_result is None: True


## Part 4: Metadata Comparison

Compare the metadata between document and audio chunks.

In [10]:
import pandas as pd

# Create comparison DataFrame
comparison = pd.DataFrame({
    'Feature': [
        'chunk_id',
        'document_id', 
        'token_count',
        'page_no',
        'section_title',
        'subsection_path',
        'is_table',
        'is_heading',
        'bbox',
        'modality',
        'start_time',
        'end_time',
        'duration',
        'language'
    ],
    'Document Chunks': [
        '✓' if 'chunk_id' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'document_id' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'token_count' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'page_no' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'section_title' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'subsection_path' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'is_table' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'is_heading' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'bbox' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'modality' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'start_time' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'end_time' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'duration' in doc_chunks.chunks[0].metadata else '✗',
        '✓' if 'language' in doc_chunks.chunks[0].metadata else '✗',
    ],
    'Audio Chunks': [
        '✓' if 'chunk_id' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'document_id' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'token_count' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'page_no' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'section_title' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'subsection_path' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'is_table' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'is_heading' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'bbox' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'modality' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'start_time' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'end_time' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'duration' in audio_chunks.chunks[0].metadata else '✗',
        '✓' if 'language' in audio_chunks.chunks[0].metadata else '✗',
    ]
})

print("Metadata Feature Comparison:")
print(comparison.to_string(index=False))

Metadata Feature Comparison:
        Feature Document Chunks Audio Chunks
       chunk_id               ✓            ✓
    document_id               ✓            ✓
    token_count               ✓            ✓
        page_no               ✓            ✗
  section_title               ✗            ✗
subsection_path               ✓            ✗
       is_table               ✗            ✗
     is_heading               ✗            ✗
           bbox               ✓            ✗
       modality               ✗            ✓
     start_time               ✗            ✓
       end_time               ✗            ✓
       duration               ✗            ✓
       language               ✗            ✓


## Part 5: Neo4j Integration Example

Demonstrates how to use these components with Neo4j SimpleKGPipeline.

In [11]:
# Example: How to use with Neo4j SimpleKGPipeline
print("Neo4j SimpleKGPipeline Integration Pattern:\n")

example_code = '''
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from vertector_data_ingestion.integrations.neo4j import (
    MultimodalLoader,
    VertectorTextSplitter
)

# Initialize components
loader = MultimodalLoader()
splitter = VertectorTextSplitter(loader=loader, chunk_size=512)

# Create Neo4j pipeline
pipeline = SimpleKGPipeline(
    llm=your_llm,
    driver=your_neo4j_driver,
    embedder=your_embedder,
    entities=[...],
    relations=[...],
    from_pdf=False  # We handle loading ourselves
)

# Process document
doc_result = await loader.run(Path("document.pdf"))
doc_chunks = await splitter.run(doc_result.text)

# Process audio  
audio_result = await loader.run(Path("meeting.wav"))
audio_chunks = await splitter.run(audio_result.text)

# Feed to Neo4j pipeline
await pipeline.run_async(
    file_path="document.pdf",
    chunks=doc_chunks.chunks
)

await pipeline.run_async(
    file_path="meeting.wav",
    chunks=audio_chunks.chunks
)
'''

print(example_code)

Neo4j SimpleKGPipeline Integration Pattern:


from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from vertector_data_ingestion.integrations.neo4j import (
    MultimodalLoader,
    VertectorTextSplitter
)

# Initialize components
loader = MultimodalLoader()
splitter = VertectorTextSplitter(loader=loader, chunk_size=512)

# Create Neo4j pipeline
pipeline = SimpleKGPipeline(
    llm=your_llm,
    driver=your_neo4j_driver,
    embedder=your_embedder,
    entities=[...],
    relations=[...],
    from_pdf=False  # We handle loading ourselves
)

# Process document
doc_result = await loader.run(Path("document.pdf"))
doc_chunks = await splitter.run(doc_result.text)

# Process audio  
audio_result = await loader.run(Path("meeting.wav"))
audio_chunks = await splitter.run(audio_result.text)

# Feed to Neo4j pipeline
await pipeline.run_async(
    file_path="document.pdf",
    chunks=doc_chunks.chunks
)

await pipeline.run_async(
    file_path="meeting.wav",
    chunks=aud

## Summary

This notebook demonstrated:

✅ **Document Processing**
- Structure-aware chunking with Docling HybridChunker
- Rich metadata: page numbers, sections, bounding boxes, table detection

✅ **Audio Processing**  
- Whisper transcription with MLX acceleration
- Segment-based chunking with timestamps
- Audio metadata: start_time, end_time, duration, language

✅ **Multimodal Pipeline**
- Single loader handles both documents and audio
- Automatic modality detection by file extension
- Proper state isolation between modalities

✅ **Neo4j Integration**
- Compatible with Neo4j SimpleKGPipeline
- Preserves rich metadata for knowledge graph construction
- Ready for production use

### Key Features

1. **Property Delegation**: MultimodalLoader properly exposes sub-loader state
2. **State Isolation**: Loading one modality clears the state of the other
3. **Metadata Preservation**: All rich metadata flows through to Neo4j chunks
4. **Document IDs**: Proper document_id, chunk_id for both documents and audio

### Next Steps

- Connect to Neo4j database
- Define entity and relation schemas
- Build knowledge graph from multimodal data
- Query and visualize the graph