# Document Ingestion and Indexing Pipeline

This notebook implements the document ingestion and indexing pipeline for the Internal Knowledge Base Q&A RAG Agent.

## Pipeline Overview
1. **Setup and Imports** - Install dependencies and import required libraries
2. **Configure Document Connectors** - Set up LlamaIndex readers for PDFs, Markdown, Notion, and Google Docs
3. **Load Documents** - Read documents from sample-datasets folder
4. **Chunk Documents** - Break documents into manageable segments
5. **Generate Embeddings** - Convert chunks to vector embeddings
6. **Build Vector Index** - Store embeddings in a vector index for retrieval
7. **Test Retrieval** - Query the index to verify it works

## 1. Setup and Imports

Install required dependencies and import necessary libraries for the pipeline.

In [None]:
# Install required packages
%pip install llama-index
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-gemini
%pip install llama-index-vector_stores-chroma
%pip install google-generativeai
%pip install chromadb

Collecting llama-index
  Using cached llama_index-0.14.5-py3-none-any.whl.metadata (13 kB)
Collecting llama-index-cli<0.6,>=0.5.0 (from llama-index)
  Using cached llama_index_cli-0.5.3-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-embeddings-openai<0.6,>=0.5.0 (from llama-index)
  Using cached llama_index_embeddings_openai-0.5.1-py3-none-any.whl.metadata (400 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Using cached llama_index_indices_managed_llama_cloud-0.9.4-py3-none-any.whl.metadata (3.7 kB)
Collecting llama-index-llms-openai<0.7,>=0.6.0 (from llama-index)
  Using cached llama_index_llms_openai-0.6.5-py3-none-any.whl.metadata (3.0 kB)
Collecting llama-index-readers-file<0.6,>=0.5.0 (from llama-index)
  Using cached llama_index_readers_file-0.5.4-py3-none-any.whl.metadata (5.7 kB)
Collecting llama-index-readers-llama-parse>=0.4.0 (from llama-index)
  Using cached llama_index_readers_llama_parse-0.5.1-py3-none-any.whl.metadata (3.

In [7]:
# Import core libraries
import os
import sys
from pathlib import Path
from typing import List, Dict, Any

# LlamaIndex core imports
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    Settings,
    Document
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.vector_stores.chroma import ChromaVectorStore

# ChromaDB for vector storage
import chromadb

print("✅ All imports successful!")

✅ All imports successful!


In [8]:
# Set up paths
PROJECT_ROOT = Path("/Users/teddytesfa/projects/AI-data-science-and-ML/Enterprise Internal Knowlwge Base Q&A Agentic RAG")
SAMPLE_DATA_DIR = PROJECT_ROOT / "resources" / "sample-datasets"
VECTOR_DB_DIR = PROJECT_ROOT / "data" / "vector_db"

# Create necessary directories
VECTOR_DB_DIR.mkdir(parents=True, exist_ok=True)

print(f"📁 Project Root: {PROJECT_ROOT}")
print(f"📄 Sample Data Directory: {SAMPLE_DATA_DIR}")
print(f"💾 Vector DB Directory: {VECTOR_DB_DIR}")
print(f"\n✅ Paths configured successfully!")

📁 Project Root: /Users/teddytesfa/projects/AI-data-science-and-ML/Enterprise Internal Knowlwge Base Q&A Agentic RAG
📄 Sample Data Directory: /Users/teddytesfa/projects/AI-data-science-and-ML/Enterprise Internal Knowlwge Base Q&A Agentic RAG/resources/sample-datasets
💾 Vector DB Directory: /Users/teddytesfa/projects/AI-data-science-and-ML/Enterprise Internal Knowlwge Base Q&A Agentic RAG/data/vector_db

✅ Paths configured successfully!


In [10]:
# Configure API keys (if needed)
# For Gemini LLM
import dotenv

dotenv.load_dotenv()

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "")

if not GOOGLE_API_KEY:
    print("⚠️  WARNING: GOOGLE_API_KEY not found in environment variables.")
    print("   Please set it using: export GOOGLE_API_KEY='your-api-key'")
    print("   Or set it directly in this cell (not recommended for production)")
    # GOOGLE_API_KEY = "your-api-key-here"  # Uncomment and add your key if needed
else:
    print("✅ GOOGLE_API_KEY found!")
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

✅ GOOGLE_API_KEY found!


In [11]:
# Initialize global settings for LlamaIndex
# Using HuggingFace embeddings (free, local) and Gemini 2.5 Pro for LLM

# Set up embedding model
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5",  # Lightweight, high-quality embedding model
    cache_folder=str(PROJECT_ROOT / "models")
)

# Set up LLM (Gemini 2.5 Pro as specified in context.json)
llm = Gemini(
    model="models/gemini-2.0-flash-exp",  # Using Gemini 2.5 Pro
    api_key=GOOGLE_API_KEY if GOOGLE_API_KEY else None
)

# Configure global settings
Settings.embed_model = embed_model
Settings.llm = llm
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✅ LlamaIndex settings configured:")
print(f"   - Embedding Model: BAAI/bge-small-en-v1.5")
print(f"   - LLM: Gemini 2.0 Flash Exp")
print(f"   - Chunk Size: 512")
print(f"   - Chunk Overlap: 50")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  llm = Gemini(


✅ LlamaIndex settings configured:
   - Embedding Model: BAAI/bge-small-en-v1.5
   - LLM: Gemini 2.0 Flash Exp
   - Chunk Size: 512
   - Chunk Overlap: 50


## 2. Configure Document Connectors

Set up LlamaIndex readers to support multiple document formats:
- **Markdown files** (.md)
- **PDF documents** (.pdf)
- **Notion pages** (exported)
- **Google Docs** (exported or via API)

For v1.0, we'll use `SimpleDirectoryReader` which automatically detects and handles multiple file formats.

In [12]:
# Configure document connectors
# LlamaIndex's SimpleDirectoryReader supports multiple formats out of the box

class DocumentConnector:
    """
    Unified document connector for loading various file formats.
    Supports: PDF, Markdown, TXT, DOCX, and more.
    """
    
    def __init__(self, data_dir: Path):
        self.data_dir = data_dir
        self.supported_formats = {
            'markdown': ['.md', '.markdown'],
            'pdf': ['.pdf'],
            'text': ['.txt'],
            'docx': ['.docx'],
            'notion': ['.md']  # Notion exports are typically markdown
        }
    
    def get_supported_files(self) -> Dict[str, List[Path]]:
        """Scan directory and categorize files by format."""
        files_by_type = {fmt: [] for fmt in self.supported_formats.keys()}
        
        if not self.data_dir.exists():
            print(f"⚠️  Directory not found: {self.data_dir}")
            return files_by_type
        
        for file_path in self.data_dir.rglob('*'):
            if file_path.is_file():
                suffix = file_path.suffix.lower()
                for fmt, extensions in self.supported_formats.items():
                    if suffix in extensions:
                        files_by_type[fmt].append(file_path)
                        break
        
        return files_by_type
    
    def load_documents(self, file_extensions: List[str] = None) -> List[Document]:
        """
        Load documents from the data directory.
        
        Args:
            file_extensions: List of file extensions to load (e.g., ['.md', '.pdf'])
                           If None, loads all supported formats.
        
        Returns:
            List of LlamaIndex Document objects
        """
        try:
            reader = SimpleDirectoryReader(
                input_dir=str(self.data_dir),
                required_exts=file_extensions,
                recursive=True,
                filename_as_id=True  # Use filename as document ID
            )
            
            documents = reader.load_data()
            return documents
            
        except Exception as e:
            print(f"❌ Error loading documents: {e}")
            return []
    
    def display_summary(self):
        """Display a summary of available documents."""
        files_by_type = self.get_supported_files()
        
        print("📚 Document Connector Summary")
        print("=" * 60)
        print(f"Data Directory: {self.data_dir}")
        print(f"\nSupported File Types:")
        
        total_files = 0
        for fmt, files in files_by_type.items():
            count = len(files)
            total_files += count
            if count > 0:
                print(f"  • {fmt.capitalize()}: {count} file(s)")
                for file in files:
                    print(f"    - {file.name}")
        
        print(f"\n📊 Total Files: {total_files}")
        print("=" * 60)

# Initialize the document connector
connector = DocumentConnector(SAMPLE_DATA_DIR)
connector.display_summary()

📚 Document Connector Summary
Data Directory: /Users/teddytesfa/projects/AI-data-science-and-ML/Enterprise Internal Knowlwge Base Q&A Agentic RAG/resources/sample-datasets

Supported File Types:
  • Markdown: 3 file(s)
    - company_handbook.md
    - troubleshooting_local_setup.md
    - project_nexus_onboarding_guide.md

📊 Total Files: 3


In [13]:
# Test the connector by loading documents
print("🔄 Loading documents from sample-datasets folder...\n")

# Load all supported documents
documents = connector.load_documents()

print(f"✅ Successfully loaded {len(documents)} document(s)\n")

# Display document details
if documents:
    print("📄 Document Details:")
    print("-" * 60)
    for idx, doc in enumerate(documents, 1):
        # Get metadata
        filename = doc.metadata.get('file_name', 'Unknown')
        file_path = doc.metadata.get('file_path', 'Unknown')
        
        # Get content preview (first 150 characters)
        content_preview = doc.text[:150].replace('\n', ' ') + "..."
        
        print(f"\n{idx}. {filename}")
        print(f"   Path: {file_path}")
        print(f"   Length: {len(doc.text)} characters")
        print(f"   Preview: {content_preview}")
    print("\n" + "-" * 60)
else:
    print("⚠️  No documents loaded. Please check the data directory.")

🔄 Loading documents from sample-datasets folder...



2025-10-22 23:25:42,714 - INFO - Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2025-10-22 23:25:42,715 - INFO - NumExpr defaulting to 8 threads.


✅ Successfully loaded 3 document(s)

📄 Document Details:
------------------------------------------------------------

1. company_handbook.md
   Path: /Users/teddytesfa/projects/AI-data-science-and-ML/Enterprise Internal Knowlwge Base Q&A Agentic RAG/resources/sample-datasets/company_handbook.md
   Length: 1172 characters
   Preview: # Company Handbook  ## New Hire Onboarding  ### Standard Local Development Environment Setup  Welcome to the company! This guide will walk you through...

2. project_nexus_onboarding_guide.md
   Path: /Users/teddytesfa/projects/AI-data-science-and-ML/Enterprise Internal Knowlwge Base Q&A Agentic RAG/resources/sample-datasets/project_nexus_onboarding_guide.md
   Length: 1200 characters
   Preview: # Project 'Nexus' - Onboarding Guide  ## Local Setup for Project Nexus  This guide will help you set up your local environment for Project Nexus.  **1...

3. troubleshooting_local_setup.md
   Path: /Users/teddytesfa/projects/AI-data-science-and-ML/Enterprise Inter

### Document Connector Configuration Notes

**Current Setup (v1.0):**
- Using `SimpleDirectoryReader` for local file loading
- Supports: Markdown (.md), PDF (.pdf), TXT (.txt), DOCX (.docx)
- Automatically detects file types and applies appropriate parsers

**Future Enhancements:**
- **Google Docs Integration**: Use `GoogleDocsReader` from llama-index-readers-google
- **Notion Integration**: Use `NotionPageReader` from llama-index-readers-notion
- **Cloud Storage**: Add support for Google Drive and OneDrive connectors
- **Custom Parsers**: Implement specialized parsers for company-specific formats

**For v1.0**, we're focusing on the markdown files in `resources/sample-datasets/`.