# Pluggable Architecture: Vector Stores & Embeddings Providers

This notebook demonstrates the pluggable architecture that enables mixing and matching different vector stores and embeddings providers.

## What You'll Learn

1. **Vector Store Options**: Azure AI Search vs ChromaDB
2. **Embeddings Options**: Azure OpenAI, Hugging Face, Cohere, OpenAI
3. **Mix & Match**: Combine any vector store with any embeddings provider
4. **Offline Setup**: Run completely offline with ChromaDB + Hugging Face
5. **Cost Optimization**: Choose free or low-cost options

## Supported Combinations (8 total)

| Vector Store | Embeddings | Offline | Use Case |
|--------------|------------|---------|----------|
| Azure Search | Azure OpenAI | ‚ùå | Production cloud (default) |
| Azure Search | Hugging Face | ‚ùå | Hybrid (save on embedding costs) |
| Azure Search | Cohere | ‚ùå | Cloud optimized |
| Azure Search | OpenAI | ‚ùå | Native OpenAI |
| ChromaDB | Azure OpenAI | ‚ùå | Local storage, cloud embeddings |
| **ChromaDB** | **Hugging Face** | **‚úÖ** | **Fully offline!** |
| ChromaDB | Cohere | ‚ùå | Local storage, cloud embeddings |
| ChromaDB | OpenAI | ‚ùå | Local storage, cloud embeddings |

## Example 1: Default Configuration (Azure Search + Azure OpenAI)

This is the existing configuration that works unchanged.

In [None]:
import os
from ingestor import ConfigBuilder, Pipeline

# Legacy configuration (still works!)
config = (
    ConfigBuilder()
    .with_local_files("../test_data/*.pdf")
    .with_azure_search(
        service_name="your-search-service",
        index_name="documents",
        api_key="your-key"
    )
    .with_azure_openai(
        endpoint="https://your-openai.openai.azure.com/",
        api_key="your-key",
        embedding_deployment="text-embedding-ada-002"
    )
    .with_local_artifacts("../artifacts")
    .build()
)

print(f"Vector Store Mode: {config.vector_store_mode}")
print(f"Embeddings Mode: {config.embeddings_mode}")
print("\n‚úÖ Backward compatibility maintained!")

## Example 2: Fully Offline (ChromaDB + Hugging Face)

Process documents completely offline with no cloud dependencies or API costs.

In [None]:
# Install dependencies first (run once):
# !pip install -r ../../requirements-chromadb.txt
# !pip install -r ../../requirements-embeddings.txt

from ingestor import ConfigBuilder, Pipeline
from ingestor.config import VectorStoreMode, EmbeddingsMode

# Fully offline configuration
config = (
    ConfigBuilder()
    .with_local_files("../test_data/*.pdf")
    .with_local_artifacts("../artifacts_offline")
    .build()
)

# Override for ChromaDB + Hugging Face
config.vector_store_mode = VectorStoreMode.CHROMADB
config.vector_store_config = {
    'collection_name': 'offline-docs',
    'persist_directory': './chroma_db',
    'batch_size': 1000
}

config.embeddings_mode = EmbeddingsMode.HUGGINGFACE
config.embeddings_config = {
    'model_name': 'sentence-transformers/all-MiniLM-L6-v2',
    'device': 'cpu',
    'batch_size': 32,
    'normalize_embeddings': True
}

# Disable Azure services
config.media_describer_mode = 'disabled'

print("Configuration:")
print(f"  Vector Store: {config.vector_store_mode}")
print(f"  Embeddings: {config.embeddings_mode}")
print(f"  Offline: ‚úÖ YES")
print(f"  Cost: $0/month")

# Note: First run will download the model (~90MB)
# Subsequent runs will use cached model from ~/.cache/huggingface/

# Run pipeline
# pipeline = Pipeline(config)
# await pipeline.run()

## Example 3: Using Environment Variables

The easiest way to configure is using environment variables.

In [None]:
import os

# Set environment variables
os.environ.update({
    # ChromaDB configuration
    'VECTOR_STORE_MODE': 'chromadb',
    'CHROMADB_COLLECTION_NAME': 'my-documents',
    'CHROMADB_PERSIST_DIR': './chroma_db',
    
    # Hugging Face configuration
    'EMBEDDINGS_MODE': 'huggingface',
    'HUGGINGFACE_MODEL_NAME': 'all-MiniLM-L6-v2',
    'HUGGINGFACE_DEVICE': 'cpu',
    
    # Input/Output
    'INPUT_MODE': 'local',
    'LOCAL_INPUT_GLOB': '../test_data/*.pdf',
    'ARTIFACTS_MODE': 'local',
    'LOCAL_ARTIFACTS_DIR': '../artifacts',
})

# Load configuration from environment
from ingestor.config import PipelineConfig

# This will fail because we're missing some required configs, but shows auto-detection
try:
    config = PipelineConfig.from_env()
    print(f"‚úÖ Auto-detected vector store: {config.vector_store_mode}")
    print(f"‚úÖ Auto-detected embeddings: {config.embeddings_mode}")
except Exception as e:
    print(f"Note: {str(e)[:100]}...")
    print("\nThis is expected - some Azure configs are required even for offline mode.")
    print("Use the examples/*.py scripts for complete working examples.")

## Example 4: Model Comparison

Compare different embedding models for your use case.

In [None]:
from ingestor.config import (
    HuggingFaceEmbeddingsConfig,
    CohereEmbeddingsConfig,
    OpenAIEmbeddingsConfig,
    AzureOpenAIConfig
)

models = [
    {
        'name': 'Azure OpenAI ada-002',
        'config': AzureOpenAIConfig,
        'dimensions': 1536,
        'languages': 'English++',
        'cost': '$$$',
        'offline': False
    },
    {
        'name': 'HF all-MiniLM-L6-v2',
        'config': HuggingFaceEmbeddingsConfig,
        'dimensions': 384,
        'languages': 'English',
        'cost': 'Free',
        'offline': True
    },
    {
        'name': 'HF multilingual-e5-large',
        'config': HuggingFaceEmbeddingsConfig,
        'dimensions': 1024,
        'languages': '100+',
        'cost': 'Free',
        'offline': True
    },
    {
        'name': 'Cohere v3 multilingual',
        'config': CohereEmbeddingsConfig,
        'dimensions': 1024,
        'languages': '100+',
        'cost': '$$',
        'offline': False
    },
    {
        'name': 'OpenAI text-embedding-3-large',
        'config': OpenAIEmbeddingsConfig,
        'dimensions': 3072,
        'languages': 'English++',
        'cost': '$$$',
        'offline': False
    },
]

import pandas as pd
df = pd.DataFrame(models)
print("\nEmbedding Models Comparison:")
print(df.to_string(index=False))

## Example 5: Testing Factory Functions

The factory pattern makes it easy to switch between implementations.

In [None]:
from ingestor.config import VectorStoreMode, EmbeddingsMode, SearchConfig
from ingestor.vector_store import create_vector_store
from ingestor.embeddings_provider import create_embeddings_provider

# Create Azure Search vector store
search_config = SearchConfig(
    endpoint='https://test.search.windows.net',
    index_name='test',
    api_key='test-key'
)

store = create_vector_store(VectorStoreMode.AZURE_SEARCH, search_config)
print(f"‚úÖ Vector Store Created: {type(store).__name__}")
print(f"   Expected dimensions: {store.get_dimensions()}")
print(f"   Methods: {[m for m in dir(store) if not m.startswith('_')]}")

# Test embeddings provider creation
from ingestor.config import OpenAIEmbeddingsConfig

openai_config = OpenAIEmbeddingsConfig(
    api_key='test-key',
    model_name='text-embedding-3-small'
)

provider = create_embeddings_provider(EmbeddingsMode.OPENAI, openai_config)
print(f"\n‚úÖ Embeddings Provider Created: {type(provider).__name__}")
print(f"   Model: {provider.get_model_name()}")
print(f"   Dimensions: {provider.get_dimensions()}")
print(f"   Methods: {[m for m in dir(provider) if not m.startswith('_')]}")

## Example 6: Configuration Scenarios

### Scenario 1: Fully Offline (Zero Cloud Dependencies)

In [None]:
# Fully Offline Setup
offline_env = """
# Vector Store
VECTOR_STORE_MODE=chromadb
CHROMADB_PERSIST_DIR=./chroma_db

# Embeddings
EMBEDDINGS_MODE=huggingface
HUGGINGFACE_MODEL_NAME=all-MiniLM-L6-v2

# Input/Output
INPUT_MODE=local
LOCAL_INPUT_GLOB=./documents/**/*.pdf
ARTIFACTS_MODE=local

# Processing
AZURE_OFFICE_EXTRACTOR_MODE=markitdown
AZURE_MEDIA_DESCRIBER=disabled
"""

print("Fully Offline Configuration:")
print(offline_env)
print("\nBenefits:")
print("  ‚úÖ Zero API costs")
print("  ‚úÖ Complete data privacy")
print("  ‚úÖ Works without internet (after initial model download)")
print("  ‚úÖ Fast local development")

### Scenario 2: Hybrid Cloud/Local (Cost Optimized)

In [None]:
# Hybrid: Azure Search + Local Embeddings
hybrid_env = """
# Vector Store: Azure Search (enterprise features)
VECTOR_STORE_MODE=azure_search
AZURE_SEARCH_SERVICE=your-service
AZURE_SEARCH_INDEX=documents

# Embeddings: Hugging Face (zero cost)
EMBEDDINGS_MODE=huggingface
HUGGINGFACE_MODEL_NAME=intfloat/multilingual-e5-large
HUGGINGFACE_DEVICE=cuda  # GPU acceleration

# Disable integrated vectorization
AZURE_USE_INTEGRATED_VECTORIZATION=false
"""

print("Hybrid Configuration:")
print(hybrid_env)
print("\nBenefits:")
print("  ‚úÖ Azure Search enterprise features")
print("  ‚úÖ Zero embedding costs (local)")
print("  ‚úÖ Best multilingual quality")
print("  ‚úÖ GPU-accelerated embeddings")
print("\nCost Savings:")
print("  Before: $1,000/month (Azure OpenAI embeddings for 10M tokens)")
print("  After: $0/month (local embeddings)")
print("  Savings: $1,000/month")

### Scenario 3: Cloud Optimized (Cohere)

In [None]:
# Cloud: Azure Search + Cohere
cohere_env = """
# Vector Store: Azure Search
VECTOR_STORE_MODE=azure_search
AZURE_SEARCH_SERVICE=your-service

# Embeddings: Cohere v3 Multilingual
EMBEDDINGS_MODE=cohere
COHERE_API_KEY=your-cohere-key
COHERE_MODEL_NAME=embed-multilingual-v3.0

# Disable integrated vectorization
AZURE_USE_INTEGRATED_VECTORIZATION=false
"""

print("Cohere Configuration:")
print(cohere_env)
print("\nBenefits:")
print("  ‚úÖ Latest multilingual models (100+ languages)")
print("  ‚úÖ Competitive pricing")
print("  ‚úÖ Optimized for semantic search")
print("  ‚úÖ Simple API")

## Example 7: Troubleshooting

Common issues and solutions.

In [None]:
# Test dependency availability
print("Checking dependencies...\n")

# Check ChromaDB
try:
    import chromadb
    print("‚úÖ chromadb installed")
except ImportError:
    print("‚ùå chromadb not installed")
    print("   Install with: pip install chromadb")

# Check sentence-transformers
try:
    import sentence_transformers
    print("‚úÖ sentence-transformers installed")
except ImportError:
    print("‚ùå sentence-transformers not installed")
    print("   Install with: pip install sentence-transformers")

# Check cohere
try:
    import cohere
    print("‚úÖ cohere installed")
except ImportError:
    print("‚ùå cohere not installed")
    print("   Install with: pip install cohere")

# Check openai
try:
    import openai
    print("‚úÖ openai installed")
except ImportError:
    print("‚ùå openai not installed")
    print("   Install with: pip install openai")

# Check torch
try:
    import torch
    print(f"‚úÖ torch installed")
    if torch.cuda.is_available():
        print(f"   üöÄ CUDA available: {torch.cuda.get_device_name(0)}")
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        print(f"   üöÄ MPS (Apple Silicon) available")
    else:
        print(f"   üíª CPU only")
except ImportError:
    print("‚ùå torch not installed")
    print("   Install with: pip install torch")

## Example 8: Quick Reference

### Installation Commands

```bash
# Base package
pip install -e .

# ChromaDB support
pip install -r requirements-chromadb.txt

# All embeddings providers
pip install -r requirements-embeddings.txt

# Individual providers
pip install chromadb
pip install sentence-transformers torch
pip install cohere
```

### Environment Variables Quick Reference

**Vector Store:**
```bash
VECTOR_STORE_MODE=azure_search  # or chromadb
```

**Embeddings:**
```bash
EMBEDDINGS_MODE=azure_openai  # or huggingface, cohere, openai
```

### Documentation Links

- [Vector Stores Guide](../../docs/vector_stores.md)
- [Embeddings Guide](../../docs/embeddings_providers.md)
- [Configuration Examples](../../docs/configuration_examples.md)
- [Implementation Summary](../../PLUGGABLE_ARCHITECTURE_SUMMARY.md)