# M1.2 — Pinecone Data Model & Advanced Indexing

**Hybrid Search, Namespaces, Failures, Decision Framework**

This notebook demonstrates:
- Dense (semantic) + Sparse (keyword) hybrid search
- Alpha parameter tuning for query-specific blending
- Namespace-based multi-tenant isolation
- Production failure scenarios and fixes
- Decision framework for when to use hybrid search

---

## 1. Pinecone Data Model (Index, Namespace, Vector)

Pinecone organizes data in a three-level hierarchy:

**Index** → Database container with fixed dimension and metric  
**Namespace** → Isolated partition within index for multi-tenancy  
**Vector** → Individual record with:
- `id`: Unique identifier (up to 512 chars)
- `values`: Dense embedding (e.g., 1536-dim from OpenAI)
- `sparse_values`: Optional BM25 keyword vector (indices + values)
- `metadata`: Key-value pairs (max 40KB per vector)

Hybrid vectors combine both dense (semantic similarity) and sparse (exact keyword matching) representations in a single record.

In [None]:
# Example vector structure
example_vector = {
    "id": "doc_0",
    "values": [0.023, -0.154, 0.091],  # Dense embedding (showing first 3 of 1536)
    "sparse_values": {
        "indices": [42, 137, 891],  # BM25 term indices
        "values": [0.85, 0.62, 0.41]  # BM25 term weights
    },
    "metadata": {
        "text": "Machine learning models require...",
        "source": "example_data"
    }
}

print("Pinecone Hybrid Vector Structure:")
print(f"  ID: {example_vector['id']}")
print(f"  Dense dim: 1536 (showing first 3: {example_vector['values']})")
print(f"  Sparse terms: {len(example_vector['sparse_values']['indices'])}")
print(f"  Metadata keys: {list(example_vector['metadata'].keys())}")

# Expected: 4 lines showing vector structure components

## 2. Dense vs Sparse Recap (Semantic vs Keyword)

**Dense Embeddings (Semantic):**
- Captures meaning and context through learned representations
- 1536 dimensions from OpenAI's text-embedding-3-small
- Excellent for natural language questions, conceptual searches
- Query latency: 40-60ms
- Misses exact keyword matches (e.g., product codes, IDs)

**Sparse Embeddings (Keyword):**
- BM25 algorithm: term frequency × inverse document frequency
- Only non-zero values stored (efficient for vocabulary size)
- Perfect for exact matches, product names, technical codes
- Query latency: <20ms
- Completely misses semantic similarity

**Why Hybrid?**
Combining both approaches improves recall by 20-40% for mixed-intent queries. However, it adds 30-80ms latency and requires alpha tuning per use case.

In [None]:
from m1_2_pinecone_advanced_indexing import embed_dense_openai, embed_sparse_bm25
from config import get_clients

# Check if API keys are available
openai_client, _ = get_clients()

if openai_client:
    # Show real dense embedding dimension
    sample_text = "Machine learning models require careful tuning"
    dense_vec = embed_dense_openai(sample_text)
    print(f"Dense embedding: {len(dense_vec)} dimensions")
    print(f"Sample values: {dense_vec[:3]}")
else:
    print("Dense embedding: 1536 dimensions (stub, no API key)")
    print("Sample values: [0.023, -0.154, 0.091]")

# Show sparse embedding structure
corpus = ["machine learning", "deep learning", "neural networks"]
embed_sparse_bm25(texts=corpus)  # Fit BM25
sparse_vec = embed_sparse_bm25(query="machine learning models")
print(f"\nSparse embedding: {len(sparse_vec['indices'])} non-zero terms")
print(f"Sample indices: {sparse_vec['indices'][:3]}")
print(f"Sample values: {[round(v, 3) for v in sparse_vec['values'][:3]]}")

# Expected: 5-6 lines showing dense dimension and sparse term counts

## 3. Building a Hybrid Index (dotproduct, 1536)

**Index Configuration Requirements:**
- **Metric**: Must use `dotproduct` (not cosine or euclidean)  
  Why? Hybrid search scales vectors by alpha weights, requiring linear combination
- **Dimension**: 1536 for OpenAI text-embedding-3-small
- **Cloud**: Serverless deployment (auto-scaling, pay-per-use)

**Critical Configuration Check:**
Mismatched metrics cause silent failures where alpha blending produces incorrect scores. Always validate `metric == "dotproduct"` before querying.

In [None]:
from m1_2_pinecone_advanced_indexing import build_index
from config import get_clients

# Attempt to build/connect to hybrid index
print("Building Hybrid Index...")
print("=" * 50)

index = build_index(dimension=1536, metric="dotproduct")

if index is None:
    # No API keys - show configuration that would be used
    print("Configuration (not applied, no API keys):")
    print("  Name: hybrid-rag")
    print("  Dimension: 1536")
    print("  Metric: dotproduct")
    print("  Cloud: AWS us-east-1 serverless")
else:
    print(f"✓ Index ready: {index}")
    print(f"Stats: {index.describe_index_stats()}")

# Expected: 3-5 lines showing index config or connection status

## 4. Querying with Alpha (0.2, 0.5, 0.8)

**Alpha Parameter Controls Blending:**
- `α = 0.0` → Pure sparse (100% keyword/BM25)
- `α = 0.5` → Balanced hybrid (50/50 split)
- `α = 1.0` → Pure dense (100% semantic)

**Selection Strategy:**
- **α = 0.2-0.3**: Keyword-heavy queries (product codes, IDs, exact terms)
- **α = 0.5**: Balanced/uncertain (default starting point)
- **α = 0.7-0.8**: Semantic-heavy queries (natural language questions, concepts)

**How It Works:**
```python
dense_scaled = dense_vector * alpha
sparse_scaled = sparse_vector * (1 - alpha)
combined_score = dotproduct(dense_scaled) + dotproduct(sparse_scaled)
```

Experimentation is required — initial tuning typically takes 4-8 hours per use case.

In [None]:
from m1_2_pinecone_advanced_indexing import hybrid_query
from config import get_clients

# Test query with different alpha values
test_query = "explain machine learning hyperparameter tuning"

print(f"Query: '{test_query}'")
print("=" * 60)

# Check if we can run real queries
openai_client, pinecone_client = get_clients()

if openai_client and pinecone_client:
    # Real queries (requires data to be upserted first)
    for alpha in [0.2, 0.5, 0.8]:
        print(f"\nAlpha = {alpha} ({'keyword-heavy' if alpha < 0.4 else 'balanced' if alpha < 0.6 else 'semantic-heavy'})")
        results = hybrid_query(test_query, alpha=alpha, top_k=3)
        for i, res in enumerate(results[:3], 1):
            print(f"  {i}. [{res['score']:.4f}] {res['text'][:80]}...")
else:
    # Simulated results (no API keys)
    print("⚠️ Simulating results (no API keys)\n")
    for alpha in [0.2, 0.5, 0.8]:
        print(f"Alpha = {alpha}:")
        print(f"  1. [0.8234] Machine learning models require careful hyperparameter tuning for opt...")
        print(f"  2. [0.7891] Kubernetes orchestrates containerized applications across distributed...")
        print(f"  3. [0.7456] Neural architecture search automates the design of deep learning mode...")
        print()

# Expected: Top 3 results per alpha (9 result lines total, or simulated equivalents)

## 5. Namespaces & Multi-Tenant (user-isolated search)

**What Are Namespaces?**
Namespaces are isolated partitions within a single Pinecone index. They enable:
- **Multi-tenancy**: Each customer/user gets their own namespace
- **Data isolation**: Queries only search within the specified namespace
- **Cost efficiency**: Share infrastructure without separate indexes
- **Access control**: Validate namespace existence before queries

**Use Cases:**
- SaaS applications with per-customer data
- Team-based document repositories
- A/B testing with variant-specific namespaces
- Environment separation (dev/staging/prod)

**Important Notes:**
- Namespaces share the same BM25 vocabulary (potential cross-contamination)
- For strict isolation or different schemas, use separate indexes
- Always validate namespace exists before querying (prevents empty results)

In [None]:
from m1_2_pinecone_advanced_indexing import upsert_hybrid_vectors, hybrid_query
from config import get_clients

# Create namespace-specific data
user_namespace = "user-123"
user_docs = [
    "User 123 compliance report Q4 2024",
    "Risk assessment for user account 123",
    "Transaction history for customer ID 123"
]

print(f"Multi-Tenant Namespace Demo: '{user_namespace}'")
print("=" * 60)

openai_client, pinecone_client = get_clients()

if openai_client and pinecone_client:
    # Upsert to user-specific namespace
    print(f"\nUpserting {len(user_docs)} docs to namespace '{user_namespace}'...")
    result = upsert_hybrid_vectors(user_docs, namespace=user_namespace)
    print(f"✓ Upsert result: {result}")
    
    # Query only within this namespace
    print(f"\nQuerying namespace '{user_namespace}'...")
    results = hybrid_query("compliance report", alpha=0.5, top_k=2, namespace=user_namespace)
    for i, res in enumerate(results, 1):
        print(f"  {i}. [{res['score']:.4f}] {res['text'][:60]}...")
else:
    # Show payload shape when no keys
    print("\n⚠️ Simulating upsert (no API keys)")
    print(f"Namespace: '{user_namespace}'")
    print(f"Docs: {len(user_docs)}")
    print("Payload shape:")
    print("  {id: 'doc_0', values: [1536-dim], sparse_values: {...}, metadata: {...}}")
    print(f"\nQuery would target namespace: '{user_namespace}' only")

# Expected: 3-5 lines showing namespace upsert/query or simulated structure

## 6. Common Failures & Fixes (5 scenarios)

Production hybrid search has predictable failure modes. All are handled in `m1_2_pinecone_advanced_indexing.py`.

### Failure #1: BM25 Not Fitted
**Symptom**: `ValueError` when calling `embed_sparse_bm25(query=...)`  
**Cause**: Forgot to fit BM25 on corpus before encoding queries  
**Fix**: Always call `embed_sparse_bm25(texts=corpus)` first  
**Detection**: Check `_bm25_encoder is not None` before queries

### Failure #2: Metric Mismatch
**Symptom**: Incorrect scores, alpha blending doesn't work as expected  
**Cause**: Index created with `cosine` or `euclidean` instead of `dotproduct`  
**Fix**: Recreate index with `metric="dotproduct"` (required for hybrid)  
**Detection**: Validate `index.metric == "dotproduct"` on startup

### Failure #3: Missing Namespace
**Symptom**: Empty results when querying valid data  
**Cause**: Querying namespace that doesn't exist (typo, not yet created)  
**Fix**: Use `safe_namespace_query()` to validate before querying  
**Detection**: Check `namespace in index.describe_index_stats()["namespaces"]`

### Failure #4: Metadata Size Exceeds 40KB
**Symptom**: Upsert fails with metadata size error  
**Cause**: Long text fields or too many metadata keys  
**Fix**: Truncate text, remove unnecessary fields, store full text externally  
**Detection**: Use `validate_metadata_size()` before upsert

### Failure #5: Partial Batch Failures
**Symptom**: Some vectors fail to upsert (API errors, dimension mismatch)  
**Cause**: Network issues, rate limits, or data validation failures  
**Fix**: Track failed IDs, implement retry logic with exponential backoff  
**Detection**: Check upsert response for `failed_ids`, log failures

In [None]:
from m1_2_pinecone_advanced_indexing import (
    check_bm25_fitted, validate_metadata_size, safe_namespace_query
)
from pinecone_text.sparse import BM25Encoder
import sys

print("Testing 5 Common Failure Scenarios")
print("=" * 60)

# #1: BM25 Not Fitted
print("\n#1 BM25 Fitted Check:")
is_fitted = check_bm25_fitted()
print(f"  ✓ BM25 encoder fitted: {is_fitted}")

# #2: Metric Check (simulated - requires real index)
print("\n#2 Metric Validation:")
print("  ✓ Index must use metric='dotproduct' for hybrid search")
print("  Check: index.metric == 'dotproduct' before queries")

# #3: Namespace Validation (simulated)
print("\n#3 Namespace Existence Check:")
print("  ✓ Use safe_namespace_query() to validate before querying")
print("  Prevents: Empty results from typos/missing namespaces")

# #4: Metadata Size Validation
print("\n#4 Metadata Size Validation:")
tiny_meta = {"text": "short", "id": "123"}
large_meta = {"text": "x" * 50000, "more": "data"}  # Exceeds 40KB
try:
    validate_metadata_size(tiny_meta)
    print("  ✓ Small metadata passed (<40KB)")
except ValueError as e:
    print(f"  ❌ {e}")

try:
    validate_metadata_size(large_meta)
except ValueError as e:
    print(f"  ✓ Large metadata caught: {str(e)[:60]}...")

# #5: Batch Failure Handling
print("\n#5 Partial Batch Failures:")
print("  ✓ upsert_hybrid_vectors() returns failed_ids list")
print("  Enables: Retry logic, error logging, graceful degradation")

print("\n" + "=" * 60)
print("All failure modes have defensive checks in place")

# Expected: 5 blocks showing each failure check result (~15 lines total)

## 7. Decision Card & Production Notes

### When to Use Hybrid Search

**✅ Use Hybrid When:**
- Mixed-intent queries (semantic + exact keywords)
- Need 20-40% better recall than dense-only
- Acceptable to invest 4-8 hours alpha tuning per use case
- Queries blend natural language with product codes/IDs
- Willing to accept 30-80ms additional latency

**❌ Avoid Hybrid When:**
- 70%+ queries are purely keyword-based → Use Elasticsearch/traditional search
- 70%+ queries are semantic-only → Use dense vectors (M1.1)
- Sub-50ms latency required → Dense-only is faster (40-60ms vs 60-120ms)
- Corpus updates hourly → BM25 refitting overhead too high (5-15 min per 10K docs)
- No time for alpha experimentation → Stick with proven dense search

---

### Benefits
- **Improved Recall**: 20-40% better than dense-only for mixed queries
- **Keyword Precision**: Catches exact matches (product codes, IDs, legal terms)
- **Namespace Isolation**: Multi-tenancy without separate indexes
- **Production Ready**: Built-in failure handling for all 5 common scenarios

### Limitations
- **Latency Overhead**: +30-80ms per query vs dense-only
- **Alpha Tuning**: 4-8 hours per use case to find optimal blend
- **BM25 Refitting**: 5-15 min per 10K docs on corpus updates
- **Metric Lock-In**: Must use `dotproduct` (can't use cosine/euclidean)
- **Namespace Vocabulary**: BM25 shared across namespaces (potential contamination)

---

### Cost Estimates (Monthly)

| Scale | Queries/Day | Vectors | Embedding Cost | Pinecone Cost | Total |
|-------|-------------|---------|----------------|---------------|-------|
| Dev/Test | 100 | 1K | $5 | $15 | **$20** |
| Small | 1K | 10K | $25 | $75 | **$100** |
| Medium | 10K | 100K | $250 | $450 | **$700** |
| Large | 100K | 1M | $2,500 | $3,200 | **$5,700** |

*Assumptions: text-embedding-3-small ($0.020/1M tokens), Pinecone serverless (us-east-1), 2 queries per upsert*

**Cost Drivers:**
- OpenAI embeddings: Scales with text length and query volume
- Pinecone storage: $0.40/1M vectors/month (serverless)
- Pinecone queries: Varies by p1 pods vs serverless

---

### Monitoring & Production Checklist

**Key Metrics:**
- Query latency (p50, p95, p99) — target <120ms hybrid, <60ms dense-only
- BM25 refit duration — track for corpus update planning
- Namespace query distribution — detect skew/hotspots
- Failed upsert rate — should be <0.1%
- Alpha effectiveness per query type — log for continuous tuning

**Operational Notes:**
- Schedule BM25 refitting during low-traffic windows
- Implement query-level alpha selection (smart_alpha_selector)
- Cache frequently accessed queries (Redis/Memcached)
- Monitor Pinecone quota limits (free tier: 100K operations/month)
- Implement exponential backoff for rate limit errors

---

### Next Steps

**Completed in M1.2:**
- ✅ Hybrid search implementation (dense + sparse)
- ✅ Alpha parameter tuning strategy
- ✅ Namespace-based multi-tenancy
- ✅ Production failure handling (5 scenarios)
- ✅ Decision framework (use when/avoid when)

**Next Module: M1.3 — Document Pipeline & Chunking**
- Document loaders (PDF, DOCX, HTML)
- Chunking strategies (fixed, semantic, recursive)
- Metadata extraction & enrichment
- End-to-end ingestion pipeline