# Vector Databases — Concepts, Workflow, and Hands‑On with LangChain

**Goal:** Learn vector DBs end‑to‑end with FAISS, Chroma, Qdrant, Pinecone, Weaviate.

In [None]:
# %pip install -q langchain langchain-community langchain-core langchain-text-splitters
# %pip install -q faiss-cpu sentence-transformers chromadb qdrant-client weaviate-client pinecone-client

## 1) Why Vector Databases?

Vector DBs enable semantic search with embeddings and ANN indexes.

## 2) Toy Corpus

In [None]:
toy_corpus = [{'text':'Vector databases enable semantic search','meta':{'topic':'vector_db'}}]

## 3) Embeddings (HuggingFace)

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
emb = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

## 4) FAISS Demo

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.document import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
texts=[d['text'] for d in toy_corpus]
docs=[Document(page_content=t) for t in texts]
chunks=RecursiveCharacterTextSplitter(chunk_size=200,chunk_overlap=20).split_documents(docs)
faiss=FAISS.from_documents(chunks, emb)
faiss.similarity_search('semantic search', k=1)

## 5) Chroma (try/except)

In [None]:
try:
    from langchain_community.vectorstores import Chroma
    chroma = Chroma.from_documents(chunks, emb, collection_name='toy')
    chroma.similarity_search('semantic', k=1)
except Exception as e:
    'Chroma not available: '+str(e)

## 6) Qdrant (try/except)

In [None]:
try:
    from langchain_community.vectorstores import Qdrant
    from qdrant_client import QdrantClient
    QdrantClient(url='http://localhost:6333')
    qvs = Qdrant.from_documents(chunks, emb, url='http://localhost:6333', collection_name='toy_q')
    qvs.similarity_search('HNSW', k=1)
except Exception as e:
    'Qdrant not available: '+str(e)

## 7) Pinecone (commented)

In [None]:
# from pinecone import Pinecone, ServerlessSpec
# ... setup and search ...

## 8) Weaviate (commented)

In [None]:
# import weaviate
# ... setup and search ...

## 9) Comparison

- Index tuning, filters, hybrid, security, observability, eval metrics

## 9.1 Production‑Grade Comparison (Deep Dive)

The table below compares **FAISS, Chroma, Qdrant, Pinecone, and Weaviate** on factors that matter in production. Values are indicative; verify against current docs & your workload.

| Factor | **FAISS** | **Chroma** | **Qdrant** | **Pinecone** | **Weaviate** |
|---|---|---|---|---|---|
| Deployment model | In‑proc lib | Local/embedded | OSS server (Docker/K8s) | Managed SaaS | OSS server / Cloud |
| Primary ANN | Flat/IVF/PQ | HNSW (impl) | **HNSW** | HNSW (managed) | **HNSW** |
| Vector types | float32 (CPU/GPU) | float32 | float32 | float32 | float32 |
| GPU support | ✅ (FAISS‑GPU) | ❌ | ❌ (via plugins in roadmap) | ❌ (managed infra) | ❌ |
| Hybrid (BM25+dense) | Manual fusion | Limited | Plugins/RRF | API-level support | **Native hybrid** |
| Filters expressiveness | Client‑side only | Metadata filters | **Rich payload filters** (AND/OR/IN/range/geo) | **Rich filters** | **Rich filters + hybrid scoring** |
| Namespaces/collections | Multiple indexes | Collections | Collections | **Namespaces + indexes** | Classes (schemas) |
| Sharding & scaling | Process-level only | Single node (basic) | **Sharding/replication** | **Serverless/hosted scaling** | **Sharding/replication** |
| Upsert latency | N/A (in‑proc) | Low | Low‑Med | Low (SLA-backed) | Low‑Med |
| Query latency (p50) | Low (in‑mem) | Low | Low‑Med | **Low with SLA** | Low‑Med |
| Durability (AOF/Snap) | N/A | Disk persist dir | **Snapshots + WAL** | **Managed backups** | **Backups/snapshots** |
| DR/HA | Your responsibility | Local | **Replicas + failover** | **Multi‑AZ/region options** | **Replicas + failover** |
| Multi‑tenancy / RBAC | In your app | Basic | **Collections + ACL (enterprise)** | **Projects/Indexes + RBAC** | **Tenants/ACL** |
| Observability | Your code | Basic logs | **Metrics (Prometheus), logs, traces** | **Dashboards, metrics, logs** | **Metrics, modules** |
| Cost model | Infra only | Local | Infra (self‑host) | **Usage‑based SaaS** | Infra (self‑host/cloud) |
| Max dim / payload | Compile‑time/host | Moderate | **High dims + payload** | **High dims + payload** | **High dims + payload** |
| Third‑party ecosystem | Huge (research) | Good for LLM apps | **Strong OSS community** | **Enterprise ecosystem** | **ML/semantic ecosystem** |
| Typical best‑fit | Prototyping, research | Small/medium apps | Scalable OSS prod | Enterprise prod, SLAs | Hybrid search, semantic apps |

**When to choose…**
- **FAISS:** offline prototyping, research benchmarks, custom GPU pipelines.  
- **Chroma:** quick local apps and demos with persistence.  
- **Qdrant:** OSS production with filtering, sharding, replicas; K8s friendly.  
- **Pinecone:** managed enterprise workloads with SLAs and serverless ops.  
- **Weaviate:** hybrid search (BM25 + dense), rich schema, semantic features.

## 10) DS Checklist

## Data Scientist Production Checklist — with Explanations

### 1. Data & Embeddings
- **Gold evaluation sets:** curate representative question–answer or document–query pairs; used for offline recall@k or NDCG benchmarking.  
- **Embedding drift:** monitor cosine distance between current and baseline embeddings; re-embed if semantic space drifts after model updates or data changes.  
- **Versioning:** store model versions, vector index snapshots, and associated metadata (commit IDs) for full reproducibility.  
- **Deduplication:** apply MinHash/SimHash or cosine thresholds to remove near-duplicates; enforce TTL or “freshness” policies for time-sensitive content.

### 2. Index Design & Tuning
- **Choose ANN structure:** HNSW (fast recall, low latency), IVF (large-scale trade-off), PQ/OPQ (compressed storage).  
- **Parameter tuning:** adjust `M`, `efConstruction`, `efSearch` (HNSW) or `nlist`, `nprobe` (IVF) to balance recall and latency.  
- **Similarity metric:** normalize embeddings for cosine; dot product emphasizes magnitude; Euclidean sensitive to scale.  
- **Namespaces/collections:** partition vectors by tenant, region, or domain to reduce cross-contamination.  
- **Blue/green rollouts:** build a new index in parallel, validate quality, then switch traffic atomically.

### 3. Retrieval Quality
- **Query rewriting:** expand under-specified queries using HyDE or multi-query generation for better recall.  
- **Reranking:** use cross-encoder models (e.g., MiniLM-reranker) to reorder top-k results by semantic relevance.  
- **Hybrid search:** blend BM25 (keyword) and vector results using Reciprocal Rank Fusion (RRF).  
- **Metrics:** measure Recall@k (coverage), NDCG (ranking quality), MRR (position sensitivity), and identify queries with no hits.  
- **Shadow/AB testing:** evaluate new embeddings or retrievers against live traffic before deployment.

### 4. Latency & Capacity
- **Service SLOs:** define latency targets (p95 < 200 ms end-to-end).  
- **Capacity planning:** memory ≈ `QPS × k × dim × 4 bytes`; plan shards and replicas accordingly.  
- **Batch embeddings:** embed documents asynchronously in bulk jobs to reduce API cost.  
- **Caching:** store hot vectors or recent queries (LRU/LFU) to cut downstream load.  
- **Async upserts:** queue writes to avoid locking read paths.

### 5. Security & Compliance
- **PII handling:** detect and redact sensitive data before embedding.  
- **Access control:** implement RBAC or per-namespace API keys.  
- **Encryption:** enforce TLS in transit, encrypted disks/snapshots at rest, and manage keys via KMS.  
- **Data residency:** ensure vector and metadata storage comply with local regulations (GDPR, ISO, SOC2).  
- **Audit trails:** log queries and upserts with trace IDs for investigation.

### 6. Ops & Observability
- **Metrics:** collect QPS, latency (p50/p95), recall@k, CPU/memory, index size, ingestion lag.  
- **Dashboards:** visualize recall trends vs latency to detect regressions.  
- **Alerts:** trigger on recall dips, ingestion backlog, replica failures, or memory pressure.  
- **Backups & DR:** automate periodic snapshots; test restore within RPO/RTO windows.  
- **Runbooks:** maintain step-by-step guides for failures (e.g., corrupted index, embedding outage).

### 7. Cost Management
- **Cost attribution:** tag vectors by project/tenant for per-use billing.  
- **Compression:** enable PQ or FP16 vectors for 50–70% memory savings.  
- **Pruning:** periodically drop stale or low-usage vectors.  
- **Scaling model:** use serverless for bursty workloads; dedicated clusters for steady traffic.  
- **Batch jobs:** schedule heavy embedding or reindexing off-peak.

### 8. Developer Experience & Governance
- **Infrastructure as Code:** manage DB and index definitions with Terraform/Helm.  
- **CI/CD pipelines:** automate retraining, embedding, and index refresh jobs.  
- **Schema evolution:** version metadata schema to avoid breaking API clients.  
- **Documentation:** maintain tuning parameters, SLAs, known issues, and troubleshooting guides.  
- **On-call readiness:** ensure rotation playbooks and monitoring dashboards are always current.