[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/advanced/Advanced_Vector_Store_and_Search.ipynb)

# Advanced Vector Store - Made Easy

## What You'll Learn

This notebook shows you **practical ways** to use vector stores in real applications. Each example is simple and ready to use.

### Topics

1. **Choosing the Right Index** - Which one to use and when
2. **Smart Filtering** - Find exactly what you need
3. **Combining Results** - Merge searches from different sources
4. **Organizing Data** - Keep different users' data separate

---

In [8]:
!pip install semantica









## Part 0: Setup Embeddings

First, let's select our embedding provider and model. Semantica supports multiple providers like Sentence Transformers and FastEmbed.


In [9]:
from semantica.embeddings import TextEmbedder

# Choose provider and model
embedder = TextEmbedder(method="fastembed", model_name="BAAI/bge-small-en-v1.5")
dimension = embedder.get_embedding_dimension()

print(f"Selected model: {embedder.get_model_info()['model_name']}")
print(f"Embedding dimension: {dimension}")


fastembed not available. Install with: pip install fastembed. Using fallback embedding method.


Selected model: BAAI/bge-small-en-v1.5
Embedding dimension: 128


## Part 1: Choosing the Right Index

Think of an index like choosing a filing system:
- **Flat**: Like a small notebook - slow but perfect
- **HNSW**: Like a well-organized library - fast and accurate
- **IVF**: Like a warehouse with sections - very fast for huge collections

### Simple Rule
- Less than 10,000 items? Use **Flat**
- Between 10,000 and 1 million? Use **HNSW** ✅ (recommended)
- More than 1 million? Use **IVF**

In [10]:
from semantica.vector_store import FAISSStore
import numpy as np

# Create some example vectors (like document embeddings)
vectors = np.random.rand(5000, 768).astype('float32')
query = np.random.rand(768).astype('float32')

adapter = FAISSStore(dimension=768)

# HNSW Index - Best for most cases
index = adapter.create_index(index_type="hnsw", metric="L2", m=16)
adapter.add_vectors(vectors, ids=[f"doc_{i}" for i in range(len(vectors))])

# Search for similar vectors
results = adapter.search_similar(query, k=5)

print("Found 5 most similar documents:")
for i, result in enumerate(results, 1):
    print(f"  {i}. Document {result['id']} (distance: {result['distance']:.3f})")

  del self._target, self._args, self._kwargs
  del self._target, self._args, self._kwargs
  del self._target, self._args, self._kwargs


Found 5 most similar documents:
  1. Document doc_1935 (distance: 110.917)
  2. Document doc_3860 (distance: 111.535)
  3. Document doc_277 (distance: 113.270)
  4. Document doc_1903 (distance: 113.371)
  5. Document doc_2959 (distance: 113.612)


## Part 2: Smart Filtering with Metadata

Imagine searching for "similar articles" but only from 2024 and only in the "Technology" category. That's what metadata filtering does!

### Real-World Example
You're building a document search where users want:
- Similar documents (vector search)
- From specific categories (metadata filter)
- From recent years (metadata filter)

In [11]:
from semantica.vector_store import HybridSearch, MetadataFilter
import numpy as np

# Create sample documents with metadata
documents = [
    {"id": 0, "text": "AI in Healthcare", "category": "Technology", "year": 2024},
    {"id": 1, "text": "Machine Learning Basics", "category": "Technology", "year": 2023},
    {"id": 2, "text": "Business Strategy", "category": "Business", "year": 2024},
    {"id": 3, "text": "Data Science Guide", "category": "Technology", "year": 2024},
    {"id": 4, "text": "Marketing Tips", "category": "Business", "year": 2023},
]

# Create vectors for each document
vectors = [np.random.rand(768) for _ in documents]
metadata = [{"category": d["category"], "year": d["year"]} for d in documents]
vector_ids = [f"doc_{d['id']}" for d in documents]

# Create search
search = HybridSearch()
query = np.random.rand(768)

# Example 1: Find Technology articles from 2024
filter1 = MetadataFilter().eq("category", "Technology").eq("year", 2024)
results = search.search(query, vectors, metadata, vector_ids, filter=filter1, k=10)

print("Technology articles from 2024:")
for r in results:
    doc_id = int(r['id'].split('_')[1])
    print(f"  - {documents[doc_id]['text']}")

# Example 2: Find any article from 2024
filter2 = MetadataFilter().eq("year", 2024)
results2 = search.search(query, vectors, metadata, vector_ids, filter=filter2, k=10)

print("\nAll articles from 2024:")
for r in results2:
    doc_id = int(r['id'].split('_')[1])
    print(f"  - {documents[doc_id]['text']} ({documents[doc_id]['category']})")

Technology articles from 2024:
  - AI in Healthcare
  - Business Strategy
  - Data Science Guide
  - Marketing Tips
  - Machine Learning Basics

All articles from 2024:
  - AI in Healthcare (Technology)
  - Business Strategy (Business)
  - Data Science Guide (Technology)
  - Marketing Tips (Business)
  - Machine Learning Basics (Technology)


## Part 3: Combining Search Results

Sometimes you want to search in multiple places and combine the results. Like searching both your email and documents, then showing the best matches from both.

### When to Use This
- Searching multiple databases
- Combining different search strategies
- Giving more weight to certain sources

In [12]:
from semantica.vector_store import SearchRanker

# Simulate two different searches
# Search 1: Recent documents
recent_results = [
    {"id": "doc_3", "score": 0.95, "source": "recent"},
    {"id": "doc_0", "score": 0.90, "source": "recent"},
    {"id": "doc_2", "score": 0.85, "source": "recent"},
]

# Search 2: Popular documents
popular_results = [
    {"id": "doc_1", "score": 0.92, "source": "popular"},
    {"id": "doc_3", "score": 0.88, "source": "popular"},
    {"id": "doc_4", "score": 0.80, "source": "popular"},
]

# Method 1: Fair combination (RRF)
ranker = SearchRanker(strategy="reciprocal_rank_fusion")
combined = ranker.rank([recent_results, popular_results])

print("Combined results (fair ranking):")
for i, result in enumerate(combined[:3], 1):
    doc_id = int(result['id'].split('_')[1])
    print(f"  {i}. {documents[doc_id]['text']} (score: {result['score']:.3f})")

# Method 2: Prefer recent documents (70% recent, 30% popular)
weighted_ranker = SearchRanker(strategy="weighted_average")
weighted_combined = weighted_ranker.rank(
    [recent_results, popular_results],
    weights=[0.7, 0.3]
)

print("\nCombined results (prefer recent):")
for i, result in enumerate(weighted_combined[:3], 1):
    doc_id = int(result['id'].split('_')[1])
    print(f"  {i}. {documents[doc_id]['text']} (score: {result['score']:.3f})")

Combined results (fair ranking):
  1. Data Science Guide (score: 0.033)
  2. Machine Learning Basics (score: 0.016)
  3. AI in Healthcare (score: 0.016)

Combined results (prefer recent):
  1. Data Science Guide (score: 0.929)
  2. AI in Healthcare (score: 0.630)
  3. Business Strategy (score: 0.595)


## Part 4: Keeping User Data Separate

If you're building an app with multiple users or companies, you need to keep their data separate. Namespaces do this automatically.

### Real Example
You're building a SaaS app where:
- Company A has their documents
- Company B has their documents
- They should never see each other's data

In [13]:
from semantica.vector_store import NamespaceManager

# Create manager
manager = NamespaceManager()

# Create separate spaces for each company
company_a = manager.create_namespace("company_a", "Company A's documents")
company_b = manager.create_namespace("company_b", "Company B's documents")

# Add documents to Company A
for i in range(10):
    manager.add_vector_to_namespace(f"company_a_doc_{i}", "company_a")

# Add documents to Company B
for i in range(15):
    manager.add_vector_to_namespace(f"company_b_doc_{i}", "company_b")

# Get each company's documents
a_docs = manager.get_namespace_vectors("company_a")
b_docs = manager.get_namespace_vectors("company_b")

print(f"Company A has {len(a_docs)} documents")
print(f"Company B has {len(b_docs)} documents")

# Set permissions (who can access what)
company_a.set_access_control("admin@companya.com", ["read", "write", "delete"])
company_a.set_access_control("user@companya.com", ["read"])  # Read-only

# Check permissions
print(f"\nAdmin can delete: {company_a.has_permission('admin@companya.com', 'delete')}")
print(f"User can delete: {company_a.has_permission('user@companya.com', 'delete')}")

Company A has 10 documents
Company B has 15 documents

Admin can delete: True
User can delete: False


## Quick Reference Guide

### Which Index Should I Use?

```python
# Small dataset (< 10,000 items)
index = adapter.create_index(index_type="flat", metric="L2")

# Medium dataset (10,000 - 1,000,000 items) ✅ RECOMMENDED
index = adapter.create_index(index_type="hnsw", metric="L2", m=16)

# Large dataset (> 1,000,000 items)
index = adapter.create_index(index_type="ivf", metric="L2", nlist=100)
```

### How Do I Filter Results?

```python
# Single condition
filter = MetadataFilter().eq("category", "Technology")

# Multiple conditions (AND)
filter = MetadataFilter() \
    .eq("category", "Technology") \
    .eq("year", 2024)

# Greater than / Less than
filter = MetadataFilter().gt("year", 2020)
```

### How Do I Combine Results?

```python
# Fair combination
ranker = SearchRanker(strategy="reciprocal_rank_fusion")
combined = ranker.rank([results1, results2])

# Weighted combination (prefer first source)
ranker = SearchRanker(strategy="weighted_average")
combined = ranker.rank([results1, results2], weights=[0.7, 0.3])
```

### How Do I Separate User Data?

```python
# Create namespace for each user/company
manager = NamespaceManager()
user_space = manager.create_namespace("user_123", "User 123's data")

# Add data to namespace
manager.add_vector_to_namespace("doc_1", "user_123")

# Get user's data
user_docs = manager.get_namespace_vectors("user_123")
```

---

## Summary

You've learned:

1. ✅ **Index Selection**: Use HNSW for most cases
2. ✅ **Smart Filtering**: Combine vector search with metadata
3. ✅ **Result Fusion**: Merge searches from different sources
4. ✅ **Data Isolation**: Keep users' data separate

### Next Steps

- Try these examples with your own data
- Experiment with different filters
- Build a multi-user application
- Explore the [introduction notebook](../introduction/13_Vector_Store.ipynb) for more basics

**Need Help?** Check our [documentation](https://semantica.readthedocs.io) or ask on [GitHub](https://github.com/Hawksight-AI/semantica).