# Semantic Search with ScyllaDB and Google Gemini

This notebook demonstrates semantic search capabilities using:
- **AsyncScyllaDBStore** - Vector-enabled key-value store
- **Google Gemini Embeddings** - 3072-dimensional embeddings
- **sklearn cosine_similarity** - Efficient similarity computation

## Prerequisites
- ScyllaDB running on localhost:9042
- GOOGLE_API_KEY in .env file

## Setup and Imports

In [None]:
import os
import asyncio
from pathlib import Path
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from vertector_scylladbstore import AsyncScyllaDBStore

# Load environment variables
def load_env():
    env_path = Path(".") / ".env"
    if env_path.exists():
        with open(env_path) as f:
            for line in f:
                line = line.strip()
                if line and not line.startswith('#') and '=' in line:
                    key, value = line.split('=', 1)
                    value = value.strip().strip('"').strip("'")
                    os.environ[key.strip()] = value

load_env()

# Verify API key
if not os.getenv("GOOGLE_API_KEY"):
    print("⚠️  GOOGLE_API_KEY not found in .env file")
else:
    print("✓ GOOGLE_API_KEY loaded")

## 1. Initialize Gemini Embeddings

Google Gemini supports different task types:
- `RETRIEVAL_DOCUMENT` - For embedding documents to store
- `RETRIEVAL_QUERY` - For embedding search queries
- `SEMANTIC_SIMILARITY` - General similarity tasks

In [None]:
# Create embeddings with RETRIEVAL_DOCUMENT task type
embeddings = GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
    task_type="RETRIEVAL_DOCUMENT"
)

# Test embedding generation
test_text = "Hello, world!"
test_embedding = embeddings.embed_query(test_text)

print(f"✓ Embeddings initialized")
print(f"  Model: gemini-embedding-001")
print(f"  Dimensions: {len(test_embedding)}")
print(f"  Sample (first 5): {test_embedding[:5]}")

## 2. Create ScyllaDB Store with Semantic Search

Configure the store with `IndexConfig` to enable semantic search:

In [None]:
# Define IndexConfig for semantic search
index_config = {
    "dims": 3072,  # Gemini embedding-001 dimensions
    "embed": embeddings,  # Pass embeddings instance
    "fields": ["$"]  # Embed entire value dict
}

# Create store (we'll use it in async context)
print("✓ IndexConfig created")
print(f"  Dimensions: {index_config['dims']}")
print(f"  Fields to embed: {index_config['fields']}")

## 3. Insert Documents with Auto-Embedding

When you call `aput()`, embeddings are generated automatically.

In [None]:
async def insert_documents():
    """Insert sample documents with automatic embedding."""
    
    async with AsyncScyllaDBStore.from_contact_points(
        contact_points=["127.0.0.1"],
        keyspace="notebook_demo",
        index=index_config,
        enable_circuit_breaker=False
    ) as store:
        
        # Setup schema
        await store.setup()
        print("✓ Store created and schema initialized\n")
        
        # Sample documents about different topics
        documents = [
            {
                "id": "ai1",
                "title": "Machine Learning Basics",
                "content": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without explicit programming.",
                "category": "AI",
                "tags": ["ml", "ai", "data-science"]
            },
            {
                "id": "ai2",
                "title": "Neural Networks",
                "content": "Deep learning uses neural networks with multiple layers to process complex patterns in data, revolutionizing computer vision and natural language processing.",
                "category": "AI",
                "tags": ["deep-learning", "neural-nets", "ai"]
            },
            {
                "id": "db1",
                "title": "NoSQL Databases",
                "content": "ScyllaDB is a high-performance NoSQL database that provides low-latency and high-throughput for modern applications, compatible with Apache Cassandra.",
                "category": "Database",
                "tags": ["nosql", "database", "scylladb"]
            },
            {
                "id": "prog1",
                "title": "Python Programming",
                "content": "Python is a versatile programming language widely used for web development, data science, automation, and machine learning applications.",
                "category": "Programming",
                "tags": ["python", "coding", "development"]
            },
            {
                "id": "ai3",
                "title": "Vector Embeddings",
                "content": "Vector embeddings represent text, images, or other data as numerical vectors in high-dimensional space, enabling semantic similarity searches.",
                "category": "AI",
                "tags": ["embeddings", "vectors", "semantic-search"]
            },
            {
                "id": "cloud1",
                "title": "Cloud Computing",
                "content": "Cloud platforms provide scalable infrastructure for deploying applications, offering services like compute, storage, and managed databases.",
                "category": "Cloud",
                "tags": ["cloud", "aws", "infrastructure"]
            }
        ]
        
        # Insert documents
        print("Inserting documents with auto-embedding...\n")
        for doc in documents:
            await store.aput(
                namespace=("knowledge", "tech"),
                key=doc["id"],
                value=doc
            )
            print(f"  ✓ {doc['id']:8} - {doc['title']}")
        
        print(f"\n✓ Inserted {len(documents)} documents with embeddings")
        return len(documents)

# Run insertion
doc_count = await insert_documents()
print(f"\nReady for semantic search with {doc_count} documents!")

## 4. Semantic Search - Basic Queries

Search using natural language queries. Results are ranked by cosine similarity.

In [None]:
async def semantic_search(query_text, limit=3):
    """Perform semantic search and display results."""
    
    async with AsyncScyllaDBStore.from_contact_points(
        contact_points=["127.0.0.1"],
        keyspace="notebook_demo",
        index=index_config,
        enable_circuit_breaker=False
    ) as store:
        
        await store.setup()  # Ensure prepared statements are ready
        
        # Perform semantic search
        results = await store.asearch(
            ("knowledge", "tech"),  # namespace_prefix (positional-only)
            query=query_text,
            limit=limit
        )
        
        return results

# Test query 1: AI-related
print("Query: 'What is artificial intelligence and how does it work?'\n")
results = await semantic_search("What is artificial intelligence and how does it work?")

for i, item in enumerate(results, 1):
    print(f"{i}. Score: {item.score:.4f}")
    print(f"   Title: {item.value['title']}")
    print(f"   Category: {item.value['category']}")
    print(f"   Content: {item.value['content'][:80]}...")
    print()

In [None]:
# Test query 2: Database-related
print("Query: 'Tell me about high-performance databases'\n")
results = await semantic_search("Tell me about high-performance databases")

for i, item in enumerate(results, 1):
    print(f"{i}. Score: {item.score:.4f}")
    print(f"   Title: {item.value['title']}")
    print(f"   Category: {item.value['category']}")
    print(f"   Content: {item.value['content'][:80]}...")
    print()

In [None]:
# Test query 3: Vector/Embedding-related
print("Query: 'How do vector representations work for similarity?'\n")
results = await semantic_search("How do vector representations work for similarity?")

for i, item in enumerate(results, 1):
    print(f"{i}. Score: {item.score:.4f}")
    print(f"   Title: {item.value['title']}")
    print(f"   Category: {item.value['category']}")
    print(f"   Content: {item.value['content'][:80]}...")
    print()

## 5. Combined Filter + Semantic Search

Filter results by metadata, then rank by semantic similarity.

In [None]:
async def filtered_semantic_search(query_text, filter_dict, limit=3):
    """Perform filtered semantic search."""
    
    async with AsyncScyllaDBStore.from_contact_points(
        contact_points=["127.0.0.1"],
        keyspace="notebook_demo",
        index=index_config,
        enable_circuit_breaker=False
    ) as store:
        
        await store.setup()  # Ensure prepared statements are ready
        
        results = await store.asearch(
            ("knowledge", "tech"),
            query=query_text,
            filter=filter_dict,
            limit=limit
        )
        
        return results

# Search only AI category documents
print("Query: 'learning and intelligence'")
print("Filter: category = 'AI'\n")

results = await filtered_semantic_search(
    "learning and intelligence",
    {"category": "AI"}
)

for i, item in enumerate(results, 1):
    print(f"{i}. Score: {item.score:.4f}")
    print(f"   Title: {item.value['title']}")
    print(f"   Category: {item.value['category']}")
    print(f"   Tags: {', '.join(item.value['tags'])}")
    print()

## 6. Field-Specific Embedding

Control which fields are embedded using the `index` parameter in `aput()`.

In [None]:
async def insert_with_field_control():
    """Demonstrate field-specific embedding control."""
    
    async with AsyncScyllaDBStore.from_contact_points(
        contact_points=["127.0.0.1"],
        keyspace="notebook_demo",
        index=index_config,
        enable_circuit_breaker=False
    ) as store:
        
        await store.setup()  # Ensure prepared statements are ready
        
        # Insert with only title and content embedded (not metadata)
        await store.aput(
            namespace=("knowledge", "tech"),
            key="custom1",
            value={
                "id": "custom1",
                "title": "Kubernetes Orchestration",
                "content": "Kubernetes automates deployment, scaling, and management of containerized applications across clusters.",
                "category": "Cloud",
                "metadata": "Internal doc - do not embed this field",
                "secret_key": "xyz123"  # Won't be embedded
            },
            index=["title", "content"],  # Only embed these fields
            wait_for_vector_sync=True
        )
        
        print("✓ Inserted document with field-specific embedding")
        print("  Embedded fields: title, content")
        print("  Skipped fields: metadata, secret_key\n")
        
        # Insert without embedding
        await store.aput(
            namespace=("knowledge", "tech"),
            key="temp1",
            value={
                "id": "temp1",
                "title": "Temporary Data",
                "content": "This is temporary and won't be searchable"
            },
            index=False,  # Skip embedding entirely
            wait_for_vector_sync=True
        )
        
        print("✓ Inserted document without embedding (index=False)")
        print("  This document won't appear in semantic search\n")
        
        # Verify semantic search finds custom1 but not temp1
        results = await store.asearch(
            ("knowledge", "tech"),
            query="container orchestration kubernetes",
            limit=5
        )
        
        custom1_found = any(item.key == "custom1" for item in results)
        temp1_found = any(item.key == "temp1" for item in results)
        
        print("Semantic search results:")
        print(f"  custom1 found: {custom1_found}")
        print(f"  temp1 found: {temp1_found}")
        
        if custom1_found:
            custom1_result = [item for item in results if item.key == "custom1"][0]
            print(f"\n  custom1 score: {custom1_result.score:.4f}")
            print(f"  custom1 title: {custom1_result.value['title']}")

await insert_with_field_control()

## 7. Performance Metrics

View query statistics and performance metrics.

In [None]:
async def show_metrics():
    """Display store performance metrics."""
    
    async with AsyncScyllaDBStore.from_contact_points(
        contact_points=["127.0.0.1"],
        keyspace="notebook_demo",
        index=index_config,
        enable_circuit_breaker=False
    ) as store:
        
        await store.setup()  # Ensure prepared statements are ready
        
        stats = store.metrics.get_stats()
        
        print("Store Performance Metrics")
        print("=" * 50)
        print(f"Total queries:      {stats['total_queries']}")
        print(f"Average latency:    {stats['avg_latency_ms']:.2f} ms")
        print(f"Min latency:        {stats['min_latency_ms']:.2f} ms")
        print(f"Max latency:        {stats['max_latency_ms']:.2f} ms")
        print(f"Error rate:         {stats['error_rate']:.2%}")

await show_metrics()

## 8. Interactive Search

Try your own queries!

In [None]:
# Try your own query
your_query = "programming languages for data science"  # Change this!

print(f"Your Query: '{your_query}'\n")
results = await semantic_search(your_query, limit=5)

for i, item in enumerate(results, 1):
    print(f"{i}. Score: {item.score:.4f} | {item.value['title']}")
    print(f"   {item.value['content'][:100]}...\n")

## Cleanup

Optionally clean up all test data.

In [None]:
async with AsyncScyllaDBStore.from_contact_points(
        contact_points=["127.0.0.1"],
        keyspace="notebook_demo",
        enable_circuit_breaker=False
    ) as store:

    await store.setup()

    # List all namespaces
    namespaces = await store.alist_namespaces()
        
    # Delete all items in each namespace
    for ns in namespaces:
        items = await store.asearch(ns, limit=1000)
        for item in items:
            await store.adelete(item.namespace, item.key)
            print(f"Deleted: {item.namespace} / {item.key}")

## Summary

This notebook demonstrated:

1. ✅ **IndexConfig** - Configure semantic search with Gemini embeddings
2. ✅ **Auto-embedding** - Embeddings generated automatically on `aput()`
3. ✅ **Semantic search** - Natural language queries with similarity ranking
4. ✅ **Filtered search** - Combine metadata filters with semantic search
5. ✅ **Field control** - Choose which fields to embed
6. ✅ **Performance** - Query metrics and latency tracking

### Key Features

- **Vector Storage**: ScyllaDB `VECTOR<FLOAT, 3072>` column
- **Embeddings**: Google Gemini `gemini-embedding-001`
- **Similarity**: sklearn `cosine_similarity` for efficient batch computation
- **Compatible**: LangGraph BaseStore interface

### Architecture Notes

**How it works:**
1. Documents are inserted with `aput()` which automatically generates embeddings
2. Search queries are embedded using the same model
3. Cosine similarity is computed between query and all document embeddings
4. Results are ranked by similarity score (higher = more relevant)

**Important:** Each time you create a new store instance with `from_contact_points()`, you must call `await store.setup()` to prepare statements. The notebook cells each create independent instances, so they don't share metrics.

### Next Steps

- Try different queries and see how well semantic search works
- Experiment with different embedding models (OpenAI, Cohere, etc.)
- Add more documents and test performance at scale
- Combine with hybrid search (keyword + semantic)
- Use a single store instance across cells to accumulate metrics