# Semantic Search Pipeline

## Overview

This notebook demonstrates a production-ready semantic search pipeline: process documents, generate embeddings, store in vector database, and retrieve results.

### Learning Objectives

- Parse documents for search
- Generate embeddings for semantic search
- Store embeddings in a vector store
- Query and retrieve relevant results

---

## Workflow

**Documents → Embeddings → Vector Store → Query → Results**

This complete pipeline enables production-ready semantic search.

---

## Step 1: Parse Documents

Start by parsing documents to extract searchable content.


In [None]:
from semantica.parse import DocumentParser
from pathlib import Path

sample_docs = [
    "Python is a high-level programming language.",
    "JavaScript is used for web development.",
    "Machine learning algorithms learn from data.",
]

parser = DocumentParser()

try:
    parsed_docs = []
    for doc in sample_docs:
        parsed = parser.parse_document(doc)
        parsed_docs.append(parsed)
    
    print(f"✓ Parsed {len(parsed_docs)} documents")
    
except Exception as e:
    print(f"✗ Error parsing documents: {e}")
    parsed_docs = sample_docs


## Step 2: Generate Embeddings

Generate embeddings for the parsed documents to enable semantic search.


In [None]:
from semantica.embeddings import EmbeddingGenerator
import numpy as np

generator = EmbeddingGenerator()

try:
    embeddings = generator.generate(parsed_docs)
    print("✓ Embeddings generated")
    print(f"  Documents: {len(parsed_docs)}")
    print(f"  Embeddings ready for storage")
    
except Exception as e:
    print(f"✗ Error generating embeddings: {e}")
    embeddings = np.random.rand(len(parsed_docs), 1536).astype(np.float32)
    print("  Using demo embeddings")


## Step 3: Store in Vector Store

Store the embeddings along with documents and metadata in a vector store.


In [None]:
from semantica.vector_store import VectorStore

vector_store = VectorStore()

metadata = [{"id": i, "source": "demo"} for i in range(len(parsed_docs))]

try:
    vector_store.store(embeddings, parsed_docs, metadata)
    print("✓ Documents stored in vector store")
    print(f"  Stored {len(parsed_docs)} documents")
    
except Exception as e:
    print(f"✗ Error storing in vector store: {e}")


## Step 4: Query and Retrieve

Query the vector store and retrieve the most relevant results.


In [None]:
from semantica.vector_store import VectorRetriever

retriever = VectorRetriever(vector_store)

query = "programming language"
query_embedding = generator.generate([query])[0] if hasattr(generator, 'generate') else np.random.rand(1536).astype(np.float32)

try:
    results = retriever.retrieve(query_embedding, top_k=3)
    
    print("✓ Production-ready semantic search complete")
    print(f"  Query: '{query}'")
    print(f"  Found {len(results) if results else 0} results")
    
    if results:
        print("\nTop Results:")
        for i, result in enumerate(results):
            score = result.score if hasattr(result, 'score') else 'N/A'
            doc = result.document if hasattr(result, 'document') else 'N/A'
            print(f"  {i+1}. Score: {score}, Document: {doc[:60]}...")
    else:
        print("  Note: Results would show most semantically similar documents")
        
except Exception as e:
    print(f"✗ Error retrieving results: {e}")
