# Day 10: Semantic Search

Traditional search matches **keywords**. Type "cat" and you find documents containing "cat".

But what if the document says "feline" instead?

**Semantic search** finds documents by **meaning**, not exact words.

## Setup

In [104]:
from google import genai
import os
from dotenv import load_dotenv
import numpy as np

load_dotenv(dotenv_path='../.env')
API_KEY = os.environ["GEMINI_API_KEY"]
client = genai.Client(api_key=API_KEY)

def cosine_similarity(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

## The Document Corpus

Let's create a small knowledge base ‚Äî 5 documents about different topics.

In [105]:
documents = [
    "Python is a popular programming language for data science and machine learning.",
    "Neural networks are computing systems inspired by biological brain structures.",
    "Cloud computing provides on-demand access to computing resources over the internet.",
    "REST APIs allow different software systems to communicate over HTTP.",
    "Docker containers package applications with their dependencies for consistent deployment."
]

print("üìö Document Corpus:")
for i, doc in enumerate(documents):
    print(f"  {i+1}. {doc}")

üìö Document Corpus:
  1. Python is a popular programming language for data science and machine learning.
  2. Neural networks are computing systems inspired by biological brain structures.
  3. Cloud computing provides on-demand access to computing resources over the internet.
  4. REST APIs allow different software systems to communicate over HTTP.
  5. Docker containers package applications with their dependencies for consistent deployment.


## Step 1: Index the Documents

Convert each document to an embedding. This is done **once** when building the index.

In [106]:
# Generate embeddings for all documents
doc_embeddings = []
for doc in documents:
    response = client.models.embed_content(
        model="gemini-embedding-001",
        contents=doc
    )
    doc_embeddings.append(response.embeddings[0].values)

print(f"‚úÖ Indexed {len(doc_embeddings)} documents")
print(f"üìê Each embedding: {len(doc_embeddings[0])} dimensions")

‚úÖ Indexed 5 documents
üìê Each embedding: 3072 dimensions


## Step 2: Search Function

When a query comes in:
1. Embed the query
2. Compare to all document embeddings
3. Return documents sorted by similarity

In [107]:
def search(query, top_k=3):
    # Embed the query
    query_response = client.models.embed_content(
        model="gemini-embedding-001",
        contents=query
    )
    query_embedding = query_response.embeddings[0].values
    
    # Calculate similarity with all documents
    scores = []
    for i, doc_emb in enumerate(doc_embeddings):
        similarity = cosine_similarity(query_embedding, doc_emb)
        scores.append((similarity, i))
    
    # Sort by similarity (highest first)
    scores.sort(reverse=True)
    
    # Return top-k results
    results = []
    for sim, idx in scores[:top_k]:
        results.append({
            "document": documents[idx],
            "score": sim
        })
    
    return results

## Test: Exact Match Query

In [108]:
query = "What is Python used for?"

print(f"üîé Query: '{query}'\n")
results = search(query)

print("üìã Results:")
for i, r in enumerate(results, 1):
    print(f"  {i}. [{r['score']:.4f}] {r['document']}")

üîé Query: 'What is Python used for?'

üìã Results:
  1. [0.7176] Python is a popular programming language for data science and machine learning.
  2. [0.5636] Neural networks are computing systems inspired by biological brain structures.
  3. [0.5424] Docker containers package applications with their dependencies for consistent deployment.


## Test: Semantic Query (No Keyword Match)

In [109]:
query = "How do AI systems learn?"

print(f"üîé Query: '{query}'\n")
results = search(query)

print("üìã Results:")
for i, r in enumerate(results, 1):
    print(f"  {i}. [{r['score']:.4f}] {r['document']}")

üîé Query: 'How do AI systems learn?'

üìã Results:
  1. [0.6794] Neural networks are computing systems inspired by biological brain structures.
  2. [0.5839] Python is a popular programming language for data science and machine learning.
  3. [0.5731] REST APIs allow different software systems to communicate over HTTP.


## Test: Infrastructure Query

In [None]:
query = "How to deploy applications consistently?"

print(f"üîé Query: '{query}'\n")
results = search(query)

print("üìã Results:")
for i, r in enumerate(results, 1):
    print(f"  {i}. [{r['score']:.4f}] {r['document']}")

## Key Takeaways

1. **Semantic search** finds documents by meaning, not keywords
2. **Index once** ‚Äî embed all documents upfront
3. **Search in O(n)** ‚Äî compare query to all embeddings
4. **Top-k retrieval** ‚Äî return most relevant documents

---

**Next:** Day 11 ‚Äî RAG: Combining search with LLM generation