In [1]:
%run supportvectors-common.ipynb



<center><img src="images/logo-poster-transparent.png" width="400"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



🎯 Project root: /Users/asifqamar/github/rag_to_riches
📁 Working directory: /Users/asifqamar/github/rag_to_riches
✅ Ready to import rag_to_riches modules!


# RAG to Riches: Lesson 1 - The Power of Semantic Search

## 🎯 Learning Objectives

By the end of this lesson, you will understand:

1. **What** semantic search is and how it differs from traditional keyword search
2. **Why** semantic search is revolutionary for information retrieval
3. **Where** semantic search excels and its best use cases
4. **How** to implement semantic search with embeddings and vector databases

---

## 📚 Table of Contents

0. [The Intuition Behind Semantic Search](#intuition)

1. [The Three W's of Semantic Search](#the-three-ws)
   - [What is Semantic Search?](#what)
   - [Why Semantic Search?](#why) 
   - [Where to Use Semantic Search?](#where)

2. [The HOW: Implementation Deep Dive](#implementation)
   - [Data Structure: AnimalQuote Examples](#data-structure)
   - [The Indexing Pipeline](#indexing-pipeline)
   - [The Search Process](#search-process)
   - [Hands-on Examples](#examples)

3. [Key Takeaways](#takeaways)

---


## The Intuition Behind Semantic Search {#intuition}

# Semantic search

Traditionally, one would search through a corpus of documents using a keywords-based search engine like Lucene, Solr, ElasticSearch, etc. While the technology has matured, the basic underlying approach behind keyword search engines is to maintain an *inverted-index* mapping keywords to a list of documents that contain them, with associated relevances.

In general, the keywords-based search approach has been quite successful over the years, and have matured with added features and linguistic capabilities.

However, this approach has had its limitations. The principal cause of it goes to the fact that when we enter keywords, it is a human tendency to describe the intent of what we are looking for. For example, if we enter "breakfast places", we implicitly also mean restaurants, cafe, etc that serve items appropriate for breakfast. There may be a restaurant described as a shop for expresso, or crepe, that a keywords-search will likely miss, since its keywords do not match the query terms. And yet, we would hope to see it near the top of the search results.

Semantic search is an NLP approach largely relying on deep-neural networks, and in particular, the transformers that make it possible to more closely infer the human intent behind the search terms, the relationship between the words, and the underlying context. It allows for entire sentences -- and even paragraphs -- describing what the searcher's intent is, and retrieves results more relevant or aligned to it.

## How would we do this NLP task with AI?

Let us represent the functional behavior we expect: 


![](images/semantic-search-functionality.png)


### Magic happens: breaking it down into steps

We recall that machine-learning algorithms work with vectors ($\mathbf{X}$) representation of data.

So the first order of business would be to map each of the document texts $D_i$ to its corresponding vector $X_i$ in an appropriate $d$-dimensional space, $\mathbb{R}^d$, i.e.

\begin{equation}
D_i \longrightarrow X_i \in \mathbb{R}^d
\end{equation}

This resulting vectors are called **sentence embeddings**. Once these embeddings are for each of the documents, we can store the collection of tuples $[<D_1, X_1>, <D_2, X_2>, ..., <D_n, X_n>]$. Here each tuple corresponds to a document and its sentence embedding.

This collection of tuples, therefore, becomes our **search index**.

### Search

Now, when the user described what she is looking for, we consider the entire text as a "sentence".
<p>
<div class="alert-box alert-warning" style="padding-top:30px">
   
<b >Caveat Emptor</b>

> Note that we have a rather relaxed definition of a *sentence* in NLP: it diverges from a grammmatical definition of a sentence somewhat.  For example, in the English language, we would consider a sentence to be terminated with a punctuation, such as a period, question-mark or exclamation. However, in NLP, we loosely consider the entire text -- whether it is just a word, or a few keywords, or an english sentence, or a few sentences together -- as one **sentence** for the purposes of natual language processing task.
    
<p>
</div>
    
Therefore, it is common to consider an entire document text as a *sentence* if the text is relatively short. Alternatively, it is partitioned into smaller chunks (of say 512-tokens each), and each such chunk is considered an NLP *sentence*.

Since we consider the entire query text as a sentence, we can map it to its **sentence embedding vector**, ${Q}$.

#### Vector Similarity
Once we have this, we simply need to compare the query vector ${Q}$ with each of the document vectors $X_i$, and sort the document vectors in descending order of similarity.

The rest is trivial: pick the top-k  in the sorted document vectors list. Then for each vector, look up its corresponding document, and return the list as sorted search result of relevant document.

We expect that these documents will exhibit high semantic similarity with the search query, assuming that the search index did contain such documents.

<figure>
    <img src="https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/docs/img/SemanticSearch.png">
    <caption> Semantic similarity as vector proximity in the embedding space. <br>
    (Figure source: Sbert.net documentation).
    </caption>
</figure>


#### Similarity measures

The sentence embedding vectors typically exist in very large dimensional space (e.g., 300 dimensions). In such large dimensional spaces, the notion of euclidean distance is not as effective. Therefore, it is far more common to use one of the two below measures for vector similarity:

* **dot-product**, the (inner) dot-product between the embedding vectors.

\begin{equation}
\text{dot-similarity} = \langle X_i, X_j \rangle
\end{equation}

* **cosine-similarity**, the $\cos \left(\theta_{ij}\right)$ gives degree of directional alignment between the vectors, but ignores their magnitudes. Here, $\theta_{ij}$ is the angle between $X_i$ and $X_j$ (embedding) vectors.

\begin{equation} 
\text{cosine-similarity} = \frac{\langle X_i, X_j \rangle} {\| X_i \| \| X_j \|}
\end{equation}

<div class="alert-box alert-info" style="padding-top:30px">
   
**Important**
    
>  Sentence transformer models trained with cosine-similarity tend to favor the shorter document texts in the search results, whereas the models trained on the dot-product similarity tend to favor longer texts.
</div>

## 🔍 The Three W's of Semantic Search {#the-three-ws}

### What is Semantic Search? {#what}

**Semantic search** is an information retrieval technique that understands the **meaning** and **context** of queries and documents, rather than just matching exact keywords.

#### Traditional Keyword Search vs. Semantic Search

| Aspect | Keyword Search | Semantic Search |
|--------|----------------|-----------------|
| **Matching** | Exact text matches | Conceptual similarity |
| **Understanding** | Lexical (word-level) | Semantic (meaning-level) |
| **Query** | "dog loyalty" | "dog loyalty" |
| **Finds** | Documents containing "dog" AND "loyalty" | Documents about faithful pets, even without exact words |
| **Technology** | TF-IDF, BM25, Boolean logic | Neural embeddings, vector similarity |

#### Key Concepts

- **Embeddings**: Dense vector representations of text that capture semantic meaning
- **Vector Space**: High-dimensional space where similar concepts are positioned close together
- **Cosine Similarity**: Mathematical measure of how similar two vectors (meanings) are
- **Dense Retrieval**: Finding relevant information based on semantic similarity rather than keyword overlap

### Why Semantic Search? {#why}

Semantic search has **revolutionized** information retrieval in several fundamental ways:

#### 🎯 **1. Intent Understanding**
- **Problem**: User searches "best friend" but documents say "loyal companion"
- **Solution**: Semantic search understands these concepts are related
- **Impact**: 40-60% improvement in search relevance

#### 🌐 **2. Language Flexibility** 
- **Synonyms**: "automobile" matches "car", "vehicle", "auto"
- **Paraphrasing**: "How to cook pasta" matches "pasta preparation methods"
- **Multilingual**: Can work across languages with multilingual embeddings

#### 🧠 **3. Context Awareness**
- **Polysemy**: "bank" (financial) vs "bank" (river) - context determines meaning
- **Nuanced queries**: "sad movie that makes you cry" vs "sad movie with bad reviews"
- **Conceptual search**: Find documents about concepts, not just keywords

#### 📈 **4. Transformational Impact on Industries**

- **Search Engines**: Google's BERT (2019) improved 10% of search queries
- **E-commerce**: Amazon's semantic search increased conversion rates by 15-25%
- **Enterprise**: Microsoft's semantic search in Office 365 improved productivity
- **Legal**: Semantic search helps lawyers find relevant case law beyond keyword matches

### Where to Use Semantic Search? {#where}

#### 🏆 **Best Use Cases**

1. **Document Collections with Rich Content**
   - Research papers, articles, books
   - Legal documents, contracts
   - Medical records, patient notes
   - **Our use case**: Animal wisdom quotes

2. **Customer Support & FAQ**
   - Users ask questions in natural language
   - Need to find relevant answers regardless of exact wording
   - Example: "My order is late" → finds "delivery delays" content

3. **Product Discovery**
   - E-commerce: "comfortable running shoes for flat feet"
   - Real estate: "cozy family home near good schools"
   - Content: "funny movies for date night"

4. **Knowledge Management**
   - Corporate wikis and documentation
   - Research databases
   - Personal note-taking systems (Obsidian, Notion)

#### ⚠️ **When NOT to Use Semantic Search**

1. **Exact Match Requirements**
   - Legal document numbers, product SKUs
   - Code search (variable names, function signatures)
   - Database queries with specific criteria

2. **Very Small Datasets**
   - < 100 documents: keyword search may be sufficient
   - Overhead of embeddings not justified

3. **Highly Technical/Domain-Specific**
   - Without domain-specific embeddings
   - Very specialized jargon that general models don't understand

---


## 🛠️ The HOW: Implementation Deep Dive {#implementation}

Now that we understand the **what**, **why**, and **where** of semantic search, let's dive into the **how**. We'll use our **Animals Wisdom Quotes** corpus to demonstrate a complete semantic search implementation.

### 📊 Data Structure: AnimalQuote Examples {#data-structure}

Let's first examine the structure of our data and see some example quotes to understand what we're working with.


In [2]:
# Import required modules
from pathlib import Path
import json
from rag_to_riches.corpus.animals import AnimalQuote, AnimalWisdom, Animals
from rag_to_riches.vectordb.embedded_vectordb import EmbeddedVectorDB
from rag_to_riches.vectordb.embedder import SimpleTextEmbedder

print("🐾 Modules imported successfully!")
print("📁 Current working directory:", Path.cwd())




🐾 Modules imported successfully!
📁 Current working directory: /Users/asifqamar/github/rag_to_riches


In [3]:
# 🚀 Initialize Shared Components (Vector Database & Embedder)
# ============================================================================
# We initialize these components ONCE at the beginning to avoid database lock issues
# and reuse them throughout the notebook for efficiency and consistency.

print("🔧 Initializing Shared Components for the Entire Notebook")
print("=" * 60)

# Initialize Vector Database (shared instance)
print("1️⃣ Initializing Vector Database (Qdrant)...")
vector_db = EmbeddedVectorDB()
print("   ✅ Vector database connected and ready for reuse")

# Initialize Text Embedder (shared instance)  
print("\n2️⃣ Initializing Text Embedder (Sentence Transformers)...")
embedder = SimpleTextEmbedder(model_name="sentence-transformers/all-MiniLM-L6-v2")
print(f"   ✅ Embedder loaded: {embedder.model_name}")
print(f"   📐 Vector dimensions: {embedder.get_vector_size()}")
print(f"   📏 Distance metric: {embedder.get_distance_metric()}")

print("\n🎯 Shared components ready! These will be reused throughout the notebook.")
print("💡 This prevents database lock issues and improves performance.")

# Set the data path
jsonl_path = Path("data/corpus/animals/animals.jsonl")

print("\n📝 Note: If you encounter a database lock error, restart the notebook kernel.")
print("   This ensures a clean start and releases any existing database connections.")


[32m2025-06-28 12:44:47[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36m__init__[0m:[36m58[0m - [1mConnected to embedded vector database at qdrant_db[0m


🔧 Initializing Shared Components for the Entire Notebook
1️⃣ Initializing Vector Database (Qdrant)...
   ✅ Vector database connected and ready for reuse

2️⃣ Initializing Text Embedder (Sentence Transformers)...


[32m2025-06-28 12:44:48[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedder[0m:[36m__init__[0m:[36m115[0m - [1mInitialized SimpleTextEmbedder with model 'sentence-transformers/all-MiniLM-L6-v2', vector size: 384[0m


   ✅ Embedder loaded: sentence-transformers/all-MiniLM-L6-v2
   📐 Vector dimensions: 384
   📏 Distance metric: Cosine

🎯 Shared components ready! These will be reused throughout the notebook.
💡 This prevents database lock issues and improves performance.

📝 Note: If you encounter a database lock error, restart the notebook kernel.
   This ensures a clean start and releases any existing database connections.


In [4]:
# Let's examine the structure of our animal quotes data
print(f"📂 Reading from: {jsonl_path}")
print(f"📄 File exists: {jsonl_path.exists()}")


📂 Reading from: data/corpus/animals/animals.jsonl
📄 File exists: True


In [5]:
# 📚 Load and Index Animal Quotes Using Shared Components
print("🔧 Creating Animals corpus loader using shared components...")
animals_loader = Animals(
    vector_db=vector_db,  # Reusing shared vector_db instance
    embedder=embedder,    # Reusing shared embedder instance
)

animals_loader.recreate_collection()

print("📊 Loading and indexing animal quotes...")
wisdom, point_ids = animals_loader.load_and_index(jsonl_path)
        

[32m2025-06-28 12:44:48[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36m_ensure_existing_collection_matches[0m:[36m469[0m - [1mCollection 'animals' exists with correct parameters[0m
[32m2025-06-28 12:44:48[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36mget_collection_info[0m:[36m295[0m - [1mRetrieved info for collection 'animals'[0m
[32m2025-06-28 12:44:48[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36mconsistency_check[0m:[36m233[0m - [1mConsistency check passed for collection 'animals'[0m
[32m2025-06-28 12:44:48[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36m__init__[0m:[36m182[0m - [1mInitialized SemanticSearch for collection 'animals' with SimpleTextEmbedder[0m
[32m2025-06-28 12:44:48[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36m__init__[0m:[36m208[0m - [1mInitialized Animals corpus loader for collection 'anim

🔧 Creating Animals corpus loader using shared components...
📊 Loading and indexing animal quotes...


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36mupsert_points[0m:[36m450[0m - [1mUpserted 100 points to collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36mindex_all_text[0m:[36m457[0m - [1mIndexed 100 texts into collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36mindex_all_quotes[0m:[36m311[0m - [1mSuccessfully indexed 100 animal quotes into collection 'animals'[0m


In [6]:
# Display information about what was loaded and indexed
print(f"✅ Successfully loaded and indexed animal quotes!")
print(f"📊 Loaded {len(wisdom)} quotes from {wisdom.source_file}")
print(f"🔗 Indexed {len(point_ids)} points into collection '{animals_loader.collection_name}'")

# Show some statistics
stats = animals_loader.get_collection_stats()
print(f"\n📈 Collection Statistics:")
print(f"   • Collection Name: {stats['collection_name']}")
print(f"   • Points in Database: {stats['point_count']}")
print(f"   • Unique Categories: {len(stats['categories'])}")
print(f"   • Unique Authors: {len(stats['authors'])}")

print(f"\n🏷️ Sample Categories: {', '.join(stats['categories'][:3])}...")
print(f"✍️ Sample Authors: {', '.join(stats['authors'][:5])}...")

print(f"\n🎯 Ready for semantic search!")


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36mcount_points[0m:[36m105[0m - [1mCollection 'animals' contains 100 points[0m


✅ Successfully loaded and indexed animal quotes!
📊 Loaded 100 quotes from data/corpus/animals/animals.jsonl
🔗 Indexed 100 points into collection 'animals'

📈 Collection Statistics:
   • Collection Name: animals
   • Points in Database: 100
   • Unique Categories: 20
   • Unique Authors: 85

🏷️ Sample Categories: Animal Morality, Animals as Reflections, Animals in Narrative...
✍️ Sample Authors: A.A. Milne, A.P.J. Abdul Kalam, Abraham Lincoln, African proverb, Albert Schweitzer...

🎯 Ready for semantic search!


### 🔧 The Indexing Pipeline {#indexing-pipeline}

The semantic search indexing pipeline consists of several key steps:

```
Raw Data (JSONL) → AnimalQuote Objects → AnimalWisdom Collection → Embeddings → Vector Database
```

#### Step-by-Step Process:

1. **Data Loading**: Parse JSONL into validated `AnimalQuote` objects
2. **Collection Creation**: Group quotes into `AnimalWisdom` container
3. **Embedding Generation**: Convert text to dense vectors using neural models
4. **Vector Storage**: Store embeddings + metadata in Qdrant vector database
5. **Indexing**: Create efficient search indices for fast retrieval

#### Key Components:

- **📝 AnimalQuote**: Individual quote with validation
- **📚 AnimalWisdom**: Collection of quotes with analysis methods  
- **🧠 Embedder**: Neural model that converts text → vectors
- **🗄️ VectorDB**: Qdrant database for storing and searching vectors
- **🔍 Animals**: Orchestrator class that manages the entire pipeline


### 🔍 Understanding Embeddings: The Magic Behind Semantic Search

Before we dive into search examples, let's understand what happens when we convert text to embeddings.

#### What are Embeddings?

**Embeddings** are dense vector representations of text that capture semantic meaning in a high-dimensional space. Each dimension represents some aspect of meaning that the neural model has learned.

#### Key Properties:
- **Dense**: Every dimension has a meaningful value (vs sparse keyword vectors)
- **Semantic**: Similar meanings → similar vectors  
- **High-dimensional**: Our model uses 384 dimensions
- **Learned**: Trained on massive text corpora to understand language


### 🎯 The Search Process {#search-process}

Now let's explore how semantic search works in practice. The search process involves:

1. **Query Embedding**: Convert the search query into a vector
2. **Similarity Calculation**: Compare query vector with all stored vectors  
3. **Ranking**: Sort results by similarity score (cosine similarity)
4. **Filtering**: Apply metadata filters (author, category, score threshold)
5. **Return Results**: Present top-k most similar documents

#### Search Types We'll Demonstrate:

1. **Basic Semantic Search**: Find conceptually similar quotes
2. **Author-Filtered Search**: Search within specific author's quotes  
3. **Category-Filtered Search**: Search within specific categories
4. **High-Confidence Search**: Only return very similar results
5. **Combined Filters**: Multiple criteria simultaneously


In [7]:
# Helper function to display search results nicely
def display_search_results(results, search_description, max_text_length=120):
    """Display search results in a formatted way."""
    print(f"\n🔍 {search_description}")
    print("=" * len(f"🔍 {search_description}"))
    
    if not results:
        print("   ❌ No results found.")
        return
    
    print(f"   📊 Found {len(results)} results")
    print()
    
    for i, result in enumerate(results, 1):
        content = result.payload.get("content", "")
        author = result.payload.get("author", "Unknown")
        category = result.payload.get("category", "Unknown")
        
        # Truncate long quotes for readability
        display_content = content if len(content) <= max_text_length else content[:max_text_length-3] + "..."
        
        print(f"   {i}. 📊 Score: {result.score:.3f}")
        print(f"      💬 Quote: \"{display_content}\"")
        print(f"      ✍️  Author: {author}")
        print(f"      🏷️  Category: {category}")
        print()

print("🛠️ Search helper function defined!")


🛠️ Search helper function defined!


### 🎪 Hands-on Search Examples {#examples}

Now for the exciting part! Let's demonstrate the power of semantic search with various examples that showcase different capabilities.


In [8]:
# Example 1: Basic Semantic Search - The Power of Meaning
print("🎯 EXAMPLE 0: Basic Semantic Search")
print("=" * 50)
print("Query: 'a friendship with animals'")


results = animals_loader.search("a friendship with animals", limit=4)
display_search_results(results, "Basic Semantic Search Results")

[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 4 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'a friendship with animals...' returned 4 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36msearch[0m:[36m426[0m - [1mAnimal quotes search for 'a friendship with animals...' returned 4 results[0m


🎯 EXAMPLE 0: Basic Semantic Search
Query: 'a friendship with animals'

🔍 Basic Semantic Search Results
   📊 Found 4 results

   1. 📊 Score: 0.625
      💬 Quote: "Animals are such agreeable friends—they ask no questions; they pass no criticisms."
      ✍️  Author: George Eliot
      🏷️  Category: Famous Literary Passages

   2. 📊 Score: 0.586
      💬 Quote: "The best thing about animals is that they don't talk much."
      ✍️  Author: Thornton Wilder
      🏷️  Category: Famous Literary Passages

   3. 📊 Score: 0.548
      💬 Quote: "Some people talk to animals. Not many listen though. That's the problem."
      ✍️  Author: A.A. Milne
      🏷️  Category: Proverbs and Sayings

   4. 📊 Score: 0.535
      💬 Quote: "Animals are reliable, many full of love, true in their affections, predictable in their actions, grateful and loyal. ..."
      ✍️  Author: Alfred A. Montapert
      🏷️  Category: Reflections and Lessons



In [9]:
# Example 1: Basic Semantic Search - The Power of Meaning
print("🎯 EXAMPLE 1: Basic Semantic Search")
print("=" * 50)
print("Query: 'loyal companions and friendship'")
print("🔍 This should find quotes about loyalty, friendship, and companionship")
print("   even if they don't contain these exact words!")

results = animals_loader.search("loyal companions and friendship", limit=4)
display_search_results(results, "Basic Semantic Search Results")


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 4 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'loyal companions and friendship...' returned 4 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36msearch[0m:[36m426[0m - [1mAnimal quotes search for 'loyal companions and friendship...' returned 4 results[0m


🎯 EXAMPLE 1: Basic Semantic Search
Query: 'loyal companions and friendship'
🔍 This should find quotes about loyalty, friendship, and companionship
   even if they don't contain these exact words!

🔍 Basic Semantic Search Results
   📊 Found 4 results

   1. 📊 Score: 0.476
      💬 Quote: "If you want loyalty, get a dog. If you want loyalty and attention, get a smart dog."
      ✍️  Author: Grant Fairley
      🏷️  Category: Animal Morality

   2. 📊 Score: 0.428
      💬 Quote: "Animals are reliable, many full of love, true in their affections, predictable in their actions, grateful and loyal. ..."
      ✍️  Author: Alfred A. Montapert
      🏷️  Category: Reflections and Lessons

   3. 📊 Score: 0.368
      💬 Quote: "Animals share with us the privilege of having a soul."
      ✍️  Author: Pythagoras
      🏷️  Category: Literary and Poetic Imagery

   4. 📊 Score: 0.359
      💬 Quote: "Animals are such agreeable friends—they ask no questions; they pass no criticisms."
      ✍️  Author: George 

In [10]:
# Example 2: Concept-Based Search - Abstract Ideas
print("🎯 EXAMPLE 2: Concept-Based Search")
print("=" * 50)
print("Query: 'wisdom and life lessons'")
print("🔍 Searching for philosophical insights and wisdom")
print("   Notice how we find deep concepts, not just keyword matches!")

results = animals_loader.search("wisdom and life lessons", limit=4)
display_search_results(results, "Concept-Based Search Results")


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 4 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'wisdom and life lessons...' returned 4 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36msearch[0m:[36m426[0m - [1mAnimal quotes search for 'wisdom and life lessons...' returned 4 results[0m


🎯 EXAMPLE 2: Concept-Based Search
Query: 'wisdom and life lessons'
🔍 Searching for philosophical insights and wisdom
   Notice how we find deep concepts, not just keyword matches!

🔍 Concept-Based Search Results
   📊 Found 4 results

   1. 📊 Score: 0.297
      💬 Quote: "Animals share with us the privilege of having a soul."
      ✍️  Author: Pythagoras
      🏷️  Category: Literary and Poetic Imagery

   2. 📊 Score: 0.296
      💬 Quote: "To my mind, the life of a lamb is no less precious than that of a human being."
      ✍️  Author: Mahatma Gandhi
      🏷️  Category: Literary Masterpieces

   3. 📊 Score: 0.293
      💬 Quote: "Hold fast to dreams, for if dreams die, life is a broken-winged bird that cannot fly."
      ✍️  Author: Langston Hughes
      🏷️  Category: Symbolism and Allegory

   4. 📊 Score: 0.288
      💬 Quote: "Dogs teach us a very important lesson in life: The mailman is not to be trusted."
      ✍️  Author: Sian Ford
      🏷️  Category: Literary Masterpieces



In [11]:
# Example 3: Emotional Search - Finding Feelings
print("🎯 EXAMPLE 3: Emotional Search")
print("=" * 50)
print("Query: 'sadness and loss'")
print("🔍 Searching for quotes that deal with sad emotions")
print("   Semantic search can understand emotional concepts!")

results = animals_loader.search("sadness and loss", limit=4)
display_search_results(results, "Emotional Search Results")


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 4 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'sadness and loss...' returned 4 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36msearch[0m:[36m426[0m - [1mAnimal quotes search for 'sadness and loss...' returned 4 results[0m


🎯 EXAMPLE 3: Emotional Search
Query: 'sadness and loss'
🔍 Searching for quotes that deal with sad emotions
   Semantic search can understand emotional concepts!

🔍 Emotional Search Results
   📊 Found 4 results

   1. 📊 Score: 0.472
      💬 Quote: "There are two means of refuge from the misery of life—music and cats."
      ✍️  Author: Albert Schweitzer
      🏷️  Category: Powerful Analogies

   2. 📊 Score: 0.462
      💬 Quote: "Like a bird singing in the rain, let grateful memories survive in time of sorrow."
      ✍️  Author: Robert Louis Stevenson
      🏷️  Category: Symbolism and Allegory

   3. 📊 Score: 0.342
      💬 Quote: "Until one has loved an animal, a part of one's soul remains unawakened."
      ✍️  Author: Anatole France
      🏷️  Category: Wisdom and Philosophy

   4. 📊 Score: 0.342
      💬 Quote: "Until one has loved an animal, a part of one's soul remains unawakened."
      ✍️  Author: Anatole France
      🏷️  Category: Famous Literary Passages



In [12]:
# Example 4: Filtered Search - Author-Specific
print("🎯 EXAMPLE 4: Author-Filtered Search")
print("=" * 50)
print("Query: 'animals' filtered by author: 'Mark Twain'")
print("🔍 This combines semantic search with metadata filtering")
print("   We're looking for Mark Twain's thoughts on animals specifically")

results = animals_loader.search("animals", limit=4, author="Mark Twain")
display_search_results(results, "Author-Filtered Search Results")

# Let's also try a different author
print("\n" + "🎯 BONUS: Same query, different author")
print("Query: 'animals' filtered by author: 'Albert Schweitzer'")

results = animals_loader.search("animals", limit=3, author="Albert Schweitzer")
display_search_results(results, "Albert Schweitzer's Animal Quotes")


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 12 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'animals...' returned 12 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36msearch[0m:[36m426[0m - [1mAnimal quotes search for 'animals...' returned 1 results (filtered by author)[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 9 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'animals...' returned 9 results[0m
[32m2025-06-28 12:44:49[0m | [1mIN

🎯 EXAMPLE 4: Author-Filtered Search
Query: 'animals' filtered by author: 'Mark Twain'
🔍 This combines semantic search with metadata filtering
   We're looking for Mark Twain's thoughts on animals specifically

🔍 Author-Filtered Search Results
   📊 Found 1 results

   1. 📊 Score: 0.459
      💬 Quote: "If animals could speak, the dog would be a blundering outspoken fellow; but the cat would have the rare grace of neve..."
      ✍️  Author: Mark Twain
      🏷️  Category: Literary and Poetic Imagery


🎯 BONUS: Same query, different author
Query: 'animals' filtered by author: 'Albert Schweitzer'

🔍 Albert Schweitzer's Animal Quotes
   ❌ No results found.


In [13]:
# Example 5: Category-Filtered Search
print("🎯 EXAMPLE 5: Category-Filtered Search")
print("=" * 50)
print("Query: 'love and compassion' in category: 'Wisdom and Philosophy'")
print("🔍 Finding philosophical quotes about love and compassion")

results = animals_loader.search("love and compassion", limit=4, category="Wisdom and Philosophy")
display_search_results(results, "Philosophy Category Search Results")

# Let's see what categories we have available
print(f"\n📚 Available categories in our corpus:")
all_categories = animals_loader.get_collection_stats()['categories']
for i, category in enumerate(sorted(all_categories), 1):
    print(f"   {i:2d}. {category}")
    if i >= 10:  # Show first 10 categories
        print(f"   ... and {len(all_categories) - 10} more")
        break


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 12 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'love and compassion...' returned 12 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36msearch[0m:[36m426[0m - [1mAnimal quotes search for 'love and compassion...' returned 1 results (filtered by category)[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36mcount_points[0m:[36m105[0m - [1mCollection 'animals' contains 100 points[0m


🎯 EXAMPLE 5: Category-Filtered Search
Query: 'love and compassion' in category: 'Wisdom and Philosophy'
🔍 Finding philosophical quotes about love and compassion

🔍 Philosophy Category Search Results
   📊 Found 1 results

   1. 📊 Score: 0.398
      💬 Quote: "Until one has loved an animal, a part of one's soul remains unawakened."
      ✍️  Author: Anatole France
      🏷️  Category: Wisdom and Philosophy


📚 Available categories in our corpus:
    1. Animal Morality
    2. Animals as Reflections
    3. Animals in Narrative
    4. Evocative Descriptions
    5. Famous Literary Passages
    6. Humorous Quotes
    7. Humorous Yet Profound
    8. Insightful Observations
    9. Literary Classics
   10. Literary Masterpieces
   ... and 10 more


In [14]:
# Example 6: High-Confidence Search with Score Threshold
print("🎯 EXAMPLE 6: High-Confidence Search")
print("=" * 50)
print("Query: 'faithful dogs' with score threshold > 0.6")
print("🔍 Only returning results with high semantic similarity")
print("   This filters out loosely related results")

results = animals_loader.search("faithful dogs", limit=5, score_threshold=0.6)
display_search_results(results, "High-Confidence Search Results")

# Compare with no threshold
print("\n🔄 COMPARISON: Same query without score threshold")
results_all = animals_loader.search("faithful dogs", limit=5)
display_search_results(results_all, "All Results (No Threshold)")

print(f"\n📊 Insight: Threshold filtering removed {len(results_all) - len(results)} lower-quality results")


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 0 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'faithful dogs...' returned 0 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36msearch[0m:[36m426[0m - [1mAnimal quotes search for 'faithful dogs...' returned 0 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 5 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'faithful dogs...' returned 5 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO   

🎯 EXAMPLE 6: High-Confidence Search
Query: 'faithful dogs' with score threshold > 0.6
🔍 Only returning results with high semantic similarity
   This filters out loosely related results

🔍 High-Confidence Search Results
   ❌ No results found.

🔄 COMPARISON: Same query without score threshold

🔍 All Results (No Threshold)
   📊 Found 5 results

   1. 📊 Score: 0.513
      💬 Quote: "Dogs are our link to paradise."
      ✍️  Author: Milan Kundera
      🏷️  Category: Proverbs and Sayings

   2. 📊 Score: 0.502
      💬 Quote: "Every dog has his day."
      ✍️  Author: Jonathan Swift
      🏷️  Category: Symbolism and Allegory

   3. 📊 Score: 0.492
      💬 Quote: "Animals are reliable, many full of love, true in their affections, predictable in their actions, grateful and loyal. ..."
      ✍️  Author: Alfred A. Montapert
      🏷️  Category: Reflections and Lessons

   4. 📊 Score: 0.480
      💬 Quote: "I wonder if other dogs think poodles are members of a weird religious cult."
      ✍️  Author: R

In [15]:
# Example 7: Demonstrating Semantic vs Keyword Search
print("🎯 EXAMPLE 7: Semantic vs Keyword Search Comparison")
print("=" * 55)
print("🔍 Let's compare semantic search with what keyword search would find")

# Semantic search for concepts
query = "creatures that bring joy and happiness"
print(f"\nQuery: '{query}'")
print("🧠 SEMANTIC SEARCH: Finds quotes about the CONCEPT of joy from animals")

semantic_results = animals_loader.search(query, limit=3)
display_search_results(semantic_results, "Semantic Search Results")

# Simulate what keyword search might miss
print("\n🔑 KEYWORD SEARCH would look for:")
print("   - Documents containing 'creatures' AND 'joy' AND 'happiness'")
print("   - Would miss quotes about 'pets', 'animals', 'delight', 'bliss', etc.")
print("   - Would miss conceptually related but differently worded content")

print("\n💡 SEMANTIC ADVANTAGE:")
print("   ✅ Understands synonyms and related concepts")
print("   ✅ Captures intent, not just keywords") 
print("   ✅ Finds relevant content with different vocabulary")
print("   ✅ More natural, human-like understanding")


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36msearch_points[0m:[36m413[0m - [1mFound 3 points in collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36msearch_with_text[0m:[36m347[0m - [1mText search for 'creatures that bring joy and happiness...' returned 3 results[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.corpus.animals[0m:[36msearch[0m:[36m426[0m - [1mAnimal quotes search for 'creatures that bring joy and happiness...' returned 3 results[0m


🎯 EXAMPLE 7: Semantic vs Keyword Search Comparison
🔍 Let's compare semantic search with what keyword search would find

Query: 'creatures that bring joy and happiness'
🧠 SEMANTIC SEARCH: Finds quotes about the CONCEPT of joy from animals

🔍 Semantic Search Results
   📊 Found 3 results

   1. 📊 Score: 0.505
      💬 Quote: "Whoever said you can’t buy happiness forgot little puppies."
      ✍️  Author: Gene Hill
      🏷️  Category: Animals as Reflections

   2. 📊 Score: 0.498
      💬 Quote: "The only creatures that are evolved enough to convey pure love are dogs and infants."
      ✍️  Author: Johnny Depp
      🏷️  Category: Proverbs and Sayings

   3. 📊 Score: 0.468
      💬 Quote: "Animals share with us the privilege of having a soul."
      ✍️  Author: Pythagoras
      🏷️  Category: Literary and Poetic Imagery


🔑 KEYWORD SEARCH would look for:
   - Documents containing 'creatures' AND 'joy' AND 'happiness'
   - Would miss quotes about 'pets', 'animals', 'delight', 'bliss', etc.
   - Wo

In [16]:
# BONUS: Exploring the Enhanced Animals Class
print("🎯 BONUS: Enhanced Animals Class Features")
print("=" * 50)
print("🔧 The Animals class now leverages SemanticSearch internally!")

# Show the new consistency check feature
print("\n1️⃣ Collection Consistency Check:")
is_consistent = animals_loader.consistency_check()
print(f"   ✅ Collection parameters consistent with embedder: {is_consistent}")

# Show access to the underlying semantic search engine
print("\n2️⃣ Access to Underlying SemanticSearch:")
print(f"   🔍 SemanticSearch instance: {type(animals_loader.semantic_search).__name__}")
print(f"   📊 Embedder type: {type(animals_loader.semantic_search.embedder).__name__}")
print(f"   🗄️ Vector DB type: {type(animals_loader.semantic_search.vector_db).__name__}")

# Demonstrate single quote indexing
print("\n3️⃣ Single Quote Indexing (New Feature):")
sample_quote = AnimalQuote(
    text="The greatness of a nation can be judged by the way its animals are treated.",
    author="Mahatma Gandhi", 
    category="Wisdom and Philosophy"
)
try:
    point_id = animals_loader.index_single_quote(sample_quote)
    print(f"   ✅ Successfully indexed single quote with ID: {point_id[:8]}...")
except Exception as e:
    print(f"   ℹ️ Single quote indexing: {e}")

print("\n💡 Benefits of the Refactoring:")
print("   🚀 Reduced code duplication by ~60 lines")
print("   🛠️ Better maintainability and consistency")
print("   🔧 Access to full SemanticSearch capabilities")
print("   ✨ Enhanced functionality while keeping the same API")
print("   🎯 Composition over duplication design pattern")


[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36mget_collection_info[0m:[36m295[0m - [1mRetrieved info for collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.search.semantic_search[0m:[36mconsistency_check[0m:[36m233[0m - [1mConsistency check passed for collection 'animals'[0m
[32m2025-06-28 12:44:49[0m | [1mINFO    [0m | [36mrag_to_riches.vectordb.embedded_vectordb[0m:[36mupsert_points[0m:[36m450[0m - [1mUpserted 1 points to collection 'animals'[0m


🎯 BONUS: Enhanced Animals Class Features
🔧 The Animals class now leverages SemanticSearch internally!

1️⃣ Collection Consistency Check:
   ✅ Collection parameters consistent with embedder: True

2️⃣ Access to Underlying SemanticSearch:
   🔍 SemanticSearch instance: SemanticSearch
   📊 Embedder type: SimpleTextEmbedder
   🗄️ Vector DB type: EmbeddedVectorDB

3️⃣ Single Quote Indexing (New Feature):
   ✅ Successfully indexed single quote with ID: 8117cdf3...

💡 Benefits of the Refactoring:
   🚀 Reduced code duplication by ~60 lines
   🛠️ Better maintainability and consistency
   🔧 Access to full SemanticSearch capabilities
   ✨ Enhanced functionality while keeping the same API
   🎯 Composition over duplication design pattern


## 🎯 Key Takeaways {#takeaways}

### 🔑 What We've Learned

1. **Semantic Search Revolution**
   - Goes beyond keyword matching to understand **meaning**
   - Uses neural embeddings to capture semantic relationships
   - Transforms how we find and discover information

2. **Technical Implementation**
   - **Embeddings**: Dense vectors that represent semantic meaning
   - **Vector Database**: Efficient storage and similarity search
   - **Pipeline**: Data → Validation → Embeddings → Indexing → Search
   - **Architecture**: Composition pattern with SemanticSearch for code reuse

3. **Practical Benefits**
   - **Intent Understanding**: Finds what you mean, not just what you say
   - **Language Flexibility**: Works with synonyms, paraphrases, concepts
   - **Better Relevance**: 40-60% improvement over keyword search
   - **Natural Queries**: Search like you think and speak

4. **Real-World Applications**
   - Document search and knowledge management
   - E-commerce product discovery  
   - Customer support and FAQ systems
   - Research and academic databases

### 🚀 Next Steps

In future lessons, we'll explore:
- **RAG (Retrieval-Augmented Generation)**: Combining search with LLMs
- **Advanced Embeddings**: Domain-specific and multimodal models
- **Vector Database Optimization**: Performance and scaling
- **Evaluation Metrics**: Measuring search quality and relevance

### 💡 Remember

> **"Semantic search doesn't just find documents that match your keywords—it finds documents that match your thoughts."**

The power of semantic search lies in its ability to bridge the gap between human intent and information retrieval, making search more intuitive, effective, and intelligent.

---

**🎉 Congratulations!** You've completed Lesson 1 and now understand the fundamentals of semantic search. You're ready to build more sophisticated RAG applications!
