# Introduction to Vector Databases

This notebook demonstrates how to use vector databases for efficient semantic search and retrieval.

## What is a Vector Database?

A **vector database** is a specialized database designed to store, index, and search high-dimensional vectors (embeddings) efficiently.

### Why Do We Need Vector Databases?

Traditional databases (SQL, NoSQL) are great for exact matches:
- `WHERE name = 'John'`
- `WHERE price > 100`

But they struggle with:
- Finding similar meanings: "laptop" vs "notebook computer"
- Semantic search: "affordable portable computers" ‚Üí find laptops
- Nearest neighbor search in high-dimensional space

### Vector Database Capabilities:

1. **Efficient Storage**: Store millions of high-dimensional vectors
2. **Fast Similarity Search**: Find nearest neighbors in milliseconds
3. **Metadata Filtering**: Combine semantic search with traditional filters
4. **Scalability**: Handle large-scale applications

### Use Cases:

- **RAG (Retrieval Augmented Generation)**: Find relevant context for LLMs
- **Semantic Search**: Search by meaning, not keywords
- **Recommendation Systems**: Find similar items
- **Duplicate Detection**: Find similar documents
- **Question Answering**: Match questions to answers

---

## Vector Databases We'll Explore

### 1. FAISS (Facebook AI Similarity Search)
- **Type**: In-memory vector search library
- **Best For**: Fast similarity search, research, prototyping
- **Pros**: Extremely fast, battle-tested, many index types
- **Cons**: In-memory only, no native persistence features

### 2. LanceDB
- **Type**: Embedded vector database
- **Best For**: Production applications, persistent storage
- **Pros**: Disk-based, SQL-like queries, versioning, cloud-native
- **Cons**: Newer, smaller community

---

## Setup and Installation

In [None]:
# Install required packages
! pip install faiss-cpu lancedb sentence-transformers pypdf python-dotenv mistralai pandas numpy -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
# Import libraries
import os
import json
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# PDF processing
from pypdf import PdfReader

# Embeddings
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv
from mistralai import Mistral

# Vector databases
import faiss
import lancedb

print("‚úì All libraries imported successfully!")

‚úì All libraries imported successfully!


---

## Part 1: PDF Parsing and Chunking

First, we need to extract text from PDFs and split it into manageable chunks.

In [None]:
def parse_pdf(pdf_path: str) -> List[Dict]:
    """
    Parse PDF and extract text with metadata.
    
    Returns:
        List of dictionaries with page text and metadata
    """
    print(f"\nüìÑ Parsing PDF: {pdf_path}")
    
    reader = PdfReader(pdf_path)
    pages_data = []
    
    for page_num, page in enumerate(reader.pages, start=1):
        text = page.extract_text()
        
        if text.strip():  # Only add if there's actual text
            pages_data.append({
                'page_number': page_num,
                'text': text,
                'file_name': Path(pdf_path).name
            })
    
    print(f"‚úì Extracted text from {len(pages_data)} pages")
    return pages_data

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Split text into overlapping chunks with sentence boundary awareness.
    """
    chunks = []
    start = 0
    text_len = len(text)
    
    while start < text_len:
        # Define the end of this chunk
        end = min(start + chunk_size, text_len)
        
        # Extract chunk
        chunk = text[start:end]
        
        # Try to break at sentence boundary (only if not at the very end)
        if end < text_len:
            # Look for sentence boundaries
            last_period = chunk.rfind('.')
            last_newline = chunk.rfind('\n')
            last_question = chunk.rfind('?')
            last_exclamation = chunk.rfind('!')
            
            break_point = max(last_period, last_newline, last_question, last_exclamation)
            
            # Only use the break point if it's reasonably far into the chunk
            if break_point > chunk_size * 0.5:
                chunk = chunk[:break_point + 1]
        
        # Add chunk if it has content
        if chunk.strip():
            chunks.append(chunk.strip())
        
        # Move start position forward by (chunk_size - overlap)
        # This guarantees we make progress even with short adjusted chunks
        start += chunk_size - overlap
    
    return chunks


def process_pdf_to_chunks(pdf_path: str, chunk_size: int = 500, overlap: int = 50) -> List[Dict]:
    """
    Parse PDF and split into chunks with metadata.
    
    Returns:
        List of dictionaries with chunks and metadata
    """
    pages_data = parse_pdf(pdf_path)
    all_chunks = []
    chunk_id = 0
    
    for page_data in pages_data:
        chunks = chunk_text(page_data['text'], chunk_size, overlap)
        
        for chunk_num, chunk in enumerate(chunks, start=1):
            all_chunks.append({
                'chunk_id': chunk_id,
                'text': chunk,
                'file_name': page_data['file_name'],
                'page_number': page_data['page_number'],
                'chunk_number': chunk_num,
                'char_count': len(chunk)
            })
            chunk_id += 1
    
    print(f"‚úì Created {len(all_chunks)} chunks from {len(pages_data)} pages")
    return all_chunks

print("‚úì PDF parsing and chunking functions defined")

‚úì PDF parsing and chunking functions defined


### Create Sample PDF for Testing

Let's create a sample PDF with technical content for demonstration.

In [None]:
# Create sample content
sample_content = """Machine Learning Fundamentals

Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on the development of computer programs that can access data and use it to learn for themselves.

Supervised Learning
Supervised learning is a type of machine learning where the algorithm learns from labeled training data. The algorithm makes predictions based on input data and is corrected when its predictions are incorrect. Common applications include classification and regression tasks.

Classification algorithms predict discrete labels. For example, determining whether an email is spam or not spam. Popular classification algorithms include logistic regression, decision trees, random forests, and support vector machines.

Regression algorithms predict continuous values. For instance, predicting house prices based on features like size, location, and age. Linear regression and polynomial regression are fundamental regression techniques.

Unsupervised Learning
Unsupervised learning involves training algorithms on unlabeled data. The system tries to learn the patterns and structure from the data without explicit guidance. Clustering and dimensionality reduction are primary unsupervised learning techniques.

K-means clustering groups similar data points together. It's widely used in customer segmentation, image compression, and anomaly detection. The algorithm iteratively assigns data points to clusters based on feature similarity.

Deep Learning
Deep learning uses artificial neural networks with multiple layers to progressively extract higher-level features from raw input. It has revolutionized fields like computer vision, natural language processing, and speech recognition.

Convolutional Neural Networks (CNNs) are particularly effective for image processing tasks. They use convolutional layers to automatically learn spatial hierarchies of features, making them ideal for tasks like image classification and object detection.

Recurrent Neural Networks (RNNs) are designed for sequence data. They maintain an internal state that captures information about previous inputs, making them suitable for tasks like language modeling and time series prediction.

Natural Language Processing
Natural Language Processing (NLP) focuses on the interaction between computers and human language. Modern NLP uses transformer architectures like BERT and GPT, which have achieved state-of-the-art results across various language tasks.

Word embeddings represent words as dense vectors in a continuous vector space, where semantically similar words are closer together. Popular embedding techniques include Word2Vec, GloVe, and contextual embeddings from transformer models.

Model Evaluation
Evaluating machine learning models is crucial for understanding their performance. Common metrics include accuracy, precision, recall, F1-score for classification, and mean squared error, R-squared for regression.

Cross-validation helps assess how well a model generalizes to unseen data. K-fold cross-validation divides the data into k subsets and trains the model k times, each time using a different subset for validation.

Overfitting and Underfitting
Overfitting occurs when a model learns the training data too well, including its noise, resulting in poor performance on new data. Regularization techniques like L1 and L2 regularization help prevent overfitting.

Underfitting happens when a model is too simple to capture the underlying patterns in the data. This can be addressed by increasing model complexity or using more relevant features.

Feature Engineering
Feature engineering is the process of using domain knowledge to create features that make machine learning algorithms work better. Good features can significantly improve model performance.

Feature scaling ensures that all features contribute equally to the model. Standardization and normalization are common scaling techniques that transform features to a similar scale.
"""

# Write to a text file (simulating PDF content)
os.makedirs('sample_data', exist_ok=True)
with open('sample_data/ml_fundamentals.txt', 'w') as f:
    f.write(sample_content)

print("‚úì Sample content created")
print(f"Content length: {len(sample_content)} characters")
print(f"\nFirst 200 characters:\n{sample_content[:200]}...")

‚úì Sample content created
Content length: 4082 characters

First 200 characters:
Machine Learning Fundamentals

Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicit...


In [None]:
# For this demo, we'll work with the text file
# In practice, you would use actual PDF files

def parse_text_file(file_path: str) -> List[Dict]:
    """Parse text file as if it were a PDF."""
    with open(file_path, 'r') as f:
        content = f.read()
    
    # Split into "pages" by double newlines (paragraph breaks)
    sections = content.split('\n\n')
    
    pages_data = []
    for page_num, section in enumerate(sections, start=1):
        if section.strip():
            pages_data.append({
                'page_number': page_num,
                'text': section,
                'file_name': Path(file_path).name
            })
    
    return pages_data

# Process the sample file
print("=" * 80)
print("PROCESSING SAMPLE DOCUMENT")
print("=" * 80)

file_path = 'sample_data/ml_fundamentals.txt'
pages_data = parse_text_file(file_path)

print(f"\n‚úì Extracted {len(pages_data)} sections")

PROCESSING SAMPLE DOCUMENT

‚úì Extracted 18 sections
{'page_number': 1, 'text': 'Machine Learning Fundamentals', 'file_name': 'ml_fundamentals.txt'}
{'page_number': 2, 'text': 'Introduction to Machine Learning\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on the development of computer programs that can access data and use it to learn for themselves.', 'file_name': 'ml_fundamentals.txt'}
{'page_number': 3, 'text': 'Supervised Learning\nSupervised learning is a type of machine learning where the algorithm learns from labeled training data. The algorithm makes predictions based on input data and is corrected when its predictions are incorrect. Common applications include classification and regression tasks.', 'file_name': 'ml_fundamentals.txt'}
{'page_number': 4, 'text': 'Classification algorithms predict discrete labels. For example, determining whether an email is spam 

In [None]:
# Create chunks
all_chunks = []
chunk_id = 0

for page_data in pages_data:
    print(page_data)
    chunks = chunk_text(page_data['text'], chunk_size=300, overlap=50)
    
    for chunk_num, chunk in enumerate(chunks, start=1):
        all_chunks.append({
            'chunk_id': chunk_id,
            'text': chunk,
            'file_name': page_data['file_name'],
            'page_number': page_data['page_number'],
            'chunk_number': chunk_num,
            'char_count': len(chunk)
        })
        chunk_id += 1

print(f"‚úì Created {len(all_chunks)} chunks")

# Display sample chunks
print("\n" + "=" * 80)
print("SAMPLE CHUNKS")
print("=" * 80)

for i, chunk in enumerate(all_chunks[:3]):
    print(f"\nChunk {i}:")
    print(f"  File: {chunk['file_name']}")
    print(f"  Page: {chunk['page_number']} | Chunk: {chunk['chunk_number']}")
    print(f"  Length: {chunk['char_count']} chars")
    print(f"  Text: {chunk['text'][:150]}...")
    print("-" * 80)

In [None]:
all_chunks = process_pdf_to_chunks('stephen_hawking_a_brief_history_of_time.pdf', chunk_size=300, overlap=50)


üìÑ Parsing PDF: stephen_hawking_a_brief_history_of_time.pdf
‚úì Extracted text from 101 pages
‚úì Created 1599 chunks from 101 pages


---

## Part 2: Generate Embeddings

Now we'll generate vector embeddings for each chunk using Sentence Transformers.

In [None]:
print("=" * 80)
print("GENERATING EMBEDDINGS")
print("=" * 80)

# Load embedding model
print("\nLoading Sentence Transformer model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embedding_dim = embedding_model.get_sentence_embedding_dimension()

print(f"‚úì Model loaded")
print(f"Embedding dimension: {embedding_dim}")

# Generate embeddings for all chunks
print(f"\nGenerating embeddings for {len(all_chunks)} chunks...")
texts = [chunk['text'] for chunk in all_chunks]
embeddings = embedding_model.encode(texts, show_progress_bar=True)

print(f"\n‚úì Generated embeddings")
print(f"Embeddings shape: {embeddings.shape}")
print(f"  ‚Ä¢ {embeddings.shape[0]} chunks")
print(f"  ‚Ä¢ {embeddings.shape[1]} dimensions per embedding")

# Add embeddings to chunks
for chunk, embedding in zip(all_chunks, embeddings):
    chunk['embedding'] = embedding

print("\n‚úì Embeddings added to chunk metadata")

GENERATING EMBEDDINGS

Loading Sentence Transformer model...
‚úì Model loaded
Embedding dimension: 384

Generating embeddings for 1599 chunks...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:17<00:00,  2.92it/s]


‚úì Generated embeddings
Embeddings shape: (1599, 384)
  ‚Ä¢ 1599 chunks
  ‚Ä¢ 384 dimensions per embedding

‚úì Embeddings added to chunk metadata





---

## Part 3: Store in FAISS Vector Database

FAISS is a library for efficient similarity search of high-dimensional vectors.

### Key Concepts:
- **Index**: The data structure that stores vectors and enables fast search
- **IndexFlatL2**: Exact search using L2 (Euclidean) distance
- **IndexFlatIP**: Exact search using inner product (cosine similarity)
- **IndexIVF**: Approximate search using inverted file indexes (faster, less accurate)

In [None]:
print("=" * 80)
print("CREATING FAISS INDEX")
print("=" * 80)

# Create FAISS index
# Using IndexFlatL2 for exact search with L2 distance
dimension = embedding_dim
faiss_index = faiss.IndexFlatL2(dimension)

print(f"\nCreated FAISS IndexFlatL2 with dimension {dimension}")
print(f"Index is trained: {faiss_index.is_trained}")
print(f"Number of vectors: {faiss_index.ntotal}")

# Convert embeddings to float32 (FAISS requirement)
embeddings_array = np.array([chunk['embedding'] for chunk in all_chunks]).astype('float32')

# Add vectors to index
faiss_index.add(embeddings_array)

print(f"\n‚úì Added {faiss_index.ntotal} vectors to FAISS index")

# Save index to disk
os.makedirs('vector_dbs', exist_ok=True)
faiss.write_index(faiss_index, 'vector_dbs/faiss_index.bin')
print("‚úì Saved FAISS index to 'vector_dbs/faiss_index.bin'")

# Save metadata separately (FAISS only stores vectors)
metadata = []
for chunk in all_chunks:
    chunk_meta = chunk.copy()
    chunk_meta.pop('embedding')  # Remove embedding array for JSON serialization
    metadata.append(chunk_meta)

with open('vector_dbs/faiss_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print("‚úì Saved metadata to 'vector_dbs/faiss_metadata.json'")

CREATING FAISS INDEX

Created FAISS IndexFlatL2 with dimension 384
Index is trained: True
Number of vectors: 0

‚úì Added 1599 vectors to FAISS index
‚úì Saved FAISS index to 'vector_dbs/faiss_index.bin'
‚úì Saved metadata to 'vector_dbs/faiss_metadata.json'


---

## Part 4: Store in LanceDB Vector Database

LanceDB is a modern vector database with built-in metadata support and SQL-like queries.

### Key Features:
- **Persistent**: Data is stored on disk
- **Metadata**: Store vectors and metadata together
- **SQL-like**: Familiar query syntax
- **Versioning**: Track data changes over time

In [None]:
print("=" * 80)
print("CREATING LANCEDB DATABASE")
print("=" * 80)

# Connect to LanceDB (creates database if it doesn't exist)
lance_db = lancedb.connect('vector_dbs/lancedb')

print("\n‚úì Connected to LanceDB")

# Prepare data for LanceDB
# LanceDB expects a list of dictionaries with vector and metadata
lance_data = []
for chunk in all_chunks:
    lance_data.append({
        'chunk_id': chunk['chunk_id'],
        'text': chunk['text'],
        'file_name': chunk['file_name'],
        'page_number': chunk['page_number'],
        'chunk_number': chunk['chunk_number'],
        'char_count': chunk['char_count'],
        'vector': chunk['embedding'].tolist()  # Convert numpy array to list
    })

# Create table (or overwrite if exists)
table_name = 'document_chunks'
try:
    # Drop table if it exists
    lance_db.drop_table(table_name)
except:
    pass

# Create new table
table = lance_db.create_table(table_name, data=lance_data)

print(f"\n‚úì Created table '{table_name}' in LanceDB")
print(f"‚úì Added {len(lance_data)} records")
print(f"‚úì Database saved to 'vector_dbs/lancedb'")

# Display table info
print(f"\nTable schema:")
print(f"  ‚Ä¢ chunk_id: integer")
print(f"  ‚Ä¢ text: string")
print(f"  ‚Ä¢ file_name: string")
print(f"  ‚Ä¢ page_number: integer")
print(f"  ‚Ä¢ chunk_number: integer")
print(f"  ‚Ä¢ char_count: integer")
print(f"  ‚Ä¢ vector: float array[{embedding_dim}]")

CREATING LANCEDB DATABASE

‚úì Connected to LanceDB

‚úì Created table 'document_chunks' in LanceDB
‚úì Added 1599 records
‚úì Database saved to 'vector_dbs/lancedb'

Table schema:
  ‚Ä¢ chunk_id: integer
  ‚Ä¢ text: string
  ‚Ä¢ file_name: string
  ‚Ä¢ page_number: integer
  ‚Ä¢ chunk_number: integer
  ‚Ä¢ char_count: integer
  ‚Ä¢ vector: float array[384]


---

## Part 5: Retrieval - Semantic Search

Now let's implement semantic search using both FAISS and LanceDB.

### How Semantic Search Works:
1. Convert query text to embedding vector
2. Find k-nearest neighbors in vector space
3. Return chunks with highest similarity scores

In [None]:
def semantic_search_faiss(query: str, k: int = 5) -> List[Dict]:
    """
    Perform semantic search using FAISS.
    
    Args:
        query: Search query text
        k: Number of results to return
    
    Returns:
        List of dictionaries with results and scores
    """
    # Generate query embedding
    query_embedding = embedding_model.encode([query])[0].astype('float32')
    query_embedding = np.array([query_embedding])  # FAISS expects 2D array
    
    # Search in FAISS index
    distances, indices = faiss_index.search(query_embedding, k)
    
    # Load metadata
    with open('vector_dbs/faiss_metadata.json', 'r') as f:
        metadata = json.load(f)
    
    # Prepare results
    results = []
    for idx, distance in zip(indices[0], distances[0]):
        result = metadata[idx].copy()
        result['distance'] = float(distance)
        result['similarity_score'] = 1 / (1 + distance)  # Convert distance to similarity
        results.append(result)
    
    return results


def semantic_search_lancedb(query: str, k: int = 5) -> List[Dict]:
    """
    Perform semantic search using LanceDB.
    
    Args:
        query: Search query text
        k: Number of results to return
    
    Returns:
        List of dictionaries with results and scores
    """
    # Generate query embedding
    query_embedding = embedding_model.encode([query])[0]
    
    # Open table
    table = lance_db.open_table('document_chunks')
    
    # Search using vector similarity
    results = table.search(query_embedding).limit(k).to_list()
    
    # Format results
    formatted_results = []
    for result in results:
        formatted_result = {
            'chunk_id': result['chunk_id'],
            'text': result['text'],
            'file_name': result['file_name'],
            'page_number': result['page_number'],
            'chunk_number': result['chunk_number'],
            'char_count': result['char_count'],
            'distance': result['_distance'],
            'similarity_score': 1 / (1 + result['_distance'])
        }
        formatted_results.append(formatted_result)
    
    return formatted_results

print("‚úì Semantic search functions defined")

‚úì Semantic search functions defined


In [None]:
# Test semantic search
print("=" * 80)
print("SEMANTIC SEARCH TEST")
print("=" * 80)

test_queries = [
    "How do neural networks learn from data?",
    "What is the difference between classification and regression?",
    "Explain clustering algorithms"
]

test_queries = [
    "Fate of the Universe",
    "accurate measurements",
    "astronomy"
]

for query in test_queries:
    print(f"\n{'='*80}")
    print(f"Query: '{query}'")
    print(f"{'='*80}")
    
    # Search with FAISS
    print("\nüîç FAISS Results:")
    print("-" * 80)
    faiss_results = semantic_search_faiss(query, k=3)
    
    for i, result in enumerate(faiss_results, 1):
        print(f"\n{i}. [Score: {result['similarity_score']:.3f}] "
              f"Page {result['page_number']}, Chunk {result['chunk_number']}")
        print(f"   {result['text'][:200]}...")
    
    # Search with LanceDB
    print("\n\nüîç LanceDB Results:")
    print("-" * 80)
    lance_results = semantic_search_lancedb(query, k=3)
    
    for i, result in enumerate(lance_results, 1):
        print(f"\n{i}. [Score: {result['similarity_score']:.3f}] "
              f"Page {result['page_number']}, Chunk {result['chunk_number']}")
        print(f"   {result['text'][:200]}...")

SEMANTIC SEARCH TEST

Query: 'Fate of the Universe'

üîç FAISS Results:
--------------------------------------------------------------------------------

1. [Score: 0.545] Page 1, Chunk 2
   t So Black
Chapter 8 - The Origin and Fate of the Universe
Chapter 9 - The Arrow of Time
Chapter 10 - Wormholes and Time Travel
Chapter 11 - The Unification of Physics
Chapter 12 - Conclusion
Glossary...

2. [Score: 0.539] Page 71, Chunk 9
   ofound implications for the role
of God in the affairs of the universe. With the success of scientific theories in describing events, most people have
come to believe that God allows the universe to e...

3. [Score: 0.537] Page 94, Chunk 2
   been an effective beginning of time. Similarly, if the whole universe recollapsed, there
must be another state of infinite density in the future, the big crunch, which would be an end of time. Even if...


üîç LanceDB Results:
--------------------------------------------------------------------------------

1. [Score: 0

---

## Part 6: Retrieval - Keyword Search

Traditional keyword search looks for exact text matches.

### Keyword Search vs Semantic Search:

| Aspect | Keyword Search | Semantic Search |
|--------|----------------|------------------|
| Matching | Exact text match | Meaning-based |
| Synonyms | Misses synonyms | Finds synonyms |
| Context | No understanding | Context-aware |
| Speed | Very fast | Fast (with index) |
| Use Case | Known terms | Natural questions |

In [None]:
def keyword_search(query: str, chunks: List[Dict], top_k: int = 5) -> List[Dict]:
    """
    Perform simple keyword search.
    
    Args:
        query: Search query
        chunks: List of chunks to search
        top_k: Number of results to return
    
    Returns:
        List of matching chunks with scores
    """
    query_lower = query.lower()
    query_terms = query_lower.split()
    
    results = []
    for chunk in chunks:
        text_lower = chunk['text'].lower()
        
        # Count term matches
        matches = sum(1 for term in query_terms if term in text_lower)
        
        if matches > 0:
            score = matches / len(query_terms)  # Normalized score
            result = chunk.copy()
            result.pop('embedding', None)
            result['keyword_score'] = score
            result['matched_terms'] = matches
            results.append(result)
    
    # Sort by score
    results.sort(key=lambda x: x['keyword_score'], reverse=True)
    
    return results[:top_k]


def keyword_search_lancedb(query: str, k: int = 5) -> List[Dict]:
    """
    Perform keyword search using LanceDB SQL-like queries.
    
    Args:
        query: Search query
        k: Number of results to return
    
    Returns:
        List of matching chunks
    """
    table = lance_db.open_table('document_chunks')
    
    # Use SQL-like WHERE clause for text search
    # Note: This is a simple contains search
    query_lower = query.lower()
    
    # LanceDB doesn't have full-text search built-in, so we'll get all and filter
    all_results = table.to_pandas()
    
    # Filter based on keyword presence
    query_terms = query_lower.split()
    matches = []
    
    for _, row in all_results.iterrows():
        text_lower = row['text'].lower()
        matched_terms = sum(1 for term in query_terms if term in text_lower)
        
        if matched_terms > 0:
            result = row.to_dict()
            result.pop('vector', None)  # Remove vector for display
            result['keyword_score'] = matched_terms / len(query_terms)
            result['matched_terms'] = matched_terms
            matches.append(result)
    
    # Sort by score
    matches.sort(key=lambda x: x['keyword_score'], reverse=True)
    
    return matches[:k]

print("‚úì Keyword search functions defined")

‚úì Keyword search functions defined


In [None]:
# Test keyword search
print("=" * 80)
print("KEYWORD SEARCH TEST")
print("=" * 80)

test_keyword_queries = [
    "neural networks",
    "classification regression",
    "clustering"
]

test_keyword_queries = [
    "Fate of the Universe",
    "accurate measurements",
    "astronomy"
]

for query in test_keyword_queries:
    print(f"\n{'='*80}")
    print(f"Query: '{query}'")
    print(f"{'='*80}")
    
    # Keyword search (simple)
    print("\nüîé Keyword Search Results (In-Memory):")
    print("-" * 80)
    keyword_results = keyword_search(query, all_chunks, top_k=3)
    
    if keyword_results:
        for i, result in enumerate(keyword_results, 1):
            print(f"\n{i}. [Score: {result['keyword_score']:.2f}, "
                  f"Matched: {result['matched_terms']} terms] "
                  f"Page {result['page_number']}, Chunk {result['chunk_number']}")
            print(f"   {result['text'][:200]}...")
    else:
        print("No matches found.")
    
    # Keyword search with LanceDB
    print("\n\nüîé Keyword Search Results (LanceDB):")
    print("-" * 80)
    lance_keyword_results = keyword_search_lancedb(query, k=3)
    
    if lance_keyword_results:
        for i, result in enumerate(lance_keyword_results, 1):
            print(f"\n{i}. [Score: {result['keyword_score']:.2f}, "
                  f"Matched: {result['matched_terms']} terms] "
                  f"Page {result['page_number']}, Chunk {result['chunk_number']}")
            print(f"   {result['text'][:200]}...")
    else:
        print("No matches found.")

KEYWORD SEARCH TEST

Query: 'Fate of the Universe'

üîé Keyword Search Results (In-Memory):
--------------------------------------------------------------------------------

1. [Score: 1.00, Matched: 4 terms] Page 1, Chunk 2
   t So Black
Chapter 8 - The Origin and Fate of the Universe
Chapter 9 - The Arrow of Time
Chapter 10 - Wormholes and Time Travel
Chapter 11 - The Unification of Physics
Chapter 12 - Conclusion
Glossary...

2. [Score: 1.00, Matched: 4 terms] Page 60, Chunk 19
   swers that this
approach suggests for the origin and fate of the universe and its contents, such as astronauts, will be de-scribed in the
next two chapters. We shall see that although the uncertainty ...

3. [Score: 1.00, Matched: 4 terms] Page 61, Chunk 1
   CHAPTER 8
THE ORIGIN AND FATE OF THE UNIVERSE
¬†
Einstein‚Äôs general theory of relativity, on its own, predicted that space-time began at the big bang singularity and
would come to an end either at the ...


üîé Keyword Search Results (LanceDB):
--

---

## Part 7: Comparison - Keyword vs Semantic Search

Let's compare how the two search methods perform on the same query.

In [None]:
print("=" * 80)
print("KEYWORD VS SEMANTIC SEARCH COMPARISON")
print("=" * 80)

comparison_queries = [
    ("What techniques prevent overfitting?", "Query with no exact keyword matches"),
    ("neural networks deep learning", "Query with exact keywords"),
    ("How to evaluate model performance?", "Natural language question")
]

for query, description in comparison_queries:
    print(f"\n{'='*80}")
    print(f"Query: '{query}'")
    print(f"Type: {description}")
    print(f"{'='*80}")
    
    # Keyword search
    print("\nüîé KEYWORD SEARCH:")
    print("-" * 80)
    keyword_results = keyword_search(query, all_chunks, top_k=2)
    
    if keyword_results:
        for i, result in enumerate(keyword_results, 1):
            print(f"\n{i}. Score: {result['keyword_score']:.2f}")
            print(f"   {result['text'][:150]}...")
    else:
        print("‚ùå No matches found")
    
    # Semantic search
    print("\n\nüîç SEMANTIC SEARCH:")
    print("-" * 80)
    semantic_results = semantic_search_faiss(query, k=2)
    
    for i, result in enumerate(semantic_results, 1):
        print(f"\n{i}. Score: {result['similarity_score']:.3f}")
        print(f"   {result['text'][:150]}...")
    
    print("\n" + "="*80)
    print("üìä Analysis:")
    if not keyword_results:
        print("  ‚Ä¢ Keyword search failed (no exact matches)")
        print("  ‚Ä¢ Semantic search succeeded (understood meaning)")
    elif len(keyword_results) < len(semantic_results):
        print("  ‚Ä¢ Keyword search found fewer results")
        print("  ‚Ä¢ Semantic search more comprehensive")
    else:
        print("  ‚Ä¢ Both methods found results")
        print("  ‚Ä¢ Semantic search may find more relevant context")

KEYWORD VS SEMANTIC SEARCH COMPARISON

Query: 'What techniques prevent overfitting?'
Type: Query with no exact keyword matches

üîé KEYWORD SEARCH:
--------------------------------------------------------------------------------

1. Score: 0.50
   Overfitting and Underfitting
Overfitting occurs when a model learns the training data too well, including its noise, resulting in poor performance on ...

2. Score: 0.25
   Regression algorithms predict continuous values. For instance, predicting house prices based on features like size, location, and age. Linear regressi...


üîç SEMANTIC SEARCH:
--------------------------------------------------------------------------------

1. Score: 0.582
   Overfitting and Underfitting
Overfitting occurs when a model learns the training data too well, including its noise, resulting in poor performance on ...

2. Score: 0.528
   Underfitting happens when a model is too simple to capture the underlying patterns in the data. This can be addressed by incr

---

## Part 8: Hybrid Search (Best of Both Worlds)

Combine keyword and semantic search for optimal results.

In [None]:
def hybrid_search(query: str, k: int = 5, keyword_weight: float = 0.3, 
                 semantic_weight: float = 0.7) -> List[Dict]:
    """
    Combine keyword and semantic search with weighted scoring.
    
    Args:
        query: Search query
        k: Number of results
        keyword_weight: Weight for keyword scores
        semantic_weight: Weight for semantic scores
    
    Returns:
        List of results with hybrid scores
    """
    # Get both types of results
    keyword_results = keyword_search(query, all_chunks, top_k=k*2)
    semantic_results = semantic_search_faiss(query, k=k*2)
    
    # Create score dictionary
    hybrid_scores = {}
    
    # Add keyword scores
    for result in keyword_results:
        chunk_id = result['chunk_id']
        hybrid_scores[chunk_id] = {
            'chunk': result,
            'keyword_score': result['keyword_score'],
            'semantic_score': 0.0
        }
    
    # Add/update with semantic scores
    for result in semantic_results:
        chunk_id = result['chunk_id']
        if chunk_id in hybrid_scores:
            hybrid_scores[chunk_id]['semantic_score'] = result['similarity_score']
        else:
            hybrid_scores[chunk_id] = {
                'chunk': result,
                'keyword_score': 0.0,
                'semantic_score': result['similarity_score']
            }
    
    # Calculate hybrid scores
    results = []
    for chunk_id, scores in hybrid_scores.items():
        hybrid_score = (keyword_weight * scores['keyword_score'] + 
                       semantic_weight * scores['semantic_score'])
        
        result = scores['chunk'].copy()
        result['keyword_score'] = scores['keyword_score']
        result['semantic_score'] = scores['semantic_score']
        result['hybrid_score'] = hybrid_score
        results.append(result)
    
    # Sort by hybrid score
    results.sort(key=lambda x: x['hybrid_score'], reverse=True)
    
    return results[:k]

print("‚úì Hybrid search function defined")

‚úì Hybrid search function defined


In [None]:
# Test hybrid search
print("=" * 80)
print("HYBRID SEARCH TEST")
print("=" * 80)

query = "How does regularization prevent overfitting in neural networks?"

print(f"\nQuery: '{query}'")
print("="*80)

hybrid_results = hybrid_search(query, k=3)

print("\nüîÄ HYBRID SEARCH RESULTS:")
print("-" * 80)

for i, result in enumerate(hybrid_results, 1):
    print(f"\n{i}. Hybrid Score: {result['hybrid_score']:.3f}")
    print(f"   ‚îî‚îÄ Keyword: {result['keyword_score']:.3f} | "
          f"Semantic: {result['semantic_score']:.3f}")
    print(f"   Page {result['page_number']}, Chunk {result['chunk_number']}")
    print(f"   {result['text'][:200]}...")

print("\n" + "="*80)
print("‚úÖ Hybrid search combines:")
print("  ‚Ä¢ Keyword matching for exact term relevance")
print("  ‚Ä¢ Semantic understanding for context and meaning")
print("  ‚Ä¢ Weighted scoring for balanced results")

HYBRID SEARCH TEST

Query: 'How does regularization prevent overfitting in neural networks?'

üîÄ HYBRID SEARCH RESULTS:
--------------------------------------------------------------------------------

1. Hybrid Score: 0.618
   ‚îî‚îÄ Keyword: 0.500 | Semantic: 0.669
   Page 15, Chunk 1
   Overfitting and Underfitting
Overfitting occurs when a model learns the training data too well, including its noise, resulting in poor performance on new data. Regularization techniques like L1 and L2...

2. Hybrid Score: 0.389
   ‚îî‚îÄ Keyword: 0.250 | Semantic: 0.449
   Page 8, Chunk 1
   Deep Learning
Deep learning uses artificial neural networks with multiple layers to progressively extract higher-level features from raw input. It has revolutionized fields like computer vision, natur...

3. Hybrid Score: 0.364
   ‚îî‚îÄ Keyword: 0.250 | Semantic: 0.414
   Page 9, Chunk 1
   Convolutional Neural Networks (CNNs) are particularly effective for image processing tasks. They use convolutional layers

---

## Part 9: Loading Saved Indexes

Demonstrate how to load previously saved vector databases.

In [None]:
print("=" * 80)
print("LOADING SAVED VECTOR DATABASES")
print("=" * 80)

# Load FAISS index
print("\nüìÇ Loading FAISS index...")
loaded_faiss_index = faiss.read_index('vector_dbs/faiss_index.bin')
print(f"‚úì Loaded FAISS index with {loaded_faiss_index.ntotal} vectors")

# Load FAISS metadata
with open('vector_dbs/faiss_metadata.json', 'r') as f:
    loaded_metadata = json.load(f)
print(f"‚úì Loaded {len(loaded_metadata)} metadata records")

# Connect to LanceDB
print("\nüìÇ Loading LanceDB...")
loaded_lance_db = lancedb.connect('vector_dbs/lancedb')
loaded_table = loaded_lance_db.open_table('document_chunks')
record_count = loaded_table.count_rows()
print(f"‚úì Loaded LanceDB with {record_count} records")

# Test search with loaded indexes
print("\nüîç Testing search with loaded indexes...")
test_query = "machine learning"
print(f"Query: '{test_query}'\n")

# Search loaded FAISS
query_embedding = embedding_model.encode([test_query])[0].astype('float32')
query_embedding = np.array([query_embedding])
distances, indices = loaded_faiss_index.search(query_embedding, 2)

print("FAISS Results:")
for idx, distance in zip(indices[0], distances[0]):
    print(f"  ‚Ä¢ Score: {1/(1+distance):.3f} - {loaded_metadata[idx]['text'][:100]}...")

# Search loaded LanceDB
query_embedding = embedding_model.encode([test_query])[0]
lance_results = loaded_table.search(query_embedding).limit(2).to_list()

print("\nLanceDB Results:")
for result in lance_results:
    print(f"  ‚Ä¢ Score: {1/(1+result['_distance']):.3f} - {result['text'][:100]}...")

print("\n‚úÖ Successfully loaded and searched both vector databases!")

LOADING SAVED VECTOR DATABASES

üìÇ Loading FAISS index...
‚úì Loaded FAISS index with 23 vectors
‚úì Loaded 23 metadata records

üìÇ Loading LanceDB...
‚úì Loaded LanceDB with 23 records

üîç Testing search with loaded indexes...
Query: 'machine learning'

FAISS Results:
  ‚Ä¢ Score: 0.676 - Machine Learning Fundamentals...
  ‚Ä¢ Score: 0.563 - Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that enable...

LanceDB Results:
  ‚Ä¢ Score: 0.676 - Machine Learning Fundamentals...
  ‚Ä¢ Score: 0.563 - Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that enable...

‚úÖ Successfully loaded and searched both vector databases!


---

## Summary and Key Takeaways

### üéØ What We Learned

1. **Vector Databases** enable efficient similarity search at scale
2. **FAISS** provides fast in-memory vector search
3. **LanceDB** offers persistent storage with metadata support

### üìä Comparison: FAISS vs LanceDB

| Feature | FAISS | LanceDB |
|---------|-------|----------|
| **Storage** | In-memory (with save/load) | Disk-based, persistent |
| **Metadata** | Separate storage required | Built-in support |
| **Speed** | Extremely fast | Fast |
| **Scalability** | Limited by RAM | Scales to disk |
| **Queries** | Vector search only | Vector + SQL-like |
| **Use Case** | Research, prototypes | Production apps |
| **Updates** | Rebuild index | Easy updates |
| **Versioning** | Manual | Built-in |

### üîç Search Methods Comparison

| Method | Pros | Cons | Best For |
|--------|------|------|----------|
| **Keyword** | Fast, exact matches | Misses synonyms, no context | Known terms, filters |
| **Semantic** | Understands meaning | Slower, needs embeddings | Natural questions |
| **Hybrid** | Best of both worlds | More complex | Production systems |

### üöÄ Production Considerations

1. **Chunking Strategy**: Balance context vs specificity (300-500 chars)
2. **Overlap**: Use 10-20% overlap to maintain context
3. **Metadata**: Store page, chunk, file info for citations
4. **Index Type**: Choose based on scale and accuracy needs
5. **Hybrid Search**: Combine methods for best results

### üõ†Ô∏è Next Steps

- Implement RAG pipeline with vector database
- Add filtering by metadata
- Experiment with different chunking strategies
- Try other vector databases (Chroma, Pinecone, Weaviate)
- Implement re-ranking for better results

### üìö Additional Vector Databases

- **Chroma**: Simple, embedded, great for prototypes
- **Pinecone**: Managed, cloud-native, production-ready
- **Weaviate**: GraphQL API, hybrid search built-in
- **Qdrant**: Rust-based, high performance
- **Milvus**: Large-scale, distributed

---

## üéì Conclusion

Vector databases are the foundation of modern AI applications. They enable:
- **Fast semantic search** across millions of documents
- **Efficient RAG** pipelines for LLMs
- **Scalable** similarity search
- **Flexible** metadata filtering

Choose the right vector database for your use case:
- **Prototype/Research**: FAISS, Chroma
- **Production**: LanceDB, Pinecone, Weaviate
- **Large Scale**: Milvus, Qdrant

**The combination of embeddings + vector databases powers the next generation of AI applications!** üöÄ