# Week 3: Keyword Search First - The Critical Foundation

> ** The 90% Problem:** Most RAG systems jump straight to vector search and miss the foundation that powers the best retrieval systems. We're doing it right!

## ESSENTIAL SETUP - Do This First!

**Before running any cells, ensure your environment is properly configured:**

```bash
# 1. CRITICAL: Copy the environment configuration
cp .env.example .env

# 2. Verify these Week 3 settings are in your .env:
# OPENSEARCH__HOST=http://opensearch:9200
# OPENSEARCH__INDEX_NAME=arxiv-papers
# ARXIV__MAX_RESULTS=15
```

**Important:** Week 3 requires the `.env` file for OpenSearch connectivity and service configuration. The defaults in `.env.example` work perfectly out of the box!

**Why Keyword Search First?**
- **Exact Match Power:** Find specific technical terms and paper IDs precisely
- **Speed & Efficiency:** BM25 is fast and doesn't require expensive embedding models
- **Interpretable:** You understand exactly why papers were retrieved
- **Production Reality:** Companies like Elasticsearch use keyword search as their foundation

---

# Week 3: OpenSearch Integration & BM25 Search

**What We're Building This Week:**

Week 3 focuses on implementing OpenSearch integration for full-text search capabilities using BM25 scoring. This transforms our system from a simple storage solution into a searchable knowledge base.

## Week 3 Focus Areas

### Core Objectives
- **OpenSearch Integration**: Connect our FastAPI application to OpenSearch cluster
- **Index Management**: Create and manage the arxiv-papers index with proper mappings
- **BM25 Search**: Implement full-text search with relevance scoring
- **Data Pipeline**: Transfer papers from PostgreSQL to OpenSearch
- **Search API**: Expose search functionality through REST endpoints

### What We'll Test In This Notebook
1. **Infrastructure Verification** - Ensure all services from Week 1-2 are running
2. **OpenSearch Service Integration** - Test client creation and health checks
3. **Index Creation & Management** - Create arxiv-papers index with proper mappings
4. **Data Pipeline** - Transfer papers from PostgreSQL to OpenSearch
5. **BM25 Search Functionality** - Test search queries with relevance scoring
6. **Search API Endpoints** - Verify FastAPI search endpoints work correctly

### Success Metrics
- OpenSearch cluster healthy and accessible
- arxiv-papers index created with proper mappings
- Papers successfully indexed from PostgreSQL
- BM25 search returns relevant results with scores
- Search API endpoints respond correctly
- All components ready for production use

---

## Week 3 Component Status
| Component | Purpose | Status |
|-----------|---------|--------|
| **OpenSearch Client** | Connect to OpenSearch cluster | ‚úÖ Complete |
| **Index Management** | Create and manage search indices | ‚úÖ Complete |
| **Query Builder** | Build complex search queries | ‚úÖ Complete |
| **Data Pipeline** | Transfer papers to OpenSearch | ‚úÖ Complete |
| **Search API** | REST endpoints for search | ‚úÖ Complete |
| **BM25 Scoring** | Relevance-based search results | ‚úÖ Complete |

## IMPORTANT: Week 3 Docker Services Restart

**NEW USERS OR INTEGRATION CONFLICTS**: Week 3 introduces OpenSearch integration that requires fresh container state. Use this clean restart approach:

### Fresh Start (Recommended for Week 3)
```bash
# Complete clean slate - removes all data but ensures correct OpenSearch state
docker compose down -v

# Build fresh containers with latest code
docker compose up --build -d
```

**When to use this:**
- First time running Week 3 
- OpenSearch connection issues
- Index conflicts or mapping errors
- Want to start with clean OpenSearch state

**Note**: This destroys existing data but ensures you have the correct Week 3 configuration with proper OpenSearch integration.

---

## Prerequisites Check

**Before starting:**
1. Week 1 infrastructure completed
2. Week 2 arXiv integration working
3. UV environment activated
4. Docker Desktop running
5. Some papers already in PostgreSQL from Week 2

**Why fresh containers?** Week 3 includes OpenSearch integration that requires proper cluster initialization and may conflict with existing index states.

**Service Access Points:**
- **FastAPI**: http://localhost:8000/docs (API documentation)
- **PostgreSQL**: via API or `docker exec -it rag-postgres psql -U rag_user -d rag_db`
- **OpenSearch**: http://localhost:9200/_cluster/health
- **Ollama**: http://localhost:11434 (LLM service)
- **Airflow**: http://localhost:8080 (Username: `admin`, Password: `admin`)

## Environment Setup

In [3]:
# Environment Setup and Path Configuration
import sys
from pathlib import Path
import json
import requests

print(f"Python Version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"Environment: {sys.executable}")

# Find project root and add to Python path
current_dir = Path.cwd()
if current_dir.name == "week3" and current_dir.parent.name == "notebooks":
    project_root = current_dir.parent.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    project_root = None

if project_root and (project_root / "compose.yml").exists():
    print(f"Project root: {project_root}")
    sys.path.insert(0, str(project_root))
else:
    print("Missing compose.yml - check directory")
    exit()

Python Version: 3.12.12
Environment: d:\Projects\Agentic_RAG\arxiv-paper-curator\.venv\Scripts\python.exe
Project root: d:\Projects\Agentic_RAG\arxiv-paper-curator


## 1. Infrastructure Verification

In [4]:
# Service Health Verification
print("WEEK 3 PREREQUISITE CHECK")
print("=" * 50)

services_to_test = {
    "FastAPI": "http://localhost:8000/api/v1/health",
    "PostgreSQL (via API)": "http://localhost:8000/api/v1/health", 
    "OpenSearch": "http://localhost:9200/_cluster/health",
    "Airflow": "http://localhost:8080/health"  
}

all_healthy = True

for service_name, url in services_to_test.items():
    try:
        response = requests.get(url, timeout=5)
        if response.status_code == 200:
            print(f"‚úì {service_name}: Healthy")
        else:
            print(f"‚úó {service_name}: HTTP {response.status_code}")
            all_healthy = False
    except requests.exceptions.ConnectionError:
        print(f"‚úó {service_name}: Not accessible")
        all_healthy = False
    except Exception as e:
        print(f"‚úó {service_name}: {type(e).__name__}")
        all_healthy = False

print()
if all_healthy:
    print("All services healthy! Ready for Week 3 OpenSearch integration.")
else:
    print("Some services need attention. Please run: docker compose up --build")

WEEK 3 PREREQUISITE CHECK
‚úì FastAPI: Healthy
‚úì PostgreSQL (via API): Healthy
‚úì OpenSearch: Healthy
‚úì Airflow: Healthy

All services healthy! Ready for Week 3 OpenSearch integration.


## 2. OpenSearch Client Setup

In [5]:
# OpenSearch Client Setup
from src.services.opensearch.factory import make_opensearch_client
from opensearchpy import OpenSearch

print("OPENSEARCH CLIENT SETUP")
print("=" * 40)

# Create OpenSearch client using factory pattern
opensearch_client = make_opensearch_client()

# Override for notebook execution (localhost instead of container hostname)
opensearch_client.host = "http://localhost:9200"
opensearch_client.client = OpenSearch(
    hosts=["http://localhost:9200"],
    http_compress=True,
    use_ssl=False,
    verify_certs=False,
    ssl_assert_hostname=False,
    ssl_show_warn=False,
)

print(f"Client configured with host: {opensearch_client.host}")
print(f"Index name: {opensearch_client.index_name}")

# Test health check
is_healthy = opensearch_client.health_check()
if is_healthy:
    print("‚úì OpenSearch health check: PASSED")
else:
    print("‚úó OpenSearch health check: FAILED")

OPENSEARCH CLIENT SETUP
Client configured with host: http://localhost:9200
Index name: arxiv-papers
‚úì OpenSearch health check: PASSED


In [7]:
# Test OpenSearch Endpoints Directly
print("TESTING OPENSEARCH ENDPOINTS")
print("=" * 40)

# Test cluster health
try:
    response = requests.get("http://localhost:9200/_cluster/health", timeout=5)
    print(f"‚úì Cluster Health Response (HTTP {response.status_code}):")
    print(json.dumps(response.json(), indent=2))
except Exception as e:
    print(f"‚úó Error: {e}")

print("\n" + "-" * 40)

# Test cluster info (root endpoint)
try:
    response = requests.get("http://localhost:9200/", timeout=5)
    print(f"\n‚úì Root Endpoint Response (HTTP {response.status_code}):")
    data = response.json()
    print(f"  Cluster: {data.get('cluster_name', 'N/A')}")
    print(f"  Version: {data.get('version', {}).get('number', 'N/A')}")
except Exception as e:
    print(f"‚úó Error: {e}")

print("\n" + "-" * 40)
print("\nüí° NOTE: If these work but your browser hangs:")
print("   ‚Ä¢ OpenSearch is working fine - it's a browser issue")
print("   ‚Ä¢ Try: Clear browser cache, use incognito mode")
print("   ‚Ä¢ Or: Use curl in terminal instead")
print("   ‚Ä¢ The Python client is the proper way to interact anyway!")


TESTING OPENSEARCH ENDPOINTS
‚úì Cluster Health Response (HTTP 200):
{
  "cluster_name": "docker-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 6,
  "active_shards": 6,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 1,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 85.71428571428571
}

----------------------------------------

‚úì Root Endpoint Response (HTTP 200):
  Cluster: docker-cluster
  Version: 2.19.0

----------------------------------------

üí° NOTE: If these work but your browser hangs:
   ‚Ä¢ OpenSearch is working fine - it's a browser issue
   ‚Ä¢ Try: Clear browser cache, use incognito mode
   ‚Ä¢ Or: Use curl in terminal instead
   ‚Ä¢ The Python client is the 

## Index Configuration

In [8]:
# Display Index Configuration
from src.services.opensearch.index_config import ARXIV_PAPERS_INDEX, ARXIV_PAPERS_MAPPING

print("INDEX CONFIGURATION")
print("=" * 40)
print(f"Index Name: {ARXIV_PAPERS_INDEX}")
print(f"\nKey Features:")
print("‚Ä¢ Custom text analyzers for better search")
print("‚Ä¢ Multi-field mapping (text + keyword)")
print("‚Ä¢ 10 specialized fields for papers")
print("\nField Types:")

properties = ARXIV_PAPERS_MAPPING["mappings"]["properties"]
for field_name, config in properties.items():
    field_type = config.get("type")
    analyzer = config.get("analyzer", "")
    if analyzer:
        print(f"  ‚Ä¢ {field_name}: {field_type} [{analyzer}]")
    else:
        print(f"  ‚Ä¢ {field_name}: {field_type}")

INDEX CONFIGURATION
Index Name: arxiv-papers

Key Features:
‚Ä¢ Custom text analyzers for better search
‚Ä¢ Multi-field mapping (text + keyword)
‚Ä¢ 10 specialized fields for papers

Field Types:
  ‚Ä¢ arxiv_id: keyword
  ‚Ä¢ title: text [text_analyzer]
  ‚Ä¢ authors: text [standard_analyzer]
  ‚Ä¢ abstract: text [text_analyzer]
  ‚Ä¢ categories: keyword
  ‚Ä¢ raw_text: text [text_analyzer]
  ‚Ä¢ pdf_url: keyword
  ‚Ä¢ published_date: date
  ‚Ä¢ created_at: date
  ‚Ä¢ updated_at: date


### Create Index

In [9]:
# Create Index if it doesn't exist
print("INDEX CREATION")
print("=" * 40)

try:
    # Check if index already exists
    index_exists = opensearch_client.client.indices.exists(index=opensearch_client.index_name)
    
    if index_exists:
        print(f"‚úì Index '{opensearch_client.index_name}' already exists")
        
        # Get current index statistics
        stats = opensearch_client.get_index_stats()
        if stats and 'error' not in stats:
            print(f"\nCurrent Statistics:")
            print(f"   Documents: {stats.get('document_count', 0)}")
            print(f"   Size: {stats.get('size_in_bytes', 0):,} bytes")
    else:
        print(f"Creating new index: {opensearch_client.index_name}")
        
        # Create the index with our custom mapping
        success = opensearch_client.create_index()
        
        if success:
            print(f"‚úì Index created successfully!")
        else:
            print(f"‚úó Index creation failed")
            
except Exception as e:
    print(f"‚úó Error with index management: {e}")

INDEX CREATION
‚úì Index 'arxiv-papers' already exists

Current Statistics:
   Documents: 0
   Size: 208 bytes


## 3. Data Pipeline - Run Airflow DAG

The **arxiv_paper_ingestion** DAG automatically:
1. Fetches papers from arXiv API
2. Stores papers in PostgreSQL
3. **Indexes papers into OpenSearch**

### Instructions:

**Before proceeding, run the Airflow DAG:**

1. Open Airflow UI: http://localhost:8080
2. Login: username `admin`, password `admin`
3. Find **`arxiv_paper_ingestion`** DAG
4. Click the DAG name to open it
5. Click **"Trigger DAG"** button (‚ñ∂Ô∏è play icon)
6. Wait ~10 minutes for completion
7. Check that all tasks turn green

Then run the cell below to verify:

In [11]:
# Verify Data Pipeline Results
print("VERIFYING DATA PIPELINE")
print("=" * 40)

stats = opensearch_client.get_index_stats()

if stats and 'error' not in stats:
    doc_count = stats.get('document_count', 0)
    
    if doc_count > 0:
        print(f"‚úì Success! Found {doc_count} documents in OpenSearch")
        
        # Show sample papers
        sample = opensearch_client.search_papers("*", size=3)
        if sample.get('hits'):
            print(f"\nSample papers:")
            for i, paper in enumerate(sample['hits'], 1):
                title = paper.get('title', 'Unknown')[:60]
                print(f"  {i}. {title}...")
    else:
        print("‚ö†Ô∏è  No documents in OpenSearch yet")
        print("\nPlease run the Airflow DAG first (see instructions above)")
else:
    print("‚úó Could not retrieve index stats")

VERIFYING DATA PIPELINE
‚úì Success! Found 15 documents in OpenSearch


## 4. Simple BM25 Search

Let's start with a simple search to demonstrate BM25 scoring:

In [12]:
# Simple BM25 Search
print("SIMPLE BM25 SEARCH")
print("=" * 40)

# Change this to any word from your papers
search_term = "learning"  # Try different terms!

print(f"Searching for: '{search_term}'\n")

results = opensearch_client.search_papers(
    query=search_term,
    size=5
)

if results.get('hits'):
    print(f"Found {results.get('total', 0)} total matches\n")
    
    for i, paper in enumerate(results['hits'], 1):
        print(f"{i}. {paper.get('title', 'Unknown')[:70]}...")
        print(f"   Score: {paper.get('score', 0):.2f}")
        print(f"   arXiv ID: {paper.get('arxiv_id', 'N/A')}\n")
else:
    print("No results found. Try searching for:")
    print("  ‚Ä¢ 'neural', 'model', 'algorithm'")
    print("  ‚Ä¢ Use '*' to see all papers")

SIMPLE BM25 SEARCH
Searching for: 'learning'

Found 5 total matches

1. Deep Delta Learning...
   Score: 5.17
   arXiv ID: 2601.00417v1

2. Deep Networks Learn Deep Hierarchical Models...
   Score: 4.32
   arXiv ID: 2601.00455v1

3. Neural Chains and Discrete Dynamical Systems...
   Score: 3.60
   arXiv ID: 2601.00473v1

4. E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for ...
   Score: 3.39
   arXiv ID: 2601.00423v1

5. Adaptive Causal Coordination Detection for Social Media: A Memory-Guid...
   Score: 3.25
   arXiv ID: 2601.00400v1



## 5. Advanced OpenSearch Queries

Now let's explore different query types using the OpenSearch Python client directly. This shows the power of BM25 without needing vectors!

### 5.1 Match Query

The `match` query is the standard query for full-text search on a single field:

In [13]:
# Match Query - Search in title field
print("MATCH QUERY - Single Field Search")
print("=" * 40)

query = {
    "query": {
        "match": {
            "title": "machine learning"
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    print(f"Title: {hit['_source']['title'][:70]}...")

MATCH QUERY - Single Field Search
Found 4 results

Title: Deep Delta Learning...
Title: Deep Networks Learn Deep Hierarchical Models...
Title: E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for ...


### 5.2 Multi-Match Query

Search across multiple fields simultaneously:

In [14]:
# Multi-Match Query - Search across multiple fields
print("MULTI-MATCH QUERY - Search Multiple Fields")
print("=" * 40)

query = {
    "query": {
        "multi_match": {
            "query": "AI Agents",
            "fields": ["title^2", "abstract", "authors"],  # ^2 boosts title field
            "type": "best_fields"
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    print(f"Title: {hit['_source']['title'][:70]}...")
    print(f"Score: {hit['_score']:.2f}")
    print(f"Authors: {', '.join(hit['_source']['authors'][:2])}...\n")

MULTI-MATCH QUERY - Search Multiple Fields
Found 6 results

Title: Progressive Ideation using an Agentic AI Framework for Human-AI Co-Cre...
Score: 9.02
Authors: S, a...

Title: Multi-Agent Coordinated Rename Refactoring...
Score: 3.71
Authors: A, b...

Title: MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Ob...
Score: 3.11
Authors: T, i...



### 5.3 Boosting Query

Boost certain results while demoting others:

In [15]:
# Boosting Query - Promote and demote results
print("BOOSTING QUERY - Promote/Demote Results")
print("=" * 40)

query = {
    "query": {
        "boosting": {
            "positive": {
                "match": {
                    "abstract": "deep learning"
                }
            },
            "negative": {
                "match": {
                    "abstract": "multimodal"
                }
            },
            "negative_boost": 0.1  # Reduce score of negative matches
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Query: Boost 'deep learning', demote 'survey' papers\n")
print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    abstract_snippet = hit['_source']['abstract'][:100]
    print(f"Title: {title}...")
    print(f"Score: {hit['_score']:.2f}")
    print(f"Abstract: {abstract_snippet}...\n")

BOOSTING QUERY - Promote/Demote Results
Query: Boost 'deep learning', demote 'survey' papers

Found 5 results

Title: Deep Networks Learn Deep Hierarchical Models...
Score: 4.85
Abstract: We consider supervised learning with $n$ labels and show that layerwise SGD on residual networks can...

Title: Deep Delta Learning...
Score: 3.66
Abstract: The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connecti...

Title: Neural Chains and Discrete Dynamical Systems...
Score: 1.80
Abstract: We inspect the analogy between machine-learning (ML) applications based on the transformer architect...



### 5.4 Filter Query

Filter results by specific criteria (doesn't affect scoring):

In [16]:
# Filter Query - Filter by categories
print("FILTER QUERY - Category Filtering")
print("=" * 40)

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "abstract": "neural"
                    }
                }
            ],
            "filter": [
                {
                    "terms": {
                        "categories": ["cs.AI"]
                    }
                }
            ]
        }
    },
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    categories = ', '.join(hit['_source']['categories'])
    print(f"Title: {title}...")
    print(f"Categories: {categories}")
    print(f"Score: {hit['_score']:.2f}\n")

FILTER QUERY - Category Filtering
Found 1 results

Title: Neural Chains and Discrete Dynamical Systems...
Categories: cs.LG, cs.AI
Score: 3.70



### 5.5 Sorting Query

Sort results by different criteria:

In [17]:
# Sorting Query - Sort by publication date
print("SORTING QUERY - Latest Papers First")
print("=" * 40)

query = {
    "query": {
        "match_all": {}  # Get all papers
    },
    "sort": [
        {
            "published_date": {
                "order": "desc"  # Latest first
            }
        }
    ],
    "size": 5
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Query: All papers sorted by publication date (newest first)\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    pub_date = hit['_source']['published_date'][:10]
    print(f"Date: {pub_date} | {title}...")

SORTING QUERY - Latest Papers First
Query: All papers sorted by publication date (newest first)

Date: 2026-01-01 | MotionPhysics: Learnable Motion Distillation for Text-Guided Simulatio...
Date: 2026-01-01 | Multi-Agent Coordinated Rename Refactoring...
Date: 2026-01-01 | MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Ob...
Date: 2026-01-01 | Progressive Ideation using an Agentic AI Framework for Human-AI Co-Cre...
Date: 2026-01-01 | Neural Chains and Discrete Dynamical Systems...


### 5.6 Combined Query

Combine multiple query types for complex searches:

In [18]:
# Combined Query - Complex search with multiple criteria
print("COMBINED QUERY - Complex Search")
print("=" * 40)

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "transformer",
                        "fields": ["title^3", "abstract"],
                        "type": "best_fields"
                    }
                }
            ],
            "filter": [
                {
                    "range": {
                        "published_date": {
                            "gte": "2024-01-01"
                        }
                    }
                }
            ],
            "should": [
                {
                    "match": {
                        "categories": "cs.AI"
                    }
                }
            ]
        }
    },
    "sort": [
        "_score",
        {"published_date": {"order": "desc"}}
    ],
    "size": 3
}

response = opensearch_client.client.search(
    index=opensearch_client.index_name,
    body=query
)

print(f"Complex Query:")
print(f"  ‚Ä¢ Must contain 'transformer' (title boosted 3x)")
print(f"  ‚Ä¢ Filter: published after 2024-01-01")
print(f"  ‚Ä¢ Prefer: cs.AI category")
print(f"  ‚Ä¢ Sort: by relevance, then date\n")

print(f"Found {response['hits']['total']['value']} results\n")

for hit in response['hits']['hits']:
    title = hit['_source']['title'][:70]
    pub_date = hit['_source']['published_date'][:10]
    score = hit['_score']
    categories = ', '.join(hit['_source']['categories'][:2])
    
    print(f"Title: {title}...")
    print(f"  Date: {pub_date} | Score: {score:.2f}")
    print(f"  Categories: {categories}\n")

COMBINED QUERY - Complex Search
Complex Query:
  ‚Ä¢ Must contain 'transformer' (title boosted 3x)
  ‚Ä¢ Filter: published after 2024-01-01
  ‚Ä¢ Prefer: cs.AI category
  ‚Ä¢ Sort: by relevance, then date

Found 5 results

Title: RMAAT: Astrocyte-Inspired Memory Compression and Replay for Efficient ...
  Date: 2026-01-01 | Score: 6.65
  Categories: cs.NE, cs.AI

Title: Deep Delta Learning...
  Date: 2026-01-01 | Score: 1.73
  Categories: cs.LG, cs.AI

Title: Language as Mathematical Structure: Examining Semantic Field Theory Ag...
  Date: 2026-01-01 | Score: 1.15
  Categories: cs.CL, cs.AI



## Summary

### What We Demonstrated

**BM25 Search is Powerful!** Without any vector embeddings, we can:

1. **Simple Search**: Basic keyword search with relevance scoring
2. **Match Queries**: Search specific fields
3. **Multi-Match**: Search across multiple fields with boosting
4. **Boosting**: Promote or demote certain results
5. **Filtering**: Apply filters without affecting scores
6. **Sorting**: Order results by date, score, or other fields
7. **Complex Queries**: Combine all techniques for sophisticated searches

### Key Takeaways

- **BM25 works great** for many search use cases
- **No vectors needed** for effective full-text search
- **Simple and fast** compared to embedding-based approaches
- **Filters and sorting** make searches precise and relevant
- **Field boosting** helps prioritize important content

### When to Use BM25 vs Vectors

**Use BM25 when:**
- Searching for specific keywords or phrases
- Need fast, simple implementation
- Have good text fields with clear terminology
- Want explainable search results

**Consider vectors when:**
- Need semantic similarity (concepts, not keywords)
- Dealing with synonyms and paraphrasing
- Cross-language search requirements
- Very short queries or documents

Remember: **You can also combine both** (hybrid search) for best results!
We will see this in the next week :)