# Reciprocal Rank Fusion (RRF) Demo with MongoDB Atlas

This notebook demonstrates how to implement **Reciprocal Rank Fusion (RRF)** to combine results from MongoDB Atlas **Full Text Search** and **Vector Search**.

RRF is a method for combining multiple result sets with different scoring scales by using their **rank** positions instead of their raw scores. This is particularly useful when combining keyword search scores (BM25) with vector similarity scores (Cosine), as they have very different distributions.

## Formula
$$ RRFscore(d) = \sum_{r \in R} \frac{1}{k + rank(d, r)} $$
Where:
*   $d$ is a document.
*   $R$ is the set of rankers (e.g., Vector Search results, Keyword Search results).
*   $rank(d, r)$ is the rank of document $d$ in result set $r$ (1-based).
*   $k$ is a constant (typically 60) to smooth the impact of high rankings.

## Prerequisites
1.  **Data Loaded**: Ensure `mobile_reviews` collection is populated (see `b.mongodb_atlas_setup.ipynb`).
2.  **Indexes**: 
    *   `vector_index` (Vector Search)
    *   `default` (Full Text Search - see `c.rsf_demo.ipynb` to create this if missing).

In [None]:
%pip install pymongo ollama pandas

In [None]:
import pymongo
import ollama
import pandas as pd

# --- CONFIGURATION --- 
# Replace with your actual connection string
MONGODB_URI = "<connection_string>"
DB_NAME = "tech_on_the_rock"
COLLECTION_NAME = "mobile_reviews"

# Connect
client = pymongo.MongoClient(MONGODB_URI)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]

print("Connected to MongoDB Atlas.")

## 1. Define Search Functions

We reuse the search functions for Vector and Keyword search.

In [None]:
def get_query_embedding(query):
    """Generates vector embedding for the query using Ollama."""
    # Ensure you have 'qwen3-embedding' or the model you used for indexing available in Ollama
    response = ollama.embeddings(model='qwen3-embedding', prompt=query)
    return response['embedding']

def search_vector(query, limit=20):
    """Performs Semantic Search using $vectorSearch."""
    query_vector = get_query_embedding(query)
    
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "path": "review_embedding",
                "queryVector": query_vector,
                "numCandidates": limit * 10,
                "limit": limit
            }
        },
        {
            "$project": {
                "_id": 0,
                "review_id": 1,
                "product_name": 1,
                "review_text": 1,
                "score": { "$meta": "vectorSearchScore" }
            }
        }
    ]
    return list(collection.aggregate(pipeline))

def search_keyword(query, limit=20):
    """Performs Keyword Search using $search."""
    pipeline = [
        {
            "$search": {
                "index": "default",
                "text": {
                    "query": query,
                    "path": ["review_text", "product_name", "tags"]
                }
            }
        },
        {
            "$limit": limit
        },
        {
            "$project": {
                "_id": 0,
                "review_id": 1,
                "product_name": 1,
                "review_text": 1,
                "score": { "$meta": "searchScore" }
            }
        }
    ]
    return list(collection.aggregate(pipeline))

## 2. Implement Reciprocal Rank Fusion (RRF)

The core RRF algorithm implementation.

In [None]:
def reciprocal_rank_fusion(results_dict, k=60):
    """
    Combines multiple lists of results using RRF.
    
    Args:
        results_dict: Dictionary where key is the name of the ranker (e.g., 'vector', 'keyword') 
                      and value is the list of result documents.
        k: Smoothing constant (default 60).
        
    Returns:
        List of documents sorted by RRF score.
    """
    fused_scores = {}
    
    for ranker_name, results in results_dict.items():
        for rank, doc in enumerate(results):
            # rank is 0-based in enumerate, RRF usually expects 1-based, or just consistent.
            # Formula: 1 / (k + rank + 1)
            
            doc_id = doc['review_id']
            
            if doc_id not in fused_scores:
                fused_scores[doc_id] = {
                    "doc": doc,
                    "rrf_score": 0,
                    "details": {}
                }
            
            # Calculate score contribution from this ranker
            score = 1 / (k + rank + 1)
            fused_scores[doc_id]['rrf_score'] += score
            
            # Store details for display
            fused_scores[doc_id]['details'][ranker_name] = {
                "rank": rank + 1,
                "raw_score": doc['score']
            }

    # Convert to list
    final_results = []
    for doc_id, data in fused_scores.items():
        item = data['doc'].copy()
        del item['score'] # Remove the single raw score
        item['rrf_score'] = data['rrf_score']
        item['ranks'] = data['details']
        final_results.append(item)
    
    # Sort by RRF score descending
    return sorted(final_results, key=lambda x: x['rrf_score'], reverse=True)

## 3. Run the Demo

We search for **"night photography"** again to compare with RSF results.

In [None]:
QUERY = "night photography"

print(f"--- Searching for: '{QUERY}' ---")

# 1. Fetch Results
print("Running Vector Search...")
vec_results = search_vector(QUERY, limit=10)

print("Running Keyword Search...")
kw_results = search_keyword(QUERY, limit=10)

# 2. Apply RRF
print("Applying Reciprocal Rank Fusion (k=60)...")
rrf_results = reciprocal_rank_fusion({
    "vector": vec_results,
    "keyword": kw_results
}, k=60)

# 3. Display Results
df = pd.DataFrame(rrf_results)

# Format 'ranks' column for better readability
def format_ranks(ranks_dict):
    return ", ".join([f"{k}:#{v['rank']}" for k, v in ranks_dict.items()])

df['rank_info'] = df['ranks'].apply(format_ranks)

columns = ['product_name', 'rrf_score', 'rank_info', 'review_text']
print(df[columns].head(10))

### Observation

*   **RRF Score**: You should see that documents appearing in *both* top lists get a higher bump in the RRF score.
*   **Rank Info**: Shows the rank of the document in each individual search method (e.g., `vector:#1, keyword:#3`).