# Lesson 4: Boosting Search Results
## Re-ranking Vector Search Results with Business Metrics

**What we'll learn:**
1. Why pure vector search isn't always enough
2. How to combine vector similarity with review scores & popularity
3. Building aggregation pipelines to boost result quality
4. Comparing rankings before and after boosting

**Run cells in order: 1 → 2 → 3 → ... → 17**

In [1]:
# Cell 1: Setup - Import libraries
import warnings
import os
import pandas as pd
import custom_utils
from datasets import load_dataset
from pydantic import BaseModel
from typing import Optional
from IPython.display import display, HTML

warnings.filterwarnings('ignore')
print("✅ All imports loaded!")

✅ All imports loaded!


In [2]:
# Cell 2: Load API keys from environment
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
MONGO_URI = os.environ.get("MONGO_URI")
HF_TOKEN = os.environ.get("HF_TOKEN")

print("✅ OpenAI API Key:", "Found" if OPENAI_API_KEY else "❌ Missing")
print("✅ MongoDB URI:", "Found" if MONGO_URI else "❌ Missing")
print("✅ HuggingFace Token:", "Found" if HF_TOKEN else "❌ Missing")

✅ OpenAI API Key: Found
✅ MongoDB URI: Found
✅ HuggingFace Token: Found


In [3]:
# Cell 3: Load Airbnb dataset (100 records)
print("🔄 Loading Airbnb dataset...")

dataset = load_dataset("MongoDB/airbnb_embeddings", streaming=True, split="train")
dataset = dataset.take(100)  # Take 100 records for this lesson
dataset_df = pd.DataFrame(dataset)

print(f"✅ Loaded {len(dataset_df)} records")
print(f"📊 Columns: {list(dataset_df.columns[:10])}...")  # Show first 10 columns

# Show a sample
sample = dataset_df.iloc[0]
print(f"\n🏠 Sample listing:")
print(f"  Name: {sample['name']}")
print(f"  Price: ${sample['price']}")
print(f"  Reviews: {sample['number_of_reviews']}")
print(f"  Accommodates: {sample['accommodates']} guests")

🔄 Loading Airbnb dataset...
✅ Loaded 100 records
📊 Columns: ['_id', 'listing_url', 'name', 'summary', 'space', 'description', 'neighborhood_overview', 'notes', 'transit', 'access']...

🏠 Sample listing:
  Name: Ribeira Charming Duplex
  Price: $80
  Reviews: 51
  Accommodates: 8 guests


In [4]:
# Cell 4: Process records with Pydantic validation
print("🔄 Processing and validating records...\n")

listings = custom_utils.process_records(dataset_df)

print(f"\n🎯 Ready to insert {len(listings)} validated listings into MongoDB")

🔄 Processing and validating records...

✅ Processed 100 listings successfully

🎯 Ready to insert 100 validated listings into MongoDB


In [5]:
# Cell 5: Connect to MongoDB Atlas
print("🔌 Connecting to MongoDB Atlas...\n")

db, collection = custom_utils.connect_to_database()

print(f"\n🎯 Connected successfully!")

🔌 Connecting to MongoDB Atlas...

✅ Connection to MongoDB successful
📋 Database: airbnb_dataset
📋 Collection: listings_reviews

🎯 Connected successfully!


In [7]:
# Cell 6: Insert data into MongoDB
print("💾 Preparing to insert data into MongoDB...\n")

# Clear existing data
print("🗑️  Clearing existing data...")
result = collection.delete_many({})
print(f"Deleted {result.deleted_count} existing records")

# Insert new data
print(f"\n📥 Inserting {len(listings)} records...")
collection.insert_many(listings)

print(f"✅ Data ingestion completed!")
print(f"📊 Collection now has {collection.count_documents({})} documents")

💾 Preparing to insert data into MongoDB...

🗑️  Clearing existing data...
Deleted 100 existing records

📥 Inserting 100 records...
✅ Data ingestion completed!
📊 Collection now has 100 documents


In [8]:
# Cell 7: Create vector search index WITH filterable fields
print("🔍 Creating vector search index with filterable fields...\n")

# This creates an index that supports:
# 1. Vector search on text_embeddings
# 2. Filtering on accommodates (number field)
# 3. Filtering on bedrooms (number field)

custom_utils.setup_vector_search_index_with_filter(collection=collection)

print("\n💡 This index allows us to:")
print("  - Find semantically similar listings (vector search)")
print("  - Filter by number fields before or during search")
print("  - Access review scores for boosting calculations")

🔍 Creating vector search index with filterable fields...

Creating index with filters...
✅ Index 'vector_index_with_filter' created successfully: vector_index_with_filter
💡 Wait a few minutes before conducting searches

💡 This index allows us to:
  - Find semantically similar listings (vector search)
  - Filter by number fields before or during search
  - Access review scores for boosting calculations


## Understanding Boosting

### The Problem with Pure Vector Search

Vector search finds **semantically similar** listings based on text content. However:

- A listing might match your query perfectly but have **terrible reviews**
- A listing might be described beautifully but be **rarely booked** (low trust signal)
- Two similar listings might have very different **quality scores**

**Example:**
```
Query: "cozy beachfront apartment"

Result 1: Vector similarity = 0.95, Review avg = 6.2/10, Reviews = 2
Result 2: Vector similarity = 0.88, Review avg = 9.8/10, Reviews = 157
```

Pure vector search ranks Result 1 first, but Result 2 is clearly the better choice!

### The Solution: Boosting

**Boosting** = Combining vector similarity with business metrics to improve result quality.

We:
1. Run vector search first (cast a wide net)
2. Calculate quality scores from review data
3. Combine scores with weighted formula
4. Re-sort results by the combined score

This gives us the **best of both worlds**: semantic relevance + quality signals.

## The Boosting Formula

### Step 1: Calculate Average Review Score

Airbnb has 6 different review dimensions:
- Accuracy
- Cleanliness  
- Checkin
- Communication
- Location
- Value

We average these to get one overall quality score:

```
averageReviewScore = (accuracy + cleanliness + checkin + communication + location + value) / 6
```

### Step 2: Combine Scores with Weights

We create a `combinedScore` that blends:
- **90% weight** on review quality (is it good?)
- **10% weight** on review count (is it popular/trusted?)

```
combinedScore = (averageReviewScore × 0.9) + (number_of_reviews × 0.1)
```

### Step 3: Re-sort Results

After vector search returns the top 20 matches, we re-sort them by `combinedScore` in descending order.

**High combinedScore** = Great reviews + Popular = Floats to the top! 🎯

## Example: How Boosting Changes Rankings

### Scenario 1: Perfect match, but terrible reviews
```
Listing A:
  Vector similarity: 0.95 (excellent match!)
  Average review: 5.2/10 (poor quality)
  Number of reviews: 3 (not trusted)
  
  combinedScore = (5.2 × 0.9) + (3 × 0.1) = 4.68 + 0.3 = 4.98
```

### Scenario 2: Good match, excellent reviews
```
Listing B:
  Vector similarity: 0.82 (good match)
  Average review: 9.7/10 (excellent quality!)
  Number of reviews: 143 (highly trusted)
  
  combinedScore = (9.7 × 0.9) + (143 × 0.1) = 8.73 + 14.3 = 23.03
```

### Result:
- **Without boosting**: Listing A ranks first (higher similarity)
- **With boosting**: Listing B ranks first (much higher combinedScore)

This is exactly what we want! We surface high-quality, trusted listings to users. 🎉

In [9]:
# Cell 11: Create review average calculation stage
print("📊 Building aggregation stage 1: Calculate average review score\n")

review_average_stage = {
    "$addFields": {  # Add new computed fields to each document
        "averageReviewScore": {
            "$divide": [  # Divide total by count
                {
                    "$add": [  # Add all 6 review dimensions
                        "$review_scores.review_scores_accuracy",
                        "$review_scores.review_scores_cleanliness",
                        "$review_scores.review_scores_checkin",
                        "$review_scores.review_scores_communication",
                        "$review_scores.review_scores_location",
                        "$review_scores.review_scores_value",
                    ]
                },
                6  # Divide by 6 to get average
            ]
        },
        # Also create a boost factor based on review count
        "reviewCountBoost": "$number_of_reviews"
    }
}

print("✅ Stage 1 created: Adds 'averageReviewScore' and 'reviewCountBoost' fields")
print("\n💡 What this does:")
print("  - Takes 6 review score fields from each listing")
print("  - Adds them together")
print("  - Divides by 6 to get the average")
print("  - Also captures number_of_reviews for popularity signal")

📊 Building aggregation stage 1: Calculate average review score

✅ Stage 1 created: Adds 'averageReviewScore' and 'reviewCountBoost' fields

💡 What this does:
  - Takes 6 review score fields from each listing
  - Adds them together
  - Divides by 6 to get the average
  - Also captures number_of_reviews for popularity signal


In [10]:
# Cell 12: Create weighting stage to combine scores
print("⚖️  Building aggregation stage 2: Combine scores with weights\n")

weighting_stage = {
    "$addFields": {
        "combinedScore": {
            "$add": [  # Add two weighted components
                # Component 1: 90% weight on review quality
                {"$multiply": ["$averageReviewScore", 0.9]},
                
                # Component 2: 10% weight on review count (popularity)
                {"$multiply": ["$reviewCountBoost", 0.1]}
            ]
        }
    }
}

print("✅ Stage 2 created: Adds 'combinedScore' field")
print("\n💡 What this does:")
print("  - Takes averageReviewScore and multiplies by 0.9 (90% weight)")
print("  - Takes reviewCountBoost and multiplies by 0.1 (10% weight)")
print("  - Adds them together to create final combinedScore")
print("\n⚖️  Weight distribution:")
print("  - 90% = Quality (high reviews matter most)")
print("  - 10% = Popularity (many reviews = more trust)")

⚖️  Building aggregation stage 2: Combine scores with weights

✅ Stage 2 created: Adds 'combinedScore' field

💡 What this does:
  - Takes averageReviewScore and multiplies by 0.9 (90% weight)
  - Takes reviewCountBoost and multiplies by 0.1 (10% weight)
  - Adds them together to create final combinedScore

⚖️  Weight distribution:
  - 90% = Quality (high reviews matter most)
  - 10% = Popularity (many reviews = more trust)


In [11]:
# Cell 13: Create sorting stage
print("🔀 Building aggregation stage 3: Sort by combined score\n")

sorting_stage = {
    "$sort": {
        "combinedScore": -1  # -1 = descending order (highest scores first)
    }
}

print("✅ Stage 3 created: Sorts results by combinedScore")
print("\n💡 What this does:")
print("  - Re-orders all results by combinedScore")
print("  - -1 means descending (highest score first)")
print("  - This is where boosting happens: high-quality listings rise to top!")

🔀 Building aggregation stage 3: Sort by combined score

✅ Stage 3 created: Sorts results by combinedScore

💡 What this does:
  - Re-orders all results by combinedScore
  - -1 means descending (highest score first)
  - This is where boosting happens: high-quality listings rise to top!


In [12]:
# Cell 14: Combine all stages into pipeline
print("🔗 Combining all stages into aggregation pipeline\n")

additional_stages = [
    review_average_stage,  # Stage 1: Calculate averageReviewScore
    weighting_stage,       # Stage 2: Calculate combinedScore  
    sorting_stage          # Stage 3: Re-sort by combinedScore
]

print("✅ Pipeline created with 3 stages")
print("\n📋 Pipeline flow:")
print("  1. Vector search finds top 20 similar listings")
print("  2. Calculate average review score for each listing")
print("  3. Combine review scores with popularity into combinedScore")
print("  4. Re-sort the 20 results by combinedScore")
print("  5. Return re-ranked results to user")
print("\n🎯 Result: Best semantic matches with highest quality rise to the top!")

🔗 Combining all stages into aggregation pipeline

✅ Pipeline created with 3 stages

📋 Pipeline flow:
  1. Vector search finds top 20 similar listings
  2. Calculate average review score for each listing
  3. Combine review scores with popularity into combinedScore
  4. Re-sort the 20 results by combinedScore
  5. Return re-ranked results to user

🎯 Result: Best semantic matches with highest quality rise to the top!


In [13]:
# Cell 15: Define SearchResultItem model for clean display
print("📋 Creating SearchResultItem model for displaying results\n")

class SearchResultItem(BaseModel):
    """Model for displaying search results with boosting scores"""
    name: str
    accommodates: Optional[int] = None
    address: custom_utils.Address
    price: int
    property_type: str
    # Boosting-specific fields
    averageReviewScore: Optional[float] = None
    number_of_reviews: Optional[int] = None
    combinedScore: Optional[float] = None

print("✅ SearchResultItem model created")
print("\n💡 This model includes:")
print("  - Basic listing info (name, price, type, location)")
print("  - Boosting scores (averageReviewScore, combinedScore)")
print("  - Review count (number_of_reviews)")
print("\n🎯 We'll display these scores so you can see boosting in action!")

📋 Creating SearchResultItem model for displaying results

✅ SearchResultItem model created

💡 This model includes:
  - Basic listing info (name, price, type, location)
  - Boosting scores (averageReviewScore, combinedScore)
  - Review count (number_of_reviews)

🎯 We'll display these scores so you can see boosting in action!


In [15]:
# Cell 16: Create query handler function with boosting
def handle_user_query(query, db, collection, stages=[], vector_index="vector_index_with_filter"):
    """
    Complete search pipeline with boosting:
    1. Run vector search to find similar listings
    2. Apply boosting stages (calculate scores, re-sort)
    3. Convert results to clean format
    4. Use GPT to generate recommendations
    5. Display results with visible scores
    """
    
    # Step 1: Run vector search WITH boosting stages
    get_knowledge = custom_utils.vector_search_with_filter(
        query, 
        db, 
        collection, 
        stages,  # This is where our boosting pipeline goes!
        vector_index
    )
    
    # Check if we got results
    if not get_knowledge:
        return "No results found.", "No source information available."
    
    print("\n📋 Fields available in results:")
    print(list(get_knowledge[0].keys())[:15])  # Show first 15 fields
    
    # Step 2: Convert to SearchResultItem models
    search_results_models = [
        SearchResultItem(**result)
        for result in get_knowledge
    ]
    
    # Convert to DataFrame for display and GPT
    search_results_df = pd.DataFrame([item.dict() for item in search_results_models])
    
    # Step 3: Generate GPT recommendation
    print("\n🤖 Generating recommendation with GPT...\n")
    
    completion = custom_utils.openai.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": "You are an Airbnb listing recommendation system. Focus on listings with high review scores and many reviews."
            },
            {
                "role": "user",
                "content": f"Answer this user query: {query} with the following context:\n{search_results_df}"
            }
        ]
    )
    
    system_response = completion.choices[0].message.content
    
    # Step 4: Display results
    print("=" * 80)
    print("❓ USER QUESTION:")
    print(query)
    print("\n" + "=" * 80)
    print("🤖 RECOMMENDATION:")
    print(system_response)
    print("\n" + "=" * 80)
    print("📊 TOP 5 RESULTS WITH BOOSTING SCORES:")
    print("=" * 80)
    
    # Show top 5 with key metrics
    top_5 = search_results_df.head(5)
    
    for i, row in top_5.iterrows():
        print(f"\n{i+1}. {row['name']}")
        print(f"   💰 Price: ${row['price']}")
        print(f"   🏠 Type: {row['property_type']}")
        print(f"   📍 Location: {row['address']['market']}, {row['address']['country']}")
        print(f"   ⭐ Avg Review Score: {row['averageReviewScore']:.2f}/10" if row['averageReviewScore'] else "   ⭐ Avg Review Score: N/A")
        print(f"   📝 Number of Reviews: {row['number_of_reviews']}")
        print(f"   🎯 Combined Score: {row['combinedScore']:.2f}" if row['combinedScore'] else "   🎯 Combined Score: N/A")
    
    print("\n" + "=" * 80)
    print("📋 FULL RESULTS TABLE:")
    display(HTML(search_results_df.to_html()))
    
    return system_response

print("✅ Query handler with boosting ready!")

✅ Query handler with boosting ready!


In [16]:
# Cell 17: Test the complete boosting pipeline
print("🧪 Testing boosting with a natural language query\n")

query = """
I want to stay in a place that's warm and friendly, 
and not too far from restaurants. Can you recommend a place? 
Include a reason as to why you've chosen your selection.
"""

# Run search WITH boosting stages
handle_user_query(
    query,
    db,
    collection,
    additional_stages,  # This applies our boosting pipeline!
    vector_index="vector_index_with_filter"
)

print("\n" + "=" * 80)
print("💡 WHAT JUST HAPPENED:")
print("=" * 80)
print("1. Vector search found 20 listings similar to 'warm, friendly, near restaurants'")
print("2. We calculated average review scores for each listing")
print("3. We combined review quality (90%) + popularity (10%) into combinedScore")
print("4. We re-sorted the 20 results by combinedScore")
print("5. The top results now have BOTH semantic relevance AND high quality!")
print("\n🎯 Notice how top results have high averageReviewScore and many reviews!")

🧪 Testing boosting with a natural language query

⚡ Search completed in 0.528349 milliseconds

📋 Fields available in results:
['_id', 'listing_url', 'name', 'summary', 'space', 'description', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'property_type', 'room_type', 'bed_type']

🤖 Generating recommendation with GPT...

❓ USER QUESTION:

I want to stay in a place that's warm and friendly, 
and not too far from restaurants. Can you recommend a place? 
Include a reason as to why you've chosen your selection.


🤖 RECOMMENDATION:
Based on your preferences for a warm and friendly atmosphere and proximity to restaurants, I recommend:

**Best location 1BR Apt in HK - Shops & Sights**

- **Location:** Hong Kong, Kowloon, Hong Kong
- **Type:** Apartment
- **Accommodates:** 4 guests
- **Average Review Score:** 9.83 (out of 10)
- **Number of Reviews:** 145

**Why I recommend this place:**
This listing consistently receives excellent reviews for its hospitali

Unnamed: 0,name,accommodates,address,price,property_type,averageReviewScore,number_of_reviews,combinedScore
0,A bedroom far away from home,2,"{'street': 'Queens, NY, United States', 'government_area': 'Briarwood', 'market': 'New York', 'country': 'United States', 'country_code': 'US', 'location': {'type': 'Point', 'coordinates': [-73.82257, 40.71485], 'is_location_exact': True}}",45,Apartment,9.833333,239,32.75
1,Homely Room in 5-Star New Condo@MTR,2,"{'street': 'Mongkok, Kowloon, Hong Kong', 'government_area': 'Yau Tsim Mong', 'market': 'Hong Kong', 'country': 'Hong Kong', 'country_code': 'HK', 'location': {'type': 'Point', 'coordinates': [114.17094, 22.32074], 'is_location_exact': False}}",479,Condominium,9.5,179,26.45
2,Cozy double bed room 東涌鄉村雅緻雙人房,2,"{'street': 'Hong Kong, New Territories, Hong Kong', 'government_area': 'Islands', 'market': 'Hong Kong', 'country': 'Hong Kong', 'country_code': 'HK', 'location': {'type': 'Point', 'coordinates': [113.92823, 22.27671], 'is_location_exact': False}}",487,Guesthouse,9.666667,162,24.9
3,The Garden Studio,2,"{'street': 'Marrickville, NSW, Australia', 'government_area': 'Marrickville', 'market': 'Sydney', 'country': 'Australia', 'country_code': 'AU', 'location': {'type': 'Point', 'coordinates': [151.15036, -33.90318], 'is_location_exact': False}}",129,Guesthouse,9.833333,146,23.45
4,Best location 1BR Apt in HK - Shops & Sights,4,"{'street': 'Hong Kong, Kowloon, Hong Kong', 'government_area': 'Yau Tsim Mong', 'market': 'Hong Kong', 'country': 'Hong Kong', 'country_code': 'HK', 'location': {'type': 'Point', 'coordinates': [114.17088, 22.29663], 'is_location_exact': True}}",997,Apartment,9.833333,145,23.35
5,Cozy Art Top Floor Apt in PRIME Williamsburg!,2,"{'street': 'Brooklyn, NY, United States', 'government_area': 'Williamsburg', 'market': 'New York', 'country': 'United States', 'country_code': 'US', 'location': {'type': 'Point', 'coordinates': [-73.96053, 40.71577], 'is_location_exact': True}}",175,Apartment,9.833333,117,20.55
6,Sydney Hyde Park City Apartment (checkin from 6am),2,"{'street': 'Darlinghurst, NSW, Australia', 'government_area': 'Sydney', 'market': 'Sydney', 'country': 'Australia', 'country_code': 'AU', 'location': {'type': 'Point', 'coordinates': [151.21346, -33.87603], 'is_location_exact': False}}",185,Apartment,10.0,109,19.9
7,"Studio convenient to CBD, beaches, street parking.",5,"{'street': 'Balgowlah, NSW, Australia', 'government_area': 'Manly', 'market': 'Sydney', 'country': 'Australia', 'country_code': 'AU', 'location': {'type': 'Point', 'coordinates': [151.26108, -33.7975], 'is_location_exact': True}}",45,Guest suite,9.833333,104,19.25
8,Banyan Bungalow,2,"{'street': 'Waialua, HI, United States', 'government_area': 'North Shore Oahu', 'market': 'Oahu', 'country': 'United States', 'country_code': 'US', 'location': {'type': 'Point', 'coordinates': [-158.1602, 21.57561], 'is_location_exact': False}}",100,Bungalow,9.666667,99,18.6
9,Cheerful new renovated central apt,8,"{'street': 'Beyoğlu, İstanbul, Turkey', 'government_area': 'Beyoglu', 'market': 'Istanbul', 'country': 'Turkey', 'country_code': 'TR', 'location': {'type': 'Point', 'coordinates': [28.97477, 41.03735], 'is_location_exact': False}}",264,Apartment,9.333333,77,16.1



💡 WHAT JUST HAPPENED:
1. Vector search found 20 listings similar to 'warm, friendly, near restaurants'
2. We calculated average review scores for each listing
3. We combined review quality (90%) + popularity (10%) into combinedScore
4. We re-sorted the 20 results by combinedScore
5. The top results now have BOTH semantic relevance AND high quality!

🎯 Notice how top results have high averageReviewScore and many reviews!
