# Lesson 2: Intelligent Search with Filter Extraction & Pre-filtering

In this notebook, we'll build a realistic search system that:
1. Extracts structured filters from natural language queries using an LLM
2. Performs efficient vector search with pre-filtering
3. Generates helpful answers from the results

This demonstrates the full RAG pipeline used in production search systems.

## Setup & Imports

In [68]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [69]:
# Get API keys from environment
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
MONGO_URI = os.environ.get("MONGO_URI")

if not OPENAI_API_KEY or not MONGO_URI:
    print("ERROR: Please set OPENAI_API_KEY and MONGO_URI in .env file")
else:
    print("✓ API keys loaded successfully")

✓ API keys loaded successfully


## Data Loading

We'll use the same Airbnb dataset from Lesson 1 (100 listings with embeddings).

In [70]:
from datasets import load_dataset
import pandas as pd

# Load dataset from HuggingFace
dataset = load_dataset("MongoDB/airbnb_embeddings", streaming=True, split="train")
dataset = dataset.take(100)
dataset_df = pd.DataFrame(dataset)

print(f"Loaded {len(dataset_df)} listings")
dataset_df.head(3)

Loaded 100 listings


Unnamed: 0,_id,listing_url,name,summary,space,description,neighborhood_overview,notes,transit,access,...,images,host,address,availability,review_scores,reviews,weekly_price,monthly_price,text_embeddings,image_embeddings
0,10006546,https://www.airbnb.com/rooms/10006546,Ribeira Charming Duplex,Fantastic duplex apartment with three bedrooms...,Privileged views of the Douro River and Ribeir...,Fantastic duplex apartment with three bedrooms...,"In the neighborhood of the river, you can find...",Lose yourself in the narrow streets and stairc...,Transport: • Metro station and S. Bento railwa...,We are always available to help guests. The ho...,...,"{'thumbnail_url': '', 'medium_url': '', 'pictu...","{'host_id': '51399391', 'host_url': 'https://w...","{'street': 'Porto, Porto, Portugal', 'suburb':...","{'availability_30': 28, 'availability_60': 47,...","{'review_scores_accuracy': 9, 'review_scores_c...","[{'_id': '58663741', 'date': 2016-01-03 05:00:...",,,"[0.0123710884, -0.0180913936, -0.016843712, -0...","[-0.1302358955, 0.1534578055, 0.0199299306, -0..."
1,10021707,https://www.airbnb.com/rooms/10021707,Private Room in Bushwick,Here exists a very cozy room for rent in a sha...,,Here exists a very cozy room for rent in a sha...,,,,,...,"{'thumbnail_url': '', 'medium_url': '', 'pictu...","{'host_id': '11275734', 'host_url': 'https://w...","{'street': 'Brooklyn, NY, United States', 'sub...","{'availability_30': 0, 'availability_60': 0, '...","{'review_scores_accuracy': 10, 'review_scores_...","[{'_id': '61050713', 'date': 2016-01-31 05:00:...",,,"[0.0153845912, -0.0348115042, -0.0093448907, 0...","[0.0340401195, 0.1742489338, -0.1572628617, 0...."
2,1001265,https://www.airbnb.com/rooms/1001265,Ocean View Waikiki Marina w/prkg,A short distance from Honolulu's billion dolla...,Great studio located on Ala Moana across the s...,A short distance from Honolulu's billion dolla...,You can breath ocean as well as aloha.,,Honolulu does have a very good air conditioned...,"Pool, hot tub and tennis",...,"{'thumbnail_url': '', 'medium_url': '', 'pictu...","{'host_id': '5448114', 'host_url': 'https://ww...","{'street': 'Honolulu, HI, United States', 'sub...","{'availability_30': 16, 'availability_60': 46,...","{'review_scores_accuracy': 9, 'review_scores_c...","[{'_id': '4765259', 'date': 2013-05-24 04:00:0...",650.0,2150.0,"[-0.0400562622, -0.0405789167, 0.000644172, 0....","[-0.1640156209, 0.1256971657, 0.6594450474, -0..."


## Document Models

Define Pydantic models for data validation (same as Lesson 1).

In [72]:
from typing import List, Optional
from pydantic import BaseModel, ValidationError
from datetime import datetime

class Host(BaseModel):
    host_id: str
    host_url: str
    host_name: str
    host_location: str
    host_about: str
    host_response_time: Optional[str] = None
    host_thumbnail_url: str
    host_picture_url: str
    host_response_rate: Optional[int] = None
    host_is_superhost: bool
    host_has_profile_pic: bool
    host_identity_verified: bool

class Location(BaseModel):
    type: str
    coordinates: List[float]
    is_location_exact: bool

class Address(BaseModel):
    street: str
    government_area: str
    market: str
    country: str
    country_code: str
    location: Location

class Review(BaseModel):
    _id: str
    date: Optional[datetime] = None
    listing_id: str
    reviewer_id: str
    reviewer_name: Optional[str] = None
    comments: Optional[str] = None

class Listing(BaseModel):
    _id: int
    listing_url: str
    name: str
    summary: str
    space: str
    description: str
    neighborhood_overview: Optional[str] = None
    notes: Optional[str] = None
    transit: Optional[str] = None
    access: str
    interaction: Optional[str] = None
    house_rules: str
    property_type: str
    room_type: str
    bed_type: str
    minimum_nights: int
    maximum_nights: int
    cancellation_policy: str
    last_scraped: Optional[datetime] = None
    calendar_last_scraped: Optional[datetime] = None
    first_review: Optional[datetime] = None
    last_review: Optional[datetime] = None
    accommodates: int
    bedrooms: Optional[float] = 0
    beds: Optional[float] = 0
    number_of_reviews: int
    bathrooms: Optional[float] = 0
    amenities: List[str]
    price: int
    security_deposit: Optional[float] = None
    cleaning_fee: Optional[float] = None
    extra_people: int
    guests_included: int
    images: dict
    host: Host
    address: Address
    availability: dict
    review_scores: dict
    reviews: List[Review]
    text_embeddings: List[float]

print("✓ Models defined")

✓ Models defined


In [73]:
# Convert dataframe to validated Listing objects
records = dataset_df.to_dict(orient='records')

# Handle NaT values
for record in records:
    for key, value in record.items():
        if isinstance(value, list):
            processed_list = [None if pd.isnull(v) else v for v in value]
            record[key] = processed_list
        else:
            if pd.isnull(value):
                record[key] = None

# Validate and convert to dictionaries
listings = [Listing(**record).dict() for record in records]
print(f"✓ Validated {len(listings)} listings")

✓ Validated 100 listings


## MongoDB Connection & Data Ingestion

In [74]:
from pymongo.mongo_client import MongoClient
from pymongo.operations import SearchIndexModel
import time

database_name = "airbnb_dataset"
collection_name = "listings_reviews_lesson2"

def get_mongo_client(mongo_uri):
    """Establish connection to MongoDB"""
    try:
        # Create client with extended timeout for SSL handshake
        client = MongoClient(
            mongo_uri, 
            appname="lesson2.intelligent_search",
            serverSelectionTimeoutMS=60000,  # 60 second timeout for initial connection
            socketTimeoutMS=60000,
            connectTimeoutMS=60000
        )
        
        # Force connection by running a simple command (this triggers SSL handshake)
        client.admin.command('ping')
        
        print("✓ Connection to MongoDB successful")
        return client
        
    except Exception as e:
        print(f"❌ Connection failed: {str(e)}")
        print("\nTroubleshooting tips:")
        print("1. Check your MONGO_URI is correct")
        print("2. Ensure your IP is whitelisted in MongoDB Atlas")
        print("3. Check your internet connection")
        raise

mongo_client = get_mongo_client(MONGO_URI)
db = mongo_client.get_database(database_name)
collection = db.get_collection(collection_name)

print(f"📋 Database: {database_name}")
print(f"📋 Collection: {collection_name}")

✓ Connection to MongoDB successful
📋 Database: airbnb_dataset
📋 Collection: listings_reviews_lesson2


In [75]:
# Clear existing data and insert fresh listings
collection.delete_many({})
collection.insert_many(listings)
print(f"✓ Inserted {len(listings)} listings into MongoDB")

✓ Inserted 100 listings into MongoDB


## Create Comprehensive Vector Search Index

This index includes the vector field PLUS filterable fields for:
- price (number)
- accommodates (number)
- bedrooms (number)
- address.country (string)
- address.market (string)

In [76]:
vector_index_name = "vector_index_with_filters"

# Define comprehensive index with filterable fields
vector_search_index_model = SearchIndexModel(
    definition={
        "mappings": {
            "dynamic": True,
            "fields": {
                # Vector field for semantic search
                "text_embeddings": {
                    "dimensions": 1536,
                    "similarity": "cosine",
                    "type": "knnVector",
                },
                # Filterable fields
                "price": {"type": "number"},
                "accommodates": {"type": "number"},
                "bedrooms": {"type": "number"},
                # String fields for location filtering
                "address": {
                    "type": "document",
                    "fields": {
                        "country": {"type": "token"},
                        "market": {"type": "token"}
                    }
                },
            },
        }
    },
    name=vector_index_name,
)

print("✓ Index model defined")

✓ Index model defined


In [77]:
# Delete the old index with incorrect definition
try:
    collection.drop_search_index(vector_index_name)
    print(f"✓ Deleted old index '{vector_index_name}'")
    time.sleep(5)  # Wait for deletion to complete
except Exception as e:
    print(f"Note: {str(e)}")

Note: Search index airbnb_dataset.listings_reviews_lesson2.vector_index_with_filters cannot be found, full error: {'ok': 0.0, 'errmsg': 'Search index airbnb_dataset.listings_reviews_lesson2.vector_index_with_filters cannot be found', 'code': 27, 'codeName': 'IndexNotFound', '$clusterTime': {'clusterTime': Timestamp(1759583359, 3), 'signature': {'hash': b'\xa1\x1bw\xd6\x84\x95\xc3Z\xe7\xceV\xb4nU\x15\x0fB\xa9\x16l', 'keyId': 7522120776351219727}}, 'operationTime': Timestamp(1759583359, 3)}


In [78]:
# Check if index already exists
index_exists = False
for index in collection.list_indexes():
    if index.get('name') == vector_index_name:
        index_exists = True
        print(f"✓ Index '{vector_index_name}' already exists")
        break

# Create index if it doesn't exist
if not index_exists:
    try:
        result = collection.create_search_index(model=vector_search_index_model)
        print("Creating index...")
        time.sleep(20)  # Wait for index to initialize
        print(f"✓ Index '{vector_index_name}' created successfully")
        print("Note: Wait a few minutes before searching to ensure index is fully ready")
    except Exception as e:
        print(f"Error creating index: {str(e)}")
        if "Duplicate Index" in str(e):
            print("You can proceed if you want to use the existing index")

Creating index...
✓ Index 'vector_index_with_filters' created successfully
Note: Wait a few minutes before searching to ensure index is fully ready


## Helper: Get Embeddings

Function to generate embeddings from text using OpenAI.

In [79]:
import openai

openai.api_key = OPENAI_API_KEY

def get_embedding(text):
    """Generate embedding for given text using OpenAI"""
    if not text or not isinstance(text, str):
        return None
    
    embedding = openai.embeddings.create(
        input=text,
        model="text-embedding-3-small",
        dimensions=1536
    ).data[0].embedding
    
    return embedding

# Test it
test_embedding = get_embedding("test")
print(f"✓ Embedding function working (dimension: {len(test_embedding)})")

✓ Embedding function working (dimension: 1536)


## Step 1: Filter Extraction from Natural Language

This is where we use an LLM to parse user queries and extract structured MongoDB filters.

In [80]:
import json

def extract_filters_from_query(user_query):
    """
    Use GPT to extract structured MongoDB filters from natural language query.
    
    Returns:
        dict with:
            - semantic_query: the semantic part to search with vectors
            - filters: MongoDB filter conditions
    """
    
    prompt = f"""You are a filter extraction system for an Airbnb search engine.

Extract structured filters from the user's query and separate the semantic search part.

User Query: "{user_query}"

Extract filters for these fields:
- price: Extract from "cheap" (under 150), "expensive" (over 300), "under $X", etc.
- accommodates: Extract from "X people", "for X guests", etc.
- bedrooms: Extract from "X bedroom", "X bed", etc.
- address.country: Extract country name if mentioned
- address.market: Extract city name if mentioned

IMPORTANT PRICE THRESHOLDS (adjusted for real Airbnb data):
- "cheap" or "affordable" = under $150 (not $100)
- "moderate" = $150-$250
- "expensive" or "luxury" = over $300

Return a JSON object with:
{{
    "semantic_query": "the semantic/descriptive part of the query",
    "filters": {{
        "price": {{"$lt": 150}},
        "accommodates": {{"$gte": 4}},
        ...
    }}
}}

Rules:
- Only include filters that are clearly mentioned
- Use MongoDB query operators: $lt, $lte, $gt, $gte, $eq
- For location, use exact string match (no operators)
- If no filters found, return empty filters object
- Be realistic with price thresholds - most listings are $100-$300/night

Examples:
Query: "cheap cozy place in New York for 4 people"
Response:
{{
    "semantic_query": "cozy place",
    "filters": {{
        "price": {{"$lt": 150}},
        "accommodates": {{"$gte": 4}},
        "address.market": "New York"
    }}
}}

Query: "warm and friendly place near restaurants"
Response:
{{
    "semantic_query": "warm and friendly place near restaurants",
    "filters": {{}}
}}

Query: "luxury apartment in Barcelona"
Response:
{{
    "semantic_query": "apartment",
    "filters": {{
        "price": {{"$gte": 300}},
        "address.market": "Barcelona"
    }}
}}

Now extract filters from the user query above. Return ONLY valid JSON.
"""
    
    try:
        response = openai.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": "You are a filter extraction system. Return only valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0
        )
        
        result_text = response.choices[0].message.content.strip()
        
        # Remove markdown code blocks if present
        if result_text.startswith("```"):
            result_text = result_text.split("```")[1]
            if result_text.startswith("json"):
                result_text = result_text[4:]
        
        result = json.loads(result_text)
        return result
        
    except Exception as e:
        print(f"Error extracting filters: {e}")
        # Fall back to no filters
        return {
            "semantic_query": user_query,
            "filters": {}
        }

print("✓ Filter extraction function defined (with realistic thresholds)")

✓ Filter extraction function defined (with realistic thresholds)


### Test Filter Extraction

Let's test the filter extraction with a few example queries.

In [81]:
# Test 1: Query with multiple filters
test_query_1 = "cheap cozy place in NYC for 4 people"
result_1 = extract_filters_from_query(test_query_1)

print("Test 1:")
print(f"Query: {test_query_1}")
print(f"Semantic Query: {result_1['semantic_query']}")
print(f"Filters: {json.dumps(result_1['filters'], indent=2)}")
print()

Test 1:
Query: cheap cozy place in NYC for 4 people
Semantic Query: cozy place
Filters: {
  "price": {
    "$lt": 150
  },
  "accommodates": {
    "$gte": 4
  },
  "address.market": "NYC"
}



In [82]:
# Test 2: Query with no filters (pure semantic)
test_query_2 = "warm and friendly place near restaurants"
result_2 = extract_filters_from_query(test_query_2)

print("Test 2:")
print(f"Query: {test_query_2}")
print(f"Semantic Query: {result_2['semantic_query']}")
print(f"Filters: {json.dumps(result_2['filters'], indent=2)}")
print()

Test 2:
Query: warm and friendly place near restaurants
Semantic Query: warm and friendly place near restaurants
Filters: {}



In [83]:
# Test 3: Query with price and bedroom filters
test_query_3 = "apartment under $150 with 2 bedrooms"
result_3 = extract_filters_from_query(test_query_3)

print("Test 3:")
print(f"Query: {test_query_3}")
print(f"Semantic Query: {result_3['semantic_query']}")
print(f"Filters: {json.dumps(result_3['filters'], indent=2)}")

Test 3:
Query: apartment under $150 with 2 bedrooms
Semantic Query: apartment
Filters: {
  "price": {
    "$lt": 150
  },
  "bedrooms": {
    "$eq": 2
  }
}


## Step 2: Vector Search with Pre-filtering

Implement vector search that applies filters BEFORE computing similarities.

In [84]:
import pprint

def vector_search_with_prefilter(semantic_query, filters=None, limit=10, debug=False):
    """
    Perform vector search with optional pre-filtering.
    
    Args:
        semantic_query: The text to search for semantically
        filters: MongoDB filter conditions to apply before search
        limit: Number of results to return
        debug: If True, print the explain structure
    
    Returns:
        List of matching documents with execution time
    """
    
    # Generate embedding for the semantic query
    query_embedding = get_embedding(semantic_query)
    
    if query_embedding is None:
        return None, 0
    
    # Build vector search stage
    vector_search_stage = {
        "$vectorSearch": {
            "index": vector_index_name,
            "queryVector": query_embedding,
            "path": "text_embeddings",
            "numCandidates": 150,
            "limit": limit,
        }
    }
    
    # Add filters if provided (this is PRE-filtering)
    if filters and len(filters) > 0:
        vector_search_stage["$vectorSearch"]["filter"] = filters
    
    # Build aggregation pipeline
    pipeline = [vector_search_stage]
    
    # Execute search
    results = list(collection.aggregate(pipeline))
    
    # Get execution statistics
    explain_query = db.command(
        'explain',
        {
            'aggregate': collection.name,
            'pipeline': pipeline,
            'cursor': {}
        },
        verbosity='executionStats'
    )
    
    # Debug: Print the explain structure
    if debug:
        print("\n=== EXPLAIN QUERY STRUCTURE ===")
        pprint.pprint(explain_query)
        print("="*50 + "\n")
    
    # Try to extract execution time
    try:
        millis_elapsed = explain_query['stages'][0]['$vectorSearch']['explain']['collectors']['allCollectorStats']['millisElapsed']
    except (KeyError, IndexError, TypeError) as e:
        if debug:
            print(f"Could not extract execution time using original path: {e}")
            print("Trying alternative paths...")
        
        # Try alternative paths
        try:
            # Alternative 1: Check if it's directly in explain
            millis_elapsed = explain_query['stages'][0]['$vectorSearch']['explain']['millisElapsed']
        except:
            try:
                # Alternative 2: Check executionStats at top level
                millis_elapsed = explain_query.get('executionStats', {}).get('executionTimeMillis', 'N/A')
            except:
                millis_elapsed = "N/A"
    
    return results, millis_elapsed

print("✓ Pre-filtered vector search function defined (with debug mode)")

✓ Pre-filtered vector search function defined (with debug mode)


### Test Pre-filtered Search

**Important Notes:**
1. **Execution time shows "N/A"**: MongoDB Atlas vector search explain structure varies by version. The timing metrics are in a different location than expected. This doesn't affect search functionality - only the performance reporting.

2. **Index must be ready**: If you just created the vector index, wait 2-3 minutes before searching.

Let's test the search with manual filters:

In [85]:
# DIAGNOSTIC: Let's see what data we actually have
print("=== DATA DIAGNOSTICS ===\n")

# Count total listings
total = collection.count_documents({})
print(f"Total listings: {total}\n")

# Check price distribution
cheap = collection.count_documents({"price": {"$lt": 100}})
moderate = collection.count_documents({"price": {"$gte": 100, "$lt": 200}})
expensive = collection.count_documents({"price": {"$gte": 200}})
print(f"Price distribution:")
print(f"  - Under $100: {cheap}")
print(f"  - $100-$200: {moderate}")
print(f"  - Over $200: {expensive}\n")

# Check accommodates
print(f"Accommodates distribution:")
for size in [2, 4, 6]:
    count = collection.count_documents({"accommodates": {"$gte": size}})
    print(f"  - {size}+ people: {count}")

# Check what market values actually exist
markets = collection.distinct("address.market")
print(f"\nAll unique markets in dataset:")
for market in sorted(markets):
    count = collection.count_documents({"address.market": market})
    print(f"  - {market}: {count} listings")

# Now test REALISTIC filter combinations
print("\n" + "="*50)
print("TESTING REALISTIC FILTER COMBINATIONS:\n")

test_cases = [
    ("Cheap (<$100) + Small (2+ people) in New York", 
     {"price": {"$lt": 100}, "accommodates": {"$gte": 2}, "address.market": "New York"}),
    
    ("Moderate (<$150) + Medium (4+ people) in New York", 
     {"price": {"$lt": 150}, "accommodates": {"$gte": 4}, "address.market": "New York"}),
    
    ("Cheap (<$100) + Medium (4+ people) in Rio", 
     {"price": {"$lt": 100}, "accommodates": {"$gte": 4}, "address.market": "Rio De Janeiro"}),
    
    ("Moderate (<$200) + Medium (4+ people) in New York", 
     {"price": {"$lt": 200}, "accommodates": {"$gte": 4}, "address.market": "New York"}),
    
    ("Just cheap (<$150) in any city", 
     {"price": {"$lt": 150}}),
]

for description, filters in test_cases:
    count = collection.count_documents(filters)
    status = "✓" if count > 0 else "✗"
    print(f"{status} {description}: {count} matches")
        
print("\n" + "="*50)

=== DATA DIAGNOSTICS ===

Total listings: 100

Price distribution:
  - Under $100: 26
  - $100-$200: 32
  - Over $200: 42

Accommodates distribution:
  - 2+ people: 91
  - 4+ people: 44
  - 6+ people: 19

All unique markets in dataset:
  - Barcelona: 3 listings
  - Hong Kong: 10 listings
  - Istanbul: 13 listings
  - Maui: 2 listings
  - Montreal: 4 listings
  - New York: 16 listings
  - Oahu: 4 listings
  - Porto: 8 listings
  - Rio De Janeiro: 25 listings
  - Sydney: 11 listings
  - The Big Island: 4 listings

TESTING REALISTIC FILTER COMBINATIONS:

✓ Cheap (<$100) + Small (2+ people) in New York: 2 matches
✓ Moderate (<$150) + Medium (4+ people) in New York: 1 matches
✗ Cheap (<$100) + Medium (4+ people) in Rio: 0 matches
✓ Moderate (<$200) + Medium (4+ people) in New York: 2 matches
✓ Just cheap (<$150) in any city: 50 matches



In [86]:
# DIAGNOSTIC: Let's see what data we actually have
print("=== DATA DIAGNOSTICS ===\n")

# Count total listings
total = collection.count_documents({})
print(f"Total listings: {total}\n")

# Check price distribution
cheap = collection.count_documents({"price": {"$lt": 100}})
print(f"Listings with price < $100: {cheap}")

# Check accommodates
big_enough = collection.count_documents({"accommodates": {"$gte": 4}})
print(f"Listings that accommodate >= 4 people: {big_enough}")

# Check what market values actually exist
markets = collection.distinct("address.market")
print(f"\nAll unique markets in dataset:")
for market in sorted(markets):
    count = collection.count_documents({"address.market": market})
    print(f"  - {market}: {count} listings")

# Check combined filter
combined = collection.count_documents({
    "price": {"$lt": 100},
    "accommodates": {"$gte": 4},
    "address.market": "New York"
})
print(f"\nListings matching ALL filters (price<100, accommodates>=4, market='New York'): {combined}")

# Let's also check other variations
ny_variations = ["New York", "New York City", "NYC", "Brooklyn", "Manhattan"]
print(f"\nChecking New York area variations:")
for variation in ny_variations:
    count = collection.count_documents({
        "price": {"$lt": 100},
        "accommodates": {"$gte": 4},
        "address.market": variation
    })
    if count > 0:
        print(f"  - '{variation}': {count} matches")
        
print("\n" + "="*50)

=== DATA DIAGNOSTICS ===

Total listings: 100

Listings with price < $100: 26
Listings that accommodate >= 4 people: 44

All unique markets in dataset:
  - Barcelona: 3 listings
  - Hong Kong: 10 listings
  - Istanbul: 13 listings
  - Maui: 2 listings
  - Montreal: 4 listings
  - New York: 16 listings
  - Oahu: 4 listings
  - Porto: 8 listings
  - Rio De Janeiro: 25 listings
  - Sydney: 11 listings
  - The Big Island: 4 listings

Listings matching ALL filters (price<100, accommodates>=4, market='New York'): 0

Checking New York area variations:



In [87]:
from IPython.display import display, HTML

def intelligent_search(user_query, limit=5, show_details=True):
    """
    Complete intelligent search pipeline:
    1. Extract filters from natural language query
    2. Perform pre-filtered vector search
    3. Generate answer using GPT
    
    Args:
        user_query: Natural language search query
        limit: Number of results to return
        show_details: If True, show extracted filters and results table
    """
    
    print("=" * 80)
    print(f"🔍 USER QUERY: {user_query}")
    print("=" * 80)
    
    # Step 1: Extract filters from query
    print("\n📋 Step 1: Extracting filters from query...")
    extraction_result = extract_filters_from_query(user_query)
    semantic_query = extraction_result['semantic_query']
    filters = extraction_result['filters']
    
    if show_details:
        print(f"  → Semantic part: '{semantic_query}'")
        if filters:
            print(f"  → Filters extracted: {json.dumps(filters, indent=6)}")
        else:
            print(f"  → No filters found (pure semantic search)")
    
    # Step 2: Perform pre-filtered vector search
    print(f"\n🔎 Step 2: Searching with pre-filtering...")
    results, exec_time = vector_search_with_prefilter(semantic_query, filters, limit=limit)
    
    print(f"  → Found {len(results)} results")
    print(f"  → Execution time: {exec_time}ms")
    
    # Check if we got results
    if not results or len(results) == 0:
        print("\n❌ No results found!")
        print("\nPossible reasons:")
        print("  - Filters too restrictive (try broader criteria)")
        print("  - Location not in dataset (try: New York, Barcelona, Rio, Hong Kong, etc.)")
        print("  - Price threshold too low (most listings are $100-$300)")
        return None
    
    # Step 3: Format results for display
    if show_details:
        print(f"\n📊 Results Preview:")
        
        # Create simple result display
        result_items = []
        for i, result in enumerate(results, 1):
            result_items.append({
                '#': i,
                'name': result.get('name', 'N/A')[:50],
                'price': f"${result.get('price', 'N/A')}",
                'accommodates': result.get('accommodates', 'N/A'),
                'bedrooms': result.get('bedrooms', 'N/A'),
                'location': result.get('address', {}).get('market', 'N/A')
            })
        
        results_df = pd.DataFrame(result_items)
        display(HTML(results_df.to_html(index=False)))
    
    # Step 4: Generate answer using GPT
    print(f"\n🤖 Step 3: Generating personalized recommendation...")
    
    # Prepare context for GPT
    context = []
    for result in results:
        context.append({
            'name': result.get('name'),
            'summary': result.get('summary', ''),
            'space': result.get('space', ''),
            'price': result.get('price'),
            'accommodates': result.get('accommodates'),
            'bedrooms': result.get('bedrooms'),
            'location': f"{result.get('address', {}).get('market', '')}, {result.get('address', {}).get('country', '')}"
        })
    
    completion = openai.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful Airbnb recommendation assistant. Provide friendly, personalized recommendations based on the search results."
            },
            {
                "role": "user",
                "content": f"""User Query: {user_query}

Search Results:
{json.dumps(context, indent=2)}

Please recommend 1-2 listings from these results that best match the user's query. 
Explain WHY each recommendation fits their needs. Be concise and friendly."""
            }
        ],
        temperature=0.7
    )
    
    answer = completion.choices[0].message.content
    
    print("\n" + "=" * 80)
    print("💡 RECOMMENDATION:")
    print("=" * 80)
    print(answer)
    print("=" * 80 + "\n")
    
    return {
        'query': user_query,
        'semantic_query': semantic_query,
        'filters': filters,
        'results': results,
        'answer': answer,
        'execution_time_ms': exec_time
    }

print("✓ Complete intelligent_search function defined")

✓ Complete intelligent_search function defined


### Test Complete Intelligent Search

Now let's test queries that combine filter extraction + vector search + answer generation.

In [88]:
# Test 1: Query with filters
intelligent_search("Apartment under 100 dollars in Barcelona!")

🔍 USER QUERY: Apartment under 100 dollars in Barcelona!

📋 Step 1: Extracting filters from query...
  → Semantic part: 'Apartment'
  → Filters extracted: {
      "price": {
            "$lt": 100
      },
      "address.market": "Barcelona"
}

🔎 Step 2: Searching with pre-filtering...
  → Found 3 results
  → Execution time: 0.032621ms

📊 Results Preview:


#,name,price,accommodates,bedrooms,location
1,Nice room in Barcelona Center,$50,2,1.0,Barcelona
2,Park Guell apartment with terrace,$85,6,2.0,Barcelona
3,Cozy bedroom Sagrada Familia,$20,2,1.0,Barcelona



🤖 Step 3: Generating personalized recommendation...

💡 RECOMMENDATION:
Here are 2 great Airbnb options in Barcelona under $100:

1. **Nice room in Barcelona Center ($50/night)**  
This cozy double room is in the central Eixample neighborhood, just a short walk from iconic sights like Sagrada Familia and Passeig de Gracia. You'll be right in the heart of the city, with easy metro access and amazing views. Perfect if you want convenience and a true Barcelona experience!

2. **Park Guell apartment with terrace ($85/night)**  
If you'd like your own space, this fully equipped apartment comes with a lovely terrace and can accommodate up to 6 guests. It's on a quiet street just 5 minutes from Park Güell and a quick metro ride to the city center—ideal for relaxing after a day of exploring.

Both options fit your budget and offer great locations—one for a central city vibe, the other for a peaceful stay near a famous park! Let me know if you want more details on either.



{'query': 'Apartment under 100 dollars in Barcelona!',
 'semantic_query': 'Apartment',
 'filters': {'price': {'$lt': 100}, 'address.market': 'Barcelona'},
 'results': [{'_id': ObjectId('68e11c7705d92b196b597d81'),
   'listing_url': 'https://www.airbnb.com/rooms/10082422',
   'name': 'Nice room in Barcelona Center',
   'summary': 'Hi!  Cozy double bed room in amazing flat next to Passeig de Sant Joan and to metro stop Verdaguer. 3 streets to Sagrada Familia and 4 streets to Passeig de Gracia. Flat located in the center of the city.  View to Sagrada Familia and Torre Agbar.',
   'space': 'Nice flat in the central neighboorhood of Eixample.',
   'description': "Hi!  Cozy double bed room in amazing flat next to Passeig de Sant Joan and to metro stop Verdaguer. 3 streets to Sagrada Familia and 4 streets to Passeig de Gracia. Flat located in the center of the city.  View to Sagrada Familia and Torre Agbar. Nice flat in the central neighboorhood of Eixample. Ideal couple or 2 friends. Dreta d

In [89]:
# Test 2: Pure semantic query (no filters)
intelligent_search("warm and friendly place near restaurants")

🔍 USER QUERY: warm and friendly place near restaurants

📋 Step 1: Extracting filters from query...
  → Semantic part: 'warm and friendly place near restaurants'
  → No filters found (pure semantic search)

🔎 Step 2: Searching with pre-filtering...
  → Found 5 results
  → Execution time: 0.280029ms

📊 Results Preview:


#,name,price,accommodates,bedrooms,location
1,Cozy house at Beyoğlu,$58,2,1.0,Istanbul
2,Downtown Oporto Inn (room cleaning),$40,2,1.0,Porto
3,Cozy bedroom Sagrada Familia,$20,2,1.0,Barcelona
4,Be Happy in Porto,$30,2,1.0,Porto
5,Banyan Bungalow,$100,2,0.0,Oahu



🤖 Step 3: Generating personalized recommendation...

💡 RECOMMENDATION:
Based on your request for a warm and friendly place near restaurants, here are my top recommendations:

1. **Be Happy in Porto (Porto, Portugal)**  
This beautifully renovated apartment in the heart of Porto offers a cozy, welcoming atmosphere. It’s surrounded by coffee shops, bakeries, restaurants, and bars, so you’ll have plenty of dining options right outside your door. The central location also makes exploring the city easy and enjoyable!

2. **Cozy house at Beyoğlu (Istanbul, Turkey)**  
This inviting home is centrally located near Taksim Square, with a bus stop just 100 meters away for easy access to the city’s best restaurants and attractions. The host emphasizes a safe, quiet, and spacious environment—perfect for a warm, local experience.

Both options offer comfort, friendliness, and easy access to great food nearby! Let me know if you’d like more details on either.



{'query': 'warm and friendly place near restaurants',
 'semantic_query': 'warm and friendly place near restaurants',
 'filters': {},
 'results': [{'_id': ObjectId('68e11c7705d92b196b597d93'),
   'listing_url': 'https://www.airbnb.com/rooms/10092679',
   'name': 'Cozy house at Beyoğlu',
   'summary': 'Hello dear Guests, wellcome to istanbul. My House is 2+1 and at second floor. 1 privite room is for my international guests. House is Very close to Taksim Square. You can Walk in 30 minutes or you can take a bus.  The bus stop is only 100 m from home. You can go Taksim, Eminönü, Karaköy, Kadıköy, Beyazıt, Sultanahmet easily from home.  I have 1 bed, two people can sleep together. Second person should pay extra. You can use kitchen, bathroom, free Wifi, dishwasher, washing machine, Ironing.',
   'space': 'Safe, quite, big house, wiev, Central, near the bus stop.',
   'description': 'Hello dear Guests, wellcome to istanbul. My House is 2+1 and at second floor. 1 privite room is for my intern

In [90]:
# Test 3: Query with price and bedroom filters
intelligent_search("spacious apartment under $200 with at least 2 bedrooms")

🔍 USER QUERY: spacious apartment under $200 with at least 2 bedrooms

📋 Step 1: Extracting filters from query...
  → Semantic part: 'spacious apartment'
  → Filters extracted: {
      "price": {
            "$lt": 200
      },
      "bedrooms": {
            "$gte": 2
      }
}

🔎 Step 2: Searching with pre-filtering...
  → Found 5 results
  → Execution time: 0.047874ms

📊 Results Preview:


#,name,price,accommodates,bedrooms,location
1,3 chambres au coeur du Plateau,$140,6,3.0,Montreal
2,Ribeira Charming Duplex,$80,8,3.0,Porto
3,Park Guell apartment with terrace,$85,6,2.0,Barcelona
4,Large railroad style 3 bedroom apt in Manhattan!,$180,9,3.0,New York
5,BBC OPORTO 4X2,$100,8,4.0,Porto



🤖 Step 3: Generating personalized recommendation...

💡 RECOMMENDATION:
Here are two great options that perfectly fit your search for a spacious apartment under $200 with at least 2 bedrooms:

**1. Ribeira Charming Duplex (Porto, Portugal) – $80/night**  
- 3 bedrooms, accommodates up to 8 guests  
- Spacious and fully equipped, located in the heart of historic Porto  
- Great value for the price, perfect for families or groups  
This apartment offers plenty of space and charm at a fantastic price, right in a vibrant UNESCO World Heritage area.

**2. Park Guell apartment with terrace (Barcelona, Spain) – $85/night**  
- 2 bedrooms, accommodates 6 guests  
- Renovated, with a cozy terrace and colonial decor  
- Quiet street, close to Park Güell and public transport  
This is a comfortable, spacious option in a great Barcelona location, complete with a lovely outdoor terrace.

Both listings are under $200, have at least 2 bedrooms, and offer great amenities for a comfortable stay. Let me

{'query': 'spacious apartment under $200 with at least 2 bedrooms',
 'semantic_query': 'spacious apartment',
 'filters': {'price': {'$lt': 200}, 'bedrooms': {'$gte': 2}},
 'results': [{'_id': ObjectId('68e11c7705d92b196b597d80'),
   'listing_url': 'https://www.airbnb.com/rooms/10066928',
   'name': '3 chambres au coeur du Plateau',
   'summary': 'Notre appartement comporte 3 chambres avec chacune un lit queen. Nous avons également un salon, une salle de bain avec baignoire, et une cuisine toute équipée, avec laveuse et sécheuse.',
   'space': "Notre logement est lumineux, plein de vie et chaleureux! Vous disposerez de l'appartement entier avec 3 chambres fermées, chacune avec 1 lit queen size.",
   'description': "Notre appartement comporte 3 chambres avec chacune un lit queen. Nous avons également un salon, une salle de bain avec baignoire, et une cuisine toute équipée, avec laveuse et sécheuse. Notre logement est lumineux, plein de vie et chaleureux! Vous disposerez de l'appartement 

In [None]:
# Test 4: Your custom query here!
intelligent_search("")