# MongoDB Vector Search Tutorial
## Step-by-step learning with Airbnb dataset

Run cells in order: 1 → 2 → 3 → ... → 17

**What we'll build:**
1. Load Airbnb dataset with embeddings (Cells 1-4)
2. Create Pydantic models for data validation (Cells 5-8)
3. Connect to MongoDB Atlas and insert data (Cells 9-10)
4. Create vector search index (Cells 11-12)
5. Implement vector search (Cells 12-14)
6. Build GPT-powered recommendation system (Cells 15-17)

In [12]:
# Cell 1: Setup
import warnings
import os
import pandas as pd
import numpy as np
from datasets import load_dataset
from dotenv import load_dotenv, find_dotenv
from typing import List, Optional
from pydantic import BaseModel, ValidationError
from pymongo.mongo_client import MongoClient
from pymongo.operations import SearchIndexModel

warnings.filterwarnings('ignore')
print("✅ All imports loaded!")

✅ All imports loaded!


In [13]:
# Cell 2: Load API keys
load_dotenv(find_dotenv())

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
MONGO_URI = os.environ.get("MONGO_URI") 
HF_TOKEN = os.environ.get("HF_TOKEN")

print("✅ OpenAI API Key:", "Found" if OPENAI_API_KEY else "❌ Missing")
print("✅ MongoDB URI:", "Found" if MONGO_URI else "❌ Missing")  
print("✅ HuggingFace Token:", "Found" if HF_TOKEN else "❌ Missing")

✅ OpenAI API Key: Found
✅ MongoDB URI: Found
✅ HuggingFace Token: Found


In [14]:
# Cell 3: Load dataset
print("🔄 Loading Airbnb dataset...")

dataset = load_dataset("MongoDB/airbnb_embeddings", streaming=True, split="train")
dataset = dataset.take(20)
dataset_df = pd.DataFrame(dataset)

print(f"✅ Loaded {len(dataset_df)} records")
print(f"📊 Shape: {dataset_df.shape}")
print(f"🔗 Embeddings: {len(dataset_df.iloc[0]['text_embeddings'])} dimensions")

# Show sample
sample = dataset_df.iloc[0]
print(f"\n🏠 Sample: {sample['name']} - ${sample['price']} - {sample['accommodates']} guests")

🔄 Loading Airbnb dataset...
✅ Loaded 20 records
📊 Shape: (20, 43)
🔗 Embeddings: 1536 dimensions

🏠 Sample: Ribeira Charming Duplex - $80 - 8 guests


In [15]:
# Cell 4: Analyze data structure
print("🔍 How I identify main vs nested structures:\n")

sample = dataset_df.iloc[0]

print("📊 Field types:")
for field_name in ['_id', 'name', 'price', 'host', 'address', 'text_embeddings']:
    if field_name in sample:
        value = sample[field_name]
        field_type = type(value).__name__
        
        if isinstance(value, dict):
            preview = f"dict with {len(value)} keys: {list(value.keys())[:3]}..."
        elif isinstance(value, list):
            preview = f"list with {len(value)} items"
        else:
            preview = f"{str(value)[:30]}..."
            
        print(f"  {field_name:15} | {field_type:8} | {preview}")

print("\n💡 Pattern:")
print("  - Simple types (str, int) = main fields")
print("  - dict type = needs separate Pydantic model")
print("  - list of floats = embeddings for vector search")

🔍 How I identify main vs nested structures:

📊 Field types:
  _id             | int64    | 10006546...
  name            | str      | Ribeira Charming Duplex...
  price           | int64    | 80...
  host            | dict     | dict with 16 keys: ['host_id', 'host_url', 'host_name']...
  address         | dict     | dict with 7 keys: ['street', 'suburb', 'government_area']...
  text_embeddings | list     | list with 1536 items

💡 Pattern:
  - Simple types (str, int) = main fields
  - dict type = needs separate Pydantic model
  - list of floats = embeddings for vector search


In [16]:
# Cell 5: Create Pydantic models
print("🏗️ Creating Pydantic models...")

class Host(BaseModel):
    host_id: str
    host_name: str
    host_is_superhost: bool
    host_has_profile_pic: bool
    host_identity_verified: bool
    host_url: Optional[str] = None
    host_location: Optional[str] = None
    host_about: Optional[str] = None
    host_response_time: Optional[str] = None
    host_thumbnail_url: Optional[str] = None
    host_picture_url: Optional[str] = None
    host_response_rate: Optional[int] = None

class Location(BaseModel):
    type: str
    coordinates: List[float]
    is_location_exact: bool

class Address(BaseModel):
    street: str
    government_area: str
    market: str
    country: str
    country_code: str
    location: Location

class Listing(BaseModel):
    _id: int
    name: str
    summary: str
    property_type: str
    room_type: str
    price: int
    accommodates: int
    host: Host
    address: Address
    text_embeddings: List[float]
    
    # Optional fields
    listing_url: Optional[str] = None
    space: Optional[str] = None
    description: Optional[str] = None
    neighborhood_overview: Optional[str] = None
    notes: Optional[str] = None
    transit: Optional[str] = None
    access: Optional[str] = None
    interaction: Optional[str] = None
    house_rules: Optional[str] = None
    bed_type: Optional[str] = None
    minimum_nights: Optional[int] = None
    maximum_nights: Optional[int] = None
    cancellation_policy: Optional[str] = None
    bedrooms: Optional[float] = 0
    beds: Optional[float] = 0
    bathrooms: Optional[float] = 0
    number_of_reviews: Optional[int] = 0
    amenities: Optional[List[str]] = []
    security_deposit: Optional[float] = None
    cleaning_fee: Optional[float] = None
    extra_people: Optional[int] = 0
    guests_included: Optional[int] = 1

print("✅ Models created: Host, Location, Address, Listing")

🏗️ Creating Pydantic models...
✅ Models created: Host, Location, Address, Listing


In [17]:
# Cell 6: Simple data cleaner (handles the NaN/array issue)
def clean_record(record):
    """Clean record - handle NaN values safely"""
    cleaned = {}
    for key, value in record.items():
        if value is None:
            cleaned[key] = None
        elif isinstance(value, (dict, list)):
            cleaned[key] = value  # Keep complex types as-is
        else:
            # Only check scalar values for NaN
            try:
                if pd.isna(value):
                    cleaned[key] = None
                else:
                    cleaned[key] = value
            except (ValueError, TypeError):
                cleaned[key] = value  # Keep original if check fails
    return cleaned

print("✅ Data cleaner function ready")

✅ Data cleaner function ready


In [18]:
# Cell 7: Test with one record
print("🧪 Testing Pydantic with first record...")

sample_record = dataset_df.iloc[0].to_dict()
cleaned_record = clean_record(sample_record)

try:
    listing = Listing(**cleaned_record)
    
    print("🎉 SUCCESS! Pydantic validation passed!")
    print(f"📋 Name: {listing.name}")
    print(f"💰 Price: ${listing.price} (converted to int)")
    print(f"👥 Accommodates: {listing.accommodates}")
    print(f"🏠 Type: {listing.property_type}")
    print(f"🌍 Location: {listing.address.market}, {listing.address.country}")
    print(f"⭐ Superhost: {listing.host.host_is_superhost}")
    print(f"🔗 Embeddings: {len(listing.text_embeddings)} dimensions")
    
except ValidationError as e:
    print("❌ Validation failed:")
    for error in e.errors():
        field = ' -> '.join(str(x) for x in error['loc'])
        print(f"  {field}: {error['msg']}")
        
except Exception as e:
    print(f"❓ Error: {e}")

🧪 Testing Pydantic with first record...
🎉 SUCCESS! Pydantic validation passed!
📋 Name: Ribeira Charming Duplex
💰 Price: $80 (converted to int)
👥 Accommodates: 8
🏠 Type: House
🌍 Location: Porto, Portugal
⭐ Superhost: False
🔗 Embeddings: 1536 dimensions


In [19]:
# Cell 8: Process all records
print("🔄 Processing all 20 records...")

validated_listings = []
errors = []

for i, record in enumerate(dataset_df.to_dict('records')):
    try:
        cleaned_record = clean_record(record)
        listing = Listing(**cleaned_record)
        validated_listings.append(listing.dict())
        
        if i < 3:
            print(f"✅ {i+1}: {listing.name[:30]}... (${listing.price})")
        
    except Exception as e:
        errors.append((i, str(e)))
        print(f"❌ {i+1}: Failed")

print(f"\n📊 Results:")
print(f"  ✅ Validated: {len(validated_listings)} records")
print(f"  ❌ Failed: {len(errors)} records")

if validated_listings:
    print("\n🎯 Ready for MongoDB! Each record has:")
    print("  - Validated data types")
    print("  - 1536-dimension embeddings")
    print("  - Structured host & address info")

🔄 Processing all 20 records...
✅ 1: Ribeira Charming Duplex... ($80)
✅ 2: Private Room in Bushwick... ($40)
✅ 3: Ocean View Waikiki Marina w/pr... ($115)

📊 Results:
  ✅ Validated: 20 records
  ❌ Failed: 0 records

🎯 Ready for MongoDB! Each record has:
  - Validated data types
  - 1536-dimension embeddings
  - Structured host & address info


In [20]:
# Cell 9: MongoDB Atlas Connection
print("🔌 Setting up MongoDB Atlas connection...")

# Database and collection names
database_name = "airbnb_dataset"
collection_name = "listings_reviews"

def get_mongo_client(mongo_uri):
    """Establish connection to MongoDB Atlas"""
    client = MongoClient(mongo_uri, appname="devrel.deeplearningai.lesson1.python")
    print("✅ Connection to MongoDB Atlas successful!")
    return client

if not MONGO_URI:
    print("❌ MONGO_URI not set in environment variables")
else:
    # Connect to MongoDB Atlas
    mongo_client = get_mongo_client(MONGO_URI)
    
    # Get database and collection
    db = mongo_client.get_database(database_name)
    collection = db.get_collection(collection_name)
    
    print(f"📋 Database: {database_name}")
    print(f"📋 Collection: {collection_name}")
    print(f"🎯 Ready to insert {len(validated_listings)} records!")

🔌 Setting up MongoDB Atlas connection...
✅ Connection to MongoDB Atlas successful!
📋 Database: airbnb_dataset
📋 Collection: listings_reviews
🎯 Ready to insert 20 records!


In [21]:
# Cell 10: Insert Data into MongoDB
print("💾 Inserting validated data into MongoDB...")

if MONGO_URI and validated_listings:
    # Clear any existing data (optional - be careful!)
    print("🗑️ Clearing existing data...")
    result = collection.delete_many({})
    print(f"Deleted {result.deleted_count} existing records")
    
    # Insert our validated listings
    print(f"📥 Inserting {len(validated_listings)} records...")
    insert_result = collection.insert_many(validated_listings)
    
    print(f"✅ Successfully inserted {len(insert_result.inserted_ids)} records!")
    print(f"🎯 Collection now has {collection.count_documents({})} documents")
    
    # Show a sample document
    sample_doc = collection.find_one()
    if sample_doc:
        print(f"\n📋 Sample document structure:")
        print(f"  - ID: {sample_doc['_id']}")
        print(f"  - Name: {sample_doc['name']}")
        print(f"  - Price: ${sample_doc['price']}")
        print(f"  - Embeddings: {len(sample_doc['text_embeddings'])} dimensions")
else:
    print("❌ Cannot insert - missing MONGO_URI or no validated listings")

💾 Inserting validated data into MongoDB...
🗑️ Clearing existing data...
Deleted 20 existing records
📥 Inserting 20 records...
✅ Successfully inserted 20 records!
🎯 Collection now has 20 documents

📋 Sample document structure:
  - ID: 68de8b5b53eb7044f9e73371
  - Name: Ribeira Charming Duplex
  - Price: $80
  - Embeddings: 1536 dimensions


## Understanding Vector Search Index

**What is a Vector Search Index?**
A vector search index is a special data structure that enables fast similarity searches through high-dimensional vectors (our embeddings).

**Key Components:**
- **Field**: `text_embeddings` (our 1536-dimension vectors)
- **Dimensions**: 1536 (matches OpenAI's text-embedding-3-small model)
- **Similarity**: `cosine` (measures angle between vectors, not magnitude)
- **Type**: `knnVector` (k-nearest neighbors for fast similarity search)

**How it works:**
1. 🔍 You provide a query (e.g., "cozy apartment near restaurants")
2. 🔄 Query gets converted to a 1536-dimension vector
3. 📐 MongoDB compares your query vector with all stored vectors using cosine similarity
4. 📊 Returns the most similar listings ranked by similarity score

**Why cosine similarity?**
- Focuses on direction/content rather than magnitude
- Perfect for text embeddings where meaning matters more than absolute values
- Range: -1 (opposite) to 1 (identical)

**Index benefits:**
- ⚡ **Speed**: Fast searches through millions of vectors
- 🎯 **Accuracy**: Finds semantically similar content
- 📈 **Scalability**: Efficient even with large datasets

In [22]:
# Cell 11: Create Vector Search Index (with detailed explanations)
print("🔍 Creating vector search index with detailed configuration...")

if MONGO_URI:
    # Define index configuration (following Lesson_1.md approach)
    text_embedding_field_name = "text_embeddings"  # Field containing our embeddings
    vector_search_index_name = "vector_index_text"  # Index identifier
    
    print(f"📋 Index Configuration:")
    print(f"  Field: {text_embedding_field_name}")
    print(f"  Name: {vector_search_index_name}")
    print(f"  Dimensions: 1536 (OpenAI text-embedding-3-small)")
    print(f"  Similarity: cosine (best for text embeddings)")
    print(f"  Type: knnVector (k-nearest neighbors)")
    
    vector_search_index_model = SearchIndexModel(
        definition={
            "mappings": {  # Describes how fields are indexed and stored
                "dynamic": True,  # Automatically index new fields that appear
                "fields": {  # Properties of the fields that will be indexed
                    text_embedding_field_name: {
                        "dimensions": 1536,  # Size of the vector (must match embeddings)
                        "similarity": "cosine",  # Algorithm for computing similarity
                        "type": "knnVector",  # Vector search type
                    }
                },
            }
        },
        name=vector_search_index_name,  # Identifier for the vector search index
    )
    
    # Check if index already exists
    index_exists = False
    print("\n🔍 Checking existing indexes...")
    
    for index in collection.list_indexes():
        print(f"  Found index: {index.get('name', 'unnamed')}")
        if index.get('name') == vector_search_index_name:
            index_exists = True
    
    if not index_exists:
        print(f"\n📝 Creating new vector search index: {vector_search_index_name}")
        try:
            result = collection.create_search_index(model=vector_search_index_model)
            print(f"✅ Vector search index created successfully!")
            print(f"📄 Index ID: {result}")
            print("⏳ Index is initializing... (may take a few minutes)")
            print("💡 You can proceed - the index will be ready shortly")
        except Exception as e:
            print(f"❌ Error creating index: {e}")
            if "Duplicate Index" in str(e):
                print("💡 Index might already exist with a different name")
    else:
        print(f"\n✅ Vector search index '{vector_search_index_name}' already exists!")
        print("🎯 Ready for vector searches!")
else:
    print("❌ Cannot create index - missing MONGO_URI")

🔍 Creating vector search index with detailed configuration...
📋 Index Configuration:
  Field: text_embeddings
  Name: vector_index_text
  Dimensions: 1536 (OpenAI text-embedding-3-small)
  Similarity: cosine (best for text embeddings)
  Type: knnVector (k-nearest neighbors)

🔍 Checking existing indexes...
  Found index: _id_

📝 Creating new vector search index: vector_index_text
✅ Vector search index created successfully!
📄 Index ID: vector_index_text
⏳ Index is initializing... (may take a few minutes)
💡 You can proceed - the index will be ready shortly


## Summary

We've successfully:
1. ✅ Loaded Airbnb dataset with embeddings
2. ✅ Analyzed data structure (main vs nested fields)
3. ✅ Created Pydantic models for validation
4. ✅ Processed and validated all records
5. ✅ Connected to MongoDB Atlas
6. ✅ Inserted data into MongoDB
7. ✅ Created vector search index

**Next steps:** Implement vector search functionality and test queries!

In [23]:
# Cell 12: Create embedding function for queries
import openai

openai.api_key = OPENAI_API_KEY

def get_embedding(text):
    """
    Generate an embedding for the given text using OpenAI's API.
    
    This converts any text into a 1536-dimension vector that can be
    compared with our stored listing embeddings.
    """
    # Check for valid input
    if not text or not isinstance(text, str):
        return None
    
    try:
        # Call OpenAI API to get the embedding
        embedding = openai.embeddings.create(
            input=text,
            model="text-embedding-3-small", 
            dimensions=1536
        ).data[0].embedding
        
        print(f"✅ Created embedding: {len(embedding)} dimensions")
        return embedding
        
    except Exception as e:
        print(f"❌ Error in get_embedding: {e}")
        return None

# Test it
print("🧪 Testing embedding generation...")
test_query = "cozy apartment near beach"
test_embedding = get_embedding(test_query)
print(f"📊 Query: '{test_query}' → {len(test_embedding)} dimensions")

🧪 Testing embedding generation...
✅ Created embedding: 1536 dimensions
📊 Query: 'cozy apartment near beach' → 1536 dimensions


In [24]:
# Cell 13: Vector Search Function
def vector_search(user_query, db, collection, vector_index="vector_index_text"):
    """
    Perform a vector search in MongoDB based on user query.
    
    How it works:
    1. Convert user's text query into an embedding (1536-dim vector)
    2. Use MongoDB's $vectorSearch to find similar listings
    3. Return top 20 most similar results
    """
    
    print(f"🔍 Searching for: '{user_query}'")
    
    # Step 1: Generate embedding for the user query
    query_embedding = get_embedding(user_query)
    
    if query_embedding is None:
        return "Invalid query or embedding generation failed."
    
    # Step 2: Define the vector search stage
    vector_search_stage = {
        "$vectorSearch": {
            "index": vector_index,  # Which index to use
            "queryVector": query_embedding,  # Our query as a vector
            "path": text_embedding_field_name,  # Field to search in documents
            "numCandidates": 150,  # Number of candidates to consider
            "limit": 20  # Return top 20 matches
        }
    }
    
    # Step 3: Build aggregation pipeline
    pipeline = [vector_search_stage]
    
    # Step 4: Execute the search
    print("🔎 Running vector search...")
    results = collection.aggregate(pipeline)
    
    # Step 5: Get execution stats (how long it took)
    explain_query_execution = db.command(
        'explain', {
            'aggregate': collection.name,
            'pipeline': pipeline,
            'cursor': {}
        },
        verbosity='executionStats'
    )
    
    vector_search_explain = explain_query_execution['stages'][0]['$vectorSearch']
    millis_elapsed = vector_search_explain['explain']['collectors']['allCollectorStats']['millisElapsed']
    
    print(f"⚡ Search completed in {millis_elapsed} milliseconds")
    
    return list(results)

print("✅ Vector search function ready!")

✅ Vector search function ready!


In [25]:
# Cell 14: Test Vector Search (Simple)
print("🧪 Testing vector search with a simple query...\n")

# Simple test query
test_query = "beach house with pool"

# Run the search
search_results = vector_search(test_query, db, collection)

print(f"\n📊 Found {len(search_results)} results\n")

# Show top 3 results
print("🏆 Top 3 matches:")
for i, result in enumerate(search_results[:3], 1):
    print(f"\n{i}. {result['name']}")
    print(f"   💰 Price: ${result['price']}")
    print(f"   🏠 Type: {result['property_type']}")
    print(f"   📍 Location: {result['address']['market']}, {result['address']['country']}")
    print(f"   📝 Summary: {result['summary'][:100]}...")

🧪 Testing vector search with a simple query...

🔍 Searching for: 'beach house with pool'
✅ Created embedding: 1536 dimensions
🔎 Running vector search...
⚡ Search completed in 0.158292 milliseconds

📊 Found 20 results

🏆 Top 3 matches:

1. Surry Hills Studio - Your Perfect Base in Sydney
   💰 Price: $181
   🏠 Type: Apartment
   📍 Location: Sydney, Australia
   📝 Summary: This spacious, light filled studio has everything you need to enjoy Sydney and is the perfect base f...

2. Ocean View Waikiki Marina w/prkg
   💰 Price: $115
   🏠 Type: Condominium
   📍 Location: Oahu, United States
   📝 Summary: A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking incl...

3. Copacabana Apartment Posto 6
   💰 Price: $119
   🏠 Type: Apartment
   📍 Location: Rio De Janeiro, Brazil
   📝 Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the ...


In [26]:
# Cell 15: Create SearchResultItem model (for clean display)
class SearchResultItem(BaseModel):
    name: str
    accommodates: Optional[int] = None
    address: Address
    summary: Optional[str] = None
    description: Optional[str] = None
    neighborhood_overview: Optional[str] = None
    price: int
    property_type: str

print("✅ SearchResultItem model ready for displaying results")

✅ SearchResultItem model ready for displaying results


In [28]:
# Cell 16: Handle User Query with GPT Recommendations
from IPython.display import display, HTML

def handle_user_query(query, db, collection):
    """
    Complete search pipeline:
    1. Run vector search to find similar listings
    2. Convert results to clean format
    3. Use GPT to generate natural language recommendations
    4. Display results nicely
    """
    
    # Step 1: Run vector search
    get_knowledge = vector_search(query, db, collection)
    
    # Check if there are any results
    if not get_knowledge:
        return "No results found.", "No source information available."
    
    # Step 2: Convert search results into SearchResultItem models
    search_results_models = [
        SearchResultItem(**result)
        for result in get_knowledge
    ]
    
    # Convert to DataFrame for GPT and display
    search_results_df = pd.DataFrame([item.dict() for item in search_results_models])
    
    # Step 3: Generate system response using GPT
    print("\n🤖 Generating recommendation with GPT...\n")
    
    completion = openai.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": "You are an Airbnb listing recommendation system."
            },
            {
                "role": "user",
                "content": f"Answer this user query: {query} with the following context:\n{search_results_df}"
            }
        ]
    )
    
    system_response = completion.choices[0].message.content
    
    # Step 4: Display results
    print(f"━" * 80)
    print(f"❓ USER QUESTION:")
    print(f"{query}\n")
    print(f"━" * 80)
    print(f"🤖 RECOMMENDATION:")
    print(f"{system_response}\n")
    print(f"━" * 80)
    print(f"📋 SOURCE DATA:")
    display(HTML(search_results_df.to_html()))
    
    return system_response

print("✅ Complete query handler ready!")

✅ Complete query handler ready!


In [None]:
# Cell 17: Test Full System with Natural Language Query
query = """
I want to stay in a place that's warm and friendly, 
and not too far from restaurants. Can you recommend a place? 
Include a reason as to why you've chosen your selection.
"""

handle_user_query(query, db, collection)