<a href="https://colab.research.google.com/github/yilinmiao/personalized_real_estate_agent/blob/main/personalized_real_estate_agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project implements a personalized real estate agent application called "HomeMatch". It uses a LLM to generate synthetic real estate listings, stores them in a vector database, and then finds and personalizes listings based on buyer preferences.

# Setting Up

In [None]:
!pip install langchain openai chromadb

In [None]:
!pip install langchain_openai

In [None]:
!pip install langchain_community

In [5]:
import os
import json
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
import chromadb # Import ChromaDB client library directly for persistence setup

In [6]:
from google.colab import userdata
openai_api_key = userdata.get('OPENAI_API_KEY')
if not openai_api_key:
    print('🛑 OpenAI API key not found. Please create a .env file and set the OPENAI_API_KEY environment variable.')
    exit()
else:
    print('✅ OpenAI API key loaded successfully.')


✅ OpenAI API key loaded successfully.


In [7]:
# --- Configuration ---
LISTINGS_FILE = "listings.json"
VECTORDB_DIR = "chroma_db"
COLLECTION_NAME = "real_estate_listings"

Initialize the LLM (using a model suitable for generation/instruction following)

Using OpenAI directly requires specifying the key if not globally set via env var for the library itself


In [8]:
try:
    llm = OpenAI(temperature=0.7, openai_api_key=openai_api_key, model_name='gpt-3.5-turbo-instruct')
    embeddings_model = OpenAIEmbeddings(openai_api_key=openai_api_key)
    print("✅ LLM and Embeddings Model initialized.")
except Exception as e:
    print(f"🛑 Error initializing OpenAI models: {e}")
    exit()

✅ LLM and Embeddings Model initialized.


# Generating Real Estate Listings
Define the prompt template for generating listings

In [9]:
listing_template = """
Generate a realistic real estate listing with the following details:
Neighborhood: {neighborhood_type}
Price: Around ${price_range}
Bedrooms: {bedrooms}
Bathrooms: {bathrooms}
House Size: Approximately {house_size} sqft

Include a compelling description highlighting key features relevant to the neighborhood type and target audience.
Also, provide a short description of the neighborhood itself, mentioning local amenities, atmosphere, or points of interest.

Format the output STRICTLY as a JSON object with keys: 'neighborhood', 'price' (as integer), 'bedrooms', 'bathrooms', 'house_size_sqft', 'description', 'neighborhood_description'.
DO NOT include any text before or after the JSON object.

Example Format:
{{
  "neighborhood": "Example Neighborhood",
  "price": 500000,
  "bedrooms": 3,
  "bathrooms": 2,
  "house_size_sqft": 1800,
  "description": "Charming updated bungalow...",
  "neighborhood_description": "Quiet tree-lined streets..."
}}

Listing Details:
Neighborhood Type: {neighborhood_type}
Price Range: ${price_range}
Bedrooms: {bedrooms}
Bathrooms: {bathrooms}
House Size: {house_size} sqft
"""


In [10]:
listing_prompt = PromptTemplate(
    input_variables=["neighborhood_type", "price_range", "bedrooms", "bathrooms", "house_size"],
    template=listing_template
)

# Using RunnableSequence with prompt | llm
listing_chain = listing_prompt | llm

In [11]:
def generate_listings(num_listings=10):
    """Generates a specified number of diverse real estate listings using the LLM."""
    listings = []
    # Define diverse parameters for generation
    listing_params = [
        {"neighborhood_type": "Family-Friendly Suburb", "price_range": "650,000", "bedrooms": 4, "bathrooms": 3, "house_size": 2500},
        {"neighborhood_type": "Trendy Urban Area", "price_range": "800,000", "bedrooms": 2, "bathrooms": 2, "house_size": 1200},
        {"neighborhood_type": "Quiet Rural Outskirts", "price_range": "450,000", "bedrooms": 3, "bathrooms": 2, "house_size": 2000},
        {"neighborhood_type": "Luxury Downtown District", "price_range": "1,500,000", "bedrooms": 3, "bathrooms": 3.5, "house_size": 3000},
        {"neighborhood_type": "Eco-Conscious Community", "price_range": "700,000", "bedrooms": 3, "bathrooms": 2, "house_size": 1800},
        {"neighborhood_type": "Historic Neighborhood", "price_range": "900,000", "bedrooms": 4, "bathrooms": 2.5, "house_size": 2800},
        {"neighborhood_type": "Lakeside Community", "price_range": "1,100,000", "bedrooms": 5, "bathrooms": 4, "house_size": 3500},
        {"neighborhood_type": "Affordable Starter Home Area", "price_range": "350,000", "bedrooms": 2, "bathrooms": 1, "house_size": 1000},
        {"neighborhood_type": "University Town Neighborhood", "price_range": "550,000", "bedrooms": 3, "bathrooms": 2, "house_size": 1600},
        {"neighborhood_type": "Retirement Community", "price_range": "400,000", "bedrooms": 2, "bathrooms": 2, "house_size": 1400},
    ]

    # Ensure we generate at least num_listings
    params_to_use = (listing_params * (num_listings // len(listing_params) + 1))[:num_listings]

    print(f"Generating {num_listings} listings...")
    for i, params in enumerate(params_to_use):
        print(f"  Generating listing {i+1}/{num_listings} ({params['neighborhood_type']})...")
        retries = 3
        for attempt in range(retries):
            try:
                generated_text = listing_chain.invoke(params)
                # Attempt to parse the LLM output as JSON
                try:
                    # Clean potential markdown/fencing
                    cleaned_text = generated_text.strip().strip('```json').strip('```').strip()
                    listing_data = json.loads(cleaned_text)

                    # Basic validation
                    required_keys = ['neighborhood', 'price', 'bedrooms', 'bathrooms', 'house_size_sqft', 'description', 'neighborhood_description']
                    if all(key in listing_data for key in required_keys):
                        # Add a unique ID
                        listing_data['id'] = f"listing_{i+1:03d}"
                        # Convert price to int if it's a string
                        if isinstance(listing_data['price'], str):
                            listing_data['price'] = int(listing_data['price'].replace(',', '').replace('$', ''))
                        listings.append(listing_data)
                        print(f"    ✅ Successfully generated and parsed listing {i+1}")
                        break # Success, move to next listing
                    else:
                        print(f"    ⚠️ Warning: Generated listing {i+1} (Attempt {attempt+1}) missing required keys. Output: {generated_text}")

                except json.JSONDecodeError as json_e:
                    print(f"    ⚠️ Warning: Could not decode JSON for listing {i+1} (Attempt {attempt+1}). Error: {json_e}. Output: {generated_text}")
                except ValueError as val_e:
                     print(f"    ⚠️ Warning: Error processing price for listing {i+1} (Attempt {attempt+1}). Error: {val_e}. Output: {generated_text}")

            except Exception as e:
                print(f"    🛑 Error generating listing {i+1} (Attempt {attempt+1}): {e}")

            # If loop finishes without break, it means all retries failed for this listing
            if attempt == retries - 1:
                 print(f"    🛑 Failed to generate valid listing {i+1} after {retries} attempts.")

    return listings

In [12]:
# Function to save listings to a file
def save_listings(listings, filename=LISTINGS_FILE):
    """Saves the list of listing dictionaries to a JSON file."""
    try:
        with open(filename, 'w') as f:
            json.dump(listings, f, indent=4)
        print(f"✅ Successfully saved {len(listings)} listings to {filename}")
    except IOError as e:
        print(f"🛑 Error saving listings to {filename}: {e}")

In [None]:
# Check if listings file already exists
if os.path.exists(LISTINGS_FILE):
    print(f"'{LISTINGS_FILE}' already exists. Loading listings from file.")
    try:
        with open(LISTINGS_FILE, 'r') as f:
            generated_listings = json.load(f)
        print(f"✅ Loaded {len(generated_listings)} listings from {LISTINGS_FILE}.")
        if len(generated_listings) < 10:
             print("⚠️ Warning: Existing file has fewer than 10 listings. Consider regenerating.")
    except (IOError, json.JSONDecodeError) as e:
        print(f"🛑 Error loading {LISTINGS_FILE}: {e}. Will attempt to regenerate.")
        generated_listings = generate_listings(num_listings=10)
        if generated_listings:
            save_listings(generated_listings)
        else:
            print("🛑 Failed to generate listings. Exiting.")
            exit()
else:
    print(f"'{LISTINGS_FILE}' not found. Generating new listings...")
    generated_listings = generate_listings(num_listings=10)
    if generated_listings:
        save_listings(generated_listings)
    else:
        print("🛑 Failed to generate listings. Exiting.")
        exit()


# Storing Listings in Vector DB

In [14]:
def setup_vector_db_and_store_listings(listings, embeddings, persist_directory=VECTORDB_DIR, collection_name=COLLECTION_NAME):
    """Initializes ChromaDB, creates documents, generates embeddings, and stores them."""
    print(f"Setting up vector database at: {persist_directory}")
    print(f"Using collection: {collection_name}")

    # Create Chroma client with persistence
    # client = chromadb.PersistentClient(path=persist_directory)

    # Check if collection already exists before trying to add potentially duplicate data
    # vector_store = Chroma(
    #     client=client,
    #     collection_name=collection_name,
    #     embedding_function=embeddings,
    #     persist_directory=persist_directory # Redundant if client is persistent? Check docs.
    # )

    # Simplified initialization using Chroma.from_documents which handles client/persistence
    # It's generally better to create the collection and add documents separately for idempotency

    # Prepare documents for LangChain/Chroma
    documents = []
    doc_ids = []
    print(f"Preparing {len(listings)} documents for embedding...")
    for listing in listings:
        # Combine relevant text fields for semantic meaning
        content = f"Neighborhood: {listing.get('neighborhood', 'N/A')}\n"
        content += f"Price: ${listing.get('price', 0):,}\n"
        content += f"Size: {listing.get('house_size_sqft', 0)} sqft, {listing.get('bedrooms', 0)} bed, {listing.get('bathrooms', 0)} bath\n"
        content += f"Description: {listing.get('description', '')}\n"
        content += f"Neighborhood Overview: {listing.get('neighborhood_description', '')}"

        # Metadata should contain all original fields for retrieval
        metadata = listing.copy() # Use the whole listing dict as metadata

        doc = Document(page_content=content, metadata=metadata)
        documents.append(doc)
        doc_ids.append(listing['id']) # Use the unique ID generated in Step 2

    # Initialize Chroma with persistence and add documents
    # Use from_documents for initial creation or if you want it to handle everything.
    # However, for more control over updates/duplicates, manage the client and collection explicitly.

    print(f"Initializing/loading ChromaDB collection '{collection_name}' at '{persist_directory}'...")
    try:
        vector_store = Chroma(
            collection_name=collection_name,
            embedding_function=embeddings,
            persist_directory=persist_directory
        )

        # Check existing IDs to avoid duplicates
        existing_ids = set(vector_store.get(["ids"])["ids"])
        print(f"Found {len(existing_ids)} existing documents in the collection.")

        docs_to_add = []
        ids_to_add = []
        for doc, doc_id in zip(documents, doc_ids):
            if doc_id not in existing_ids:
                docs_to_add.append(doc)
                ids_to_add.append(doc_id)

        if docs_to_add:
            print(f"Adding {len(docs_to_add)} new documents to the collection...")
            vector_store.add_documents(documents=docs_to_add, ids=ids_to_add)
            vector_store.persist() # Ensure changes are saved
            print(f"✅ Added {len(docs_to_add)} new documents.")
        else:
            print("✅ No new documents to add. Collection is up-to-date.")

        # Verify count (may not be perfectly instant after add?)
        # count = vector_store._collection.count()
        # print(f"Collection now contains {count} documents.")

        print("✅ Vector store initialized and populated successfully.")
        return vector_store

    except Exception as e:
        print(f"🛑 Error setting up or populating vector store: {e}")
        # Consider more specific error handling (permissions, API keys, etc.)
        return None

In [15]:
if generated_listings:
    vector_db = setup_vector_db_and_store_listings(generated_listings, embeddings_model)
else:
    print("🛑 No listings generated or loaded, skipping vector DB setup.")
    vector_db = None

Setting up vector database at: chroma_db
Using collection: real_estate_listings
Preparing 4 documents for embedding...
Initializing/loading ChromaDB collection 'real_estate_listings' at 'chroma_db'...


  vector_store = Chroma(


Found 0 existing documents in the collection.
Adding 4 new documents to the collection...
✅ Added 4 new documents.
✅ Vector store initialized and populated successfully.


  vector_store.persist() # Ensure changes are saved


# Defining Buyer Preferences

In [17]:
# Hardcoded example buyer preferences based on Instructions.txt
buyer_preferences = {
    "house_size_description": "A comfortable three-bedroom house with a spacious kitchen and a cozy living room.",
    "important_factors": "A quiet neighborhood, good local schools, and convenient shopping options.",
    "desired_amenities": "A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.",
    "transportation_needs": "Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.",
    "neighborhood_vibe": "A balance between suburban tranquility and access to urban amenities like restaurants and theaters."
}

In [18]:
# Combine preferences into a single string for semantic search query
buyer_preference_summary = (
    f"Looking for: {buyer_preferences['house_size_description']}. "
    f"Key factors: {buyer_preferences['important_factors']}. "
    f"Wants amenities like: {buyer_preferences['desired_amenities']}. "
    f"Needs transportation options: {buyer_preferences['transportation_needs']}. "
    f"Prefers neighborhood vibe: {buyer_preferences['neighborhood_vibe']}"
)

In [19]:
print("Defined Buyer Preferences:")
for key, value in buyer_preferences.items():
    print(f"  - {key.replace('_', ' ').title()}: {value}")
print("\nCombined Preference Summary for Search:")
print(buyer_preference_summary)

Defined Buyer Preferences:
  - House Size Description: A comfortable three-bedroom house with a spacious kitchen and a cozy living room.
  - Important Factors: A quiet neighborhood, good local schools, and convenient shopping options.
  - Desired Amenities: A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.
  - Transportation Needs: Easy access to a reliable bus line, proximity to a major highway, and bike-friendly roads.
  - Neighborhood Vibe: A balance between suburban tranquility and access to urban amenities like restaurants and theaters.

Combined Preference Summary for Search:
Looking for: A comfortable three-bedroom house with a spacious kitchen and a cozy living room.. Key factors: A quiet neighborhood, good local schools, and convenient shopping options.. Wants amenities like: A backyard for gardening, a two-car garage, and a modern, energy-efficient heating system.. Needs transportation options: Easy access to a reliable bus line, proxi

# Searching Based on Preferences

In [20]:
def search_listings(vector_store, query, k=3):
    """Performs semantic search on the vector store based on the query."""
    if not vector_store:
        print("🛑 Vector store not available. Cannot perform search.")
        return []

    print(f"Searching for top {k} listings matching the query:")
    print(f"  Query: {query[:100]}..." ) # Print truncated query

    try:
        # Perform similarity search with scores
        results_with_scores = vector_store.similarity_search_with_score(query, k=k)

        if not results_with_scores:
            print("  No matching listings found.")
            return []

        print(f"\nFound {len(results_with_scores)} matching listings:")
        matched_listings = []
        for i, (doc, score) in enumerate(results_with_scores):
            # The metadata contains the original listing dictionary
            listing_data = doc.metadata
            print(f"  {i+1}. Listing ID: {listing_data.get('id', 'N/A')}")
            print(f"     Neighborhood: {listing_data.get('neighborhood', 'N/A')}")
            print(f"     Price: ${listing_data.get('price', 0):,}")
            print(f"     Specs: {listing_data.get('bedrooms', 0)} bed, {listing_data.get('bathrooms', 0)} bath, {listing_data.get('house_size_sqft', 0)} sqft")
            # print(f"     Description: {listing_data.get('description', '')[:100]}...") # Optionally show snippet
            print(f"     Similarity Score: {score:.4f}") # Lower score often means more similar in Chroma/cosine
            print("---")
            matched_listings.append(listing_data) # Return the full listing data

        return matched_listings # Return the list of matched listing dictionaries

    except Exception as e:
        print(f"🛑 Error during similarity search: {e}")
        return []

In [21]:
top_listings = []
if vector_db:
    top_listings = search_listings(vector_db, buyer_preference_summary, k=3)
else:
    print("🛑 Skipping search because vector database setup failed or was skipped.")

Searching for top 3 listings matching the query:
  Query: Looking for: A comfortable three-bedroom house with a spacious kitchen and a cozy living room.. Key ...

Found 3 matching listings:
  1. Listing ID: listing_005
     Neighborhood: Eco-Conscious Community
     Price: $700,000
     Specs: 3 bed, 2 bath, 1800 sqft
     Similarity Score: 0.3396
---
  2. Listing ID: listing_006
     Neighborhood: Historic Neighborhood
     Price: $900,000
     Specs: 4 bed, 2.5 bath, 2800 sqft
     Similarity Score: 0.3475
---
  3. Listing ID: listing_008
     Neighborhood: Affordable Starter Home Area
     Price: $350,000
     Specs: 2 bed, 1 bath, 1000 sqft
     Similarity Score: 0.3731
---


# Personalizing Listing Descriptions

In [25]:
personalization_template = """
Rewrite the following real estate listing description to highlight aspects that match the buyer's preferences, while keeping all factual information (like price, size, number of beds/baths, specific features mentioned) the same. Make it sound appealing to this specific buyer.

Buyer Preferences Summary:
{buyer_preferences}

Original Listing Description:
{original_description}

Neighborhood Description (for context):
{neighborhood_description}

Personalized Listing Description:
"""

personalization_prompt = PromptTemplate(
    input_variables=["buyer_preferences", "original_description", "neighborhood_description"],
    template=personalization_template
)

personalization_chain = personalization_prompt | llm

def personalize_listing_descriptions(matched_listings, buyer_prefs_summary):
    """Generates personalized descriptions for matched listings using the LLM."""
    if not matched_listings:
        print("No matched listings to personalize.")
        return

    print(f"Personalizing descriptions for {len(matched_listings)} matched listings...")
    personalized_results = []

    for i, listing in enumerate(matched_listings):
        print(f"\n--- Personalizing Listing {i+1} (ID: {listing.get('id', 'N/A')}) ---")
        original_desc = listing.get('description', '')
        neighborhood_desc = listing.get('neighborhood_description', '')

        print(f"Original Description:\n{original_desc}\n")

        try:
            # Generate personalized description
            personalized_desc = personalization_chain.invoke({
                "buyer_preferences": buyer_prefs_summary,
                "original_description": original_desc,
                "neighborhood_description": neighborhood_desc
            })
            print(f"Personalized Description:\n{personalized_desc.strip()}\n")
            personalized_results.append({
                "listing_id": listing.get('id', 'N/A'),
                "original": original_desc,
                "personalized": personalized_desc.strip()
            })
        except Exception as e:
            print(f"🛑 Error personalizing description for listing {listing.get('id', 'N/A')}: {e}")
            personalized_results.append({
                "listing_id": listing.get('id', 'N/A'),
                "original": original_desc,
                "personalized": "Error during personalization."
            })

    return personalized_results

In [26]:
if top_listings:
    personalization_results = personalize_listing_descriptions(top_listings, buyer_preference_summary)
else:
    print("🛑 Skipping personalization as no top listings were found or search failed.")

Personalizing descriptions for 3 matched listings...

--- Personalizing Listing 1 (ID: listing_005) ---
Original Description:
Welcome to your dream home in the heart of the Eco-Conscious Community! This stunning house boasts 3 spacious bedrooms and 2 modern bathrooms, perfect for a family looking to live in a sustainable and environmentally friendly neighborhood.

Personalized Description:
Welcome to your perfect family home in the idyllic Eco-Conscious Community! This charming house features 3 cozy bedrooms and 2 modern bathrooms, ideal for a family seeking a sustainable and environmentally friendly lifestyle.

Neighborhood Description (for context):
Indulge in the peaceful atmosphere of this green community, surrounded by like-minded individuals. Take a leisurely stroll along the calm, tree-lined streets and admire the beautifully landscaped gardens. Enjoy the convenience of local amenities, including a community center, organic market, and parks, all just a short walk away. Embrace 