# The price of state - NOW RAG

| Phase       | What we build                   |
| ----------- | -------------------------------- |
| Phase 1     | RAG for comparables              |
| Phase 2     | ML + RAG blended estimator       |
| Phase 3     | Agent + strategy                 |
| Phase 4     | UI                               |


D·ª± √°n n√†y minh ho·∫° m·ªôt tr·ª£ l√Ω ƒë·ªãnh gi√° theo m√¥ h√¨nh agentic d√†nh cho ch·ªß nh√† Airbnb, k·∫øt h·ª£p:

- ∆Ø·ªõc t√≠nh gi√° d·ª±a tr√™n Machine Learning (ML)
- Retrieval-Augmented Generation (RAG) s·ª≠ d·ª•ng c√°c listing t∆∞∆°ng t·ª±
- Suy lu·∫≠n b·∫±ng LLM ƒë·ªÉ x√¢y d·ª±ng chi·∫øn l∆∞·ª£c ƒë·ªãnh gi√° v√† gi·∫£i th√≠ch ƒë·ªÅ xu·∫•t

H·ªá th·ªëng ƒë∆∞·ª£c thi·∫øt k·∫ø nh∆∞ m·ªôt c√¥ng c·ª• h·ªó tr·ª£ ra quy·∫øt ƒë·ªãnh, thay v√¨ ch·ªâ l√† m·ªôt m√¥ h√¨nh d·ª± ƒëo√°n thu·∫ßn t√∫y, v·ªõi tr·ªçng t√¢m l√† t√≠nh minh b·∫°ch, s·ª± ph√π h·ª£p v·ªõi th·ªã tr∆∞·ªùng v√† kh·∫£ nƒÉng ·ª©ng d·ª•ng th·ª±c t·∫ø.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sentence_transformers import SentenceTransformer

In [2]:
# load dataset from phase 2
df_phase3 = pd.read_pickle("/Users/macos/projects/aicourse/week1/df_phase2.pkl")

In [3]:
# 1. Choose what goes into embeddings
EMBED_TEXT_COLS = [
    "final_text_clean"
]

df_rag = df_phase3[
    ["id", "price", "latitude", "longitude", "room_type", "property_type", "final_text_clean"]
].dropna(subset=["final_text_clean", "price"])

## Vector Retrieval (Chroma + SentenceTransformer)

all-MiniLM l√† m·ªôt m√¥ h√¨nh r·∫•t h·ªØu √≠ch t·ª´ HuggingFace, c√≥ kh·∫£ nƒÉng chuy·ªÉn ƒë·ªïi c√¢u v√† ƒëo·∫°n vƒÉn th√†nh vector 384 chi·ªÅu, r·∫•t ph√π h·ª£p cho c√°c t√°c v·ª• nh∆∞ semantic search (t√¨m ki·∫øm ng·ªØ nghƒ©a).

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

M√¥ h√¨nh n√†y c√≥ th·ªÉ ch·∫°y kh√° nhanh tr√™n m√°y local.

Ngo√†i ra, OpenAI c≈©ng cung c·∫•p m·ªôt m√¥ h√¨nh Embeddings d·∫°ng closed-source. Tuy nhi√™n, so v·ªõi embeddings c·ªßa OpenAI th√¨ all-MiniLM c√≥ m·ªôt s·ªë l·ª£i th·∫ø:

Mi·ªÖn ph√≠! C√≥ th·ªÉ ch·∫°y ho√†n to√†n tr√™n m√°y local, d·ªØ li·ªáu kh√¥ng r·ªùi kh·ªèi h·ªá th·ªëng c·ªßa b·∫°n ‚Äî ƒëi·ªÅu n√†y ƒë·∫∑c bi·ªát h·ªØu √≠ch n·∫øu b·∫°n ƒëang x√¢y d·ª±ng m·ªôt h·ªá th·ªëng RAG c√° nh√¢n

In [4]:
# 2. SentenceTransformer encoder

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [5]:
import chromadb
from tqdm import tqdm

# Kh·ªüi t·∫°o ChromaDB client - l∆∞u tr·ªØ vƒ©nh vi·ªÖn
client = chromadb.PersistentClient(path="vectorstore")
collection = client.get_or_create_collection("bangkok_airbnb")

In [None]:
# Embed v√† L∆∞u Data v√†o ChromaDB -- (run 1 time)

BATCH = 512

for i in tqdm(range(0, len(df_rag), BATCH)):
    batch = df_rag.iloc[i:i+BATCH]

    documents = batch["final_text_clean"].tolist()
    embeddings = encoder.encode(documents).astype(float).tolist()

    metadatas = [
        {
            "price": float(p),
            "room_type": rt,
            "property_type": pt,
            "lat": lat,
            "lon": lon
        }
        for p, rt, pt, lat, lon in zip(
            batch.price,
            batch.room_type,
            batch.property_type,
            batch.latitude,
            batch.longitude
        )
    ]

    ids = [f"listing_{idx}" for idx in batch.index]

    collection.add(
        documents=documents,
        embeddings=embeddings,
        metadatas=metadatas,
        ids=ids
    )
  # --> Now we have a semantic Airbnb market memory

In [6]:
# Retrieval: ‚ÄúFind comparable listings‚Äù
from functools import lru_cache

@lru_cache(maxsize=1024)
def encode_text(text: str):
    return encoder.encode([text])[0]

def retrieve_similar_listings(text, k=7):
    vec = encode_text(text)
    results = collection.query(
        query_embeddings=[vec.astype(float).tolist()],
        n_results=k
    )
    docs = results["documents"][0]
    prices = [m["price"] for m in results["metadatas"][0]]
    return docs, np.array(prices)

In [None]:
# test it -- aha moment
docs, prices = retrieve_similar_listings(df_rag.iloc[0].final_text_clean)
prices

In [8]:
# 3. Turning retrieval into a RAG prompt - T·∫°o Context cho LLM

def make_airbnb_context(docs, prices):
    msg = "Here are similar Airbnb listings in Bangkok:\n\n"
    for d, p in zip(docs, prices):
        msg += f"Listing description:\n{d[:300]}...\nPrice: {p:.0f} THB/night\n\n"
    return msg

In [9]:
# The Airbnb Price RAG Agent (LLM) - Prompt for LLM

def messages_for_listing(listing_text, docs, prices):
    msg = (
        "Estimate a reasonable nightly Airbnb price range in THB.\n"
        "Respond in JSON with EXACTLY this format:\n"
        "{\n"
        '  "low": number,\n'
        '  "mid": number,\n'
        '  "high": number\n'
        "}\n"
        "Do NOT include any text, explanation, or currency symbols.\n\n"
        "Target listing:\n"
        f"{listing_text}\n\n"
    )

    msg += make_airbnb_context(docs, prices)

    return [{"role": "user", "content": msg}]

In [10]:
def price_stats(prices):
    return {
        "p25": np.percentile(prices, 25),
        "median": np.median(prices),
        "p75": np.percentile(prices, 75)
    }

In [11]:
# Call LLM (OpenAI / Ollama / GPT-4o-mini / llama): RAG Agent - K·∫øt h·ª£p Retrieval + LLM
from openai import OpenAI
import os
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv() 

client = OpenAI()

def airbnb_price_rag(listing_text):
    docs, prices = retrieve_similar_listings(listing_text)
    messages = messages_for_listing(listing_text, docs, prices)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        # reasoning_effort="none"
    )

    return response.choices[0].message.content

In [None]:
# L·∫•y m·ªôt listing ƒë·ªÉ test
test_listing = df_rag.iloc[0]
test_text = test_listing["final_text_clean"]
actual_price = test_listing["price"]

print("=" * 60)
print("LISTING M√î T·∫¢:")
print(test_text[:300] + "...")
print(f"\nüí∞ GI√Å TH·ª∞C T·∫æ: {actual_price:.0f} THB/night")
print("=" * 60)

# T√¨m listings t∆∞∆°ng t·ª±
print("\nüîç T√åM 5 LISTINGS T∆Ø∆†NG T·ª∞:")
docs, prices = retrieve_similar_listings(test_text, k=5)
for i, (doc, price) in enumerate(zip(docs, prices), 1):
    print(f"\n{i}. Gi√°: {price:.0f} THB")
    print(f"   M√¥ t·∫£: {doc[:150]}...")

# D·ª± ƒëo√°n gi√° b·∫±ng RAG
print("\n" + "=" * 60)
print("ü§ñ AI ƒêANG D·ª∞ ƒêO√ÅN GI√Å...")
predicted_price = airbnb_price_rag(test_text)
print(f"üí° GI√Å D·ª∞ ƒêO√ÅN: {predicted_price} THB/night")
print(f"üí∞ GI√Å TH·ª∞C T·∫æ: {actual_price:.0f} THB/night")

import json

try:
    pred_json = json.loads(predicted_price)

    low = float(pred_json["low"])
    mid = float(pred_json["mid"])
    high = float(pred_json["high"])

    # error theo mid price
    error = abs(mid - actual_price) / actual_price * 100

    print(f"üìä LOW: {low:.0f} | MID: {mid:.0f} | HIGH: {high:.0f}")
    print(f"üìä SAI S·ªê (MID): {error:.1f}%")

except Exception as e:
    print("‚ö†Ô∏è  Kh√¥ng parse ƒë∆∞·ª£c JSON:", e)

# Phase 2: Blending ML + RAG

| ML side                           | RAG side                                  |
|----------------------------------|-------------------------------------------|
| model_tfidf ‚Üí predicts log(price)| retrieve_similar_listings(text)            |
| MAE ‚âà 592 THB                    | Comparable prices list: [p1, p2, ..., pk]  |


In [None]:
# --- Bring back tf-idf + Ridge 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error

# Core sklearn pipeline
from sklearn.pipeline import Pipeline

# Preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

y = np.log1p(df_phase3["price"])

## 1. Numeric feature preparation
# ch·ªâ ch·ªçn features c√≥ √Ω nghƒ©a kinh t·∫ø
NUMERIC_COLS = [
    "accommodates",
    "bedrooms",
    "beds",
    "minimum_nights",
    "number_of_reviews",
    "reviews_per_month",
    "latitude",
    "longitude"
]

df_num = df_phase3[NUMERIC_COLS].copy()

# numeric process pipeline
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

## 2. text data
texts = df_phase3["final_text_clean"].values

X_num_train, X_num_test, X_txt_train, X_txt_test, y_train, y_test = train_test_split(
    df_num,
    texts,
    y,
    test_size=0.2,
    random_state=42
)

X_num_train_proc = numeric_pipeline.fit_transform(X_num_train)
X_num_test_proc  = numeric_pipeline.transform(X_num_test)

baseline_model = Ridge(alpha=1.0)
baseline_model.fit(X_num_train_proc, y_train)

y_pred = baseline_model.predict(X_num_test_proc)
mae_numeric = mean_absolute_error(
    np.expm1(y_test),
    np.expm1(y_pred)
)

print(f"Numeric only MAE: {mae_numeric:,.0f} THB")

tfidf = TfidfVectorizer(
    max_features=8000,
    ngram_range=(1, 2),
    stop_words="english",
    min_df=5
)

X_txt_train_tfidf = tfidf.fit_transform(X_txt_train)
X_txt_test_tfidf  = tfidf.transform(X_txt_test)

# combine numeric + text
from scipy.sparse import hstack

X_train_all = hstack([X_num_train_proc, X_txt_train_tfidf])
X_test_all  = hstack([X_num_test_proc,  X_txt_test_tfidf])

model_tfidf = Ridge(alpha=1.0)
model_tfidf.fit(X_train_all, y_train)

y_pred = model_tfidf.predict(X_test_all)

mae_tfidf = mean_absolute_error(
    np.expm1(y_test),
    np.expm1(y_pred)
)

print(f"Numeric + TF-IDF MAE: {mae_tfidf:,.0f} THB")

In [13]:
def ml_price_predict(listing_text, numeric_df):
    if isinstance(numeric_df, pd.Series):
        numeric_df = numeric_df.to_frame().T
        
    num_proc = numeric_pipeline.transform(numeric_df)
    txt_vec = tfidf.transform([listing_text])
    X = hstack([num_proc, txt_vec])
    log_pred = model_tfidf.predict(X)[0]
    return np.expm1(log_pred)

In [14]:
# Comparable-based distribution

def rag_price_stats(listing_text):
    _, prices = retrieve_similar_listings(listing_text)
    
    return {
        "low": np.percentile(prices, 25),
        "mid": np.median(prices),
        "high": np.percentile(prices, 75),
        "mean": prices.mean(),
        "std": prices.std()
    }

In [None]:
ex_prices = [1000, 1100, 1200, 5000]

mean_price = np.mean(ex_prices)
median_price = np.median(ex_prices)

mean_price, median_price

In [None]:
#---------------

Core idea: ML = anchor, RAG = correction

This is the professional mental model:

ML says where the market usually is
RAG says where this listing fits locally

In [16]:
# Weighted average

def blend_prices(ml_price, rag_stats, w_ml=0.6):
    w_rag = 1 - w_ml
    
    blended_mid = (
        w_ml * ml_price +
        w_rag * rag_stats["mid"]
    )
    
    return {
        "low": max(0, blended_mid - rag_stats["std"]),
        "mid": blended_mid,
        "high": blended_mid + rag_stats["std"]
    }

In [17]:
# pricing function

def price_phase2(listing_text, numeric_row):
    ml_price = ml_price_predict(listing_text, numeric_row)
    rag_stats = rag_price_stats(listing_text)
    
    blended = blend_prices(ml_price, rag_stats)
    
    return {
        "ml_price": ml_price,
        "rag_median": rag_stats["mid"],
        "final_range": blended
    }

In [None]:
# Test blending

print("\n" + "=" * 80)
print("Testing ML + RAG Blending")
print("=" * 80)

test_row = df_phase3.iloc[10]
test_result = price_phase2(
    listing_text=test_row.final_text_clean,
    numeric_row=test_row[NUMERIC_COLS]
)

print(f"\n ML prediction: {test_result['ml_price']:,.0f} THB")
print(f"RAG median: {test_result['rag_median']:,.0f} THB")
print(f"Final range: {test_result['final_range']['low']:,.0f} - {test_result['final_range']['high']:,.0f} THB")
print(f"   (recommended: {test_result['final_range']['mid']:,.0f} THB)")
print(f"\nActual price: {test_row.price:,.0f} THB")


# Phase 3: Agent
‚ÄúHow should we price given goals and market context?‚Äù

We model pricing as a decision problem, not a regression task.
ML + RAG generate market intelligence, and an agent applies pricing strategy based on host goals and market volatility.

In [35]:
#---------------

In [18]:
# decision engine 

def choose_strategy(market):
    ml = market["ml_price"]
    mid = market["rag_median"]

    if ml < mid * 0.9:
        return "aggressive", "ML price is significantly below market median."
    elif ml > mid * 1.1:
        return "premium", "ML price is well above comparable listings."
    else:
        return "balanced", "ML price aligns closely with market median."

In [19]:
# apply strategy

def apply_strategy(market, strategy):
    r = market["final_range"]
    return {
        "aggressive": r["low"],
        "balanced": r["mid"],
        "premium": r["high"]
    }[strategy]

In [20]:
# decision engine class

class DecisionEngine:
    def price(self, listing_text, numeric_row):
        ml_price = ml_price_predict(listing_text, numeric_row)
        rag_stats = rag_price_stats(listing_text)


        blended_mid = 0.6 * ml_price + 0.4 * rag_stats["mid"]

        market = {
            "ml_price": ml_price,
            "rag_median": rag_stats["mid"],
            "final_range": {
                "low": max(0, blended_mid - rag_stats["std"]),
                "mid": blended_mid,
                "high": blended_mid + rag_stats["std"]
            }
         }


        strategy, rule_reason = choose_strategy(market)
        final_price = apply_strategy(market, strategy)


        return {
            "recommended_price": round(final_price),
            "strategy": strategy,
            "rule_reasoning": rule_reason,
            "market": market
        }

In [21]:
from openai import OpenAI
client = OpenAI()


class ExplanationEngine:
    def explain(self, decision):
        prompt = f"""
Explain this Airbnb pricing decision clearly to a host.


Strategy: {decision['strategy']}
ML estimate: {decision['market']['ml_price']:.0f} THB
Market median: {decision['market']['rag_median']:.0f} THB
Range: {decision['market']['final_range']['low']:.0f}‚Äì{decision['market']['final_range']['high']:.0f} THB


Do NOT change the price.
Keep it short and practical.
"""

        try:
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                timeout=5
            )
            return resp.choices[0].message.content
        except:
            return decision["rule_reasoning"]

In [22]:
# complete agent

class AirbnbPricingAgent:
    def __init__(self, explain=True):
        self.decision_engine = DecisionEngine()
        self.explainer = ExplanationEngine() if explain else None

    def price(self, listing_text, numeric_row):
        decision = self.decision_engine.price(listing_text, numeric_row)

        reasoning = (
            self.explainer.explain(decision)
            if self.explainer else decision["rule_reasoning"]
        )

        return {
            "price": decision["recommended_price"],
            "strategy": decision["strategy"],
            "reasoning": reasoning,
            "market": decision["market"]
        }

In [None]:
agent = AirbnbPricingAgent()

agent.price(
    test_row.final_text_clean,
    test_row[NUMERIC_COLS]
)

In [None]:
test_row.price

# UI

In [None]:
# ui adapter function

def build_numeric_df(values: dict) -> pd.DataFrame:
    """
    Ensure numeric features match training columns & order
    """
    return pd.DataFrame(
        [[values[col] for col in NUMERIC_COLS]],
        columns=NUMERIC_COLS
    )

agent = AirbnbPricingAgent(explain=True)

def price_room_ui(
    description,
    accommodates,
    bedrooms,
    beds,
    latitude,
    longitude,
    minimum_nights,
    number_of_reviews,
    reviews_per_month,
):
    try:
        values = {
            "accommodates": accommodates,
            "bedrooms": bedrooms,
            "beds": beds,
            "latitude": latitude,
            "longitude": longitude,
            "minimum_nights": minimum_nights,
            "number_of_reviews": number_of_reviews,
            "reviews_per_month": reviews_per_month,
        }

        numeric_df = build_numeric_df(values)

        result = agent.price(description, numeric_df)

        return (
            f"üí∞ {result['price']:,.0f} THB / night",
            result["strategy"].capitalize(),
            result["reasoning"],
            f"""
**ML estimate:** {result['market']['ml_price']:,.0f} THB  
**Market median:** {result['market']['rag_median']:,.0f} THB  
**Range:** {result['market']['final_range']['low']:,.0f}
‚Äì {result['market']['final_range']['high']:,.0f} THB
"""
        )

    except Exception as e:
        return (
            " Error",
            "-",
            str(e),
            ""
        )

In [None]:
import gradio as gr

with gr.Blocks(title="üè† Airbnb Host Pricing Assistant") as demo:
    gr.Markdown("## üè† Airbnb Host Pricing Assistant")
    gr.Markdown("ML + Market pricing with AI explanations")

    description = gr.Textbox(
        label="Room Description",
        lines=6,
        placeholder="Describe your Airbnb listing..."
    )

    with gr.Row():
        accommodates = gr.Number(label="Accommodates", value=2)
        bedrooms = gr.Number(label="Bedrooms", value=1)
        beds = gr.Number(label="Beds", value=1)

    with gr.Row():
        latitude = gr.Number(label="Latitude", value=13.75)
        longitude = gr.Number(label="Longitude", value=100.50)

    with gr.Row():
        minimum_nights = gr.Number(label="Minimum nights", value=1)
        number_of_reviews = gr.Number(label="Number of reviews", value=10)
        reviews_per_month = gr.Number(label="Reviews per month", value=1.0)

    btn = gr.Button("Recommend Price üöÄ")

    price_out = gr.Textbox(label="Recommended Price")
    strategy_out = gr.Textbox(label="Strategy")
    reasoning_out = gr.Textbox(label="Reasoning", lines=4)
    context_out = gr.Markdown()

    btn.click(
        price_room_ui,
        inputs=[
            description,
            accommodates, bedrooms, beds,
            latitude, longitude,
            minimum_nights,
            number_of_reviews,
            reviews_per_month,
        ],
        outputs=[
            price_out,
            strategy_out,
            reasoning_out,
            context_out
        ],
    )

demo.queue()
demo.launch()