## Learning Objective
Learn how to **reduce token costs by 6x** using prompt compression while maintaining answer quality. This lesson teaches the important concept of **prompt optimization** for production RAG systems.

## What We're Building
A compressed search pipeline that:
1. Performs vector search + boosting (reusing Lesson 4)
2. Compresses verbose search results using LLMLingua-2
3. Sends compressed context to GPT (instead of full results)
4. Achieves 80%+ cost savings with minimal quality loss

## The Problem
Vector search returns 20 detailed listings = **~3,000 tokens**  
## Solution
Compress results to **~500 tokens** = **6x savings!**


## Phase 1: Setup & Data Loading

In [25]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [26]:
import custom_utils

In [27]:
# Load 100 Airbnb listings from HuggingFace
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("MongoDB/airbnb_embeddings", streaming=True, split="train")
dataset = dataset.take(100)
# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset)
dataset_df.head(5)

Unnamed: 0,_id,listing_url,name,summary,space,description,neighborhood_overview,notes,transit,access,...,images,host,address,availability,review_scores,reviews,weekly_price,monthly_price,text_embeddings,image_embeddings
0,10006546,https://www.airbnb.com/rooms/10006546,Ribeira Charming Duplex,Fantastic duplex apartment with three bedrooms...,Privileged views of the Douro River and Ribeir...,Fantastic duplex apartment with three bedrooms...,"In the neighborhood of the river, you can find...",Lose yourself in the narrow streets and stairc...,Transport: • Metro station and S. Bento railwa...,We are always available to help guests. The ho...,...,"{'thumbnail_url': '', 'medium_url': '', 'pictu...","{'host_id': '51399391', 'host_url': 'https://w...","{'street': 'Porto, Porto, Portugal', 'suburb':...","{'availability_30': 28, 'availability_60': 47,...","{'review_scores_accuracy': 9, 'review_scores_c...","[{'_id': '58663741', 'date': 2016-01-03 05:00:...",,,"[0.0123710884, -0.0180913936, -0.016843712, -0...","[-0.1302358955, 0.1534578055, 0.0199299306, -0..."
1,10021707,https://www.airbnb.com/rooms/10021707,Private Room in Bushwick,Here exists a very cozy room for rent in a sha...,,Here exists a very cozy room for rent in a sha...,,,,,...,"{'thumbnail_url': '', 'medium_url': '', 'pictu...","{'host_id': '11275734', 'host_url': 'https://w...","{'street': 'Brooklyn, NY, United States', 'sub...","{'availability_30': 0, 'availability_60': 0, '...","{'review_scores_accuracy': 10, 'review_scores_...","[{'_id': '61050713', 'date': 2016-01-31 05:00:...",,,"[0.0153845912, -0.0348115042, -0.0093448907, 0...","[0.0340401195, 0.1742489338, -0.1572628617, 0...."
2,1001265,https://www.airbnb.com/rooms/1001265,Ocean View Waikiki Marina w/prkg,A short distance from Honolulu's billion dolla...,Great studio located on Ala Moana across the s...,A short distance from Honolulu's billion dolla...,You can breath ocean as well as aloha.,,Honolulu does have a very good air conditioned...,"Pool, hot tub and tennis",...,"{'thumbnail_url': '', 'medium_url': '', 'pictu...","{'host_id': '5448114', 'host_url': 'https://ww...","{'street': 'Honolulu, HI, United States', 'sub...","{'availability_30': 16, 'availability_60': 46,...","{'review_scores_accuracy': 9, 'review_scores_c...","[{'_id': '4765259', 'date': 2013-05-24 04:00:0...",650.0,2150.0,"[-0.0400562622, -0.0405789167, 0.000644172, 0....","[-0.1640156209, 0.1256971657, 0.6594450474, -0..."
3,10009999,https://www.airbnb.com/rooms/10009999,Horto flat with small garden,One bedroom + sofa-bed in quiet and bucolic ne...,Lovely one bedroom + sofa-bed in the living ro...,One bedroom + sofa-bed in quiet and bucolic ne...,This charming ground floor flat is located in ...,"There´s a table in the living room now, that d...","Easy access to transport (bus, taxi, car) and ...",,...,"{'thumbnail_url': '', 'medium_url': '', 'pictu...","{'host_id': '1282196', 'host_url': 'https://ww...","{'street': 'Rio de Janeiro, Rio de Janeiro, Br...","{'availability_30': 0, 'availability_60': 0, '...","{'review_scores_accuracy': None, 'review_score...",[],1492.0,4849.0,"[-0.063234821, 0.0017937823, -0.0243996996, -0...","[-0.1292964518, 0.037789464, 0.2443587631, 0.0..."
4,10047964,https://www.airbnb.com/rooms/10047964,Charming Flat in Downtown Moda,Fully furnished 3+1 flat decorated with vintag...,The apartment is composed of 1 big bedroom wit...,Fully furnished 3+1 flat decorated with vintag...,With its diversity Moda- Kadikoy is one of the...,,,,...,"{'thumbnail_url': '', 'medium_url': '', 'pictu...","{'host_id': '1241644', 'host_url': 'https://ww...","{'street': 'Kadıköy, İstanbul, Turkey', 'subur...","{'availability_30': 27, 'availability_60': 57,...","{'review_scores_accuracy': 10, 'review_scores_...","[{'_id': '68162172', 'date': 2016-04-02 04:00:...",,,"[0.023723349, 0.0064210771, -0.0339970738, -0....","[-0.1006749049, 0.4022984803, -0.1821258366, 0..."


In [28]:
print("Columns:", dataset_df.columns)

Columns: Index(['_id', 'listing_url', 'name', 'summary', 'space', 'description',
       'neighborhood_overview', 'notes', 'transit', 'access', 'interaction',
       'house_rules', 'property_type', 'room_type', 'bed_type',
       'minimum_nights', 'maximum_nights', 'cancellation_policy',
       'last_scraped', 'calendar_last_scraped', 'first_review', 'last_review',
       'accommodates', 'bedrooms', 'beds', 'number_of_reviews', 'bathrooms',
       'amenities', 'price', 'security_deposit', 'cleaning_fee',
       'extra_people', 'guests_included', 'images', 'host', 'address',
       'availability', 'review_scores', 'reviews', 'weekly_price',
       'monthly_price', 'text_embeddings', 'image_embeddings'],
      dtype='object')


In [29]:
# Process records with Pydantic validation
listings = custom_utils.process_records(dataset_df)

✅ Processed 100 listings successfully


In [30]:
# Connect to MongoDB Atlas
db, collection = custom_utils.connect_to_database()

✅ Connection to MongoDB successful
📋 Database: airbnb_dataset
📋 Collection: listings_reviews


In [31]:
# Delete any existing records in the collection
collection.delete_many({})

DeleteResult({'n': 100, 'electionId': ObjectId('7fffffff00000000000003ac'), 'opTime': {'ts': Timestamp(1759849757, 10), 't': 940}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1759849757, 10), 'signature': {'hash': b'c8\x10Id\xe6\xc5\xdf)\x92\x89\xb5$\x81\xf1\xa8^\xa5~\xff', 'keyId': 7522120776351219727}}, 'operationTime': Timestamp(1759849757, 10)}, acknowledged=True)

In [32]:
# Insert listings into MongoDB
collection.insert_many(listings)
print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed


In [33]:
# Create vector search index with filterable fields
custom_utils.setup_vector_search_index_with_filter(collection=collection)

Creating index with filters...
✅ Index 'vector_index_with_filter' created successfully: vector_index_with_filter
💡 Wait a few minutes before conducting searches


## Phase 2: Boosting Pipeline (Baseline - No Compression Yet)

In [34]:
# Define the data model for search results
from pydantic import BaseModel
from typing import Optional

class SearchResultItem(BaseModel):
    name: str
    accommodates: Optional[int] = None
    address: custom_utils.Address
    neighborhood_overview: Optional[str] = None
    notes: Optional[str] = None
    averageReviewScore: Optional[float] = None
    number_of_reviews: Optional[float] = None
    combinedScore: Optional[float] = None

In [35]:
# Stage 1: Calculate average review score from 6 review dimensions
review_average_stage = {
    "$addFields": {
        "averageReviewScore": {
            "$divide": [
                {
                    "$add": [
                        "$review_scores.review_scores_accuracy",
                        "$review_scores.review_scores_cleanliness",
                        "$review_scores.review_scores_checkin",
                        "$review_scores.review_scores_communication",
                        "$review_scores.review_scores_location",
                        "$review_scores.review_scores_value",
                    ]
                },
                6  # Divide by 6 to get the average
            ]
        },
        # Also capture review count for boosting
        "reviewCountBoost": "$number_of_reviews"
    }
}

In [36]:
# Stage 2: Combine review quality (30%) and popularity (70%) into combinedScore
weighting_stage = {
    "$addFields": {
        "combinedScore": {
            # Formula: 30% review quality + 70% popularity
            "$add": [
                {"$multiply": ["$averageReviewScore", 0.3]},  # Review quality weight
                {"$multiply": ["$reviewCountBoost", 0.7]}      # Popularity weight
            ]
        }
    }
}

In [37]:
# Stage 3: Sort by combinedScore in descending order
sorting_stage = {
    "$sort": {"combinedScore": -1}  # High scores first
}

# Combine all stages into pipeline
additional_stages = [review_average_stage, weighting_stage, sorting_stage]

In [38]:
# Baseline query handler WITHOUT compression (this is our "BEFORE" state)
from IPython.display import display, HTML

def handle_user_query(query, db, collection, stages=[], vector_index="vector_index_text"):
    """
    Baseline handler that sends FULL search results to GPT (no compression).
    This will help us see the token count BEFORE compression.
    """
    # Perform vector search with boosting stages
    get_knowledge = custom_utils.vector_search_with_filter(query, db, collection, stages, vector_index)

    # Check if there are any results (could be empty list, None, or error string)
    if not get_knowledge or isinstance(get_knowledge, str):
        print(f"❌ Error or no results: {get_knowledge}")
        return "No results found.", pd.DataFrame()
    
    # Convert results to our SearchResultItem model
    search_results_models = [
        SearchResultItem(**result)
        for result in get_knowledge
    ]

    # Convert to DataFrame for display
    search_results_df = pd.DataFrame([item.dict() for item in search_results_models])

    print("=" * 80)
    print("📋 UNCOMPRESSED SEARCH RESULTS (This is what we're sending to GPT):")
    print("=" * 80)
    print(search_results_df.to_string())
    print(f"\n📊 Estimated token count: ~{len(search_results_df.to_string()) / 4:.0f} tokens")
    print("=" * 80)

    # Send full results to GPT (no compression)
    completion = custom_utils.openai.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {
                "role": "system", 
                "content": "You are an airbnb listing recommendation system."
            },
            {
                "role": "user", 
                "content": f"Answer this user query: {query} with the following context:\n{search_results_df}"
            }
        ]
    )
    
    system_response = completion.choices[0].message.content
    print(f"\n- User Question:\n{query}\n")
    print(f"- System Response:\n{system_response}\n")
    
    return system_response, search_results_df

## Phase 3: Understanding Prompt Compression

### Why Do We Need Compression?

When we perform vector search with boosting, we retrieve **20 detailed listings** from MongoDB. Each listing contains:
- Name, accommodates, bedrooms
- Full address with location
- Neighborhood overview (long text descriptions)
- Review scores and counts
- Combined scores

**The Problem:**
- 20 listings × ~150 tokens each = **~3,000 tokens**
- Sending this to GPT-3.5 costs **$0.0015 per 1K tokens**
- For 1 million queries: **$4,500 in token costs!**

**The Solution:**
- Use **LLMLingua-2** to compress results from 3,000 → 500 tokens (6x compression)
- Same 1 million queries: **$750 in token costs**
- **Savings: $3,750 (83% cost reduction!)**

---

### What Gets Compressed?

This is important to understand:

| Component | Compressed? | Why? |
|-----------|-------------|------|
| **System Prompt** | ❌ No | Short and essential ("You are an Airbnb recommendation system") |
| **User Query** | ❌ No | Must preserve exact user intent |
| **Search Results** | ✅ **YES** | Verbose, repetitive, contains filler words |
| **GPT Answer** | ❌ No | Generated after compression |

**What LLMLingua removes from search results:**
- Filler words: "In the neighborhood of...", "you can find...", "this area is known for..."
- Redundant phrases: Multiple mentions of "friendly", "warm", "welcoming"
- Unnecessary adjectives and connecting words

**What LLMLingua keeps:**
- Property names
- Key numbers (accommodates, scores, review counts)
- Location info (country, city)
- Essential descriptors (restaurants, cafes, river, beach)

---

### How Does LLMLingua-2 Work?

1. **Tokenization**: Breaks text into tokens
2. **Importance Scoring**: Uses a small BERT model to score each token's importance for answering the query
3. **Intelligent Pruning**: Removes low-importance tokens while preserving semantic meaning
4. **Reassembly**: Reconstructs compressed text that GPT can still understand

**Key Advantages:**
- **3x-6x faster** than original LLMLingua
- **Task-agnostic**: Works for any domain (not just Airbnb)
- **Trained by GPT-4**: Distilled knowledge from GPT-4 for optimal compression

Let's see this in action!

## Phase 4: Install and Initialize LLMLingua-2

In [39]:
# Install LLMLingua library (uncomment if not already installed)
# !pip install llmlingua

In [40]:
# Initialize LLMLingua-2 PromptCompressor
# This uses a small BERT-based model trained by GPT-4 for intelligent compression
import json
from llmlingua import PromptCompressor

print("Initializing LLMLingua-2 model (this may take 30-60 seconds)...")

llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    model_config={"revision": "main"},
    use_llmlingua2=True,  # Use the latest LLMLingua-2 algorithm
    device_map="cpu",      # Use CPU (change to "cuda" if you have GPU)
)

print("✅ LLMLingua-2 model loaded successfully!")

Initializing LLMLingua-2 model (this may take 30-60 seconds)...
✅ LLMLingua-2 model loaded successfully!


## Phase 5: Build Compression Pipeline

In [41]:
# Function to compress query prompt using LLMLingua-2
def compress_query_prompt(query):
    """
    Compresses the search results context using LLMLingua-2.
    
    Args:
        query: Dict containing demonstration_str (search results), instruction, and question
        
    Returns:
        JSON string with compressed prompt and metadata
    """
    demonstration_str = query['demonstration_str']
    instruction = query['instruction']
    question = query['question']

    # Apply 6x compression using LLMLingua-2
    compressed_prompt = llm_lingua.compress_prompt(
        demonstration_str.split("\n"),  # Split results into lines for better compression
        instruction=instruction,
        question=question,
        target_token=500,  # Target ~500 tokens (from ~3000 tokens)
        rank_method="longllmlingua",  # Use LongLLMLingua algorithm for better quality
        context_budget="+100",  # Allow 100 extra tokens if needed
        dynamic_context_compression_ratio=0.4,  # Compress to 40% of original
        reorder_context="sort",  # Reorder context for better coherence
    )

    return json.dumps(compressed_prompt, indent=4)

In [42]:
# Query handler WITH compression
import pprint

def handle_user_query_with_compression(query, db, collection, stages=[], vector_index="vector_index_text"):
    """
    Performs vector search and compresses the results before sending to GPT.
    """
    # Perform vector search to get knowledge from the database
    get_knowledge = custom_utils.vector_search_with_filter(query, db, collection, stages, vector_index)

    # Check if there are any results (could be empty list, None, or error string)
    if not get_knowledge or isinstance(get_knowledge, str):
        print(f"❌ Error or no results: {get_knowledge}")
        return pd.DataFrame(), None

    # Convert search results into a list of SearchResultItem models
    search_results_models = [SearchResultItem(**result) for result in get_knowledge]

    # Convert search results into a DataFrame
    search_results_df = pd.DataFrame([item.dict() for item in search_results_models])
    
    # Select only essential columns for compression (to fit within 512 token limit)
    # We'll keep: name, accommodates, location info, neighborhood, scores
    compact_df = search_results_df[['name', 'accommodates', 'neighborhood_overview', 'averageReviewScore', 'number_of_reviews', 'combinedScore']].copy()
    
    # Add simplified address info (just country and market)
    compact_df['location'] = search_results_df['address'].apply(lambda x: f"{x.get('market', '')}, {x.get('country', '')}")

    # Prepare information for compression (use compact version)
    query_info = {
        'demonstration_str': compact_df.to_string(),  # Results from vector search (compact!)
        'instruction': "Write a high-quality answer for the given question using only the provided search results.",
        'question': query  # User query
    }

    print(f"📏 Input length before compression: {len(query_info['demonstration_str'])} chars (~{len(query_info['demonstration_str'])/4:.0f} tokens)")

    # Compress the query prompt using LLMLingua-2
    compressed_prompt = compress_query_prompt(query_info)

    # Print compressed prompts for inspection
    print("=" * 80)
    print("🗜️  COMPRESSED PROMPT (This is what we're sending to GPT):")
    print("=" * 80)
    pprint.pprint(compressed_prompt)
    print("=" * 80)

    return search_results_df, compressed_prompt

In [43]:
# Function to generate system response using compressed prompt
def handle_system_response(query, compressed_prompt):
    """
    Sends the compressed prompt to GPT and generates a response.
    """
    # Parse the JSON to extract the actual compressed text
    compressed_data = json.loads(compressed_prompt)
    compressed_text = compressed_data['compressed_prompt']
    
    # Generate system response using OpenAI's completion
    completion = custom_utils.openai.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": "You are an Airbnb listing recommendation system."
            },
            {
                "role": "user",
                "content": f"Answer this user query: {query} with the following context:\n{compressed_text}"
            }
        ]
    )

    system_response = completion.choices[0].message.content

    # Print User Question and System Response
    print(f"\n- User Question:\n{query}\n")
    print(f"- System Response:\n{system_response}\n")

    return system_response

## Phase 6: Before/After Comparison - See Compression in Action!

Now let's run the same query with and without compression to see the dramatic difference!

In [44]:
# Define our test query
query = """
I want to stay in a place that's warm and friendly, 
and not too far from restaurants, can you recommend a place? 
Include a reason as to why you've chosen your selection.
"""

In [45]:
# STEP 1: Run query WITHOUT compression (BEFORE)
print("\\n" + "=" * 80)
print("🔴 STEP 1: BASELINE (WITHOUT COMPRESSION)")
print("=" * 80 + "\\n")

response_uncompressed, results_df_uncompressed = handle_user_query(
    query, 
    db, 
    collection, 
    additional_stages, 
    vector_index="vector_index_with_filter"
)

# Store uncompressed context length for comparison
uncompressed_text = results_df_uncompressed.to_string()
uncompressed_token_count = len(uncompressed_text) / 4  # Rough token estimate

🔴 STEP 1: BASELINE (WITHOUT COMPRESSION)
⚡ Search completed in 0.252375 milliseconds
📋 UNCOMPRESSED SEARCH RESULTS (This is what we're sending to GPT):
                                                  name  accommodates                                                                                                                                                                                                                                                                         address                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

In [46]:
# STEP 2: Run query WITH compression (AFTER)
print("\\n" + "=" * 80)
print("🟢 STEP 2: WITH COMPRESSION")
print("=" * 80 + "\\n")

results_df_compressed, compressed_prompt = handle_user_query_with_compression(
    query, 
    db, 
    collection, 
    additional_stages, 
    vector_index="vector_index_with_filter"
)

# Extract compressed text for comparison
compressed_data = json.loads(compressed_prompt)
compressed_text = compressed_data['compressed_prompt']
compressed_token_count = len(compressed_text) / 4  # Rough token estimate

🟢 STEP 2: WITH COMPRESSION
⚡ Search completed in 0.19426 milliseconds
📏 Input length before compression: 24149 chars (~6037 tokens)
🗜️  COMPRESSED PROMPT (This is what we're sending to GPT):
('{\n'
 '    "compressed_prompt": "name accommodates neighborhood _ overview '
 'averageReviewScore number _ of _ reviews combinedScore location\\n\\n1 '
 'Homely Room in 5 - Star New Condo @ MTR 2 Many restaurants and shops nearby. '
 '9. 500000 179. 0 128. 15 Hong Kong, Hong Kong\\n\\n2 Cozy double bed room '
 '\\u6771 \\u6d8c \\u9109 \\u6751 \\u96c5 \\u7dfb \\u96d9 \\u4eba \\u623f 2 10 '
 '- minute walk to the bus stop on Tung Chung Road. 10 - minute walk to Indian '
 'and Thai Food restaurants. 15 - minute walk to Yat Tung Estate ( shopping '
 'centre, restaurants and bus terminus ). 7 - minute walk to Chung Mun Road ( '
 'Mun Tung Estate Car Park Entrance ). 10 - minute to Airport / Asia Expo by '
 'taxi / car. 4 - minute to Tung Chung MTR station by taxi / car. 9. 666667 '
 '162. 0 116. 30 Ho

In [47]:
# STEP 3: Side-by-Side Visual Comparison
print("\\n" + "=" * 80)
print("📊 SIDE-BY-SIDE COMPARISON")
print("=" * 80)

print("\\n🔴 BEFORE COMPRESSION (First 800 characters):")
print("-" * 80)
print(uncompressed_text[:800] + "...")
print(f"\\nTotal length: {len(uncompressed_text)} characters")
print(f"Estimated tokens: ~{uncompressed_token_count:.0f} tokens")

print("\\n" + "=" * 80)

print("\\n🟢 AFTER COMPRESSION (First 800 characters):")
print("-" * 80)
print(compressed_text[:800] + "...")
print(f"\\nTotal length: {len(compressed_text)} characters")
print(f"Estimated tokens: ~{compressed_token_count:.0f} tokens")

print("\\n" + "=" * 80)
print("💎 COMPRESSION RESULTS")
print("=" * 80)

# Calculate compression ratio
compression_ratio = uncompressed_token_count / compressed_token_count if compressed_token_count > 0 else 0
token_savings = uncompressed_token_count - compressed_token_count
cost_per_1k_tokens = 0.0015  # GPT-3.5 input cost
cost_savings_per_query = (token_savings / 1000) * cost_per_1k_tokens

print(f"🎯 Compression Ratio: {compression_ratio:.1f}x")
print(f"💰 Token Savings: {token_savings:.0f} tokens per query")
print(f"💵 Cost Savings: ${cost_savings_per_query:.6f} per query")
print(f"💵 Cost Savings (1M queries): ${cost_savings_per_query * 1_000_000:.2f}")
print(f"📉 Size Reduction: {((token_savings / uncompressed_token_count) * 100):.1f}%")
print("=" * 80)

📊 SIDE-BY-SIDE COMPARISON
\n🔴 BEFORE COMPRESSION (First 800 characters):
--------------------------------------------------------------------------------
                                                  name  accommodates                                                                                                                                                                                                                                                                         address                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ...
\nTotal length: 46850 characters
Estimated

In [48]:
# STEP 4: Generate answer using compressed prompt
print("\\n" + "=" * 80)
print("🤖 GENERATING ANSWER WITH COMPRESSED PROMPT")
print("=" * 80 + "\\n")

if compressed_prompt:
    system_response = handle_system_response(query, compressed_prompt)
    print("\\n✅ Answer generated successfully using compressed context!")
    print("\\n💡 Notice: The answer quality remains high despite 6x compression!")
else:
    print("❌ No valid results to display.")

🤖 GENERATING ANSWER WITH COMPRESSED PROMPT

- User Question:

I want to stay in a place that's warm and friendly, 
and not too far from restaurants, can you recommend a place? 
Include a reason as to why you've chosen your selection.


- System Response:
Based on your preferences for a warm and friendly place that’s not too far from restaurants, I recommend:

**Homely Room in 5-Star New Condo @ MTR** (Hong Kong, Hong Kong)

**Reason for selection:**
This listing is specifically described as "homely," which suggests a warm and welcoming atmosphere. It’s located near many restaurants and shops, making it easy for you to explore local dining options. With a high average review score (9.0) and a large number of reviews (179), guests have consistently had positive experiences here. The convenient access to transportation (MTR) further adds to its appeal for exploring the city. 

This combination makes it a great choice for a comfortable, friendly stay with excellent access to restaurants!

