# 04 Performance Profiling

**Objective:** Ensure system performs well with production-scale data.

**Areas to Profile:**
- Query performance (database operations)
- Code performance (Python/pandas operations)

**Why This Matters:** System must handle 80K+ reviews efficiently for production use.

In [1]:
import sys
from pathlib import Path
project_root = Path.cwd() if (Path.cwd() / "src").exists() else Path.cwd().parent
sys.path.insert(0, str(project_root))

import time
import sqlite3
import io
from contextlib import redirect_stdout
import pandas as pd
import cProfile
import pstats

from sqlalchemy import text
from src.utils import get_engine, get_db_path
from src.benchmarking import get_reviews_df, create_comparable_groups, extract_hotel_features

engine = get_engine(sample=False)

## Query Performance Analysis

Testing common database operations to ensure quick response times.

**Test Queries:**
1. **Count** - Simple aggregation
2. **Group by** - Hotel-level aggregation
3. **Filter** - Indexed lookup
4. **Complex** - Multi-condition with sorting

**Each query runs 5 times; we report average, min, and max times.**

In [None]:
db_path = get_db_path(sample=False)
conn = sqlite3.connect(str(db_path))

test_queries = [
    ("Count all reviews", "SELECT COUNT(*) FROM reviews"),
    ("Avg rating by hotel", "SELECT offering_id, AVG(rating_overall) FROM reviews GROUP BY offering_id"),
    ("Filter by rating >= 4", "SELECT * FROM reviews WHERE rating_overall >= 4 LIMIT 1000"),
    ("Complex aggregation", """
        SELECT offering_id, 
               COUNT(*) as n,
               AVG(rating_overall) as avg_rating,
               AVG(rating_cleanliness) as avg_clean
        FROM reviews
        WHERE rating_overall >= 3.5
        GROUP BY offering_id
        HAVING COUNT(*) >= 10
        ORDER BY avg_rating DESC
        LIMIT 100
    """),
]

print("="*70)
print("QUERY EXECUTION PLANS")
print("="*70)

for name, query in test_queries:
    print(f"\n{name}:")
    print(f"  Query: {query[:80]}{'...' if len(query) > 80 else ''}")
    print("  Plan:")
    
    for row in conn.execute(f"EXPLAIN QUERY PLAN {query}").fetchall():
        print(f"    {row}")

conn.close()

QUERY EXECUTION PLANS

Count all reviews:
  Query: SELECT COUNT(*) FROM reviews
  Plan:
    (4, 0, 0, 'SCAN reviews USING COVERING INDEX idx_reviews_rating_overall')

Avg rating by hotel:
  Query: SELECT offering_id, AVG(rating_overall) FROM reviews GROUP BY offering_id
  Plan:
    (7, 0, 222, 'SCAN reviews USING INDEX idx_reviews_offering')

Filter by rating >= 4:
  Query: SELECT * FROM reviews WHERE rating_overall >= 4 LIMIT 1000
  Plan:
    (4, 0, 202, 'SEARCH reviews USING INDEX idx_reviews_rating_overall (rating_overall>?)')

Complex aggregation:
  Query: 
        SELECT offering_id, 
               COUNT(*) as n,
               AVG(r...
  Plan:
    (9, 0, 222, 'SCAN reviews USING INDEX idx_reviews_offering')
    (58, 0, 0, 'USE TEMP B-TREE FOR ORDER BY')


## Query Timing Measurements

Measuring actual execution time over 5 runs for reliability.

In [None]:
conn = sqlite3.connect(str(db_path))

# Capture printed output
buf = io.StringIO()
with redirect_stdout(buf):
    print("=" * 80)
    print("QUERY PERFORMANCE TIMING")
    print("=" * 80)

    results = []

    for name, query in test_queries:
        times = []
        for _ in range(5):
            start = time.time()
            cursor = conn.execute(query)
            _ = cursor.fetchall()
            times.append(time.time() - start)

        avg_time = sum(times) / len(times)
        results.append({
            'query': name,
            'avg_time': avg_time,
            'min_time': min(times),
            'max_time': max(times),
        })

        print(f"\n{name}:")
        print(f"  Average: {avg_time*1000:.2f}ms")
        print(f"  Min:     {min(times)*1000:.2f}ms")
        print(f"  Max:     {max(times)*1000:.2f}ms")

    print(f"\n{'='*80}")
    print("SUMMARY")
    print(f"{'='*80}")
    
    all_under_200 = all(r['avg_time'] < 0.2 for r in results)
    print(f"All queries < 200ms: {'✓ YES' if all_under_200 else '✗ NO'}")
    print(f"Slowest query: {max(results, key=lambda x: x['avg_time'])['query']}")
    print(f"  Time: {max(r['avg_time'] for r in results)*1000:.2f}ms")

conn.close()

# Convert captured output to text
query_profiling_txt = buf.getvalue()

# Write to file
output_path = project_root / "profiling" / "query_results.txt"
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(query_profiling_txt, encoding="utf-8")

print(query_profiling_txt)
print(f"\n Wrote {output_path}")

QUERY PERFORMANCE TIMING

Count all reviews:
  Average: 10.87ms
  Min:     0.12ms
  Max:     53.53ms

Avg rating by hotel:
  Average: 2046.15ms
  Min:     883.64ms
  Max:     6228.12ms

Filter by rating >= 4:
  Average: 16.31ms
  Min:     9.82ms
  Max:     20.55ms

Complex aggregation:
  Average: 1034.93ms
  Min:     965.46ms
  Max:     1109.53ms

SUMMARY
All queries < 200ms: ✗ NO
Slowest query: Avg rating by hotel
  Time: 2046.15ms


✓ Wrote c:\Users\rayya\Desktop\IS5126-G4-hotel-analytics-master\profiling\query_results.txt


## Code Performance Analysis

Profiling Python operations to identify bottlenecks in our benchmarking code.

**What we're profiling:** The complete benchmarking workflow (feature extraction + clustering).

In [None]:
prof = cProfile.Profile()
prof.enable()

# Run the benchmarking workflow
df = get_reviews_df(sample=False)
features = extract_hotel_features(df)
features_clustered, sil_score, profiles = create_comparable_groups(features, n_clusters=6)

prof.disable()

# Generate report
s = io.StringIO()
ps = pstats.Stats(prof, stream=s).strip_dirs().sort_stats("cumulative")
ps.print_stats(30)

code_profiling_txt = s.getvalue()

# Write to file
output_path = project_root / "profiling" / "code_profiling.txt"
output_path.write_text(code_profiling_txt, encoding="utf-8")

# Show in notebook
print(code_profiling_txt)
print(f"\n Wrote {output_path}")

Extracting hotel-level features...
Analyzing review text for hotel characteristics...
Creating 6 comparable groups...
Data shape: (3374, 7)
Unique hotels: 3374

Testing different cluster counts:
  K=3: silhouette=0.297, min_size=459, max_size=1487
  K=4: silhouette=0.306, min_size=258, max_size=1341
  K=5: silhouette=0.291, min_size=105, max_size=1366
  K=6: silhouette=0.303, min_size=69, max_size=892
  K=7: silhouette=0.315, min_size=45, max_size=826
  K=8: silhouette=0.333, min_size=75, max_size=926
  K=9: silhouette=0.333, min_size=15, max_size=776
  K=10: silhouette=0.310, min_size=15, max_size=740
  K=11: silhouette=0.314, min_size=14, max_size=741
  K=12: silhouette=0.281, min_size=14, max_size=588

✓ Selected K=9 with silhouette=0.333
         3844062 function calls (3767761 primitive calls) in 28.290 seconds

   Ordered by: cumulative time
   List reduced from 1884 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  