# Rental Review Analysis Notebook

**Purpose:**
This notebook implements an end-to-end pipeline to analyze short-term rental reviews per your request: (1) aggregate multiple reviews per property, (2) summarize them using an LLM (OpenAI example provided), (3) canonicalize/normalize issues, and (4) add embeddings-based clustering to find common issues.  

**Notes:** Replace CSV path placeholders and set `OPENAI_API_KEY` if you want to run LLM calls. The notebook is modular so you can swap LLM providers.

- find the common issues, such as smoke alarm, sink drain
- segment comparison by apartment vs houses, seattle vs remote, 2 bdr vs 3bdr+

- check the final solutions for the issues: categorized, and whether it results in client request refund
- Check whether 5 stars but leave private note,how about 4 stars with private note
- Check whether there are issues raisen but no reviews
- trace where some tasks didn't complete in time, and what the reactions from the future guests.

In [3]:

# 0. Requirements - run this cell once to install packages (uncomment to run)
%pip install openai pandas numpy scikit-learn matplotlib seaborn tiktoken sentence-transformers umap-learn

print('Notebook ready. Install dependencies if needed.')

Defaulting to user installation because normal site-packages is not writeable
Collecting openai
  Downloading openai-2.3.0-py3-none-any.whl (999 kB)
[K     |████████████████████████████████| 999 kB 11.2 MB/s eta 0:00:01
Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 79.6 MB/s eta 0:00:01
[?25hCollecting matplotlib
  Downloading matplotlib-3.9.4-cp39-cp39-macosx_11_0_arm64.whl (7.8 MB)
[K     |████████████████████████████████| 7.8 MB 14.8 MB/s eta 0:00:01
[?25hCollecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
[K     |████████████████████████████████| 294 kB 55.0 MB/s eta 0:00:01
[?25hCollecting tiktoken
  Downloading tiktoken-0.12.0-cp39-cp39-macosx_11_0_arm64.whl (997 kB)
[K     |████████████████████████████████| 997 kB 65.1 MB/s eta 0:00:01
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-5.1.1-py3-none-any.whl (486 kB)
[K    

## 1) Configuration and imports
Update the `CSV_PATH` to point at your reviews CSV. The notebook expects the columns you provided:

`Listing, Reservation, Guest.name, checkin_date, checkout_date, booking_platform, createdAt, Overall, Check.in.score, Accuracy, Cleanliness, Communication, Location, Value, Public.Review, PropertyType, Region, BEDROOMS`


In [21]:
import os
import json
import time
from pathlib import Path
import pandas as pd
import numpy as np

# LLM client - optional (OpenAI example included)
try:
    import openai
except Exception:
    openai = None

# Embedding fallback / clustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Config - update paths and API key as needed
review_PATH = "/Users/ylin/My Drive/Cohost/Data and Reporting/06-Reviews/Data/PropertyReviews.xlsx" # update as needed
openai.api_key = os.environ["OPENAI_API_KEY"] # or set here as string (not recommended in notebooks)
LLM_MODEL = 'gpt-5'  # example; swap if needed
EMBEDDING_MODEL = 'text-embedding-3-small'  # example
OUTPUT_DIR = Path('notebook_outputs')
OUTPUT_DIR.mkdir(exist_ok=True)


KeyError: 'OPENAI_API_KEY'

## 2) Load your CSV and preview
We'll load the CSV and ensure the expected columns exist. The notebook will create convenient column aliases matching your provided schema.

In [None]:

# Load CSV
df = pd.read_excel(review_PATH)
print('Loaded rows:', len(df))
expected_cols = ['Listing','Reservation','Guest.name','checkin_date','checkout_date','booking_platform','createdAt','Overall','Check.in.score','Accuracy','Cleanliness','Communication','Location','Value','Public.Review','PropertyType','Region','BEDROOMS']
missing = [c for c in expected_cols if c not in df.columns]
if missing:
    print('WARNING - missing columns from CSV:', missing)
else:
    print('All expected columns present.')

df.head(3)

Loaded rows: 3926
All expected columns present.


Unnamed: 0,Listing,Reservation,Guest.name,checkin_date,checkout_date,booking_platform,createdAt,Overall,Check.in.score,Accuracy,Cleanliness,Communication,Location,Value,Public.Review,PropertyType,Region,BEDROOMS
0,Beachwood 1,HMYAAXKZCY,Eƶéquias,2023-12-08,2023-12-11,Airbnb,2023-12-13,5.0,5.0,4.0,5.0,5.0,5.0,5.0,"Great place to stay, I thought I would have pr...",Condo,Seattle,2
1,Beachwood 1,HMP9YF2R2A,Aimée,2023-08-04,2023-08-09,Airbnb,2023-08-09,5.0,5.0,5.0,5.0,5.0,5.0,5.0,Beautiful space with access to the Puget Sound...,Condo,Seattle,2
2,Beachwood 1,HMSWBTETFP,Hong,2023-10-03,2023-10-04,Airbnb,2023-10-05,5.0,5.0,5.0,5.0,5.0,5.0,5.0,"The unit is perfect, with a great view of the ...",Condo,Seattle,2


### Quick column mapping
Create normalized column names that we'll use in the pipeline.

In [10]:

# Column mapping - adapt if your CSV uses slightly different names
col_map = {
    'Listing':'property_id',
    'Reservation':'booking_id',
    'Guest.name':'guest_name',
    'checkin_date':'checkin_date',
    'checkout_date':'checkout_date',
    'booking_platform':'platform',
    'createdAt':'created_at',
    'Overall':'rating',
    'Check.in.score':'checkin_score',
    'Accuracy':'accuracy_score',
    'Cleanliness':'cleanliness_score',
    'Communication':'communication_score',
    'Location':'location_score',
    'Value':'value_score',
    'Public.Review':'review_text',
    'PropertyType':'property_type',
    'Region':'region',
    'BEDROOMS':'bedrooms'
}

# Apply mapping for columns that exist
available_map = {k:v for k,v in col_map.items() if k in df.columns}
df = df.rename(columns=available_map)
# Ensure key columns exist
for c in ['property_id','booking_id','review_text']:
    if c not in df.columns:
        df[c] = None

# Basic cleaning
df['review_text'] = df['review_text'].astype(str).fillna('')
df['property_id'] = df['property_id'].astype(str)
df['booking_id'] = df['booking_id'].astype(str)
df['bedrooms'] = pd.to_numeric(df.get('bedrooms', pd.Series(np.nan)), errors='coerce').fillna(0).astype(int)

print('Columns after mapping:', df.columns.tolist())
df[['property_id','booking_id','rating','review_text','property_type','region','bedrooms']].head(3)

Columns after mapping: ['property_id', 'booking_id', 'guest_name', 'checkin_date', 'checkout_date', 'platform', 'created_at', 'rating', 'checkin_score', 'accuracy_score', 'cleanliness_score', 'communication_score', 'location_score', 'value_score', 'review_text', 'property_type', 'region', 'bedrooms']


Unnamed: 0,property_id,booking_id,rating,review_text,property_type,region,bedrooms
0,Beachwood 1,HMYAAXKZCY,5.0,"Great place to stay, I thought I would have pr...",Condo,Seattle,2
1,Beachwood 1,HMP9YF2R2A,5.0,Beautiful space with access to the Puget Sound...,Condo,Seattle,2
2,Beachwood 1,HMSWBTETFP,5.0,"The unit is perfect, with a great view of the ...",Condo,Seattle,2


## 3) Aggregate multiple reviews per property
We'll combine all public reviews per `property_id` into a single text block for summarization. We'll keep individual reviews too for later linking (refunds, tasks).

In [13]:

# Aggregate public reviews per property
grouped = df.groupby('property_id').agg({
    'review_text': lambda texts: '\n\n'.join([t for t in texts if str(t).strip()!='']),
    'rating': 'mean',
    'property_type':'first',
    'region':'first',
    'bedrooms':'first'
}).reset_index().rename(columns={'rating':'avg_rating'})

print('Grouped properties:', len(grouped))
grouped.head(3)

Grouped properties: 81


Unnamed: 0,property_id,review_text,avg_rating,property_type,region,bedrooms
0,Beachwood 1,"Great place to stay, I thought I would have pr...",4.794872,Condo,Seattle,2
1,Beachwood 10,"nan\n\ngreat communication, beautiful apartmen...",4.272727,Condo,Seattle,1
2,Beachwood 3,"A really nice, well equipped apartment with a ...",5.0,Condo,Seattle,1


## 4) LLM summarization function
This cell contains a **framework-neutral** summarization wrapper. It uses the OpenAI API if `openai` is installed and an API key is set. Otherwise it will return a mock summary so you can test the pipeline locally.

**Output format (JSON):** `{positives:[...], negatives:[...], critical_issues:[...], refund_mentions:[...], sentiment_trend: 'improving'|'stable'|'declining'}`

In [14]:

def summarize_reviews_with_llm(text, model=LLM_MODEL, openai_client=openai, api_key=OPENAI_API_KEY):
    # If no text return empty
    if not text or str(text).strip()=='':
        return {'positives': [], 'negatives': [], 'critical_issues': [], 'refund_mentions': [], 'sentiment_trend': 'stable'}
    # If openai not available or no key, return a mock summary for testing
    if openai_client is None or not api_key:
        # crude heuristics for demo
        positives = []
        negatives = []
        if 'view' in text.lower() or 'sunset' in text.lower():
            positives.append('view')
        if 'clean' in text.lower():
            positives.append('cleanliness')
        if 'parking' in text.lower():
            negatives.append('parking')
        if 'smoke' in text.lower() or 'marijuana' in text.lower():
            negatives.append('smoke odor')
            critical = ['smoke odor']
        else:
            critical = []
        return {'positives': positives, 'negatives': negatives, 'critical_issues': critical, 'refund_mentions': [], 'sentiment_trend': 'stable'}
    # Real OpenAI call path
    openai_client.api_key = api_key
    prompt = f"""You are an assistant analyzing reviews for a short-term rental property.
Summarize these reviews. Provide a JSON object with keys:
- positives: up to 5 short positive themes
- negatives: up to 5 recurring complaints
- critical_issues: safety/maintenance issues (short phrases)
- refund_mentions: if guests requested refunds or compensation, list brief notes
- sentiment_trend: one of improving, stable, declining

Reviews:
{ text }

Return only valid JSON."""
    try:
        resp = openai_client.ChatCompletion.create(
            model=model,
            messages=[{'role':'user','content':prompt}],
            temperature=0.0,
            max_tokens=400
        )
        raw = resp.choices[0].message['content']
        # try parse JSON
        parsed = None
        try:
            parsed = json.loads(raw)
        except Exception:
            import re
            m = re.search(r'(\{.*\})', raw, flags=re.S)
            if m:
                parsed = json.loads(m.group(1))
        if parsed is None:
            parsed = {'raw': raw}
        return parsed
    except Exception as e:
        return {'error':'llm_error','detail': str(e), 'raw':''}

### 4b) Run LLM summarization on grouped properties (demo / batch)
**Caution:** Running LLM calls may incur cost. If you have many properties, run in batches and set `OPENAI_API_KEY` in your environment.

In [15]:

# Example: run summarization on grouped (here we will run a mock if no API key)
summaries = []
for _, row in grouped.iterrows():
    s = summarize_reviews_with_llm(row['review_text'])
    summaries.append(s)

grouped['llm_summary'] = summaries
grouped[['property_id','avg_rating','llm_summary']].head(3)

Unnamed: 0,property_id,avg_rating,llm_summary
0,Beachwood 1,4.794872,"{'error': 'llm_error', 'detail': ' You tried ..."
1,Beachwood 10,4.272727,"{'error': 'llm_error', 'detail': ' You tried ..."
2,Beachwood 3,5.0,"{'error': 'llm_error', 'detail': ' You tried ..."


In [19]:
from openai import OpenAI
client = OpenAI()

result = client.responses.create(
    model="gpt-5",
    input="Write a haiku about code.",
    reasoning={ "effort": "low" },
    text={ "verbosity": "low" },
)

print(result.output_text)

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

## 5) Extract critical issues into an issues table and normalize
We'll pull `critical_issues` from the LLM summaries and canonicalize them with a simple mapping. Later we cluster similar phrases using embeddings or TF-IDF + KMeans.

In [None]:

# Extract issues
issue_rows = []
for _, r in grouped.iterrows():
    summ = r['llm_summary'] or {}
    issues = summ.get('critical_issues') if isinstance(summ, dict) else []
    if isinstance(issues, str):
        issues = [issues]
    for iss in issues:
        if iss and str(iss).strip():
            issue_rows.append({'property_id': r['property_id'], 'issue_raw': str(iss).strip()})
issues_df = pd.DataFrame(issue_rows)
if issues_df.empty:
    print('No critical issues found in summaries (this is OK for demo).')
else:
    display(issues_df.head())

In [None]:

# Simple canonicalization mapping - extend as needed
canonical_map = {
    'smoke odor':'smoke_alarm_or_odor',
    'smoke':'smoke_alarm_or_odor',
    'smoke alarm':'smoke_alarm_or_odor',
    'sink drain':'sink_drain',
    'slow sink':'sink_drain',
    'shower small':'bathroom_shower_size',
    'door lock':'door_lock',
    'keypad not locking':'door_lock',
    'parking':'parking',
    'noise':'noise'
}

def canonicalize_issue(s):
    s_low = s.lower().strip()
    for k,v in canonical_map.items():
        if k in s_low:
            return v
    # fallback: normalized text with spaces -> underscores
    return s_low.replace(' ','_')

if not issues_df.empty:
    issues_df['issue_canonical'] = issues_df['issue_raw'].apply(canonicalize_issue)
    issues_df = issues_df.merge(df[['property_id','property_type','region','bedrooms']].drop_duplicates('property_id'), on='property_id', how='left')
    issues_df.to_csv(OUTPUT_DIR / 'issues_table.csv', index=False)
    issues_df.head()

## 6) Embeddings-based clustering (or TF-IDF fallback)
If you have an embeddings provider API key (OpenAI or sentence-transformers), use it. Otherwise the notebook falls back to TF-IDF + KMeans. The clusters will help discover common issue themes across all review texts (not only critical_issues).

In [None]:

# Prepare texts for clustering: use all review_texts per property
texts = grouped['review_text'].fillna('').tolist()

use_openai_embeddings = (openai is not None and OPENAI_API_KEY)

if use_openai_embeddings:
    # OpenAI embeddings example
    openai.api_key = OPENAI_API_KEY
    embs = []
    for t in texts:
        try:
            resp = openai.Embedding.create(input=t, model=EMBEDDING_MODEL)
            embs.append(resp['data'][0]['embedding'])
            time.sleep(0.2)
        except Exception as e:
            print('Embedding error', e)
            embs.append([0]*1536)
    X = np.array(embs)
else:
    # TF-IDF fallback
    vec = TfidfVectorizer(max_features=1000, stop_words='english')
    X = vec.fit_transform(texts).toarray()

# Run KMeans for clustering themes
n_clusters = min(8, max(2, int(len(texts)/5))) if len(texts)>0 else 2
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
grouped['cluster'] = kmeans.labels_

# Show cluster top terms if TF-IDF used
if not use_openai_embeddings:
    terms = vec.get_feature_names_out()
    order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
    clusters_summary = {}
    for i in range(n_clusters):
        top_terms = [terms[ind] for ind in order_centroids[i, :10]]
        clusters_summary[f'cluster_{i}'] = top_terms[:8]
    print('TF-IDF cluster top terms (sample):')
    print(json.dumps(clusters_summary, indent=2))

## 7) Segment comparisons & basic analytics
Produce breakdowns by `property_type`, `region`, and bedroom buckets (2bdr vs 3bdr+). Also compute refund linkage if you provide refunds data later.

In [None]:

# Bedroom bucket
grouped['bedroom_bucket'] = grouped['bedrooms'].apply(lambda x: '2bdr' if x==2 else ('3bdr+' if x>=3 else 'other'))

# Cluster counts by segment
seg_counts = grouped.groupby(['region','property_type','bedroom_bucket','cluster']).size().reset_index(name='count')
seg_counts.head(10).to_csv(OUTPUT_DIR/'cluster_segment_counts.csv', index=False)
seg_counts.head(10)

## 8) Save outputs and next steps
This notebook saves outputs to `notebook_outputs/`. Next steps you can run: join with refunds, tasks tables; run LLM summarization in batches; refine canonicalization mapping; build dashboards (Streamlit/Power BI).

In [None]:

# save grouped with clusters and summaries
grouped.to_csv(OUTPUT_DIR/'grouped_properties_with_summaries.csv', index=False)
print('Saved grouped data and cluster segment counts to', OUTPUT_DIR)
