# Yelp Open Dataset - Drivers of High Ratings
## Analysis overview

### 1. Environment & Paths

**1.1 Importing Libraries**  
Loaded core Python libraries for data handling, modeling, and visualization

**1.2 Path Discovery**   
Set up folder paths so data and outputs stay organized and reproducible across machines.  

**1.3 Display Options**    
Adjusted notebook display settings to keep tables and results easy to read.  

**1.4 Brand Palette**  
Defined a consistent color scheme and chart style to make visuals clear and presentation-ready.

### 2. Load & Clean dataset

**2.1 Reading business.json**  
Loaded the Yelp business dataset in manageable chunks to handle large file size efficiently.  
*Observation:* First chunk confirmed structure (IDs, names, stars, reviews, location) and showed dataset is wide with many fields.  

**2.2 Column Overview & Null Scan**  
Checked column types, missing data, and attribute coverage to understand data quality upfront.  
*Observation:* Missing values mainly in `hours` (~15%) and `attributes` (~9%), while key fields like `stars` and `review_count` were mostly complete.  

**2.3 Cleaning Helpers**  
Built functions to standardize messy fields, flatten attributes, and parse categories/hours into consistent formats.  
*Observation:* Frequent attributes included credit cards, parking, and price range; rare ones included drive-thru and Bitcoin payments.  

**2.4 Apply Cleaning**  
Applied cleaning pipeline to create a structured dataset with a clear target variable (`high_rating`).  
*Observation:* Dataset shape was ~50k rows × 56 columns, with star ratings averaging 3.6 (75% ≥ 3.5). Coverage was good for core fields, weaker for auxiliary ones like price and hours.  

**2.5 Attribute Normalization**  
Standardized text values, converted yes/no fields, and expanded nested attributes (e.g., parking, ambience) for consistency.  
*Observation:* Normalization boosted usable values significantly — e.g., `wifi` grew from 0 raw entries to ~19k usable entries.  

**2.6 Process Full Dataset**  
Ran the full cleaning pipeline across all rows, saving results into a structured CSV for downstream analysis.  
*Observation:* Final cleaned dataset combined identifiers, ratings, categories, and attributes into one consistent, analysis-ready table.  


### 3. Coverage & Attribute Parsing

**3.1 Load Dataset**  
Loaded the cleaned dataset (`business_clean.csv`) and validated core fields like ratings, price, weekly hours, and categories.  
*Observation:* Dataset opened successfully with expected fields available for analysis.  

**3.2 Initial EDA: Distributions & Coverage**  
Explored distributions of star ratings, review counts, attribute coverage, and top categories.  
*Observation:* Most businesses rated 3–5 stars (avg ~4), review counts are heavily skewed with many businesses <10 reviews, high coverage for core fields (stars, credit cards), but sparse fields like wifi and hours.  

**3.3 Fixed Gaps — Data Reliability & Normalization**  
Created new features to address data weaknesses: reliable ratings (≥20 reviews), Bayesian-adjusted stars, missing flags, and standardized ratings by category/city.  
*Observation:* These fixes reduce small-sample noise and allow fair comparisons across categories and cities.  

**3.4 EDA with Gap Fixes**  
Analyzed how adjusted ratings relate to reviews, hours, and attributes (e.g., wifi, alcohol, outdoor seating).  
*Observation:* Free WiFi, alcohol service, reservations, and outdoor seating show slight positive lifts; shorter weekly hours (<40) associate with higher ratings; z-scores reveal city/category effects.  

**3.5 Safety Checks**  
Verified all engineered features exist (`high_rating_reliable`, `stars_bayes`, `stars_cat_z`, `stars_city_z`, `hours_bin`, and missing flags).  
*Observation:* Guarantees consistency so later modeling steps are robust, even if earlier cells were skipped.  


### 4. Q1 model + diagnostics

**4.1 Model Setup — Target & Feature Coverage**  
Defined `high_rating = (stars ≥ 4.5)`, computed base rate, and audited feature coverage to know what’s reliably usable.  
*Why now:* Establishes the benchmark and confirms we have enough signal before modeling.  
*Observation:* Base rate ≈ 10–11%; high coverage for credit cards/price; moderate for delivery/outdoor seating; very low for hours & parking.

**4.2 Parking Attribute Remediation**  
Parsed messy/nested parking fields into clean columns (garage/street/lot/valet/validated) and standardized true/false values.  
*Why now:* Coverage for parking was near-zero and stringified; without fixing it, parking could not be evaluated fairly.  
*Observation:* Parking coverage increased materially (each type ~57% usable), enabling downstream analysis.

**4.3 Attribute Lifts & Simple Correlations**  
Screened features with transparent point lifts/correlations to see which attributes associate with high ratings.  
*Why now:* Quick sanity-check to shortlist candidates and spot odd signs before complex models.  
*Observation:* Positives—dogs allowed, street parking, outdoor seating; Negatives—wheelchair accessible, accepts credit cards, valet, happy hour, table service, TV, takeout; numeric features showed weak predictive power.

**4.4 Segment Baseline — High-Rating Rates by Category**  
Computed base rates and coverage within top business segments to avoid pooled averages hiding differences.  
*Why now:* Restaurants dominate volume and can distort overall benchmarks; segment baselines make comparisons fair.  
*Observation:* Overall base ≈ 11%, but Restaurants drop to ≈ 2.6% (much lower than other service segments).

**4.5 Macro-Segment Mapping**  
Mapped noisy category text to a few intuitive macro-segments and summarized counts/base rates.  
*Why now:* Clean segment labels are needed for stable filtering, comparisons, and modeling controls.  
*Observation:* Food & Beverage dominates (~71k) yet has the **lowest** base rate (~5%); service-oriented segments show much higher (≈15–23%).

**4.6 Segment Analysis — Food & Beverage (Univariate)**  
Within F&B, measured lifts/odds for binary & categorical attributes and correlations for numeric features.  
*Why now:* Focuses on the target domain to identify levers specific to F&B before multivariate modeling.  
*Observation:* F&B base ≈ 5.2%; strongest positives—dogs allowed, street parking; negatives—wheelchair accessible, credit cards, happy hour, table service, TV, drive-thru; “unknown” levels often score higher; casual/family-friendly signals trend positive while formal/bar/fast-food contexts trend negative.

**4.7 Segment Model — F&B (Regularized Logistic Regression)**  
Trained an interpretable L1/L2/Elastic Net logistic with balanced classes, CV, and a clean preprocessing pipeline.  
*Why now:* Univariate effects can be confounded; regularized logistic provides multivariate, explainable drivers.  
*Observation:* ROC-AUC ≈ 0.818 (acc ≈ 0.678); recall high, precision low due to ~5% positives; positives—street parking, “attire unknown,” Mexican, no alcohol, outdoor seating; negatives—fast food, full bar, formal attire, groups, pizza, TV, delivery.

**4.8 Result Refinement — Step 1: Drop `parking_any`, Pool Rare Levels, Refit**  
Removed synthetic aggregates and pooled rare categories to stabilize coefficients and interpretation.  
*Why now:* `parking_any` masked specific effects; rare levels caused noisy, unstable estimates.  
*Observation:* Performance held (ROC-AUC ≈ 0.816, PR-AUC ≈ 0.170) with cleaner, more interpretable drivers; overall pattern unchanged.

**4.8 Result Refinement — Step 2: Add Cuisine Tokens + State Controls, Refit**  
Restored rich `categories`, tokenized cuisines, and added state fixed effects to reduce confounding.  
*Why now:* Category text carries true concept signal; geography drives baseline differences—both distort effects if omitted.  
*Observation:* ROC-AUC ↑ to ≈ 0.852, PR-AUC ↑ to ≈ 0.218; clearer signals emerged. Remaining quirks: “attire=unknown” is a missingness proxy; non-F&B tokens leak in; alcohol shows concept vs amenity conflict; geography is a **control**, not a lever.  

**4.8 Result Refinement —  Step 3: Whitelist Cuisine/Venue Tokens, Drop Alcohol, Refit Logistic**  
Kept only F&B-relevant tokens (cuisines, venue types) and dropped alcohol to avoid double counting bar/cocktail features.  
*Observation:* ROC-AUC stayed strong (~0.83); food trucks, Latin cuisine, cafes, vegan options → positive; Chinese, Tex-Mex, burgers → negative.

**4.8 Result Refinement — Step 4: Final, Name-Safe and NaN-Proof Logistic (F&B)**  
Hardened model by scrubbing missing values, rebuilding cuisine indicators, and cleaning labels for a stakeholder-ready driver table.  
*Observation:* Precision for positives remains low, but signals are stable; food trucks, cafes, vegan/Latin cuisines push ratings up, while formal cuisines/ambience pull them down.

### 5. Q1 roll-up charts/tables

**5.1 ROC & PR Curves**  
Plotted ROC and precision–recall curves to assess discrimination and performance under class imbalance.  
*Why now:* Confirms the logistic model separates signal from noise before thresholding or driver interpretation.  
*Observation:* ROC-AUC ≈ 0.833, PR-AUC ≈ 0.204 (~4× random). Strong ranking power; model effectively prioritizes top businesses despite low prevalence.

**5.2 Driver Charts — Probability Lift**  
Converted odds ratios into intuitive probability-point changes relative to the 5.2% base.  
*Why now:* Stakeholders need interpretable “what moves the needle” views beyond coefficients.  
*Observation:* Food trucks (+10.6 pts) and Latin American (+7.1 pts) strongest positives; upscale (–4.5 pts) and touristy (–4.0 pts) strongest negatives; most other factors cluster near neutral.

**5.3 Threshold Selection**  
Tested multiple thresholds to balance precision vs. recall for practical use.  
*Why now:* ROC/PR curves rank order, but decisions require a cut-off aligned to business tolerance for false positives/negatives.  
*Observation:* Best F1 ≈ 0.28 at threshold ≈ 0.78. Accuracy ≈ 0.90, but recall is low due to imbalance. Lowering threshold ↑ recall but ↓ precision.

**5.4 Roll-Up Analysis — Cuisine, Venue, Operations**  
Aggregated individual attributes into higher-order cuisine, venue, and operational themes.  
*Why now:* Collapsing micro-features reveals consistent directional patterns across related drivers.  
*Observation:*  
- **Cuisine:** Specialty/health-conscious cuisines (Latin American, vegan, gluten-free) outperform; mainstream comfort foods underperform.  
- **Venue:** Food trucks and casual venues positive; upscale/touristy negative.  
- **Operations:** Street parking & family-friendly cues help; group-oriented/“add-on” amenities (valet, TVs, delivery) hurt.  
- **Overall:** Flexible/casual formats succeed; formal/upscale contexts underperform consistently.


### 6. Q1 Validation

**6.1 Validations / Sanity Checks**  
Confirmed logistic drivers with tree models (Permutation Importance, SHAP) and calibration curves.  
*Observation:* Core drivers stable; directions align; mild overconfidence at extremes.  

**6.2 Calibration**  
Applied Platt scaling & isotonic regression.  
*Observation:* ROC-AUC ~0.83 unchanged; probability quality improved (Brier ↓, Log Loss ↓).  

**6.3 Thresholds**  
No new cutoff—focus remains on calibrated probabilities.  
*Observation:* Outputs are stable and business-ready.  


### 7. Q2 Recommendations (playbook + data gaps)

**7.1 Operator Playbook & Data Gaps**  
Built an operator playbook from model results, mapping factors to actions, metrics, and experiments; logged data gaps for future fixes.  
*Observation:* Playbook converts model signals into operator levers (impact × coverage); surfaced confidence via bootstraps; gaps highlight missing ops/ambience/menu/geo data.

**7.2 Playbook Charts**  
Visualized top drivers by *Impact × Reach* and coverage vs probability lift.  


### 8. Conclusion

*Conclusion:* Logistic model (ROC-AUC ~0.82) identifies clear drivers—positive (Food Trucks, Cafes, Vegan/Gluten-Free) vs. negative (Tourist vibe, Burgers, “Good for Groups”, TV).  
*Observation:* Base rate is low (~5%), so precision–recall focus + playbook actions (parking, outdoor seating, delivery speed) are key for impact.  
