# ECON 0150 | Replication Notebook

**Title:** Restaurant Costs and Ratings

**Original Authors:** Zhang

**Original Date:** Fall 2024

---

This notebook replicates the analysis from a student final project in ECON 0150: Economic Data Analysis.

## About This Replication

**Research Question:** Do restaurants with higher average costs receive higher customer ratings?

**Data Source:** Zomato restaurant dataset (~7,000 restaurants)

**Methods:** OLS regression of restaurant rating on average cost for two people

**Main Finding:** Positive relationship: each unit increase in average cost is associated with 0.0004 higher rating (p < 0.001), but effect size is small (R² = 0.14).

**Course Concepts Used:**
- Simple linear regression
- Interpreting small but significant coefficients
- Scatter plots with large datasets
- R² interpretation

---
## Step 0 | Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

In [None]:
# Load data from course website
base_url = 'https://tayweid.github.io/econ-0150/projects/replications/0026/data/'

df = pd.read_csv(base_url + 'zomato.csv')

print(f"Number of restaurants: {len(df):,}")
print(f"Columns: {df.columns.tolist()}")
df.head()

---
## Step 1 | Data Preparation

In [None]:
# Select and clean relevant columns
data = df[['rate (out of 5)', 'avg cost (two people)']].copy()

# Convert to numeric (handle any non-numeric values)
data['rating'] = pd.to_numeric(data['rate (out of 5)'], errors='coerce')
data['avg_cost'] = pd.to_numeric(data['avg cost (two people)'], errors='coerce')

# Drop missing values
data = data.dropna(subset=['rating', 'avg_cost'])

# Keep only the cleaned columns
data = data[['rating', 'avg_cost']]

print(f"Clean data: {len(data):,} restaurants")

---
## Step 2 | Data Exploration

In [None]:
# Summary statistics
print("Summary Statistics:")
print(data.describe())

In [None]:
# Distribution of variables
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(data['rating'], bins=30, edgecolor='black')
axes[0].set_xlabel('Rating (out of 5)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Restaurant Ratings')

axes[1].hist(data['avg_cost'], bins=50, edgecolor='black')
axes[1].set_xlabel('Average Cost for Two')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Average Costs')

plt.tight_layout()
plt.show()

In [None]:
# Correlation
correlation = data['rating'].corr(data['avg_cost'])
print(f"Correlation between rating and cost: {correlation:.3f}")

---
## Step 3 | Visualization

In [None]:
# Scatter plot with transparency (many overlapping points)
plt.figure(figsize=(10, 6))
plt.scatter(data['avg_cost'], data['rating'], alpha=0.15, s=15)
plt.xlabel('Average Cost for Two')
plt.ylabel('Rating (out of 5)')
plt.title('Restaurant Rating vs Average Cost')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Binned analysis - average rating by cost category
data['cost_bin'] = pd.cut(data['avg_cost'], bins=[0, 200, 400, 600, 800, 1000, 2000, 5000], 
                          labels=['0-200', '200-400', '400-600', '600-800', '800-1000', '1000-2000', '2000+'])

avg_by_cost = data.groupby('cost_bin')['rating'].agg(['mean', 'count']).reset_index()

plt.figure(figsize=(10, 5))
plt.bar(avg_by_cost['cost_bin'].astype(str), avg_by_cost['mean'], color='steelblue')
plt.xlabel('Average Cost Category')
plt.ylabel('Average Rating')
plt.title('Average Rating by Cost Category')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\nSample sizes by category:")
print(avg_by_cost)

---
## Step 4 | Statistical Analysis

In [None]:
# OLS Regression
model = smf.ols('rating ~ avg_cost', data=data).fit()
print(model.summary())

In [None]:
# Scatter plot with regression line
plt.figure(figsize=(10, 6))

# Plot points with transparency
plt.scatter(data['avg_cost'], data['rating'], alpha=0.1, s=10)

# Add regression line
x_range = np.linspace(data['avg_cost'].min(), data['avg_cost'].max(), 100)
y_pred = model.params['Intercept'] + model.params['avg_cost'] * x_range
plt.plot(x_range, y_pred, 'r-', linewidth=2, label='Regression Line')

plt.xlabel('Average Cost for Two')
plt.ylabel('Rating (out of 5)')
plt.title('Rating vs Cost with Regression Line')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Practical interpretation
print("\n" + "="*50)
print("KEY RESULTS")
print("="*50)
print(f"Intercept: {model.params['Intercept']:.4f}")
print(f"Cost coefficient: {model.params['avg_cost']:.6f}")
print(f"\nInterpretation:")
print(f"  - Each ₹100 increase in cost is associated with")
print(f"    {model.params['avg_cost'] * 100:.3f} higher rating")
print(f"  - Each ₹1000 increase in cost is associated with")
print(f"    {model.params['avg_cost'] * 1000:.2f} higher rating")
print(f"\nR-squared: {model.rsquared:.3f}")
print(f"P-value: {model.pvalues['avg_cost']:.2e}")

---
## Step 5 | Results Interpretation

### Key Findings

| Metric | Value |
|--------|-------|
| Cost Coefficient | 0.0004 |
| R-squared | 0.14 |
| P-value | < 0.001 |

### Interpretation

1. **Statistically Significant:** The positive relationship between cost and rating is statistically significant (p < 0.001)

2. **Small Effect Size:** A ₹1,000 increase in average cost is associated with only a 0.4 point higher rating (out of 5)

3. **Low R²:** Cost explains only 14% of the variation in ratings

### What Else Matters?

The low R² suggests many other factors affect ratings:
- Food quality
- Service
- Location/ambiance
- Cuisine type
- Online order availability

### Causal Interpretation?

Does higher cost *cause* higher ratings, or do better restaurants both charge more AND receive better ratings? This is likely **confounding** - quality causes both.

---
## Replication Exercises

### Exercise 1: Restaurant Type
Does the relationship differ by restaurant type (Quick Bites, Casual Dining, Fine Dining)?

### Exercise 2: Multiple Regression
Add number of ratings as a predictor. Do popular restaurants have higher ratings?

### Exercise 3: Cuisine Analysis
Which cuisines have the highest ratings? Does cost-rating relationship vary by cuisine?

### Challenge Exercise
Research the economics of restaurant pricing. What determines optimal price points?

In [None]:
# Your code for exercises

# Example: Look at restaurant types
# types = df.groupby('restaurant type')['rate (out of 5)'].agg(['mean', 'count'])
# print(types.sort_values('mean', ascending=False))